Acceso abierto

Research on Collection and Preprocessing Strategies of Traffic Data Driven by Big Data

  
21 mar 2025

Cite
Descargar portada

Introduction

The ever-developing information technology has also updated the means of collecting transportation worker data, which enables transportation data to be collected from richer sources and to broaden the application areas of the data. Nowadays, traffic data has shown the development of multiple datasets [1-3]. The so-called static traffic data, in fact, is a certain period of time, the transportation system in the stable information data, facility information, traffic volume, etc. are its main content, while the dynamic traffic data, traffic control information, real-time traffic flow information, such as continuous changes in the information [4-6].

Intelligent transportation is a product of the fusion of electronic information technology and transportation technology, which is the best means to solve urban traffic congestion, improve driving safety, and improve travel efficiency [7-8]. Intelligent transportation system (ITS) is an inevitable trend in the development of the transportation industry, and the collection and processing of real-time dynamic traffic information is a prerequisite for the realization of ITS. If there is no massive traffic information collection and processing as a support, ITS will only stay in the conceptual stage [9-10].

To recognize big data, cloud computing must be recognized first. Cloud computing is a new type of computing processing developed in recent years, is a support platform for processing big data, is a systematic project in the era of big data, and is significantly affecting the process of information technology and transportation industry service [11-13].

The technique of pre-processing data before data mining is called data preprocessing. Its main working principle is to carry out a series of processing work such as necessary cleaning, integration, transformation, discretization and statute of the original data, so that it meets the minimum specifications and standards required by data mining algorithms to carry out knowledge acquisition and research [14-16]. Since the change process of dynamic traffic data is a real-time, nonlinear, high-latitude, non-stationary stochastic process, which has the characteristics of uncertainty, randomness, and incompleteness with different factors such as road and time, data preprocessing research is required [17-18]. In a complete data mining process, preprocessing techniques take about 60% of the time. First of all, data preprocessing as a key step to improve data quality, its importance is mainly manifested in the following aspects, first of all, ITS massive data has a large number of characteristics and distribution, resulting in the traditional manual elimination and screening methods are not suitable for dealing with massive data. Secondly, the continuous working conditions and environment for a long time make the chances of errors and failures of various traffic detection equipments greatly increase [19-21]. Furthermore, different systems and users have different requirements for data quality and accuracy, and need to adopt different preprocessing methods for real-time traffic data. Common data preprocessing techniques are data cleaning, data integration, data transformation and data statute, data cleaning, i.e., filling missing data values, smoothing noisy data, identifying or removing outliers, and correcting inconsistencies in the data [22-23]. Data integration refers to merging data from multiple data sources together to form a consistent data store, data transformation means, transforming data into a form suitable for mining, such as scaling attribute data to fall into a relatively small and specific interval. Data statute, on the other hand, represents, without affecting the mining results, the use of numerical aggregation, the removal of redundant features to compress the data, to improve the quality of the mining model and to reduce the time complexity. It should be noted that the above mentioned approaches are interrelated [24-25]. For example, the removal of redundant data is both a form of data cleansing and a data statute, and the data cleansing process still needs to be carried out after the integration of the data.

The main applications of big data technology to empower intelligent transportation systems include traffic data collection, data analysis and prediction, and data storage, etc., and the relevant research on the data sources of intelligent transportation systems based on big data technology includes, Literature [26] classifies the data sources used for traffic flow prediction, quantifies the contribution of data sources such as social media and cellular networks, and compares and studies traffic prediction data sources from the perspectives of reliability, adequacy, and operation and maintenance costs.Some scholars have focused on analyzing the data collection and mining strategies, from algorithms, tools, and other perspectives, in Intelligent Transportation Systems (ITS). Literature [27] conceived a big data analytics architecture for Intelligent Transportation Systems (ITS) to enable the analysis and storage of data collected from transportation multimodal sensors, and the feasibility of the proposed big data analytics architecture was verified using Hadoop program.Literature [28] introduces an automatic vehicle detection and counting strategy based on distributed fiber optic sensors and confirms that the proposed strategy is effective through performance tests, and finally, it also proposes the idea of rough vehicle classification, which achieves the classification and identification of heavy and light vehicles. The most popular research direction is how to use big data technology to preprocess and predict traffic data for control. Literature [29] discusses the problems related to traffic data acquisition and dynamic processing, and proposes to combine software analysis and visualization techniques to cope with the prediction and analysis of traffic flow in each time period, and the study provides an important reference for researchers in the field of transportation. Literature [30] aims to optimize the modern traffic information physical fusion network fitness, data type complexity, data collection and data transportation and other issues, proposed the intelligent traffic information fusion cloud control system program, and through simulation experiments for the relevant validation, confirmed the proposed program in the prediction of urban traffic flow as well as good performance in traffic flow control. Literature [31] compared and studied the performance of full sample demand distribution model, traffic integration model and model biological protein expression database model in traffic flow prediction, emphasized that the development of big data technology has brought new development and challenges to the field of traffic flow prediction, and pointed out that how to reasonably utilize the big data technology is the future key research direction in the field of traffic flow prediction. Finally, some researchers also analyzed the value of traffic data information based on big data technology, intelligent transportation system communication based on big data technology and other issues. Literature [32] discusses the application potential of spatio-temporal big data generated in the field of transportation, including the prediction of individual travel demand as well as transportation demand, and the provision of reference for the planning and construction of urban transportation networks. Literature [33] envisioned an intelligent transportation system architecture with big data analytics as the core logic to support data information processing and friendly communication in intelligent transportation, and the effectiveness of the proposed scheme was confirmed in data evaluation tests with multiple input libraries.

This paper analyzes the basic data system and information characteristics of traffic data, and processes the microwave data acquired by RTMS instruments on the expressway in M city. The preprocessed traffic flow data’s missing states are categorized. On the basis of considering the multidimensional spatio-temporal characteristics of lane traffic flow on urban expressways, tensor decomposition theory is introduced, and a tensor Tucker decomposition-based traffic flow missing data repair method (TDIM) is constructed. Five lane cross-sections in M city are selected as the research objects, and the tensor input model of expressway lane correlation is constructed, and the repair effect of the TDIM method is examined by constructing the missing tensor.

Acquisition and pre-processing of traffic data
Overview of Big Data in Transportation
Basic data systems in transportation systems

According to the different ways of obtaining traffic big data can be subdivided into four types of traffic flow monitoring data, traffic floating vehicle monitoring data, traffic monitoring video data and traffic service data data types of data include both unstructured data such as video, images and other types of data, including some structured data, according to the different sources of data and different information characteristics of the visual analysis of traffic big data also presents diverse characteristics.

Traffic flow monitoring data

Traffic flow monitoring data refers to the information about traffic flow and passenger flow collected by vehicle network sensor devices laid on urban road networks.Traffic flow monitoring information is generally obtained through a variety of sensors, such as loop coil detectors, ultrasonic coil detectors, magnetic coil detectors, microwave ultraviolet detectors, and so on.

Traffic Floating Vehicle Data

Floating vehicles are ordinary vehicles equipped with satellite positioning and communication control devices.Floating cars during the driving process can provide real-time access to information such as driving time, driving direction, driving speed, position coordinates, and other data. The dynamic data information obtained through the floating vehicle collection equipment has the technical advantages of high efficiency, flexibility and low cost, and can constantly provide a large number of valuable data samples for traffic visual analysis.

Traffic Monitoring Video Data

Mainly through the establishment of a digital monitoring network covering highways, major traffic arteries and intersections, and the acquisition of traffic monitoring data in real time, uploaded to the vehicle scheduling center, through the visualization system to monitor the implementation of vehicle monitoring based on the video monitoring data can identify the vehicle characteristics, matching vehicle information and predict the location of the vehicle, to provide a source of data for the traffic visual analysis.

Traffic Service Data

Mainly refers to a series of unstructured traffic service information released by the media to remind travelers to understand the current traffic conditions and assist in travel planning. Similar to traditional traffic data, traffic service data is unstructured data, and the use of this type of data also requires structured processing to convert it into user-recognizable text or image information.

Information Characteristics of Transportation Big Data

1) Spatio-temporal mobility

Traffic data is dynamic and shows how a moving object’s trajectory changes at different locations. Since the traffic target cannot generate a trajectory route at a precise point in time, the change process of the traffic entity must be described based on continuous points in time. The movement process of the traffic entity can be regarded as a series of spatial-time pairs, through the use of a variety of data sensing equipment can record a huge number of spatial-temporal pairs (points) and thus generate the movement trajectory of the entity target, based on the visualization of the traffic situational information to be perceived.

2) Multi-source heterogeneity

Multisource refers to a wide range of traffic information sources, data types, and formats.Its information sources, information forms, occurrence times, occurrence spaces, and information users are very different.Heterogeneity mainly manifests itself in different forms of data and information expression, with different levels of certainty and standard formats. As traffic data come from different collection devices and application systems, there are different interface standards and storage formats, which leads to the existence of a large number of unstructured data and structured data with different attributes in traffic data.

3) Cycle similarity Although urban transportation is in constant dynamic change, in the same geographic location and time range, the travel characteristics of residents in a certain time cycle shows repetitiveness and regularity. Based on the information characteristics of traffic data cycle similarity, it can be combined with the existing traffic sample data through visual analysis to make a prediction of its traffic trends in the future.Based on the information characteristics of the traffic data cycle similarity, we can combine the existing traffic sample data and make predictions on the trend of the traffic situation in the future using visual analysis.

4) Regional correlation

The structure of urban transportation network is complex, and the traffic flow between different roads or regions often interacts with each other, resulting in the phenomenon of fusion and separation of traffic flow, which produces a certain correlation relationship. Based on the correlation analysis of regional traffic flow, the change of regional traffic structure can be studied to provide decision makers with a reference for decision-making.The characteristics of traffic flow cycles, such as similarity and regional correlation, are the basis for predicting traffic dynamics.

Traffic flow data sources and processing
Data sources

The data source of this paper is the microwave data acquired by the RTMS instrument on the expressway of M city [34], and in the subsequent research and analysis of this paper, the data taken is the traffic flow data located on a roadway in M city, which is collected from March 8, 2023 to March 22, 2023, a total of two weeks of data. The microwave data are collected every two minutes, i.e., the time interval between two adjacent data is 2 min, and each detector collects 800 data per day, and the data content mainly includes fields such as collection time, lane flow, lane speed, and lane occupancy. The fields contained in the raw data are: detector number (POSID), moment of data collection (TIMETAG), flow (VOLUME), speed (SPEED) and occupancy (OCCUPANCY) of each lane. The study section is a bidirectional 6-lane roadway, divided into inner ring lanes and outer ring lanes, with the middle of the roadway separated by a guardrail as a central barrier, for the three outer ring lanes, the L1, L2, and L3 lanes, respectively, from the middle band sequentially outward; for the three inner ring lanes, the same from the middle band sequentially outward for the L11, L12, and L13 lanes, respectively.

Data pre-processing
Standardized timestamps

This paper proposes the following method for the processing of timestamps [35]:

The time of a day is divided into 720 intervals according to the interval of 2 min, and each interval can be expressed as Ar = [2T – 2,2T),T = 1,2,3,…,720 in minutes (min). The actual data collected each day is i, and each corresponding time point is denoted as at,t = 1,2,3,…,i. Then take the concatenation set indexed by the detector number and timestamp, and compare the relationship between AT and at, which can be processed in three cases:

I. If there exists a unique at in Ar, then the data in interval AT is the actual data corresponding to time point at;

II. If there are two or more at in Ar, i.e. data redundancy, then take the mean of the corresponding actual data for that interval AT;

III. If there is no corresponding at in Aτ, i.e., data is missing, then it can be handled as default for the time being.

Abnormal Data Handling

Threshold theory refers to the identification of abnormal data by setting a reasonable critical value interval for the parameters and observing whether the actual detected values of the parameters are within the specified theoretical threshold range. In this paper, for any traffic flow parameter of flow rate, speed and occupancy, the actual detected values in the microwave data are screened by giving its reasonable value range, and the data points which are not within the threshold range are eliminated. This method can simply and quickly remove the data with abnormal values and make the values of each parameter of the study roadway realistic.

I. Flow rate: 0qfcCT60 \[0\le q\le \frac{{{f}_{c}}\cdot C\cdot T}{60}\]

where.

fc - correction factor, the value is generally between 1.3 and 1.5;

C - road capacity, in veh/h;

T - time interval for data collection, in min.

II. Speed: 0vfvV \[0\le v\le {{f}_{v}}\cdot V\]

Where, V - Road design speed in km/h.

III. Occupancy: 0o100% \[0\le o\le 100%\]

Since the original data selected in this paper are microwave data, the occupancy data obtained from the detection is time occupancy, i.e., the proportion of the time in which a vehicle is detected to have passed over a period of time during the observation process to the total time of detection in that period, so it is expressed in the form of a percentage, and accordingly, the size of its value should be in the range of [0, 100%].

A traffic flow mechanism is a theoretical representation of the correlation relationship between different parameters, thus describing the operation of the traffic flow at the macro level. Under the premise that the detected values of each parameter are within the threshold range, it is also necessary to satisfy the logical relationship that conforms to the characteristics of the traffic flow. For example, when one of the parameters of a lane at a certain point in time is 0, if the other two parameters are not equal to 0, which is in line with the threshold theory but not in line with the combination of logical relationships between the parameters of the traffic flow, the data is abnormal data, which may be due to the software or hardware failures and other problems that lead to the content of the data is not recorded, and it should be repaired or eliminated from the process; if the other two parameters are also 0, it indicates that no vehicle passes through at that point in time and the detector does not detect vehicle data, which is in line with the combined logical relationship between the parameters and the operational characteristics of the traffic flow, then the data is recorded correctly.

Spatial and temporal correlation analysis of traffic data
Time correlation analysis of traffic data

Depending on the differences in the collection methods of traffic speed time series data, data correlation analysis can be performed in two dimensions: the dimension of adjacent time series and the dimension of cycle time series. Adjacent time series refers to the time series of traffic data for the same road section in a continuous and uninterrupted time range. Taking the traffic speed dataset of M city as an example, the traffic speed data of its same road section for the week of March 1 to March 7, 2023 is analyzed. The speed time series plot in adjacent time series mode is shown in Figure 1. Analyzing the data observations, it can be learned that in the consecutive one-week time period, the value of the traffic speed of the same road does not change much, and the overall trend of change is similar.

Figure 1.

The velocity data diagram of one star period on a road in M city

Next, the correlation between the traffic speed data for this section of the roadway was calculated and analyzed according to the Pearson correlation function (Pearson). The Pearson correlation coefficient values for the weekly speed data are shown in Table 1. The results show that the Pearson correlation coefficient value of weekday traffic speed data from Monday to Friday is between 0.8915-0.9343, and the traffic speed data is linearly highly correlated, which indicates that the traffic speed data from Monday to Friday can be used as the reference data for the data repair model. And the similarity of the traffic speed data of the two days of the holiday weekend is 0.7867, showing a high degree of linear correlation, which indicates that the traffic speed data of the two days of the weekend are also significantly similar, and can also be used as the reference data of the traffic data interpolation model. However, in conjunction with the example in the circle in Fig. 1, it can be seen that the degree of similarity of the traffic speed data series between weekdays and two days of weekend compared to weekdays is in the range of 0.6116-0.684, which is a linear significant correlation due to the different start and end times of the peak traffic hours in these two periods of time. Therefore, although there is a certain degree of similarity between the traffic data of weekdays and two days of weekend, the degree of similarity is not high enough to be well used as reference data for the interpolation model of traffic data.

The velocity data Pearson correlation coefficient value

ρ Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Monday 1
Tuesday 0.9324 1
Wednesday 0.9245 0.9343 1
Thursday 0.9211 0.8915 0.9319 1
Friday 0.9141 0.9187 0.9123 0.9275 1
Saturday 0.635 0.6555 0.6569 0.6465 0.684 1
Sunday 0.6122 0.6116 0.6382 0.6271 0.6143 0.7867 1

Cycle time series, on the other hand, refers to the time series of traffic data on the same roadway at the same time of the week. As an example, the traffic speed dataset of City M is analyzed for its traffic speed data on the same roadway in March 2023 on four consecutive Wednesdays. The speed time series plot in cycle time series mode is shown in Figure 2. Analyzing the data observations, it can be learned that the values of traffic speeds on the same roadway do not differ much during the time period of the same date in a month and the overall trend shows similar changes.

Figure 2.

The speed time sequence diagram of the periodic time sequence mode

The correlation between the traffic speed data for the same days of the month for this segment was calculated and analyzed using the Pearson correlation function. The results of Pearson’s correlation coefficient for four consecutive weeks of week 3 speed data are shown in Table 2. It was learned that the values of Pearson correlation coefficient for the time period of one month with the same dates ranged from 0.7219 to 0.8316. From the range of Pearson’s correlation coefficient values, it can be seen that the traffic speed data in the time period of one month with the same date belongs to the linear significant correlation, which indicates that the traffic speed data in the time period of one month with the same date can be considered as the reference data for the data repair model.

Pearson coefficient of velocity data for continuous four-star period

ρ Week 1 Week 2 Week 3 Week 4
Week 1 1 0.8316 0.7792 0.7936
Week 2 1 0.7458 0.7219
Week 3 1 0.8248
Week 4 1
Spatial correlation analysis of traffic data

Also based on the above idea, in analyzing the spatial correlation of traffic data, this section delves into the traffic speed data of five adjacent roadway segments on the day of 7/3/2023. The adjacent roadway speed data are shown in Figure 3, which depicts the curves of traffic speed over time for these five adjacent roadways.

Figure 3.

Adjacent road speed data

The values of Pearson correlation coefficients for speed data of neighboring roads are shown in Table 3. The analysis found that the Pearson correlation coefficients between the five selected road sections range from 0.5075 to 0.8714, indicating that the speed data between these road sections show a linear significant correlation. Meanwhile, the correlation between different road segments changes with spatial distance. For example, the correlation between road 4 and road 5 is higher than that between road 1 and road 3, and similarly the correlation between road 1 and road 2 and road 3 is higher than that between road 4 and road 5. After comprehensive analysis, it can be learned that the traffic speed data in different sections of the road shows a certain correlation, and the degree of this correlation is affected by the spatial structure of the road network, so the traffic data for the adjacent roads should be targeted to be considered as the reference data for the data repair model.

The correlation coefficient value of the adjacent road velocity data

ρ Road 1 Road 2 Road 3 Road 4 Road 5
Road 1 1
Road 2 0.8608 1
Road 3 0.8249 0.8299 1
Road 4 0.5107 0.5145 0.5075 1
Road 5 0.5286 0.5181 0.5153 0.8714 1
Tensor Decomposition-Based Repair of Missing Data for Expressway Traffic Flow
Tensor Basis
Definition of tensor

A tensor of order N, X ∈ ℝn1×n2×…×nn, is an element of a space tensored by a product of bases in a N-vector space, i.e. ℝn1×n2×…×nN.

Freely varying two of the tensor subscript variables and fixing the other subscripts yields a slice of the tensor, i.e. χ = a(1)a(2)∘…∘a(N), which is then represented as follows: xi1,i2,...iN=ai1(1)ai2(2)...aiN(N),1inIn \[{{x}_{{{i}_{1}},{{i}_{2}},...{{i}_{N}}}}=a_{{{i}_{1}}}^{(1)}a_{{{i}_{2}}}^{(2)}...a_{{{i}_{N}}}^{(N)},1\le {{i}_{n}}\le {{I}_{n}}\]

The matrixization of mode n of tensor χ ∈ ℝn1×n2×…×nN refers to the unfolding of tensor χ into a matrix χ(n) along mode n. In essence, this matrixization process maps the elements xi1i2in of tensor χ to the elements xin,j of the matrix, where: j=1+h=1,hnM(ih1)Jh,Jh=m=1,mnh1Im \[j=1+\sum\limits_{h=1,h\ne n}^{M}{({{i}_{h}}-1)}{{J}_{h}},{{J}_{h}}=\prod\limits_{m=1,m\ne n}^{h-1}{{{I}_{m}}}\]

The expansion matrix of the third-order tensor model [36] is shown in Fig. 4.

Figure 4.

The expansion matrix of the third order tensor model

The third-order tensor is X(X ∈ ℝ3×3×2), then the n-mode matrixized expansion of tensor χ has the following three forms: X(1)=[ 147101316258111417369121518 ] \[{{X}_{(1)}}=\left[ \begin{matrix} 1 & 4 & 7 & 10 & 13 & 16 \\ 2 & 5 & 8 & 11 & 14 & 17 \\ 3 & 6 & 9 & 12 & 15 & 18 \\ \end{matrix} \right]\] X(2)=[ 123101112456131415789161718 ] \[{{X}_{(2)}}=\left[ \begin{matrix} 1 & 2 & 3 & 10 & 11 & 12 \\ 4 & 5 & 6 & 13 & 14 & 15 \\ 7 & 8 & 9 & 16 & 17 & 18 \\ \end{matrix} \right]\] X(3)=[ 128910111718 ] \[{{X}_{(3)}}=\left[ \begin{matrix} 1 & 2 & \cdots & \cdots & 8 & 9 \\ 10 & 11 & \cdots & \cdots & 17 & 18 \\ \end{matrix} \right]\]

Tensor operations

Definition 1: Tensor inner product

The inner product of two Nth-order tensors χ ∈ ℝl1×l2×⋯×ln and Y ∈ ℝl1×l2×⋯×ln is defined as the sum of the products of the individual corresponding elements, i.e. χ,Y=i1=1I1i2=1I2iN=1INxi1i2iNyi2iN$\langle \chi ,\text{Y}\rangle =\sum\limits_{{{i}_{1}}=1}^{{{I}_{1}}}{\sum\limits_{{{i}_{2}}=1}^{{{I}_{2}}}{\cdots }}\sum\limits_{{{i}_{N}}=1}^{{{I}_{N}}}{{{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{N}}}}}{{y}_{{{i}_{2}}\cdots {{i}_{N}}}}$. Two tensors are orthogonal if the inner product is zero.

Definition 2: Frobenius Paradigm of a Tensor

The Frobenius norm of a tensor X1I×I2××IN$X\in \mathbb{R}_{1}^{I}\times {{I}_{2}}\times \cdots \times {{I}_{N}}$ of order N is ||X||F= X,X =i1=1I1i2=1I2iN=1INxi1i2iN2$||X|{{|}_{F}}=\sqrt{\left\langle X,X \right\rangle }=\sum\limits_{{{i}_{1}}=1}^{{{I}_{1}}}{\sum\limits_{{{i}_{2}}=1}^{{{I}_{2}}}{\cdots }}\sum\limits_{{{i}_{N}}=1}^{{{I}_{N}}}{x_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{N}}}^{2}}$.

Definition 3: Kronecker product of a matrix

If we have matrix A ∈ ℝI×J,B ∈ ℝK×L, the Kronecker product of matrices A and B is each element of matrix A multiplied by all the elements of matrix B, the result is denoted as AB with dimension (LK)×(JL) and is computed as shown below: AB=[ a11Ba12Ba1JBa21Ba22Ba2JBaI1BaI2BaIJB ] \[A\otimes B=\left[ \begin{matrix} {{a}_{11}}B & {{a}_{12}}B & \cdots & {{a}_{1J}}B \\ {{a}_{21}}B & {{a}_{22}}B & \cdots & {{a}_{2J}}B \\ \vdots & \vdots & \ddots & \vdots \\ {{a}_{I1}}B & {{a}_{I2}}B & \cdots & {{a}_{IJ}}B \\ \end{matrix} \right]\] where ai is the element at position (i,j) in matrix A.

Definition 4: Khatri-Rao product of matrices

If there is a matrix A ∈ ℝI×J,B ∈ ℝK×L, the Khatri-Rao product of two matrices is defined as the multiplication of each column element of matrix A and each column element of matrix B, the result of which is denoted by AB and has dimension (LK, as computed as shown below: AB=[ a1b1a2b2aLbL ] \[A\odot B=\left[ \begin{matrix} {{a}_{1}}\otimes {{b}_{1}} & {{a}_{2}}\otimes {{b}_{2}} & \cdots & {{a}_{L}}\otimes {{b}_{L}} \\ \end{matrix} \right]\] where ai and bi are the ith column vectors of matrices A and B respectively.

Definition 5: Hadamard product of tensors

If there are two dimensions each element of tensor χ ∈ ℝI1×I2×⋯×IN and Y1I×I2××IN,X$\text{Y}\in \mathbb{R}_{1}^{I}\times {{I}_{2}}\times \cdots \times {{I}_{N}},\text{X}$ is multiplied by the multiplication of the corresponding positional element in γ to obtain the Hadamard product of χ and γ , the result is denoted by (X,y) and the dimension is I1×I2×⋯×IN The computational procedure is shown below: (X*Y)i1i2in=xi1i2inyi1i2in \[{{(X*Y)}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}={{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}{{y}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}\] where xh2in is the (i1i2in)rd element in tensor χ ; yh2hn is the (i1i2in)th element in matrix γ .

Definition 6: n-mode product of a tensor

If there is a tensor χ1I×I2××In\[\chi \in \mathbb{R}_{1}^{I}\times {{I}_{2}}\times \cdots \times {{I}_{n}}\] and matrices U ∈ ℝJ×In, χ and U with n– -mode product of χxnU(n),XxnU(n) of dimension I1×⋯×In–1×J×In+1×⋯×IN, the computation is shown below: (XxnU)i1in1jin+1iN=in=1i1xi1i2iNujin \[{{({{X}_{xn}}U)}_{{{i}_{1}}\cdots {{i}_{n-1}}j{{i}_{n+1}}\cdots {{i}_{N}}}}=\sum\limits_{{{i}_{n}}=1}^{{{i}_{1}}}{{{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{N}}}}}{{u}_{j{{i}_{n}}}}\]

To facilitate the illustration of the n– mode product of the tensor, χ1I×I2×I3,M(n)In×In$\chi \in \mathbb{R}_{1}^{I}\times {{I}_{2}}\times {{I}_{3}},{{M}^{(n)}}\in {{\mathbb{R}}^{{{I}_{n}}\times {{I}_{n}}}}$ is used as an example, then XxnM(n) is computed as follows: (Xx1M(1))j1i2i3=i1xi1i2i3mj1i1(1) \[{{({{\text{X}}_{{{x}_{1}}}}{{M}^{(1)}})}_{{{j}_{1}}{{i}_{2}}{{i}_{3}}}}=\sum\limits_{{{i}_{1}}}{{{x}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}}m_{{{j}_{1}}{{i}_{1}}}^{(1)}\] (Xx3M(3))i1i2j3=i3xi1i2i3mj3i3(3) \[{{({{\text{X}}_{{{x}_{3}}}}{{M}^{(3)}})}_{{{i}_{1}}{{i}_{2}}{{j}_{3}}}}=\sum\limits_{{{i}_{3}}}{{{x}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}}m_{{{j}_{3}}{{i}_{3}}}^{(3)}\] (Xx2M(2))i1j2i3=i2xi1i2i3mj2i2(2) \[{{({{\text{X}}_{{{x}_{2}}}}{{M}^{(2)}})}_{{{i}_{1}}{{j}_{2}}{{i}_{3}}}}=\sum\limits_{{{i}_{2}}}{{{x}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}}m_{{{j}_{2}}{{i}_{2}}}^{(2)}\]

Definition 7: Weighted Paradigm of a Tensor

If there is a tensor X ∈ ℝI1×I2×⋯×IN, the weighted paradigm of tensor χ is denoted by ∥xw as follows: ||X||W=||X*W|| \[||\text{X}|{{|}_{\text{W}}}=||\text{X}*\text{W}||\] where W is the weight of tensor X and the value of W is given in the following equation: Wi1i2in={ 1xi1i2inPresence observation0xi1i2inMissing observations \[{{\text{W}}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}=\left\{ \begin{array}{*{35}{l}} 1 & {{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}\text{Presence observation} \\ 0 & {{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}\text{Missing observations} \\ \end{array} \right.\] χIV stores all observations of the tensor, and χW has the values shown below: Xi1i2inνν=(X*W)i1i2in={ xi1i2inifWi1i2in=10ifWi1i2in=0 \[\text{X}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}^{\nu \nu }={{(\text{X}*\text{W})}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}=\left\{ \begin{array}{*{35}{l}} {{x}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}} & \;if\;{{\text{W}}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}=1 \\ 0 & \;if\;{{\text{W}}_{{{i}_{1}}{{i}_{2}}\cdots {{i}_{n}}}}=0 \\ \end{array} \right.\]

Tensor Tucker decomposition model

Tensor Tucker decomposition is the representation of a tensor as a kernel tensor multiplied by a matrix along each mode. For a third order tensor χ ∈ ℝI×J×K, its Tucker decomposition is: XG;A,B,C=G×,A×,B×,C=p=1Pq=1Qr=1Rgpqrapbqcr \[\text{X}\approx \;\text{G};A,B,C\;=\;\text{G}\times \;,A\times \;,B\times \;,C=\sum\limits_{p=1}^{P}{\sum\limits_{q=1}^{Q}{\sum\limits_{r=1}^{R}{{{g}_{pqr}}}}}{{a}_{p}}\cdot {{b}_{q}}\cdot {{c}_{r}}\] where G ∈ ℝP×Q×R is the kernel tensor; A ∈ ℝI×P,B ∈ ℝJ×Q,C ∈ ℝK×R is the orthogonal factor matrix, which can be regarded as the principal components along the corresponding modes.

The Tucker decomposition is improved as an extension of the matrix singular value threshold decomposition in a higher dimensional space, and therefore, the Tucker decomposition is also called the higher order singular value decomposition (HOSVD).

The Tucker decomposition of a third-order tensor X ∈ ℝI×J×K is shown in Fig. 5.

Figure 5.

Tucker decomposition of third-order tensors

The Tucker decomposition process for the third-order tensor X ∈ ℝI×J×K can be expanded into the matrix form shown below: X(1)AG(1)(CB)T \[{{X}_{(1)}}\approx A{{G}_{(1)}}{{(C\otimes B)}^{T}}\] X(2)BG(2)(CA)T \[{{X}_{(2)}}\approx B{{G}_{(2)}}{{(C\otimes A)}^{T}}\] X(3)CG(3)(AB)T \[{{X}_{(3)}}\approx C{{G}_{(3)}}{{(A\otimes B)}^{T}}\]

When the individual dimensions of the kernel tensor are the same and diagonal, i.e., P = Q = R, then the Tucker decomposition of the tensor evolves into the CP decomposition. Therefore, the CP decomposition model of the tensor can be regarded as a special case of the Tucker decomposition model. The relationship between the tensor CP decomposition model and the Tucker decomposition model is shown in Fig. 6.

Figure 6.

Relationship between CP decomposition model and Tucker decomposition

If the above Tucker decomposition model is generalized to any higher order tensor, the Tucker decomposition model for the Nth order tensor χ can be obtained: X[ G;A(1),A(2),...,A(n) ]=G×1A(1)×2A(2)...×NA(N)) \[\text{X}\approx \left[ \text{G};{{A}^{(1)}},{{A}^{(2)}},...,{{A}^{(n)}} \right]=\text{G}{{\times }_{1}}{{A}^{(1)}}{{\times }_{2}}{{A}^{(2)}}...{{\times }_{N}}{{A}^{(N)}})\]

The expansion matrix is of the following form: X(n)A(n)G(n)(A(N)A(n+1)A(n1)A(1))T \[{{X}_{(n)}}\approx {{A}^{(n)}}{{G}_{(n)}}{{({{A}^{(N)}}\otimes \cdots \otimes {{A}^{(n+1)}}\otimes {{A}^{(n-1)}}\otimes \cdots \otimes {{A}^{(1)}})}^{T}}\]

Tucker Decomposition for Fixing Missing Traffic Flow Data
Objective function setting

Suppose A is the original tensor of dimension I1×I2×I3 and with several missing values, and W is the missing data position tensor to describe the position of the missing elements in A. W is defined as follows: Wi1i2i3={ 1xi1i2i3Observed presence0xi1i2i3Observed missing \[{{\text{W}}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}=\left\{ \begin{array}{*{35}{l}} 1 & {{x}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}\text{Observed presence} \\ 0 & {{x}_{{{i}_{1}}{{i}_{2}}{{i}_{3}}}}\text{Observed missing} \\ \end{array} \right.\]

Eq. i1 ∈ [1,I1];i2 ∈ [1,I2];i3 ∈ [1,I3].

The expression for the TDIM model is shown below: F(S,X,Y,Z)=argmin(12||W*(AS×1X×1Y×1Z)||F2+λ2(||S||2+||X||2+||Y||2+||Z||2)) \[\text{F}\left( S,X,Y,Z \right)=\text{argmin}\left( \begin{align} & \frac{1}{2}||\text{W}*(\text{A}-S{{\times }_{1}}X{{\times }_{1}}Y{{\times }_{1}}Z)||_{F}^{2} \\ & +\frac{\lambda }{2}(||S|{{|}^{2}}+||X|{{|}^{2}}+||Y|{{|}^{2}}+||Z|{{|}^{2}}) \\ \end{align} \right)\] where X ∈ ℝl1×R1, Y ∈ ℝl2×R2, Z ∈ ℝl3×R3 is the factor matrix, which can be regarded as the principal components of the tensor over each mode. S ∈ ℝR1×R2×R3 is the kernel tensor, and λ2(||S||2+||X||2+||Y||2+||Z||2)$\frac{\lambda }{2}(||S|{{|}^{2}}+||X|{{|}^{2}}+||Y|{{|}^{2}}+||Z|{{|}^{2}})$ is the regularization term, which is used to reduce the overfitting of the function.

The goal of the TDIM model is to obtain the kernel tensor S and the factor matrix X,Y,Z and to make the error function value as small as possible.

Initial solution of the algorithm

Higher order singular values [37] (HOSVD) are applied to determine the initial values as follows: M=S×1X˜×2Y˜×3Z˜ \[\text{M}={{\text{S}}_{\times 1}}{{\tilde{X}}_{\times 2}}{{\tilde{Y}}_{\times 3}}\tilde{Z}\] where M=W*A;S˜I×I2×I3$\text{M}=\text{W}*\text{A};\tilde{S}\in {{\mathbb{R}}^{I\times {{I}_{2}}\times {{I}_{3}}}}$ is the kernel matrix; X˜I1×I1,Y˜I2×I2,Z˜I3×I3$\tilde{X}\in {{\mathbb{R}}^{{{I}_{1}}\times {{I}_{1}}}},\tilde{Y}\in {{\mathbb{R}}^{{{I}_{2}}\times {{I}_{2}}}},\tilde{Z}\in {{\mathbb{R}}^{{{I}_{3}}\times {{I}_{3}}}}$ is the three orthogonal factors. The three orthogonal matrices are truncated by taking their first R1,R2,R3 columns respectively to form the initial values X0,Y0 and Z0 obtained by HOSVD, where X0=[x˜1,...,x˜R1],Y0=[y˜1,...,y˜R2],Z0=[z˜1,...,z˜R3]${{X}_{0}}=[{{\tilde{x}}_{1}},...,{{\tilde{x}}_{R1}}],{{Y}_{0}}=[{{\tilde{y}}_{1}},...,{{\tilde{y}}_{R2}}],{{Z}_{0}}=[{{\tilde{z}}_{1}},...,{{\tilde{z}}_{R3}}]$.

For a given (X0,Y0,Z0), the initial tensor S0 can be obtained by optimizing the F(S,X0,Y0,Z0) function to minimize the

The gradient of F(S,X0,Y0,Z0) is as follows: F.=(W*(AS×1X0×2Y0×3Z0))×1X0×2TY0×3TZ0T \[\overset{.}{\mathop{\text{F}}}\,=-{{(\text{W}*(\text{A}-{{\text{S}}_{\times 1}}{{X}_{0\times 2}}{{Y}_{0\times 3}}{{Z}_{0}}))}_{\times 1}}X_{0\times 2}^{T}Y_{0\times 3}^{T}Z_{0}^{T}\]

The initial value of the kernel matrix S0 is obtained by making F = 0. Then S0,X0,Y0,Z0 is the initial value of the optimization algorithm.

Algorithmic Gradient Calculation

The following equation can be obtained by making M = W*A,N = W*(Sx1Xx2Yx3,Z): F(S,X,Y,Z)=argmin(12||MN||F2+λ2(||S||2+||X||2+||Y||2+||Z||2)) \[\text{F}\left( S,X,Y,Z \right)=\text{argmin}\left( \frac{1}{2}||\text{M}-\text{N}||_{F}^{2}+\frac{\lambda }{2}(||S|{{|}^{2}}+||X|{{|}^{2}}+||Y|{{|}^{2}}+||Z|{{|}^{2}}) \right)\]

Since the values of W and A do not change during the iteration, M can be obtained by precomputation. Then the partial derivative of the objective function F(S,X,Y,Z) is: FS=(MN)×1X0×2TY0×3TZ0T+λS \[\frac{\partial \text{F}}{\partial \text{S}}={{(\text{M}-\text{N})}_{\times 1}}X_{0\times 2}^{T}Y_{0\;\times 3}^{T}Z_{0}^{T}+\lambda S\] FX=(M(1)N(1))(ZY)S(1)T+λX \[\frac{\partial \text{F}}{\partial X}=({{\text{M}}_{(1)}}-{{\text{N}}_{(1)}})(Z\otimes Y)S_{(1)}^{T}+\lambda X\] FY=(M(2)N(2))(ZX)S(2)T+λY \[\frac{\partial \text{F}}{\partial Y}=({{\text{M}}_{(2)}}-{{\text{N}}_{(2)}})(Z\otimes X)S_{(2)}^{T}+\lambda Y\] FZ=(M(3)N(3))(XY)S(3)T+λZ \[\frac{\partial \text{F}}{\partial Z}=({{\text{M}}_{(3)}}-{{\text{N}}_{(3)}})(X\otimes Y)S_{(3)}^{T}+\lambda Z\]

To implement the TDIM traffic flow data repair algorithm, this paper applies the MATLAB programming platform as well as TensorToolbox2.6 The tensor processing package solves S,X,Y and Z under the minimum of the objective function.

Specific steps of the missing data repair algorithm

The specific steps of the TDIM traffic flow missing data repair algorithm proposed in this paper are shown below.

Step 1: Data input

Input data include: original traffic data tensor A , missing data location tensor W , maximum iteration number kmax, step size τ , allowable error θ , relative error rate R;

Calculation M = W*A.

Step 2: Initialization

Apply HOSVD to compute the initial tensor S0,X0,Y0,Z0

Step 3: Iterative process

For i = 0 to kmax;

Calculation N = W*(S×1X×2Y×3Z);

Calculate F(S,X,Y,Z)=argmin(12MNF2+λ2(||S||2+||X||2+||Y||2+||Z||2))$\text{F}\left( S,X,Y,Z \right)=\text{argmin}\left( \frac{1}{2}\parallel \text{M}-\text{N}\parallel _{F}^{2}+\frac{\lambda }{2}(||S|{{|}^{2}}+||X|{{|}^{2}}+||Y|{{|}^{2}}+||Z|{{|}^{2}}) \right)$.

Compute function gradients FS,FX,FY$\frac{\partial \text{F}}{\partial S},\frac{\partial \text{F}}{\partial X},\frac{\partial \text{F}}{\partial Y}$ and FZ$\frac{\partial }{\text{F}}\partial Z$ and update S,X,Y,Z based on the gradients.

Calculate Rk=||MN||F||M||F${{R}_{k}}=\frac{||\text{M}-\text{N}|{{|}_{F}}}{||\text{M}|{{|}_{F}}}$.

If RkRk–1θ, truncate to step 4.

End for

Step 4: Data Output

Output the estimation tensor A^=Sk×1X×2Y×3Z\[\hat{A}={{S}_{k}}{{\times }_{1}}X{{\times }_{2}}Y{{\times }_{3}}Z\].

Traffic data recovery performance results and analysis

In this paper, two types of traffic data are selected as experimental data, namely traffic flow data and traffic speed data. The superiority of Tucker’s tensor decomposition algorithm can also be verified by comparing the recovery results with those of five missing value recovery methods.

In order to verify the superiority of the algorithm in this paper, typical missing value recovery algorithms such as LRMC, PPCA, LLS, Kernel Sparse Representation based on Elastic Network Regularization (KSR-EN), and High Accuracy Low Rank Tensor Completion Method (Ha LRTC) are introduced for comprehensive comparison. Among them, LRMC and Ha LRTC belong to the matrix and tensor complementation based methods, respectively.PPCA is based on probabilistic generative model, and LLS adopts linear regression model for imputation.KSR-EN is a missing value recovery model based on augmented sparse representation that combines kernel learning with elastic network regularization, and utilizes kernel algorithm to perform sparse representations in high-dimensional nonlinear space. The parameters in each method are tuned to their optimal values during the experimental validation. In order to reduce the random effect, each trial is repeated 10 times, and the average value is taken as the final experimental result. Taking the traffic flow data as an example, the experimental results in MCAR, MIXED, and MAR data missing modes are shown in Table 4.

Experimental results in MCAR, MIXED&MAR data deletion mode

Traffic flow data δ LRMC PPCA LLS KSR-EN HaLRTC TDIM
MCAR 0.2 81.83 77.23 73.84 68.25 64.04 60.55
0.3 83.2 81.57 81.42 70.61 65.91 60.34
0.4 87.41 83.88 88.83 75.76 71.19 64.07
0.5 89.1 88.86 100.5 79.92 73.39 66.15
0.6 95.59 93.23 113.49 86.41 77.8 71.99
MIXED 0.2 88.82 81.86 80.25 71.62 67.99 61.87
0.3 91.48 87.66 84.98 76.34 69.17 62.72
0.4 95.65 89.45 93.21 80.68 71.05 64.4
0.5 97.36 94.29 102.48 87.28 74.92 69.69
0.6 103.23 100.61 117.36 94.83 79.79 73.52
MAR 0.2 98.81 90.37 84.32 77.2 68.15 66.23
0.3 100.3 96.04 90.09 79.94 71.93 66.61
0.4 102.17 98.61 100.03 86.83 74.93 72.39
0.5 107.4 101.79 112.08 91.53 76.77 74.87
0.6 108.91 105.64 129.41 99.1 85.72 79.6

Meanwhile, in order to illustrate the recovery results more intuitively and fully demonstrate the degree of fit between the recovered values and the real values, the comparison results of the recovery results of different methods are shown in Fig. 7 when the traffic flow data are in the MIXED missing mode and δ = 0.3, (a) to (f) represent the LRMC, PPCA, LLS KSR-EN, HaLRTC, and TDIM algorithms. We get the effect graphs of recovering missing values with different methods. From the graphs, we can see that the methods proposed in this paper have small residuals in the missing data recovery process. From these results, we can get some interesting points:

1) In general, MCAR missing mode has the least difficult data recovery, while MAR is the most difficult case to recover. This is reasonable because successive losses lose a lot of valuable information, which increases the difficulty of accurate recovery. It can also be observed that the recovery error of each method increases with the missing rate.

2) The poor recovery results of the LRMC, PPCA, and LLS methods may be due to the fact that their a priori assumptions cannot be satisfied on this dataset.The KSR-EN and Ha LRTC methods outperform the above methods in modeling complex spatio-temporal data. However, the recovery performance of Ha LRTC is better than that of KSR-EN, which implies that multidimensional correlation modeling of traffic data facilitates the recovery of missing values.

3) The proposed method in this paper outperforms HaLRTC and all other compared methods in terms of recovery results regardless of any missing rate and missing mode. And the recovery errors obtained by this paper’s algorithm are 34%-42.12%, 25.89%-35.82%, 29.58%-45.37% lower than LRMC, PPCA, LLS, KSR-EN and HaLRTC algorithms, respectively, under the three datasets, 25.72%-31.94%, and -5.16%-4.47%.

Figure 7.

The results of different methods were compared

The experimental results of traffic speed data under different missing modes and missing rates are shown in Table 5, and we can observe a similar phenomenon.In summary, the experimental results show that the method presented in this paper can effectively improve the recovery performance of the low-rank tensor complementation method for spatio-temporal traffic data.

Data experiment results of different missing modes and missing rates

Traffic flow data δ LRMC PPCA LLS KSR-EN HaLRTC TDIM
MCAR 0.2 4.446 4.161 3.982 3.702 2.738 2.804
0.3 4.522 4.402 4.403 3.833 2.842 2.792
0.4 4.756 4.530 4.815 4.119 3.135 2.999
0.5 4.850 4.807 5.463 4.350 3.257 3.115
0.6 5.211 5.049 6.185 4.711 3.502 3.439
MIXED 0.2 4.834 4.418 4.338 3.889 2.957 2.877
0.3 4.982 4.740 4.601 4.151 3.023 2.924
0.4 5.214 4.839 5.058 4.392 3.127 3.018
0.5 5.309 5.108 5.573 4.759 3.342 3.312
0.6 5.635 5.459 6.400 5.178 3.613 3.524
MAR 0.2 5.389 4.891 4.564 4.199 2.966 3.119
0.3 5.472 5.206 4.885 4.351 3.176 3.141
0.4 5.576 5.348 5.437 4.734 3.343 3.462
0.5 5.867 5.525 6.107 4.995 3.445 3.599
0.6 5.951 5.739 7.069 5.416 3.942 3.862
Conclusion

In this study, the microwave data of traffic are acquired by RTMS instrument, the time stamps and outliers of the collected data are preprocessed, the missing data of traffic flow are repaired by using Tucker tensor decomposition technique, and the expressway lane operation data are utilized for analysis.

1) This paper systematically analyzes the spatio-temporal characteristics of the road network traffic data from two perspectives: temporal correlation and spatial correlation, and finds that the strong spatio-temporal correlation demonstrated by the urban road network traffic data is reflected in multiple levels. On the one hand, there is an obvious change cycle law of the road itself; on the other hand, the traffic conditions between different road sections influence and propagate each other. Therefore, by combining the theory of spatio-temporal correlation research, the traffic data repair model can be constructed more comprehensively and accurately.

2) From the perspective of temporal and spatial correlation of traffic data, it mines the overall correlation of the data from a higher dimension, while taking into account the correlation between samples within the traffic data, and makes effective use of the data patterns.Compared with the comparison methods, the experimental results show that when recovering the data through this paper’s method, it has the lowest recovery error in three different datasets than the LRMC, PPCA, LLS, KSR-EN and HaLRTC algorithms by -5.16%-45.37%, having the lowest recovery error.