Intelligent capture analysis model for high-speed toll evasion vehicles based on vehicle re-identification algorithm

Under the charging mode of China's highway “one network”, for long-distance cross-provincial vehicles, the highway single toll amount is larger, and the temptation of illegal benefits brought by toll evasion is greater than that in the past [1-2]. At the same time, due to the highway network is more and more complex, evasion of the audit is more and more difficult, a few unscrupulous owners use a variety of means to evade payment of highway tolls without being detected. Therefore, how to quickly find and control the illegal owner's highway toll evasion and build an efficient and powerful auditing system are the urgent problems for highway operators and managers.

Vehicle re-identification algorithm, also known as vehicle re-identification, aims to retrieve and correlate images of the same vehicle identity captured by different cameras, and identify the target vehicle through the image or video sequence [3-4]. As vehicles passing through the highway will be detected by different angles and resolutions of the monitoring equipment, when the unscrupulous car owners in the highway using various ways to evade the toll, generally will leave the corresponding behavioral characteristics of information. By detecting vehicles in each frame of the surveillance video, a data set of illegal vehicle images is constructed, and then the suspected toll evading vehicles can be identified according to the behavioral characteristics of the evading vehicles [5-7]. The use of vehicle re-identification technology on the behavioral characteristics of toll evasion vehicles, the establishment of the corresponding management measures and comical shuttle system can effectively regulate the order of the passage of vehicles, control the vehicle refuses to pay, fleeing to pay, underpayment of highway tolls, to avoid the loss of state-owned assets, and to maintain the fairness of the transportation market [8-10].

Liu, X. et al. show that the construction of vehicle re-identification image dataset needs to address two important requirements, namely, capturing a large number of vehicles in a real traffic environment and realizing cross-camera vehicle search, and propose a large-scale benchmark dataset “VeRi” to meet the requirements and a benchmark model that combines the color, texture, and high-level semantic information [11]. Sun, W. et al. introduced a multi-feature learning model with enhanced local attention (MFELA), which enhances the global feature representation by extracting multi-scale semantic features, and enhances the attention to the local region by using the region batch discard block method to combine the feature representations of the global branch and the local branch to achieve a better performance of vehicle re-recognition [12]. Liu, X. et al. designed a region-aware depth model to learn local features with differences as a way to aid vehicle re-recognition efforts, and the method achieved better performance in large-scale datasets [13]. Liu, X. et al. proposed a vehicle re-recognition method with global-regional features, which describes additional local details based on recognizing the appearance of a vehicle in order to enhance the ability to differentiate between the global contexts, and utilizes a group loss algorithm to optimize the distances within and between groups of vehicle images for a more convenient and efficient recognition process [14]. Zakria, Cai, J. et al. constructed a vehicle re-recognition system with a global channel and a local-region channel, which firstly extracts global feature vectors of the images in the database to initially recognize the vehicle, and subsequently extracts more discriminative and salient features from different regions of the vehicle [15]. Shen, F. et al. constructed a vehicle re-recognition model based on the graph-interactive transformer method, using the graph to extract local features with discriminative power within an image, and using the transformer to extract global features with lupanicity between images [16]. He, B. et al. introduced a partially regularized discriminative feature preservation method, which enhances the ability of perceiving subtle differences, and constructed a framework combining global constraints and local constraints modules for end-to-end training, which significantly improved the performance of vehicle re-recognition algorithms [17].

Tian, X. et al. emphasized that the key to vehicle re-recognition is to extract discriminative vehicle features, so they constructed a combined approach for feature extraction, applied a three-branch adaptive attention network to extract the important features of the vehicle, and then used a global relational attention approach to capture the global structural information, and finally used multi-granularity feature learning to capture the fine-grained local information [18]. Shen, J. et al. applied an attention model to achieve robustness and efficiency of vehicle re-identification algorithms and proposed an end-to-end partitioned and fused multibranch network (PFMN), which does not require additional annotations or attributes to learn vehicle discriminative features with a high degree of effectiveness [19]. Rong, L. et al. similarly proposed a multibranch network model that employs a fusion of global and local features to capture the vehicle information, and embedded a channel attention module to achieve personalized feature extraction of the target vehicle, and then using a weighted local feature control method to reduce the background information and noise information to effectively improve the accuracy of vehicle re-identification [20]. Yang, J. et al. investigated a method to learn discriminative information from vehicle images with multiple viewpoints, and proposed a two-branch feature learning network, which utilizes the pyramid local attention module to learn local features at different scale local features, then the spatial attention module learns the attention global features, and finally the pooled local and global feature vectors are used for identity re-identification [21].

To summarize, the research on VRI technology is advancing, enhancing the accuracy, robustness, and real-time performance of vehicle recognition in complex environments.However, some shortcomings still exist in existing research, such as complex recognition environments and the similarity of identified vehicles.Therefore, the study proposes an intelligent capture analysis model for highway toll evasion vehicles based on the VRI algorithm.By combining multi-dimensional self-attention structure with multi-dimensional feature fusion, the accuracy of the intelligent capture analysis model for toll-evading vehicles can be improved.The research aims to maintain the order of highway operations and conduct precise inspections of vehicles that are evading fees. The innovation lies in the combination of deep learning and computer vision, which can automatically learn complex features in vehicle images, thereby achieving high-precision vehicle recognition.

2

Methods and materials

This section has two parts. Firstly, a vehicle recognition algorithm based on multidimensional self-attention is proposed. Secondly, a VRI method relying on multi-dimensional feature fusion relying on multi-dimensional self-attention is designed, forming an intelligent capture and analysis model for highway toll evasion vehicles based on the vehicle re identification algorithm.

2.1

Vehicle recognition algorithm ground on multi-dimensional self-attention

The recognition module in the intelligent capture analysis model for highway toll evasion vehicles consists of a vehicle recognition method on the ground of multi-dimensional self-attention and a vehicle recognition algorithm on the ground of multi-dimensional feature fusion. The vehicle recognition algorithm based on multi-dimensional self-attention can effectively extract significant features of the appearance of vehicles that avoid fees.The multi-dimensional feature fusion algorithm for vehicle recognition fuses the extracted features in multiple dimensions.Traditional VRI methods that focus on attention only use global or local attention mechanisms for feature learning and information processing. However, due to insufficient consideration for scale changes, these approaches appear inefficient in extracting and utilizing multi-dimensional information [22-23]. Multi-dimensional self-attention can help the model identify prominent feature regions in vehicle images. Therefore, a vehicle recognition method based on multi-dimensional self-attention is designed to enhance VRI accuracy according to vehicle appearance features.The vehicle recognition algorithm model based on multi-dimensional self-attention is displayed in Figure 1.

As shown in Figure 1, the vehicle image undergoes 50 layers of residual network to extract initial features, and then enters the multi-dimensional spatial attention module. Different sized convolution kernels are applied to extract multidimensional features, generate spatial attention, and enhance the feature map. Subsequently, channel attention maps are generated using spatiotemporal pooling, one-dimensional convolution, and activation functions through the channel attention module.Finally, channel attention is fused to produce optimized feature maps.The calculation of the VRI algorithm using multidimensional self-attention is displayed in Equation (1).

(1)

M_{O U T} = N_{C} (N_{S} (M) \times M) \times M

\[{{M}_{OUT}}={{N}_{C}}({{N}_{S}}(M)\times M)\times M\]

In equation (1), M signifies the input feature map. M_OUT signifies the output feature map. N_s(·) represents the multi-dimensional spatial attention module. N_c(·) represents the channel attention module. The VRI algorithm on the basis of multi-dimensional self-attention consists of a residual network of 50 layers and an attention module. The residual network consists of 50 layers, which output high-dimensional feature maps by adjusting the step size of the final convolutional layer and removing average downsampling and fully connected layers.This network utilizes residual blocks to learn vehicle features. Each residual block is first subjected to 1*1-convolution dimensionality reduction, ReLU activation, and then 3*3-convolution processing. After ReLU activation, the quantity of channels is restored through 1*1-convolution and added to the original input to maintain the scale of the feature map, achieving deep level feature extraction. The proposed multi-dimensional spatial attention module structure is shown in Figure 2.

In Figure 2, the input data M undergoes average downsampling and max pooling to obtain two 2D feature maps, which are stacked to form a multi-channel map. After 1x1 2D convolution compression to a single channel, the salient feature map $\bar{M}$ \[\bar{M}\] is output. The calculation process is shown in equation (2).

(2)

\bar{M} = C F 2 D_{1} (m a t (M e a n 2 D (M), M a x 2 D (M)))

\[\bar{M}=CF2{{D}_{1}}\left( mat\left( Mean2D\left( M \right),Max2D\left( M \right) \right) \right)\]

In equation (2), CF2D₁ represents the 1*1 two-dimensional convolution kernel. Mean 2D(·) represents the average downsampling. Max2D(·) represents the maximum downsampling. mat(·) represents channel merging. Afterwards, a two-dimensional convolution with four multi-scale convolution kernels is used to add and fuse feature map channels. The spatial attention feature map is activated by Sigmoid, as shown in equation (3).

(3)

N_{S} (\bar{M}) = η (\sum_{i = 0}^{3} C o n v 2 D_{k = 2 i + 1} (\bar{M}))

\[{{N}_{S}}(\bar{M})=\eta \left( \sum\limits_{i=0}^{3}{Conv2{{D}_{k=2i+1}}(\bar{M})} \right)\]

In equation (3), η represents Sigmoid. $\bar{M}$ \[\bar{M}\] represents significant features. Finally, matrix multiplication is performed to get a weighted feature map M_OUT. The channel attention mechanism implements adaptive weight regulation, deeply explores and strengthens the intrinsic connections between features, thereby achieving intelligent screening and efficient fusion. This operation significantly improves the predictive ability of the model and enhances its understanding and processing ability of complex data, as shown in Figure 3.

In Figure 3, the feature map M undergoes adaptive average downsampling and adaptive maximum downsampling operations in the feature image dimension to reshape the 2D feature map into a one-dimensional vector. A 4-core one-dimensional convolutional layer is used as a shared layer to process the two one-dimensional vectors. Then, a fused feature vector $\bar{M}$ \[\bar{M}\] is generated through addition and fusion, effectively integrating multi-source one-dimensional feature information and enriching data representation. Subsequently, the channel attention feature map $N_{C} (\bar{M})$ \[{{N}_{C}}(\bar{M})\] is produced through the Sigmoid activation function, as shown in equation (4).

(4)

N_{C} (\bar{M}) = η (\bar{M})

\[{{N}_{C}}(\bar{M})=\eta (\bar{M})\]

In equation (4), η represents the Sigmoid activation function. Finally, the activated $N_{C} (\bar{M})$ \[{{N}_{C}}(\bar{M})\] and M are element multiplied to obtain the output graph M_OUT. The model combines triplets with cross entropy loss function for training, minimizing the joint loss value to improve model performance. The three samples are the image sample anchor point, positive samples of the same type as the anchor point, and negative samples of a various type from the anchor point, denoted as x_a, x_z, and x_f. The triplet loss function is calculated as shown in equation (5).

(5)

L_{t r i} = \sum_{i = 1}^{Q} \max {{‖ f (x_{a_i}) - f (x_{z_i}) ‖}_{2}^{2} - {‖ f (x_{a_i}) - f (x_{f_i}) ‖}_{2}^{2} + β, 0}

\[{{L}_{tri}}=\sum\limits_{i=1}^{Q}{\max \left\{ \left\| f\left( {{x}_{a\_i}} \right)-f\left( {{x}_{z\_i}} \right) \right\|_{2}^{2}-\left\| f\left( {{x}_{a\_i}} \right)-f\left( {{x}_{f\_i}} \right) \right\|_{2}^{2}+\beta ,0 \right\}}\]

In equation (5), L_tri represents the triplet loss function. Q signifies the number of samples in the training set. ${‖ \cdot ‖}_{2}^{2}$ \[\left\| \cdot \right\|_{2}^{2}\] represents the square of the 2-norm. f(x) represents the mapping function. β represents a constant parameter. The calculation of the multi-classification task cross entropy loss function combined with the SoftMax function is shown in equation (6).

(6)

L_{c e} = - \sum_{i = 1}^{u} \sum_{j = 1}^{r} p_{i, j} \log {p^{'}}_{i, j}

\[{{L}_{ce}}=-\sum\limits_{i=1}^{u}{\sum\limits_{j=1}^{r}{{{p}_{i,j}}\log {{{{p}'}}_{i,j}}}}\]

In equation (6), L_ce represents the function obtained by combining the multi-classification task cross entropy loss function with the SoftMax function. p_i,j signifies the probability that sample i belongs to category j. ${p^{'}}_{i, j}$ \[{{{p}'}_{i,j}}\] represents the probability predicted using SoftMax. u represents the total samples. r signifies the total number of categories. During model training, a mixed loss function is applied to guide the model optimization, and parameters are used to finely adjust the weights between different loss terms to achieve a more balanced performance improvement. The mixed loss function is shown in equation (7).

(7)

L = L_{t r i} + χ L_{c e}

\[L={{L}_{tri}}+\chi {{L}_{ce}}\]

In equation (7), χ represents the equilibrium parameter. During model training, χ is 2, and β is set to 0.3.

2.2

Vehicle re-identification method ground on multi-dimensional feature fusion

The VRI algorithm on the ground of multi-dimensional self-attention can effectively extract subtle and significant features of fare evasion vehicles and accurately identify them. However, in real traffic scenes, due to factors such as the diversity of shooting angles and dimensional differences, vehicle images exhibit diverse scales, resulting in different proportions of vehicles in the images, and directly affecting the VRI accuracy [24]. Therefore, a VRI method on the basis of multi-dimensional self-attention algorithm and multi-dimensional feature fusion is proposed, which deeply extracts multi-dimensional features and fuses feature information at various scales. The calculation of multi-dimensional feature extraction is shown in equation (8).

(8)

M_{1}, M_{2}, \dots, M_{T} = E x t r a c t o r (P)

\[{{M}_{1}},{{M}_{2}},\cdots ,{{M}_{T}}=Extractor(P)\]

In equation (8), T signifies the total feature maps. P signifies the initial input image, and PϵR^{A_i×B_i×C_i}. Among them, A_i, B_i, and C_i respectively signify the number, height, and width of channels in the i-th scale feature map. EXtractor(·) represents the feature extractor. Multi-dimensional feature fusion integrates features of different scales. The fused features are rich in multi-level semantic information, accurately depicting the appearance of vehicles. The multi-dimensional feature fusion calculation is shown in equation (9).

(9)

M_{f u s e} = \sum_{i = 1}^{T} w_{i} M_{i}

\[{{M}_{fuse}}=\sum\limits_{i=1}^{T}{{{w}_{i}}{{M}_{i}}}\]

In equation (9), M_fuse represents multi-dimensional feature fusion. w_i represents the weight size of the feature map at the i-th scale. Then, the weight coefficients are normalized, as shown in equation (10).

(10)

w_{i} = \frac{e^{s_{i}}}{\sum_{i = 1}^{T} e^{s_{i}}}, i = 1, 2, \dots T

\[{{w}_{i}}=\frac{{{e}^{{{s}_{i}}}}}{\sum{_{i=1}^{T}{{e}^{{{s}_{i}}}}}},i=1,2,\cdots T\]

In equation (10), s_i represents the scaling factor corresponding to the i-th scale feature map. Figure 4 displays the VRI method model on the basis of multi-dimensional feature fusion.

In Figure 4, the model has multi-dimensional feature extraction and complementary fusion modules. The feature extraction module is based on a residual network of 50 layers and a feature pyramid, extracting rich semantic and scale features, integrating semantics through convolution and pooling, and generating an output feature map M′. Then, M′ is passed as input to the fusion module, and the output feature map M is integrated through channel stacking. The module uses a feature pyramid downsampling section, optimized through max pooling layers and 1x1 convolution kernels to expand the feature map size and adjust its depth, aiming to efficiently capture and accurately extract key feature information across multiple scales. The multi-dimensional feature extraction module is shown in Figure 5.

In Figure 5, the core principle of the module is to introduce a feature pyramid on the basis of a residual network of 50 layers, and fuse the multi-dimensional feature maps from Stage 2 to Stage 5. Based on maximum downsampling, 1x1 convolution matching channels, and adding them up, the output feature map is passed layer by layer to the lower layer, followed by multi-dimensional spatial attention convolution and average downsampling to obtain multi-dimensional features. The complementary fusion module integrates these features using dilated convolutions with dilation rates of 2, 3, and 4 and deformable convolutions with kernel sizes of 1, 3, and 5, respectively. Then, the fused features are concatenated through channels to form more robust multi-dimensional features, forming a more robust multi-dimensional feature.

3

Results

To demonstrate the effectiveness and superiority of the multi-dimensional self-attention and vehicle recognition method on the ground of multi-dimensional feature fusion, experiments are conducted using window11 as the operating system and TensorFlow as the experimental platform, with detailed analysis of the experimental results.

3.1

Experiment and analysis of vehicle recognition algorithm based on multi-dimensional self-attention

The experiment uses VeRI-776 and VehicleID as datasets. The experimental setup is displayed in Table 1.

Table 1.

Hardware, software, and parameter settings for the experiment

Name	Imprint
Systems	ubuntu18.04 LTS
CUDA	10.2
cuDNN	7.6.5
Programming languages	Python3.7.7
Deep learning frameworks	Pytorch
CPU	Intel(R)Core(TM)i7-12700H
GPU	RTX 3090Ti
Display memory	24GB
Random access memory	64GB
Hard disk	2TB
Optimizers	Adam

To validate the overall impact of each module in the vehicle recognition method on the basis of multi-dimensional self-attention, the comparative algorithms used in the study are: adding Relation-Aware Global Attention (RGA) module to the backbone network, adding an Efficient Channel Attention Network (ECANet) module, and adding a Convolutional Block Attention Module (CBAM) module. These algorithms are named M1, M2, M3, and M4, respectively, and compared with the proposed algorithm. Comparative analysis is performed on the VeRI-776 and VehicleID, as displayed in Figure 6.

According to Figure 6 (a), M4 performed the best compared with other algorithms, with mAP of 76.42%, Rank-194.64%, and Rank-5 of 98.15%. As shown in Figure 6 (b), on the Small dataset, the Rank-1 and Rank-5 of the M4 were 76.42% and 83.52%, respectively, both of which were optimal. To further validate the effectiveness of the vehicle recognition algorithm on the basis of multi-dimensional self-attention proposed in the study, the VeRI-776 is separated into training and testing sets in a 7:3. Figure 7 shows the results.

As shown in Figure 7 (a), with the increase of the round, the mAP, Rank-1, and Rank-5 of the vehicle recognition algorithm based on multi-dimensional self-attention gradually increased and eventually stabilized at 75%, 87%, and 92%, respectively. In Figure 7 (b), with the increase of rounds, the loss function value gradually decreased and converged towards 0, which proved the effectiveness of the algorithm.

3.2

Experimental and analysis of network model for vehicle re-identification method based on multi-dimensional feature fusion

To verify its effectiveness, the experimental environment configuration used for training remains unchanged. The optimized residual network architecture with 50 layers is selected as the main network framework for feature extraction, and pre-trained parameters from the ImageNet dataset are applied to initialize this backbone network to improve model performance. Three comparison methods are set up, namely Position Map Regression Networks (PRN), Uncertainty-aware Multi-shot Teacher-Student (UMTS), and Multi-Scale Deep (MSDeep), and compared with the proposed VRI method on the basis of multi-dimensional feature fusion on the VeRI-776 and VehicleID, as displayed in Figure 8.

According to Figure 8 (a), on the VeRI-776, the multi-dimensional fusion VRI method had the best mAP and Rank1, which were 83.65% and 96.32%, respectively. From Figure 8 (b), the proposed method performed well on the VehicleID, especially on the Small dataset, where Rank-1 and Rank-5 were both the best, with 91.42% and 96.35%, respectively. The fusion experiment of multi-dimensional self-attention proposed in the study with a multi-dimensional feature fusion vehicle recognition algorithm is displayed in Table 2.

Table 2.

Fusion experiment of multi-scale attention mechanism and multi-scale information fusion vehicle recognition algorithm on VeRI-776 and VehicleID

	VehicleID dataset						VeRI-776 dataset
Methodologies	Small dataset		Medium dataset		Large dataset		mAP	Rank-1	Rank-5
Methodologies	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	Rank-5
Baseline	82.67	92.31	88.12	90.26	77.29	95.84	82.36	92.17	94.67
Baseline+multi-scale attention mechanism	86.22	95.14	88.2	90.24	77.84	90.59	83.15	93.25	95.49
Baseline+multi-scale information fusion	84.57	97.12	88.11	95.12	74.48	91.13	83.64	94.36	96.17
Multi-scale attention mechanism+multi-scale information fusion	89.23	98.65	88.26	96.16	77.97	94.62	84.69	97.64	98.15

According to Table 2, on the VeRI-776 dataset, the combined VRI performance significantly improved, with mAP reaching 84.69%, Rank-1 reaching 97.64%, and Rank-5 reaching 98.15%. On the VehicleID dataset, after combining multi-dimensional attention and feature fusion algorithms, the mAP of the Small set increased by 6.56%, the Medium set by 0.14%, and the Large set by 0.68%. By incorporating multi-dimensional self-attention structures separately or multi-dimensional feature fusion, the model's metrics were improved, and the model results were optimal when both were combined. This demonstrates the effectiveness of the proposed vehicle recognition algorithm that combines multi-dimensional self-attention with multi-dimensional feature fusion. This indicates that the study effectively improves the performance of the intelligent capture and analysis model for highway toll evasion vehicles based on VRI algorithm by combining multi-dimensional self-attention structure with multi-dimensional feature fusion.

4

Discussion and conclusion

Multiple challenges include feature variation caused by equipment differences in highway toll evasion vehicle recognition, inter class confusion caused by the diversity of traffic scenes, intra class similarity, and detail preservation. The study improved the vehicle recognition method by combining multi-dimensional self-attention with multi-dimensional feature fusion. An intelligent capture analysis model for highway toll evaders based on VRI algorithm was proposed, further recognizing highway toll evaders. The experimental results showed that on the VeRI-776, the mAP, Rank-1, and Rank-5 using multi-dimensional attention modules were optimal, with values of 76.42%, 94.64%, and 98.15%, respectively.On the Small dataset of VehicleID, the multidimensional self-attention achieved optimal performance, with Rank-1 and Rank-5 values reaching 91.42% and 96.35% respectively.On the VeRI-776 dataset, the VRI method based on multi-dimensional feature fusion achieved the best results in mAP and Rank1 evaluation metrics, with 83.65% and 96.32%.On the VehicleID, the combined model metrics were further improved. The mean absolute price of the Small test set increased by 6.56%. On the dataset, the performance indicators of VRI were further improved. Especially when combining multi-dimensional self-attention and multi-dimensional feature fusion, the mAP reached 84.69%, Rank-1 reached 97.64%, and Rank-5 reached 98.15%. Overall, the model that integrates multi-dimensional self-attention structures with multi-dimensional features has practical application value in improving the intelligent capture of highway toll evaders. Future research can further explore its application in real-world scenarios and consider multimodal data integration to achieve broader applications and higher recognition performance.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Intelligent capture analysis model for high-speed toll evasion vehicles based on vehicle re-identification algorithm

Sinan Song

Published Online: Mar 17, 2025

Received: Oct 22, 2024

Accepted: Feb 15, 2025

DOI: https://doi.org/10.2478/amns-2025-0181

KeywordsVehicle re-identification algorithm, Highway fare evasion, Intelligent capture, Attention mechanism

© 2025 Sinan Song, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Vehicle re-identification algorithm, Highway fare evasion, Intelligent capture, Attention mechanism