Optimisation of highway vehicle occlusion recognition based on attention and multitasking approach

As a major transportation route between cities, high vehicle speed and dense traffic flow on highways make it more necessary to have accurate and efficient risk identification and early warning systems to cope with potential traffic accident risks [1-2]. In the field of modern traffic safety management, real-time identification and early warning of highway vehicle risk is an important task to ensure the safety and smoothness of road traffic [3-5].

With the increase in the rate of motor vehicle ownership and the continuous extension of highway mileage, the accident rate of highways has been on a gradual upward trend, which has brought about a serious impact on people's travel safety. It can be seen that for the complex scenarios of highways, especially accident-prone roadways, accurate detection and identification of vehicle risks are crucial for preventing the occurrence of related safety accidents and formulating corresponding control programs [6-8]. However, in the actual highway regulatory perspective, vehicle targets often have realistic problems such as inter-target occlusion and environmental background occlusion [9-10]. Due to the occlusion problem, existing target detection and recognition algorithms are difficult to solve the exact position of the target and the category of the target from the video data with missing apparent and motion information, which results in the serious defects of poor reliability and low robustness of the detection and recognition algorithms [11-12]. In order to better solve this problem, the existing intelligent algorithms are optimized for vehicle target detection and recognition to enhance the detection and recognition capabilities of traffic environment awareness algorithms in dynamic occlusion scenarios, to serve intelligent transportation applications such as traffic efficiency optimization and traffic safety warning of intelligent vehicle and road systems, and to contribute to the development of transportation intelligence [13-15].

Vehicles, as one of the main participants in transportation, and the detection and tracking techniques associated with them are the main components of an intelligent transportation risk identification system. Literature [16] develops a non-intrusive system based on vision technology to identify highway vehicles, which uses deep convolutional neural networks to extract the feature description factors of the video region to locate the vehicle position and uses linear support vector machine templates to classify the vehicles, which greatly improves the performance of vehicle identification. Literature [17] proposed an integrated recognition method YOLOv4- l based on YOLOv4 to reduce the computational complexity, using this deep learning algorithm to analyze the characteristics of multi-lane traffic flow with different flow densities, which can significantly improve the speed of detecting and recognizing vehicles, but reduces a certain amount of recognition accuracy. Literature [18] proposed a video vehicle detection and classification method based on static appearance features and motion features of vehicles, relying on traffic surveillance video to describe the motion features of vehicles, combined with the designed algorithms of removing, selecting, and reorganizing detection targets in order to improve the detection and classification accuracy of vehicles. Literature [19] combines a deep learning model with a single-shot multibox detector (SSD), and experiments show that the proposed method is able to detect more vehicles compared to the pre-trained SSD model during the detection of highway vehicles. Literature [20] showed that road sensors that provide accurate trajectory data for vehicles face the problem of occlusion that leads to reduced integrity and reliability of target detection, so a recognition algorithm based on LiDAR sensor data is proposed to quickly and accurately determine the content of the occlusion by automatically recognizing the vehicle's localized occlusion and the relationship between the occlusions. Literature [21] studied how to improve the accuracy of LiDAR sensor detection and identification of vehicles in the case of weather occlusion, and proposed 3D-SDBSCAN algorithm based on density-based spatial clustering with noise (DBSCAN) algorithm, and experiments have proved that the proposed algorithm is better able to overcome the problem of weather occlusion under the conditions of rain and snow.

In summary, previous research has made significant progress in addressing the issue of vehicle occlusion on highways.However, there are some limitations to these studies.For example, in highly complex or variable environments such as lighting changes and strong occlusion, there is still significant room for improvement in detection precision. To this end, research innovatively introduces the Convolutional Attention Mechanisms for Three-Branch Structures (CAMTS) and Unbiased Coordinate System Transformation (UCST) for keypoint extraction and transformation, combined with Multi-acale Attention Feature Fusion (MAFF) to dynamically adjust feature weights at different resolutions, thereby achieving efficient detection of vehicles in complex occlusion scenes. Research can contribute to vehicle detection and technical support for auto-driving and smart transportation systems. The research targets to address the performance degradation of existing detection algorithms in heavily occluded and variable environments, and provide more robust solutions for vehicle detection and recognition in autonomous driving and intelligent transportation systems.

2

Methods and materials

2.1

Attention-based vehicle keypoint conversion and detection

There are various types of occlusion phenomena, which may be caused by vehicles ahead, roadblocks, or other traffic participants. However, traditional vehicle detection methods mostly rely on global vehicle appearance information. When the vehicle is partially or completely obscured, these methods are difficult to accurately identify the vehicle, resulting in a significant decrease in detection accuracy [22-23]. Therefore, the study first attempts to extract keypoint features from vehicles, then converts keypoint coordinates, and finally performs image detection. Key points are extracted and transformed using UCST, as shown in Figure 1.

Figure 1 (a) showcases an example of key point extraction for a vehicle, and Figure 1 (b) showcases an example of key point coordinate localization for a vehicle. From Figure 1 (a), multiple feature points of the vehicle, including key parts such as wheels, roof, and headlights, are identified through keypoint detection technology. Figure 1 (b) shows the coordinate positioning of these keypoints in the image, where w and h respectively denote the width and height of the vehicle. UCST not only needs to consider coordinate translation and scale changes, but also introduces high-order transformation methods such as nonlinear transformation and rotation matrix to ensure effective transformation of key points in complex scenes [24-25]. The equation for coordinate translation and normalization is shown in equation (1).

(1)

P_{t}^{'} = P_{i} - (\frac{1}{n} \sum_{i = 1}^{n} P_{i})

In equation (1), P_i represents the original keypoint coordinates; $P_{t}^{'}$ represents the translated coordinates, and n represents the total amount of keypoints. Take the mean of all key points, calculate the centroid position, and translate each key point to an unbiased coordinate system centered on the centroid. Continue with scale normalization, as shown in equation (2).

(2)

P_{t}^{''} = \frac{P_{i}}{\sqrt{\sum_{i = 1}^{n} (x_{i}^{2} + y_{i}^{2})}}

In equation (2), $P_{t}^{''}$ represents the key point coordinates after scale normalization. By calculating the sum of squared Euclidean distances from key points to the centroid, and then dividing the coordinates of each key point by that value. At this point, a rotation transformation is performed to ensure that the key points of the vehicle captured from different angles can be uniformly mapped to the standard posture. The calculation formula is shown in equation (3).

(3)

P_{t}^{'''} = R (θ) \cdot P_{t}^{''} = (\begin{matrix} \begin{matrix} \cos θ \\ \sin θ \end{matrix} & \begin{matrix} - \sin θ \\ \cos θ \end{matrix} \end{matrix})

In equation (3), R(θ) represents the rotation matrix; R(θ) represents the rotation angle. By using homogeneous coordinate transformation, nonlinear projection transformation from two-dimensional image space is achieved, and the calculation expression is shown in equation (4).

(4)

P_{t}^{''''} = L \cdot P_{t}^{'''} = (\begin{matrix} \begin{matrix} h_{11} \\ h_{21} \\ h_{31} \end{matrix} & \begin{matrix} h_{12} \\ h_{22} \\ h_{32} \end{matrix} & \begin{matrix} h_{13} \\ h_{23} \\ h_{33} \end{matrix} \end{matrix}) \cdot (\begin{matrix} x_{i}^{'''} \\ y_{i}^{'''} \\ 1 \end{matrix})

In equation (4), L represents a 3 × 3 homogeneous transformation matrix; $x_{i}^{'''}$ and $y_{i}^{'''}$ represent the coordinates of feature points after rotation transformation. After completing the two-dimensional transformation of vehicle keypoints, to improve the recognition accuracy of keypoints, CAMTS is introduced to enhance the spatiotemporal data processing capability. The structure of CAMTS is shown in Figure 2 [26-27].

As shown in Figure 2, the left branch first passes the input tensor of C × H × W into the Z-pool layer, which minimize the amount of channels to 2 and lowers the computational cost. Subsequently, the 2 × H × W tensor is passed through convolutional and batch normalization layers to generate a 1 × H × W tensor, which is then subjected to an activation function to obtain the corresponding attention weights. The middle branch adjusts the input tensor to H × C × W, passes it into the Z-pool layer, minimizes the amount of channels to 2, and then obtains a tensor of 1 × W × C through convolutional and batch normalization layers. Finally, it is activated by an activation function to generate attention weights. The expression of Z-pool layer is denoted in equation (5).

(5)

Z - p o o l (x) = [M a x p o o l (x), A v g p o o l (x)]

In equation (5), Maxpool and Avgpool denote the max and the average pooling. The formula for the convolutional attention mechanism is denoted in equation (6).

(6)

F^{' C \times H \times W} = M_{c}^{C \times 1 \times 1} \otimes F^{C \times H \times W}

In equation (6), F^C×H×W and $M_{c}^{C \times 1 \times 1}$ denote the output matrix of the previous layer and the one-dimensional channel weight matrix, respectively; ⊗ represents matrix multiplication. The expression of the weight matrix M is denoted in equation (7).

(7)

M = E x p a n d_a s (σ C o n_{16} (G A P (F)) \oplus C o n_{k} (F^{'})))

In equation (7), σ and F represent the Sigmoid activation function and input matrix, respectively; GAP represents the global average pooling layer; Con₁₆ and Con_k represent 16 one-dimensional convolutions with kernel 1 and cross channel interactive convolutions, respectively. F′ represents the feature map formed by global average pooling and global attention of the input matrix. The cross entropy loss function is applied to determine the superiority or inferiority of CAMTS, and its equation is shown in equation (8).

(8)

L o s s = - \sum_{i = 0}^{M} y_{i}^{*} \log (q_{i}) = - \log (q_{d})

In equation (8), q_d and q_i represent the probability distribution and the probability of the ith class sample, respectively, while $y_{i}^{*}$ means the actual distribution of sample labels.

2.2

Construction of Multi-task vehicle occlusion recognition model with fusion attention

After extracting key points of the vehicle and converting them into CAMTS, it was found that conventional feature point recognition methods exhibit performance degradation in dynamic and complex environments such as shooting angles, lighting changes, and vehicle turns [28]. To address this issue, research has improved traditional feature point recognition algorithms and proposed a two-line multitasking occlusion recognition detection model.The structure of the model is indicated in Figure 3.

In Figure 3, the first line is responsible for extracting the basic features of the vehicle, using multiple convolutional layers to process the input image layer by layer, gradually extracting the key features of the vehicle through the stacking of feature maps. The second line handles the vehicle's dynamic information, which includes changes in lighting, turning, and other complex situations. After feature extraction, the two lines are fused through a shared layer, and the extracted features are input into multiple branch networks for different tasks, such as vehicle position detection, occlusion recognition, etc. During this period, each task branch is optimized through independent loss functions. The feature fusion branch uses the MAFF method. Compared with other methods, MAFF can dynamically adjust the importance weights of different scale features through multi-scale feature extraction and attention mechanisms, thereby preserving more useful feature information at different resolutions [29-30]. The structure is depicted in Figure 4.

In Figure 4, MAFF first adjusts the resolution of the keypoint feature map and selects the feature map X₀ with moderate resolution. After multiple screenings, three different sized feature images are obtained, namely Y₁, Y₂, and Y₃, with example scales such as 8 × 128 × 128, 16 × 256 × 256, and 26 × 512 × 512. Then, through convolution and batch normalization operations, features that match the overall resolution of the vehicle are obtained, and the three feature images are adjusted accordingly. The fusion attention module performs three types of feature fusion, and finally obtains high-quality feature maps that match the original vehicle, thereby completing recognition. The calculation formula for multi-scale feature extraction is shown in equation (9).

(9)

\begin{matrix} F_{i} = C o n v (X_{i}) & f o r i = 1, 2, 3 \end{matrix}

In equation (9), F_i represents the feature map obtained through convolution operation; X_i indicates the ith input feature map. Different input feature maps represent feature images at different scales, such as feature extraction at low, medium, and high resolutions. During this period, the weight distribution between each feature map is ensured to be reasonable by dynamically adjusting the importance of features at different scales, thus ensuring a reasonable weight distribution. The calculation formula is shown in equation (10).

(10)

\begin{matrix} A_{i} = σ (G \cdot F_{i}) & f o r i = 1, 2, 3 \end{matrix}

In equation (10), A_i represents the attention weight at the ith scale, and G represents the learned weight matrix. The calculation for multi-scale feature fusion is shown in equation (11).

(11)

F_{f u s i o n} = \sum_{i}^{3} A_{i} \cdot F_{i}

In where, the features F_i of each scale are weighted and summed based on their attention weights A_i to obtain the final fused feature F_fusion. This fusion method preserves useful information from feature maps of different resolutions, while enhancing the influence of important features through attention mechanisms. Ultimately, the loss of multiple tasks is optimized through objective function detection to ensure that the model achieves good results in multiple tasks such as vehicle recognition and occlusion processing. The calculation formula is shown in equation (12).

(12)

D = D_{c l s} + λ_{1} D_{l o c} + λ_{2} D_{m a s k}

In equation (12), D_cls represents classification loss; D_loc represents positioning loss; D_mask represents occlusion recognition loss; λ₁ and λ₂ are both hyperparameters used to control the impact of losses in each part on the final optimization. In summary, the process of obstacle recognition for vehicles on highways with two lines and multiple tasks is indicated in Figure 5.

From Figure 5, the initial stage of the process is to obtain the features of the vehicle keypoints from the input image, using an unbiased coordinate system transformation for coordinate unification. Secondly, CAMTS is used to enhance the spatiotemporal processing capability of keypoint features. The model processes spatial attention, channel attention, and global contextual features through three parallel branches, and uses a Z-pool layer to reduce computational complexity. Finally, the fused attention weights are obtained, and the MAFF module is utilized to extract and weight features at different resolutions to generate a fused feature map. Subsequently, the model adopts a two-wire architecture, with one line responsible for extracting basic vehicle features and the other line processing dynamic information such as lighting changes and turning situations. After fusion of the extracted features in the shared layer, the model performs tasks such as vehicle position detection and occlusion recognition through multiple branch networks. Each task independently optimizes the loss function to ensure the accuracy of the model. Ultimately, by integrating the target detection loss function and optimizing the tasks of classification, localization, and occlusion recognition, the robustness and accuracy of vehicle recognition in complex scenes are improved.

3

Results

3.1

Performance testing of a two-line multitasking vehicle occlusion recognition model

The research set the CPU to Intel Core 2.5Hz dual core, with 16GB of memory and NVIDIA GeForce RTX166s GPU. The programming language was Python 3.7, and model construction was implemented on Pytorch with Adam as the optimizer. The publicly available COCO dataset and ApolloScape Open Dataset for Autonomous Driving (ApolloScape) were used as testing data sources.Among the datasets utilized in computer vision, COCO is a commonly employed one for multi-object detection, containing a large number of vehicle images with occlusions.ApolloScape is a large-scale outdoor dataset that is specifically made for autonomous driving tasks, and it includes street view images taken in various weather, time, and traffic conditions, with a particular focus on detecting vehicle occlusions. Firstly, the final proposed occlusion recognition model was subjected to ablation testing, and the outcomes are denoted in Figure 6.

Figure 6 (a) showcases the ablation test findings of the final model on the COCO dataset, and Figure 6 (b) showcases the ablation test findings of the final model on the ApolloScape dataset. In Figure 6, the pixel values represent the Euclidean distance between the extracted feature points and the annotated feature points. The smaller the distance, the closer the extracted feature points are to the real feature points, and the higher the detection accuracy. Compared to the individual CAMTS and MAFF modules, the combined detection pixel value of CAMTS-MAFF was lower, approximately 1.8px. The pixel value iteration of the final model was the fastest, with its lowest value remaining stable at around 1px. In Figure 6 (b), similar to the COCO dataset, the overall performance of the final model was greater than that of other individual combination models. The final model had around 100 iterations, and the minimum pixel value for feature point extraction was 1px. The study introduced advanced occlusion detection algorithms for comparison, such as Mask Region-Based Convolutional Neural Networks (Mask R-CNN), Faster Region-Based Convolutional Neural Networks (Faster R-CNN), and You Only Look Once version 7 (YOLOv7). The PR curve and the area under the curve S were plotted with precision recall as the coordinates, and the results are shown in Figure 7.

Figure 7 (a) showcases the PR test curve and area outcomes of Mask R-CNN, Figure 7 (b) showcases the PR test curve and area outcomes of Faster R-CNN, Figure 7 (c) showcases the PR test curve and area outcomes of YOLOv7, and Figure 7 (d) shows the PR test curve and area outcomes of the raised model. In Figure 7, the PR curve of the Mask R-CNN model exhibited certain fluctuations, especially at higher recall rates, where the precision decreased rapidly. The final area was 0.747, which was relatively low, indicating that Mask R-CNN's performance was not stable enough in complex scenes. Relatively speaking, the PR curve of Faster R-CNN was relatively smooth, and its precision remained good at medium to high recall rates, but there was still a downward trend. The PR curve of YOLOv7 maintained high precision, especially performing well at medium to high recall rates, with an area of 0.941, but still inferior to the raised method. The PR curve of the proposed method by the research showed the best smoothness and stability, and could maintain high precision even at high recall rates.The maximum area was 0.953, which was the largest among the four models. The study conducted tests using P-value, R-value, F1 value, Mean Squared Error (MSE), and Mean Absolute Error (MAE) as indicators, and the test findings are denoted in Table 1.

Table 1.

Comparison of different methods for detecting occluded vehicles

Data set	Model	P/%	R/%	F1/%	MSE	MAE
COCO	Mask R-CNN	79.33	82.29	80.81	0.04	0.06
	Faster R-CNN	84.69	84.17	84.43	0.02	0.04
	YOLOv7	89.75	88.74	89.32	0.02	0.02
	Our method	93.28	92.47	92.82	0.01	0.02
ApolloScape	Mask R-CNN	84.25	86.54	85.41	0.03	0.06
	Faster R-CNN	86.69	88.15	87.36	0.03	0.05
	YOLOv7	89.51	90.33	89.51	0.03	0.04
	Our method	92.23	93.82	92.62	0.01	0.02

According to Table 1, the precision, recall rate, and overall detection efficacy of the raised model in occlusion vehicle detection tasks were higher than Mask R-CNN, Faster R-CNN, and YOLOv7. Especially in the COCO dataset, the F1 value of the proposed model reached 92.82%, which is about 3.5 percentage points higher than YOLOv7, demonstrating strong comprehensive detection ability. The advantages of the proposed model are also reflected in the performance of MSE and MAE from the perspective of error metrics. In the COCO and ApolloScape datasets, the MSE of the proposed model was 0.01, while other models had higher MSE values, especially Mask R-CNN with MSEs of 0.04 and 0.03. Similarly, the MAE value also denoted that the raised model had the smallest error, only 0.02. In conclusion, the proposed method is demonstrably more precise, effective in recall, and error-resistant than the existing mainstream methods.

3.2

Simulation testing of a two-line multitasking vehicle occlusion recognition model

To conduct practical application tests on the new model proposed by the research, the study randomly selected any two images from two types of datasets and continued to compare the occlusion detection efficacy of the four methods. The outcomes are indicated in Figure 8.

Figures 8 (a) and 8 (b) denote the occlusion detection findings of the Mask R-CNN model, Figures 8 (c) and 8 (d) denote the occlusion detection outcomes of the Faster R-CNN model, Figures 8 (e) and 8 (f) showcase the occlusion detection outcomes of the YOLOv7 model, and Figures 8 (g) and 8 (h) show the occlusion detection results of the raised model. In Figure 8, for vehicles with severe occlusion, the Mask R-CNN and Faster R-CNN models failed to fully detect the occlusion area, exhibiting certain false positives and false negatives. The YOLOv7 model performed well and could accurately detect occluded vehicles, but in complex scenes, there was a decrease in detection precision in the overlapping bounding box areas of multiple vehicles. Finally, the model proposed by the research demonstrated the best detection capability, effectively detecting all occluded vehicles, and the bounding box had the highest accuracy in the detection results, with almost no false positives and false negatives, especially in complex multi lane scenes. The research tested the model's detection performance under different occlusion rates, and the findings are indicated in Table 2.

Table 2.

Comparison of model detection results under different occlusion conditions

Occlusion rate	Method	Color precision/%	Vehicle type precision/%	Detection time/s
20%	Mask R-CNN	79.08	87.22	1.04
	Faster R-CNN	76.33	72.23	1.46
	YOLOv7	82.81	73.52	1.22
	Our method	91.74	93.49	0.29
40%	Mask R-CNN	80.13	79.45	0.93
	Faster R-CNN	75.62	86.51	0.77
	YOLOv7	82.77	89.52	0.61
	Our method	90.06	89.06	0.44
60%	Mask R-CNN	80.45	80.94	0.55
	Faster R-CNN	83.92	78.25	1.94
	YOLOv7	76.76	83.49	0.74
	Our method	94.56	88.67	0.21
80%	Mask R-CNN	78.77	81.47	1.28
	Faster R-CNN	82.36	84.22	1.18
	YOLOv7	84.58	87.11	1.92
	Our method	92.52	88.16	0.53

Table 2 shows that the research's proposed model performed the best under all occlusion conditions, particularly in terms of color precision and vehicle model precision.It still maintained 92.52% color precision and 88.16% vehicle model precision at an 80% occlusion rate. However, other methods had a significant decrease in occlusion under high occlusion rates, such as Mask R-CNN and Faster R-CNN, which had lower precision. At the same time, the model proposed by the research also had significant advantages in detection time. The detection time was kept within 0.5 seconds under all testing conditions, particularly at a 60% occlusion rate, where it was only 0.21 seconds, which was much faster than other models.From this, the proposed model not only performs excellently in precision when dealing with occlusion vehicle detection tasks, but also demonstrates extremely high efficiency in detection time, especially during high occlusion rates.

4

Conclusion

Aiming at the complex vehicle occlusion problem on highways, an occlusion recognition optimization method that integrates an attention mechanism and a multi-task approach was studied and designed. By using keypoint detection and unbiased coordinate system transformation, CAMTS was used to extract the main features of vehicles. When combined with the MAFF mechanism, multiple tasks were effectively processed. The results indicate that the pixel value iteration of the new model was as fast as 100 times, and its lowest pixel value remained stable at around 1 px. Compared to other advanced methods of the same type, the PR curve of the proposed model exhibited the best smoothness and stability, with a maximum area under the PR curve of 0.953. In addition, the new model had the highest P value of 93.28%, the highest R value of 93.82%, the highest F1 value of 92.82%, the lowest MSE value of 0.01, and the lowest MAE value of 0.02, all of which were significantly better than existing mainstream methods. The simulation results showed that compared to Mask R-CNN, Faster R-CNN, and YOLOv7, the raised method had the highest vehicle color detection precision of 94.56% and the highest vehicle type detection precision of 89.06%. The detection time was within 0.5 seconds, which shows that it had both extremely high detection accuracy and excellent detection efficiency. Even though there were significant research achievements, there were still limitations in practical applications, such as the need to improve performance when dealing with extreme lighting changes or different weather conditions. Future research can explore more tests in different scenarios while considering introducing more complex environmental information, such as dynamic objects and multi-sensor data fusion, to further improve the precision and adaptability of occlusion vehicle detection.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Optimisation of highway vehicle occlusion recognition based on attention and multitasking approach

Shifeng Feng

Publié en ligne: 17 mars 2025

Reçu: 18 oct. 2024

Accepté: 12 févr. 2025

DOI: https://doi.org/10.2478/amns-2025-0180

Mots clésAttention, Multi-task, Vehicle occlusion, Recognition, MAFF

© 2025 Shifeng Feng, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Attention, Multi-task, Vehicle occlusion, Recognition, MAFF