Study on automatic target identification method in substation 3D scene model
Publié en ligne: 25 sept. 2025
Reçu: 01 févr. 2025
Accepté: 09 mai 2025
DOI: https://doi.org/10.2478/amns-2025-1009
Mots clés
© 2025 Youhui Chen et al., published by Sciendo.
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the development of social economy, the demand for power supply in various industries is getting higher and higher, and the stable operation of substation plays a vital role in providing stable and reliable power supply [1]. In order to complete the monitoring of the equipment in the substation, it is necessary to carry out regular inspection and maintenance of the substation equipment. Substation advocates the concept of “unattended”, which can effectively reduce the accidents caused by human misoperation and improve the automation level [2]. At present, the substation has already realized the four remotes, i.e., telemetry, telecommunication, remote control, and telemodulation functions [3-4]. However, it lacks the ability to control the safety of field personnel, and cannot meet the unattended safety protection work of the substation. The way of inspection and maintenance is manual, which not only consumes a lot of manpower, but also wastes time, and most importantly, relying on manual operation and maintenance often leads to unclear or incomplete operation and maintenance, which leads to the failure of substation equipment can not be found in time, resulting in chain accidents [5-6]. At the same time, because there are many high-voltage equipment in the substation, the risk of manual operation and maintenance is very high. Therefore, the implementation of substation automatic target recognition instead of manual inspection will become the development trend of substation inspection.
With the development of three-dimensional technology, the three-dimensional scene model of substation is used in the non-contact detection of the distance of electrical equipment and virtual reality substation simulation training system because of its intuitive and high accuracy [7-9]. Therefore, the substation 3D scene modeling technology is of great significance to the safe operation of substations. Substation three-dimensional scene modeling front-view automatic target recognition plays a great role in substation operation and maintenance, using front-view template matching technology to directly search for the target, which has the advantages of adaptability, high efficiency and accurate recognition [10]. This method can effectively locate the target in the substation, once the substation equipment or line failure, it can greatly shorten the maintenance time of technicians, which is of great significance to ensure the safe and stable operation of the substation [11-13]. Therefore, three-dimensional modeling technology and front-view template matching technology can be combined to construct a simulation model for automatic target recognition in front-view of three-dimensional scene of substation.
In this paper, an improved YOLOv5 model is proposed for automatic target recognition in substation 3D scene. Firstly, the three-dimensional point cloud data of a substation is obtained by LiDAR technology, and the point cloud data processing is realized by point cloud noise reduction, downsampling and point cloud alignment, and the three-dimensional scene model of the substation is constructed by combining the three-dimensional point cloud data fusion and three-dimensional modeling software. Based on the YOLOv5 algorithm, the BiFPN module is introduced to enhance the multi-scale feature extraction capability of the substation target, and the coordinate attention is combined to further enhance the recognition accuracy of the model for small targets, and the CIoU loss function is optimized to reduce the recognition loss of the model for the substation target. The automatic target recognition effect of the model in the 3D scene model of substation is verified through simulation experiments, which provides a new research idea for improving the safe and stable operation and real-time monitoring of substation.
As the core content of power transmission network construction, substation, with its complicated power lines and complex substation equipment, brings great challenges to the management and maintenance of substation equipment information.3D GIS is gradually applied to the visualization management of power transmission and substation projects with its good spatial representation and spatial analysis capability. Relying on the three-dimensional scene model of the substation, it can assist in realizing the intelligent identification and monitoring of the substation and better maintain the safe operation of the substation.
The establishment of three-dimensional scene model of the substation is inseparable from the acquisition of three-dimensional point cloud data, and three-dimensional point cloud data acquisition relies on a variety of types of three-dimensional acquisition technology, this paper introduces the principle of three-dimensional point cloud data acquisition of LiDAR, to provide data support for the construction of three-dimensional scene model of the substation.
LiDAR is to measure the target distance, speed and position by transmitting a laser beam to the target and receiving the reflected light. It is mainly composed of laser, receiver, optical system, clock and data processing unit. The laser emits an infrared laser beam with high energy density and narrow beamwidth. The infrared laser beam is emitted to the target surface after passing through the beam control system, and the target will reflect the laser beam back and converge to the receiver through the optical system, which accepts the reflected light and converts it into electrical signals. The time difference of the reflected light is calculated by the clock recording the time of the laser beam from emission to reflection and then to the receiver, using the time difference of the reflected light to calculate the distance between the target and the LIDAR, the distance is calculated using the TOF time-of-flight method, which includes two kinds of pulse ranging and phase ranging.
Pulse ranging is through the measurement of laser pulses in the radar and the target between the back and forth flight time to obtain information on the target distance. The formula for obtaining the distance can be expressed as:
Where
Phase ranging is used to obtain distance information by measuring the phase difference generated by the round-trip flight of a continuous-wave laser signal between the radar and the target. To eliminate the effect of the phase difference on the distance measurement, the phase offset must be calculated first using the formula. The phase offset formula is:
where Δ
Finally, the distance and phase offset formulas are combined to accurately find the distance between the object and the LIDAR, and the combination formula is expressed as:
where
Based on the relevant principles of LIDAR scanning, this paper chooses a 110KV substation as the research object and practical Ouster OS LIDAR to collect 3D point cloud data of the substation. The wide-angle 120° camera collects RGB images, and a total of 2048 data sizes are collected, which are named as BDZ dataset. It contains 1024 transformers, 740 fire sandboxes, and 5267 operators. The collected data can be used for tasks such as target recognition and 3D modeling of substations.
Outlier denoising based on statistical filtering Statistical filtering first needs to determine a neighborhood window size, which is usually determined by the characteristics of the point cloud and application requirements. For each point, statistical filtering will find all the points in its neighborhood window and calculate the mean and standard deviation values of the coordinates, distances, normal vectors and other parameters of these points. These statistical parameters can reflect the distribution characteristics of the point cloud in the local area and help to determine which points are outliers. Secondly, by comparing the statistical parameters of each point with the points in its neighborhood, it is judged whether the point is an outlier or not. Specifically, points with large differences from the mean or standard deviation of the points in the neighborhood are generally judged as outliers according to a set threshold or standard deviation. Finally, different methods can be selected to remove or replace the outlier points, and common removal methods include deleting the entire point or replacing it with the mean value of the points in the neighborhood. Statistical filtering outlier denoising is performed as follows:
The key to outlier noise removal is to determine which points are outliers. After the processing of the above method, we need to determine whether Voxel grid-based point cloud data downsampling In order to speed up the point cloud target recognition, it is necessary to perform the downsampling operation first, i.e., to reduce the number of point clouds in a reasonable way, so as to improve the overall efficiency. In this paper, we choose to use the voxel grid method to downsample the substation 3D point cloud data. The specific operation steps are as follows:
In order to better realize the construction of substation 3D scene model, it is necessary to align the processed substation 3D point cloud data. In this paper, on the basis of the traditional ICP point cloud alignment algorithm, we optimize the ICP point cloud alignment algorithm by adding new point cloud descriptors with different weights to the matched point pairs with different degrees of importance, which can accelerate the convergence speed under specific convergence conditions [14].
The algorithm in this paper is improved on the basis of the traditional ICP algorithm, and the main improvement idea is to enhance the robustness of the point pair matching, when using KD-Tree for neighborhood search. The search condition of the nearest point is changed to the feature distance equation, i.e.:
Also, feature distance filtering can be performed for matched point pairs based on this equation.
The normal-based evaluation function of the alignment error is chosen, i.e:
where
The specific implementation steps of the improved ICP point cloud alignment algorithm are as follows:
Input the reference point cloud Establish a KD-Tree for Calculate the weighting factors and feature descriptors using the normal features of the two point clouds. Iterate all the points Calculate the average feature distance Sort Using the method proposed by Low, assuming that all
The SVD method is utilized to solve to obtain Calculate the evaluation function
In this paper, the fusion between 3D point cloud and 2D color image is achieved by finding the correspondence between LIDAR point cloud coordinates and 2D image pixel coordinates in 3D space. At any moment, there is a monoclinic relationship between the 3D point cloud data and the image pixel coordinate system, and the mapping relationship between the two coordinate systems is expressed as:
Since the frame rates of the two sensors are not the same, the timestamps of the data acquired by the two sensors need to be aligned. In this paper, the Time Synchronizer in the ROS system is used to realize the alignment of the current frame of the 3D LiDAR point cloud data with the 2D image data in the color solid-state LiDAR scanning system.
Iterate the laser point in the point cloud frame, denote its coordinates as
Where
When the 3D laser point cloud is converted to the pixel coordinate system, the pixel coordinates are not integer, so the adjacent pixels in the 2D image are selected by calculating the Euclidean distance between the adjacent pixels and fusing the RGB values of the pixels to achieve the color fusion between the laser point cloud and the image pixels. Finally, the texture and color information in the 2D image are given in the 3D point cloud, so as to realize 2D-3D information fusion and provide effective data for the construction of 3D scene model of substation.
In order to realize the substation 3D scene model construction, this paper takes the Unity3D software and HTC Hive device as the basis, combined with the 3D point cloud data of the substation obtained from the previous collection, and obtains the substation 3D scene model construction process as shown in Figure 1.

The 3D scene model construction process of the substation
Scanning the terrain data of the substation by LiDAR, using the acquired 3D point cloud data to select 3D Max 3D modeling tool to establish the 3D terrain model of the substation. Through the binocular stereo vision system to collect the image material of primary and secondary equipment of the substation, based on the established three-dimensional terrain model, use 3D Max software to establish the three-dimensional model of the substation’s fire fighting equipment and the scene based on the collected image material. The 3D model is imported into Unity3D software to develop the 3D scene of the substation, so that the scene has vividness and realism. The HTC Hive device is selected to realize the human-computer interaction of the 3D scene, which gives the substation operation environment a sense of scene and immersion. Utilizing the substation 3D scene for substation operation, the virtual simulation results are used to evaluate the safety of substation operation and provide guarantee for the safe operation of the substation.
Substation occupies a pivotal position in the power system, and its safe and stable operation is directly related to the reliability of the power system to deliver electricity. With the gradual maturity of 3D modeling technology, the 3D scene model of substation has become the mainstream mode of substation operation and maintenance because of its intuitive and high accuracy. Automatic target recognition plays a great role in substation operation and maintenance, and the use of deep learning technology to recognize the targets in the substation 3D scene model has the advantages of strong adaptability, high efficiency and accurate recognition. The method can effectively locate the target in the substation, which can greatly shorten the maintenance time of the technicians once the substation equipment or line fails, which is of great significance to ensure the safe and stable operation of the substation.
YOLOv5 is a one-stage target recognition algorithm based on anchors, which can quickly identify the target and its relative position in the image. The YOLOv5 model consists of an input, a feature extraction module, a feature fusion module and an output, and each of them consists of several basic components, such as convolution, pooling, and activation functions [16].
Input module. YOLOv5s input module includes Mosaic data enhancement, adaptive computational anchor frame and adaptive scaling. Mosaic data enhancement addresses the problems of varying scale sizes and uneven distribution of data by randomly selecting images and reassembling them by scaling, splicing, and cropping, so as to make the model’s prediction of small targets more accurate. The adaptive computational anchor frame can be automatically adapted to the training set and validation set according to the set parameters. Adaptive image scaling is used to reduce the amount of computation by adjusting the aspect ratio of the image to a certain size, and then using the black edge strategy to unify the image size, thus speeding up the model training and inference. Feature Extraction Module: YOLOv5 feature extraction module consists of four parts, namely, Focus module, CBS module, C3 module and SPP module, which are designed to improve the feature capability and computational efficiency of model learning. Feature Fusion Module: YOLOv5 feature fusion module realizes multi-scale detection based on feature pyramid network, and the model uses convolution and up-sampling to fuse images with different resolutions in the longitudinal pathway, and adopts hopping connection to realize the fusion of images with different semantic features in the horizontal pathway. Through the fusion of these two ways, the model is able to obtain multi-scale feature information, which improves the recognition ability and accuracy of the model. Output Module. The output module of YOLOv5 model is a combination of three different scale sizes of detection heads, and then the results of feature extraction and feature fusion are classified and predicted. In this module, the exact location of the prediction frame is determined using the GIOU loss function, which not only reflects the distance information between targets, but also establishes a relationship between multiple non-overlapping targets, thus solving the problem of slow convergence. In the case of stacked prediction frames, it is necessary to filter the prediction frames and retain the prediction frames with the highest confidence. This reduces the amount of computation and enhances the predictive ability of the model for the target, thus realizing the accurate recognition of the target.
Attention mechanism is an idea that mimics the human visual system and can enhance the deep learning model’s ability to focus on input information. Based on the neural network model of the attention mechanism, it can adaptively assign different weights according to the importance and relevance of the input data, thus helping the model to focus more on the useful information in order to better process the input data and improve the accuracy and generalization ability of the model [17].
Suppose there is an input sequence
The model is divided into two phases, i.e., encoding phase and decoding phase. In the encoding phase, the input sequence is encoded into a set of vectors
In the above model, the role of the Attention mechanism is to provide contextual information for the decoding phase to better generate the target sequence elements. Specifically,
Based on the 3D scene model data of the substation, combined with the YOLOv5 model and the attention mechanism, this paper proposes a target recognition model oriented to the 3D scene model of the substation, and its specific framework is shown in Fig. 2. The two main components in this model structure are organized as PAFPN and prediction channel. In the original YOLOv5 structure, the prediction part only utilizes a three-layer network, and this shallow structure reduces the computational complexity of the YOLOv5 network, but the feature extraction ability of small targets is slightly insufficient, so this paper considers adding a layer of network for prediction, so that the network can extract deeper features of the target, and reduce the probability of the network’s missed detection of the target. At the same time, two and three attentions are added to the network up-sampling and down-sampling process to enhance the semantic transfer of the low-level features from the bottom upward, so that the network improves the attention to the small targets and reduces the probability of the network’s wrong detection of the target. Finally, the final inference results are obtained after filtering the prediction frame by non-maximum suppression.

The 3D scene model of the substation is the model
The multi-scale feature fusion of YOLO architecture has experienced continuous optimization and improvement from FPN-like, FPN, and then PANet. From the principle point of view, the shallow network has higher resolution and covers more accurate location information. The deep network has a larger sensory field, covers more high-dimensional semantic information, and contributes more to the classification of the target. So optimizing the effect of information fusion at different scales becomes a way to enhance the network architecture.
The problem faced by single stage target detection algorithms is that it is not possible to get features of different scales by a single stage of feature extraction. The generated features extracted in the backbone segment are divided according to the stage, denoted as
After that the feature fusion is performed layer by layer from top to bottom and the output is denoted as
YOLOv5 applies PANet for feature fusion in the neck segment, and the PANet structure is characterized by a bottom-up fusion link established in layers
where
Through this design, a small increase in accuracy and a significant reduction in the amount of operations can be realized theoretically, which is of great practical significance for improving the target recognition and efficient monitoring of the 3D scene of the substation.
In recent years, the Attention Mechanism module has been widely used in computer vision tasks to tell the model what to pay more attention to and where, and is now used in deep neural networks in order to improve the performance of the model. However, in lightweight networks, the application of attention mechanisms is somewhat limited due to the fact that most of them incur additional computational overheads that are unaffordable for lightweight networks. Therefore, in this paper, we introduce a simple and flexible coordinate attention mechanism (CAM) with little additional computational overhead to improve the accuracy of the network.
Input features Fig.
The two transformations of the above equation perform feature aggregation along the two spatial directions and then cascade the two generated feature maps
Divide
In the formula,
The CAM is able to effectively attend to valid channels while focusing on spatial location coordinate information. In this paper, this attention mechanism is embedded into the CBL module in Backbone and the residual block in Bottleneck’s CSP to help the model to be more capable of feature extraction for the target of interest.
The loss function of YOLOv5 is at the heart of its training process and is responsible for measuring the discrepancy between model predictions and actual labels. In order to train the model to accurately predict the location, size, and class of the target, Improved YOLOv5 employs a composite loss function that consists of 3 main components, namely the bounding box regression loss, the object confidence loss, and the classification loss.
The object confidence loss
Where
The classification loss
where
In this paper, the CIoU loss is used to measure the difference between the predicted bounding box (including centroid coordinates, width, and height) and the real bounding box [18]. The CIoU loss improves on the IoU by taking into account the overlap between the bounding boxes, the centroid distances, and the aspect ratios to provide a more comprehensive assessment of the quality of the bounding box. The bounding box regression loss is calculated as follows:
Where
The total loss function of the improved YOLOv5 model is the weighted sum of the above three partial losses, i.e:
where
The operation reliability of power components in the substation is extremely important, the substation using manual inspection method, inspection personnel workload and professional skills level is different, easy to inspect the lower efficiency and inspection results are not accurate enough problems. The remote inspection of substations helps substation managers to quickly find faults in substation power equipment. Relying on the three-dimensional modeling technology to build a three-dimensional scene model of the substation, combined with deep learning technology can realize the effective monitoring of targets in the substation scene, to provide protection for the safe and stable operation of the substation.
In order to compare the effect of different downsampling methods in processing point clouds, this section carries out random downsampling, uniform downsampling and voxel downsampling on the original point cloud data of the substation respectively. By adjusting the size of the near-neighbor search radius in uniform downsampling and the resolution of the raster in voxel downsampling, the number of point clouds after uniform downsampling and voxel downsampling is ensured to be the same, which is 47251. The experimental data are shown in Table 1.
Comparison of the results of the drop sampling experiment
| Method | The number of original clouds | The number of clouds in the back point | Time(s) |
|---|---|---|---|
| Random drop sampling | 281421 | 47251 | 0.1274 |
| Uniform drop sampling | 281421 | 47251 | 0.1532 |
| Body drop sampling | 281421 | 47251 | 0.1049 |
As can be seen from the table, the same from 281421 points downsampling to 47251 points, voxel downsampling time is the shortest, only 0.1049s, compared with the random downsampling and uniform downsampling were 17.66% and 31.53% faster, respectively. When voxel downsampling the original point cloud, for regions with high density such as the target object and its surrounding clutter, the probability of the points in it being sampled is greater, while for regions with low density such as the interference noise, the probability of the points in it being sampled is smaller. Instead, the voxel downsampled point cloud data filters out most of the interfering noise and also ensures important information about the target object and its surrounding clutter.
In order to compare the effects of different outlier removal methods and different parameters for processing the point cloud, this section denoises the downsampled point cloud using a statistical filter and a radius filter, respectively. When the statistical filter is used, the best results are obtained by adjusting the number of nearest neighbor search points
Removal experiment from the group point
| Method | Parameter | Drop sampling | Removal from the group point | Time(s) |
|---|---|---|---|---|
| Radius filtering | 47251 | 47179 | 0.1042 | |
| 47251 | 47103 | 0.1225 | ||
| 47251 | 47192 | 0.1563 | ||
| Statistical filtering | 47251 | 46935 | 0.0991 | |
| 47251 | 46326 | 0.1208 | ||
| 47251 | 45834 | 0.1517 |
From the data in the table, it can be seen that when the radius filter is used for outlier removal, there is a linear relationship between the near-neighbor point set number threshold and the number of points after removing the outliers as well as the time taken, i.e., the larger the near-neighbor point set number threshold is, the fewer the number of points after removing the outliers and the longer the time taken. In the experiment with the best results, the number of point clouds is reduced from 47251 to 47103, and the elapsed time is 0.1225 s. When the statistical filter is used for outlier removal, there is no linear relationship between the number of nearest-neighbor search points and the number of points after removing the outliers, but there is a linear relationship with the elapsed time, i.e., the greater the number of nearest-neighbor search points, the longer the elapsed time. In the experiment with the best effect, the number of point clouds is reduced from 47,251 to 45,834, and the time used is 0.1517 s. Overall, this paper utilizes statistical filtering for the removal of outlier points from the substation 3D scene data with good effect, which can ensure the accurate processing of the substation 3D scene data.
In order to verify the feasibility of the improved ICP point cloud alignment algorithm proposed in this paper, the traditional ICP algorithm and the RANSAC+ICP algorithm are used to align the multi-view point cloud with 10 target equipment data in the 3D scene model of the substation, and the practical root-mean-square-error (RMSE) is used to measure the alignment accuracy. Table 3 shows the comparison results of RMSE and time consumption of different algorithms.
RMSE and time-consuming comparison results of different algorithms
| No. | ICP | RANSAC+ICP | Ours | |||
|---|---|---|---|---|---|---|
| 1 | RMSE | Time/s | RMSE | Time/s | RMSE | Time/s |
| 2 | 0.632 | 0.159 | 0.385 | 0.253 | 0.079 | 0.275 |
| 3 | 0.683 | 0.214 | 0.465 | 0.276 | 0.075 | 0.318 |
| 4 | 0.652 | 0.208 | 0.472 | 0.295 | 0.077 | 0.299 |
| 5 | 0.637 | 0.201 | 0.516 | 0.282 | 0.076 | 0.273 |
| 6 | 0.601 | 0.311 | 0.243 | 0.674 | 0.019 | 0.615 |
| 7 | 0.641 | 0.374 | 0.357 | 0.645 | 0.008 | 0.649 |
| 8 | 0.648 | 0.459 | 0.551 | 1.142 | 0.009 | 1.106 |
| 9 | 0.639 | 0.428 | 0.538 | 0.938 | 0.013 | 0.627 |
| 10 | 0.655 | 0.417 | 0.492 | 0.729 | 0.007 | 1.235 |
From the table, it can be seen that the traditional ICP algorithm is fast, but the alignment results are relatively poor. Although RANSAC+ICP improves the alignment accuracy compared with the traditional ICP algorithm, its alignment efficiency is not high, and its alignment accuracy is poor for some repetitive point clouds with relatively few repetitions. In comparison, the improved ICP point cloud alignment algorithm proposed in this paper can achieve more accurate and stable point cloud alignment results in various scenarios, although the alignment time is relatively long. In comparison, the average RMSE of the proposed alignment method is 93.73% higher than that of the traditional ICP alignment method, and 90.97% higher than that of the RANSAC+ICP alignment method. The method not only provides innovative improvements to the limitations of the traditional ICP algorithm, but also achieves significant results in improving the robustness and accuracy of the point cloud alignment. The improved ICP point cloud alignment algorithm proposed in this paper mainly uses KD-Tree to perform neighborhood search, changes the search conditions of the nearest points, and combines the SVD method to perform the perceptual transformation of the point cloud coordinates, which results in a significant improvement in the alignment effect of the substation 3D point cloud data. The algorithm is able to better describe the local geometric features of the substation 3D field point cloud, which improves the accuracy of the initial matching. The robustness of the algorithm to outliers and the accuracy of the initial estimation are further strengthened, making the overall alignment process more robust.
For the improved YOLOv5 model designed in this paper for substation 3D scene target recognition, the BDZ dataset obtained from the previous paper is divided into training set and test set according to the ratio of 8:2 as an example. The main evaluation indexes selected are accuracy (P), recall (R), average accuracy (AP), average precision mean (mAP@0.5 and mAP@0.5:0.95), recognition speed (FPS) and model volume (Vol).
P-R curve of the model The P-R curve of the model indicates the effectiveness of target recognition in the 3D scene of the substation, and the horizontal and vertical coordinates in its right-angle coordinate system are the recall rate and accuracy rate, respectively, and the AP value can be obtained by calculating the area of the lower part of the P-R curve in the coordinate system. The BDZ dataset constructed in this paper mainly contains three categories of transformers, fire sandboxes and operators, then its P-R curve is shown in Figure 3.
The model of all kinds of target P-R curves Based on the data in the figure, it can be seen that the AP values of the improved YOLOv5 model in this paper for recognizing the three categories of transformers, fire sandboxes and operators in the 3D scene of substations are 97.05%, 94.78% and 95.91%, respectively. The main reason for failing to realize complete recognition is that the complex spatial environment of the substation interferes with target recognition, and although this paper carries out certain preprocessing for the point cloud data, there are still certain defects in target recognition. Therefore, in the subsequent work, we will consider increasing the preprocessing of substation scene images, which will be used to improve the cleaning rate and recognition rate of images, and better guarantee the safety monitoring of substation operation and maintenance scenes. Performance comparison of different algorithms In order to further verify that the improved YOLOv5 model proposed in this paper has certain advantages in detection accuracy, model volume, and recognition speed compared with other mainstream target detection algorithms, the improved YOLOv5s model proposed in this paper is compared with YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv3, YOLOv4, YOLOXs, SSD, and Faster RCNN algorithms on the self-constructed BDZ dataset in this paper for comparison experiments, and the experimental results are shown in Table 4.
Performance comparison of different algorithms As can be seen from the table, the improved YOLOv5 model proposed in this paper has the highest detection accuracy compared with other mainstream target recognition algorithms, and the model volume and detection speed reach 8.72MB and 92f/s, respectively, which can be better applied to the task of automatic target recognition in the three-dimensional scene model of the substation. Compared with the YOLOv5x algorithm, which has the closest detection accuracy, the improved algorithm proposed in this paper reduces the model volume by 166.57MB, and increases the mAP@0.5 and mAP@0.5:0.95 by 0.73 percentage points and 1.53 percentage points, respectively, which has obvious advantages in volume and detection accuracy. In summary, the improved YOLOv5 model proposed in this paper has the highest detection accuracy while ensuring the model is lightweight and maintains good detection real-time performance, which further proves the superiority of the algorithm in this paper. By applying it to the automatic target recognition of substation 3D scene model, it can reach the effective recognition of substation scene targets with more efficient recognition speed and recognition accuracy, and provide support for ensuring the safe operation and stable operation of substations.

Model
mAP@0.5/%
mAP@0.5:0.95/%
Vol/MB
FPS/(f/s)
YOLOv5n
84.62
59.08
3.88
139
YOLOv5s
87.85
63.92
14.45
125
YOLOv5m
88.89
66.63
42.54
95
YOLOv5l
89.15
67.25
93.81
80
YOLOv5x
89.64
67.41
175.29
55
YOLOv3
81.82
46.34
246.64
43
YOLOv4
82.06
47.49
256.41
35
YOLOXs
81.38
50.88
36.64
32
SSD
40.54
20.25
98.58
43
Faster RCNN
50.29
23.21
115.37
15
Ours
90.37
68.94
8.72
92
In order to further validate the detection performance of the algorithm proposed in this study and to explore the effectiveness of each improvement method, ablation experiments were designed on the basis of YOLOv5, using the same hyper-parameters as well as training techniques for each group of experiments, and the results of the model’s ablation experiments are shown in Table 5.
The experimental results of the model
| Group | BiFPN | CAM | CIoU | Vol/MB | FPS/(f/s) | mAP@0.5/% | mAP@0.5:0.95/% |
|---|---|---|---|---|---|---|---|
| 1 | × | × | × | 14.45 | 125 | 87.85 | 63.92 |
| 2 | √ | × | × | 13.98 | 103 | 87.94 | 64.06 |
| 3 | × | √ | × | 12.07 | 101 | 88.06 | 64.73 |
| 4 | × | × | √ | 14.45 | 125 | 87.85 | 63.92 |
| 5 | √ | √ | × | 11.39 | 100 | 88.74 | 65.97 |
| 6 | √ | × | √ | 13.98 | 98 | 89.23 | 66.48 |
| 7 | × | √ | √ | 10.86 | 95 | 89.62 | 67.81 |
| 8 | √ | √ | √ | 8.72 | 92 | 90.37 | 68.94 |
As can be seen from the table, the introduction of the CAM module reduces the size of the model by 2.38MB, which is an effective lightweight method, and the mAP@0.5 is increased by 0.21 percentage points compared with the baseline model and 0.81 percentage points by mAP@0.5:0.95, although the network accuracy is not improved much, but it significantly reduces the number of parameters of the model. At the same time, after the introduction of BiFPN module to reduce the complexity of the model, although the model recognition speed is reduced to 103f/s, the indicators of mAP@0.5 and mAP@0.5:0.95 are increased by 0.09 and 0.14 percentage points, respectively. It is proved that the introduction of the BiFPN module improves the model’s ability to fuse the multi-scale features of the target in the 3D scene of the substation, and makes the identified target frame more suitable for the contour of the object. Compared with YOLOv5, the overall recognition speed of the improved YOLOv5 model is reduced by 36.95%, while the mAP@0.5 and mAP@0.5:0.95 are increased by 2.25 percentage points and 5.02 percentage points, respectively. The model has further strengthened the ability to identify and fit the 3D scene target of the substation, and the hardware requirements are smaller during the network operation, which can be widely used in the automatic task of the 3D scene target of the substation with high requirements for the IOU of the target box and the need for more accurate positioning.
In order to verify the effectiveness of the improvement modules on the baseline method, the cross-sectional comparison experiments of each improvement module are now conducted based on the YOLOv5 version. In the cross-sectional comparison of attentional mechanisms, three typical attentional mechanisms, SE, BAM, and CBAM, are selected as the control group for performance comparison. In the experiments, it is guaranteed that different attention mechanism modules are introduced at the same position in the backbone network and the neck feature fusion network, and at the same time, it is guaranteed that other network structures remain unchanged. In order to verify the effectiveness of BiPAN neck network, four typical network structures, FPN, PANet, GFPN and PRFPN, are selected as control groups for comparison. In addition, five traditional loss functions, IoU, GIoU, DIoU, EIoU and SIoU, are selected for the cross-sectional comparison experiments. Table 6 shows the results of the cross-sectional comparison experiments of the improved modules.
Improve the lateral contrast experiment of the module
| Model | mAP@0.5/% | mAP@0.5:0.95/% | Model | mAP@0.5/% | mAP@0.5:0.95/% |
|---|---|---|---|---|---|
| YOLOv5 | 49.84 | 30.49 | FPN | 49.84 | 30.49 |
| SE | 50.69 | 32.95 | PANet | 51.06 | 32.95 |
| BAM | 50.93 | 33.41 | GFPN | 51.98 | 33.16 |
| CBAM | 51.42 | 33.87 | PRFPN | 52.31 | 33.83 |
| CAM | BIFPN | ||||
| Model | Loss value | mAP@0.5/% | Model | Loss value | mAP@0.5/% |
| IoU | 0.0712 | 60.52 | EIoU | 0.0691 | 64.27 |
| GIoU | 0.0694 | 63.45 | SIoU | 0.0685 | 64.41 |
| DIoU | 0.0695 | 62.93 | CIoU |
Based on the data in the table, the following conclusions can be drawn:
Compared with the introduction of SE, BAM and CBAM and the absence of attention mechanism, the mAP@0.5 increased by 1.39 percentage points, 1.15 percentage points, 0.66 percentage points and 2.24 percentage points, respectively, which verified the effectiveness of CAM attention mechanism in improving the YOLOv5 model. PANet adds a cross-scale aggregation path on the basis of FPN, improves the ability of feature interaction, and increases the mAP@0.5 by 1.22 percentage points. GFPN introduces learnable weights for the input features and simplifies the PANet, increasing the mAP by 2.14 percentage points. In addition, PRFPN improves the feature extraction capability through bidirectional fusion and improved parallel FP structure, and the mAP is increased by 2.47 percentage points. When BiPAN is introduced, the mAP increases by 3.03 percentage points, which verifies the effectiveness of the module. According to the results of loss function optimization, the loss value of the CIoU loss function proposed in this paper is only 0.0623 in the model, and the mAP@0.5 reaches 66.58%, which is better than that of the loss function in the other 5. This shows that the CIoU loss function fully considers the overlap between bounding boxes, the distance between the center points and the aspect ratio, so that it can obtain better loss values in the target recognition of the 3D scene model of the substation.
Starting from the actual demand of automatic target recognition of substation 3D scene model, the article proposes a network model for target recognition of substation 3D scene model based on improved YOLOv5s algorithm. On the basis of the original YOLOv5 model, target recognition is realized by multi-scale feature fusion BiFPN module, which reduces the complexity of the model, model parameters and computation amount. To compensate for the loss of model accuracy caused by the multi-scale feature fusion operation, the CAM attention mechanism is further introduced to improve the model performance. The experimental results show that the algorithm proposed in this paper can provide effective technical support for automatic target recognition in the 3D scene model of substation.
