Study on automatic target identification method in substation 3D scene model

With the development of social economy, the demand for power supply in various industries is getting higher and higher, and the stable operation of substation plays a vital role in providing stable and reliable power supply [1]. In order to complete the monitoring of the equipment in the substation, it is necessary to carry out regular inspection and maintenance of the substation equipment. Substation advocates the concept of “unattended”, which can effectively reduce the accidents caused by human misoperation and improve the automation level [2]. At present, the substation has already realized the four remotes, i.e., telemetry, telecommunication, remote control, and telemodulation functions [3-4]. However, it lacks the ability to control the safety of field personnel, and cannot meet the unattended safety protection work of the substation. The way of inspection and maintenance is manual, which not only consumes a lot of manpower, but also wastes time, and most importantly, relying on manual operation and maintenance often leads to unclear or incomplete operation and maintenance, which leads to the failure of substation equipment can not be found in time, resulting in chain accidents [5-6]. At the same time, because there are many high-voltage equipment in the substation, the risk of manual operation and maintenance is very high. Therefore, the implementation of substation automatic target recognition instead of manual inspection will become the development trend of substation inspection.

With the development of three-dimensional technology, the three-dimensional scene model of substation is used in the non-contact detection of the distance of electrical equipment and virtual reality substation simulation training system because of its intuitive and high accuracy [7-9]. Therefore, the substation 3D scene modeling technology is of great significance to the safe operation of substations. Substation three-dimensional scene modeling front-view automatic target recognition plays a great role in substation operation and maintenance, using front-view template matching technology to directly search for the target, which has the advantages of adaptability, high efficiency and accurate recognition [10]. This method can effectively locate the target in the substation, once the substation equipment or line failure, it can greatly shorten the maintenance time of technicians, which is of great significance to ensure the safe and stable operation of the substation [11-13]. Therefore, three-dimensional modeling technology and front-view template matching technology can be combined to construct a simulation model for automatic target recognition in front-view of three-dimensional scene of substation.

In this paper, an improved YOLOv5 model is proposed for automatic target recognition in substation 3D scene. Firstly, the three-dimensional point cloud data of a substation is obtained by LiDAR technology, and the point cloud data processing is realized by point cloud noise reduction, downsampling and point cloud alignment, and the three-dimensional scene model of the substation is constructed by combining the three-dimensional point cloud data fusion and three-dimensional modeling software. Based on the YOLOv5 algorithm, the BiFPN module is introduced to enhance the multi-scale feature extraction capability of the substation target, and the coordinate attention is combined to further enhance the recognition accuracy of the model for small targets, and the CIoU loss function is optimized to reduce the recognition loss of the model for the substation target. The automatic target recognition effect of the model in the 3D scene model of substation is verified through simulation experiments, which provides a new research idea for improving the safe and stable operation and real-time monitoring of substation.

2

Three-dimensional scene model construction of substation

As the core content of power transmission network construction, substation, with its complicated power lines and complex substation equipment, brings great challenges to the management and maintenance of substation equipment information.3D GIS is gradually applied to the visualization management of power transmission and substation projects with its good spatial representation and spatial analysis capability. Relying on the three-dimensional scene model of the substation, it can assist in realizing the intelligent identification and monitoring of the substation and better maintain the safe operation of the substation.

2.1

Three-dimensional point cloud data acquisition

2.1.1

Three-dimensional point cloud data acquisition

The establishment of three-dimensional scene model of the substation is inseparable from the acquisition of three-dimensional point cloud data, and three-dimensional point cloud data acquisition relies on a variety of types of three-dimensional acquisition technology, this paper introduces the principle of three-dimensional point cloud data acquisition of LiDAR, to provide data support for the construction of three-dimensional scene model of the substation.

LiDAR is to measure the target distance, speed and position by transmitting a laser beam to the target and receiving the reflected light. It is mainly composed of laser, receiver, optical system, clock and data processing unit. The laser emits an infrared laser beam with high energy density and narrow beamwidth. The infrared laser beam is emitted to the target surface after passing through the beam control system, and the target will reflect the laser beam back and converge to the receiver through the optical system, which accepts the reflected light and converts it into electrical signals. The time difference of the reflected light is calculated by the clock recording the time of the laser beam from emission to reflection and then to the receiver, using the time difference of the reflected light to calculate the distance between the target and the LIDAR, the distance is calculated using the TOF time-of-flight method, which includes two kinds of pulse ranging and phase ranging.

Pulse ranging is through the measurement of laser pulses in the radar and the target between the back and forth flight time to obtain information on the target distance. The formula for obtaining the distance can be expressed as: (1) $d = \frac{c}{2} \times t$

Where d denotes the distance from the object to the LIDAR, c denotes the speed of light, and t denotes the time delay of the laser pulse, which is divided by 2 in the formula because of the round-trip distance traveled by the laser pulse.

Phase ranging is used to obtain distance information by measuring the phase difference generated by the round-trip flight of a continuous-wave laser signal between the radar and the target. To eliminate the effect of the phase difference on the distance measurement, the phase offset must be calculated first using the formula. The phase offset formula is: (2) $Δ φ = 2 π \times (\frac{Δ t}{T})$

where Δφ denotes the phase offset, Δt denotes the time difference between the transmitted and received laser pulses, and T denotes the period of the laser pulse. Since the phase difference of a complete laser pulse is 2π, the phase offset can be calculated from the time difference.

Finally, the distance and phase offset formulas are combined to accurately find the distance between the object and the LIDAR, and the combination formula is expressed as: (3) $d = (\frac{c}{2}) \times (t + \frac{Δ φ}{2 π f})$

where f represents the frequency of the laser pulse. This formula takes into account the effect of phase difference on the distance, so it has higher measurement accuracy and stability.

Based on the relevant principles of LIDAR scanning, this paper chooses a 110KV substation as the research object and practical Ouster OS LIDAR to collect 3D point cloud data of the substation. The wide-angle 120° camera collects RGB images, and a total of 2048 data sizes are collected, which are named as BDZ dataset. It contains 1024 transformers, 740 fire sandboxes, and 5267 operators. The collected data can be used for tasks such as target recognition and 3D modeling of substations.

2.1.2

Point cloud data preprocessing

1)

Outlier denoising based on statistical filtering

Statistical filtering first needs to determine a neighborhood window size, which is usually determined by the characteristics of the point cloud and application requirements. For each point, statistical filtering will find all the points in its neighborhood window and calculate the mean and standard deviation values of the coordinates, distances, normal vectors and other parameters of these points. These statistical parameters can reflect the distribution characteristics of the point cloud in the local area and help to determine which points are outliers. Secondly, by comparing the statistical parameters of each point with the points in its neighborhood, it is judged whether the point is an outlier or not. Specifically, points with large differences from the mean or standard deviation of the points in the neighborhood are generally judged as outliers according to a set threshold or standard deviation. Finally, different methods can be selected to remove or replace the outlier points, and common removal methods include deleting the entire point or replacing it with the mean value of the points in the neighborhood. Statistical filtering outlier denoising is performed as follows:

Step1 Assume that there are n points in the point cloud data, for each point p_i(x_i, y_i, z_i), find its k nearest neighbor points p_j(x_j, y_j, z_j) in the 3D space by KD-Tree search algorithm.

Step2 Calculate the average distance d_i between the current point and its k nearest neighbor points. Then: (4) $d_{i} = \frac{1}{k} \sum_{j = 1}^{k} \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2} + {(z_{i} - z_{j})}^{2}}$

Step3 Calculate the average of the global distances μ. Then: (5) $μ = \sum_{i = 1}^{n} \frac{1}{n} d_{i}$

Step4 Solve for the standard deviation σ. Then: (6) $σ = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(d_{i} - u)}^{2}}$

The key to outlier noise removal is to determine which points are outliers. After the processing of the above method, we need to determine whether d_i is within μ±aσ, and if not, then determine that it is an outlier point, so as to remove it. The parameter a and the number of point neighbors k need to be adapted according to the size of the data volume of the point cloud to be processed and the density of the point cloud. 2)

Voxel grid-based point cloud data downsampling

In order to speed up the point cloud target recognition, it is necessary to perform the downsampling operation first, i.e., to reduce the number of point clouds in a reasonable way, so as to improve the overall efficiency. In this paper, we choose to use the voxel grid method to downsample the substation 3D point cloud data. The specific operation steps are as follows:

Step1 Determine the side length of the voxel raster based on the set of coordinates of the point cloud data by taking the limit value X_max, X_min, Y_max, Y_min, Z_max, Z_min on each axis of X, Y, Z of the point cloud data. Then the side length of the largest enclosing box is: (7) ${\begin{array}{l} L_{x} = X_{\max} - X_{\min} \\ L_{y} = Y_{\max} - Y_{\min} \\ Z_{y} = Z_{\max} - Z_{\min} \end{array}$

Step2 Divide the maximal inclusion box by voxel mini-grid. First set the side length of each voxellet grid as c, and then divide the X, Y, Z-coordinate axis where the point cloud data is located into M, N, L parts uniformly, the maximal containment box is divided into $M \times N \times L$ voxellet grids. Where [-] denotes downward rounding and sum is the total number of voxel mini-grids. Then: (8) $S u m = M \times N \times L$ (9) ${\begin{array}{l} M = ⌊ \frac{L_{x}}{c} ⌋ \\ N = ⌊ \frac{L_{y}}{c} ⌋ \\ L = ⌊ \frac{L_{z}}{c} ⌋ \end{array}$

Step3 Assign the data points to the corresponding voxel rasters. Number each voxel raster as (i, j, k). The voxel raster number corresponding to the mnd data point (X_m, Y_m, Z_m) in the point cloud is: (10) ${\begin{array}{l} i = ⌊ \frac{X_{m} - X_{\min}}{c} ⌋ \\ j = ⌊ \frac{Y_{m} - Y_{\min}}{c} ⌋ \\ k = ⌊ Z \frac{z_{m} - Z_{\min}}{c} ⌋ \end{array}$

Step4 Simplify the point cloud. By calculating the center point coordinate (r_x, r_y, r_z) of each voxel mini-raster, it replaces all the point information in the original mini-raster. Combine the center points of all voxel rasters to realize the streamlined processing of the point cloud. The equation for solving the center of mass of voxel number (i, j, k) is: (11) ${\begin{array}{l} r_{x} = \frac{1}{n} \sum_{m = 1}^{n} P_{m_{x}} \\ r_{y} = \frac{1}{n} \sum_{m = 1}^{n} P_{m_{y}} \\ r_{z} = \frac{1}{n} \sum_{m = 1}^{n} P_{m_{z}} \end{array}$

2.2

Substation 3D scene model

2.2.1

Improved ICP point cloud alignment

In order to better realize the construction of substation 3D scene model, it is necessary to align the processed substation 3D point cloud data. In this paper, on the basis of the traditional ICP point cloud alignment algorithm, we optimize the ICP point cloud alignment algorithm by adding new point cloud descriptors with different weights to the matched point pairs with different degrees of importance, which can accelerate the convergence speed under specific convergence conditions [14].

The algorithm in this paper is improved on the basis of the traditional ICP algorithm, and the main improvement idea is to enhance the robustness of the point pair matching, when using KD-Tree for neighborhood search. The search condition of the nearest point is changed to the feature distance equation, i.e.: (12) $\min (f_{i} \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2} + {(z_{i} - z_{j})}^{2} + {(c_{i} - c_{j})}^{2}})$

Also, feature distance filtering can be performed for matched point pairs based on this equation.

The normal-based evaluation function of the alignment error is chosen, i.e: (13) $E (R, T) = \arg \min \sum_{i = 1}^{i = N} | | (p_{i} - (q_{j} R + T)) n_{i} | |_{2}^{2}$

where q_j is the point corresponding to p_i.

The specific implementation steps of the improved ICP point cloud alignment algorithm are as follows: 1)

Input the reference point cloud P and the point cloud Q to be aligned.

2)

Establish a KD-Tree for P, Q and solve the normal features of the two point clouds.

3)

Calculate the weighting factors and feature descriptors using the normal features of the two point clouds.

4)

Iterate all the points p_i in P and search the nearest point q_i as a match according to the KD-Tree as well as to obtain the set of matching points ${S_{i}}$ .

5)

Calculate the average feature distance d_mean and standard deviation σ between the matched points in ${S_{i}}$ . If the distance d_si between feature distances satisfies the 3σ criterion, i.e., d_si < d_mean + 3σ, the point pair is retained, otherwise the point pair S_i is rejected.

6)

Sort ${S_{i}}$ according to d_si from smallest to largest, and take the top 60% of matched point pairs for calculation.

7)

Using the method proposed by Low, assuming that all α, β, γ, t_x, t_y, t_z are close to 0 after coarse alignment, then sin θ ≈ θ, cos θ ≈ 1, the alignment error evaluation function can be approximately converted to: (14) $\arg \min ([\begin{matrix} q_{1} \times n_{1} & n_{1} \\ ⋮ & ⋮ \\ q_{N} \times n_{N} & n_{N} \end{matrix}] [\begin{matrix} α \\ β \\ γ \\ t_{y} \\ t_{x} \\ t_{z} \end{matrix}] - [\begin{matrix} (q_{1} - p_{1}) n_{1} \\ ⋮ \\ (q_{N} - p_{N}) n_{N} \end{matrix}])$

The SVD method is utilized to solve to obtain α, β, γ, t_x, t_y, t_z, which in turn yields the rigid transformation matrix [RT], which is rigidly transformed for Q. 8)

Calculate the evaluation function E_k(R, T) of the alignment error of this time, k representing the kth transformation. If the difference D_k = E_k − E_k+1 between the two times E_k before and after meets the set threshold then it is converged, otherwise go to step (4) and continue the iteration.

2.2.2

3D Point Cloud Data Fusion

In this paper, the fusion between 3D point cloud and 2D color image is achieved by finding the correspondence between LIDAR point cloud coordinates and 2D image pixel coordinates in 3D space. At any moment, there is a monoclinic relationship between the 3D point cloud data and the image pixel coordinate system, and the mapping relationship between the two coordinate systems is expressed as: (15) $s [\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}] = K [\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} X_{i} \\ Y_{i} \\ Z_{i} \\ 1 \end{matrix}], i = 1, 2, 3...$

Since the frame rates of the two sensors are not the same, the timestamps of the data acquired by the two sensors need to be aligned. In this paper, the Time Synchronizer in the ROS system is used to realize the alignment of the current frame of the 3D LiDAR point cloud data with the 2D image data in the color solid-state LiDAR scanning system.

Iterate the laser point in the point cloud frame, denote its coordinates as p(x, y, z), project the laser point under the 2D pixel coordinate system of the corresponding image through the mapping relation, and search for its nearest neighboring pixel point to obtain the RGB information, and then re-map it under the radar coordinate system [15]. At this time, the coordinates of the 3D color laser point are of the form: (16) $p_{c} = {[x, y, z, R (u, v), G (u, v), B (u, v)]}^{T}$

Where R(u, v), G(u, v) and B(u, v) denote the image color information of the projected pixel points, the point cloud (X, Y, Z, R, G, B) contains three-dimensional coordinates and color parameter information.

When the 3D laser point cloud is converted to the pixel coordinate system, the pixel coordinates are not integer, so the adjacent pixels in the 2D image are selected by calculating the Euclidean distance between the adjacent pixels and fusing the RGB values of the pixels to achieve the color fusion between the laser point cloud and the image pixels. Finally, the texture and color information in the 2D image are given in the 3D point cloud, so as to realize 2D-3D information fusion and provide effective data for the construction of 3D scene model of substation.

2.2.3

Substation 3D scene modeling

In order to realize the substation 3D scene model construction, this paper takes the Unity3D software and HTC Hive device as the basis, combined with the 3D point cloud data of the substation obtained from the previous collection, and obtains the substation 3D scene model construction process as shown in Figure 1.

Scanning the terrain data of the substation by LiDAR, using the acquired 3D point cloud data to select 3D Max 3D modeling tool to establish the 3D terrain model of the substation. Through the binocular stereo vision system to collect the image material of primary and secondary equipment of the substation, based on the established three-dimensional terrain model, use 3D Max software to establish the three-dimensional model of the substation’s fire fighting equipment and the scene based on the collected image material. The 3D model is imported into Unity3D software to develop the 3D scene of the substation, so that the scene has vividness and realism. The HTC Hive device is selected to realize the human-computer interaction of the 3D scene, which gives the substation operation environment a sense of scene and immersion. Utilizing the substation 3D scene for substation operation, the virtual simulation results are used to evaluate the safety of substation operation and provide guarantee for the safe operation of the substation.

3

Target recognition model for 3D scene of substation

Substation occupies a pivotal position in the power system, and its safe and stable operation is directly related to the reliability of the power system to deliver electricity. With the gradual maturity of 3D modeling technology, the 3D scene model of substation has become the mainstream mode of substation operation and maintenance because of its intuitive and high accuracy. Automatic target recognition plays a great role in substation operation and maintenance, and the use of deep learning technology to recognize the targets in the substation 3D scene model has the advantages of strong adaptability, high efficiency and accurate recognition. The method can effectively locate the target in the substation, which can greatly shorten the maintenance time of the technicians once the substation equipment or line fails, which is of great significance to ensure the safe and stable operation of the substation.

3.1

YOLOv5 and the attention mechanism

3.1.1

YOLOv5 algorithm

YOLOv5 is a one-stage target recognition algorithm based on anchors, which can quickly identify the target and its relative position in the image. The YOLOv5 model consists of an input, a feature extraction module, a feature fusion module and an output, and each of them consists of several basic components, such as convolution, pooling, and activation functions [16]. 1)

Input module. YOLOv5s input module includes Mosaic data enhancement, adaptive computational anchor frame and adaptive scaling. Mosaic data enhancement addresses the problems of varying scale sizes and uneven distribution of data by randomly selecting images and reassembling them by scaling, splicing, and cropping, so as to make the model’s prediction of small targets more accurate. The adaptive computational anchor frame can be automatically adapted to the training set and validation set according to the set parameters. Adaptive image scaling is used to reduce the amount of computation by adjusting the aspect ratio of the image to a certain size, and then using the black edge strategy to unify the image size, thus speeding up the model training and inference.

2)

Feature Extraction Module: YOLOv5 feature extraction module consists of four parts, namely, Focus module, CBS module, C3 module and SPP module, which are designed to improve the feature capability and computational efficiency of model learning.

3)

Feature Fusion Module: YOLOv5 feature fusion module realizes multi-scale detection based on feature pyramid network, and the model uses convolution and up-sampling to fuse images with different resolutions in the longitudinal pathway, and adopts hopping connection to realize the fusion of images with different semantic features in the horizontal pathway. Through the fusion of these two ways, the model is able to obtain multi-scale feature information, which improves the recognition ability and accuracy of the model.

4)

Output Module. The output module of YOLOv5 model is a combination of three different scale sizes of detection heads, and then the results of feature extraction and feature fusion are classified and predicted. In this module, the exact location of the prediction frame is determined using the GIOU loss function, which not only reflects the distance information between targets, but also establishes a relationship between multiple non-overlapping targets, thus solving the problem of slow convergence. In the case of stacked prediction frames, it is necessary to filter the prediction frames and retain the prediction frames with the highest confidence. This reduces the amount of computation and enhances the predictive ability of the model for the target, thus realizing the accurate recognition of the target.

3.1.2

Attention mechanisms

Attention mechanism is an idea that mimics the human visual system and can enhance the deep learning model’s ability to focus on input information. Based on the neural network model of the attention mechanism, it can adaptively assign different weights according to the importance and relevance of the input data, thus helping the model to focus more on the useful information in order to better process the input data and improve the accuracy and generalization ability of the model [17].

Suppose there is an input sequence X = (x₁, x₂, …, x_n) and a target sequence Y = (y₁, y₂, …, y_m), where x_i and y_j denote the ith and jth elements in the input and target sequences, respectively. Then, given an input sequence X and a target sequence Y, the most common model of the attention mechanism can be represented as: (17) $a_{t} = A t t e n t i o n (h_{t - 1}, X)$ (18) $C_{t} = \sum_{i = 1}^{n} a_{t, i} h_{i}$ (19) $s_{t} = D e c o d e r R N N ([y_{t - 1}, C_{t}], s_{t - 1})$ (20) $y_{t} = O u t p u t P r o j e c t i o n (s_{t})$

The model is divided into two phases, i.e., encoding phase and decoding phase. In the encoding phase, the input sequence is encoded into a set of vectors H = [h₁, h₂, …, h_n]. where h_i denotes the encoded representation of the ird element of the input sequence. In the decoding phase, the target sequences Y are generated one by one, and the generation of each target sequence element y_τ depends on the previously generated elements as well as the coded representation of the input sequence.

In the above model, the role of the Attention mechanism is to provide contextual information for the decoding phase to better generate the target sequence elements. Specifically, s_t denotes the state representation of the first t elements in the target sequence, and Decoder RNN is a recurrent neural network used to compute the state representation in the target sequence. Attention is an attention computation function used to compute the correlation between the trd element in the target sequence and each element in the input sequence to obtain the attention vector a_τ. Each of the attention vector a_τ element represents the attention of the tth element in the target sequence to each element in the input sequence, which can be regarded as a set of weights. c_τ denotes a contextual representation of the input sequence obtained based on a_τ weighting, which is used to enhance the state representation s_t in the decoding stage, providing contextualized information related to the input sequence. y_t denotes the output obtained based on the tth element in the target sequence and c_τ.

3.2

Target Recognition with Multiscale Attention

Based on the 3D scene model data of the substation, combined with the YOLOv5 model and the attention mechanism, this paper proposes a target recognition model oriented to the 3D scene model of the substation, and its specific framework is shown in Fig. 2. The two main components in this model structure are organized as PAFPN and prediction channel. In the original YOLOv5 structure, the prediction part only utilizes a three-layer network, and this shallow structure reduces the computational complexity of the YOLOv5 network, but the feature extraction ability of small targets is slightly insufficient, so this paper considers adding a layer of network for prediction, so that the network can extract deeper features of the target, and reduce the probability of the network’s missed detection of the target. At the same time, two and three attentions are added to the network up-sampling and down-sampling process to enhance the semantic transfer of the low-level features from the bottom upward, so that the network improves the attention to the small targets and reduces the probability of the network’s wrong detection of the target. Finally, the final inference results are obtained after filtering the prediction frame by non-maximum suppression.

3.2.1

Multi-scale feature fusion

The multi-scale feature fusion of YOLO architecture has experienced continuous optimization and improvement from FPN-like, FPN, and then PANet. From the principle point of view, the shallow network has higher resolution and covers more accurate location information. The deep network has a larger sensory field, covers more high-dimensional semantic information, and contributes more to the classification of the target. So optimizing the effect of information fusion at different scales becomes a way to enhance the network architecture.

The problem faced by single stage target detection algorithms is that it is not possible to get features of different scales by a single stage of feature extraction. The generated features extracted in the backbone segment are divided according to the stage, denoted as C₁, C₂, ⋯, C₇, and the number represents the number of times the image resolution is halved, e.g., C₄ indicates stage 4, and the output is a feature map of the original image of 1/16 size.

After that the feature fusion is performed layer by layer from top to bottom and the output is denoted as P. This process can be expressed as: (21) $P_{i} = f_{i} (C_{i}, P_{i + 1}), i \in {3, 4, \dots, 6}$

YOLOv5 applies PANet for feature fusion in the neck segment, and the PANet structure is characterized by a bottom-up fusion link established in layers C₃ to C₇, which enhances the upward transfer of strongly localized features from the bottom layer. Compared to PANet, BiFPN has 3 main improvements, firstly, some nodes are reduced, nodes with person degree 1 are deleted from BiFPN, because such nodes have no extra information compared to the previous node, which reduces redundant computation. Second, jump links are added so that the output layer not only gets the information that has been involved in feature fusion from the bottom up, but also retains the unfused information of the original node. Third, the fusion module is formed, which can continue to participate in heapden as a whole to do further fusion. The relationship of each layer is shown in the following equation: (22) ${\begin{matrix} P_{7}^{o u t} = C o n v (P_{7}^{i n}) \\ P_{6}^{o u t} = C o n v (P_{6}^{i n} + Re s i z e (P_{7}^{o u t})) \\ P_{5}^{o u t} = C o n v (P_{5}^{i n} + Re s i z e (P_{6}^{o u t})) \\ P_{4}^{o u t} = C o n v (P_{4}^{i n} + Re s i z e (P_{5}^{o u t})) \\ P_{3}^{o u t} = C o n v (P_{3}^{i n} + Re s i z e (P_{4}^{o u t})) \end{matrix}$

where $P_{i}^{in}$ denotes the input of node P_i, $P_{i}^{out}$ denotes the output of node P_i, and Resize(·) is the upsampling operation.

Through this design, a small increase in accuracy and a significant reduction in the amount of operations can be realized theoretically, which is of great practical significance for improving the target recognition and efficient monitoring of the 3D scene of the substation.

3.2.2

Coordinate Attention Mechanism

In recent years, the Attention Mechanism module has been widely used in computer vision tasks to tell the model what to pay more attention to and where, and is now used in deep neural networks in order to improve the performance of the model. However, in lightweight networks, the application of attention mechanisms is somewhat limited due to the fact that most of them incur additional computational overheads that are unaffordable for lightweight networks. Therefore, in this paper, we introduce a simple and flexible coordinate attention mechanism (CAM) with little additional computational overhead to improve the accuracy of the network.

Input features Fig. X is the output of the previous layer of convolution with dimension C × H × W, i.e., number of channels is C, height is H, and width is W. The average pooling using dimensions (H, 1) and (1, W) is used to encode the output of each channel along the horizontal vs. vertical coordinate directions, i.e., the c-th channel with height h vs. the cth channel with width w, respectively, with the following equation: (23) $z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)$ (24) $z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (j, w)$

The two transformations of the above equation perform feature aggregation along the two spatial directions and then cascade the two generated feature maps z^h, z^w and perform a convolution operation F₁ with a convolution kernel size of 1 to generate intermediate feature maps f for the pair of spatial information in the horizontal and vertical directions with the following formula: (25) $f = δ (F_{1} ([z^{h}, z^{w}]))$

Divide f into two single tensors f^h and f^w along the spatial dimension, and then transform the feature maps f^h and f^w into the same number of channels as the input X using two convolution operations F_h and F_w with convolution kernel size 1, as follows: (26) $g^{h} = σ (F_{h} (f^{h}))$ (27) $g^{w} = σ (F_{w} (f^{w}))$

In the formula, σ operation is the Sigmoid activation function, which is operated to get the range value from 0 to 1, which represents the degree of importance level. Expanding g^h and g^w as attention weights, the final output formula is as follows: (28) $y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)$

The CAM is able to effectively attend to valid channels while focusing on spatial location coordinate information. In this paper, this attention mechanism is embedded into the CBL module in Backbone and the residual block in Bottleneck’s CSP to help the model to be more capable of feature extraction for the target of interest.

3.2.3

Loss function design

The loss function of YOLOv5 is at the heart of its training process and is responsible for measuring the discrepancy between model predictions and actual labels. In order to train the model to accurately predict the location, size, and class of the target, Improved YOLOv5 employs a composite loss function that consists of 3 main components, namely the bounding box regression loss, the object confidence loss, and the classification loss.

The object confidence loss L_conf is computed using a binary cross-entropy loss. This part of the loss is designed to optimize the model’s prediction of the presence or absence of the target, thus ensuring that the model accurately identifies the region containing the target. The formula is as follows: (29) $\begin{array}{rcl} L o s s_{c o n f} & = & - \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i j}^{o k j} [{\hat{C}}_{i}^{j} \log C_{i}^{j} + (1 - {\hat{C}}_{i}^{j}) \log (1 - C_{i}^{j})] \\ - λ_{m o b j} \sum_{i = 0}^{K \times K} \sum_{j = 0}^{M} I_{i j}^{m o b j} [{\hat{C}}_{i}^{j} \log C_{i}^{j} + (1 - {\hat{C}}_{i}^{j}) \log (1 - C_{i}^{j})] \end{array}$

Where K denotes that the final output feature map of the network is divided into K × K cells, M denotes the corresponding number of anchor frames in each cell, $I_{i j}^{a j}$ denotes targeted anchor frames, $I_{i j}^{m o b j}$ denotes untargeted mismatched frames, and λ_noobj denotes the weight coefficients of the confidence loss for untargeted anchor frames.

The classification loss L_dass is calculated using the cross-entropy loss. This part of the loss ensures that the model can accurately classify each target detected. The formula is calculated as follows: (30) $L o s s_{c l a s s} = - \sum_{i = 0}^{K \times K} I_{i j}^{o b j} \sum [{\hat{P}}_{i}^{j} \log p_{i}^{j} + (1 - {\hat{P}}_{i}^{j}) \log (1 - p_{i}^{j})]$

where $p_{i}^{j}$ is the predicted probability and ${\hat{P}}_{i}^{j}$ is the actual probability of the category to which the target in the grid belongs.

In this paper, the CIoU loss is used to measure the difference between the predicted bounding box (including centroid coordinates, width, and height) and the real bounding box [18]. The CIoU loss improves on the IoU by taking into account the overlap between the bounding boxes, the centroid distances, and the aspect ratios to provide a more comprehensive assessment of the quality of the bounding box. The bounding box regression loss is calculated as follows: (31) $L_{c i o u} = 1 - (I o U - \frac{ρ^{2} (b - b^{g t})}{c^{2}} - α v)$

Where IoU is the intersection and concurrency ratio of the predicted and real frames, b is the centroid of the predicted frame, b^g is the centroid of the real frame, and ρ is the Euclidean distance between the centroids of the predicted and real frames. c is the diagonal distance of the smallest rectangular region containing the predicted and real frames, α is the weight coefficient, and v is the difference between the aspect ratio of the predicted and real frames.

The total loss function of the improved YOLOv5 model is the weighted sum of the above three partial losses, i.e: (32) $L o s s = λ_{1} L_{c i o u} + λ_{2} L_{c o n f} + λ_{3} L_{c l a s s}$

where λ₁, λ₂, λ₃ is the weighting coefficient used to balance the loss contributions of the different components. These weighting coefficients can be adapted to specific tasks and datasets to optimize the overall performance of the model.

4

Substation 3D scene target recognition experiment

The operation reliability of power components in the substation is extremely important, the substation using manual inspection method, inspection personnel workload and professional skills level is different, easy to inspect the lower efficiency and inspection results are not accurate enough problems. The remote inspection of substations helps substation managers to quickly find faults in substation power equipment. Relying on the three-dimensional modeling technology to build a three-dimensional scene model of the substation, combined with deep learning technology can realize the effective monitoring of targets in the substation scene, to provide protection for the safe and stable operation of the substation.

4.1

Substation 3D point cloud data validation

4.1.1

Point cloud data processing effect

In order to compare the effect of different downsampling methods in processing point clouds, this section carries out random downsampling, uniform downsampling and voxel downsampling on the original point cloud data of the substation respectively. By adjusting the size of the near-neighbor search radius in uniform downsampling and the resolution of the raster in voxel downsampling, the number of point clouds after uniform downsampling and voxel downsampling is ensured to be the same, which is 47251. The experimental data are shown in Table 1.

Table 1.

Comparison of the results of the drop sampling experiment

Method	The number of original clouds	The number of clouds in the back point	Time(s)
Random drop sampling	281421	47251	0.1274
Uniform drop sampling	281421	47251	0.1532
Body drop sampling	281421	47251	0.1049

As can be seen from the table, the same from 281421 points downsampling to 47251 points, voxel downsampling time is the shortest, only 0.1049s, compared with the random downsampling and uniform downsampling were 17.66% and 31.53% faster, respectively. When voxel downsampling the original point cloud, for regions with high density such as the target object and its surrounding clutter, the probability of the points in it being sampled is greater, while for regions with low density such as the interference noise, the probability of the points in it being sampled is smaller. Instead, the voxel downsampled point cloud data filters out most of the interfering noise and also ensures important information about the target object and its surrounding clutter.

In order to compare the effects of different outlier removal methods and different parameters for processing the point cloud, this section denoises the downsampled point cloud using a statistical filter and a radius filter, respectively. When the statistical filter is used, the best results are obtained by adjusting the number of nearest neighbor search points k₁ to seek the optimal parameters. When the radius filter is used, the optimal parameters are sought by adjusting the size of the near-neighbor search radius r₁ and the size of the near-neighbor point set threshold k₂ to obtain the best results. The experimental data are shown in Table 2.

Table 2.

Removal experiment from the group point

Method	Parameter	Drop sampling	Removal from the group point	Time(s)
Radius filtering	k₁ = 15	47251	47179	0.1042
	k₁ = 25	47251	47103	0.1225
	k₁ = 90	47251	47192	0.1563
Statistical filtering	r₁ = 15, k₂ = 10	47251	46935	0.0991
	r₁ = 15, k₂ = 25	47251	46326	0.1208
	r₁ = 15, k₂ = 45	47251	45834	0.1517

From the data in the table, it can be seen that when the radius filter is used for outlier removal, there is a linear relationship between the near-neighbor point set number threshold and the number of points after removing the outliers as well as the time taken, i.e., the larger the near-neighbor point set number threshold is, the fewer the number of points after removing the outliers and the longer the time taken. In the experiment with the best results, the number of point clouds is reduced from 47251 to 47103, and the elapsed time is 0.1225 s. When the statistical filter is used for outlier removal, there is no linear relationship between the number of nearest-neighbor search points and the number of points after removing the outliers, but there is a linear relationship with the elapsed time, i.e., the greater the number of nearest-neighbor search points, the longer the elapsed time. In the experiment with the best effect, the number of point clouds is reduced from 47,251 to 45,834, and the time used is 0.1517 s. Overall, this paper utilizes statistical filtering for the removal of outlier points from the substation 3D scene data with good effect, which can ensure the accurate processing of the substation 3D scene data.

4.1.2

Point cloud data alignment effect

In order to verify the feasibility of the improved ICP point cloud alignment algorithm proposed in this paper, the traditional ICP algorithm and the RANSAC+ICP algorithm are used to align the multi-view point cloud with 10 target equipment data in the 3D scene model of the substation, and the practical root-mean-square-error (RMSE) is used to measure the alignment accuracy. Table 3 shows the comparison results of RMSE and time consumption of different algorithms.

Table 3.

RMSE and time-consuming comparison results of different algorithms

No.	ICP		RANSAC+ICP		Ours
1	RMSE	Time/s	RMSE	Time/s	RMSE	Time/s
2	0.632	0.159	0.385	0.253	0.079	0.275
3	0.683	0.214	0.465	0.276	0.075	0.318
4	0.652	0.208	0.472	0.295	0.077	0.299
5	0.637	0.201	0.516	0.282	0.076	0.273
6	0.601	0.311	0.243	0.674	0.019	0.615
7	0.641	0.374	0.357	0.645	0.008	0.649
8	0.648	0.459	0.551	1.142	0.009	1.106
9	0.639	0.428	0.538	0.938	0.013	0.627
10	0.655	0.417	0.492	0.729	0.007	1.235

From the table, it can be seen that the traditional ICP algorithm is fast, but the alignment results are relatively poor. Although RANSAC+ICP improves the alignment accuracy compared with the traditional ICP algorithm, its alignment efficiency is not high, and its alignment accuracy is poor for some repetitive point clouds with relatively few repetitions. In comparison, the improved ICP point cloud alignment algorithm proposed in this paper can achieve more accurate and stable point cloud alignment results in various scenarios, although the alignment time is relatively long. In comparison, the average RMSE of the proposed alignment method is 93.73% higher than that of the traditional ICP alignment method, and 90.97% higher than that of the RANSAC+ICP alignment method. The method not only provides innovative improvements to the limitations of the traditional ICP algorithm, but also achieves significant results in improving the robustness and accuracy of the point cloud alignment. The improved ICP point cloud alignment algorithm proposed in this paper mainly uses KD-Tree to perform neighborhood search, changes the search conditions of the nearest points, and combines the SVD method to perform the perceptual transformation of the point cloud coordinates, which results in a significant improvement in the alignment effect of the substation 3D point cloud data. The algorithm is able to better describe the local geometric features of the substation 3D field point cloud, which improves the accuracy of the initial matching. The robustness of the algorithm to outliers and the accuracy of the initial estimation are further strengthened, making the overall alignment process more robust.

4.2

Substation 3D scene target identification

4.2.1

Performance comparison of different algorithms

For the improved YOLOv5 model designed in this paper for substation 3D scene target recognition, the BDZ dataset obtained from the previous paper is divided into training set and test set according to the ratio of 8:2 as an example. The main evaluation indexes selected are accuracy (P), recall (R), average accuracy (AP), average precision mean (mAP@0.5 and mAP@0.5:0.95), recognition speed (FPS) and model volume (Vol). 1)

P-R curve of the model

The P-R curve of the model indicates the effectiveness of target recognition in the 3D scene of the substation, and the horizontal and vertical coordinates in its right-angle coordinate system are the recall rate and accuracy rate, respectively, and the AP value can be obtained by calculating the area of the lower part of the P-R curve in the coordinate system. The BDZ dataset constructed in this paper mainly contains three categories of transformers, fire sandboxes and operators, then its P-R curve is shown in Figure 3.

Based on the data in the figure, it can be seen that the AP values of the improved YOLOv5 model in this paper for recognizing the three categories of transformers, fire sandboxes and operators in the 3D scene of substations are 97.05%, 94.78% and 95.91%, respectively. The main reason for failing to realize complete recognition is that the complex spatial environment of the substation interferes with target recognition, and although this paper carries out certain preprocessing for the point cloud data, there are still certain defects in target recognition. Therefore, in the subsequent work, we will consider increasing the preprocessing of substation scene images, which will be used to improve the cleaning rate and recognition rate of images, and better guarantee the safety monitoring of substation operation and maintenance scenes.

2)

Performance comparison of different algorithms

In order to further verify that the improved YOLOv5 model proposed in this paper has certain advantages in detection accuracy, model volume, and recognition speed compared with other mainstream target detection algorithms, the improved YOLOv5s model proposed in this paper is compared with YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv3, YOLOv4, YOLOXs, SSD, and Faster RCNN algorithms on the self-constructed BDZ dataset in this paper for comparison experiments, and the experimental results are shown in Table 4. Table 4.

Performance comparison of different algorithms

Model	mAP@0.5/%	mAP@0.5:0.95/%	Vol/MB	FPS/(f/s)
YOLOv5n	84.62	59.08	3.88	139
YOLOv5s	87.85	63.92	14.45	125
YOLOv5m	88.89	66.63	42.54	95
YOLOv5l	89.15	67.25	93.81	80
YOLOv5x	89.64	67.41	175.29	55
YOLOv3	81.82	46.34	246.64	43
YOLOv4	82.06	47.49	256.41	35
YOLOXs	81.38	50.88	36.64	32
SSD	40.54	20.25	98.58	43
Faster RCNN	50.29	23.21	115.37	15
Ours	90.37	68.94	8.72	92

As can be seen from the table, the improved YOLOv5 model proposed in this paper has the highest detection accuracy compared with other mainstream target recognition algorithms, and the model volume and detection speed reach 8.72MB and 92f/s, respectively, which can be better applied to the task of automatic target recognition in the three-dimensional scene model of the substation. Compared with the YOLOv5x algorithm, which has the closest detection accuracy, the improved algorithm proposed in this paper reduces the model volume by 166.57MB, and increases the mAP@0.5 and mAP@0.5:0.95 by 0.73 percentage points and 1.53 percentage points, respectively, which has obvious advantages in volume and detection accuracy. In summary, the improved YOLOv5 model proposed in this paper has the highest detection accuracy while ensuring the model is lightweight and maintains good detection real-time performance, which further proves the superiority of the algorithm in this paper. By applying it to the automatic target recognition of substation 3D scene model, it can reach the effective recognition of substation scene targets with more efficient recognition speed and recognition accuracy, and provide support for ensuring the safe operation and stable operation of substations.

4.2.2

Analysis of model ablation experiments

In order to further validate the detection performance of the algorithm proposed in this study and to explore the effectiveness of each improvement method, ablation experiments were designed on the basis of YOLOv5, using the same hyper-parameters as well as training techniques for each group of experiments, and the results of the model’s ablation experiments are shown in Table 5.

Table 5.

The experimental results of the model

Group	BiFPN	CAM	CIoU	Vol/MB	FPS/(f/s)	mAP@0.5/%	mAP@0.5:0.95/%
1	×	×	×	14.45	125	87.85	63.92
2	√	×	×	13.98	103	87.94	64.06
3	×	√	×	12.07	101	88.06	64.73
4	×	×	√	14.45	125	87.85	63.92
5	√	√	×	11.39	100	88.74	65.97
6	√	×	√	13.98	98	89.23	66.48
7	×	√	√	10.86	95	89.62	67.81
8	√	√	√	8.72	92	90.37	68.94

As can be seen from the table, the introduction of the CAM module reduces the size of the model by 2.38MB, which is an effective lightweight method, and the mAP@0.5 is increased by 0.21 percentage points compared with the baseline model and 0.81 percentage points by mAP@0.5:0.95, although the network accuracy is not improved much, but it significantly reduces the number of parameters of the model. At the same time, after the introduction of BiFPN module to reduce the complexity of the model, although the model recognition speed is reduced to 103f/s, the indicators of mAP@0.5 and mAP@0.5:0.95 are increased by 0.09 and 0.14 percentage points, respectively. It is proved that the introduction of the BiFPN module improves the model’s ability to fuse the multi-scale features of the target in the 3D scene of the substation, and makes the identified target frame more suitable for the contour of the object. Compared with YOLOv5, the overall recognition speed of the improved YOLOv5 model is reduced by 36.95%, while the mAP@0.5 and mAP@0.5:0.95 are increased by 2.25 percentage points and 5.02 percentage points, respectively. The model has further strengthened the ability to identify and fit the 3D scene target of the substation, and the hardware requirements are smaller during the network operation, which can be widely used in the automatic task of the 3D scene target of the substation with high requirements for the IOU of the target box and the need for more accurate positioning.

4.2.3

Horizontal comparison of improvement modules

In order to verify the effectiveness of the improvement modules on the baseline method, the cross-sectional comparison experiments of each improvement module are now conducted based on the YOLOv5 version. In the cross-sectional comparison of attentional mechanisms, three typical attentional mechanisms, SE, BAM, and CBAM, are selected as the control group for performance comparison. In the experiments, it is guaranteed that different attention mechanism modules are introduced at the same position in the backbone network and the neck feature fusion network, and at the same time, it is guaranteed that other network structures remain unchanged. In order to verify the effectiveness of BiPAN neck network, four typical network structures, FPN, PANet, GFPN and PRFPN, are selected as control groups for comparison. In addition, five traditional loss functions, IoU, GIoU, DIoU, EIoU and SIoU, are selected for the cross-sectional comparison experiments. Table 6 shows the results of the cross-sectional comparison experiments of the improved modules.

Table 6.

Improve the lateral contrast experiment of the module

Model	mAP@0.5/%	mAP@0.5:0.95/%	Model	mAP@0.5/%	mAP@0.5:0.95/%
YOLOv5	49.84	30.49	FPN	49.84	30.49
SE	50.69	32.95	PANet	51.06	32.95
BAM	50.93	33.41	GFPN	51.98	33.16
CBAM	51.42	33.87	PRFPN	52.31	33.83
CAM	52.08	34.62	BIFPN	52.87	35.46
Model	Loss value	mAP@0.5/%	Model	Loss value	mAP@0.5/%
IoU	0.0712	60.52	EIoU	0.0691	64.27
GIoU	0.0694	63.45	SIoU	0.0685	64.41
DIoU	0.0695	62.93	CIoU	0.0623	66.58

Based on the data in the table, the following conclusions can be drawn: 1)

Compared with the introduction of SE, BAM and CBAM and the absence of attention mechanism, the mAP@0.5 increased by 1.39 percentage points, 1.15 percentage points, 0.66 percentage points and 2.24 percentage points, respectively, which verified the effectiveness of CAM attention mechanism in improving the YOLOv5 model.

2)

PANet adds a cross-scale aggregation path on the basis of FPN, improves the ability of feature interaction, and increases the mAP@0.5 by 1.22 percentage points. GFPN introduces learnable weights for the input features and simplifies the PANet, increasing the mAP by 2.14 percentage points. In addition, PRFPN improves the feature extraction capability through bidirectional fusion and improved parallel FP structure, and the mAP is increased by 2.47 percentage points. When BiPAN is introduced, the mAP increases by 3.03 percentage points, which verifies the effectiveness of the module.

3)

According to the results of loss function optimization, the loss value of the CIoU loss function proposed in this paper is only 0.0623 in the model, and the mAP@0.5 reaches 66.58%, which is better than that of the loss function in the other 5. This shows that the CIoU loss function fully considers the overlap between bounding boxes, the distance between the center points and the aspect ratio, so that it can obtain better loss values in the target recognition of the 3D scene model of the substation.

5

Conclusion

Starting from the actual demand of automatic target recognition of substation 3D scene model, the article proposes a network model for target recognition of substation 3D scene model based on improved YOLOv5s algorithm. On the basis of the original YOLOv5 model, target recognition is realized by multi-scale feature fusion BiFPN module, which reduces the complexity of the model, model parameters and computation amount. To compensate for the loss of model accuracy caused by the multi-scale feature fusion operation, the CAM attention mechanism is further introduced to improve the model performance. The experimental results show that the algorithm proposed in this paper can provide effective technical support for automatic target recognition in the 3D scene model of substation.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Study on automatic target identification method in substation 3D scene model

Youhui Chen

Zhonghua Lv

Ruixue Hu

Xinying Zhao

Dongxue Li

Published Online: Sep 25, 2025

Received: Feb 01, 2025

Accepted: May 09, 2025

DOI: https://doi.org/10.2478/amns-2025-1009

KeywordsLidar technology, Improved ICP algorithm, 3D scene model, YOLOv5 algorithm, BiFPN module, CAM mechanism

© 2025 Youhui Chen et al., published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Lidar technology, Improved ICP algorithm, 3D scene model, YOLOv5 algorithm, BiFPN module, CAM mechanism