Development of real-time video target recognition system based on convolutional neural network

Video target analysis and recognition is an important research direction in the field of computer vision, which can be applied to video surveillance, intelligent transportation, video search and many other fields. The video target analysis and recognition technology based on convolutional neural network has received extensive attention and research in recent years. The research of video target analysis and recognition technology based on convolutional neural network is based on the idea of deep learning [1-4]. Deep learning is a machine learning method that learns and represents the features of data through multi-level neural network models. Convolutional neural network is a classical model in deep learning, which extracts the spatial features of the image through local perceptual field and parameter sharing, and transfers and classifies the information through multi-layer convolutional layer and fully connected layer [5-8]. The application of convolutional neural network in video target analysis and recognition is mainly divided into spatial-based target analysis and temporal-based target recognition [9-10].

Spatial-based target analysis is mainly realized by feature extraction and classification of each frame in the video, which requires the use of convolutional neural networks to extract features for each frame, and the extracted features are classified and recognized, and the key technology mainly includes the selection of image feature extraction and classification methods [11-14]. Timing-based target recognition, on the other hand, is based on spatial target analysis, and takes a sequence of consecutive video frames as input, and realizes it by modeling and classifying the timing features, and its key technology mainly focuses on the modeling of the timing features and the selection of the classification method [15-18].

Literature [19] aims to reveal the role of deep learning and convolutional neural networks in target detection tasks. Convolutional neural networks are analyzed as the backbone of the target detection model and tested on the latest generalized datasets and benchmarks. Demonstrates the great potential of partial convolutional neural network architectures for applications. Literature [20] introduces low-power and flexible platforms as hardware gas pedals for cnn and uses the AlexNet architecture in real-time object recognition applications on reconfigurable hardware. The effectiveness of the proposed CNN gas pedal is demonstrated by unfolding the evaluation on zc706. Literature [21] proposes target detection techniques that enable real-time detection of targets in any device and environment where the model is running and uses convolutional neural networks to build a multilayer model. A single-shot multibox detection algorithm is used to improve the computational performance of the target detection technique, and the accuracy of the detected target is examined by parameters such as LP and mAP. Literature [22] uses the concept of deep learning along with convolutional neural network for recognizing objects that appear when video is used as input. The convolutional neural network gives a confidence score for each object during object recognition. And residual network is applied instead of VGG network to increase the computational speed. Literature [23] provides a systematic review of recent advances in the field of Convolutional Neural Networks in the field of object detection, describes the types of object detection models, the available benchmark datasets and the use of the object detection models in various applied research works. Literature [24] proposes a convolutional neural network incorporating time domain motion features for detecting UAV video targets. The motion information between two neighboring frames is extracted by combining it with a baseline network and the effectiveness of the proposed method in detecting targets is verified. Literature [25] introduces the IRRCNN model that utilizes the functionality of RCNN. This method is able to improve the recognition accuracy of the initial-residual network with the same number of network parameters. The performance of the IRRCNN model is empirically evaluated. The results show that the method has high recognition accuracy for commonly used DCNN models including RCNN. Literature [26] proposes a criminal activity detection system based on E-CNN. The data was examined by using SPSS software. The results show that the average accuracy of criminal activity detection is generally high. It can be used to enhance the security warning system for potential criminal activities. Literature [27] proposed the deep learning framework T-CNN, which combines the temporal and contextual information of the tubelet obtained from the video, and can effectively improve the baseline performance of existing still image detection frameworks when applied to video. Literature [28] detects objects in the environment by applying CNN. Two target detection models, SSD for MobileNetV1 and Faster-rcnn for InceptionV2, were compared. The results pointed out that both models are suitable for real-time applications and accurate target detection, respectively. Literature [29] developed an intelligent video technology based on convolutional neural network deep learning, which was used in smart cities to improve the accuracy and real-time recognition of abnormal behaviors in massive video surveillance data, and upgraded it to realize that this technology has a good value of popularization in the field of intelligent video technology applications.

In this paper, video image enhancement comparison experiments are conducted to determine the method of video image processing enhancement. On the basis of realizing video image enhancement, a target recognition algorithm based on the composite backbone network of YOLO-V5 is designed. Aiming at the phenomenon that the feature pyramid of YOLO-V5 is relatively single, and the processing effect is poor in the complex environment, the backbone network is optimized, and the convolutional attention mechanism module CBAM is added to enhance the image feature extraction capability. The feature pyramid network FPN and path aggregation network PANet are adopted as the architecture of multi-scale feature fusion, which better combines the shallow detail information and deep semantic information to enhance the performance of the network and improve the detection accuracy of target recognition. The real-time video target recognition system is built using the constructed real-time video target recognition model as its core. The training process before and after the optimization of this paper’s real-time video target recognition model is observed in terms of precision, recall and average precision to examine the performance of the model, and this paper’s real-time video target recognition system is applied to complete the real-time video target recognition task of rabbits and young rabbits to validate the utility of the system for the function of real-time video target recognition.

2

Video Image Enhancement Technology

With the development of information technology, the information content of video image quality is increasingly important, and video image enhancement technology also requires higher requirements. For the video output image noise, low contrast and other common problems, this paper will use the background differential detection processing method for the initial detection of moving targets in the video image, and respectively introduce the linear gray scale transformation, histogram adjustment, histogram equalization three video image enhancement techniques.

2.1

Background differential detection processing

Background differential detection method is also a more commonly used motion target detection method, the basic principle of this method is that the current frame image is subtracted from the pre-constructed background image, if the gray level difference of a pixel is greater than the threshold value, the pixel can be determined to be a motion target, otherwise it is the background.

I_t denotes the tnd frame image, B_t denotes the pre-constructed background image, D_t denotes the grayscale difference image of these two frames, and R_t denotes the result of the binarization process, i.e., the detected moving target. The mathematical expression for the background difference method is (1) $D_{t} (x, y) = | I_{t} (x, y) - B_{t} (x, y) |$ \[{{D}_{t}}(x,y)=|{{I}_{t}}(x,y)-{{B}_{t}}(x,y)|\] (2) $R_{t} (x, y) = {\begin{array}{l} 1 & Goals, D_{t} (x, y) > T \\ 0 & Background, D_{t} (x, y) \leq T \end{array}$ \[{{R}_{t}}(x,y)=\left\{ \begin{array}{*{35}{l}} 1 & \text{Goals,}{{D}_{t}}(x,y)>T \\ 0 & \text{Background,}{{D}_{t}}(x,y)\le T \\ \end{array} \right.\] Wherein, I_l(x,y) indicates a grayscale value of the tnd frame image at pixel point (x,y), and B_l(x,y) indicates a grayscale value of the pro background image at point (x,y). When the value of D_l(x,y) is greater than the threshold T, it indicates that the grayscale change at the point (x,y) is more obvious relative to the background, i.e., the point is determined as the target, and when the value of D_i(x,y) is less than or equal to the threshold T, it indicates that the point does not undergo obvious grayscale change relative to the background, i.e., the point is determined as the background.

The key to the background differential detection method is to construct a complete background image. First of all, it is necessary to initialize the background, due to the inability to determine whether there is a moving target in the first frame, the first frame can not simply be used as the background, you can do some processing of the first N frames of the image to get an initialized background image. In fact, in most cases, the background will change over time, and if a fixed background is used, the motion target cannot be detected accurately, so it is also necessary to update the background image in real time in order to reduce the detection error caused by the scene changes, the background update equation is (3) $B_{i} (x, y) = (1 - α) \cdot B_{t - 1} (x, y) + α \cdot I_{t} (x, y)$ \[{{B}_{i}}(x,y)=(1-\alpha )\cdot {{B}_{t-1}}(x,y)+\alpha \cdot {{I}_{t}}(x,y)\] Where α denotes the update rate, the size of which is between 0 and 1, which serves to regulate the background update rate.

The background differential detection method is simple in principle, easy to implement, computationally efficient, and free of voids and ghosting compared to the inter-frame differential method, and its difficulty lies in the maintenance and updating of the background image, which requires a lot of time and resources.

2.2

Video Image Contrast Enhancement Technology

2.2.1

Linear grayscale transformation

If r^′ = T[r] is a linear single-valued function, the grayscale transformation determined by it is called a grayscale linear transformation, or linear transformation for short [30]. The specific transformation, is a kind of function that maps the gray value of each pixel in the original image to another value domain interval according to this linear transformation relationship.

If the gray value of the image is r = f(x,y), as shown below undergoes one of the following linear transformations, whose expression is.

(4)

{\begin{matrix} r^{'} = \frac{r_{b}^{'} - r_{a}^{'}}{r_{b} - r_{a}} (r - r_{a}^{'}) + r_{a}^{'}, r \in [r_{a}, r_{b}] \\ r^{'} = r_{a}^{'}, r \in [r_{min}, r_{a}] \\ r^{'} = r_{b}^{'}, r \in [r_{b}, r_{max}] \end{matrix}

\[\left\{ \begin{matrix} {{r}^{\prime }}=\frac{r_{b}^{\prime }-r_{a}^{\prime }}{{{r}_{b}}-{{r}_{a}}}(r-r_{a}^{\prime })+r_{a}^{\prime },r\in [{{r}_{a}},{{r}_{b}}] \\ {{r}^{\prime }}=r_{a}^{\prime },r\in [{{r}_{\text{min}}},{{r}_{a}}] \\ {{r}^{\prime }}=r_{b}^{\prime },r\in [{{r}_{b}},{{r}_{\text{max}}}] \\ \end{matrix} \right.\]

It is possible to extend the range of values of the transformed grayscale r^′ to $[r_{a}^{'}, r_{b}^{'}]$ $[r_{a}^{\prime },r_{b}^{\prime }]$. In the extreme case $r_{a}^{'} = r_{{min}^{'}}, r_{b}^{'} = r_{{max}^{'}}$ $r_{a}^{\prime }={{r}_{\text{mi}{{\text{n}}^{\prime }}}},r_{b}^{\prime }={{r}_{\text{ma}{{\text{x}}^{\prime }}}}$ it follows that the human eye perceives a difference in luminance between two neighboring pixels only when their grayscale values (corresponding to their luminance) differ by a certain degree. When the grayscale value r is only in the smaller interval [r_a,r_b], the total number of levels of luminance difference that the human eye can perceive for the figure is very small. If the difference between the gray value of a target and the gray value of its background is small, the human eye will not be able to detect it, and after such a linear transformation, the total number of levels of luminance difference that can be perceived by the human eye for the transformed image will be increased, and the transformed image will look much clearer.

There is also a linear transformation is a segmented linear transformation, segmented linear transformation when the value domain of the image is divided into a number of value domains and different linear transformations of an algorithm. Using segmented linear transformation, according to the requirements of the transformation, compression of a portion of the gray-scale interval, expanding into another portion of the gray-scale interval. The transformation expression is shown in equation (5): (5) $r^{'} = \frac{r_{c}^{'} - r_{a}^{'}}{r_{c} - r_{a}} (r - r_{a}) + r_{a}^{'}, r \in [r_{a}, r_{c}]$ ${{r}^{\prime }}=\frac{r_{c}^{\prime }-r_{a}^{\prime }}{{{r}_{c}}-{{r}_{a}}}(r-{{r}_{a}})+r_{a}^{\prime },r\in [{{r}_{a}},{{r}_{c}}]$ (6) $r^{'} = \frac{r_{b}^{'} - r_{c}^{'}}{r_{b} - r_{c}} (r - r_{c}) + r^{'}, r \in [r_{c}, r_{b}]$ ${{r}^{\prime }}=\frac{r_{b}^{\prime }-r_{c}^{\prime }}{{{r}_{b}}-{{r}_{c}}}(r-{{r}_{c}})+{{r}^{\prime }},r\in [{{r}_{c}},{{r}_{b}}]$

The segmented linear transform can both form a unit function with multiple folds and approximate a curve. At the same time, this transformation extends the gray scale range of the image details of interest and enhances their contrast. At the same time, the gray scale range of the image details that are not of interest can be compressed, reducing its contrast. It is worth noting that with this segmented linear transformation, the total gray scale range of the entire image before and after the transformation is unchanged.

2.2.2

Histogram adjustment

The histogram of an image is an important statistical feature of an image and can be thought of as an approximation of the density function of an image’s gray scale distribution. Usually the gray level distribution density function of an image is related to the location of the pixel. Let the density distribution density function of the image at point (x,y) be p(z;x,y), then the gray level density function of the image is: (7) $p (z) = \frac{1}{S} \iint_{D} p (z; x, y) d x d y$ \[p(z)=\frac{1}{S}\iint\limits_{D}{p}(z;x,y)dxdy\] Where D is the defined domain of the image and S is the area of region D. In general, it is difficult to obtain the density function of the gray level distribution of an image precisely, so in practice the histogram of the image is used instead. The gray level histogram is a discrete function that represents the correspondence between each gray level of a digital image and the probability of occurrence of that gray level. Let the total number of pixels in a digital image be N, there are L gray levels, and there are n_K pixels of gray level r_K with the kth gray level. Then the probability of occurrence of the kth gray level is: (8) $h_{K} = \frac{n_{K}}{N}, k = 0, 1, \dots, L - 1$ \[{{h}_{K}}=\frac{{{n}_{K}}}{N},k=0,1,\ldots ,L-1\]

2.2.3

Histogram equalization

The histogram grayscale adjustment described above is based on probability theory. In fact, a common method for changing the shape of the histogram is histogram equalization. The basic idea of this method is to transform the unbalanced histogram of the original image into a uniformly distributed form, which increases the dynamic range of the gray values, thus achieving the effect of enhancing the overall contrast of the image.

The image enhancement transform function needs to fulfill two conditions.

1) T(s) is a single-valued, single-increasing function within 0 ≤ s ≤ L – 1.

2) There is 0 ≤ T(s) ≤ L – 1 for 0 ≤ s ≤ L – 1.

The first condition above ensures that the inverse transformation exists and that each gray level of the original image remains in the order from black to white after the transformation, preventing the appearance of some inverted gray levels in the transformed image. The second condition ensures the consistency of the dynamic range of gray values before and after the transformation, which can also be said that the original image and the transformed image have the same range of gray levels. For Equation t = T(s), the inverse transform can be expressed as: (9) $s = T^{- 1} (t), 0 \leq t \leq L - 1$ \[s={{T}^{-1}}(t),0\le t\le L-1\]

And it can be proved that Eq. (8) should also satisfy the above conditions (1) and (2).

The gray levels of an image can be regarded as random variables on the interval [0,L – 1], and it can be proved that the cumulative distribution function (CDF) satisfies the above two conditions and can transform the distribution of s into a uniform distribution of t . In fact, the CDF of s is the cumulative histogram of the original image, in this case there: (10) $t_{k} = T (s_{k}) = \sum_{i = 0}^{k} \frac{n_{i}}{n} = \sum_{i = 0}^{k} p_{s (s_{i})}, k = 0, 1, 2, \dots, L - 1$ \[{{t}_{k}}=T({{s}_{k}})=\sum\limits_{i=0}^{k}{\frac{{{n}_{i}}}{n}}=\sum\limits_{i=0}^{k}{{{p}_{s({{s}_{i}})}}},k=0,1,2,\ldots ,L-1\]

From Eq. (10), the gray value of each pixel after histogram equalization is calculated directly from the image histogram. Of course, in the actual digital image processing but also to t_k rounding to meet the requirements. The inverse transformation can be written as: (11) $s_{k} = T^{- 1} (t_{k}), 0 \leq t_{k} \leq L - 1$ \[{{s}_{k}}={{T}^{-1}}({{t}_{k}}),0\le {{t}_{k}}\le L-1\]

3

Video Image Enhancement Comparison Experiment

In the previous section, three video image enhancement techniques, linear grayscale transformation, histogram adjustment, and histogram equalization, were proposed in this study.In this chapter, three video image enhancement techniques will be used for comparison experiments to determine the final video image enhancement method to be used by observing the corresponding histograms after image processing. The corresponding histogram of the image after processing is specifically shown in Fig. 1. Figure (a) shows the histogram of the original image, while Figures (b)~(d) show the histogram of the image after histogram adjustment, histogram equalization and linear gray scale change in that order. It can be seen that the histogram adjustment enhances the brightness of the image while increasing the contrast, the original brighter areas of the image appear overexposed, and the quality of the video image decreases instead of increasing. And through the histogram equalization of the image and the original image comparison is not obvious gap, the processing effect is not obvious. The brightness of the image after the linear gray scale change is improved, and the performance is better in terms of image detail clarity. Ultimately, the linear grayscale change method was finally selected as the video image enhancement technique for video enhancement and preprocessing in this study.

4

Real-time video target recognition model based on convolutional neural network

In this study, three video image enhancement techniques, linear gray scale transformation, histogram adjustment and histogram equalization, are proposed, and through the examination of video image enhancement comparison experiments, the linear gray scale change technique is finally determined as the means of video image enhancement processing. This chapter will take the video image after enhancement and preprocessing as the premise, based on convolutional neural network, combined with YOLO-V5 target recognition algorithm, to further carry out the research of real-time video target recognition technology.

4.1

Convolutional neural network model

Convolutional neural networks, also known as convolutional networks, are a type of neural network used to specialize in processing data with grid-like structures. [31] The input image is used as the input layer of the convolutional network, the middle includes several times of convolutional layer and pooling layer for extracting the image feature information, at the end of the network junction usually contains a fully connected layer plays the role of integrating the connection, and finally the output result is used as the output layer of the convolutional network.

4.1.1

Convolutional layers

In its usual form, convolution refers to a mathematical operation on two functions of a real variable. In the field of general function analysis, convolution denotes the process of generating a new function by means of two functions f(x) and g(x), and can also be characterized as the integral of the product of the function values of the overlapping parts of the functions f(x) and g(x) after flipping and translating over the overlap length.

Let f(x),g(x) be two productable functions on R. The integral: (12) $\int_{- \infty}^{\infty} f (t) g (x - t) d t$ \[\int_{-\infty }^{\infty }{f}(t)g(x-t)dt\] is called the convolution of f(x) with g(x) and can be denoted as: (13) $s (x) = (f * g) (x)$ \[s(x)=(f*g)(x)\]

In a convolutional neural network processing image data, the first parameter f(x) of the convolution operation represents the input image, the second parameter g(x) represents the convolution kernel function, and s(x) represents the convolution result image. All subsequent references to convolution in this paper refer to the convolution operation expressed in Equation (14). The input in image processing can be viewed as a two-dimensional discrete point set, therefore, the convolutional neural network is generally a two-dimensional convolution operation, which can be written: (14) $S (i, j) = (I * K) (i, j) = \sum_{m} \sum_{n} I (m, n) K (i - m, j - n)$ \[S(i,j)=(I*K)(i,j)=\sum\limits_{m}{\sum\limits_{n}{I}}(m,n)K(i-m,j-n)\] Where, s(i,j) represents the convolution result image, I represents the original image, K represents the convolution kernel, (m,n) represents the size of the convolution kernel, and (i,j) represents the position of the pixel on which this convolution operation acts. Mathematically, the convolution operation has an exchange property, i.e., Equation (13) can also be written as s(x) = (g*f)(x) with no change in the result. This property is manifested in two-dimensional convolution operations in images as the convolution kernel function K(m,n) can be flipped, but since convolution kernel flipping has no special significance for the application of convolutional neural networks, usually convolution operations in neural networks do not flip the convolution kernel function.

4.1.2

Pooling layer

The pooling layer refers to the statistical results of the neighborhood of a certain position of the original image as the value of the output image, and its statistical methods are maximum pooling, average pooling, and weighted average pooling. Typically, if one is more concerned with the overall features of the image rather than the fine local features, then the pooling operation will play an approximate invariant to the representation of the input. In layman’s terms, if the input image is transformed by a small number of local transformations, and then the same pooling operation is used on both the before and after transformations, the representations of the two cases will turn out to be approximately the same. If you only need to determine whether a certain feature appears in the input image, and do not care about the specific location of the feature, the pooling layer can help to complete this judgment.4 An example of the pooling operation is shown, in which the pooling kernel has a size of 2 × 2, and the pooling operation has a step size of 2. The step size is determined by the offset where the pooling kernel is moved in the input image.

4.1.3

Fully connected layers

Typically most convolutional neural network models have a fully connected layer, which serves to make an integrated connection between the outputs of the convolutional or pooling layers, so that the previously localized features become global features after passing through the fully connected layer. The fully connected layer looks at the output of the previous layer and determines which category these features better match. Given different weight values, the probability of being correct for the different categories will be higher.The fully connected layer typically produces a vector that is multidimensional and has dimensions that are linked to the number of target categories.

In a convolutional neural network, the convolutional layer, the pooling layer, and the fully connected layer can usually be referred to as hidden layers, so the neural network model can be simplified to an input layer, a hidden layer, and an output layer.

4.2

YOLO-V5 Target Recognition Algorithm

Based on YOLO-V5, this paper keeps the number of detection heads unchanged by improving the prediction scale, which preserves the network’s ability to capture shallow details and ensures the completeness of deep semantic information, thus improving the model’s detection accuracy [32].

4.2.1

Inputs

In the YOLO family of algorithms, the model’s prediction frames are output and backpropagated based on the initial anchor frames, so the initial size of the anchor frames has a direct impact on the final accuracy of the model. Prior to YOLO-V5, the determination of the anchor frame size usually relied on the K-Means algorithm, which had to be performed separately before model training. YOLO-V5, on the other hand, simplifies the process by integrating the anchor frame size calculation process into the model training. It is no longer necessary to run the program independently. At the same time, if the adaptively generated anchor frame size is not satisfactory, the user can still customize the anchor frame.

4.2.2

Outputs and Loss Functions

The output part of YOLO-V5 is responsible for determining the category of the target and predicting its bounding box.The loss function of YOLO-V5 consists of three main components, the bounding box prediction error, the confidence score error, and the category prediction error. The total loss function expression is shown in equation (15). Where box_gain,cls_gain,obj_gain denotes the weight of the corresponding loss, respectively, with general default values of 0.05, 0.5, and 1.0: (15) $L o s s = b o x_g a i n \times b b o x_l o s s + c l s_g a i n \times c l s_l o s s + o b j_g a i n \times o b j_l o s s$ \[Loss=box\_gain\times bbox\_loss+cls\_gain\times cls\_loss+obj\_gain\times obj\_loss\]

The bounding box loss function discards the earlier version of the GIoU-Loss computation in favor of the CIoU-Loss computation.The GIoULoss computation is shown in Equation (16): (16) $G I o U_{L O S S} = I o U - \frac{| C / (A \cup B) |}{| C |}$ \[GIo{{U}_{LOSS}}=IoU-\frac{|C/(A\cup B)|}{|C|}\] where A is the predicted box, B is the true box, and C denotes the minimum outer rectangle of the two. Although this approach solves the problem of bounding box mismatch, more accurately represents the degree of overlap between the two bounding boxes, and speeds up the convergence of the model, the problem of degradation of the intersection and merger ratio (IoU) still occurs if one bounding box completely contains the other box. CIoULoss, on the other hand, considers the aspect ratio of the predicted box and the real box, the overlap area between the two boxes, and the center distance at the same time, which is better by considering the difference between the predicted box and the real box through more dimensions.The computation of CIoULoss is shown in Eq. (17): (17) $C I o U_{L O S S} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν$ \[CIo{{U}_{LOSS}}=1-IoU+\frac{{{\rho }^{2}}(b,{{b}^{gt}})}{{{c}^{2}}}+\alpha \nu \] (18) $ν = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2}$ \[\nu =\frac{4}{{{\pi }^{2}}}{{(\text{arctan}\frac{{{w}^{gt}}}{{{h}^{gt}}}-\text{arctan}\frac{w}{h})}^{2}}\] (19) $α = \frac{ν}{(1 - I o U) + ν}$ \[\alpha =\frac{\nu }{(1-IoU)+\nu }\]

In Eq. ρ represents the Euclidean distance between the center points of the bounding box, c represents the diagonal length of the minimum closed region of the predicted and real bounding boxes. And b and b^gr represent the coordinates of the center points of the predicted and real boxes, respectively. v is used to measure the consistency of the aspect ratio and the expression is shown in equation (18). α is the weight parameter and the expression is shown in equation (19).

YOLO-V5 is usually calculated using the binary cross entropy function when dealing with classification loss, the definition of which is shown in Equation (20): (20) $L = - y log p - (1 - y) log (1 - p) = {\begin{array}{l} - log p, y = 1 \\ - log (1 - p), y = 0 \end{array}$ \[L=-y\text{log}p-(1-y)\text{log}(1-p)=\left\{ \begin{array}{*{35}{l}} \quad -\text{log}p,y=1 \\ -\text{log}(1-p),y=0 \\ \end{array} \right.\]

In the formula, y denotes the actual label of the input sample (positive class samples are labeled 1 and negative class samples are labeled 0), while P denotes the probability that the model predicts the sample to be a positive class.

The confidence level of each prediction frame reflects the accuracy of that prediction frame, with a higher confidence level indicating a better match between the prediction frame and the actual frame.

As with calculating the classification loss, YOLO-V5 uses the quadratic cross-entropy function by default to calculate the confidence loss.

4.3

Enhancing network feature extraction capabilities

In order to enhance the feature extraction capability of the network structure, a lightweight convolutional attention module CBAM is introduced into the YOLO-V5 detection model, which simultaneously introduces the attention mechanism in both channel and spatial dimensions, improving the feature extraction capability of the network model without significantly increasing the computational effort.

The channel attention mechanism structure first passes the input feature map through two parallel average pooling and maximum pooling layers to change the height and width of the feature map to 1. Then it passes through the ShareMLP module, in which the number of channels is compressed and then expanded to the original number of channels, and goes through the ReLU activation function to get the two activated results. The two output results are summed element by element and then weighted by a Sigomid activation function feature to get the output result, which is finally multiplied by the original map to change back to the original size.

The structure of spatial attention mechanism firstly goes through two parallel average pooling layers and maximum pooling layer to highlight the feature region and get two 1×H×W feature maps, next it goes through a convolutional layer convolved into 1-channel feature maps, and then normalized by a Sigmoid function to get the new feature maps and generate the spatial attention matrix. Finally after weighting operation to add spatial attention to the feature maps to get new feature maps.

4.4

Improvement of multi-scale feature network fusion

The V3 version of the YOLO family of algorithms includes multi-scale feature fusion to increase the network’s predictive capacity by combining shallow detail information with deep semantic information.

YOLO-V5 employs the Feature Pyramid Network FPN and the Path Aggregation Network PANet as its architecture for multiscale feature fusion [33]. FPN consists of bottom-up feature extraction and top-down horizontal feature fusion. Although FPN facilitates the transfer of deep semantic information to shallow layers, detailed information may be lost in the process.

Compared to PANet, BiFPN has the following design changes [34].

1) Removal of nodes that only serve as inputs. If a node only has input connections and does not perform feature fusion, its contribution to a network designed to integrate different features is minimal. Therefore, removing it not only has little impact on the performance of the network, but also helps simplify the architecture of the network.

2) Fusing more features without adding much cost. If the original input and output nodes are in the same layer, BiFPN adds an extra edge between the original input and output nodes.

3) A feature network layer containing bidirectional paths is constructed. Compared to PANet, which contains only one top-down and one bottom-up path, BiFPN achieves deeper feature fusion by integrating a set of bi-directional paths into a single network layer and by performing multiple iterations on the layer. A set of bidirectional paths as a feature network layer. Unlike PANet, which has only one top-down and one bottom-up path, BiFPN achieves a higher level of feature fusion by using a set of bi-directional paths as one network layer and repeating the same layer several times.

4) A weighted summation is used to assign differentiated weights to different features. The specific weighted fusion process is demonstrated in Eq. (21). This method is similar to the softmax-based fusion strategy in terms of fusion efficiency and accuracy: (21) $O = \sum_{i} \frac{ω_{i} \times I_{i}}{ε + \sum_{j} ω_{j}}$ \[O=\sum\limits_{i}{\frac{{{\omega }_{i}}\times {{I}_{i}}}{\varepsilon +\sum\limits_{j}{{{\omega }_{j}}}}}\] Where ω_i denotes the weight parameter of the model, and ω_i ≥ 0, I_i denote the inputs of the different layers, and ε is a small value to ensure the stability of the values.

5

Real-time video target recognition system based on convolutional neural network

Based on the real-time video target recognition algorithm based on convolutional neural network proposed in this paper, a real-time video target recognition system is developed, which consists of hardware devices and software programs.

5.1

System hardware

System hardware equipment consists of cameras, turntables, host computer, alarms, lithium batteries, monitors, tripods, and other components.The system in this paper to ensure real-time detection of the model on the basis of as far as possible to purchase low-cost hardware equipment, such as CPU, graphics card hardware are entry-level products.

5.2

System software

The system software programs are all written in C++ language using Windows 10 operating system, based on Visual Studio 2017 platform, combined with OpenCV graphics library. The main process of the system software includes system initialization, video image acquisition, target recognition in image, pseudo-target rejection and early warning. The target recognition algorithm in the image adopts the real-time video target recognition algorithm based on convolutional neural network proposed in this paper. The workflow of the system software is as follows.

1) System initialization, that is, the host computer through the serial port and dual-spectrum camera, turntable, alarm and other equipment to establish communication.

2) Video image acquisition, according to the acquisition mode of the camera to start acquiring video image data.

3) Target recognition in a single frame image, through this paper’s real-time video target recognition algorithm based on convolutional neural networks to recognize the collected video images, if the target to be detected is recognized, the process will go to step 4, if the target to be detected is not recognized, the process jumps directly to step 6.

4) Pseudo-target rejection, the recognized target to do the judgment. If it is recognized as a real target, then send a warning signal to the alarm and transfer the process to step 5, if it is a pseudo-target, then the process will jump to step 6.

5) Early warning, if the alarm receives the warning signal, it will control the sound and light alarm.

6) Judge whether to stop monitoring, if so, the system is shut down, otherwise the process will jump to step 2 to continue.

5.3

System functions

The main functions of the real-time video target recognition system developed in this paper are as follows.

1)

Acquire and save video data

The use of dual-spectrum camera to capture video images, the camera is placed on the turntable and rotate with the turntable along the horizontal direction, you can collect 360 ° of video data.

2)

Real-time recognition of video images

The collected video image data is recognized by the target detection algorithm based on deep learning, and then the recognized pseudo-targets are rejected, based on the time-domain characteristics of the image sequence to determine whether the target is a real moving target, if it is a real target, then send an early warning signal to the alarm, if it is a pseudo-target, then it will be rejected directly.

3)

Display and save the detection results

The recognized video image is displayed on the program interface in real time, and if a moving target is detected, it is selected with a bracket box and its category is indicated, meanwhile, the image of the detected moving target and its result are automatically stored in the local disk for later viewing.

6

Real-time video target recognition experiments

6.1

Real-time video target recognition model improvement training

In order to understand the real-time video target recognition model based on convolutional neural network proposed in this paper, the training state before and after the improved target recognition algorithm based on YOLO-V5 with convolutional neural network, the training process is observed in terms of precision, recall and average precision. Firstly, the precision is analyzed, and the details are shown in Figure 2. From the figure, it can be seen that the precision of this paper’s model is steadily increasing as the training time goes on, and the precision fluctuates more in 0-50 epochs, the curve is on the upper left of the original model, and the convergence speed is a little bit faster than it is. After 50 epochs, the accuracy increases steadily with smaller fluctuations, and the accuracy is around 0.848.

The recall of this model and the original model before improvement is shown in Fig. 3. The model of this paper approaches the peak at about 80th, 100th, 138th and 240th epochs, while the recall of the original model approaches the peak at 164th epochs, which indicates that the network structure is better and the convergence ability is enhanced.

Combining the two indexes of precision and recall, mAP is further used to evaluate the average precision of this paper’s model and the original model before improvement, as shown in Fig. 4. The average precision of this paper’s model is close to the peak at the 45th epochs, while the original model is close to the peak at the 64th epochs, and after 250 epochs, the average precision of the two is comparable, from this comprehensive index, the network structure of the improved algorithm guarantees the precision of the original model, and its convergence ability is stronger than that of the original model.

Since the higher precision, the lower recall, i.e., the higher the search rate is, the lower the search rate is, the two are a pair of contradictory, purely from the precision and recall evaluation of the model is not perfect, so we can use P-R curves, F1 curves and other evaluation indicators.

The P-R curves of the two models are shown in Fig. 5. Ideally, the higher the curve is to the upper right corner, that is, the higher the precision and recall, the better the model is. It can be seen that the P-R curves of the two models are almost the same, and the performance is comparable, indicating that the improved model improves the convergence speed of training while maintaining the performance of the original model, which further illustrates the feasibility of the improved model.

The F1 scores of the two models are shown in Fig. 6, and the F1 scores of the two models are basically the same. The F1 change rule of the two models is basically the same, the difference is that the improved model in the confidence level of 0.485, the F1 score reaches the maximum value, while the original model is in the confidence level of 0.614 to reach the maximum value, in the confidence interval of [0.12, 0.86], the two F1 score fluctuation is smooth, not much difference, from the view of such a comprehensive index, more to illustrate that the improved model is designed reasonably, in order to ensure the performance of the original model on the basis of improving the training speed. From such comprehensive indexes, it is more indicated that the improved model’s network structure is designed reasonably, and the training speed is improved to guarantee the performance of the original model.

6.2

Real-time video target recognition system application practice

In order to test the recognition performance of the real-time video target recognition system constructed in this paper, we carry out the application practice test. The comparison object chosen for this application practice is the traditional real-time video target recognition system (hereinafter referred to as the “traditional system”) that does not use the improved target recognition algorithm of YOLO-V5. The test task is set to recognize rabbits and young rabbits, and is divided into three test groups. The first group consists of 10 images containing only rabbit targets, the second group consists of 10 images containing only young rabbit targets, and the third group consists of 10 images containing both mother and young rabbits.

The results of the first group of tests are shown in Table 1. As can be seen from the data in the table, analyzed in terms of the detection speed of each image, the detection time of the system in this paper is 0.09s less than that used by the traditional system, and the total inference time is 2.3ms less than that of the original model, indicating that the detection real-time performance of the improved model is better than that of the traditional system.

Table 1.

Test results of the first group

System	Test time for each image(s)	Number of detection objects	Pretreatment time(ms)	Inference time(ms)	NMS time(ms)	Total test time(s)
Traditional system	0.017	6	0	15.8	2.7	0.159
	0.017	1
	0.017	6
	0.015	1
	0.015	1
	0.015	4
	0.015	3
	0.017	4
	0.016	3
	0.015	4
System of this article	0.014	5	0	13.5	0	0.15
	0.015	1
	0.015	6
	0.016	1
	0.015	1
	0.015	4
	0.015	2
	0.015	2
	0.014	3
	0.016	4

The results of the second set of tests are shown in Table 2. The detection speed of this paper’s system in each image is also less than the time used by the traditional system, the total time consumed is 0.117s, which is 0.059s less than the original model 0.176s, indicating that the detection speed of this paper’s system is effectively improved compared to the traditional system.

Table 2.

Test results of the second group

System	Test time for each image(s)	Number of detection objects	Pretreatment time(ms)	Inference time(ms)	NMS time(ms)	Total test time(s)
Traditional system	0.076	1	0.5	10.5	1.4	0.176
	0.011	8
	0.01	10
	0.012	3
	0.012	7
	0.011	7
	0.012	2
	0.01	9
	0.011	11
	0.011	5
System of this article	0.011	1	0.6	10.9	1.5	0.117
	0.012	8
	0.012	10
	0.012	3
	0.011	7
	0.012	5
	0.012	2
	0.012	9
	0.011	11
	0.012	4

The third group is 10 images containing both mother rabbits and young rabbits, and the test results are specifically shown in Table 3. The ra in the table represents the detected rabbit objects and sub_ra represents the detected young rabbit objects. From this set of data, it can be seen that each image contains both rabbits and young rabbits, and the total time consumed by this paper’s system in the detection of images is 0.0763s, which is 0.0377s less than that of the traditional system of 0.114s, and it proves once again that the detection speed of this paper’s system has been effectively improved.

Table 3.

Test results of thethird group

System	Test time for each image(s)	Number of detection objects		Pretreatment time(ms)	Inference time(ms)	NMS time(ms)	Total test time(s)
System	Test time for each image(s)	ra	sub_ra	Pretreatment time(ms)	Inference time(ms)	NMS time(ms)	Total test time(s)
Traditional system	0.013	1	5	0.5	10.4	1.6	0.114
	0.011	1	4
	0.012	1	4
	0.011	1	4
	0.011	1	3
	0.012	1	5
	0.011	1	5
	0.012	1	5
	0.01	1	4
	0.011	1	3
System of this article	0.015	1	5	0.6	7.8	0	0.0763
	0.0011	1	4
	0.0011	1	4
	0.0011	1	4
	0.014	1	3
	0.012	1	6
	0.001	1	5
	0.001	1	6
	0.015	1	4
	0.015	1	5

7

Conclusion

Through comparison experiments on video image enhancement, this paper selects the linear gray scale transformation technique as the way to enhance video images. With the premise that the video image is enhanced, a real-time video target recognition model based on convolutional neural network is proposed to form a real-time video target recognition system.

The training performance of the real-time video target recognition model in this paper before and after optimization is observed in terms of precision, recall, and average precision.As training time goes on, the precision of the model in this paper steadily increases and stabilizes at about 0.848 after 50 epochs. Compared with the original model before optimization, the recall and average precision of this paper’s model can approach the peak faster, approaching the peak at the 80th and 45th epochs, respectively, with better network structure and enhanced convergence ability compared with the original model. Introducing the P-R curve and F1 curve as supplementary evaluation indexes, the curve graphs of the two models are almost the same, and the performance is comparable, indicating that the model in this paper can still maintain the performance of the original model while the convergence speed is improved. The difference between the F1 score curves of the two is not large, and the fluctuation is smooth, which further validates the reasonableness of the network structure design of the model in this paper.

To test the functional performance of the real-time video target recognition system in this paper, it is applied to the task of recognizing rabbits and young rabbits in real-time video. In the three groups of video image tests, the detection speed of this system is faster than the traditional system, and the total recognition time in the first, second and third groups of video image tests is faster than that of the traditional system by 0.09 s, 0.059 s, and 0.0763 s. Compared to the traditional system, the system of this paper has significantly improved its detection speed and has a higher working efficiency in accomplishing the real-time video target recognition task.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Development of real-time video target recognition system based on convolutional neural network

Fanshu Ji

Published Online: Mar 21, 2025

Received: Oct 20, 2024

Accepted: Feb 11, 2025

DOI: https://doi.org/10.2478/amns-2025-0600

KeywordsLinear gray scale transformation, Image enhancement, Convolutional neural network, YOLO-V5, Real-time video target recognition

© 2025 Fanshu Ji, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Linear gray scale transformation, Image enhancement, Convolutional neural network, YOLO-V5, Real-time video target recognition