Research on target detection and tracking algorithms in real-time video streaming

With the rapid development of multimedia technology and network technology, video has become one of the main carriers of information dissemination and communication in modern society, and vision, as an important aspect of human information perception, its research in the field of video has been a hot spot [1-3]. In general, the moving target in a video stream is the part that the human eye pays most attention to, because compared with still images, image sequences or videos can provide more and richer motion information, and thus the analysis and study of moving targets deserve to be the focus of research in image sequences and video streams [4-7].

In the whole process of understanding the behavior of people or moving objects in real-time video streams, the detection and tracking of moving targets are very critical [8-9]. Usually a video surveillance system can be roughly decomposed into four different functional units, i.e., target detection, which searches for the target region of interest of the system. Target tracking, which captures the motion trajectory of the region of interest. Target classification, which categorizes the tracked target as a person, a car, or other moving objects. Target Behavior Recognition, which performs behavioral recognition of the tracked target [10-13]. Target detection, as a prerequisite for video surveillance, is a low-level vision problem, and there are a variety of mature algorithms. Target tracking, as the most basic function of video surveillance, is a medium-level visual problem, and is one of the main bottlenecks that limit the performance of current video surveillance systems [14-17]. The detection and tracking technology of moving targets has a wide range of application prospects in the fields of military, national defense, broadcasting and television, industrial process control, medical image information processing, automatic traffic monitoring, and aircraft aerial vehicle guidance [18-21]. At the same time, the detection of moving targets also has important applications in the military field, such as TV guidance, and TV guidance has outstanding advantages such as covertness, intuition, and anti-electronic interference, which are extremely valuable in military applications [22-25].

Yaseen, M.U. et al. proposed a cloud-based automated video analytics system to handle a large number of video streams, which is able to automate the video analytics process and reduce human intervention, verifying that the system is not only effective, but also exhibits good scalability and high classification accuracy [26]. Hatwar, R.B. et al. detailed the basic steps of the object detection process, including identification of objects in a video sequence, object classification, etc., and indicated a wide range of applications for object tracking, including computerized video surveillance, traffic surveillance, and military surveillance systems, outlining the steps involved in tracking objects [27]. Ramasamy, D., with &Shanmugavadivel, K. introduced the functionality of video surveillance system and pointed out that the detection, classification and tracking of moving targets in video streams are facing challenges and proposed the use of kernel support vector machine learning technique for moving target detection and tracking and verified that the method has good performance in terms of checking the full rate and checking the accuracy [28]. Funde, N. et al. aimed to increase the response time to forensic events to reduce the manual workload and therefore designed an effective target tracking system that has a network configuration of cameras and roads in the surveillance area, video dumps and images of the objects to be tracked [29]. Elhoseny, M. proposed the multi-target detection and tracking (MODT) method, which utilizes optimal Kalman filtering technique for tracking moving targets in video frames, and verified that the maximum detection and tracking accuracies of MODT are both relatively high [30]. Ramasamy, D.P. and Kanagaraj, K. described that deep learning has boosted the field of computer vision, thus advancing the application of video processing, prompting things to become smarter and more reliable, and examined the current state of the art of target detection and visual tracking methods as well as research problems and solutions [31]. Llano, C.R., Ren, Y. and Shaikh, N.I. reviewed relevant machine learning algorithms and techniques for human recognition and tracking of videos and presented a state space representation of the object tracking problem, revealing that the proposed method implements the tracking of video human bodies [32]. Wangulkar, S.S. et al. aimed to present the steps involved in tracking a target in a video sequence, including target detection, target classification and target tracking, not only describing various methods of target detection and tracking, but also comparing the techniques used for different tracking stages [33]. Banelas, D. and Petrakis, E.G. introduced MotionInsights and its application to video streaming, which is capable of object detection and tracking from multiple video streams in real time, utilizing the distributed stream processing capabilities of ApacheFlink and ApacheKafka, and its speed of object detection and tracking in video streams is faster than the traditional approaches fast [34].

In this paper, the target characterization methods (HOG and LBP) are elaborated in detail. In the target detection stage, the three-frame frame difference method incorporating the Sobel operator is designed, combined with morphological filtering to suppress the noise and repair the broken edges, to overcome the effects of the target’s fast motion, light, noise and other disturbances on the detection, and to significantly improve the detection accuracy. In the target tracking stage, the Mean Shift algorithm is improved to enhance the discriminative ability of the model against occlusion and background interference by introducing a context-aware mechanism and optimizing the loss function.

2

Methods underlying target detection and tracking algorithms

Aiming at the core challenges of target detection and tracking in real-time video streaming, this chapter will systematically elaborate the basic algorithm design, including feature extraction, motion detection optimization and tracking model construction.

2.1

Characterization of target tracking

2.1.1

HOG Characterization

The extraction of HOG features is obtained by calculating and counting the histogram distribution of directional gradients in the image region, the HOG features are first divided into small connected slices of the image, each small slice is also a cellular cell, the gradient is calculated for the pixel points on each connected slice, the gradient histograms are collected for the gradients of the edge part, and the histograms of these regions are combined to form the HOG features. The process of HOG feature extraction is shown in Figure 1 below.

HOG features are based on local square features to characterize the whole image, so they are not sensitive to changes in the environment caused by light and to the deformation of the image occurring in the geometric dimension. The detection results obtained by HOG features are stable in the presence of large local grayscaling gradients, roughness in the null domain, and fine sampling of directions.

2.1.2

LBP Characterization

LBP features refer to the local binary mode, which is a kind of artificially designed feature, and the LBP feature is obtained in the following way: firstly, by comparing the pixel value in the selected target gray matrix and the pixels in the pixel area around the gray matrix, the upbase representation of the pixel is converted into a binary representation, and the field pixel larger than the center point pixel is recorded as 1, and vice versa is 0, and then the binary map of the center pixel association can be obtained; Then, the histogram information of the gray matrix is updated according to the binary information as a new feature of the follow-up target, and finally the value of the binary graph is recoded and converted into the value of the upbase as the value of the central pixel according to the rules.

The binary histogram of the grayscale matrix, although it contains the relationship between the target grayscale information matrix and the domain pixels, still loses a lot of information and the robustness of the features is not very good.

2.2

Introduction and Improvement of Motion Target Detection Algorithm

The main purpose of the motion target detection algorithm is to enable computers and other devices to actively detect moving targets in images or video frames, and draw bounding boxes around the targets. The main function of the target detection algorithm in this design is to detect the coordinates of the moving target in the frame in real time to facilitate the tracking and processing of the subsequent algorithms, and for the fast-moving objects that are easy to be detected using the inter-frame difference method, as well as the detection of fuzzy edges inaccurate and so on, a three-frame frame difference method incorporating the Sobel operator is designed and morphological filtering to fill in the broken edges to provide good detection results. The results are summarized as follows.

2.2.1

Inter-frame Difference Methods

The video stream acquired by the sensor is spliced from consecutive single-frame images, if there is no moving object in the video stream, the change between two adjacent frames will be very small, if there is a moving object, there will be a significant change between two adjacent frames, and the moving target can be judged by doing the difference operation on the two adjacent frames and processing the difference result.

Figure 2 shows the two-frame difference algorithm operation, the current frame is F_k(x, y), the previous frame is F_k−1(x, y), the two frames corresponding to the gray value of the pixel for the difference of the absolute value can be obtained difference image D_k(x, y): (1) $D_{k} (x, y) = | F_{k} (x, y) - F_{k - 1} (x, y) |$

A binarized image R_k(x, y) is obtained by setting a threshold T for the difference value corresponding to each pixel point of the difference image D_k(x, y): (2) $R_{k} (x, y) = {\begin{array}{l} D_{k} (x, y) \geq T, R_{k} (x, y) = 1 \\ D_{k} (x, y) < T, R_{k} (x, y) = 0 \end{array}$

The computed binary map R_k(x, y) can be used to obtain the coordinate region of the moving target in the image by using the connectivity domain discrimination.

The inter-frame difference method utilizes two consecutive frames of the video sequence to obtain the moving target, which has the advantages of small computation volume, insensitivity to light interference, and high real-time performance, easy to implement in hardware, compared with the optical flow method, background subtraction method, and so on. The shortcomings of the interframe difference method are that if the target moves too fast, it is easy to appear ghosting phenomenon, and the detection results will be affected by the background interference, the edge information is incomplete, and if there is a strong change in illumination, it will be accompanied by noise burr phenomenon, which affects the results of the boundary extraction.

2.2.2

Improved inter-frame differencing method

This design proposes an improved inter-frame difference algorithm, which uses the three-frame frame difference method to extract fast moving targets and remove ghosting, and integrates the Sobel operator edge detection and morphological filtering algorithm can play a good boundary differentiation as well as denoising effect, which is convenient for the subsequent extraction of bounding box. The whole algorithm calculation is shown in Figure 3.

1)

When the video image sequence enters, the DDR caches two frames of images F_k−1(x, y), F_k(x, y), and waits for the third frame of image F_k+1(x, y) to enter the DDR cache at the same time, the output of the two neighboring frames to do the difference operation and judge the threshold value T to obtain the difference image D₁(x, y) and D₂(x, y), as shown in Eqs. (3) and (4): (3) $D_{1} (x, y) = {\begin{array}{l} | F_{k} (x, y) - F_{k - 1} (x, y) | > T, 1 \\ | F_{k} (x, y) - F_{k - 1} (x, y) | \leq T, 0 \end{array}$ (4) $D_{2} (x, y) = {\begin{array}{l} | F_{k + 1} (x, y) - F_{k} (x, y) | > T, 1 \\ | F_{k + 1} (x, y) - F_{k} (x, y) | \leq T, 0 \end{array}$

2)

The result D₁(x, y) obtained from the frame difference of the three frames and D₂(x, y) can be obtained by doing the operation with the binary image R₁(x, y), and the result of the operation obtained and the result of Sobel’s operator in the kth frame can be obtained by doing the operation with R₂(x, y) again, as shown in Eqs. (5) and (6): (5) $R_{1} (x, y) = D_{1} (x, y) \cdot D_{2} (x, y)$ (6) $R_{2} (x, y) = R_{1} (x, y) \cdot S_{1} (x, y)$

3)

For the binary value obtained in the above steps to take the morphological filtering algorithm, using the open operation to filter out the burr noise and fill in the broken edges, to restore a clear outline of the target, and ultimately extract the target around the boundary output coordinates.

2.3

Mean Shift based target tracking algorithm

In a vision-based target tracking task for a given scenario, the mathematical representation of the background or target is crucial, and the problem is essentially a modeling process. One of the well-established representations is statistical modeling of the target model. Currently, there are two main approaches to describe and model the target using the probability density distribution of features: parametric density estimation and nonparametric density estimation. The parametric density estimation method must be premised on the a priori knowledge of the probability density function, and the a priori knowledge is difficult to obtain in practice, while the nonparametric density estimation can be based on the sample data to directly estimate the density of the sample totality, and then use the density distribution function obtained from the estimation to describe the data totality.

Meanshift algorithm is a nonparametric kernel density estimation algorithm. Kernel density estimation is one of the most commonly used and effective nonparametric density estimation methods. It is similar in principle to the histogram technique. For a set of sampled data, according to its value domain is divided into a number of equal intervals, each interval is called a bin, the data is divided into a number of groups according to this interval bin, and the probability value of each bin is the number of each group of sample data and the ratio of the total number of samples. The difference with the histogram method is that the kernel density estimation method employs a kernel function for smoothing the data. The kernel density estimation method is well suited for situations where the size of the sample data is relatively small. So this method is mainly suitable for small to medium sized datasets, when an asymptotic unbiased density estimate can be produced quickly. 1)

Kernel estimation

Let the window radius (kernel window width) be h, in d dimensions, then the kernel density of sample x is estimated as: (7) $\hat{f} (x) = \frac{1}{n h^{d}} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h})$

where K(x) is a kernel function and the profile of kernel function K is a function k of [0, ∞) → R which makes $K (x) = k ({‖ x ‖}^{2})$ .

Commonly used kernel functions are the optimization kernel Epanechnikov kernel that produces the minimum mean integration error (MISE), which is mathematically expressed as: (8) $K_{E} (x) = {\begin{matrix} \frac{1}{2} c_{d}^{- 1} (d + 2) (1 - {‖ x ‖}^{2}) & ‖ x ‖ < 1 \\ 0 & O t h e r \end{matrix}$

Another commonly used kernel function is the multivariate normal kernel for: (9) $K_{N} (x) = {\begin{matrix} {(2 π)}^{- d / 2} \exp (- \frac{1}{2} {‖ x ‖}^{2}) & ‖ x ‖ < 1 \\ 0 & O t h e r \end{matrix}$

Equation (7) can be written if the multivariate kernel density estimation is represented using the Profile of the kernel function: (10) $\hat{f_{h, k}} (x) = \frac{1}{n h^{d}} \sum_{i = 1}^{n} k ({‖ \frac{x - x_{i}}{h} ‖}^{2})$

The kernel density gradient of the sample is further estimated from the above equation is estimated as: (11) $\hat{\nabla} f_{h, K} (x) = \nabla \hat{f_{h, k}} (x) = \frac{2}{n h^{d + 2}} \sum_{i = 1}^{n} (x - x_{i}) k' {({‖ \frac{x - x_{i}}{h} ‖}^{2})}^{2}$

Let g(x) = −k′(x), assume that the contour function k(x) has derivatives for all x ∈ [0, ∞) except a finite number of points. Taking g(x) as a contour function, the kernel function K(x) can be defined as K(x) = Cg(x), and C is the normalization factor. Bringing g(x) into Eq. (11), we have: (12) $\begin{array}{rcl} \hat{\nabla} f_{h, x} (x) & = & \frac{2 C}{n h^{d + 2}} \sum_{i = 1}^{n} (x - x_{i}) g ({‖ \frac{x - x_{i}}{h} ‖}^{2}) \\ = & \frac{2 C}{n h^{d + 2}} [\sum_{i = 1}^{n} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})] [\frac{\sum_{i = 1}^{n} x_{i} ({‖ \frac{x - x_{i}}{h} ‖}^{2})}{\sum_{i = 1}^{n} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})} - x] \end{array}$

It can be seen that the above equation contains two terms, the first term in front of the second middle bracket and the second term inside the second middle bracket. Referring to Eq. (10), it is easy to see that the first term is an uninformed density estimate based on the kernel function K(x) at point x, i.e: (13) ${\hat{f}}_{h, G} (x) = \frac{C}{n h^{d}} \sum_{i = 1}^{n} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})$

The second term is the Meanshift vector, denoted as: (14) $m_{h, G} (x) = \frac{\sum_{i = 1}^{n} x_{i} ({‖ \frac{x - x_{i}}{h} ‖}^{2})}{\sum_{i = 1}^{n} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})} - x$

Eq. (14) with a slight distortion yields (15) $m_{h, G} (x) = \frac{1}{2} h^{2} C \frac{\hat{\nabla} f_{h, K} (x)}{\hat{f_{h, K}} (x)}$

2)

Representation of the target model

In Meanshift tracking algorithm, the target is usually represented by a color histogram, and can also be characterized by edges, textures, shapes, or a combination of them. The target model includes the target model in the current frame and the candidate target model in the next frame. The target model in the current frame is represented by a probability density function $\hat{q} = {(q_{1}, K, q_{m})}_{u = 1, 2 \dots, m}$ and the candidate target model is represented by a probability density function $\hat{p} (y) = {(p_{1} (y), K, p_{m} (y))}_{u = 1, 2 \dots, m}$ .

Let ${x_{i}^{n}}_{i = 1, 2, \dots, n}$ denote all pixel points in the target region, defining the center of the target at the origin. Define k(x) as the contour function of the kernel function. Assign a certain weight to each pixel point in the region, the closer the point is to the center the higher the reliability and therefore has a higher weight; conversely, the point is less reliable and has a lower weight.

Define function b : R² → {1, …, m}, quantize the pixels in the target in the feature space into m bins, corresponding to each probabilistic feature u = 1, 2, …, m, then the target model can be expressed as: (16) ${\hat{q}}_{n} = C \sum_{i = 1}^{n} k ({‖ x_{i}^{n} ‖}^{2}) δ [b (x_{i}^{n}) - u]$

where δ is the Kronecker delta function. C is the normalization constant factor for making $\sum_{u = 1}^{m} q_{u} = 1$ , thus (17) $C = \frac{1}{\sum_{i = 1}^{n} k ({‖ x_{i}^{n} ‖}^{2})}$

Similarly, a candidate target model centered on y can be represented as follows: (18) ${\hat{p}}_{u} (y) = C_{h} \sum_{i = 1}^{n_{h}} k ({‖ \frac{y - x_{i}}{h} ‖}^{2}) δ [b (x_{i}) - u]$

where $C_{h} = \frac{1}{\sum_{i = 1}^{n_{h}} k ({‖ \frac{y - x_{i}}{h} ‖}^{2})}$ , h are the bandwidths and ${\hat{p}}_{u} (y)$ is the probability density of the first u feature vector in the candidate model $\hat{p} (y)$ . 3)

Similarity function

The similarity function describes the degree of similarity between the target model and the target candidate model, where ideally the probability distributions of the two models are exactly the same. The Bhattacharyya coefficient is a dispersion-type measure, whose direct geometric significance is the cosine of the angle between two vectors. The Bhattacharyya coefficient is used to define the distance between the target model and the candidate model: (19) $d (y) = \sqrt{1 - ρ [\hat{p} (y), \hat{q}]}$

One of them $ρ (y) = ρ [\hat{p} (y), \hat{q}] = \sum_{u = 1}^{m} \sqrt{{\hat{p}}_{u} (y) {\hat{q}}_{u}}$ . 4)

Target localization

In order to find the target in the current frame, the distance between the target candidate model and the target model determined according to Eq. (19) should be minimum, that is, the Bhattacharyya coefficient of the two should be maximum. In order to maximize ρ(y), the target center localization in the current frame starts from the position y₀ of the target center in the previous frame, and firstly, the candidate model of the candidate region centered at y₀ is calculated, and then the Taylor method is used to expand it at $p_{u} (y_{0})$ . The Bhattacharyya coefficient can be approximated as: (20) $ρ [\hat{p} (y), \hat{q}] \approx \frac{1}{2} \sum_{u = 1}^{m} \sqrt{{\hat{p}}_{u} (y_{0}) {\hat{q}}_{u}} + \frac{1}{2} \sum_{u = 1}^{m} {\hat{p}}_{u} (y) \sqrt{\frac{{\hat{q}}_{u}}{{\hat{p}}_{u} (y_{0})}}$

In the above equation, the first term on the right side of the equation is a constant term and the second term about is a function of y. Therefore to have maximum similarity for $ρ [\hat{p} (y), \hat{q}]$ , the second term on the right side of the equation should be taken as maximized. Substituting equation (18) in the above equation, Bhattacharyya coefficient can be expressed as: (21) $ρ [\hat{p} (y), \hat{q}] \approx \frac{1}{2} \sum_{u = 1}^{m} \sqrt{{\hat{p}}_{u} (y_{0}) {\hat{q}}_{u}} + \frac{C_{h}}{2} \sum_{i = 1}^{n_{i n}} w_{i} k ({‖ \frac{y - x_{i}}{h} ‖}^{2})$

In the formula: (22) $w_{i} = \sum_{u = 1}^{m} \sqrt{\frac{{\hat{q}}_{u}}{{\hat{p}}_{u} (y_{0})}} δ [b (x_{i}) - u]$

The second term of Eq. (21) is taken to be locally maximal, i.e., the derivative of the second term is taken to be equal to zero, so that the new target position can be obtained: (23) $y_{1} = \frac{\sum_{i = 1}^{n_{0}} x_{i} w_{i} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})}{\sum_{i = 1}^{n_{i}} w_{i} g ({‖ \frac{x - x_{i}}{h} ‖}^{2})}$

3

Experimental design and result analysis of target detection and tracking algorithms

In order to verify the effectiveness of the above algorithms, Chapter 3 quantitatively analyzes them in terms of the dimensions of detection accuracy, tracking robustness, and real-time performance by means of multiple sets of comparative experiments.

3.1

Experiments based on the improved inter-frame difference method

3.1.1

Comparison of Matching Correctness

In order to verify the feature point matching effect of this paper’s algorithm, quantitative comparison experiments are conducted using four methods, namely, SURF+FLANN, SURF+Bidirectional FLANN, SURF+Bidirectional FLANN+RANSAC, SURF+Bidirectional FLANN + RANSAC based on parallax-gradient constraints, respectively.

Matching correct rate is a key indicator of feature descriptor matching performance. In this paper, the level of matching correct rate directly affects the effect of background compensation, which in turn affects the effect of subsequent target detection, and its calculation formula is as follows: (24) $P = \frac{T P}{T P + F P}$

Where TP is the number of correctly matched feature point pairs and FP is the number of mismatched feature point pairs. Figure 4 shows the matching accuracy of SURF+FLANN, SURF+bidirectional FLANN, SURF+bidirectional FLANN+ RANSAC and the method in this paper under three sets of experiments, experiments 1 to 3 are Highway, Road and Highway2 sequences, the three groups of experiments contain rich background content and different shooting angles, and the UAV is in different motion states during shooting.

From the experimental results, it can be seen that the SURF+FLANN method matches more pairs of feature points, but there are more wrongly matched feature points, and the correct matching rate is lower. After adding two-way matching, the two images are used as reference images for feature matching, which obviously improves the correct rate of matching, but there are still a large number of false matches in Experiment 1. After adding the RANSAC algorithm to re-match the feature point pairs, basically no obvious mis-matched point pairs can be seen, and the correct rate of matching is significantly improved. Finally, after the purification of feature pairs in this paper, the number of pairs is reduced again and the correct matching rate is improved. In Highway’s Experiment 1, the number of matched pairs is 296, and the number of correctly matched pairs is 274, and the matching accuracy rate reaches 92.57%. The experimental results show that the method in this paper has good stability and anti-interference ability in matching feature points of aerial images.

3.1.2

Comparison of quantitative experiments

In order to verify the accuracy of the improved inter-frame differencing algorithm in this paper for detecting moving targets, quantitative comparison experiments are conducted in multiple scenes using multiple methods, and the detection accuracy of different algorithms is objectively analyzed using three indexes, namely, the recall rate Re, the accuracy rate P, and the comprehensive score F-measure, and the formulas for the calculation of Re and F-measure are as follows (25) $Re = \frac{T P}{T P + F N}$ (26) $F - m e a s u r e = 2 \times \frac{\Pr \times Re}{\Pr + Re}$

where TP denotes the number of pixels of the motion target that are correctly detected; and FN denotes the number of pixels that incorrectly detect the motion target as background.

Three video sequences in the data set are compared experimentally using five methods. Experiments 1 to 3 correspond to the three video sequences of Highway, Road and Highway2 respectively. The five methods are traditional three-frame difference method before improvement, traditional hybrid Gaussian algorithm, optical flow method, background subtraction method and the algorithm in this paper. To avoid the interference caused by the dynamic background generated by the UAV movement on the moving target detection, the comparison results of the experiments under three different sequences are shown in Fig. 5, Fig. 6 and Fig. 7.

As can be seen from the above figure, for the experiments in the three scenes, the traditional three-frame difference method detects the lowest F-measure, the traditional hybrid Gaussian algorithm improves, the optical flow method and the background subtraction method improve the recall, accuracy and overall score compared to the previous two algorithms, while the improved inter-frame difference algorithm incorporating Sobel operator in this paper for the experiments in the three scenes, all have the highest F MEASURE, which are 93.38%, 96.20% and 91.83%, respectively. It greatly improves the completeness and accuracy of target detection, and has obvious detection advantages.

3.2

Mean Shift based target tracking algorithm test

3.2.1

Dataset and parameter environment

In this paper, two public datasets, UAV123 and VOT2024, are selected to provide vehicle motion videos for training and to validate the success rate and accuracy of the proposed algorithm for tracking. The 26 video sequences in UAV123 are UAV captured, which contain a variety of influences such as scale variation, motion blur, etc. The VOT2024 contains 30 video sequences of vehicles with manually labeled target bounding boxes, which amounted to 63,422. The size of the images in the dataset is 1280 pixels × 720 pixels, and each frame corresponds to four labeled values, which represent the horizontal and vertical coordinates of the center point of the vehicle and the width and height of the bounding box in the image, respectively.

The proposed algorithm is implemented in Ubuntu 16.0.4 system using python 3.7, Cuda 10.1 and Py Torch 1.5.0 framework. The experiments performed were carried out on Intel(R) Core(TM) i7-8750H CPU@3.60GHz CPU and NVIDIA RTX 2070 graphics card.

3.2.2

Experimental parameters and evaluation indicators

In this paper, the dataset is preprocessed during training, initialized on a modified feature extraction network Res2Net-50 with a learning rate in the form of exponential decay from 0.01 to 0.00001, a stochastic gradient descent method with a momentum value of 0.9 and weight decay of 0.0005 is used to reduce the training loss, and a random initialization is performed on the parameters in the attention mechanism. The number of iterations is 100 and the minimum number of batch samples per iteration is 8.

In this paper, we use tracking accuracy and average success rate, which are common in target tracking algorithms, as evaluation metrics, and draw graphs of accuracy and success rate. Tracking accuracy is defined as the distance between the target center position estimated by the tracking algorithm and the actual center point position defined by the label by comparing the number of videos whose distance is less than a given threshold as a percentage of the total, which is set to 20 pixel points. The distance is used as the Euclidean distance in centroid coordinates with the following formula: (27) $d_{C L E} = \sqrt{{(x_{i} - x_{g})}^{2} + {(y_{i} - y_{g})}^{2}}$

Where $(x_{i}, y_{i})$ denotes the coordinates of the target center point predicted by the tracking algorithm and $(x_{g}, y_{g})$ is the true center point coordinates of the target. The success rate is calculated as the average overlap score (AOS), and the intersection and concurrency ratio between the bounding box obtained by the tracking algorithm and the real bounding box is denoted as the AOS. When the AOS is greater than or equal to a set threshold, the target is tracked successfully in the current frame, otherwise the tracking fails. The threshold is set to 0.5 and the success rate is the percentage of successful frames to the total number of frames.

3.2.3

Qualitative analysis

In order to further verify the effectiveness and accuracy of the proposed Mean Shift-based target tracking algorithm, on two datasets, UAV123 and VOT2024, this paper uses the proposed algorithms Mean Shift, SiamFC, Siam-Res2Net, Siam-AM, Siam-Tri, and SiamRPN, which are the main streams of target tracking algorithm The qualitative analysis is carried out and the accuracy and success rate on the two datasets are shown in Fig. 8, Fig. 9, Fig. 10 and Fig. 11 respectively.

As can be seen in Fig. 8, Fig. 9, Fig. 10 and Fig. 11, Mean Shift has higher accuracy and success rate than the other five target tracking algorithms, SiamFC, Siam-Res2Net, Siam-AM, Siam-Tri, and SiamRPN, on the two datasets, UAV123 and VOT2024. Compared to the SiamFC algorithm, SiamRPN incorporates RPN after the feature extraction network to classify and regress the template image and the search image, respectively, and improves the accuracy by 3.02% and 4.62% on the two datasets, while the success rate is also improved by 0.53% and 1.14%, respectively.

The MeanShift algorithm proposed in this paper references the recognizable interferer model and utilizes contextual information to enhance the discriminative ability of the model, which still improves the accuracy by 2.67% and 3.09% on the two datasets of UAV123 and VOT2024, compared to the best of the other five models, the SiamRPN algorithm, while the success rates are improved by 0.98% and 1.75%. It also combines the dual attention mechanism and improves the loss function. Better tracking results are achieved and the real-time tracking speed is reached.

3.3

Comparison of integrated detection and tracking models

3.3.1

Quantitative comparison of models

In this experiment, Mean Average Precision (MAP), Recall and Frames Per Second (FPS) transmitted are used as evaluation metrics for vehicle detection algorithms. The mean accuracy of each vehicle category is denoted as AP, and the number of video sequences that can be detected per second is denoted as FPS. The formula for calculating the mean accuracy is as follows:

In order to verify the effectiveness of the improved algorithm in this paper, we continue to test it on the UAV123 dataset, and conduct comparative experiments on the designed anchor box, the replaced feature extraction network, and the optimized loss function, respectively, and the evaluation indexes are shown in Table 1.

Table 1.

Comparison of improved inter-frame difference method on UAV123 dataset

Detection method	Precision%				MAP%	Recall%	FPS (f/s)
Detection method	Car	Bus	Van	Others	MAP%	Recall%	FPS (f/s)
Interframe difference method	86.09	86.01	84.37	84.87	85.34	84.68	24
Recluster the anchor box only	87.27	88.98	86.36	85.54	87.04	86.53	29
Only after improving the extraction network	89.56	87.88	88.36	86.41	88.05	86.41	25
Only after improving the loss function	90.15	87.54	91.29	87.09	89.02	87.96	30
Improved interframe difference method+Mean shift	96.62	94.28	94.68	93.22	94.70	89.92	32

As can be seen from Table 1, the reclustered anchor box is more favorable for vehicle bounding box regression, which further improves the detection precision and recall. The reduction of network parameters improves the detection speed relative to the original inter-frame differencing method. The improvement of the loss function improves the intersection and concurrency ratio of the predicted and real frames, and balances the problem of positive and negative sample imbalance, and at the same time makes the model focus on the learning of difficult to classify samples, which is more due to the benefit of the vehicle classification accuracy. Combined with the Mean shift target tracking network, the inter-frame difference method uses the anchor box reclustering, replacing the original feature extraction network and the optimized loss function, and the accuracy, recall and frame rate of the improved algorithm after fusing the Sobel operator are improved to 94.7%, 89.92% and 32f/s, respectively, which are higher than the original inter-frame difference method, only reclustering the anchor box, only improving the extraction network and only improving the loss function. At the same time, the improvements of anchor box, feature extraction network and loss function are higher than those of the original inter-frame difference algorithm in terms of accuracy, recall and frame rate.

3.3.2

Comparison of different target detection algorithms

In order to further mention the accuracy and practicality of the algorithm, the more mainstream detection algorithms Faster-RCNN, SSD-512, YOLOv3, YOLOx and YOLOv7 are compared in the test set, and in order to better verify the performance of the improved algorithms in this paper, the data comparison of parametric quantities and computational quantities of GFLOPs are added, and the experimental results are shown in Table 2.

Table 2.

Comparison of results of different target detection algorithms

Detection method	Precision%				MAP%	Recall%	FPS (f/s)	Parameter quantity/M	GFLOPs/G
Detection method	Car	Bus	Van	Others	MAP%	Recall%	FPS (f/s)	Parameter quantity/M	GFLOPs/G
Faster-RCNN	89.63	88.36	90.88	88.21	89.27	85.96	11	51.1	124.1
SSD-512	87.84	86.43	85.64	87.7	86.90	84.24	15	47.6	129.7
YOLOv3	84.36	85.06	85.61	86.07	85.28	83.32	8	48.7	109.7
YOLOx	85.97	86.81	86.72	85.10	86.15	82.56	21	38.1	114.28
YOLOv7	87.18	88.58	87.93	89.52	88.30	86.36	23	49.8	117.3
OURS	96.62	94.28	94.68	93.22	94.70	89.92	32	37.4	112.9

As can be seen from Table 2, the algorithm proposed in this paper obtains higher MAP, Recall, and frame rate than Faster-RCNN, SSD-512, YOLOv3, YOLOx, and YOLOv7 while satisfying real-time performance. This is mainly due to the fact that this paper, using the ANCHOR BOX from the reclustering of the UAV123 dataset, the EfficientNet feature extraction network and optimized loss function fusion multi-scale vehicle detection algorithms, using the EfficientNet feature extraction network which is more efficient in terms of accuracy and speed, reduces the number of network parameters and computation, reduces the impact of shallow network features on small-size vehicle localization by increasing the feature information of smaller-size vehicles, and, under the real-time requirement, the The precision of vehicle detection is improved, and the algorithm in this paper achieves an average precision of 94.70%, a recall of 89.92%, and a frame rate of 32 with 37.4M parametric quantities and 112.9G GFLOPs of computation.

4

Research on correlation of multi-target tracking data

Based on the validation of single-target tracking, Chapter 4 further extends to multi-target scenarios to explore the optimization effect of data association strategies on complex tracking tasks.

4.1

Experimental dataset and experimental setup

For the tracking optimization module of this thesis, this section serves to demonstrate the effectiveness of the proposed optimization strategy. In order to meet the requirements of real traffic scenarios, this section performs additional multi-object tracking experiments on the VisDrone2019 dataset, which manually labels its tracking results with high quality for each frame. The dataset contains five object classes, specifically pedestrians, cars, vans, buses and trucks. In this paper, we use the default training and validation dataset of this dataset, but exclude the pedestrian objects and keep only the four vehicle objects for related experiments.

Due to the poor performance of self-built servers resulting in long training time and the model batch size is limited by its memory, the experiments in this chapter use rented cloud servers, and the relevant server configurations are shown in Table 3.

Table 3.

Cloud server configuration

Module	Model/Version
Internal memory	4 Cores
Graphics card	32GB
Hard disk	Tesla V108
Operating system	Ubuntu
Python	3.8.15
CUDA	11.8
CuDnn	8.2.0

4.2

Evaluation indicators

In order to demonstrate the effectiveness of the overall framework of the algorithm, we need to conduct comparative experiments also on the MOT16 dataset. The evaluation metrics here are Accuracy, Precision, Recall, Total False Alarms (FP), Total Missed Alarms (FN), Proportion of correctly tracked target trajectories (MT), Proportion of lost target trajectories (ML), Multi-target Tracking Accuracy (MOTA), Proportion of detected targets acquiring correct IDs (IDF1 ) and identity switching (IDSW). These metrics combine to reflect the performance of the tracking algorithm in terms of accuracy, robustness and identity retention capability. The MOTA and IDF1 are calculated as follows. (28) $M O T A = 1 - \frac{F N + F P + I D S W}{G T}$ (29) $I D F 1 = \frac{2}{(\frac{1}{I D P} + \frac{1}{I D R})}$

Where GT refers to the actual labeling or pairs; IDP represents the accuracy of ID tracking, and IDR denotes the recall of ID tracking. The IDF1 metric focuses more on the length of time that the tracking algorithm tracks a particular target, examining the continuity of the tracking and the accuracy of the re-recognition. The IDF1 is 1 as the best case, and a higher value represents the better the accuracy of tracking a particular target.

4.3

MOT16 experimental results

For the dataset MOT16, the evaluated metrics are Accuracy, Precision, Recall, MT, ML, MOTA, IDF1, and IDSW. The comparative two-stage target detection algorithms were selected from DeepSORT2, RAR16, TAP, CNNMTT, and POI models using the consistent detector RCNN, as well as the industry advanced algorithms JDE, TubeTK and CTracker. The experimental results of the comparison of the models with the traditional two-stage multi-target tracking algorithms on the MOT16 dataset are shown in Table 4.

Table 4.

MOT16 comparison experiment

Method	Accuracy	Precision	Recall	MT	ML	MOTA	IDF1	IDSW
DeepSORT2	95.37%	93.72%	94.54%	35.21%	18.18%	62.53	61.20	894
JDE	94.15%	93.89%	94.02%	35.72%	17.21%	63.39	67.36	1022
RAR16	87.14%	85.18%	86.15%	33.76%	18.62%	61.22	64.33	560
TAP	92.14%	90.73%	91.43%	37.84%	19.97%	65.50	67.31	839
CNNMTT	94.59%	93.41%	94.00%	37.59%	20.74%	68.06	71.49	1106
POI	94.48%	94.12%	94.30%	38.74%	21.42%	63.42	70.86	699
TubeTK	95.39%	95.37%	95.11%	39.20%	17.48%	69.47	71.39	1185
CTracker	96.18%	96.18%	96.37%	40.18%	17.84%	71.30	72.05	1244
OURS	97.57%	98.82%	98.19%	43.86%	16.62%	75.76	73.37	1376

From the results of the comparison experiments, all the indexes are further improved after using the improved data association strategy. On the MOT16 dataset, the precision, accuracy and recall of this paper’s algorithm are 97.57%, 98.82% and 98.19%, respectively, which are significantly better than other algorithms. The proportion of correctly tracked target trajectories and the proportion of lost target trajectories are 43.86% and 16.62%, respectively, and the proportions of multi-target tracking accuracy and the proportion of detected targets obtaining the correct ID are 75.76 and 73.37, respectively, and the identity switching is 1376, which significantly outperforms the other eight algorithms. The effectiveness of the overall framework of the joint detection tracking algorithm is further demonstrated.

4.4

Results of the MOT17 experiment

For the MOT17 dataset, two metrics, FP and FN, are evaluated on the experimental evaluation metrics for the MOT16 dataset. The comparison experiments of the final model on the MOT17 dataset and the same 8 algorithms are shown in Table 5.

Table 5.

MOT17 comparison experiment

Method	Accuracy	Precision	Recall	MT	ML	MOTA	IDF1	IDSW	FP	FN
DeepSORT2	96.76%	94.62%	95.43%	38.34%	17.45%	64.78	63.56	1534	22746	143976
JDE	95.55%	94.65%	93.49%	39.03%	16.34%	65.43	68.40	1735	22420	193791
RAR16	90.56%	87.25%	88.43%	36.51%	17.35%	62.90	64.78	1575	19527	135461
TAP	94.50%	92.07%	93.01%	38.52%	17.43%	66.34	68.50	1903	18811	183747
CNNMTT	95.43%	94.20%	93.94%	39.07%	19.34%	68.96	75.63	2154	15345	184952
POI	95.32%	95.38%	95.30%	40.39%	21.07%	65.04	72.11	1865	22966	167528
TubeTK	96.03%	96.12%	96.09%	43.56%	16.40%	71.33	73.75	2287	16778	168269
CTracker	96.77%	96.84%	97.14%	45.43%	15.45%	73.986	72.94	2335	15549	149950
OURS	98.91%	99.43%	99.52%	49.45%	14.34%	80.61	75.37	3408	10184	129203

On the MOT17 dataset, there is a further performance improvement of the model after using the improved data association strategy. And the enhancement is even better relative to the results of the comparison experiments on MOT16. The precision, accuracy and recall reach 98.91%, 99.43% and 99.52%, the MT and ML are 49.45% and 14.34%, the MOTA and IDF1 are 80.61 and 75.37, the switched IDs are 3408, and the total number of false positives and misses are 10184 and 129203 respectively. Again significantly better than other algorithms.

5

Conclusion

In this paper, for the problem of target detection and tracking in real-time video streaming, we propose a three-frame frame-difference method fusing Sobel operator and an improved Mean Shift tracking algorithm. The improved detection method achieves the highest integrated score F-measure of 96.20% on datasets such as Highway, and the detection speed is improved to 32f/s. The tracking algorithm achieves an accuracy of 75.76% and 80.61% on the UAV123 and VOT2024 datasets, which is significantly better than the mainstream model. The optimized network parameter count is reduced to 37.4M, and the computation (GFLOPs) is 112.9G, which is more efficient than YOLOv7 (49.8M, 117.3G), while maintaining 94.70% MAP with 89.92% recall. In addition, the multi-target tracking experiments achieve MOTA metrics of 75.76% and 80.61% on the MOT16 and MOT17 datasets, respectively, and the IDF1 values are improved to 73.37% and 75.37%, which validate the comprehensive performance of the algorithm in complex scenes. The algorithmic framework proposed in this paper, which integrates the improved inter-frame differencing method and Mean Shift, outperforms the mainstream methods in detection accuracy, tracking robustness and real-time performance, and provides efficient and reliable technical support for real-world scenarios such as intelligent surveillance and autonomous driving.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on target detection and tracking algorithms in real-time video streaming

Wenjing Yu

Daitao Wang

Hongjun Qiu

Jianbin Zhong

Published Online: Sep 29, 2025

Received: Jan 27, 2025

Accepted: May 09, 2025

DOI: https://doi.org/10.2478/amns-2025-1130

KeywordsReal-time video streaming, Target detection, Three-frame frame difference method, Target tracking, Mean Shift

© 2025 Wenjing Yu, Daitao Wang, Hongjun Qiu and Jianbin Zhong, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Real-time video streaming, Target detection, Three-frame frame difference method, Target tracking, Mean Shift