Applying Deep Learning Networks to Identify Optimized Paths in Gymnastic Movement Techniques

Deep learning network is a type of artificial neural network and has gained wide application in recent years. Deep learning networks use multiple layers of wide and shallow artificial neuron networks to extract high-level abstract features from complex data. Deep learning networks are very similar to traditional artificial neural networks, but there are some key differences. While traditional artificial neural networks usually have only one or two layers with a relatively small number of parameters and twin structures, deep learning networks can contain tens of layers or more with a very large number of parameters and structural complexity [1-2]. This architecture allows deep learning networks to perform better on the task of handling large-scale, high-dimensional data, which can be used in the recognition of gymnastic movements.

For a long time, due to the diversity and complexity of gymnastics movements, it is difficult to carry out the judging work of artistic gymnastics. In artistic gymnastics competitions, the judges' scores are based on the established standards for artistic gymnasts, which requires the judges to maintain a high level of concentration, accurate judging awareness and good physical strength for a long time. Therefore, it is of great significance to find a more effective method to assist the judging of artistic gymnastics for both sports events and daily teaching of artistic gymnastics. Action recognition is an important task in the field of computer vision, where different actions of a human body or an object are identified by analysing video streams or image sequences. Deep learning is a powerful technique that has been widely used to solve various computer vision problems, including gymnastics action recognition [3-6].

Literature [7] proposes an innovative movement evaluation system with state-of-the-art 3D cnn, which is excellent in analysing and evaluating gymnastic movements, accurately identifying the joints and positions of the human body and adopting spatial features that enable the fluidity and rhythm of gymnastic movements. It provides a standard tool for training and teaching of gymnastics. Literature [8] describes a new software architecture consisting of three different layers, MLP, HMM and MLD, for recognising and evaluating gymnastic activities, gives an overview of its architecture, the various experiments conducted and the results, and develops an analysis of the current state of the art in the automatic recognition of gymnastic activities. Literature [9] creates an aerobics pose recognition model based on neural network combined with sensor network and proposes a deep neural network with CNN+LSTM to solve the problem of sparse texture features and unsatisfactory results in dynamic scenes, and the experimental results prove that the created model has better performance. Literature [10] mentions a system for subject action recognition, by designing a cost-effective single node sensor system and proposing the VITGS model, the experimental results verified that the method is not only effective in recognising gymnastic actions, but can also be used in other areas of action recognition. Literature [11] used computer vision to evaluate the robustness and stability of stationary ring movements in gymnastics competitions and compiled video clips from internet platforms in order to generate a dataset, a high motion detection accuracy was obtained through experiments, and in order to obtain a higher accuracy in action recognition, the technical group point was used on the athlete's body, and this method achieved a very high detection accuracy.

Literature [12] aimed at the theoretical study of the starting time and technique of trampoline somersaults based on image recognition, pointing out that the starting time is the moment when the spring recovers and exerts a force on the body. In addition, trampoline actions and the application of artificial intelligence in action recognition in sports are analysed and the results of the study illustrate the effectiveness of the method. Literature [13] presents the design and optimisation of an aerobic exercise recognition system using high dimensional biotechnology data. Accurate classification and recognition of exercise movements were achieved by utilising deep learning techniques, and real-time and multidimensional physiological data of athletes were collected based on biosensing technology and wearable devices. The experimental results proved the effectiveness of the system and achieved accurate movement recognition and classification. Literature [14] emphasised the improved action recognition algorithm combining PTP and CNN to build a human-computer interaction gymnastics action recognition system based on the PTP-CNN algorithm, whereas the test results proved the effectiveness of the PTP-CNN algorithm, and the system outperformed the other compared systems. Literature [15] reviewed the past and current state of research on biomechanical feedback in sports and artistic performances and merged the results with wearable technology to propose a dual-chain body model monitoring using six inertial measurement units and deep learning techniques, and the results of the study proved that wearable devices can provide effective and practical biomechanical feedback for sports and artistic training. Literature [16] introduced a lightweight deep neural network based action recognition architecture by using CNNs to extract spatial features, extracting temporal motion features for spatial feature maps of different CNN layers and designing a joint optimisation module in order to explore the connection between the two types of LSTM features, and its experimental results illustrated the effectiveness of the proposed method.

In this paper, gymnastic movements are identified using the OpenPose algorithm, and gymnasts' body skeletal points are extracted using the OpenPose network structure. At the same time, MobileNet-V3 network is used to replace VGG-19 network to optimize the original OpenPose model and construct OpenPose-MobileNet-V3 gymnastics action recognition model. The recognition effects of the OpenPose model before and after optimization on 14 gymnastics actions are compared, and then the recognition accuracy of the OpenPose-MobileNet-V3 model is compared with other action recognition models to detect the optimization effect of the OpenPose-MobileNet-V3 model. In addition, the learning decay of MobileNet-V3 and cosine annealing strategies are compared to analyze the optimization effect of the two strategies on gymnastics action recognition in the OpenPose model.

2

Gymnastic movement recognition and optimization based on the OpenPose model

2.1

OpenPose-based Gymnastic Movement Recognition Algorithm

The OpenPose algorithm is an open source method that detects multiplayer movements using CNN neural networks and Caffe framework, and it remains a popular method for motion detection today [17]. Using a large amount of data, through learning and training, it continuously detects the position of human joints in color images, and then it can complete human bone detection without relying on local joint features in the image, and it can extract human joints with high accuracy even in noisy images, and then it can extract gesture features by modeling and reconstructing the position of human joints in the detected image, which can effectively improve the recognition accuracy and Continuity.

2.1.1

OpenPose algorithm flow

1) First an RBG color image is input and this image is processed using Convolutional Neural Network and after processing the target features are extracted.

2) Process the target features using a two-branch network, where one branch obtains the confidence mapping of the joint parts from the image features S, and the other branch obtains the set of two-dimensional vector vector fields from the image features L, with each of the maps in L being a kind of computational result.

3) Detect the association vector fields of each part of the human body by S and L.

4) Finally, the affine vector fields of the detected targets are analyzed by confidence level, and after several iterations, the joint points of the human body are obtained and the skeleton map is constructed.

Where, S = (S₁,S₂,...,S_J), S_J ∈ R^w×h, j∈{1,…J}, indicates that J predictions need to be made for each joint location with J confidence maps.

L = (L₁,L₂,...,L_C), L_C ∈ R^w×h×2, c ∈ {1,…,C} indicates that C vector fields are obtained for each prediction of each part of the detected target.

2.1.2

OpenPose Network Architecture

OpenPose overall uses a VGG network as a skeleton, and its network structure is shown in Fig. 1.The VGG network is mainly divided into two main parts, each of which consists of a continuous convolutional network [18]. The upper half of the network predicts the confidence map of skeletal points, which belongs to the first branch, and the lower half of the network predicts the site association vector field, which belongs to the second branch. The two branches are able to iteratively predict and regress the confidence map and site-association vector field at the same time, and each regression completes one iteration. The whole detection system is completed after t∈(1...T) iteration.

F is the input for the beginning stage of the two branches, a set of mappings, obtained from the analysis of the original input data by the CNN. A set of data S¹ = ρ¹(F) and L¹ = Φ¹(F), i.e., confidence mapping and association vector field, are obtained in the first stage of starting training, and S¹, L¹, and F need to be concatenated in order to predict the results more accurately. The prediction result at stage t of training is: (1) $S^{t} = ρ^{t} (F, S^{t - 1}, L^{t - 1}), \forall t \geq 2$ (2) $L^{t} = φ^{t} (F, S^{t - 1}, L^{t - 1}), \forall t \geq 2$

Where, ρ^t, Φ^t are the results obtained after t rounds of iterations in the CNN middle stage.

After obtaining the prediction results, the loss function needs to be calculated once in the termination segment of each stage in order to get better recognition accuracy. Meanwhile the OpenPose algorithm adds the L2 loss function in order to determine the error between the prediction result and the real data. For the case that there may be undetected data in some of the datasets that are not completely labeled, it is considered to be caused by the weighting problem of the loss function, so it is spatially weighted and points out those regions that are not detected, where the loss function for stage t is shown below: (3) $f_{S}^{t} = \sum_{j = 1}^{J} \sum_{p} W (p) \cdot {‖ S_{j}^{t} (p) - S_{j}^{*} (p) ‖}_{2}^{2}$ (4) $f_{L}^{t} = \sum_{c = 1}^{c} \sum_{p} W (p) \cdot {‖ L_{c}^{t} (p) - L_{c}^{*} (p) ‖}_{2}^{2}$

where $S_{j}^{*}$ denotes a confidence map of the target joint locations of the actual dataset, $L_{c}^{*}$ denotes the site association domains in the labeling of the actual dataset, and W is a binary mask used to avoid error penalties in some cases. The OpenPose algorithm can still make predictions even when p is not labeled by the detection, and the gradient is periodically corrected during the training of the prediction to prevent missing gradients, and the objective function can be expressed as: (5) $f = \sum_{t = 1}^{T} (f_{s}^{t} + f_{L}^{t})$

2.1.3

OpenPose Skeletal Point Extraction Steps

The general flow of skeletal point extraction using the OpenPose algorithm is shown in Figure 2.

Step 1: Input color image, preprocessing, use VGG-19 to train the processed image, complete the selection of color image external feature points.

Step 2: Divide into two network branches for iterative prediction, one branch is trained to predict the skeletal points of the human body, and the other branch is trained to predict the connecting lines of the skeletal points, and the two branches can be processed in parallel, and then the data information is stored in JSON data format.

Step 3: Select the skeletal point data information, convert the information stored as JSON format to text format, and store the efficient data information.

Step 4: Preprocess the skeletal point data information.

The OpenPose model can obtain the 18 major skeletal points information of the human body as shown in Figure 3, and different skeletal points correspond to different code numbers.

2.1.4

OpenPose function

OpenPose algorithm is developed with Caffe as the framework, an open source, human posture estimation model for research use.OpenPose model is richer in features, mainly the following:

1) Multi-platform operation, applicable to a variety of operating systems, such as Windows system, Ubuntu system and so on. C++, Python API interface is provided, which can meet the programming needs of different people.

2) Not only can it realize single and multi-person human posture recognition, but also facial information as well as hand joint recognition.

3) OpenPose can also handle well in the face of special scenes such as partial occlusion, with high recognition performance.

4) It can accept many forms of inputs, such as color images, videos, video streams from cameras, etc., which can provide the required functions for different scenarios.

5) There are many options for the output, such as adding skeletal point information to the human body information of the original data (color image, video, etc.) for display, and saving the recognized human body skeletal point data to the relevant files.

2.2

OpenPose model improvement

2.2.1

Raw Feature Extraction Network VGG-19

The VGG-19 network used in the OpenPose model is responsible for skeleton feature extraction, and its network structure is schematically shown in Fig. 4, which results in slow processing speed due to its high computational cost and deep network structure. To alleviate this problem, a lighter weight network architecture or improvement of existing algorithms can be considered to enhance the real-time performance of the model while maintaining or improving its accuracy.

2.2.2

MobileNet-V3 network

VGG-19 network design is relatively simple, its main focus is on the network depth, it increases the network depth by using the same type of convolutional layer several times, thus improving the feature extraction ability, but its parametric quantity is large, the computational complexity is high, its calculation requires a large amount of computational resources, and its inference is relatively slow, which is not conducive to carry out real-time detection, in order to solve the above problems of the VGG-19, the present study adopts the lightweight MobileNet network to replace the VGG-19 network in the OpenPose network model.

In 2019, the Google team introduced the latest version of the MobileNet family, MobileNet-V3 [19]. This version inherits the deep separable convolution technique from V1 and the inverted residuals and linear bottleneck design from V2. In addition, the V3 version integrates the SENet module, which enables automatic tuning of model parameters. To achieve network optimization without sacrificing accuracy, MobileNet-V3 is used in this study to replace the original VGG-19 network architecture.

The traditional convolutional kernel used in VGG-19 tends to cause the gradient to disappear as the network depth increases, which affects the training effect. In contrast, MobileNet-V3 uses a depth-separable convolutional kernel technique with a residual structure to mitigate the problem of gradient vanishing. The structure utilizes the strategy of expanding the input dimension and then reducing the input to enhance the effect of gradient propagation, thus effectively reducing the storage requirements during the computation process, which not only reduces the consumption of computational resources, but also improves the speed of data processing.

MobileNet-V3 follows the depth-separable convolution of the V1 version and inherits the features of the V2 version, including the inverted residual structure and linear bottleneck structure. In addition, MobileNet-V3 introduces the SENet module, which can automatically optimize the model parameter settings.

In MobileNet-V3, the improved h-swish activation function is used instead of the ReLU activation function, whose formula is shown in Eq. h-swish can not only mitigate the influence of the ReLU activation function, but also is smoother than ReLU, which improves the performance of the model on low-dimensional data, and reduces the hyper-parameters when the neural network layers are deepened to reduce the computational cost. : (6) $h - s w i s h [x] = x \cdot \frac{Re L U 6 (x + 3)}{6}$

MobileNet-V3 optimizes MobileNet-V2, and one of the improvements is for the V2 network tail. In MobileNet-V3, the network structure was optimized. To maintain a sufficient number of feature maps, the 1 × 1 convolutional layer is placed after the average pooling layer. At the same time, the 3 × 3 DConv and other 1 × 1 convolutional layers were removed. This adjustment not only maintains the accuracy of the model, but also significantly reduces the number of parameters and computational requirements of the model.

The main innovation of the improved version of OpenPose is that its feature extraction part consists of a Bneck module (bottleneck module). The process starts with a 1 × 1 convolutional layer to extend the channel, followed by a batch normalization (BN) layer with an h-swish activation function layer, and ends with a 1 × 1 convolutional layer for channel compression, or compression after average pooling. In order to avoid the loss of information due to the deepening of the network layers and to improve the flow of feature information, the module also incorporates the concept of residual connectivity.

From the above, it can be seen that MobileNet-V3 employs the h-swish activation function to improve the performance of the model in the low-dimensional feature space and enhances the importance adjustment of the feature mapping through the SENet module, and these improvements enable MobileNet-V3 to achieve a better balance between efficiency and performance.

2.2.3

Improved OpenPose Network Modeling

The OpenPose network structure needs to go through five 7 × 7 convolution operations in both the upper and lower branches from stage2, and the overhead of the 7 × 7 convolution operations is very large. To ensure that the complexity of the OpenPose network model is reduced without any decrease in accuracy, this experiment replaces the large convolution kernel with a smaller convolution kernel to reduce the complexity of the model and to improve the efficiency of the model. The ensuing problem is that the small convolution kernel reduces the receptive field (i.e., the size of the input region that can be observed by the convolution), and hollow convolution is introduced to fill the missing receptive field during the convolution process.

In human behavior recognition, where spatial feature information is particularly important, the use of cavity convolution avoids oversampling in the convolutional network and helps maintain the spatial resolution of the feature map. In cavity convolution, a certain number of spaces are inserted between elements within the convolutional kernel to increase the range over which the convolutional kernel acts.

The 7 × 7 convolution in the original OpenPose network structure is improved as follows: a 1 × 1 convolution kernel with two 3 × 3 convolution kernels is utilized to replace the original 7 × 7 network, and the last layer of 3 × 3 convolution kernel is a null convolution with an expansion factor of 2 to solve the problem of missing receptive fields. A jump connection consisting of 1 × 1 convolutional kernels is added between the three convolutional kernels, which is to cope with the gradient vanishing problem that occurs as the depth of the network increases.

In summary, the improved OpenPose network model is shown in Fig. 5, with the Bneck module of MobileNet-V3 as a component, and the internal structure of the Bneck module is designed to efficiently extract key information to optimize the skeleton detection performance.The Bneck module to perform feature extraction, the network processes the input training set images by first using a normal convolutional layer with a step size of 2 operation to increase the number of channels, followed by feature extraction, and finally performs the downsampling process.

3

Analysis of optimization effects

3.1

Overall performance

The data in this section is derived from the NTU RGB+D dataset. The NTU RGB+D dataset consists of 10 gymnasts, and the dataset contains 14 types of gymnastics movements (suspension straight body circle, shoulder blade swing, shoulder support level, forward roll, back roll, fish leap forward roll, swallow balance, shoulder and elbow handstand, headstand, side hand flip, front hand flip, push-up jump, bend back roll, leg bend hanging and swing), numbered 1~14 respectively. We take all the factors into account, using 90% of the dataset as the training set and 10% as the test set, and using ten-fold cross-validation to improve the generalization ability of the model. The confusion matrix of the optimized OpenPose algorithm is shown in Table 1 and the confusion matrix of the original OpenPose algorithm is shown in Table 2. In the confusion matrix, the rows represent the predicted action types, the columns represent the actual action types, and the diagonal lines represent the recognition accuracy of each action.

Table 2.

Identification confusion matrix of improved OpenPose algorithm

Movement	1	2	3	4	5	6	7
1	0.95	0.01	0.01	0.01	0.00	0.00	0.00
2	0.01	0.97	0.00	0.00	0.00	0.00	0.02
3	0.02	0.01	0.94	0.00	0.00	0.02	0.00
4	0.00	0.04	0.03	0.93	0.00	0.00	0.00
5	0.00	0.01	0.00	0.00	0.95	0.00	0.00
6	0.00	0.00	0.00	0.00	0.00	0.95	0.00
7	0.00	0.00	0.00	0.00	0.01	0.01	0.96
8	0.00	0.00	0.00	0.00	0.00	0.00	0.00
9	0.00	0.00	0.00	0.03	0.01	0.00	0.00
10	0.01	0.03	0.01	0.00	0.00	0.00	0.00
11	0.00	0.01	0.00	0.00	0.00	0.00	0.00
12	0.00	0.00	0.00	0.00	0.00	0.00	0.00
13	0.00	0.00	0.00	0.00	0.01	0.00	0.00
14	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	8	9	10	11	12	13	14
1	0.00	0.00	0.00	0.00	0.00	0.01	0.00
2	0.00	0.00	0.00	0.00	0.00	0.00	0.00
3	0.00	0.00	0.00	0.01	0.00	0.00	0.00
4	0.00	0.00	0.00	0.00	0.00	0.00	0.00
5	0.00	0.03	0.01	0.00	0.00	0.00	0.00
6	0.01	0.00	0.02	0.00	0.00	0.02	0.00
7	0.00	0.01	0.00	0.00	0.01	0.00	0.00
8	0.96	0.04	0.00	0.00	0.00	0.00	0.00
9	0.00	0.95	0.01	0.00	0.01	0.00	0.00
10	0.01	0.00	0.96	0.00	0.00	0.00	0.01
11	0.00	0.00	0.01	0.98	0.00	0.00	0.00
12	0.00	0.03	0.00	0.00	0.97	0.00	0.00
13	0.00	0.00	0.00	0.00	0.00	0.96	0.03
14	0.00	0.00	0.01	0.00	0.00	0.01	0.98

Table 2.

Identification confusion matrix of OpenPose algorithm

Movement	1	2	3	4	5	6	7
1	0.87	0.01	0.02	0.03	0.00	0.00	0.01
2	0.00	0.89	0.00	0.02	0.03	0.00	0.01
3	0.00	0.04	0.89	0.01	0.00	0.02	0.03
4	0.01	0.02	0.01	0.90	0.00	0.01	0.00
5	0.01	0.02	0.01	0.01	0.88	0.00	0.00
6	0.00	0.00	0.00	0.00	0.01	0.92	0.02
7	0.01	0.02	0.03	0.01	0.02	0.00	0.90
8	0.00	0.00	0.02	0.02	0.00	0.00	0.00
9	0.04	0.01	0.00	0.00	0.00	0.00	0.00
10	0.02	0.00	0.00	0.01	0.00	0.00	0.00
11	0.02	0.01	0.00	0.00	0.01	0.00	0.00
12	0.00	0.00	0.00	0.00	0.04	0.00	0.00
13	0.01	0.03	0.00	0.01	0.00	0.04	0.01
14	0.01	0.01	0.01	0.02	0.00	0.00	0.00
	8	9	10	11	12	13	14
1	0.02	0.01	0.01	0.00	0.01	0.01	0.00
2	0.02	0.00	0.01	0.00	0.00	0.01	0.01
3	0.01	0.00	0.00	0.00	0.00	0.00	0.00
4	0.00	0.00	0.02	0.01	0.00	0.01	0.01
5	0.00	0.03	0.00	0.00	0.02	0.01	0.01
6	0.01	0.01	0.02	0.01	0.00	0.00	0.00
7	0.00	0.00	0.00	0.00	0.01	0.00	0.00
8	0.91	0.00	0.01	0.03	0.01	0.00	0.00
9	0.02	0.92	0.00	0.00	0.00	0.01	0.00
10	0.00	0.02	0.89	0.03	0.00	0.00	0.03
11	0.01	0.04	0.03	0.86	0.02	0.00	0.00
12	0.00	0.00	0.01	0.05	0.89	0.00	0.01
13	0.00	0.00	0.01	0.00	0.03	0.85	0.01
14	0.04	0.00	0.02	0.00	0.01	0.00	0.88

Overall, the average recognition accuracy of the optimized OpenPose algorithm and the original OpenPose algorithm are 95.786% and 88.929%, respectively. Obviously, the recognition accuracy of the optimized OpenPose algorithm is higher than that of the pre-optimization. Among these 14 movements, the recognition accuracy of the optimized OpenPose algorithm was lower in two movements, shoulder support horizontal and forward roll over, which were 94% and 93%, while the recognition accuracy of the other 12 movements was not less than 95%. The recognition accuracy of the original OpenPose algorithm is between 90% and 92% for the five movements of forward roll, fish leap forward roll, swallow balance, shoulder and elbow handstand, and head and handstand, and the recognition accuracy is below 90% for the remaining nine gymnastic movements.

Overall, the optimized OpenPose-MobileNet-V3 algorithm has a better ability to recognize gymnastic movements.

3.2

Comparison of methods

The recognition accuracies of this paper's model (OpenPose-MobileNet-V3) and other state-of-the-art models on NTU RGB+D and Northwestern-UCLA datasets are shown in Table 3. The model proposed in this paper (OpenPose-MobileNet-V3) achieves 95.786% and 94.572% accuracies on the NTU RGB+D and Northwestern-UCLA datasets, respectively. Compared with the earlier Lie Group, HBRNN-L, Part-Aware LSTM and ST-LSTM+Trust Gate, the accuracy achieves a significant improvement of 43.385%, 38.794%, 35.495% and 34.621% on the NTU RGB+D dataset, and 43.385%, 38.794%, 35.495% and 34.621% on the Northwestern- UCLA dataset by 42.729%, 41.285%, 40.477%, and 38.717%, respectively. Comparing to the dual-stream model Two-stream RNN, it improves 30.623 and 33.25 percentage points on NTU RGB+D and Northwestern- UCLA datasets, respectively. Compared with most LSTM-based models, such as STA-LSTM and Ensemble TS-LSTM, the model in this paper shows a significant advantage. And compared to the GCN-based ST-GCN, BGC-LSTM and DPRL models, the accuracy of this paper's model is 11.829, 11.21 and 9.56 percentage points higher on the NTU RGB+D dataset, and 8.636, 6.843, and 6.382 percentage points higher on the Northwestern-UCLA dataset, respectively. This experimental data demonstrates the superior recognition accuracy of the proposed model (OpenPose-MobileNet-V3) over other action recognition models.

Table 3.

Comparison with other advanced models on NTU RGB+D and Northwestern-UCLA

Model	Accuracy (%)
Model	NTU RGB+D	Northwestern-UCLA
Lie Group	52.401	51.843
HBRNN-L	56.992	53.287
Part-Aware LSTM	60.291	54.095
ST-LSTM+Trust Gate	61.165	55.855
Two-stream RNN	65.163	61.322
STA-LSTM	65.551	62.935
Ensemble TS-LSTM	66.617	63.712
Deep STGC_K	70.017	78.493
Clips+CNN+MTLN	70.263	75.726
ST-NBMIM	70.808	69.986
E1Att-GRU	71.203	86.976
RotClips+MTCNN	71.547	81.304
ST-GCN	83.957	85.936
BGC-LSTM	84.576	87.729
DPRL	86.226	88.190
OpenPose-MobileNet-V3	95.786	94.572

3.3

Parameter sensitivity experiments

This section explores the effect of different learning rate decay methods on the model in this paper.In MobileNet-V3, the initial learning rate is set to 0.1, the learning rate is warmed up using the learning rate in the first 5 iterations, and the learning rate is decayed to 0.1 times of the original one at the 35th and 55th iterations. In this experiment, the cosine annealing method is used instead of the learning rate decay method of MobileNet-V3, and the change curves of the loss value and the learning rate of the two strategies during the training process are plotted by the visualization tool TensorBoradX under Pytorch, and the parameter change curves of the two strategies during the training process are shown in Fig. 6, in which Figs.(a) and Fig.(b) are respectively the learning rate change curve and loss value change curve.

Observing Fig. 6(a), the difference between the two learning rate decay methods can be clearly seen; in the cosine annealing strategy, the learning rate decay process is presented as a smooth curve, which carries out a slow decrease in each iteration. In contrast, the MobileNet-V3 training process is divided into different stages, and the learning rate is directly decayed to 0.1 times of the original when the specified number of iterations is reached. Figure 6(b) demonstrates the change in loss value, which is smoother with the cosine annealing strategy compared to MobileNet-V3. MobileNet-V3 shows a significant decrease in loss value when the learning rate is reduced, which also shows that the way the learning rate decays can affect the training of the model. At the later stage of training (in the range of 55-70 iterations), the cosine annealing strategy also shows slight fluctuations in the loss values, while the MobileNet-V3 curve has stabilized, which indicates that the learning rate decay applied in this paper is beneficial to the training of the proposed model.

After training the model using cosine annealing strategy, its recognition accuracy is examined and the recognition results of cosine annealing strategy in NTU RGB+D test set are shown in Table 4. As shown in Table 4, the model trained using the cosine annealing strategy is able to achieve an accuracy of 93.643%. Compared to the OpenPose-MobileNet-V3 model, the accuracy of this model decreased by 2.143 percentage points, indicating that MobileNet-V3 is more conducive to improving the recognition effect of the OpenPose model.

Table 4.

Identification confusion matrix of cosine annealing strategy

Movement	1	2	3	4	5	6	7
1	0.91	0.00	0.01	0.00	0.05	0.00	0.00
2	0.00	0.90	0.03	0.00	0.00	0.03	0.00
3	0.00	0.02	0.93	0.00	0.00	0.00	0.00
4	0.01	0.00	0.02	0.89	0.00	0.02	0.00
5	0.00	0.01	0.00	0.02	0.91	0.00	0.01
6	0.00	0.00	0.00	0.00	0.04	0.96	0.00
7	0.00	0.00	0.00	0.00	0.00	0.03	0.97
8	0.01	0.00	0.00	0.00	0.00	0.00	0.00
9	0.00	0.00	0.00	0.00	0.01	0.00	0.02
10	0.00	0.00	0.00	0.00	0.01	0.00	0.00
11	0.01	0.00	0.00	0.01	0.00	0.00	0.00
12	0.00	0.00	0.00	0.00	0.00	0.00	0.00
13	0.00	0.01	0.00	0.00	0.00	0.00	0.02
14	0.00	0.00	0.00	0.00	0.00	0.00	0.01
	8	9	10	11	12	13	14
1	0.01	0.00	0.00	0.01	0.00	0.01	0.00
2	0.00	0.01	0.00	0.00	0.01	0.01	0.01
3	0.00	0.00	0.04	0.01	0.00	0.00	0.00
4	0.00	0.03	0.03	0.00	0.00	0.00	0.00
5	0.05	0.00	0.00	0.00	0.00	0.00	0.00
6	0.00	0.00	0.00	0.00	0.00	0.00	0.00
7	0.00	0.00	0.00	0.00	0.00	0.00	0.00
8	0.95	0.04	0.00	0.00	0.00	0.00	0.00
9	0.04	0.92	0.00	0.00	0.00	0.01	0.00
10	0.00	0.01	0.97	0.00	0.00	0.00	0.01
11	0.00	0.00	0.02	0.95	0.01	0.00	0.00
12	0.00	0.00	0.00	0.00	0.96	0.00	0.04
13	0.00	0.00	0.00	0.00	0.00	0.94	0.03
14	0.01	0.01	0.00	0.02	0.00	0.00	0.95

4

Conclusion

In this paper, the OpenPose algorithm is used to recognize gymnastics movement skills. MobileNet-V3 is embedded into the OpenPose model to replace the original feature extraction network, and VGG-19 is improved to construct the OpenPose-MobileNet-V3 model. Finally, the optimized model is examined for its effect on gymnastics movement recognition.

The recognition accuracies of the OpenPose-MobileNet-V3 algorithm and the original OpenPose algorithm for 14 types of gymnastic movements are 95.786% and 88.929%, respectively. In comparison with other recognition models, OpenPose-MobileNet-V3 achieves 95.786% and 94.572% accuracy on the NTU RGB+D and Northwestern-UCLA datasets, respectively, which are the optimal performances. Comparing MobileNet-V3 with the cosine annealing strategy, the cosine annealing strategy has a smoother learning rate decay curve and loss change, while the MobileNet-V3 learning rate decay process has a stage, and the loss value decreases with the learning rate decay. The accuracy of the model trained by the cosine annealing strategy is 93.643%, which is 2.143% lower than that of the OpenPose-MobileNet-V3 model, and MobileNet-V3 is better optimized for the OpenPose model.

Funding:

This research was supported by the 2022 Jilin Provincial Social Science Fund Project (General Optional Project): Jilin Province Sports Consumption Pilot City Construction Research (Project Number: 2022B157).

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Applying Deep Learning Networks to Identify Optimized Paths in Gymnastic Movement Techniques

Dan Mo

Yintong Wang

Bowen Zhang

Data publikacji: 17 mar 2025

Otrzymano: 10 paź 2024

Przyjęty: 01 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0265

Słowa kluczoweDeep learning, OpenPose algorithm, MobileNet-V3 network, Gymnastics action recognition

© 2025 Dan Mo et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
Deep learning, OpenPose algorithm, MobileNet-V3 network, Gymnastics action recognition