A Study on the Optimal Design of Reinforced Learning-Driven Personalized Physical Training Strategies in Physical Education Instruction
Data publikacji: 21 mar 2025
Otrzymano: 30 paź 2024
Przyjęty: 07 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0601
Słowa kluczowe
© 2025 Yinghui Jiang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Physical education has always been an indispensable part of school education, and with the development of society and the gradual deepening of people’s concern for health, the importance of physical education is also becoming more and more prominent. As an important form of physical training, the application of physical fitness training in physical education has also received more and more attention [1-4]. With the increasingly fierce social competition, the pressure faced by students is also increasing, and their physical quality and physical fitness level are also gradually concerned [5-7]. In order to improve the physical quality and fitness level of students, personalized physical training has emerged, which not only focuses on the improvement of children’s individual physical quality, but also pays more attention to cultivating students’ comprehensive ability and teamwork spirit [8-11].
Individualized physical training refers to the planned, organized, targeted sports training, in order to improve the body functions, enhance physical fitness, improve athletic ability and adapt to the purpose of sports load [12-14]. Personalized physical training is an important form of physical exercise, which can effectively improve the function of various organs of the human body, strengthen the functions and capabilities of the body, improve the performance of sports and reduce the risk of sports injuries by scientifically selecting sports, mastering the training methods and conducting a reasonable training program [15-18]. In physical education, personalized physical training is one of the most important means of cultivating students’ comprehensive quality. Through physical training, students can comprehensively improve their physical quality, enhance their physical fitness level, improve their motor skills and ability to adapt to various sports programs, so that they can better participate in sports, enjoy the fun of sports and improve the quality of life [19-22].
Existing recommender systems are unable to complete dynamic modeling and lack the timeliness of recommendation, based on this, this paper proposes a personalized recommendation model based on reinforcement learning to be used for physical training strategy recommendation. The model presented in this paper can better capture the real-time dynamic interests of users and provide a reinforcement learning process. Softmax is used to transform the probability of each action, and a pseudo-twin component is designed to compute rewards and remove noisy interaction records to achieve more accurate recommendations. After calculating the rewards, the rewards are ranked and the top ranked ones are selected as the final recommendation. Finally, the effectiveness of the recommendation algorithm is verified on the mL-1m and mL-100k datasets.
SOM neural network [23] is a competitive learning unsupervised network with the following basic ideas:
When a certain type of data is input, the neuron node in its output layer gets the maximum stimulus and wins, and the connection weight vectors of the winning neuron node and its surrounding nodes are corrected accordingly in the direction of the input vector. When the input data changes, the winning node also shifts from the original winning node to other nodes. According to the output condition of the SOM neural network, it is able to get the distribution characteristics of all data from the sample data can be obtained, and also make the output neuron nodes represent a certain type of pattern, so as to judge the class to which the input vector belongs.
SOM neural network structure is a two-layer neural network, divided into the input layer and the output layer (also called the competition layer), the neurons of the two layers are all interconnected. The main function of the input layer is to receive input information from the outside world, and the number of neurons is the same as the dimension of the input data. The main function of the output layer is to process input information, and each neuron has a weight value. After receiving the input data, the network will determine the winning neuron in the output layer according to the “WTA” rule, and after the neuron weights are stabilized through continuous training with the input data, it will determine the position of the input data in the low-dimensional space. The neuron weights will be stabilized to determine the position of the input data in the low-dimensional space. Usually, each neuron represents a clustered category.
The neurons in the output layer have a topological relationship with each other, and different network topologies can be utilized to meet different needs. In this regard, Fig. 1 shows the topology of SOM neural network, Fig. 1(a) shows the one-dimensional line array of SOM neural network, and Fig. 1(b) shows the two-dimensional planar array of SOM neural network.

Structure of SOM neural network insertion robot
The training goal of the SOM neural network is to find appropriate weight vectors for each output layer neuron in order to maintain the topology. The three phases of SOM training are:
In SOM neural network, the similarity between the input vector and the neuron’s weight vector is the basis for the network to classify or cluster the input vectors, and is also the key to find the winning neuron. Usually the similarity is measured by the distance between two vectors, the larger the distance, the lower the similarity. The SOM neural network, after receiving an input vector at the input layer, calculates the Euclidean distance between the weight vector of each neuron and that vector, and searches for the neuron with the largest similarity as the winning neuron.
Assume that there is
Cooperation between the winning neuron and its neighboring neurons is carried out to determine the tuning center of the winning neuron, thus forming the superior neighborhood. Generally the winning neighborhood needs to decrease with the increase of the lateral distance of the winning neuron, and the Gaussian function is usually chosen to calculate it.
Let
The adaptive phase is mainly centered on the winning neuron to update the weight vectors of all neurons in the superior neighborhood, the degree of adjustment is based on the lateral distance of the winning neuron to determine, the closer the lateral distance, the greater the degree of adjustment. When the entire network learning is completed competitive layer neurons can recognize the sample data set similar input patterns, this stage is the key to the sample data set classification, clustering. The adjustment formula of neuron weight vector is:
The specific implementation steps of the SOM neural network algorithm are described as follows:
Stepl: Initialization of network parameters. Set the network input layer, the number of nodes in the output layer, the initial learning rate Step2: Randomly select an input vector from the sample data set and do normalization on it. Step3: Determine the winning neuron in the output layer. Calculate the Euclidean distance between the sample data and the Step4: Define the winning neighborhood Step5: Update the iterative learning rate and neighborhood radius and renormalize the learned weight vectors. Step6: Continue iteration until the learning rate decays to 0 or the number of iterations is reached.
The flow of the SOM neural network algorithm is shown in Figure 2.

Flow diagram of SOM neural network
The SOM algorithm utilizes self-organizing features to group data by mapping high-dimensional complex samples onto low-dimensional neuron arrays, but there are some shortcomings in the SOM neural network algorithm.
SOM lacks a specific objective function and does not guarantee convergence. Almost all algorithms are finally reduced to optimization problems, in order to reach the final goal, the objective function needs to be constructed and the model needs to be solved, but the SOM algorithm lacks a reasonable objective function, which makes the SOM algorithm blind during the learning process, and parameters such as the learning rate, the neighborhood function, the network structure, etc., need to be preset, and it is difficult to ensure the overall convergence.
The number of output neurons set is inconsistent with the clustering category. In clustering the clustering categories are difficult to predict in advance, while a SOM output neuron usually does not correspond to a category, there may be a single category merging and splitting.
SOM neural network measures the similarity between data in terms of Euclidean distance. Therefore, to use the SOM algorithm for data clustering, the following three problems need to be solved: first, set the objective function and optimize the convergence conditions to speed up the convergence of the network; second, organize and optimize the training results of SOM, design a self-clustering method, determine the number of clusters and divide the categories; third, explore whether the combination of SOM neural network and dimensionality reduction algorithm is feasible.
This section details a personalized recommendation model for sports training strategies based on LSTM [24] and reinforcement learning [25]. The removal of recommendation noise as well as acquisition of user’s time-varying points of interest is carried out using LSTM long and short-term interest layers and reinforcement learning. The proposed algorithm consists of a long- and short-term interest point acquisition layer, a reinforcement learning decision layer, and a reward output layer, and the structure is shown in Figure 3. First, the algorithm is based on the recommendation interaction records and selects a number of items that a particular user has interacted with as input to the model. Second, the embedding vectors of these items are input into the LSTM in the form of 3D vectors in order to obtain the feature information corresponding to each sequence, the agent in reinforcement learning is cascaded using the MLP layer as well as the Softmax layer, and the dimensions of the sequences are converted into the total number of items in order to obtain the selection probability corresponding to each action. Next, the first

Model architecture of DPRM-LRL
The pseudo-twin component is used to fully capture the interaction between items in both subsequences and output delayed rewards. Finally, a gradient strategy is designed to perform robust gradient estimation.
Long Short-Term Memory Network (LSTM) is a deep learning model commonly used to process sequence data, especially for sequence data with long-term dependencies. It is an extension of recurrent neural network (RNN), which effectively captures and remembers long-term dependencies in sequence data by introducing a gating mechanism, while avoiding the gradient vanishing problem in traditional RNN.
Firstly, the input of the long-short-term network is constructed to obtain the interaction record of this user in the interaction record, and the obtained item features are grouped into inputs, which is done to ensure that each group can be predicted by the LSTM to produce a new result. Secondly, the embedding vector of each item is obtained from the interaction record. Finally, after the output of LSTM, MLP, and Softmax network, the sequence features predicted for each sequence are generated, and thus, the task of the long and short-term time-varying interest acquisition layer is finished, and the two-dimensional sequence feature vectors about the original sequences are obtained.
The strategy component uses a multilink level agent module for decision making, where the sequence features are first fed into this module, all actions in each state are sampled, and the top
Define the action as all candidate items after passing through the long and short term interest point acquisition layer and MLP and Softmax layer, each item is defined as an action, also called an arm, all the actions are represented by the set
Given the definitions of actions and states of the strategy component, the scheme of the strategy is given. First, starting from the state, the two-dimensional sequence features are fed into the MLP layer, which converts the vector of
After obtaining the corresponding items will be sign, the probability of each item is converted using Softmax. Softmax converts the eigenvalues into probabilities as follows:
Given an interaction sequence
Based on the two subsequences generated, they are then aggregated using the pseudo-twin component. Specifically, given subsequences
After obtaining the embeddings of the two subsequences, a multilayer perceptron (MLP) is used to compute their delay rewards. The functions are shown in Eq. (10) and Eq. (11):
In this model, sequences are selected using a reinforcement learning approach. The goal is to learn a stochastic strategy that maximizes the expected cumulative reward
Specifically, given a list of sampled actions (
Depending on the level of sampling
A two-by-two comparison is formed as an additional constraint to fully utilize the information provided by these two subsequences to improve the learning process. Note that binary actions are used in the model and the probability is equal to 1. Therefore, the probability of generating
Based on the probabilities and rewards of the generated subsequences, the traditional strategy learning strategy is transformed into a two-by-two learning process, and the final gradient of
Unlike previous sequence recommendation methods, the advantage of this method is that it provides a new reinforcement learning perspective for sequence recommendation.
Items are fed into a reinforcement learning component, and a pseudo-twin component is designed to perform the computation of rewards as well as the removal of noise for personalized recommendation of sports training strategies.
Calculate their awards
The strategy network and pseudo-twin network are intertwined, so they need to be trained together. In the entire training process, the parts to be trained are the LSTM part, MLP part, and the neural network layer in the pseudo-twin module.
The preprocessing procedure is replicated. For all datasets, the presence of comments or ratings is considered as implicit feedback (i.e., the user interacts with the item), and timestamps are used to determine the sequence of operations. For segmentation, the history sequence of each sequence was divided into two parts using leave-one-out: (1) the most recent interactions used for testing: and (2) the sequence of interactions used for validation.
This paper takes 12503 college students in the first to fourth year of a sports college as the research object, and takes the test results of the Standard at the end of 2022 as the data source, among the students who participated in the test, 5518 were male students and 6985 were female students. A total of 10 items of basic information, 9 items of testing environment information, and 8 items of physical fitness test data were collected from each testing student, and only physical function indicators, physical fitness-related indicators, and overall evaluation were finally selected for data mining.
After analyzing the clustering results, the girls’ group is divided into 5 categories of students, and the boys’ group is divided into 3 categories of students, each of which presents different characteristics of its physical test indicators. The specific results are analyzed as follows, and the final clustering results of the girls’ and boys’ groups are shown in Table 1 and Table 2, respectively. The overall trends of the clustering results are shown in Figures 4 and 5, respectively.
Female group k-means Cluster results
Cluster1 | Cluster2 | Cluster3 | Cluster4 | Cluster5 | |
---|---|---|---|---|---|
Case number | 650 | 710 | 4520 | 895 | 210 |
Lung capacity score | 85 | 84 | 85 | 78 | 83 |
50 meters running score | 62 | 68 | 73 | 64 | 37 |
Fixed jump | 64 | 65 | 73 | 35 | 27 |
Preflexion score | 76 | 82 | 84 | 75 | 75 |
Sit-ups scores | 65 | 18 | 71 | 68 | 28 |
1000 meters run /800 meter running score | 34 | 67 | 75 | 65 | 24 |
Health score | 65.25 | 69.26 | 79.83 | 68.56 | 52.13 |
Man group k-means Cluster results
Cluster1 | Cluster2 | Cluster3 | |
---|---|---|---|
Case number | 1630 | 2468 | 1420 |
Lung capacity score | 84 | 84 | 83 |
50 meters running score | 81 | 78 | 68 |
Fixed jump | 64 | 65 | 16 |
Preflexion score | 75 | 72 | 63 |
Sit-ups scores | 72 | 7 | 5 |
1000 meters run /800 meter running score | 72 | 63 | 49 |
Health score | 78.46 | 68.21 | 57.44 |

The mean of the scores after the group Cluster

The average of the scores of boys after clustering
The girls’ group achieved higher results in all measures in Cluster 3. These students have better physical fitness, accounting for 64.71% of the total number of students. The lowest mean of total score of students in cluster 5 was 52.13 not passing line, and its lower mean in other items except lung capacity and sitting forward bends items, which indicates that this category is insufficient in endurance, lower extremity explosive strength, waist and abdominal strength. In addition, the girls’ group in cluster 1, cluster 2 and cluster 4 achieved poor results in long-distance running, sit-ups and standing long jump, respectively, and the students in these three different cluster types have their own weak items.
The highest mean values of the measured scores of the items in cluster 1 of the male group showed higher physical fitness in this category, and the male students in cluster 3 showed lower total scores of the test items. Cluster 2 and Cluster 3 achieved lower scores in pull-ups and both categories also had the highest number of students, 74.27% of the total number of students, with the lowest mean score of 5 in pull-ups in Cluster 3. The standing long jump event resulted in lower mean scores for Cluster 3. It was also found that cluster 3 students failed in the total mean score, with male students mainly failing in the standing long jump and pull-up test programs due to weak performance.
From the results of the cluster analysis, it can be seen that the male and female student groups have a significant difference in physical fitness factors. The bifurcation of the boys’ group is more apparent. The boys in the cluster 1 group displayed greater balance in all test indicators and improved physical fitness. Cluster 2 and Cluster 3, the two categories with a larger number of people in these two groups, and the weak points of the test items in the boys’ group were concentrated in the upper body strength test items, with only single-digit scoring averages achieved for pull-ups. Cluster 3 was the worst overall performance among the 3 categories, with its failing standing long jump item.
The factors affecting change in the female group were more complex, with unique key influences in different cluster categories except for cluster 3. Cluster 1 had a weakness in endurance events, cluster 2 had a weakness in sit-ups, and cluster 4 had a weakness in standing long jump. Cluster 5 basically scored poorly in each item, which suggests that we need to choose different exercise methods for different students to make up for the shortcomings, and also proves the utility of SMO-based algorithms in physical education optimization.
In order to better show the effectiveness of the proposed algorithm in this paper, four baseline algorithms of KNNBasic, KNNWithMeans, KNNBaseline and SVD, and four combination algorithms of DDPG+RLSTM, DDPG+RLSTM, DDPG+T_self_attention and DDPG+self_attention are compared with the proposed algorithm.
Table 3 shows the experimental comparison results, both in the mL-1m dataset and in the mL-100k dataset the algorithms proposed in this paper outperform the other algorithms on Ave_RMSE and Ave_MAE. In addition, the final results of all the module combination algorithms are better than the baseline algorithm, thus indicating that the algorithms in this paper as well as the module combination algorithms are superior with respect to the baseline algorithm.
Comparison of experimental results
Algorithm | mL-1m data set | mL-100k data set | ||
---|---|---|---|---|
Ave_RMSE | Ave_MAE | Ave_RMSE | Ave_MAE | |
This algorithm | 0.402 | 0.231 | 0.243 | 0.106 |
DDPG+RLSTM | 0.446 | 0.258 | 0.527 | 0.407 |
DDPG+LSTM | 0.603 | 0.467 | 0.564 | 0.436 |
DDPG+T_self_attention | 0.576 | 0.398 | 0.451 | 0.267 |
DDPG+self_attention | 0.595 | 0.425 | 0.535 | 0.368 |
KNNBasic | 0.922 | 0.726 | 0.975 | 0.774 |
KNNWithMeans | 0.928 | 0.738 | 0.951 | 0.75 |
KNNBaseline | 0.897 | 0.703 | 0.923 | 0.727 |
SVD | 0.875 | 0.683 | 0.938 | 0.74 |
Tables 4 and 5 show the magnitude of Ave_RMSE and Ave_MAE enhancement for each combination algorithm relative to the baseline algorithm on the two datasets respectively. The unsigned numbers in the table are all positive, and the Ave_RMSE value of this paper’s algorithm is 0.5668 higher than the worst performing baseline algorithm, KNNWithMeans, and the Ave_MAE value is 0.6870 higher than the worst performing baseline algorithm, KNNWithMeans, on the mL-1m dataset. On mL-100k dataset, the Ave_RMSE value of this paper algorithm is 0.7508 higher than the baseline algorithm KNNBasic, and the Ave_MAE value is 0.8630 higher than the baseline algorithm. Meanwhile, the test results of module combination algorithms are all greatly improved over the baseline algorithm. Therefore, the performance of both the algorithm and the combination algorithm in this paper is due to the baseline algorithm.
This algorithm and module combination Ave_RMSE enhancement
Lifting amplitude/% | KNNBasic | KNNWithMeans | KNNBaseline | SVD | |
---|---|---|---|---|---|
mL-1m | This algorithm | 56.40 | 56.68 | 55.18 | 54.01 |
DDPG+RLSTM | 51.63 | 51.94 | 50.28 | 49.03 | |
DDPG+LSTM | 34.60 | 35.02 | 33.11 | 31.09 | |
DDPG+T_self_attention | 37.53 | 37.93 | 35.79 | 34.17 | |
DDPG+self_attention | 35.47 | 35.88 | 33.67 | 32.00 | |
mL-100k | This algorithm | 75.08 | 74.45 | 73.67 | 74.09 |
DDPG+RLSTM | 45.95 | 44.58 | 42.90 | 43.82 | |
DDPG+LSTM | 42.15 | 40.69 | 38.89 | 39.87 | |
DDPG+T_self_attention | 53.74 | 52.58 | 51.14 | 51.92 | |
DDPG+self_attention | 45.13 | 43.74 | 42.04 | 42.96 |
This algorithm and module combination Ave_MAE enhancement
Lifting amplitude/% | KNNBasic | KNNWithMeans | KNNBaseline | SVD | |
---|---|---|---|---|---|
mL-1m | This algorithm | 68.18 | 68.70 | 67.14 | 66.19 |
DDPG+RLSTM | 64.46 | 65.04 | 63.30 | 62.24 | |
DDPG+LSTM | 35.67 | 36.72 | 33.57 | 31.63 | |
DDPG+T_self_attention | 45.18 | 46.07 | 43.39 | 41.73 | |
DDPG+self_attention | 41.46 | 42.41 | 39.54 | 37.77 | |
mL-100k | This algorithm | 86.30 | 85.90 | 85.42 | 85.68 |
DDPG+RLSTM | 47.42 | 45.73 | 44.02 | 45.00 | |
DDPG+LSTM | 43.67 | 41.87 | 40.03 | 41.08 | |
DDPG+T_self_attention | 66.50 | 64.40 | 63.27 | 63.92 | |
DDPG+self_attention | 52.45 | 50.93 | 49.38 | 50.27 |
This part of the experiment mainly collects the reward value Reward that can be obtained by the intelligent body after each prediction when the algorithm is tested, as well as the RMSE and MAE values that can be obtained by each round of prediction, and analyzes the convergence of the algorithm by observing the trend of the values of these evaluation indexes. Meanwhile, in order to evaluate the convergence of the algorithm as a whole, this paper analyzes the algorithm by observing the trend of the average return Ave_Reward in each round. Figures 6 and 7 show the trend of Ave_Reward values of this paper’s algorithms and the combined algorithms of each module with the increase of training times on the two datasets mL-1m and mL-100k, respectively. The overall results of all algorithms tested on mL-1m are better than the overall results tested on mL-100k, the reason for which has been analyzed in the overall performance section. In both graphs the algorithms of this paper have the best performance, but there are some problems:

mL-1m ave_reward trend chart

mL-100k ave_reward trend chart
In mL-1m, the algorithm eventually converges to a higher height than the other algorithms, but this advantage is not very obvious, while in mL-100k, the convergence height of this paper’s algorithm is much higher than the other algorithms, so that the algorithm can better learn the user’s long-term physical training interests when training on the mL-1m dataset, so that the long-term physical training interests are dominated by the student’s overall physical training interests The state enhancement mechanism in modeling long-term sports training interests makes the algorithm more capable of modeling students’ long-term interests.
With a number of A students of the sports college as the experimental object, according to the recommendation algorithm to obtain different physical training strategies and their recommendation degree as shown in Table 6, the set of pairs of recommended degree, can be obtained with the number of A students most compatible with the part of the physical training strategy, the analysis of the original physical measurement data can be found, the recommended physical training strategies and experimental objects at the same time taking into account both the similarity and the dissimilarity, in line with the research This is in line with the recommendation principle proposed in the study. In real life, it is necessary to consider practical factors, and among the highly recommended physical training strategies, users can filter them subjectively according to factors such as intensity, mode, and relevance, so as to obtain the physical training strategy that best matches their own.
Number A student and physical training strategy set
Number | Degree of recommendation | Sort |
---|---|---|
1 | 0.668 | 3 |
2 | 0.526 | 5 |
3 | 0.459 | 6 |
4 | 0.745 | 2 |
5 | 0.324 | 8 |
6 | 0.159 | 10 |
7 | 0.569 | 4 |
8 | 0.951 | 1 |
9 | 0.437 | 7 |
10 | 0.266 | 9 |
This paper proposes a personalized recommendation model for sports training strategies based on LSTM and reinforcement learning, based on the application of SOM neural network algorithm to cluster and divide students’ physical fitness, and recommend personalized physical training strategies for students.
Using the SOM neural network algorithm can quickly classify students by physical fitness category, divide each student by category, and also react to the weak items of each category of students, the average value of all indicators in the girls’ group clustering 3 achieved better results, and this type of students accounted for 64.71% of the total number of girls. The boys group in cluster 3 had the lowest mean score for the pull-up program, with a score of only 5 points. Therefore, it is possible to help students select appropriate physical training strategies based on the characteristics obtained for each category.
The personalized recommendation model for physical training strategies based on LSTM and reinforcement learning achieves advanced performance, and the personalized recommendation algorithm for physical training strategies proposed in this paper outperforms other algorithms on Ave_RMSE and Ave_MAE on both mL-1m and mL-100k datasets. On the mL-1m dataset, the Ave_RMSE and Ave_MAE values are 56.68% and 68.70% higher than KNNWithMeans, respectively. The Ave_RMSE and Ave_MAE values of this paper’s algorithm are better than those of the best performing algorithms on the mL-100k dataset.
In terms of the set-pair recommendation degree of the model, this paper’s model performs reliable sports training strategy recommendation, and it is able to perform scientific sports training strategy recommendation while acquiring long and short-term interests.