Research on the optimization system of athlete selection and training effect based on big data

In the field of competitive sports, athlete selection and training effects determine the level of athletic teams. The rapid development of big data technology provides new training methods and ideas for the optimization of athlete selection and training effect. Big data is relying on the powerful rapid classification and recording ability of electronic computers for data collection, through the continuous progress of search engine technology, various types of data on the Internet platform of mutual communication, tandem, communication, and in the cloud computing, with the help of the Internet dimension, will be a single data convergence into a huge all-embracing large database [1–3].

Athlete selection is the foundation of competitive sports, and the important measure to improve the quality and skills of athletes is long-term training [4–5]. With the aid of big data, the selection and training methods are based on evidence, and both become more professional and reliable. For example, the U.S. professional basketball league has a powerful data statistics, mining system, the quantification of the game data has been to the extreme, with scoring, lost points, shooting percentage and other statistics up to more than 90 technical indicators [6–7]. In addition, there are also player contribution statistics, that is, by calculating a player's whole game scoring, positional scoring, free throw scoring and other active scoring, first-class assists, steals and other positive technical indicators, minus fouls, lost balls and other negative data through the corresponding formula to produce a series of evaluation indicators [8–10]. On the basis of these evaluation indexes, the basic information, physical indexes, in-competition performance, event results, and training-specific data of athletes are collected through professional medical records, training records of athletes' training bases, and records of the competition process, etc., and the quantitative assessment of the athletes' indexes is carried out by using the big data analysis technology, so as to predict the potential athletic ability and development potential of the athletes [11–13]. In addition, the tacit understanding and matching between players in the team events of competitive sports also affect the results. More advanced data mining techniques can be used to discover the correlation and regularity between athletes and their influence on the team's winning rate, providing a more scientific basis for selection [14–15]. According to the various indicators and characteristics of athletes, combined with the results of big data analysis, personalized training plans can be developed to strengthen the weak links while taking into account the sustainable development of advantageous projects, thus improving the training effect [16–18]. It can be seen that big data has an important application value for the optimization of athlete selection and training effect.

In this paper, an index system that can be used to evaluate the selection of athletes is constructed, and the constructed indexes are pre-processed and analyzed for significance and correlation, and the training content is arranged by verifying the homogeneity among the items. The median is introduced for contour coefficient calculation, and an improved K-means clustering algorithm is proposed based on the contour coefficient. After determining the number of clusters, the improved clustering algorithm was utilized to cluster and analyze the various scores of the selection indexes of the athletes. Subsequently, the athlete and sport models are constructed, and the collaborative filtering and content recommendation algorithms are combined to recommend suitable training programs for athletes and develop training intensity plans for the athletes' tolerance. The article also designed an athlete selection and training optimization system to optimize the selection and training arrangements for athletes.

2

Constructing and Quantitative Analysis of Athlete Selection Indicators Based on Big Data

2.1

Selection index system construction

Through expert investigation and consultation and summarization of research literature on orienteering athlete selection, based on the statistical results of the previous questionnaire survey, the questionnaire of this paper was compiled and the athlete selection indexes were constructed, and as much as possible, the system of selection indexes was listed in six aspects including body morphology, physical function, physical fitness, special sports quality, mental quality, and coaches' evaluation, and the final selection indexes were as shown in Table 1.

Table 1.

Athletes selection indexes

Primary index	Secondary index
Physical form	Weight, height, sitting height, length of legs, finger distance, Kole index, chest circumference, shoulder width, arm’s length, body fat, upper and lower limbs length, hips width, waistline, arch
Body function	Lung activity, maximum oxygen uptake, hemoglobin, blood pressure, cardio index, blood testosterone level
Physical quality	Lead-up, 50m run, 1000(800)m run, fixed jump, seated forward bend(sit-ups)
Special quality	Reading chart, using compass, route selection, sense of direction, distance judgment, checkpoint capture, repositioning
Mental quality	Cognitive psychology, emotional emotion, will quality, personality characteristics, temperament type, general intelligence, motion intelligence
Trainer	Sports ability, sports sense, acceptance ability, combat psychological quality, will-quality

According to the six aspects established in Table 1, including physical form, physical function, physical quality, special sports qualities, mental qualities, and coach evaluation, a cumulative total of 44 secondary athlete selection indicators exists.

2.2

Quantitative analysis of selection indicators

1)

Data source and pre-processing

In the previous section, a comprehensive qualitative exploration of athlete selection indicators was conducted to lay down the qualitative support for the subsequent research. In this section, the data from the athlete selection test will be analyzed to provide data support for the subsequent study.

Athlete fitness test data from the 2023-2024 academic year at the University of H was utilized for two quantitative analyses. The total amount of raw data was 7449 entries, of which the total amount of data for female athletes was 3435 entries and the total amount of data for male athletes was 4014 entries. Python's data analysis package Pandas was utilized to preprocess the raw data.

2)

Significance analysis between some indicators

The “physical fitness” was selected as a representative for correlation analysis, and the significance of the two indicators was tested before verifying the correlation between the two indicators. The significance of the difference between the indicators is realized by the principle of t-test with the help of Scipy scientific computing package of Python. For the convenience of presentation, the Chinese names of the indicators were simplified with English designations: BMI (body mass index), CV (lung capacity), ST (seated forward bending), 50m (50-meter run), SLJ (standing long jump), SU (sit-ups), PU (pull-ups), and 1000m ¥ 800m (1000-meter run and 800-meter run).

The significance analysis heatmaps between the events of male and female athletes are shown in Figures 1 and 2, respectively. From the data in the graphs, it can be seen that the significance between all the events of both male and female athletes is ≤ 0.05, which means that there is a significant difference between all the events, and the validity of the validation data.

3)

Pearson correlation coefficient analysis between parts

In this paper, the Pearson correlation coefficient is used to analyze the correlation between indicators. It should be noted that BMI is determined by the larger or smaller the better, so it needs to be considered separately.Pearson correlation coefficient is greater than 0.2 means that there is correlation, and less than 0.2 can be considered as basically irrelevant. By analyzing the correlation between the physical fitness test data of athletes of University of H and the correlation between male and female athletes, the results of the Pearson correlation analysis between male and female athletes are shown in Fig. 3 and Fig. 4, respectively.

It can be seen that the results of 50m running and standing long jump are consistent among male and female athletes, both showing negative correlation. That is to say, the better the results of 50-meter running, the better the results of standing long jump will be to a certain extent. In other programs that show weak positive or weak negative correlation, it can be said that there is more or less a connection between the two programs. Speed is mainly determined by muscle strength, muscle speed, and sensitivity between nerves and muscles. And the standing long jump mainly reacts to the explosive force of the lower limbs and the muscle power of the upper limbs, the explosive force has something of the muscle speed and power, the sensitivity between the nerves and the muscles of the integrated embodiment. Therefore, the 50m running and vertical jump program has certain homogeneity due to the physical quality requirements. The homogeneity of the project is one of the bases for arranging the exercise content and improving the performance of the project.

In other weakly positively correlated and weakly negatively correlated items, more or less certain similarities between the items can be found, which also verifies the credibility of the source data to a certain extent. Except for the items with certain correlation, there is no correlation between most of the items. For example, there is almost no correlation between forward bending while sitting and other activities. Seated forward bending is an item that reflects the flexibility of the body, which means that seated forward bending is independent of the other items. The correlation results of the items are not moderately correlated, which also reflects the scientific nature of the athlete selection index system constructed in this paper, which can comprehensively reflect the athletes' physical qualities, covering the physical form, physical function and physical quality.

3

Cluster analysis based on selection indicators

3.1

K-means clustering based on optimizing initial cluster centers and profile coefficients

3.1.1

K-means clustering

K-means clustering algorithm is an iterative relocation algorithm, the algorithm is generally divided into two steps: the first step is iterative, the distance between each sample point is calculated, and then the corresponding sample point is divided into the closest cluster to complete the initial clustering. The second step is relocation. Recalculate the clustering center of each cluster, and divide the closest sample points into their corresponding clusters. Repeat this operation until the clustering center does not change. The algorithm is to obtain the clustering results by means of multiple iterations. The basic idea is: first of all, randomly select K initial clustering center in the given data set. Then calculate the distance of all sample points, according to the principle of nearest distance, the sample points are divided into the corresponding clusters. The clustering center of the cluster is recalculated during each iteration. When the clustering center no longer changes or reaches the specified conditions, no more clustering is carried out, to which the clustering is ended and the clustering results are output [19].

Denote the average sample distance of dataset D as: 1 $M e a n D i s t (D) = \frac{1}{C_{n}^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{i} d (x_{i}, x_{j})$ where n is the number of sample points in data set D, $C_{n}^{2}$ is the number of all optional combinations of two randomly selected sample points out of n, and d(x_i, x_j) denotes the Euclidean distance from sample point x_i to sample point x_j.

The sum of error squares for data set D is denoted as: 2 $S S E = \sum_{i = 1}^{k} \sum_{x \in c_{i}} d (^{x, c_{i}) 2}$ where K is the number of clusters, c_i is the clustering center of the i rd cluster C_i, and d(x, c_i) refers to the dissimilarity between data points x and c_i. Different computations of dissimilarity often lead to different clustering results, and here the Euclidean distance metric is usually used.The size of SSE can indicate how dense the sample points are.

3.1.2

Euclidean distance

The Euclidean distance is a distance named after the ancient Greek mathematician Euclid. Therefore, the Euclidean distance, also known as the Euclidean distance, is a common distance metric that measures the true distance between two points in n-dimensional space.

The Euclidean distance in 2D and 3D space is the distance between two points, and the 2D Euclidean distance formula is: 3 $d = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}}$

The 3D Euclidean distance formula is: 4 $d = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2} + {(z_{1} - z_{2})}^{2}}$

Generalizing to n-dimensional space, the Euclidean distance is given by: 5 $d = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{j})}^{2}}$

Euclidean distance is a measure of the distance between two sample points. The closer the distance between two sample points, the more similar these two sample points are. Conversely, the less similar these two sample points are [20].

3.1.3

Improved K-means clustering algorithm

In this section, a K-means clustering algorithm based on optimizing the initial clustering centers and profile coefficients is designed. For a better description of the algorithm, it is assumed that the dataset to be clustered is D = {x_i | x_i ∈ R_p, i = 1,2,…n} and each sample element is denoted as x_i = (x_i₁,x_i2,…,x_ip),1 ≤ i ≤ n, where p is the dimension of the sample element.

The inter-cluster similarity is defined as: 6 $C l u S i m = \min (d (c_{i}, c_{j}))$

Where, i = 1,2,…,k–1, j = i+1,…,k, c_i are the cluster centers of cluster C_i, c_j is the cluster center of cluster C_j, and d(c_i, c_j) denotes the Euclidean distance between cluster center c_i and cluster center c_j. The smaller the CluSim value is, the closer the distance between two clusters is, and the more similar the clusters are to each other. Conversely, the further the distance between clusters the less similarity between clusters.

In order to better show the characteristics of the sample points, this section introduces the median for the calculation of the profile coefficient. Assuming that sample point x_i is clustered into cluster C, the median of the distance between sample points is defined as: 7 $a_{i} = m e d i a n (d (x_{i}, x_{j}))$ 8 $b_{i} = \min_{p, p \neq c} [m e d i a n (d (x_{i}, x_{j}))]$ where a_i denotes the median distance between sample point x_i and all other sample points belonging to the same cluster C, and b_i denotes the median distance between sample point x_i and all sample points in the nearest cluster that do not belong to cluster C.

Assuming that sample point x_i is clustered into cluster C, the median-based profile coefficient is defined as: 9 $s_{i} = \frac{b_{i} - a_{i}}{\max (a_{i}, b_{i})}$

Assuming that sample point x_i is clustered into cluster C, the median-based mean profile coefficient is defined as: 10 $S = \frac{1}{n} \sum_{i = 1}^{n} s_{i}$

The median-based average profile coefficient combines intraclass and interclass distances to evaluate the reasonableness of the overall effect of clustering, and takes a value between -1 and 1. If the value is close to 1, it means that the intraclass distance of the sample is much smaller than the minimum interclass distance, which indicates that the clustering result for this data set is optimal.

3.2

Analysis of K-means clustering results for athlete selection test

3.2.1

Selection of K value

Figure 5 shows the effect of fitting the number of clusters of male athletes' selection test scores for the school sports teams in the academic year 2023-24, and Figure 6 shows the effect of fitting the number of clusters of female athletes' test scores. From the curve graph can be seen four male athletes and female athletes respectively clustered k-value changes, with the increase in the number of clusters, the sum of squares within the group is decreasing, and the sum of squares between the groups is increasing, according to the figure and the data background can be clustered into 5 classes, after which the curve rise slows down, and the fitted curve tends to be stable, indicating that clustering into 5 classes is a better fit for the data.

3.2.2

Comparison of clustering results of male athletes' selection performance

The five categories of male athletes for each physical quality are summarized in Table 2. The first category had the largest number of male athletes with 401. This category had moderate test results in total score, body mass index, lung capacity, long jump, and forward body flexion endurance run, but had poorer results in sprinting and higher results in pull-ups.

Table 2.

The cluster of male athletes

	Total score	BMI	CV	50m	SLJ	ST	1000m	PU	Count
1	77.03	22.02	3524.06	6.28	236.62	15.52	240.14	8.19	401
2	67.31	23.53	3061.73	5.93	231.14	14.05	245.08	8.01	114
3	73.63	20.32	3226.11	5.92	234.86	15.14	242.22	8.57	295
4	77.95	20.82	3983.44	6.23	238.03	16.10	249.25	7.95	320
5	79.38	24.69	4598.46	5.90	239.67	17.05	253.55	6.90	135

The second group of male athletes had low total scores and poor results in lung capacity, body mass index, standing long jump, and forward body bends. But performed better in sprint and endurance running events. This category had the least number of male athletes with only 114.

The third category of male athletes had the lowest body mass index, higher performance in sprints, and best performance in pull-ups. There were 295 male athletes in this category.

The fourth category of male athletes had poor performance in endurance running and sprinting, better performance in long jump, forward body flexion, lunges, poor sprinting quality, and poor endurance quality, and there were 320 male athletes in this category.

The fifth category of male athletes had the highest total score, highest long jump, forward body flexion, highest lung capacity and body mass index, but lower scores in pull-ups and endurance running. Although it is the category with the highest total score, it is not well-rounded and lacks more endurance and upper body strength.

3.2.3

Comparison of clustering results of female athletes' selection performance

Table 3 shows the five categories of female athletes for each physical fitness. According to Table 3, it can be seen the results of female athletes' physical fitness test clustered 5 categories. The first category of female athletes had the highest total score of 91, and this category of female athletes had the best results in long jump, forward bend, endurance run and one-minute sit-up in the selection test, but this category of female athletes had the worst results in the sprint test. The best results in sprinting were obtained by female athletes in categories III and V. The lowest total scores were obtained by female athletes in category III. Category III female athletes had the lowest total score and were the worst in BMI, lung capacity, long jump, forward bending, endurance run, and one minute sit-up, but this category of female athletes had the best results in sprinting, and this category had the lowest number of female athletes, which was only 65. Category V female athletes excelled in sprinting and endurance running but were average in lunges, forward bending, long jump, and one-minute sit-ups. There were 222 female athletes in this category. Category 2 female athletes performed average in every event tested, with balanced scores in each event, no outstanding scores and no overly poor scores. The fourth category of female athletes had higher total scores, better scores in every test, and more balanced scores in every test event.

Table 3.

The cluster of female athletes

	Total score	BMI	CV	50m	SLJ	ST	1000m	SU	Count
1	78.61	22.83	3993.82	7.46	167.79	21.41	231.64	33.47	91
2	74.48	21.24	3035.79	7.26	163.68	20.02	235.85	31.72	285
3	67.00	23.59	2037.57	7.1	161.40	19.26	238.08	30.30	65
4	76.67	21.99	3455.24	7.21	165.65	20.56	234.49	32.42	237
5	72.69	20.78	2613.45	7.18	162.41	19.34	232.45	31.21	222

Overall, the female athletes clustered results in more complex categorical variables, mainly sprint scores, which are more different from endurance running scores. Female athletes with higher overall scores had lower sprint scores, while female athletes with lower overall scores had higher sprint scores.

4

Personalized training recommendation algorithm for athletes

4.1

Personalized Training Recommendation Algorithm

In this paper, from the perspectives of both athletes and sports, we combine the collaborative filtering (CF) algorithm with the content-based recommendation (CB) algorithm based on the athlete model and the sports model to recommend sports training programs to athletes that match the athletes' characteristics. The principle of a collaborative filtering algorithm is to predict what current athletes may like based on the past behaviors or opinions of the registered athlete population. From the perspective of the athlete group, the athlete-based collaborative filtering algorithm (UB-CF) is chosen to build a model for the athletes, and then, from the perspective of the sports, the CB recommendation algorithm is used to build a model of the recommended objects based on the sports characteristics, and finally, the UB-CF algorithm and the CB algorithm are combined to form a personalized sports recommendation algorithm, so as to achieve the purpose of giving personalized recommendations to the athletes.

4.1.1

Athlete modeling

The degree of construction of the athlete model directly affects the recommendation effect of the whole recommendation system, the construction of the sports model not only considers the basic information filled in by the athlete when he/she first registers in the system, and constructs the model for the athlete using the basic information, but also takes into account the change of the athlete's interest in using the system in the short term or the long term, which has to go to the updating of the athlete model, so that the athlete can achieve the purpose of accurate recommendation.

The athlete model is made up of a model of the athlete's basic physical traits and a model of their rating matrix for the program. The basic athlete profile model includes basic attribute information and interests of the athlete. Here the athlete keywords are extracted, each keyword is a basic information of the athlete, the basic information of the athlete is age, gender, BMI, sports category of interest, etc., the athlete model is represented as shown in equation (11): 11 $\vec{U_{u}} = {(k_{u}^{1}, w_{u}^{1}), (k_{u}^{2}, w_{u}^{2}), (k_{u}^{3}, w_{u}^{3}), \dots, (k_{u}^{n}, w_{u}^{n})}$

Where, $\vec{U_{u}}$ denotes the feature vector of athlete u, $k_{u}^{n}$ denotes the n th keyword of athlete u, and $w_{u}^{n}$ denotes the weight of the n th keyword of athlete u.

The athlete-to-program rating matrix model represents the athlete's liking of the sport he or she is interested in, i.e., the rating scale. The athlete's rating level of the sport contains access, like, share, and favorite. The rating representation is implicitly obtained and expresses the athlete's preference for the recommended sport through the athlete's interface actions.

4.1.2

Sport modeling

Because the application domains of recommender systems are different, there is no standard guideline to establish a unified modeling standard for each application. This shows that recommender system modeling has a great impact on recommender systems. In this paper, from the perspective of sports, according to the role of sports can be in the human body can be divided into the upper limbs, trunk and lower limbs. According to sports, the qualities of sports can be divided into strength, speed, flexibility, endurance, and sensitivity: five major sports qualities. If the sports have both upper limb and lower limb sports, the keyword of the object is set to 1, then the trunk sports is set to 0. If the sports can have the three major sports qualities of strength, speed, flexibility, their weights are set to 1, then the two major sports qualities of endurance and agility are set to 0. Eventually, the model of the sports is as shown in equation (12): 12 $\vec{I_{u}} = {(p_{u}^{1}, w_{u}^{1}), (p_{u}^{2}, w_{u}^{2}), (p_{u}^{3}, w_{u}^{4}), ..., (p_{u}^{7}, w_{u}^{7}), (p_{u}^{8}, w_{u}^{8})}$

Where $p_{u}^{n}$ represents the n nd keyword, $w_{u}^{n}$ represents the weight of the n th keyword, and the weight of the keyword is either 0 or 1, that is, whether the sport has this keyword or not.

4.1.3

Athlete-based collaborative filtering algorithm

The UB-CF algorithm measures and scores the degree of liking for these items based on the historical behavior of the athletes, and then calculates the relationship between the athletes' attitudes and preferences for the same item.

In this paper, Pearson correlation coefficient is used to calculate the correlation between two athletes, and the range of the calculated correlation result should be in [-1, +1], -1 means that there is an inverse influence between the two, +1 means that there is a positive influence, and 0 means that there is no correlation, and the formula is as follows: 13 $s i m (u, v) = \frac{\sum_{p \in P_{u, v}} (r_{u, p} - {\bar{r}}_{u}) (r_{v, p} - {\bar{r}}_{v})}{\sqrt{\sum_{p \in P_{u, v}} {(r_{u, p} - {\bar{r}}_{u})}^{2}} \sqrt{\sum_{p \in P_{u, v}} {(r_{v, p} - {\bar{r}}_{v})}^{2}}}$ where P_u,_v denotes the joint rating of athlete u and athlete v on the item, ${\bar{r}}_{u}$ and ${\bar{r}}_{v}$ denote the average ratings of athlete u and athlete v, p is the rating term of the item, and r_u,p denotes the rating of athlete u on item p.

By using the above formula, the set of similar athletes or the nearest neighbor set of athletes of the target athlete to be recommended can be calculated, and by denoting the set by N, it can be predicted that the athlete u has a rating of pred(u, p) for item p on the set of N, and then the formula for pred(u, p) is as follows: 14 $p r e d (u, p) = {\bar{r}}_{u} + \frac{\sum_{v \in N} s i m (u, v) * (r_{v, p} - {\bar{r}}_{v})}{\sum_{v \in N} s i m (u, v)}$

Where, sim(u, v) denotes the similarity of athlete u and athlete v, ${\bar{r}}_{u}$ and ${\bar{r}}_{v}$ denote the average ratings of athlete u and athlete v, and r_v,_p denotes the rating of athlete v for program p.

Finally, based on the predicted values of the athletes for the items, Top-N programs can be selected for the target users based on the descending list.

4.1.4

Content-based recommendation algorithms

The CB recommendation algorithm requires only one important feature, i.e., labels, which is needed to decompose the sports into a series of features that are sufficient to indicate the sport and to give a relationship between the sport and the user based on the user's behavior in the system (checking out, sharing, rating, and favoriting) [21].

The cosine similarity formula for CB recommendation algorithm is as follows: 15 $s i m (A, B) = \frac{\sum_{i = 1}^{n} A_{i} * B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} * \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}$

Where A_i indicates the degree of preference of the user for the type of sport, and B_i indicates the type of sport to which each sport belongs, and this belonging relationship is non-zero or one.

Finally, according to the similarity of the sports in descending order, select the Top-N to the user can be.

4.1.5

Training intensity development

Training intensity is one of the main factors in determining training tolerance, and it usually refers to the degree of exhaustion and strain on the body during training. Training intensity is usually measured by physiological indicators such as enzyme shedding, heart rate, and maximum oxygen consumption. It has been found that training intensity has an impact on the raw materials used by the body and the adaptations carried out by the body. Reasonable training intensity not only improves the heart and respiratory capacity, but also enhances muscle strength, bone density, the body's responsiveness to the outside world, and reduces one's own sense of depression, thus improving the human body's physical fitness.

The exercise heart rate equation is as follows: 16 $THR = (H R_{m a x} - H R_{r e s t}) * E I_{d e s i r e d}$

Where THR stands for Exercise Heart Rate, HR_max stands for Maximum Exercise Heart Rate, which can be obtained from the absolute value of the difference between 220 and age, HR_rest stands for Resting Heart Rate, and EI_desired stands for Desired Training Intensity.

The resting heart rate is obtained by measuring the pulse rate for one minute in the morning while awake and before getting out of bed, or it can be measured on all three mornings and the average of the three measurements is obtained. The intensity of training for cravings is now more commonly categorized into three levels, minimum, optimal and maximum intensity, which are calculated as in the formula below: 17 $THR = (H R_{m a x} - H R_{r e s t}) * α + H R_{r e s t}$

Eq. α When is 0.9, the THR obtained is the highest training intensity. When it is 0.8, the THR obtained is the best training intensity: when it is 0.7, the THR obtained is the lowest training intensity. When the training intensities were calculated, the overall trend of the statistical results was decreasing from year to year according to the unit of years, since the age was increasing from year to year. The three calculated intensity results are displayed in a visual way, as well as compared with the athlete's training heart rate each time, it can be intuitively prompted to the athlete to adjust their training intensity in a timely manner.

4.2

Validation of Recommendation Results

Eight preparatory athletes were selected as experimental subjects and a questionnaire was designed to collect data on the effect of training recommendations. The questionnaire was in the form of a Likert scale with options categorized as 1, 2, 3, 4, 5, and 6, where 6 represents the highest level of satisfaction with the training recommendation and 1 represents the lowest level of satisfaction with the training recommendation.

The athletes' data were automatically divided into training set (70%) and test set (30%), the training set was used to predict the rating matrix of the test set, and then the root mean square error (RMSE) between the true and predicted values was calculated for evaluating the prediction quality of the algorithm, and the smaller the RMSE, the better the model. Since the training and test sets are randomly divided, we use the mean value of 200 experiments to represent the prediction results, which is more convincing. Figure 7 displays the experimental results and algorithm comparison.

The personalized training recommendation algorithm proposed in this paper can achieve the lowest RMSE value in the four methods with the best results. Under the same similarity calculation model, lower RMSE values can be obtained by using the new recommendation method, which proves the rationality of the personalized training recommendation method proposed in this study.

Table 4 shows the real recommendation results of the test set, while Table 5 shows the recommendations results of the proposed algorithm in this paper. The actual top-2 recommended sports of the 5th athlete are badminton and swimming, while the algorithm recommended badminton and basketball. Other than that, the actual top2 recommendations and predicted top2 recommendations of all other athletes are completely consistent, which also shows the superiority of this algorithm.

Table 4.

The true recommendation of the test set

Stud.	FB.	BB.	VB.	TT.	B.	T.	S.	Top2 recommendation
1	0.23	0.10	0.07	0.23	0.23	0.10	0.07	Football/badminton/table tennis
2	0.44	0.44	0.13	0.44	0.55	0.44	0.44	Badminton, football/basketball/table tennis/tennis/swimming
3	0.10	1.60	0.07	0.07	0.55	0.10	2.12	Swimming, basketball
4	0.18	0.81	0.10	0.26	1.91	0.13	1.60	Badminton, swimming
5	0.23	0.34	0.10	0.34	3.17	0.07	0.86	Badminton, swimming
6	0.18	1.60	0.18	0.18	0.18	0.18	1.91	Swimming, basketball
7	0.10	3.80	0.10	0.10	0.10	0.10	2.86	Basketball, swimming
8	0.23	0.10	0.07	0.23	0.23	0.10	0.07	Football/badminton

Table 5.

The true recommendation of the test set

Stud.	FB.	BB.	VB.	TT.	B.	T.	S.	Top2 recommendation
1	0.50	0.27	0.09	0.20	0.64	0.09	0.19	Badminton, football
2	0.34	0.24	0.09	0.15	0.42	0.13	0.27	Badminton, football
3	0.37	0.84	0.10	0.26	0.34	0.10	1.64	Badminton, swimming,
4	0.10	0.94	0.19	0.18	1.68	0.14	1.64	Badminton, swimming
5	0.19	1.20	0.13	0.19	2.22	0.13	0.61	Badminton, basketball
6	0.38	0.85	0.10	0.26	0.33	0.10	1.61	Swimming, basketball
7	0.35	1.27	0.09	0.20	0.20	0.09	0.94	Basketball, swimming
8	0.50	0.27	0.09	0.20	0.64	0.09	0.19	Badminton, football

It can be seen that this paper's algorithm recommendation results and the actual results are very close to the 8 preparatory athletes' sports top2 recommendation, for which 7 athletes' algorithm recommendation results coincide with their actual situation, and 1 athlete's algorithm recommendation results are different from the actual situation with one sport.

The traditional athlete collaborative filtering recommendation algorithm only considers whether the athlete likes it or not, which often does not match the actual training scenarios. Experimental results show that the algorithm in this paper has a very superior recommendation effect.

5

Athlete Selection and Training Optimization System

5.1

Purpose of system design

The purpose of the system design is to meet the needs of different levels and stages of selection as a starting point, so that the system simulates the human brain on the selection of candidates for preferential selection, the overall function is to be able to comprehensive management of the selection of information, athletes test indexes to evaluate the results of the analysis of the library to save, on the one hand, can be dynamic evaluation of athletes, on the other hand, with the help of the system can quickly make decisions, that is, to make a one-time decision, can be given to athletes to give preferential sorting of multi-year training stage. On the other hand, the system can make decisions quickly, i.e., one-time decisions and preferential ranking of athletes in the training stage for many years, and at the same time, the system can also evaluate the athletes at the same level and handle some affairs, such as printing reports.

5.2

Basic structure of the system

The system structure designed in this paper is shown in Fig. 8. Users (referring to coaches, researchers, managers, etc.) interact with the system through the human-computer multimedia intelligent interface. The multimedia human-computer intelligent interface is a human-computer interface based on multimedia technology, designed and realized by artificial intelligence methods, which provides a variety of functions, such as multimedia information input, output, information storage and processing, and intelligent interaction.

The total system control is a software system based on the management system of each library. It carries out cooperative scheduling, mutual communication, overall control, resource sharing, and cooperative operations for each library.

Database management system (DBMS) is a software system for storing, querying, managing, and maintaining data information. In connection with the characteristics of material selection information, the system initially connects three types of databases and a database dictionary. The center database is used to store test data for material selection and provide other system calls. The standard database is used to store the judging standard and test description of material selection, which is the guideline for user evaluation. The evaluation and analysis results database is used to store the results after the system evaluates and predicts the athletes for dynamic tracking and evaluation. The data dictionary includes the name of the database and the help software for operation, management, and maintenance.

Model library management system (MBMS) is the software system that handles the storage, calling, management, assembly, and construction of models. Model library MB mainly stores evaluation, prediction, and statistical analysis model methodology inventory. It also includes mathematical algorithms, applications, system programs, and other tools used for modeling. Image Bank Management System (IBMS) is a software system designed for the storage, calling, stitching, construction, and management of information. The image library is used to store technical images and statistical graphs for athletes' evaluation.

Knowledge Base Management System KBMS is a software system for storing, querying, managing and maintaining knowledge information. In addition to the system knowledge base, the system has added the neural network knowledge base. The system knowledge base stores some experiences, facts, and reasoning rules used by coaches for evaluation and prediction. The neural network knowledge base mainly contains the self-learning model of neural networks and some rules used for network reasoning.

The sample library management system is a software system for storing, querying, managing, and modifying sample information. The sample library is used to store the samples for neural network training.

The main function of the inference machine is to utilize the knowledge in the system knowledge base, call various models in the model library for inference, or utilize the self-learning model of the neural network knowledge base for self-learning inference.

6

Conclusion

In this study, an improved K-means clustering algorithm was designed to analyze the athletes' selection test performance, and the athletes were clustered into five categories based on the fitting results. Among the male athletes, the category 2 group had the worst performance in the selection test with a total score of 67.31. While category 5 male athletes had the highest total score performance with a total score of 79.38, their physical fitness was not comprehensive enough. Among female athletes, category 1 group had the highest total score with a total of 78.61 points. The category 3 group had the worst performance with a total test score of 67. The personalized training algorithm designed in this paper achieved lower RMSE values than the other three compared methods. Among the eight experimental subjects selected, the recommendation results of this paper's algorithm for seven of the athletes matched the actual situation, proving the accuracy of the new method's recommendations. It shows that the work in this paper has practical results for both the analysis of selection data and training recommendation optimization of athletes, and the selection and training optimization system constructed in this paper has solid technical support.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Research on the optimization system of athlete selection and training effect based on big data

Yongkang Guan

Weijia Xue

Publicado en línea: 21 mar 2025

Recibido: 11 oct 2024

Aceptado: 05 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0563

Palabras claveK-means, Profile coefficients, Collaborative filtering, Content recommendation, Athlete selection

© 2025 Yongkang Guan, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
K-means, Profile coefficients, Collaborative filtering, Content recommendation, Athlete selection