Physical Health Data Analysis of Youth Sports Based on Cloud Computing and Gait Perception

Young people’s physical health is related to personal growth, and is also an important cornerstone for the prosperity of the country and the nation. Improving the physical health level of adolescents is conducive to implementing the Healthy China strategy. As an important means to exercise the physical and mental health of adolescents and cultivate healthy habits of adolescents, sports play an important role in improving adolescents’ physical health. At present, the relevant health data of youth sports is mainly analyzed by professionals, which has problems such as large data volume, long analysis time, and obvious subjectivity in the analysis results. Moreover, it is difficult to explore the deep characteristics of physical health data of youth sports, resulting in difficulty in grasping the real situation of adolescents’ physical health.

In recent years, with the rapid rise of big data technology and cloud computing technology, it has provided a reference for the analysis of adolescent sports physical health data. By collecting large-scale athletes’ health data and utilizing the edge computing model in cloud computing, Xiaoyun Z et al evaluated the athletes’ health status by adopting the bicubic polynomial interpolation method, and achieved a good result [1]. Based on secondary data from the national trends survey of health information, Steven T L D et al. analyzed the relationship between exercise and physical health in cancer survivors using a multiple logistic regression model. They concluded that exercise was beneficial for improving the physical health of cancer survivors. By collecting pulse signal data before and after human exercise, and using motion recognition algorithms, Zhang et al. proposed a human health assessment method to achieve more accurate physical health assessment according to human movement [3]. Based on big data and cloud computing technology, Xinyuan F et al. analyzed and visualized physical health data in the WOS database to effectively evaluate the physical health of people of all ages [4]. Jeanette H et al. utilized big data technology to explore the change trend of physical health of 60~70-year-old people in Denmark with time, which provided a reference for evaluating the physical health of the elderly population in Denmark [5].

Analyzing the above research, it can be found that cloud computing technology has been widely applied to data analysis in the field of physical health, and has shown good analysis effects. Therefore, this paper attempts to use cloud computing technology to analyze the physical health data of youth sports. At the same time, combined with the common three types of youth sports, such as competitive football, wrestling, and sports acrobatics [6], and collecting their physical health data such as gait, heart function index, and heart rate, this paper proposes a physical health data analysis method of youth sports based on cloud computing and gait perception.

The organizational structure of this paper is as follows: The first section describes the background and importance of physical health data analysis of youth sports, as well as related literature; The second section introduces the basic theory of cloud computing and random forest algorithm in cloud computing; The third section constructs a cloud computing framework for analyzing physical health data of youth sports and, and improves the random forest algorithm; Based on the collected adolescents’ health data, the fourth section carries out case simulation to verify the performance of the method; The fifth section summarizes research results of this paper and looks forward to the future research direction.

2

Basic Theory

2.1

Cloud Computing

Cloud computing utilizes the Internet for computing and storage, so as to provide various types and quantities of virtual resources to users with different needs. It has the characteristics of dynamic scalability, high flexibility, strong reliability and strong computing power [7-10].

The overall architecture of cloud computing is shown in Fig. 1. By integrating multiple low-cost computing entities into one entity with powerful computing power, and with the help of three basic services: Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS), its computing power can be distributed to various terminals [11-12].

SaaS service provides application program through multiple renting methods. In short, software infrastructure is hosted by the service provider, and then provided to users through the Internet.

PaaS services treat a platform as a service and provide shared applications to users.

IaaS refers to the outsourcing of resources and equipment that support operations to provide users with a virtualized infrastructure [13-15].

With the development of computer technology, cloud computing is widely used in various scenarios, covering many industries such as healthcare, education, and finance, and has achieved excellent results. Therefore, this paper analyzes the physical health data of youth sports based on cloud computing.

2.2

Principle of Random Forest Algorithm

Random forest algorithm is an ensemble classifier. By constructing multiple decision trees to participate in learning and training, the idea of voting method is utilized to count prediction results of each decision tree, and the category with the most votes is taken as the final category, which can be used to deal with classification and regression problems [16-18].

Random forest algorithm uses Gini index and information entropy, as shown in equations (1) and (2) respectively, to split nodes. (1) $G i n i (D) = \sum_{i = 1}^{m} p_{i}^{2}$ (2) $E n t r o p y (D) = - \sum_{i = 1}^{m} p_{i}^{2} \log_{2} (p_{i})$

Where m represents the number of categories in training set D, and p_i indicates probability that the training set belongs to category c_i.

The splitting process can be expressed as: (3) $\min_{j, s} [\min_{c_{1}} \sum_{R_{1}} {(y_{i} - c_{1})}^{2} + \min_{c_{2}} \sum_{R_{2}} {(y_{i} - c_{2})}^{2}]$

Where R₁ and R₂ represent two regions defined by classification point s and j split variables, and y is the output variable of data set. R₁ and R₂ are defined as follows: (4) $R_{1} (j, s) = {x | x^{(j)} \leq s}$ (5) $R_{2} (j, s) = {x | x^{(j)} > s}$

Segmentation is repeated until the regression tree does not continue to grow. In this case, the solution of decision tree model is: (6) $f (x) = \sum_{i = 1}^{n} (\frac{1}{N_{i}} \sum_{\in R_{i} (j, s)} y_{k}) I (x \in R_{i})$

Where I(x ∈ R_i) is index function. When x ∈ R_i, I(x ∈ R_i) = 1, otherwise I(x ∈ R_i) = 0.

Furthermore, random forest algorithm uses marginal function mr(X, Y), as shown in equation (7), to judge classification accuracy. (7) $m r (X, Y) = P [h_{k} (X) = Y] - {\max P [h_{k} (X) = j]}_{j = 1}^{c}$

Where P[h_k(X) = Y] is the probability of correct judgment and ${\max P [h_{k} (X) = j]}_{j = 1}^{c}$ is the probability of incorrect judgment.

Random forest uses equation (8) to determine parameter mtry, and equation (9) to determine classifier generalization error PE. (8) $m t r y = 3 \sqrt{g e n e n u m e r s e l e c t e d f o r e a c h s e p a r a t i o n p o int}$ (9) $P E = P_{X, Y} (m g (x, y) < 0)$

Where (x, y) is sample point, mg(x, y) represents margin function, which can be expressed by equation (10), and P_X,Y is the probability on space X, Y. (10) $m g (x, y) = a v_{k} I (h_{k} (x) = y) - \max_{j = y} a v_{k} I (h_{k} (x) = j)$

Assuming there is a dataset D constructed by n samples, and each sample contains m features, then the workflow of random forest algorithm can be summarized into the following 6 steps: 1)

Take out y samples from D and repeat n times to generate n training subsets, and generate a decision tree by each subset;

2)

Randomly extract x features from m to form a feature subset. Then, based on the principle of minimizing node impurity, segmentation points are selected to partition the sub nodes, and the training subset is divided into corresponding sub nodes;

3)

Repeat the division process of step (2) until all nodes of the decision tree are generated;

4)

Repeat steps (2) and (3) until n decision trees are generated, and combine n decision trees to form a random forest;

5)

Input the unselected data from step (1) into random forest for classification prediction, and obtain n prediction results. Then, according to the idea of voting method, the category with the most votes is taken as the final classification prediction result.

The above process can be illustrated in Fig. 2.

3

Physical Health Data Analysis of Youth Sports Based on Cloud Computing and Gait Perception

3.1

Cloud Computing Framework Design

Combining cloud computing and gait perception, this paper designs a cloud computing framework as shown in Fig. 3 to analyze and process the physical health data of youth sports. The cloud computing framework includes five parts: data collection layer, cloud storage layer, cloud computing layer, data analysis layer and data application layer.

Data acquisition layer is used to collect physical health data of adolescents during different exercise processes, including respiration difference and heart function index. The data acquisition layer includes various sensors [19-20].

Cloud storage layer is used to store the physical health data of various youth sports, and this layer contains multiple data servers.

Cloud computing layer mainly uses data mining algorithm to perform classification computation on data definition data. Random forest algorithm can randomly select some samples from a large number of original physical health samples of adolescent sports to generate new sample set, uses the new sample set for learning, then generates multiple sample sets, and generates decision trees. In this way, random forest algorithm can more accurately analyze the essence of adolescent sports physical health data from the adolescent sports physical health sample data. Therefore, this paper chooses random forest algorithm as the data mining algorithm for adolescents’ physical health data.

Data analysis layer is used to connect cloud computing and data application layers, and can transmit the analysis results of cloud computing layer to data application layer. Users can obtain the physical health data analysis results of youth sports through the PC end of data application layer [21-23].

3.2

Improvement of Random Forest Algorithm

Adopting random forest algorithm as the data mining algorithm of cloud computing layer has certain limitations on the feature extraction of a small number of samples. To address this issue, the random forest algorithm is improved. 1)

Improvement of sampling method

Random forest carries out sampling through Bootstrap, which can effectively solve the problem of sample imbalance to some extent, but there are still limitations in effectively extracting features from a small number of samples. To solve this problem, the sampling method of random forest algorithm is improved in this paper. The improvement ideas are as follows:

Assuming that there is a low proportion of anomalous samples, after using the bootstrap method for sampling, multiple random sampling is performed to increase the proportion of anomalous samples, so that the weak classifier pays more attention to the characteristics of abnormal samples [24-26].

The specific improvement process is shown in Fig. 4. Firstly, threshold value a are set according to the characteristics of different samples, and a ∈ (0.5%,1%) is to determine whether the samples are unbalanced. The calculation formula is as follows: (11) $a = \frac{a b n o r m a l s a m p l e s i z e}{n o r m a l s a m p l e s i z e}$

Then threshold value b is set, and b ∈ (0.01%,0.1%) is the sample proportion of a single resampling. When a determines that the result is an unbalanced sample, randomly select a sample from samples. If a is not selected, the sample points are added, otherwise the weight is increased to obtain a new sample.

Finally, use a again to determine if the sample is balanced. If the sample is balanced, end sampling; Otherwise, continue sampling until the sample is balanced. 2)

Improvement of decision tree feature splitting method

As can be seen, the improved sampling method mentioned above will increase the feature dimension, resulting in increased computational complexity and difficulty of the decision tree, which will affect the efficiency of random forest algorithm. Therefore, in order to avoid this problem, this paper improves the decision tree feature splitting method of random forest algorithm.

The decision tree splitting methods of standard random forest algorithm mainly include CART and ID3, which can be selected according to the characteristics of the dataset. Based on the characteristics of physical health data of youth sports, there are many discrete values of features. Therefore, decision tree splitting method using information entropy feature neutralization is proposed on CART and ID3 decision tree splitting methods [27-28].

Firstly, outliers are processed for all physical health data features of youth sports, and the threshold R ∈ [0.99,0.999] is screened. If the data features exceed R, it indicates that the data is abnormal and needs to be dynamically adjusted.

Then, the features are finely segmented according to the values of step length 1 to R quantile, and the initial information entropy H of all data is calculated according to formula (9). (12) $H = - \sum_{i = 1}^{X_{[R]}} p_{i} \log p_{i} - p_{l} \log p_{l}$ (13) $p_{i} = \frac{c o u n t (X_{[i - 1, i)})}{c o u n t (X)}$ (14) $p_{l} = 1 - R$

Where, p_i is the proportion of the number of feature values in [i – 1, i) to all numbers; p_l is the proportion of samples exceeding R quantile to the total number of samples; X_[R] is the number of feature X exceeding R quantile; X_{[i–1, i)} represents the sample value of feature X from i − 1 to i.

Afterwards, select the integer R₁ ≥ 1 to make the number of sub intervals do not exceed $\frac{X_{[R]}}{R_{1}}$ , and merge adjacent intervals to obtain a new interval. Then the information entropy H₁ of the new interval is calculated, and the maximum value of H-H₁ is calculated. R₂ is the penalty item set. When the number in the interval exceeds $\frac{X_{[R]}}{R_{1}} \times R_{2}$ , the problem of large amount of data in the maximum deviation interval can be solved, thereby realizing the feature neutralization. The specific process of feature neutralization is shown in Fig. 5.

Finally, Gini coefficient is adopted to split the features of the merged features to form decision tree, which can effectively reduce the feature dimension, reduce the difficulty and complexity of decision tree, and then improve the operation efficiency of random forest algorithm.

3.3

Analysis of Physical Health Data of Youth Sports

Based on the cloud computing platform and the improved random forest algorithm, physical health data of youth sports are analyzed, and the specific process is shown in Fig. 6. When using cloud computing and improved random forest algorithm to analyze data, the first step is to calculate the weighted information gain rate of the different decision tree feature variables and the improved random forest decision tree. Then, the importance of the two characteristics is calculated, and k important physical health characteristics of youth sports could be screened out. Moreover, n different features are randomly selected from the remaining features, and k features are combined to form x-dimension features, which can reduce the physical health data of youth sports to x-dimension. Finally, by comparing the characteristics with the real influencing factors, the analysis results of physical health data of youth sports can be obtained.

4

Simulation Experiment

4.1

Establishment of Experimental Environment

The experiment is based on Ubuntu10.04 operating system to build the improved random forest algorithm, and adopts Scala 2.10.4 and Java 1.7 software development cloud environment. In addition, OpenStack is used to build one host node and four worker nodes, so as to build cloud computing platform. The IP address of host node is 172.16.0.1, and the IP addresses of worker nodes are 172.16.0.2, 172.16.0.3, and 172.16.0.4.

The hardware configuration of the host is 4-core Intel Xeon E3-122v53.0GHz, and the memory is 32GB. The hardware configuration is Intel Xeon E3-122v53.0GHz and the memory is 16GB.

4.2

Data Source and Preprocessing

In this experiment, 800 teenagers are selected to participate in three types of sports: competitive football, wrestling and sports acrobatics, and their physical health data such as heart function index, respiration difference, and lung capacity are used as experimental data.

Among them, data reflecting body shape include respiration difference, height, weight, waist circumference, hip circumference, chest circumference and body fat percentage; Data reflecting physical functions include lung capacity, pulse pressure difference, heart function index, and heart rate; Data reflecting physical fitness include grip strength, 50-meter dash, Shuttle run, and standing long jump [29-30]. These physical health data are measured using electronic sphygmomanometer, spirometer and other instruments. Some data examples are shown in Table I.

Table 1.

Data examples

Serial number	Height (cm)	Weight (kg)	50m dash time (s)	Mean grip strength (kg)
1	153.23	48.26	8.9	20.36
2	156.78	51.29	9.1	21.05
3	161.29	55.30	8.7	22.68
…	…	…	…	…

Then, data are normalized according to equation (15) : (15) $x^{'} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

Where, x and x′ are values before and after normalization, x_max and x_min represent the maximum and minimum values.

Considering that there are missing values in the collected data and they may affect the final analysis results of physical health data of youth sports, all missing values are deleted according to equation (16) before carrying out experiment. In addition, since the input of random forest algorithm is numerical data, and the collected data such as “one-minute tennis toss” is textual data, it is necessary to convert this kind of data and convert all non-numerical data to between 0 and 1. (16) $g_{i j} = \frac{\sum_{k = 1}^{K} g_{k} / d_{k}}{\sum_{k = 1}^{K} 1 / d_{k}}$

Where d_k is Euclidean distance of sample closest to k, g_k represents gene expression value of k, and g_ij indicates the weighted summation value.

4.3

Evaluation Index

In this paper, precision, recall, F1 value, accuracy and training time are selected as indexes to evaluate the improved random forest algorithm. Among them, calculation formulas of accuracy, recall, F1 value and precision are as follows: (17) $p r e c i s i o n = \frac{T P}{T P + F P}$ (18) $recall = \frac{T P}{T P + F N}$ (19) $F 1 = \frac{2 \times p r e c i s i o n \times r e a l l}{p r e c i s i o n + r e a l l}$ (20) $accuracy = \frac{T P + T N}{T P + T N + F P + F N}$

Where, TP represents positive samples that are correctly classified; TN represents negative samples that are correctly classified; FP represents positive samples that are incorrectly classified; FN represents negative samples that are incorrectly classified.

4.4

Results and Analysis

1)

Improvement validation of random forest algorithm (1)

Sampling mode improvement verification

This study sets decision tree feature splitting mode as ID3. In order to verify the effectiveness of improving the random forest algorithm by improving its sampling mode, various performance indexes of random forest algorithm before and after sampling mode improvement are compared, as shown in Table II. As can be seen, improving sampling mode significantly improves the accuracy, recall, and other indicators of random forest algorithm, but increases the training time of the algorithm.

Table 2.

Comparison of Various Performance Indexes of Random Forest Algorithm Before and After Improvement of Sampling Mode

Sampling mode	Training time	Precision	Recall	F1 value	Accuracy
Bootstrap	128.26s	85.26%	85.47%	85.61%	86.74%
Improved random forests	142.39s	98.23%	98.14%	98.06%	98.41%

(2)

Decision tree feature splitting method improvement verification

Verifying the effectiveness of improving random forest algorithm by improving its decision tree feature splitting method, this paper compares the performance indexes of the random forest algorithm before and after decision tree feature splitting method improvement, and the results are shown in Table III. It can be seen from the table that random forest algorithm with improved decision tree feature splitting method has a shorter training time, and its accuracy and recall have also been improved to a certain extent.

Table 3.

Comparison of Performance Indexes of Random Forest Algorithm Before and After Improvement of Decision Tree Feature Splitting Method

Decision tree feature splitting method	Training time	Precision	Recall	F1 value	Accuracy
CART	126.33s	84.33%	85.28%	84.31%	83.28%
ID3	128.26s	85.26%	85.47%	85.61%	86.74%
Multiple splitting method	58.36s	90.11%	90.26%	90.55%	90.27%

(3)

Verification of comprehensive improvement

In order to verify the effectiveness of improvements carrying out on random forest algorithm, based on experimental datasets, performance indexes of random forest algorithm before and after the joint improvement of sampling mode and decision tree feature splitting method are compared. Among them, the splitting mode of random forest algorithm before and after improvement is ID3. Table IV shows the training time, accuracy, recall, F1 value, and precision of random forest algorithm before and after improvement. It can be seen from Table IV that compared with random forest algorithm before the improvement, the training time of random forest algorithm after improvement has decreased, and its accuracy, recall, F1 value and precision have been significantly improved. This indicates that the improved random forest algorithm adopts multiple sampling mode and decision tree splitting method of information entropy feature merging, which improves the precision of random forest algorithm and to some extent reduces the difficulty and computational complexity of algorithm training. Moreover, the efficiency of algorithm operation is improved, and the improvement is effective.

Table 4.

Comparison of Performance Indexes of Random Forest Algorithm Before and After Improvement

Algorithm	Training time	Precision	Recall	F1 value	Accuracy
Random forest	128.26s	85.26%	85.47%	85.61%	86.74%
Improved random forest	110.14s	99.81%	99.56%	99.15%	99.92%

(4)

Comparison with other algorithms

In order to verify the superiority of the improved random forest algorithm, performance indexes of the improved random forest algorithm are further compared with those of the traditional method logistic regression and LightGBM on the experimental data set, and the results are shown in Table V. As can be seen, compared to logistic regression, the improved random forest exhibits better performance in terms of training time, accuracy, recall, precision and operational efficiency. Compared to LightGBM, although the training time of the improved random forest algorithm is longer, it has obvious advantages in the accuracy, recall and other indexes. Overall, the performance of the improved random forest algorithm is better. This shows that the improved random forest algorithm has certain advantages in the analysis of physical health data of youth sports based on cloud computing and gait perception.

Table 5.

Comparison of Performance Indexes of Different Methods

Algorithm	Training time	Precision	Recall	F1 value	Accuracy
Logistic regression	155.37s	76.14%	76.22%	76.41%	75.08%
LightGBM	50.28s	91.23%	91.62%	92.27%	92.39%
Improved random forest	110.14s	99.81%	99.56%	99.15%	99.92%

2)

Analysis results of improved random forest on physical health data of youth sports

The above verification results indicate that the improved random forest algorithm has certain effectiveness and superiority in the analysis of physical health data of youth sports. Therefore, this article adopts the improved random forest algorithm to further analyze physical health data of youth sports.

Figure 7 shows top5 results of the improved random forest algorithm for physical health data analysis of youth sports. From the figure, it can be seen that in competitive sports for teenagers, the heart function index, weight, shuttle run, standing long jump, and lung capacity are key indicators that affect the physical health of teenagers, which is consistent with our understanding. Therefore, for competitive sports, the improved random forest algorithm can effectively analyze the physical health data of youth sports.

The top5 results of physical health data analysis of wrestling youth sports by improved random forest algorithm are illustrated in Fig. 8. In wrestling youth sports, the heart function index, 50-meter dash, weight, shuttle run and respiration difference are the key indicators that affect the physical health of teenagers. The reason for this is that wrestling sports mainly exercise the strength of teenagers. The above results are consistent with actual perceptions. This reveals that the improved random forest algorithm can effectively analyze physical health data of wrestling youth sports.

Fig. 9 shows top 5 results of the improved random forest algorithm for the analysis of the physical health data of skill youth sports. In skill youth sports, heart function index, 50-meter dash, shuttle run, standing long jump, and respiration difference are the key indicators affecting teenagers’ physical health. The reason for this is that sports acrobatics mainly exercise the comprehensive abilities of teenagers. The above results are consistent with the actual cognition. This indicates that the improved random forest algorithm can effectively analyze physical health data of skill youth sports.

To quantitatively analyze the physical health data of adolescents in competitive football, wrestling and sports acrobatics, the improved random forest algorithm is adopted. In addition, combined with the analysis of above three types of data, the accuracy of the improved random forest for evaluating Top1, Top3 and Top5 factors influencing adolescents’ physical health is counted, and it is compared with logistic regression and LightGBM, and the results are shown in Table VI. As can be seen, the improved random forest algorithm has a high accuracy for different categories of youth sports, with an average accuracy of 99.22% on Top1, Top3 and Top5, which is significantly higher than logistic regression and LightGBM. This indicates that the improved random forest algorithm can effectively analyze the physical health data of adolescents during exercise, and the analysis results has positive significance in guiding adolescents’ exercise and improving their physical function.

Table 6.

Analysis Results of Different Algorithms on Physical Health Data of Different Categories of Youth Sports

Category	Accuracy	Logistic regression	LightGBM	Improved random forest
Competitive football	Top1	76.23%	90.12%	98.88%
	Top3	78.33%	92.63%	98.92%
	Top5	79.33%	93.18%	99.99%
Wrestling	Top1	76.17%	90.05%	98.81%
	Top3	78.29%	92.36%	99.23%
	Top5	79.65%	93.07%	99.91%
Sports acrobatics	Top1	74.32%	90.91%	98.23%
	Top3	76.48%	92.76%	99.24%
	Top5	78.91%	93.66%	99.78%

5

Conclusions

In summary, the analysis method of physical health data of youth sports based on cloud computing and gait perception realizes the rapid and accurate analysis of adolescents’ physical health data during sports. The improvement of the random forest algorithm in this method improves the accuracy and operation effect of random forest algorithm. Compared to standard random forest algorithm, the accuracy, recall, F1 value and precision have been improved to varying degrees, reaching 99.81%, 99.56%, 99.15% and 99.92%, respectively, and the training time has been reduced to a certain extent, which is 110.14s. This method adopts the improved random forest algorithm as the cloud computing layer in cloud computing framework to mine adolescents’ health data, thereby improving the precision of adolescent health data analysis. In addition, the average accuracy on Top1, Top3, and Top5 reaches 99.22%, which is significantly higher than that of logistic regression and LightGBM.

This study has achieved certain research results. However, due to the constraints of conditions, there are still some deficiencies to be improved. For example, the feature selection method of adolescents’ physical health data used is based on supervision, but with the increase of data reflecting adolescents’ physical health, the feature selection based on supervision may have certain limitations. Therefore, the next step of research will attempt to select features based on unsupervised methods, so as to improve the universality of the method.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Physical Health Data Analysis of Youth Sports Based on Cloud Computing and Gait Perception

Ming Lei

Published Online: Feb 27, 2025

Received: Oct 11, 2024

Accepted: Jan 28, 2025

DOI: https://doi.org/10.2478/amns-2025-0100

Keywordscloud computing, youth sports, physical health data, random forest algorithm

© 2025 Ming Lei, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
cloud computing, youth sports, physical health data, random forest algorithm