Protection and Inheritance Strategy of She Traditional Sports Skills Based on Pattern Recognition

In the cultural life of society, traditional sports programs of ethnic minorities are indispensable, and they are also irreplaceable key components in national soft power and education [1-3]. Throughout China’s minority cultures, She traditional sports have always been unique. In a certain sense, it is very close to the folk art that surrounds it, which contains different components such as music, dance and martial arts, without mixing any elements from the West, and is full of the national spirit of ancient China and the unyielding character of valor, which is the cultural essence that unites the wisdom of the She people [4-6].

Tracing back the origin of She traditional sports, it is ultimately the practical activities of She compatriots’ life and health, customs and habits, and resisting foreign disasters, which are very important in the process of human culture [7-9]. Most of the She people have been living in deep mountains and dense forests for a long time, and due to the differences in production, life style and geographical features, they have formed a unique cultural form. She traditional sports are closely related to production and labor, living customs, marriage and love, religious beliefs, etc., and can be roughly divided into four categories: dances of a sporting nature, athletic activities of a sporting nature, recreational projects of a sporting nature, and martial arts activities in the traditional sense [10-11]. There is no clear category boundary between the She traditional sports programs, and in different occasions and under different needs, each She traditional sports program can show different purposes and characteristics. However, many young people nowadays know nothing about She traditional sports, so more efforts should be made to protect and pass on She traditional sports heritage in time.

A multi-Kinect based human behavior recognition study is proposed for She traditional sports skills, the human skeletal motion model is analyzed, and the final data fusion results are obtained through the prediction of human skeletal joint points and the analysis of the Kalman filter mathematical model adapted to this paper. Through the collection of She traditional sports skills movement data, the human skeleton is represented as a graph structure, using the dynamic spatio-temporal graph convolution network DLSTM-GCN, and the experimental analysis is based on the skeletal data obtained after target detection and skeletal key point detection, and after the experimental comparative analysis, it is concluded that the improved model can better capture the short-term and long-term temporal dependence between the actions to improve the sporadic action recognition Accuracy.

2

Overview

Under modern society, the development of traditional sports for ethnic minorities is somewhat abrupt in comparison to modern life. In this regard, the literature [12] describes how to protect traditional sports culture while governing the community through the dissemination of traditional sports culture and modern governance. In fact, the protection and dissemination of traditional sports heritage are inseparable from social governance and community governance. And the literature [13] constructed a multi-source data analysis model to analyze the reasons for participation in She community sports, which provides a reference for the development of sustainable strategies for the dissemination and inheritance of She sports. With the development of more intelligent technologies, traditional ethnic minority sports are seeing hope again and no longer rely on traditional oral transmission. For example, literature [14] found that ethnic traditional sports have innovated inheritance methods through the application of smart media technology, diversified communication, and promoted sustainable development. Literature [15] analyzed the evolution of the inheritance process of ethnic traditional sports culture through intelligent PLS software. As always, traditional sports are used to maintain the health of ethnic people and spread the culture. Literature [16] utilized a neural network algorithm to recognize segmentation of images of ethnic minority costumes, and its recognition speed and accuracy help to retain and authenticate the dissemination of traditional elements of clothing. Literature [17] utilizes machine learning models to recognize Bima script handwriting for preservation and proper dissemination of Bima script. Literature [18] recognized historical artifacts using digital image processing techniques such as image enhancement, image coding, and pattern recognition, and its recognition speed and accuracy were higher than traditional handwriting recognition. For the above studies, in fact, most of them can be carried over to the recognition of She traditional sports, but it needs to be more accurate and visualized to be more in line with the characteristics of sports.

3

Human Skeletal Data in She Traditional Sports Skills under Multi-Kinect

3.1

Kinect Sensor

3.1.1

Kinect Sensor Architecture

The human behavior recognition system mainly utilizes Kinect as a visual sensor to collect data. Kinect is a human-computer interaction device released by Microsoft in 2010 for use with the Xbox-360 gaming console. In 2012, Kinect was integrated with the Windows platform with the launch of Kinect for Windows as a way to encourage developers to design Kinect-based somatosensory interaction devices. In 2014, Microsoft introduced Kinect 2.0, a new generation of sensors. Both of them can acquire color and depth data simultaneously, and the acquired data can be used for 3D reconstruction, human posture recognition, gesture recognition, and other applications. The principle of Kinect structure is shown in Figure 1.

3.1.2

Acquisition of Kinect sensor depth information

The Kinect sensor is different from ordinary sensors in that it can not only collect color maps, but also obtain depth information of the objects in the scene from Kinect. Kinect 2.0 adopts the TOF detection method, which mainly uses an infrared emitter to emit infrared light and modulates the light source using a square wave with a frequency range of 10-100MHz. [19] The phase detection is used to get the phase shift and attenuation of the emitted light and the light after the object has been reflected. From there, the total flight time of the infrared light from the light source to the surface of the object and back to the sensor is calculated, and the distance from the object to the sensor is found based on the round-trip flight time of the light. The depth is calculated using equation (1).

(1)

2 d = \frac{p h a s e}{2 π} \cdot \frac{c}{f}

Where d is the depth, phase is the phase shift of the modulation signal, c is the speed of light, the speed of light traveling in the air is c = 3 × 10⁸ m/s, and f is the modulation frequency of the sensor.

After obtaining the target human body joint coordinate information using the Kinect sensor, the joint information can be used for subsequent data fusion and human behavior recognition studies.

3.2

Multi-Kinect sensor-based external parameter calibration

Sensor calibration generally involves solving for a set of unknown parameters based on the sensor projection process, by which any point in three-dimensional space can be transformed to the corresponding point in the two-dimensional plane by projection. The unknown parameters usually include the focal length f_x, f_y, the position of the center of projection c_x, c_y rotation matrix for transforming the sensor’s coordinate system to the global coordinate system R , and the translation matrix T. f_x, f_y, c_x, and c_y are determined by the sensor itself and are referred to as the sensor’s internal references, while R and T are made into the sensor’s external references, and usually contain three radial distortion parameters (k₁,k₂,k₃) and two tangential distortion parameters (p₁,p₂).

3.2.1

Three coordinate systems

The Kinect sensor has three coordinate systems: the color image coordinate system, the depth image coordinate system, and the skeletal space coordinate system. In the color image coordinate system, coordinate (X,Y) is the color image pixel value; in the depth image coordinate system, coordinate (X,Y) is the depth image pixel value, and Z represents the distance from the object to the sensor; and in the skeletal space coordinate system, coordinate (X,Y,Z) is the coordinate information of the joints of the human skeleton.

3.2.2

Global coordinate transformation

In the course of this paper, the Kinect1 (abbreviated as KI) spatial coordinate system is denoted as S₁, and the Kinect2 (abbreviated as K2) spatial coordinate system is denoted as S₂. It is assumed that the K₁ spatial coordinate system S₁ is the global coordinate system. Figure 2 shows the imaging model of any target human body in the two Kinect spatial coordinate systems, and the 3D coordinates of any joint point of the target human body P are P₁ = (x₁,y₁,z₁)^T and P₂ = (x₂,y₂,z₂)^T under S₁ and S₂, respectively.From Eq. (2), the two Kinect coordinate transformation is to find the rotation matrix R and translation matrix T of S₂ with respect to S₁.

(2)

{P^{'}}_{2} = R P_{2} + T

where ${P^{'}}_{2} = {({x^{'}}_{2}, {y^{'}}_{2}, {z^{'}}_{2})}^{T}$ is the coordinate position of any joint in S₂ at S₁, R is a 3 × 3 matrix containing three independent variables Φ, θ, and ψ, i.e., the relative rotation angles between the two coordinate systems S₁ and S₂, and T is a 3 × 1 translation matrix also containing three independent variables.

3.2.3

3D coordinate transformation based on vector space

From subsection 2.2.2, we can see that the two Kinect external parameter calibrations actually solve the rotation matrix R and translation matrix T. In this paper, we utilize the homonymous vectors and homonymous points to solve R and T, and the specific solution process is described as follows.

Let vector L = M₁M₂ be any vector in three-dimensional space, where M₁ = (x₁,y₁,z₁)^T, M₂ = (x₂,y₂,z₂)^T are any points in three-dimensional space. L₁ and L₂ are the expressions of space vector L in coordinate systems S₁ and S₂. N = L₁/||L₁||, n = L₂/||L₂|| are a pair of normalized vectors of the same name. The expression of any vector in three-dimensional space in different coordinate systems is called a pair of homonymous vectors.

From Eq. (2) and vector properties, the homonymous vectors N and n have the following relation as shown in Eq. (3).

(3)

N = R n

From the rotational transformation relations Rodrigues of the spin theory, there exists a matrix U consisting of a rotation vector u = (u_x,u_y,u_z) with the expression shown in Eq. (4), where U has a relational equation as shown in Eq. (5) with the homonymous vectors N and n. U and the rotation matrix R have the relation equation as shown in equation (6).

(4)

U = [\begin{matrix} 0 & - u_{z} & u_{y} \\ u_{z} & 0 & - u_{x} \\ - u_{y} & u_{x} & 0 \end{matrix}]

(5)

N - n = U (N + n)

(6)

R = (I + U) {(I - U)}^{- 1}

U is an inverse skew-symmetric matrix, which is a singular matrix and cannot be solved for u = (u_x,u_y,u_z) by using Eq. (4), but the generalized matrix method can be used to solve for the rotational vector u = (u_x,u_y,u_z), and then R.

There are i(i≥2) sets of vectors with the same name, so that E_i = N_i + n_i, D_i = N_i – n_i. Then the formula (6) can be transformed by the formula (7), where E and D expressions such as formula (8) and formula (9).

(7)

E u = D

(8)

E = ⌊ \begin{matrix} 0 & E_{1 z} & - E_{1 y} \\ - E_{1 z} & 0 & E_{1 x} \\ E_{1 y} & - E_{1 x} & 0 \\ \dots \\ 0 & E_{i z} & - E_{i y} \\ - E_{i z} & 0 & E_{i x} \\ E_{i y} & - E_{i x} & 0 \end{matrix} ⌋

(9)

D = {[D_{1 x}, D_{1 y}, D_{1 z}, \dots, D_{i x}, D_{i y}, D_{i z}]}^{T}

The rotation vector u can be obtained by using the least squares method, as shown in Eq. (10).

(10)

u = {(E^{T} E)}^{- 1} E^{T} D

Finally, R can be found by substituting matrix U into Eq. (6).

As for the translation matrix T, it can be solved based on the homonymous points. If any point P_k = (x_k,y_k,z_k)(k = 1,2,3⋯), in the S₁ and S₂ coordinate system for P_1k and P_2k, where P_1k and P_2k is called a set of homonymous points. Then the translation matrix T is solved by equation (11): (11) $T = \frac{\sum_{i = 1}^{k} (P_{1 k} - R P_{2 k})}{k}$

From the above analysis, it can be seen that only need to know more than two groups of homonymous vectors in three-dimensional space can be derived from the rotation matrix R, and on the basis of deriving R, using more than one group of homonymous points can be derived from the translation matrix T.

In this paper, we construct the homonymous vectors and homonymous points by obtaining the normal vectors of the three-plane target, extract the point cloud of the three-plane target using two Kinect, and then obtain the parametric equations of the three-plane target using the programming of the PCL point cloud library to obtain the normal vectors of the three-plane target and the plane equation in two Kinect coordinate systems, and then find the intersection point of the three planes using the plane equation o.

3.3

Human Skeletal Joint Point Data Fusion under Multiple Kinect Constraints

3.3.1

Multi-constraint optimization of human skeletal joint point data

In order to reduce the calculation amount of the two sets of data fusion, and optimize and improve the stability and reliability of the human skeletal joint positions after data fusion, priority is given to the reliable data screening operation when data fusion is carried out. There is an interdependence and constraint relationship between the human skeletal joint positions, and according to the physiological characteristics of the human body, the length of the bones is fixed, so based on the coordinates of the acquired data, the length of the bones is derived, and the constraints are optimized for the skeletal position data.

Each frame acquired by each Kinect device contains 25 3D coordinate data of human skeletal joint points, which are flagged with their corresponding names g₁(x₁,y₁,y₁), g₂(x₂,y₂,y₂), ..., g_i(x_i,y_i,z_i), g_j(x_j,y_j,z_j), ...,g₂₅(x₂₅,y₂₅,z₂₅) according to the point type, and will be solved for the skeletal length $L i_{j k n} = | g_{i} - g_{j} | = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2} + {(z_{i} - z_{j})}^{2}}$ of the two neighboring joint points connected by bones.

The lengths of the corresponding bones acquired by two different Kinect L_ij are calculated and compared, and the lengths should be consistent, based on the two sets of data before and after the N frames are solved L_ij11, L_ij21, L_ij12, L_ij22, ..., L_ij1N, L_ij2N, the data are considered comprehensively to exclude the bone lengths that have a large deviation, and the remaining valid K data are averaged to obtain the effective bone length constant value for data constraints. Optimization.

The corresponding formula for bone length L_ij is shown in (12).

(12)

L_{i j} = \frac{1}{K} \sum_{K = 0}^{K - 1} L_{i j k}

In order to represent the angular value of joint activity rotation, based on the actual physiological joint activity angle threshold for data screening, here the rotation matrix R is converted to the Euler angle form, and the corresponding transformed angular values α, β, and γ are obtained by changing around the axis in the order of Z – Y – X, respectively.

The specific formula for converting the rotation matrix R to Euler angle form is shown in equation (13).

(13)

\begin{array}{l} R & = R (α) R (β) R (γ) \\ = (\begin{matrix} \cos α & - \sin α & 0 \\ \sin α & \cos α & 0 \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} \cos β & 0 & \sin β \\ 0 & 1 & 0 \\ - \sin β & 0 & \cos β \end{matrix}) (\begin{matrix} \cos β & 0 & \sin β \\ 0 & 1 & 0 \\ - \sin β & 0 & \cos β \end{matrix}) (\begin{matrix} 1 & 0 & 0 \\ 0 & \cos γ & - \sin γ \\ 0 & \sin γ & \cos γ \end{matrix}) \\ = (\begin{matrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{matrix}) \end{array}

At this point, solve for $\begin{matrix} α = a r c t a n \frac{p_{32}}{p_{33}}, & β = \arctan \frac{- p_{31}}{\sqrt{p_{32}^{2} + p_{32}^{2}}} \end{matrix}$ and $γ = \arctan \frac{p_{21}}{p_{11}}$ .

3.3.2

Determination of weighting coefficients for human skeletal joint point contribution

Since the angle between the two Kinect devices in this paper is placed at 90°, considering the placement of the human posture direction and the YOZ plane of the first Kinect reference coordinate system, the angles between XOY α₁, β₁, α₁, β₁ are co-angles with each other, α represents the angle between the human body posture and the YOZ plane of the detection field of the first machine, β₁ is larger, the more the human body faces the first machine, β₁ represents the angle between the human posture and the YOZ plane of the detection field of the second machine, and α₁ is larger, the more the human body faces the first machine, in this case, the confidence of the data can be expressed by the included angle. And the human posture direction is represented by the human body’s left and right shoulder joint points forming vector $\vec{d}$ , which is solved by using the inverse cosine method between vectors and planes for α₁, β₁.

Therefore, combining the two characteristics together to determine the contribution of the human skeletal joints, the smaller the standing distance d_ki the greater the contribution, and a_k, the greater the contribution of the posture toward the larger the angle of the contribution to optimize the setting of the proportion of the two Kinect weights, the first Kinect weights for $w_{1 i} = \frac{d_{2 i}}{2 (d_{1 i} + d_{2 i})} + \frac{\tan α_{1}}{2 (\tan α_{1} + \tan β_{1})}$ , the corresponding second Kinect weights for $w_{2 i} = \frac{d_{1 i}}{2 (d_{1 i} + d_{2 i})} + \frac{\tan β_{1}}{2 (\tan α_{1} + \tan β_{1})}$ .

When both Kinect lost data at the same time, both weights are set to 0. At this time, it is necessary to compensate for the lost data prediction.

3.3.3

Skeletal joint point compensation under improved Kalman filtering

Kalman filtering is a common multi-sensor fusion of different perspectives of data processing, recursive way to the previous moment of the state as a parameter value, in order to realize the next target prediction of the estimated state of the numerical value, and then analyze the effect of the observation of the correction, to obtain the optimal prediction results, the filtering prediction of the computational amount is small, to ensure that the state of the prediction process of the speed of data processing and the final prediction of the optimal effect. The state change is denoted as s_k = AS_k–1 + Bu_k–1 + q_k, and the observation change is denoted as o_k = HS_k–1 + r_k, where A, H are state transfer and observation matrices, and B and u_k–1 are control matrices and vectors. The white noise is denoted by q_k, and r_k is the white noise of the observation noise, and the corresponding covariance matrices are denoted by Q_k and R_k, respectively. And from this, the corresponding equation for the predicted value is calculated as $\bar{s_{k}} = A {\overset{∽}{s}}_{k - 1} + B u_{k - 1}$ , the covariance of the predicted state is calculated as $\bar{Y_{k}} = A {\overset{∽}{Y}}_{k - 1} A^{T} + Q_{k}$ , and updating the state requires solving in advance for the Kalman gain value $K_{k} = \bar{Y_{k}} H_{k}^{T} {(H_{k} \bar{Y_{k}} H_{k}^{T} + R_{k})}^{- 1}$ , which results in updating the state value ${\bar{s^{'}}}_{k} = \bar{s_{k}} + K_{k} (o_{k} - H_{k} \bar{s_{k}})$ and its corresponding covariance value ${\bar{Y}}^{'}_{k} = (1 - K_{k} H_{k}) \bar{Y_{k}}$ .

In order to ensure the accuracy and real-time performance of the improved prediction method for the calculation of nonlinear human complex motion, the nonlinear motion characteristics are linearized and converted to meet the nonlinear motion conditions, and at the same time, the Kalman filtering idea can also be taken to quickly predict the solution to the problem and realize the high-efficiency state position prediction.

Here the nonlinear motion of human body is expressed in the form of Taylor series expansion of its dynamic nonlinear characteristics, the position change of the human skeletal joints is expressed as p(x), which is calculated as (14), and the Taylor series expansion is carried out at x₀, and the velocity change is expressed by the first-order derivatives of p(x) at x₀, p′(x₀) . Similarly, the Taylor series expansion formulas of the first two items can be used to transform the complex real-time nonlinear change of the velocity of the human body’s motion into a linear representation.

(14)

\begin{matrix} p (x) = \frac{p (x_{0})}{0!} + \frac{p^{'} (x_{0})}{1!} (x - x_{0}) + \dots \\ + \frac{p^{(n)} (x_{0})}{n!} {(x - x_{0})}^{n} + R_{n} (x) \end{matrix}

A Taylor series expansion expansion of multidimensional variables is carried out in the form as in Eq. (15), where J_p denotes the Jacobi matrix and $H ({\overset{∽}{x}}_{k})$ stands for the Hessian matrix.

(15)

p (\vec{x}) = p ({\vec{x}}_{k}) + J_{p} (\vec{x} - {\vec{x}}_{k}) + \frac{1}{2!} {(\vec{x} - {\vec{x}}_{k})}^{T} H ({\vec{x}}_{k}) (\vec{x} - {\vec{x}}_{k}) + o^{n}

This linear representation of the nonlinear human motion velocity feature is brought into the computation to realize the improved Kalman filtering with a state change transfer variance formula ofs_k = p(s_k–1+q_k, while the corresponding formula for the observed change can be expressed as o_k = h(s_k)+r_k.

The two variance correspondences of the carryover expansion using the Taylor series expansion formula are shown in Eqs. (16) and (17), where P_k–1, H_k are denoted as the Jacobi matrices corresponding to $p ({\overset{∽}{s}}_{k - 1})$ and $h (\bar{s_{k}})$ , respectively.

(16)

s_{k} = p (s_{k - 1}) + q_{k} = p ({\overset{∽}{s}}_{k - 1}) + P_{k - 1} (s_{k - 1} - {\overset{∽}{s}}_{k - 1}) + q_{k}

(17)

o_{k} = h (s_{k}) + r_{k} = h (\bar{s_{k}}) + H_{k} (s_{k} - \bar{s_{k}}) + r_{k}

Based on the above improved formula to derive the corresponding predicted value corresponding to the formula for $\bar{s_{k}} = p ({\overset{∽}{s}}_{k - 1})$ , the covariance of the corresponding improved predicted state is calculated as $\bar{Y_{k}} = P_{k - 1} {\overset{∽}{Y}}_{k - 1} P_{k - 1}^{T} + Q_{k}$ , the updated state requires an early solution of the Kalman gain value $K_{k} = \bar{Y_{k}} H_{_{k}}^{T} {(H_{k} \bar{Y_{k}} H_{_{k}}^{T} + R_{k})}^{- 1}$ also produces a change in value accordingly, and thus the updated state value is improved to ${\overset{∽}{s}}^{'}_{k} = \bar{s_{k}} + K_{k} (o_{k} - h (\bar{s_{k}}))$ , and its corresponding covariance value is ${\overset{∽}{Y^{'}}}_{k} = (1 - K_{k} H_{k}) \bar{Y_{k}}$ .

4

Skeletal data-based action recognition methods

4.1

Graph Convolutional Neural Networks

Video as well as image data is Euclidean data with pixels neatly arranged in the form of a matrix, where each pixel has the same number of neighboring pixels, traditional convolutional neural networks can only handle Euclidean type data and cannot be used to process non-Euclidean graph data. In mathematical graph theory, graph data refers to topological graphs, which are constructed from a series of nodes and edges that correspond to each other. Graph data has no regularized structure, and each node has a different number of neighboring nodes, and there is no translation invariance, so it is infeasible to use a fixed-size convolutional kernel to extract the features of the nodes in the graph data. In order to efficiently process complex graph data, convolutional neural networks can be utilized, which essentially acts as a feature extractor by updating the features of the nodes using the connectivity relationships between the nodes.

There are two types of convolution methods for GCNs: the spectral domain-based graph convolution method and the null domain-based graph convolution method, the latter being specifically described here.

The node features are updated using null domain graph convolution, which is achieved by aggregating the features of the nodes surrounding the node. In CNN, the features of neurons in the latter layer can be obtained by aggregating the features of a region in the previous layer, and by applying this idea of local connectivity to GCN, the iterative formulas of the nodes can be obtained intuitively as shown in Eq. (18).

(18)

H^{(l + 1)} = A H^{(l)} W^{(l)}

In Eq. (18), H^(l) is the graph of node features in layer l, A is the adjacency matrix of the graph, and W^(l) is the weight parameter in layer l network. There is a drawback in this calculation that the features of its own nodes are not taken into account, so A can be replaced by Ã, Ã = A + I, and I are diagonal matrices. This approach accumulates the features of neighboring nodes with its own nodes and is proportional to the number of neighboring nodes and the eigenvalues increase with the number of neighboring nodes. In order to solve this problem, without changing the distribution of features take the normalization of the rows of Ã, i.e., D⁻¹Ã, D for the degree matrix of Ã; and the columns of Ã, i.e., Ã^D−1, at this time, Eq. (18) can be rewritten as Eq. (19).

(19)

H^{(l + 1)} = D^{- \frac{1}{2}} \overset{∽}{A} D^{- \frac{1}{2}} H^{(l)} W^{(l)}

4.2

Temporal Convolutional Network ST-GCN

With the in-depth study of GCN and the rapid development of artificial intelligence, the ST-GCN model, which integrates graph convolution and skeletal human action recognition, is proposed.The input of the model is the skeleton spatio-temporal map constructed from human skeletal data.The spatial and temporal features of the skeleton spatio-temporal map are aggregated using spatial graph convolutional network and temporal convolutional network to gradually generate a higher level of feature maps to realize action recognition. The process of sparring action recognition is shown in Fig. 3.

4.3

Dynamic spatio-temporal graph convolutional network DLSTM-GCN

4.3.1

Pre-processing

Action video after human posture estimation to get a continuous human posture skeletal point sequence, due to the action in time has continuity, in some adjacent action frames in the human body posture change is very small, the skeletal posture point coordinates do not have differences, in a period of continuous posture frames there are more redundant frames, to remove the redundant posture frames, you can reduce the redundancy of information in the sequence of skeletal posture points to improve the accuracy of the model, simplify the representation of the action to reduce the computational cost [20]. In the preprocessing section, a simple algorithm based on non-great suppression is designed to remove redundant frames. This algorithm can remove redundant pose frames in a more targeted way, and can retain pose frames with pose changes or significant action characteristics, so as to retain more distinctive action information, thus improving the accuracy of action recognition and reducing the cost of model inference.

The non-great suppression method is used to extract some redundant frames in the continuous skeletal pose sequence, and the redundancy elimination condition formula is shown in (20).

(20)

G (P_{i}, P_{j}) = 1 [o k s (P_{i}, P_{j}) | Λ] \geq η]

Where j, i is the posture frame number, the initial value is set i = 1, j = 2, function oks is used to determine whether the two human skeleton posture P_i and P_j are similar, η is set as the threshold for the elimination condition of redundant frames, when the oks value between the two postures is greater than or equal to the threshold η, function G value is 1, indicating that the two skeleton postures are duplicated or the change in the posture is slight, determining that P_j is a redundant frame, it is deleted, and the comparison is carried out on the next frame. Comparison j + 1. is less than the threshold value η, the value of G is 0, indicating that there is a difference between the two bone postures, will P_i be included in the input sequence, and continue to compare the next frame, until the comparison is completed all consecutive bone for the postures of the frame. oks is calculated as shown in equation (21).

(21)

o k s = \frac{\sum_{i}^{n} [\exp (- \frac{d_{i}^{2}}{2 s^{2} k_{i}^{2}}) δ (V_{i} > 0)]}{\sum_{i}^{n} [δ (V_{i} > 0)]}

Where d_i is the Euclidean distance between the current frame and the corresponding joint in the previous frame, s is the scale size of the human body gesture, k_i represents the scale factor of the real key point, usually set as the distance from the center of the key point to the edge of the human body part, V_i represents the visibility of the real key point, and δ(V_i > 0) represents the value of 1 when V_i > 0 is established, otherwise 0. The similarity of each joint point that can be obtained from the oks calculation formula will be between [0,1], with an oks of 1 for a stationary target character, and oks close to 0 when the gap between the character’s movements in the upper and lower frames is too large.

4.3.2

Identifying Termination Strategy Networks

Improvement of ST-GCN, this paper uses LSTM-based Recognition Termination Strategy Network instead of TCN.Compared to TCN, LSTM is better at capturing temporal correlation when dealing with time-series data, and has stronger temporal modeling ability, which is able to capture the temporal information and long-term dependencies in the action sequences.LSTM usually has fewer parameter counts, which likewise helps to reduce the model’s overall LSTMs usually have fewer number of parameters, which also helps to reduce the overall complexity and computation of the model and makes it easier to perform efficient action categorization in resource-constrained environments.LSTMs have richer action representations with hidden states that can be used as latent representations of the action sequences.

Then, the LSTM-based policy module uses the currently pooled features and the hidden state $h_{t}^{(n)}$ containing the history frame information for action classification to obtain the probability distribution of the action category $p_{t}^{(n)}$ , and for early termination of the action classification, the network is set with an action probability threshold ρ and a number of times threshold T min. The recognition is terminated only when the probability of the recognized action in the current frame is greater than the threshold ρ and a total number of Tmin times of recognitions including the current frame satisfy the probability of the action is greater than threshold ρ, that is, only when the formula (22) and (23) hold at the same time to stop the model inference, and output the action category at this point in time as the final action recognition results, the formula (22) in $\max (p_{t}^{(n)})$ represents the probability distribution of the action category $p_{t}^{(n)}$ in the largest maximum value, the formula (23) in $T_{\max (p_{t}^{(n)})}$ represents the cumulative action classification results to the current frame for the same class of action and $\max (p_{t}^{(n)}) > ρ$ times.

(22)

\max (p_{t}^{(n)}) > ρ

(23)

T_{\max (p_{t}^{(n)}) > ρ} = T \min

5

Protection and Inheritance under the Identification of She Traditional Sports Skills

In order to verify the feasibility of the research methodology proposed in this paper, this section takes the traditional sports skill “Shequan” of the She ethnic group as an example for data collection and action recognition.

5.1

Experimental analysis of motor data fusion of traditional she sports skills

The data fusion experiments in this project involve two filtering estimation and fusion experiments, which are conducted by utilizing she-fist data captured by two Kinect sensors at the same time. The acquisition frame rate is 30 frames of data per second, and the corresponding multiple frames of skeletal point data can be obtained under the multi-second sampling time, and subsequently, after coordinate calibration, it can be used in the simulation experiment of data fusion.

When two Kinects are utilized to collect she-fist data, both sensors can simultaneously track the joint position information of the movement. However, there may be problems such as body occlusion and data loss. When one Kinect can capture the data and the other is estimated or not captured, the fused data thus obtained has the possibility of insufficient accuracy and forming a large error, so it is necessary to judge the observed data, and the process of obtaining the estimated value based on the observed value by filtering and estimating is a continuous iterative process, and by discarding the observations with large errors and unusable values, a more accurate and less error can be obtained The fusion value can be obtained with more accurate and less error by discarding the large error and unusable observation value.

Firstly, a simulation experiment for filter estimation is carried out, in which data from different skeletal joints in 3D coordinates under multiple frames is simulated, and good filtering results are obtained. Therefore, in this paper, we take the right knee KneeRight and the right hand HandRight as examples for filtering experiments, and analyze the filtering effect of the right knee joints and the right hand joints to verify the feasibility of Kalman filtering for the prediction of joints.

Fig. 4 and Fig. 5 show the data of the right knee joint point in the X-axis direction for the experiment, through Fig. 4 and Fig. 5 we can get that the data of the right knee skeletal point estimated by filtering is closer to the real value, and the error is not more than 0.05m, corresponding to the error after filtering estimation is also smaller than the observed error, and the filtering effect obtained is obvious, so the filtering fusion under multi-Kinect is feasible to be estimated by utilizing Kalman filtering.

Figure 6 shows the human right hand skeletal joint point data captured using a certain Kinect sensor, the filtering trajectory in the extraction of skeletal point coordinate information, the right hand tracking is not accurate, there may be occlusion, data jitter is large, etc., at this time, it is necessary to utilize the Kalman estimation, will be estimated as an observation for experimental research, so that the data can be smooth, more close to the true value, to avoid data jumps and other situations. Therefore, it can be concluded that Kalman filtering is feasible for predicting the coordinates of skeletal points.

Figure 7 shows the data fusion experiment under the right knee joint point, the data collected by the two Kinect sensors are filtered and estimated respectively, since both use the same type of two Kinect sensors, then the weights can be made to be 0.7 each, so that the fusion operation can be carried out according to the estimated value, and the final data fusion result is obtained, and the fused data is between (0.09m, 0.11m), which contains the detected to all the skeletal point data in sports skill movement.

The data fusion experimental diagram of the posture of the lower She fist in the selected experiment is captured by the color camera, the “K1 acquisition pose” and “K2 acquisition pose” are the action postures tracked by the two Kinect under the corresponding frames, and the “post-fusion pose” is the human skeleton model obtained by the data fusion algorithm, and the complete human skeleton motion model after fusion is obtained by solving the occlusion problem, and the recognition research of the corresponding posture of She Fist can be completed through the extraction of human posture features.

5.2

Experiment and Analysis of She Traditional Sports Skill Movement Recognition

5.2.1

Evaluation indicators

In the field of action classification, the accuracy rate is an important concept for evaluating the performance of the detection model, and the accuracy rate indicates the ratio of the number of correctly classified samples to the total number of samples on the test dataset of the classification model. In the case that the number of samples in each category is roughly balanced, the accuracy rate can better reflect the model’s performance. The calculation formula for this is:

(24)

A cc y r a c y = \frac{C o r r e c t s a m p l e s i z e}{T o t a l s a m p i e s i z e}

For datasets with uneven sample distribution, the evaluation metric of accuracy is often considered. Compared to accuracy, precision is more concerned with the effect of the model on each category, which is calculated by the formula: (25) $A c c r a c y = \frac{\Pr e d i c t t h e t y p e x a n d p r e d i c t t h e c o r r e c t s a m p l e s i z e}{A l l p r e d i c t i o n s o f c l a s s x}$

By calculating the accuracy of the model on each category, it is possible to know its specific recognition effect for each category, and a high recognition accuracy avoids misrecognition of specific categories.

5.2.2

Experimental analysis and comparison

The model was first trained on the training set of NTU-RGB+D dataset and tested on the validation set, and the accuracy of 60 types of actions was calculated and compared with the commonly used generic action classification algorithms, and the specific results are shown in Table 1.

As can be seen from the table, the dynamic spatio-temporal map convolutional classification effect is in the leading position in the field of generalized action classification when only human key point data are used. For the more traditional methods such as LieGroup, LSTM, and simple CNN, the GCN method has obvious advantages. Dynamic spatio-temporal graph convolution also has obvious advantages over simple GCN methods such as ST-GCN, although it is slightly inferior to DGNN. The possible reason for this is that DGNN uses a fully connected layer as the feature extraction module of the backbone network instead of the convolutional layer used in this network, which has an obvious advantage in the number of parameters. However, in terms of training and inference time DGNN is at a significant disadvantage over the earlier fusion network, and the dynamic spatio-temporal map convolution spends only about one-third of the training speed and inference time of DGNN.

Table 1.

Other algorithms compare results

Model name	Accuracy rate(%)
Lie Group	52.5
ARRN-LSTM	83.6
3scale ResNet152	87.7
ST-GCN	81.5
Pb-GCN	89.2
DGNN	91.2
DLSTM-GCN	90.1

This model accomplishes the fine-grained action classification task on the she-boxing action classification set, the backbone network of dynamic spatio-temporal graph convolution is designed with 10 layers, the length of the input human keypoint sequence is 300 frames, and the number of human keypoints is 25, all of which are 3-dimensional coordinates with a batch size of 16. The dataset has a total of 1274 targets in 10 classes, which are divided into training and validation sets according to 8:2. The comparison of the results of this method with other classification algorithms is shown in Table 2.

Table 2.

Comparison of experimental accuracy of action classification data set

Categories	ST-GCN	2s-AGCN	DGNN	FenceNet	DLSTM-GCN
SL-R	0.38	0.58	0.45	0.72	0.71
SL-L	SL-R	0.12	0.41	0.68	0.39
SJ-R	SL-L	0.53	0.8	0.87	0.83
SJ-L	SJ-R	0.88	0.89	0.87	0.87
SW-R	SJ-L	0.78	0.89	0.87	0.87
SW-L	SW-R	0.46	0.54	0.29	0.74
FH-R	SW-L	0.66	0.85	0.78	0.83
FH-L	FH-R	0.45	0.75	0.68	0.68
UC-R	FH-L	0.42	0.73	0.52	0.71
UC-L	UC-R	0.24	0.5	0.62	0.87
Top 1(%)	67.75	84.57	76.67	82.76	93.45
Top 5(%)	97	100	98.15	99.51	100

As can be seen from the results, dynamic spatio-temporal graph convolution achieves good results on the fine-grained she-fist action classification dataset. In terms of classification accuracy in each category, especially in the two categories of swinging fist and uppercut, the proposed model has a significant advantage over other models. The classification model based on DLSTM-GCN with 2s-AGCN, which is also based on the keypoint weight matrix, is also much higher than the other two models in terms of classification accuracy on multiple categories.

In this paper, we also analyze the classification accuracy of this method and other methods on each category of the she-fist movement classification dataset, and the results are shown in Table 3. From the results in the table, it can be learned that the recognition accuracy of the present method has a significant leading position in several categories. On almost all categories, the recognition effect of the present method is more balanced, and the accuracy is generally around 0.8~0.97, which will have high stability and reliability on practical scenarios.

Table 3.

Comparison of accuracy of the classification data set of the method

Categories	ST-GCN	2s-AGCN	DGNN	FenceNet	DLSTM-GCN
SL-R	0.38	0.58	0.45	0.72	0.81
SL-L	0.53	0.7	0.54	0.75	0.84
SJ-R	0.38	0.39	0.2	0.2	0.82
SJ-L	0.68	0.77	0.58	0.79	0.89
SW-R	0.44	0.6	0.79	0.66	0.85
SW-L	0.63	0.77	0.82	0.71	0.97
FH-R	0.38	0.58	0.72	0.87	0.96
FH-L	0.55	0.89	0.83	0.76	0.88
UC-R	0.44	0.83	0.71	0.77	0.76
UC-L	0.37	0.55	0.64	0.77	0.97

DLSTM-GCN was tested on a simple comparison of motion-capture based she-boxing dataset and the comparison results are shown in Table 4. Due to its lesser amount of data, more categories, and higher degree of fine-grainedness. Therefore, from the indexes, there is an improvement compared to the Shequan motion dataset, from 63.55% to 98.76%, which is significant.

Table 4.

Comparison of experimental results based on action capture

Model name	Top 1(%)	Top 5(%)
ST-GCN	52.11	95.76
2s-AGCN	53.47	94.21
DLSTM-GCN	63.55	98.76

5.2.3

Experimental Comparison of Data Expansion Algorithms

In this subsection, the performance of several commonly used interpolation algorithms on the she-fist action dataset will be compared. In this paper, the real collected human key point sequences are first downsampled, and then the downsampled human key point sequences are expanded to the original length by the interpolation algorithm, and the average error and the root mean square error under multiple expansion multiples are calculated, and the results are shown in Table 5. After the experiment, the accuracy of the polynomial interpolation method is almost the same, and the average error is the lowest when the interpolation order reaches the 8th order, and the lowest value is 0.3011. From the table, it can also be seen that satisfactory interpolation results can also be obtained with higher data expansion multiples.

Table 5.

The accuracy of various interpolation algorithms is compared

Interpolation algorithm	Mean error		Error mean square root
Interpolation algorithm	k=2	k=5	k=2	k=5
2 times neville	0.0781	0.4612	0.1338	0.6353
4 times neville	0.0413	0.3184	0.0689	0.4474
8 times neville	0.0391	0.3011	0.0628	0.4123
Linear interpolation	0.0781	8.0278	0.1338	9.7477

The X-axis motion of the hand in one of the samples is shown as an example in Fig. 8-Fig. 10. The true length of the key point sequence in the sample is 60, and the original sequence is downsampled by 2x, 3x and 5x, and then the downsampled data is expanded to the original length by interpolation algorithm. It can be seen that, compared to linear interpolation, Neville interpolation can better perceive the motion characteristics of joints, compensate for some missing key frames, and predict key point positions more accurately. However, when the sampling frequency is too low, as shown in the 5-fold expansion example in the figure, it can lead to distortion of the expanded data.

5.3

Traditional Sports Skills Inheritance Strategies

1)

Give full play to the positive role of civil organizations in the inheritance of traditional She sports

The key to the protection and transmission of intangible cultural heritage also lies in the grassroots level, and in giving full play to the power of folk organizations. Folk organizations are generally composed of artists who are usually the protectors and inheritors of intangible cultural heritage, and at the same time, they also have strong appeal and influence in a certain region, so it is necessary to give full play to the enthusiasm of folk organizations in the specific implementation of the inheritance of She traditional sports and cultural heritage. In fact, not only the She traditional sports themselves, but also the individuals or organizations who inherit these She traditional sports cultural heritage, they own or inherit the form of such culture, therefore, in the process of She traditional sports cultural heritage inheritance, it is also necessary to protect the intangible cultural heritage protectors who are able to pass on and continue to make innovations, instead of limiting themselves to the collection and preservation of the physical achievements of She traditional sports. The protection of the intangible cultural heritage of traditional sports should also be protected.

2)

Strengthening the inheritance of She traditional sports cultural heritage in school education

As an important base for cultivating talents, schools also play an important role in the preservation of traditional sports in China. However, from the point of view of school physical education nowadays, whether it is in the stage of compulsory education, or in the stage of high school and university education, the physical education is only limited to the teaching of simple and basic physical exercise skills, and the teaching form is boring and single. And rarely will some excellent traditional sports programs be introduced into physical education, resulting in students’ understanding of traditional sports programs being very insufficient. Therefore, the author believes that on the one hand, traditional sports should be introduced into school physical education, increase the curriculum of traditional sports and develop teaching materials of intangible cultural heritage, so that students can fully recognize and understand She traditional sports in their thoughts and actions; on the other hand, intangible culture teachers should be strengthened, so that more teachers can devote themselves to the inheritance and protection of intangible cultural heritage, and promote the inheritance of She traditional sports in a comprehensive way. On the other hand, the construction of teachers of intangible culture should be strengthened, so that more teachers can devote themselves to the inheritance and protection of intangible cultural heritage, and the inheritance of She traditional sports can be promoted comprehensively.

3)

Enhance the public’s awareness of the inheritance of She traditional sports

To do a good job in the transmission and protection of Chinese traditional sports in the context of intangible cultural heritage, it is closely related to the support and help of the general public. The adoption of the Intangible Cultural Heritage Law of the People’s Republic of China to regulate the protection of China’s intangible cultural heritage is of great significance. The enactment of the law and the establishment of the legal status of the protection and inheritance of intangible cultural heritage have undoubtedly provided a clear “wind vane” and an effective “cardiotonic” for the people. Therefore, we should strengthen the legal publicity of China’s intangible cultural heritage with the help of the widely active microblogging, WeChat and other Internet channels, as well as radio and television media, so that the public can understand and familiarize themselves with the knowledge of the law on intangible cultural heritage, thus consciously establishing the awareness of protecting the intangible cultural heritage and enhancing the sense of responsibility for the inheritance of traditional sports of the She ethnic group.

6

Conclusion

In this paper, we take the traditional sports skill “She Quan” as an example, and use the coordinates of the joints tracked by multi-Kinect sensors to recognize the She Quan movements, extract the real-time coordinate points and angle features of the human skeleton, and combine with the dynamic temporal regularization algorithm to complete the data fusion of She Quan. By introducing the key point weight matrix and dynamic graph connection mechanism, the part of the human body that accounts for the main factors in traditional sports skills has a higher weight, and expanding the experiments on the basis of the she-boxing action dataset, we verify the algorithm’s ability to classify the fine-grained boxing technical actions and the effect of each module on the improvement of the classification accuracy by using the dynamic spatio-temporal graph convolutional network, and the results show that the recognition effect of this method is more balanced, and the accuracy is generally around 0.8~0.95. The above two methods are used to protect the traditional sports skills of the She ethnic group, and then the inheritance strategy of the traditional sports skills of the She ethnic group is proposed.

Funding:

This article is part of the Fujian Provincial Social Science Fund Project: Research on the Identity and Ethnic Integration of the She Ethnic Folk Sports Culture in the New Era in China (FJ2021B129).

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Protection and Inheritance Strategy of She Traditional Sports Skills Based on Pattern Recognition

Hui Lan

Publié en ligne: 21 mars 2025

Reçu: 06 nov. 2024

Accepté: 13 févr. 2025

DOI: https://doi.org/10.2478/amns-2025-0660

Mots clésKinect, Data fusion, Graph convolutional neural network, Traditional sports skills

© 2025 Hui Lan, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Mots clés
Kinect, Data fusion, Graph convolutional neural network, Traditional sports skills