Design and Implementation of Dance Personalized Teaching System Assisted by Artificial Intelligence

The speed of time renewal, education with the development of social progress is also in constant reform, in this new era of the pursuit of individuality and innovation, more respect for individuality, promote individuality. Personalized teaching is a kind of teaching mode that tailors teaching strategies and contents according to students’ individual differences [1-4]. This mode of teaching aims to give full play to the potential of students and improve the learning effect [5-6]. In the field of modern education, personalized teaching has been paid attention to and practiced by more and more educational institutions and teachers. Personalized teaching can improve learning motivation, promote learning level, develop comprehensive ability and enhance self-confidence [7-9]. However, personalized instruction has challenges such as teacher resources and skill levels, instructional costs, instructional assessment and tracking, and students’ adaptability [10-12].

Dance as an artistic movement program has been introduced into primary and secondary schools and even university classrooms. Since the broadcast of the program this is Street Dance, especially in 2022, which set off a national street dance craze, more and more people have begun to learn dance and feel the charm of dancing. However, most of the dance movements nowadays are the same, and few of them are eye-catching and lack their own unique style. Good dance teaching can not only improve students’ physical fitness and artistic performance, but also promote dancers’ mental health, aesthetic ability and creativity [13-16]. Dance personalized teaching combines the individuality of the project fully with the ontological personality of each learner, implements personalized teaching, promotes the overall development of learners, promotes the development of learners’ personality, highlights the value of learners’ individuality, and meets the interests and needs of different learners.

Not only does personalized teaching require individuality and creativity, but it is also necessary to combine the type of dance, and most importantly, to combine the individuality of each learner, and to teach according to the learner’s interests and hobbies to complement their strengths and weaknesses. Improve the quality of dance teaching, so that learners as soon as possible to understand the essence of the dance, for each person to provide the most personalized stage. And artificial intelligence technology through the integration of online and offline, to enhance the experience of learners, to help learners independent learning and teaching reflection, to improve the dance technology [17-20].

Artificial intelligence improves teaching efficiency and quality in dance teaching through intelligent recognition of dance gestures [21], and literature [22] summarizes the role and suggestions of virtual technology in dance teaching, while literature [23] strengthens dancers’ memories of dance techniques through virtual reality technology to help dancers improve efficiency. Literature [24] constructed a virtual teaching robot system to improve student participation and teaching effectiveness. And literature [25] constructed a dance learner model, through the learner wearing the model device automatically captures the relevant dance data, students and teachers are able to use this data to target the design of teaching programs. Literature [26] and literature [27] involve an online dance teaching system and a dance movement distance learning system, respectively; the former helps students to understand their own situation and coordination, and provides teachers and students with different dance steps, while the latter evaluates the students’ perception of their dance ability, thus realizing personalized teaching. Literature [28] utilized target detection and automatic recognition techniques in artificial intelligence technology to identify the techniques and functions of a Latin dance teaching system to provide guidance for teaching. Similarly, literature [29] utilized convolutional neural networks to process and extract features from dance images and to accurately identify and classify them. Literature [30] analyzes dance movements through computer techniques and trains machine learning models to classify and evaluate the movements. Whereas, literature [31] is designed a dance motion capture system to compare various poses of dancers with the standard poses. These studies helped the teachers in personalized instruction to guide and manage each student individually. Artificial intelligence technology has the potential to also enhance the level of innovation and approach to dance [32].

This paper constructs a personalized dance teaching system based on Kinect’s human body detection and tracking technology combined with LSTM. For human body joint point recognition, the method of edge detection is used to find the existence region of the human body, identify each joint part of the human body, and analyze the front and side joint points to determine the coordinate position of the human body. After constructing the interrelationship graph of the skeleton nodes, CNN local convolution is used to extract image region features, and shared convolution weights and pooling operations are used to reduce the data dimensionality. Skeletal information is further processed using an LSTM network to compensate for the lack of temporal feature mining by convolutional neural networks. Experiments are designed to verify the system performance and then put into dance teaching applications and collect user feedback.

2

Method

2.1

Function and process design of the system

2.1.1

Required system functionality

The functional design objectives of the dance teaching system in this paper are as follows: 1)

Introduction of basic knowledge: including background knowledge of dance, types of dance, history of development and the current situation and other related knowledge, presented in the form of text, for users to preview the relevant knowledge of dance. Obtained from the local area where the dance originated.

2)

Video playback demonstration: by the dance heritage dance, we recorded and edited dance teaching video, for users to watch the standard dance teaching video and learn.

3)

Dance practice mode: first use optical motion capture to record the standard movement data stream of the inheritor, and then put the data movement into the character model we built as the teacher avatar. Users can imitate the movements of the teacher’s avatar by watching them, and the system will give them instant and total scores to verify the standardization of their movements.

2.1.2

System Functional Flow Design

The system’s functions are mainly divided into three parts: an introduction of basic knowledge, a demonstration of video playback, and a mode for dance practice. The function of a basic knowledge introduction is to provide users with text and illustrations on dance knowledge. Video playback demonstration part of the function and the market of the conventional video playback software is similar to the progress bar control playback, volume control, multiplier playback, replay, pause and other conventional functions, the user can follow their own needs to control the playback of the dance video. Dance practice module of the process shown in Figure 1, the user needs to follow this process to complete a dance practice, and finally the system will give the user total feedback to measure the learning effect.

2.2

Human detection and tracking technology

2.2.1

Human detection and tracking technologies

1)

Human detection technology

Commonly used methods include optical flow method and adjacent frame difference method and so on, followed by analysis and judgment based on the foreground, this traditional detection method is not very effective in the complex external environment and when there is interference from external object occlusion, and it is also very easy to be affected by the noise of the underlying image features.

In general, the classification of foreground targets mainly includes classification methods based on appearance, shape, features and other information. Specifically, the commonly used classification methods are as follows: (1)

Adjacent frame difference method

The theoretical basis of the adjacent frame difference method is the difference operation, which determines the appearance profile of a moving object by analyzing the difference between two frames in the video stream: (1) $D_{x} (x, y) = | f_{k - 1} (x, y) - f_{k} (x, y) |$ $${D_x}(x,y) = \left| {{f_{k - 1}}(x,y) - {f_k}(x,y)} \right|$$

Equation (1) is a differential operation on the image. In practice, it is difficult to guarantee the quality of the captured video, and the surrounding environment and the camera itself will bring a lot of noise to the capture results. Therefore, it is necessary to process the results of the differential operation to eliminate the effect of noise. Noise usually obeys a Gaussian distribution, so we can overcome the effect of noise on the differential operation by setting a threshold, see equation (2): (2) ${\begin{array}{l} D_{x} (x, y) = 0, D_{x} (x, y) < τ \\ D_{x} (x, y) = 1, D_{x} (x, y) \geq τ \end{array}$ $$\left\{ {\begin{array}{*{20}{l}} {{D_x}(x,y) = 0,{D_x}(x,y) < \tau } \\ {{D_x}(x,y) = 1,{D_x}(x,y) \ge \tau } \end{array}} \right.$$

(2)

Background Difference

The background difference method is to do the difference operation between the current frame and the background to distinguish the foreground from the background. By modeling the background image, the model captured from the video and the background model are differentiated during target detection, and then the model in the foreground to be detected can be obtained.

Considering the video image I(x, y, t) as consisting of a moving target m(x, y, t) and a background b(x, y, t), then: (3) $I (x, y, t) = m (x, y, t) + b (x, y, t)$ $$I(x,y,t) = m(x,y,t) + b(x,y,t)$$

From equation (3): (4) $m (x, y, t) = I (x, y, t) - b (x, y, t)$ $$m(x,y,t) = I(x,y,t) - b(x,y,t)$$

However, because of the presence of noise, the result calculated by Eq. (4) contains the differential image d(x, y, t) in addition to the noise information n(x, y, t): (5) $d (x, y, t) = I (x, y, t) - b (x, y, t) - n (x, y, t)$ $$d(x,y,t) = I(x,y,t) - b(x,y,t) - n(x,y,t)$$

(3)

Optical flow method

The optical flow method first projects the 3D model into a 2D plane, combining the motion target of the 3D model with the motion target of the 2D image. The motion of each pixel point is determined according to the change in intensity of the pixel point in the time domain.

2)

Human body tracking technology

The tracking method that monitors the detected target and analyzes and predicts its trajectory before tracking the target of interest is called target tracking.Kalman tracking algorithm, Meanshift tracking algorithm, Dynamic Bayesian tracking algorithm, Particle Filtering tracking algorithm, etc. are some of the commonly used tracking algorithms. People usually use feature based, contour based methods for human tracking.

2.2.2

Kinect-based human detection and tracking technology

1)

Kinect Depth Map Principles and Measurements

Optical encoding imaging is divided into the following steps: (1)

Calibration: firstly, a laser spot is projected in the space to be detected, and a large amount of laser scattering information is collected in the target area, such as Fig. Z₁, Z₂, Z₃, Z₄ for the position of the reference image [33].

(2)

Sampling: the scattering image formed on the surface of the object will be changed according to whether the object is opaque or in motion, resulting in a different reference image, Figure A and B of the surface of the scattering of the position of the formation of Z_A, Z_B.

(3)

Localization: Based on the correlation coefficients obtained from the cross-correlation operation between the reference image and the test image, the most probable location of the object is the location of the reference image with the largest coefficient, we consider that A is at Z₂ and B is at Z₃ because Z_A and Z₂ have the largest correlation coefficients, and Z_B and Z₃ have the largest correlation coefficients in the graph.

(4)

Reconstruction:

Kinect needs to calibrate the RGB and IR cameras separately. The spatial points p original values d_r can be derived from the correspondence between the color image and the depth information:

(6)

d_{r} = K \tan (H \cdot d + L) - O

$${d_r} = K\tan \left( {H \cdot d + L} \right) - O$$

Where d is the depth value of point p in units of cm, H = 3.5 × 10⁻⁴rad, K = 12.36cm, L = 1.18rad, O = 3.7cm. The depth map is labeled with different colors depending on the distance between Kinect and the target.

After obtaining the depth of the image, the world coordinates of point p can be obtained, and the depth coordinate $(x_{d}, y_{d}, z_{d})$ $$\left( {{x_d},{y_d},{z_d}} \right)$$ corresponds to the world coordinate $(x_{w}, y_{w}, z_{w})$ $$\left( {{x_w},{y_w},{z_w}} \right)$$ as: (7) ${\begin{array}{l} x_{w} = (x_{d} - \frac{w}{2}) \cdot (z_{w} + D') \cdot F \cdot (\frac{w}{h}) \\ y_{w} = (y_{d} - \frac{h}{2}) \cdot (z_{w} + D') \cdot F \\ z_{w} = d \end{array}$ $$\left\{ {\begin{array}{*{20}{l}} {{x_w} = \left( {{x_d} - \frac{w}{2}} \right) \cdot \left( {{z_w} + D\prime } \right) \cdot F \cdot \left( {\frac{w}{h}} \right)} \\ {{y_w} = \left( {{y_d} - \frac{h}{2}} \right) \cdot \left( {{z_w} + D\prime } \right) \cdot F} \\ {{z_w} = d} \end{array}} \right.$$

Where D′ = −10, F = 0.002, Kinect has a resolution w × h of 1920 × 1080. 2)

Kinect-based human joint point recognition

Kinect is capable of recognizing and tracking the human skeleton. Kinect first recognizes the 25 coordinates of the human body’s joints, establishes the structure of the human skeleton, and combines the depth information to realize the representation of the human skeleton structure in three-dimensional space.

The human body joint point recognition includes the following three parts: (1)

Removing the background and first finding the possible regions of the human body. The human body contour is extracted using edge detection. The specific method is to use the distance from the Kinect sensor for analysis.

(2)

Recognition of important parts of the human body, including the head, arms, legs, and body torso.

(3)

Human body joints recognition, in Kinect the human body will be connected by joints, so Kinect analyzes the frontal and lateral joints to determine the coordinate position situation of the human body [34].

Each pixel information can be inferred from the body component recognition. Define the density estimation of body components as follows:

(8)

f_{c} (\hat{x}) \propto \sum_{i = 1}^{N} w_{i c} \exp (- {‖ \frac{\hat{x} - {\hat{x}}_{i}}{b_{c}} ‖}^{2})

$${f_c}(\hat x) \propto \sum\limits_{i = 1}^N {{w_{ic}}} \exp \left( { - {{\left\| {\frac{{\hat x - {{\hat x}_i}}}{{{b_c}}}} \right\|}^2}} \right)$$

where $\hat{x}$ $$\hat x$$ is the 3D spatial coordinates, N is the number of pixels, w_ic is the pixel weights, ${\hat{x}}_{i}$ $${\hat x_i}$$ represents the projection of the pixel x_i into world space, and b_c represents the width of each component. w_ic then balances the pixel inference probability and the spatial area probability: (9) $w_{i c} = P (c | I, x_{i}) \cdot d_{l} {(x_{i})}^{2}$ $${w_{ic}} = P\left( {c|I,{x_i}} \right) \cdot {d_l}{\left( {{x_i}} \right)^2}$$

Such an approach improves the accuracy of joint predictions and also allows depth-invariant density estimation.

2.3

Research on human posture recognition algorithm based on human skeleton map and deep learning

2.3.1

Image representation of the human skeleton

When extracting the spatial information of the skeleton, the relative relationship of all the joints in the skeleton for different actions changes, not only the joints connected in the structure of the skeleton itself, therefore, in this paper, graphical modeling with the interrelationships among the joints can better represent the characteristics of the object’s actions.

2.3.2

Convolutional Neural Networks and LSTMs

In a convolutional neural network, a point on the feature map corresponds to a region on the input map. The size of the convolutional kernel is the same in all feature map species, and by sharing the weights in this way, the model can be made concise and easier to understand, as well as train. 1)

Convolution Layer

The computational expression for the feature map of the convolution layer is: (10) $f_{i j, k} = \max (w_{k} x_{i, j}, 0)$ $${f_{ij,k}} = \max \left( {{w_k}{x_{i,j}},0} \right)$$

2)

Pooling layer

The pooling layer aims to further improve the parameters of the model as well as the computational complexity. There are hyper-parameters in the pooling layer, such as the size of the kernel, the size of the window that specifies the maximum value, and unlike the convolution operation, the pooling operation will only select the detected features for the operation.

3)

Activation Functions

Multilayer networks also have the ability to detect nonlinear features through the use of activation functions, the most typical of which are tanh, sigmoid, and ReLU. The equations for different activation functions, range.

4)

Loss function

The most common loss function is Softmax, which has a wide range of applications in image classification, video recognition and other research.

(11)

P_{j}^{t} = e^{a_{j}^{(i)}} / \sum_{l = 0}^{K - 1} e^{a_{i}^{(i)}}

$$P_{j}^{t}={{e}^{a_{j}^{(i)}}}/\underset{l=0}{\overset{K-1}{\mathop \sum \limits}}\,{{e}^{a_{i}^{(i)}}}$$

(12)

L_{s o f t \max} = - \frac{1}{N} [\sum_{i = 1}^{N} \sum_{j = 0}^{K - 1} l {y^{(i)} = j} \log P_{j}^{i}]

$${L_{soft\max }} = - \frac{1}{N}\left[ {\sum\limits_{i = 1}^N {\sum\limits_{j = 0}^{K - 1} l } \left\{ {{y^{(i)}} = j} \right\}\log P_j^i} \right]$$

5)

Regularization

Overfitting is an almost unavoidable problem in deep convolutional neural networks, and regularization techniques such as Dropout can effectively prevent overfitting from occurring.

6)

Gradient descent optimization algorithm

In the gradient descent algorithm, all the samples are used, which leads to its slow training speed, although iterations up to a certain number of times can eventually find the optimal solution. Stochastic Gradient Descent (SGD) has a relatively fast training speed compared to Gradient Descent, and uses a random example $(x_{t}, y_{t})$ $$\left( {{x_t},{y_t}} \right)$$ to estimate the gradient of the loss function. In practice, each parameter update of SGD is based on a small batch of training data, which not only makes the convergence more stable, but also reduces the variance of the parameter update.

2.3.3

Construction of graph structure skeleton images

This section focuses on image modeling of the human skeleton and how to extract spatial features, temporal features and merge the two respectively. Figure 2 shows the human skeleton delineation.

A human posture skeleton sequence is S, which contains N frames of skeleton, each of which is represented by K. There are M joint points, and the 3D coordinate information of each joint point is represented by (x, y, z). A human posture sequence can be represented as shown in Eqs. (13) and (14): (13) $S = {K_{1}, K_{2}, \dots, K_{t}}$ $$S = \left\{ {{K_1},{K_2}, \ldots ,{K_t}} \right\}$$ (14) $K_{t}^{j} = (x_{j}, y_{j}, z_{j}), 1 < j < M$ $$K_t^j = \left( {{x_j},{y_j},{z_j}} \right),1 < j < M$$

We encode each frame of the human skeleton as a network, and the relationship between the human joint points in each frame will change accordingly as time changes. The network of joint-point interactions at different times is defined as $S A N_{t} = (V_{t}, E_{t})$ $$SA{N_t} = \left( {{V_t},{E_t}} \right)$$, where V_t denotes the set of vertices in the network at frame t and E_t denotes the set of edges in the network at frame t. (15) $d (i, j) = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2} + {(z_{i} - z_{j})}^{2}}$ $$d(i,j) = \sqrt {{{\left( {{x_i} - {x_j}} \right)}^2} + {{\left( {{y_i} - {y_j}} \right)}^2} + {{\left( {{z_i} - {z_j}} \right)}^2}}$$

The relationship between the nodes of the same part of the human can be expressed by the as in equation (16): (16) $w (i, j) = {\begin{array}{l} d (i, j) \times α_{1}, 1 \leq i, j \leq 4 \\ d (i, j) \times α_{2}, 5 \leq i, j \leq 12 \\ d (i, j) \times α_{3}, 13 \leq i, j \leq 20 \end{array}$ $$w(i,j) = \left\{ {\begin{array}{*{20}{l}} {d(i,j) \times {\alpha _1},1 \le i,j \le 4} \\ {d(i,j) \times {\alpha _2},5 \le i,j \le 12} \\ {d(i,j) \times {\alpha _3},13 \le i,j \le 20} \end{array}} \right.$$

2.3.4

CNN and LSTM based Skeletal Point Image Classifier

In this paper, a small convolutional neural network is designed. Firstly, the image sequence is converted into a batch of images by sequence folding layer, and independent convolutional computations are performed in the time dimension, all convolutions have a kernel of size 1, the first and second convolutional layers have a step size of 3 × 3, and the third convolutional layer has a step size of 2. ReLU neurons and dropout regularization rate are used, and then the images are converted into a feature vector output by a spreading layer [35]. (17) $i_{t} = g (W_{x^{i} x_{t}} + W_{h^{i} h_{t - 1}} + b_{i})$ $${i_t} = g\left( {{W_{{x^i}{x_t}}} + {W_{{h^i}{h_{t - 1}}}} + {b_i}} \right)$$ (18) $f_{t} = g (W_{x^{f}} x_{t} + W_{h^{f}} W_{t - 1} + b_{f})$ $${f_t} = g\left( {{W_{{x^f}}}{x_t} + {W_{{h^f}}}{W_{t - 1}} + {b_f}} \right)$$ (19) $o_{t} = g (W_{x^{o}} x_{t} + W_{h^{o}} h_{t - 1} + b_{o})$ $${o_t} = g\left( {{W_{{x^o}}}{x_t} + {W_{{h^o}}}{h_{t - 1}} + {b_o}} \right)$$

3

Results and Discussion

3.1

Experimental setup and procedure

3.1.1

Experimental setup

Before constructing the dance movement dataset, this study screened and identified six types of dance movements as typical examples, namely, the swinging hand dance (D1), the brocade chicken dance (D2), the chaffing bag dance (D3), the weaving dance (D4), the bamboo drum dance (D5), and the lusheng dance (D6). Each type of dance contains three action segments that can represent the characteristics of that dance, totaling 18 action segments. When constructing the dance movement dataset, Kinect was used to collect the 3D position coordinates of key human skeletal points, such as the left and right shoulder.

Prior to the start of the experiment, the experimenter was required to sign an informed statement for this experiment. During the experiment, the following constraints were set for this experiment according to the actual environment of the laboratory: 1)

The vertical distance of Kinect from the ground should be 1 to 1.5 m, and the horizontal distance from the trainer should be 2 to 2.5 m. Due to the limited observational range of Kinect, all the skeletal nodes of the human body can be completely detected by the Kinect camera within this distance, which will result in optimal data capture and the best tracking performance.

2)

The relative angle between the trainer and the Kinect should not exceed 60 degrees; if the angle is too large, the joints detected by the Kinect will be distorted.

3)

Since the acquisition of joint data is calibrated according to the angle of the human body toward the Kinect, the position of the trainer in the motion capture experiment should be directly in front of the Kinect, and ensure that the location of the test site is open and free of object obstruction.

3.1.2

Experimental procedure

Twenty physically fit volunteers (age (25±4) years) were selected to participate in this study. None of the subjects had strenuous exercise in the 24 hours before the experiment and were in good health. Table 1 shows the basic information of the 20 volunteers who participated in the recording. Among them, ID represents the number of the trainer who participated in the recording, A represents the age of the trainer who participated in the recording, G represents the gender of the trainer (M: male, F: female), H and W represent the height and weight of the trainer who participated in the recording, S represents the dance level that the trainer possessed before the recording, and the numbers in D1- D6 represent the number of times of the typical movements in the dance category that he/she participated in the recording.

Table 1.

The action information of 20 volunteers in the collection of ginseng and data

ID	The basic information of the tester					Dance types and recording times
ID	A	G	W	H	S	D1	D2	D3	D4	D5	D6
1	26	M	66	175	11	16	18	15	18	17	17
2	21	F	54	162	8	11	13	12	11	12	11
3	23	F	55	167	5	5	6	7	7	5	6
4	21	M	70	177	4	5	4	4	5	4	5
5	23	M	57	167	7	10	10	8	10	7	9
6	26	F	71	177	2	2	1	1	9	1	8
7	22	M	62	175	6	9	9	9	12	10	9
8	24	M	61	173	11	16	16	16	16	17	17
9	22	M	53	170	9	13	14	11	11	13	15
10	21	M	70	180	1	2	2	1	0	2	1
11	24	F	67	176	1	0	2	3	1	3	2
12	28	F	53	164	1	1	0	0	3	2	0
13	27	F	47	161	5	4	4	7	5	4	5
14	25	M	66	182	4	5	4	3	4	2	6
15	23	F	70	179	8	11	13	11	11	11	11
16	26	F	66	182	8	12	11	11	13	9	11
17	27	F	69	166	10	16	15	15	16	13	15
18	24	M	55	167	12	20	18	18	18	16	16
19	22	F	60	184	8	11	10	11	10	11	10
20	24	M	66	187	4	2	4	4	5	4	5

The experiment required professional subject dancers to learn 18 original typical dance movement examples according to the movement video and movement explanation within 7 days before data collection, and to be able to perform them in order to follow the music rhythm fluently and familiarly. During the experiment, we tried to avoid stops, too fast or slow movements as much as possible. During the experiment, the 18 movement segments were danced according to the voice instruction and beat commands, and the coordinates of each skeletal point of the standard dance movements of professional dancers were collected through the movement acquisition program. At the end of the experiment, the recorded sample information was deposited into the dataset. Table 2 shows the name of the collected original dance dataset and the maximum number of frames of the movement.

Table 2.

The name of the original dance data set and its maximum frame number

Dance species	Serial number	Action name	Maximum frame number
D1	1	Cyclotron	74
	2	Single pendulum	47
	3	Side pendulum	157
D2	1	Four-step	362
	2	Golden chicken spread	61
	3	Golden chicken foraging	80
D3	1	Chaff bag	69
	2	Bran bag	51
	3	Start back	71
D4	1	Planting cotton	69
	2	Spinning	154
	3	Spun yarn	64
D5	1	Drumming	73
	2	Lookout drum	58
	3	Drum	70
D6	1	Mountain bead	62
	2	Lusheng	80
	3	Chinese grass	71

3.1.3

Dynamic weighting of skeletal points

According to the algorithm designed in the previous section for dynamic weighting of skeletal points.

Figure 3 shows the weights of the joints during the practice of a certain movement in the “chaff bag dance”, and the most significant changes in the joints are the right and left knees and elbows, because the right leg begins to approach the upright position, pacing with the rhythm of the music. Therefore, the weights of the two right leg joints, i.e., the hip and the ankle, increase with the lifting of the leg and then gradually decrease. In addition, the movement consists of the right and left hands flinging the chaff bag in rhythm. Therefore, the angle formed by the two arms also changed slightly, and the weights of the elbows also produced a large change, and the weight diagram showed that the left and right elbow weight values first increased and then decreased with the rhythm.

3.2

Dance Movement Recognition

The 3D CNNs in this paper were implemented on Python 3.6 platform using a Windows 10 system computer with AMD Ryzen 7 4800H with Radeon Graphics 2.90 GHz, and the keras deep learning framework was used to program the algorithms.

Firstly, in order to mitigate the overfitting phenomenon, Dropout technique is used, the Dropout ratio is set to 0.2, 0.4, 0.5, 0.6, 0.8 and experiments are conducted on the dataset of this paper respectively, the results of the validation set are shown in Fig. 4, the iteration is 100 times, comparing the recognition accuracy of the test set with different Dropout ratios, when the Dropout ratio is set to 0.4 and 0.5, good recognition results are obtained, and after 50 iterations, the recognition accuracy of Dropout ratio of 0.5 is slightly higher than that of ratio of 0.4. The Dropout ratio was set to 0.5 for all subsequent experiments.

In the experimental process, the model training for various types of dance typical movement dataset, get the loss function curve as shown in Figure 5. In the figure, the vertical coordinate represents the loss value, and the horizontal coordinate represents the number of iterations. From the figure, it can be seen that after the number of iterations is carried out to 50 times, the Loss of all kinds of dance movements are gradually stabilized at about 0.326, which achieves the best training effect of the model.

Various types of dance movement recognition accuracy is shown in Figure 6, for the experiments on the typical movement data set of folk dance to get the training set and validation set recognition accuracy, the figure in the model training iterations after 40 times, the training set recognition rate of 98.27%, and then the curve to maintain the basic stability, to achieve the model of the best recognition effect. The recognition rates for all types of dances from D1 to D6 were above 88.7%.

The confusion matrix of the experimental results is shown in Figure 7. It compares the recognition of a test set of 20 randomly selected actions. The test set’s number of actions varies because it is made up of randomly selected individual actions from 20% of the total sample. The figure shows that the highest prediction error is for action 10, while the rest of the actions have good recognition results.

3.3

Scoring of dance movements

In order to further verify the effectiveness of the method in this paper. The average scores of 20 participants in the “single pendulum (D1-1)”, “golden chicken spreading wings (D2-5)”, “chaff bun (D3-8)”, “spinning thread (D4-12)”, “Wanggu (D5-12)” and “Earth Dragon Rolling Thorn (D6-18)” movements were visualized, and the results are shown in Figure 8.

As can be seen from the figure, the results of the subjects’ dance movement evaluation were positively correlated with the subjects’ dance level.

For the single pendulum movement of the D1 pendulum dance, there were no significant distinguishing features between the different levels of trainers, and the evaluation scores were all above 0.655. This result may provide information for learners of dance movements: practicing the “single pendulum” movement of the pendulum dance is less dependent on a higher level of dance proficiency of the trainer. The evaluation results of the D4 weaving dance “Spinning Thread” are highly correlated with the subjects’ dance level, so if the trainers do not have a clear judgment of their own dance level, they can input the recorded movements of the weaving dance into the model to estimate their own dance level.

3.4

User feedback

In order to verify the observability and applicability of the system design in this paper, a total of 50 researchers at different dance levels were invited to subjectively evaluate the system in this paper. The researchers were categorized into a dance field expert group, a dance teacher group, and a dance candidate group, and were scored using the dance movement teaching system.

The three groups of researchers were shown the results of the analysis of the teaching effect of different dance movements, and the members of the participating researchers scored the effect of the system according to the use of the effect, including the aesthetics score and the effectiveness score. The scoring range was 1 to 10, and the higher the score, the more effective the system was. Some of the scoring results among the three groups of researchers are shown in Table 3.

Table 3.

Some people score results

	Choreographer				Dance teacher			Dancer
Aesthetics	9.8	8.5	9.7	8.2	9.3	8.5	9.4	8.7	9.6	9.3
Validity	8.4	8.5	8.6	8.9	9.4	8.6	8.8	9.3	9.5	9.1

The collation found that the overall evaluation of the researchers on the dance teaching system proposed in this paper is higher than 8, so it can be verified that the proposed movement recognition based dance movement teaching system meets the needs of the dance field and the validity is verified.

4

Conclusion

This paper outlines a personalized teaching system for dance that is based on movement recognition. Through experiments to analyze the functional effectiveness of the system, it is found that after the number of iterations is carried out to 50 times, the Loss of all kinds of dance movements are gradually stabilized at about 0.326, reaching the best training effect of the model. After 40 iterations, the training set’s recognition rate reaches 98.27%, and then the curve stays mostly unchanged, and the recognition rate for all types of dances from D1 to D6 is above 88.7%. The movement with the most prediction errors is movement 10, while the rest of the movements have good recognition results. In summary, the dance teaching system developed in this paper is effective and meets the design and expectations.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

ID	The basic information of the tester					Dance types and recording times
ID	A	G	W	H	S	D1	D2	D3	D4	D5	D6
1	26	M	66	175	11	16	18	15	18	17	17
2	21	F	54	162	8	11	13	12	11	12	11
3	23	F	55	167	5	5	6	7	7	5	6
4	21	M	70	177	4	5	4	4	5	4	5
5	23	M	57	167	7	10	10	8	10	7	9
6	26	F	71	177	2	2	1	1	9	1	8
7	22	M	62	175	6	9	9	9	12	10	9
8	24	M	61	173	11	16	16	16	16	17	17
9	22	M	53	170	9	13	14	11	11	13	15
10	21	M	70	180	1	2	2	1	0	2	1
11	24	F	67	176	1	0	2	3	1	3	2
12	28	F	53	164	1	1	0	0	3	2	0
13	27	F	47	161	5	4	4	7	5	4	5
14	25	M	66	182	4	5	4	3	4	2	6
15	23	F	70	179	8	11	13	11	11	11	11
16	26	F	66	182	8	12	11	11	13	9	11
17	27	F	69	166	10	16	15	15	16	13	15
18	24	M	55	167	12	20	18	18	18	16	16
19	22	F	60	184	8	11	10	11	10	11	10
20	24	M	66	187	4	2	4	4	5	4	5

ID	The basic information of the tester					Dance types and recording times
ID	A	G	W	H	S	D1	D2	D3	D4	D5	D6
1	26	M	66	175	11	16	18	15	18	17	17
2	21	F	54	162	8	11	13	12	11	12	11
3	23	F	55	167	5	5	6	7	7	5	6
4	21	M	70	177	4	5	4	4	5	4	5
5	23	M	57	167	7	10	10	8	10	7	9
6	26	F	71	177	2	2	1	1	9	1	8
7	22	M	62	175	6	9	9	9	12	10	9
8	24	M	61	173	11	16	16	16	16	17	17
9	22	M	53	170	9	13	14	11	11	13	15
10	21	M	70	180	1	2	2	1	0	2	1
11	24	F	67	176	1	0	2	3	1	3	2
12	28	F	53	164	1	1	0	0	3	2	0
13	27	F	47	161	5	4	4	7	5	4	5
14	25	M	66	182	4	5	4	3	4	2	6
15	23	F	70	179	8	11	13	11	11	11	11
16	26	F	66	182	8	12	11	11	13	9	11
17	27	F	69	166	10	16	15	15	16	13	15
18	24	M	55	167	12	20	18	18	18	16	16
19	22	F	60	184	8	11	10	11	10	11	10
20	24	M	66	187	4	2	4	4	5	4	5

Design and Implementation of Dance Personalized Teaching System Assisted by Artificial Intelligence

Zhejian Xiong

Liang Cheng

Published Online: Mar 21, 2025

Received: Nov 08, 2024

Accepted: Feb 06, 2025

DOI: https://doi.org/10.2478/amns-2025-0582

KeywordsKinect, Joint calibration, LSTM, Dance teaching

© 2025 Zhejian Xiong et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Kinect, Joint calibration, LSTM, Dance teaching

ID	The basic information of the tester					Dance types and recording times
ID	A	G	W	H	S	D1	D2	D3	D4	D5	D6
1	26	M	66	175	11	16	18	15	18	17	17
2	21	F	54	162	8	11	13	12	11	12	11
3	23	F	55	167	5	5	6	7	7	5	6
4	21	M	70	177	4	5	4	4	5	4	5
5	23	M	57	167	7	10	10	8	10	7	9
6	26	F	71	177	2	2	1	1	9	1	8
7	22	M	62	175	6	9	9	9	12	10	9
8	24	M	61	173	11	16	16	16	16	17	17
9	22	M	53	170	9	13	14	11	11	13	15
10	21	M	70	180	1	2	2	1	0	2	1
11	24	F	67	176	1	0	2	3	1	3	2
12	28	F	53	164	1	1	0	0	3	2	0
13	27	F	47	161	5	4	4	7	5	4	5
14	25	M	66	182	4	5	4	3	4	2	6
15	23	F	70	179	8	11	13	11	11	11	11
16	26	F	66	182	8	12	11	11	13	9	11
17	27	F	69	166	10	16	15	15	16	13	15
18	24	M	55	167	12	20	18	18	18	16	16
19	22	F	60	184	8	11	10	11	10	11	10
20	24	M	66	187	4	2	4	4	5	4	5