Data-driven Optimization of Folk Dance Inheritance and Protection Strategies and Their Realization Paths
Published Online: Sep 26, 2025
Received: Jan 12, 2025
Accepted: Apr 27, 2025
DOI: https://doi.org/10.2478/amns-2025-1056
Keywords
© 2025 Wenjing Zhou, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Folk dance comes from the daily life of the working people, and it is a form of self-written, self-directed and self-performed dance with distinctive vernacular characteristics [1]. Through folk dance, we can see the hard work and simple lifestyle of the working people [2]. Thus, folk dance can deeply reflect the local working people’s production methods, customs and traditional culture, and is filled with various ethnic customs and social beliefs in terms of dance theme content. Folk dance often retains the characteristics of local production and life [3-5].
Dance belongs to a kind of body language, which began to survive in ancient times [6]. It can be said that folk dance has thousands of years of history and cultural precipitation, with unique artistic charm and flavor [7]. At present, the number of folk dances in China is more than one hundred, and many folk dances are widely known, such as the Dai Peacock Dance, the Mongolian Andai Dance and so on. Folk dance on the one hand has significant national characteristics, on the other hand has a distinctive theme and content, this art form has its own value and advantages [8-10]. For this reason, we need to study the folk dance inheritance and development issues in depth, and effectively continue to inherit and carry forward the excellent Chinese culture [11-12].
In the process of globalization and modernization, the ecology of the world’s culture has undergone great changes, and the preservation of folk culture and intangible cultural heritage, which contains the spiritual home of the nation, is an inevitable cultural demand for a country to promote the spirit of the nation and develop independently and sustainably [13-15]. Due to changes in the social environment, some cultural heritages are disappearing or have disappeared rapidly from our side [16]. Therefore, it becomes especially important to guard our spiritual home. Protecting folk culture and intangible cultural heritage is to protect and continue the historical lineage, which is to maintain the spiritual bond of people who have been in this world for generations [17-19]. Folk dance is an important part of China’s outstanding traditional culture, reproduced in the folk, rooted in the masses, known as the “mother of artistic dance”. Original folk dance is a historical memory passed down from generation to generation, and is a precious historical material and living fossil for research, preservation, inheritance and development of human resources [20-23].
With the improvement of China’s international status and the rise of people’s national self-confidence, more and more scholars are concerned about the development and inheritance of China’s traditional fine culture, and the development and inheritance of folk dances are in the center of the importance, but the related research demonstrates the seriousness of the inheritance and development of folk dances. Literature [24] summarizes the current situation of the inheritance and protection of traditional Yi dance culture in Liangshan, Sichuan Province, and provides specific suggestions for the inheritance and protection of traditional dance in Liangshan based on macro and micro perspectives to cope with the impacts of modern and foreign cultures. Literature [25] reveals the obstacles to the inheritance and protection of Liaoning Manchu dance based on the questionnaire method, covering the lack of cultural self-confidence and insufficient dance inheritance talents.
Therefore, it is very urgent to explore the inheritance and protection path of folk dance, and some researchers have studied the roles that these roles should play in the inheritance and protection of folk dance from the dimensions of the state, school and society. Literature [26] proposes to let colleges and universities participate in the inheritance and protection of folk dances, and organically integrate folk dances and college education in order to promote the inheritance and protection of non-heritage dances and carry forward the fine traditional culture. Literature [27] points out the lack of non-heritage protection mechanism and the drastic impact of foreign culture on traditional culture in China, and argues that the society and government departments must play the role of a moat and formulate corresponding policies and implementation measures to protect non-heritage dances and non-heritage culture. Literature [28] affirms the positive role played by colleges and universities in the inheritance and protection of non-heritage dances, which promotes the innovative development and inheritance of folk dances, and makes a positive contribution to the promotion and protection of folk traditional arts.
Some experts also analyze the effective path of folk dance inheritance and protection based on the technical perspective. Literature [29], through teaching experiments, confirms that information technology such as Kinect capture helps to improve the quality and standardization of folk dance teaching, which in turn promotes the inheritance and innovative development of folk dance. Literature [30] illustrated the importance of folk dance inheritance and conceptualized a metadata model to promote the digital inheritance and preservation of folk dance.
In order to improve the accuracy of folk dance movement recognition, this paper implements preprocessing on the collected folk dance machine vision images, and takes the preprocessed data as the basic data for folk dance movement recognition. Through the average value method, the visual image implementation image is converted into gray scale image to improve the data processing ability. After that, the temporal feature extraction (MTNFF) model is added into the folk dance action recognition network structure based on channel topology time elimination, coupled with the extracted spatio-temporal features, and the spatio-temporal network features of the obtained human skeleton sequence are fed into the action classifier to obtain the classification probability of the action. In this paper, the single-stream folk dance action recognition structure is first introduced, followed by the complete MTSGCN network structure formed by taking late fusion of four different input streams. Finally, the recognition and application performance of the model is analyzed, and the path of folk dance inheritance and protection is proposed.
In this paper, a binocular stereo vision depth motion vision image camera is used to collect video images of folk dance movements and record the whole process of dancers’ training, and the accuracy of the collected data is improved by constructing a binocular stereo vision model [31].
The
where the translation vector is described by
where the camera focal length is described by
where the internal parameters of the camera are described by
The original folk dance visual images collected are color 3-channel images, including a total of 2500 color variations. Therefore, this paper implements image preprocessing to transform the collected original folk dance visual images into grayscale images through the average value method, which reduces the amount of data and can improve the data processing capability. The calculation formula of the average value method is as follows:
Where the red channel is described by
Thresholding is applied to the folk dance visual grayscale image to produce a binarized folk dance visual image, and the thresholding formula is as follows:
Where, (
The background subtraction of binarized folk dance visual images is implemented by Gaussian model [32] in the following process:
Step 1: Construct the model. Set the pixel point values in the binarized folk dance visual image at the
where the mean is described by
Step 2: Update and match the model. A new input 1-frame image is used to determine the matching of pixel values and Gaussian model in this image by the following formula:
When the conditions of Eq. (6) are not met, foreground points are derived; when the conditions of Eq. (6) are met, background points are derived.
The Gaussian model is used to derive the background-subtracted folk dance visual image, but it is not conducive to the effect of folk dance movement recognition in the later stage because of the large amount of noise in the foreground environment. Therefore, the median filtering method is used to implement noise reduction on the background-subtracted folk dance visual image, and the specific process is as follows:
Find the point where the center of the template and the center pixel point of the folk dance visual image coincide by moving the template, and take this point as the center of the window to complete the window construction. Implement the sorting of all pixel gray values in the order from smallest to largest. Solve the median pixel gray value with the formula:
where the window is described by
In this paper, we propose a neighboring frame feature processing module to capture the relationship between neighboring frames for learning, and add the feature relationship between neighboring frames to capture the subtle transformations while preserving the original temporal features. The spatial feature processed feature sequence is set to
where
Obtaining the sequence of skeleton features between neighboring frames, the processing flow of the neighboring frame feature processing module is applied to
where
After extracting the temporal feature sequences from the neighboring frames, 1 × 1-convolution operation is performed on them and the Sigmoid activation function is used to process the relevant features, and then the residual connection is used to fuse them with the original feature maps, and this processing flow can be expressed as follows:
Where,
In order to fuse the features between all the frames after processing, the Hadamard product is taken for aggregation. The feature sequence after the neighbor frame feature processing module can be expressed as:
where
For temporal feature extraction, multi-scale temporal modeling is adopted [33]. Four different branches are described in detail below.
Each of the four branches inputs the human skeleton structure information after spatial feature extraction
The first branch performs a 1×1 convolution operation on the input information, with the main purpose of reducing the channel dimension. After that, an expansion convolution with an expansion rate of 1 is performed to enrich the sensory field;
The second branch introduces the neighboring frame feature processing module to the input information to get the relationship between neighboring frames. After that 1×1 convolution with 5×1 convolution is introduced;
The third branch first performs 1×1 convolution on the input information and introduces a 3×1 maximum pooling layer after the convolution operation;
The fourth branch simply performs 1×1 convolution for channel dimension adjustment.
MTSGCN single-stream folk dance action recognition mainly includes three parts: spatial feature extraction, temporal feature extraction, and folk dance action classification. For the spatial feature extraction part, temporal extraction part, and folk dance action classification part, MCTTE and MTNFF models are used respectively to input the feature sequences after spatio-temporal feature extraction into the classifier to obtain the corresponding action classification. The specific network structure of MTAGCN [34] is shown in Fig. 1.

MTSGCN single-flow human action recognition framework
Firstly, the input sequence is denoted as
where
The feature sequence
The first branch performs a 1×1 convolution operation on the input information with the main purpose of reducing the channel dimensions. After that, an expansion convolution with an expansion rate of 1 is performed to enrich the sensory field;
The second branch first performs information fusion between neighboring frames on the input sequence, and the formula is expressed as follows:
where BN denotes batch normalization,
The 1×1 convolution with 5×1 convolution is introduced to the processed feature sequence.
The third branch first performs 1×1 convolution on the input information and introduces a 3×1 maximum pooling layer after the convolution operation;
The fourth branch simply performs 1×1 convolution for channel dimension adjustment.
Afterwards, the feature sequences processed in the four branches are concatenated and matrix addition is performed with the original input, after which the ReLU activation function is used to improve the nonlinear relationship between the layers of the network. The formula is represented as follows:
where
Therefore, spatial feature extraction is performed on the initial feature sequence first, and the feature sequence after spatial feature extraction is fed into the temporal feature extraction module to obtain the feature sequence after spatio-temporal feature extraction
The sequence after feature extraction needs to go through the classifier to get the corresponding classification result. The probability value of the corresponding action category is output. The formula is expressed as follows:
where
The MTSGCN network proposed in this paper consists of four different input streams to perform spatio-temporal feature extraction on the respective feature sequence information and obtain the corresponding folk dance movement classification probabilities. Finally, a post-fusion approach is taken to sum up the corresponding probabilities to obtain the final folk dance movement classification. The complete network structure is shown in Fig. 2.

MTSGCN network structure
In this paper, experiments are conducted on the Pytorch platform built on Windows system. In this study, the Dropout of the model spatial domain network and time domain network are both set to 0.5; and the Momentum is set to 0.9. When the model is first trained, the learning rate is set to 0.001. When the spatial domain network is trained to the 20th epoch, the learning rate is reduced to 1/10 of the original one, and when it is trained to the 40th epoch, the learning rate is reduced by 1/10 again. When the time-domain network is trained to the 25th epoch, the learning rate decreases to 1/10 of the original rate, and the learning rate decreases again by 1/10 when it is trained to the 50th epoch. Both networks are trained for 50 epochs.
When training, the two datasets are divided into 3 equal parts of videos of each category, and there are 3 ways of dividing the training set and the test set, which are shown as follows: the proportion of the first 1/3, the middle 1/3, and the back 1/3 in Split1 is in the following order: the test set, the training set, the training set; in Split2 is in the following order: the training set, the test set, the training set; and in Split3 is in the following order: the training set, training set, test set. After the division, the above three divisions are trained and tested separately to get the accuracy of the three divisions, and then the accuracy of these three divisions is averaged to get the action recognition accuracy of the dataset.
In order to verify the effectiveness of this paper’s algorithm, a series of comparative experiments are conducted on the single dance dataset 1 and the double dance dataset 2, and their accuracy rates are recorded during the training process of “traditional model 1, model 2 and model 3 (this paper’s model)”. Figure 3 records the change process on dataset 1 (single dataset), and Figure 4 records the change process on dataset 2 (double dataset); in Figures 3 and 4, (a) and (b) represent the different test conditions of flow and Rgb, respectively. As can be seen from the figures, the model proposed in this paper has a certain degree of improvement over the traditional benchmark network. After 30 iterations in Flow condition, the accuracy of the model for single person dance movement recognition reaches more than 90%; after 15 iterations in Rgb condition, the accuracy of the model for single person dance movement recognition reaches more than 85%. In dataset 2, the model’s accuracy reached more than 60% only when the model had to undergo 45 and 15 iterations in the flow and Rgb test conditions, respectively.

Data set 1 accuracy changes

Data set 2 accuracy changes
In order to verify the validity of the added modules, ablation experiments are conducted in this section, and the three models are compared two by two, and the accuracy rates of the three input modes on the two datasets are shown in Table 1. In dataset 1, comparing the three input modes, it is found that the three models “Model 1, Model 2, and Model 3” have the highest accuracy rates in Rgb+Flow input mode, which are 88.37%, 93.81%, and 97.41%, respectively. In the Rgb and Flow input modes, the accuracies of the three models ranged from 82.84%-86.44% and 86.11%-91.75%, respectively.
The accuracy of three input modes on two data sets
Data set | Network model | Rgb(%) | Flow(%) | Rgb+Flow(%) |
---|---|---|---|---|
Data set 1 | Network model 1 | 82.84 | 86.11 | 88.37 |
Network model 2 | 83.78 | 87.46 | 93.81 | |
Network model 3 | 86.44 | 91.75 | 97.41 | |
Data set 2 | Network model 1 | 59.12 | 61.08 | 72.09 |
Network model 2 | 61.62 | 64.13 | 73.05 | |
Network model 3 | 65.95 | 66.05 | 77.39 |
In dataset 2, it is still the Rgb+Flow input mode that has the highest accuracy of the three models, whose accuracy is 72.09%, 73.05% and 77.39%, respectively. In Rgb and Flow input modes, the accuracy rates of the three models are less than 70%. From the table, it can be seen that the models in this paper have higher accuracy of the network trained on dataset 1. This is because the scenarios in dataset 1 are less complex and larger in size compared to dataset 2.
In dataset 1, this paper investigates 9 folk movements of “fanning, twisting, throwing waist, kicking, lowering the waist, pacing, swinging arms, jumping, and squatting”, which are named 1~9. In dataset 2, this paper investigates 9 kinds of movements, namely “pair fan, double waist twist, double waist turn, double arm swing, double kick, arm extension, double paw hitting the ground, double hip twisting, and double exchange step”, which are named 11~19 respectively. The confusion matrix can provide a good summary of the recognition accuracy of each action category, so this section uses the confusion matrix to visualize the recognition accuracy of actions in dataset 2 and dataset 1.
The confusion matrix for the identified results on dataset 1 is shown in Figure 5. The vertical axis in the coordinate diagram represents the true category of each action, the horizontal axis represents the predicted category of each action, and the diagonal line represents the correct recognition accuracy of each action category. The results show that on dataset 1, the recognition accuracy of the model for “waist twisting, fan turning, kicking-fan turning, swinging arm-throwing waist, jumping-kicking” is less than 70%. The recognition accuracy of “kicking-lower waist, lower waist-fan, lower waist-throwing waist, pacing-throwing waist, jumping-throwing waist, jumping-lower waist, squatting-fanning, squatting-jumping” is more than 95%. The recognition accuracy between the rest of the actions is within 70%-95%. Overall, the recognition accuracy of the model is good.

Data set 1 identifies the confusion matrix of the result
The confusion matrix of the recognition results on dataset 2 is shown in Fig. 6. As can be seen from the figure, data set 2 also contains nine types of actions, in which the recognition accuracy between the other actions is above 60%, except for “two-person arm swing - two-person waist turn, two-person foot strike - two-person waist turn”. Among them, the recognition accuracy of two-person kicking and two-person twisting reaches 99.07%, followed by “two-person exchange step - arm extension, two-person arm swing - two-person twisting, two-person exchange step - two-person twisting, two-person exchange step - two-person twisting” and “two-person exchange step - two-person twisting”. -The recognition accuracies were 96.02%, 94.38%, 92.88% and 90.15%, respectively. Except for the above-described movements, the accuracies between the remaining movements are between 60% and 90%. The model’s recognition results are better.

Data set 2 identifies the confusion matrix of the result
An overall analysis of the recognition accuracies of the two datasets reveals that the model has a higher recognition accuracy for dataset 1 than for dataset 2. This may be related to the difficulty of the folk dance movements in both datasets. Because the movements in dataset 2 are produced by two-person interactions, the difficulty of recognition increases compared to the single-person movements in dataset 1, so it leads to a slight decrease in the recognition accuracy. However, overall, the model has better recognition accuracy in both datasets.
In this paper, five existing action recognition methods “ST-ResNet, KVMF, TBN, ActionVLAD and STRA-Net” are selected to compare with the MTSGCN method in this paper. The accuracy comparison results of different methods on dataset 1 and dataset 2 are shown in Table 2. From the table, it can be seen that the accuracy of the proposed method is higher than the five algorithms (97.06% and 89.01%), and it is 5.9%-18.85% and 8.46%-17.28% higher than the other five algorithms on dataset 1 and dataset 2, respectively. This shows that the algorithm in this paper has very high accuracy in recognizing folk dance movements and has a good application prospect.
Different methods are based on data set 1 and number 2
Model | Data set 1 | Data set 2 |
---|---|---|
ST-ResNet | 87.81 | 80.55 |
KVMF | 84.23 | 74.08 |
TBN | 91.16 | 72.11 |
ActionVLAD | 85.62 | 73.32 |
STRA-Net | 78.21 | 71.73 |
MTSGCN | 97.06 | 89.01 |
In order to verify whether the dance movement feature extraction method based on multi-branching time and neighboring frame features proposed in this paper is effective in movement recognition, training and validation are carried out on the publicly available folk dance movement dataset FD1 to evaluate the effectiveness of the algorithm.
In order to verify the effectiveness of folk dance movement recognition, this paper conducted experiments on the FD1 dataset. The experimental results of the effectiveness of the overall network structure are shown in Table 3. In the table, the experimental comparison will be made between the baseline model and the model in this paper, and the validation performance of the model is evaluated using the TOP metric. According to the experimental results, the action recognition accuracy top metric of the model proposed in this paper on the validation set is 92.21%, which is an improvement of 8.75% compared to 83.46% of the baseline model. The experimental results show that the model in this paper has obvious advantages in the recognition and extraction of multi-branch time and neighbor frame features, which further improves the recognition performance of folk dance actions.
Results of the effectiveness of the overall network structure
Model | top-1 | Model size(MB) |
---|---|---|
Baseline model | 83.46 | 65.92 |
MTSGCN | 92.21 | 252.23 |
Figure 7 shows the recognition accuracy of nine single folk dances. It can be seen that the recognition accuracy of this paper’s model for nine kinds of movements is improved compared with the baseline model, in which the two models have the largest difference in the recognition of the “fan turn” movement, followed by the lower back and swinging arm movements, and this paper’s model has a higher recognition accuracy than the baseline model by 28.9%, 24.88%, and 21.57%, respectively. In addition, the recognition accuracies of this paper’s model were 16.04%, 16.88% and 12.61% higher than that of the baseline model for the waist-swinging, lower back and squatting movements, respectively. The recognition errors for both jumping and twisting movements are within 10%. Comprehensive analysis reveals that the average recognition accuracies of the algorithm proposed in this paper and the baseline model for the nine movements of single folk dance are 90.7% and 72.44%, respectively, which shows that the algorithm in this paper performs well.

The accuracy of the 9 types of individual folk dancing
Figure 8 shows the recognition accuracy results of 9 types of two-person interaction-related actions. By comparing the results in the figure, it can be seen that the recognition accuracy of the model in this paper ranges from 81.24% to 88.51% for the other 6 types of two-person interaction actions, except for “two-person waist turning, two-person arm swinging, and two-person leg kicking”, which are below 80%. In contrast, in the baseline model, the recognition accuracies of the other 8 action categories were below 70%, except for the two-person exchange step. The difference between the two models in recognizing the two-person interaction actions is large, and the recognition accuracy of this paper’s model increases by 13.39%-24.1% compared to the baseline model with respect to the nine actions. Overall, the average recognition accuracy of this paper’s model (82.31%) increased by 16.81% over the baseline model (65.5%). It can be seen that the algorithmic model proposed in this paper still shows excellent performance on such recognition-difficult action samples, which proves that the algorithmic model proposed in this paper is able to effectively extract the characteristics of the subtle interactions between the action nodes, and thus is able to more accurately recognize the categories of difficult action samples.

The results of the action recognition accuracy of nine pairs of pairs
The rational use of digital equipment has greatly facilitated the preservation and transmission of folk dances. For example, the camera can film the dance performance and play it repeatedly; the scanner can completely copy the picture information and save it to the computer for viewing and printing at any time; the Internet database can store the information of a large number of materials and can be output in reverse without the limitation of time and space.
In today’s society, this paper can objectively and realistically record the performance form, dance vocabulary, dance props and other related things of folk dances with the help of intelligent devices (e.g., cameras, video cameras, etc.), duplicate and scan the precious dance literature, record the voice information of folk dances and store it for a long time. In addition, information related to folk dances can be summarized through the Internet for public inquiry and retrieval. The collected materials can also be organized and stored in different categories to establish a digital archive. The publicity of folk dances can be promoted through official websites, WeChat public numbers and other means, and a website for folk dance cultural resources can be created to promote the publicity of folk dances on the Internet.
Folk dance is a comprehensive and culturally rich art form, each art form and dance movement has a specific connotation, in particular, there are some folk dances that are popular for a long time and have less existing data, it is difficult to fully express its connotation and pass it on with only collected text and video data, which requires the use of more advanced digitization technology. Taking motion capture technology as an example, it was initially applied to the production of animation, in which the recorders projected the movements of the dance performers on the computer screen through the motion capture equipment, and the later staff produced the animation picture according to the projected movements of the actors. In addition, through the digital technology to analyze a variety of representative dance clips of folk dance, and then analyze the laws and characteristics of folk dance movements, you can achieve the capture of the common laws of the folk dance movements, in order to produce a digital model of folk dance, so that the lost art form to model in front of the eyes of the people.
In this paper, folk dance movements are captured by visual image technology, and the visual images are processed by Gaussian model. And the spatio-temporal convolutional features of folk dance movements are extracted by MTSGCN network to get the final folk dance movement recognition results and propose the path of inheritance and protection.
In the recognition accuracy of folk dance movements, single is significantly better than double, and its recognition accuracy is 97.41% and 77.39% in Rgb+Flow input mode, respectively. In addition, the visualization results of most of the movement recognition accuracies in the two datasets are between 70%-95% and 60%-90%, respectively, and the models are more effective in recognition. The recognition accuracies of the MTSGCN method are all higher than the other five recognition methods in dataset 1 and dataset 2, respectively, by 5.9%, -18.85%, and 8.9%. -18.85% and 8.46%-17.28%, which shows that this paper’s algorithm has a very high accuracy in recognizing folk dance movements and has a good application prospect. The action recognition accuracy of this paper’s model on the validation set is improved by 8.75% compared with the baseline model; the recognition accuracy of this paper’s model on nine single and double folk dance actions is improved by 2.09%-28.9% and 13.39%-24.1% compared with the baseline model, respectively. The experimental results show that the model in this paper has obvious advantages in the recognition and extraction of multi-branch time and neighbor frame features, which significantly improves the recognition performance of folk dance movements. In specific practice, each region should combine the characteristics of folk dance culture in the region, integrate and optimize resources and materials, start from the classification of folk dance content, build different modules, form a comprehensive and systematic dance inheritance and development model, and vigorously exert the advantages of information technology to inject a strong impetus for the inheritance of traditional folk dance culture.