Research on automatic identification and evaluation method of piano playing skills based on convolutional neural network 
Data publikacji: 21 mar 2025
Otrzymano: 20 paź 2024
Przyjęty: 02 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0622
Słowa kluczowe
© 2025 Xiaoliang Wu, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
In the teaching of piano performance in colleges and universities, the words “performance” and “playing” reveal the dual requirements of musical expressiveness and skills and techniques respectively [1]. Skills and techniques form the basis of piano performance, and students must be proficient in the technology, while musical expression is the soul of the performance, which gives emotion and depth to the work, so that each performance has a unique personality and expression [2-3]. However, the current piano performance teaching in colleges and universities often favors the cultivation of students’ technical skills and ignores the importance of musical expression, which not only restricts the breadth and depth of students’ artistic expression, but also affects students’ ability to fully understand music performance [4-6].
With the continuous development of music education, the piano, as an important tool for musical expression, has attracted more and more attention to the relationship between its playing skills and musical expression [7]. Traditional music education tends to focus on the training of technique, while paying relatively little attention to musical expression [8]. However, with the increasing importance of musical expression in performance, how to effectively combine technique and performance has become a hot research topic. Influenced by the Internet, music education has also gradually entered into a completely new field of development [9-10].
Internet-based music education breaks through the limitations of traditional education, by giving full play to the role of the network, prompting students to learn knowledge in an unlimited number of places and time, while also mobilizing students’ motivation and interest in learning with its rich learning resources [11-13]. As an important part of the Internet, mobile Internet can provide people with more perfect network services based on mobile terminals [14].
Music performance has always been an important subject system in music education curriculum and teaching, and innovative, reliable, credible, and effective music performance assessment is an important part of the success of music education [15-16]. The potential problems of music performance assessment are closely related to the characteristics of music performance. For each musical work, different musicians will have different expressions in terms of timbre, performance, intensity, interpretation, and so on. Thus, different musical performances may require different assessment methods [17-19].
In this paper, a hand gesture recognition system based on G-HRNet is constructed for hand gesture piano playing techniques.The system obtains real-time camera shooting data for real-time processing, and recognizes the joint points of the hand using target detection and gesture recognition. On this basis, the high-resolution network HRNet is introduced, and the Ghost module and DFC attention mechanism of this network model are introduced, the lightweight G-Block module is constructed, and the lightweight high-resolution network model G-HRNet is proposed.Combined with the human body joint angle feature and DTW algorithm based on human body joint angle feature and DTW algorithm to assess the similarity of the hand movements during piano playing with the standard playing movements. And the application effect of this paper’s algorithm is verified by public datasets.
Firstly, the system acquires the corresponding piano playing video in real time through the camera, after image processing, then sends it to the algorithm recognition module for piano playing error recognition, hand shape error recognition and scoring. Finally, the recognized video is displayed on the front-end through WebSocket communication. The algorithm recognition module counts the number of playing errors and hand shape errors, and displays the errors and the overall score on the front end using HTTP communication. The overall framework of the system is shown in Figure 1. Algorithm recognition is composed of steps such as data acquisition, data labeling, model training, and algorithm deployment. The YOLOv3 target detection algorithm is used for model training, and the algorithm deployment is implemented to integrate the trained model into the system and deploy it on the server.

System framework
The deep learning-based piano hand fingering recognition system designed in this paper can recognize and correct fingerings played by practitioners. The system composition module is shown in Figure 2. Through AI technology to simulate the accompanying trainer and guide the students’ piano practice, the image and video recognition algorithms are used to accurately detect the hand shape of playing the piano and correct the students’ hand shape and fingering errors in real time, which mainly realizes the functions of image acquisition, gesture recognition, and recognition and comparison.

System module
The key algorithm for hand recognition in this project is composed of two primary components, the target detection algorithm and pose estimation algorithm. The target detection algorithm mainly uses the YOLOv3 algorithm [20] to detect the position where the hand is located in the key frame.The YOLOv3 algorithm employs a separate neural network for target detection by dividing the image into regions and predicting the probability and bounding box for each region. The block diagram of the scoring algorithm for this system is shown in Fig. 3. The algorithm is divided into two main steps: first, a series of candidate regions are generated on the image based on certain rules, and then these candidate regions are labeled by their positional relationship with the real box. Secondly, image features are extracted using a convolutional neural network to predict the location and category of candidate regions. Then two functions are extended on this basis, the first function is to make an error hand shape judgment on the hand shape of the current key frame by YOLOv3 algorithm, and the second function is to predict the hand key point by using the “HRnet” pose estimation algorithm after obtaining the position coordinates of the hand in the current key frame, and the hand key point coordinates are obtained by using the “HRnet” pose estimation algorithm. The obtained coordinates of the hand keypoints are stored in an array, and the DTW algorithm [21] is used to compare the player’s playing situation with the standard data played by the piano teacher and score the player’s performance. The final output is the number of incorrect hand shapes and the score of the player’s performance.

Scoring algorithm block diagram
This project uses YOLOv3 target detection algorithm for model training. The default weights file provided by the official is used for initialization to set the model’s label type and other information. The labeled dataset is fed into the network for data preprocessing. Perform network model training to process the images in the dataset and get the output of the network. Finally the trained model is used for target prediction.
Prior to the advent of the high-resolution network HRNet [22], network models for attitude estimation were based on the method of recovering high-resolution feature maps. However, such an approach leads to a large amount of valid information being lost in the process of constant up and down sampling. In order to improve the detection accuracy of the joints, the network structure is constantly increased and widened, and even the resolution of the input image is increased, etc., which leads to a large number of existing high-precision model parameters and a large amount of computation, and it is difficult to be used in the case of limited computational resources, so in this paper, we propose the lightweight network G-HRNet, which is optimized for lightweight improvement of the HRNet.
Deploying neural networks on embedded devices is difficult due to memory and computational power limitations. The operation used to generate an arbitrary convolutional layer of 
Where “*” represents the convolution operation; 
Assume that there are 
The convolution filters are 
In order to obtain the required 
Where 
Given an input feature 
The G-Block consists of three Ghost convolutions, the first Ghost convolution serves as an extension layer to increase the original number of feature channels, followed by the activation function ReLU to speed up the training process. The second Ghost convolution and the third Ghost convolution reduce the number of channels to the original number of channels, and the last Ghost convolution increases the BN after the last Ghost convolution, and finally combines the principle of residual network, where the inputs are divided into the main inputs and residual inputs, and the inputs and outputs are connected using a shortcut, which prevents the loss of information.
The original HRNet is lightweighted. The specific improvement is divided into two aspects, one is the use of Ghost convolution [23] and DFC attention mechanism to construct a lightweight residual module; the second is to streamline the parallel subnet of the original HRNet, which in turn achieves the purpose of reducing the number of model parameters.The G-HRNet network is mainly composed of three phases, which are the first phase, the second phase, and the third phase, respectively.The G-HRNet consists of parallel connected subnets, each of which is composed of different subnets from top to bottom resolution gradually reducing the resolution of different features to multi-scale feature fusion. G-HRNet is composed of parallel connected sub-networks, each sub-network gradually reduces the resolution from top to bottom to carry out multi-scale feature fusion between different resolution features, the low resolution is restored to high resolution through up-sampling, and then fused with the high resolution features; the high resolution is reduced to lower resolution through down-sampling, and then fused with the low resolution features, and thus feature fusion is carried out continuously.
The heatmap regression approach DARK utilizes the Taylor expansion of the Gaussian distribution to mitigate the quantization error in heatmap regression.DARK is decoded as follows:
Assuming that the predicted heatmap is completely free of error is the constructed heatmap labeling, i.e:
In the first step, we log the predicted heatmap and 
In the second step, the logarithmic heat map is Taylor-expanded at the maximum point 
In the third step, both sides of the equal sign of the above equation are derived simultaneously with respect to 
In the fourth step, bring 
In this paper, we build a hand pose estimation model based on G-HRNet network, which inputs color hand images, and after the Stem stage downsamples the input hand images by 4 times, and feeds the obtained hand feature maps into the first stage subnetwork. After that, it goes to the second stage of the model, the high-resolution feature map. Firstly, the feature maps with half lower resolution than it are sampled to get the feature maps for feature extraction, followed by entering into the third stage subgrid, downsampling the feature maps of the second stage to generate the feature maps of the third stage subgrid. Repeat the above process, and finally the feature maps generated from the three subnets are fused to output at the size of the highest resolution feature map, i.e., 64×64, and after a convolution with a convolution kernel of 21, the output is obtained as 21 joint-point heat maps, and finally the final hand pose estimation joint-point detection results are obtained after DARK heat map decoding.
In order to get more reliable output for similarity assessment, the problem of time axis difference needs to be overcome when calculating the similarity between two human joint angle sequences by matching and aligning the two human joint angle feature vectors to be assessed on the time axis. Dynamic temporal regularization algorithm is an effective way to solve this problem. It temporally regularizes the time series by extending or shortening operations on the time axis, thus enabling similarity calculation between time series.
The DTW algorithm effectively solves the problem of matching based on the traditional Euclidean distance matching algorithm for time series with unequal lengths, which cannot be stretched or compressed on the time axis for similarity comparison of speech signal feature parameter sequences.The core idea of the DTW algorithm is to search for the optimal temporal match between two time series elements by means of a dynamic programming algorithm, which implies that the DTW algorithm allows the time series to be The core idea of DTW algorithm is to find the best time match between two time series elements by dynamic programming algorithm, which means that DTW algorithm allows one-to-many or many-to-one mapping between time series elements, thus lengthening or shortening the time series to realize the alignment and evaluation in the time axis.DTW algorithm has a wide range of applications in the fields of human movement recognition and analysis, DNA sequence alignment and image processing which can be transformed into linear sequence analysis and processing.
There are two time series 
The DTW algorithm needs to add some constraints when finding the optimal matching path, i.e:
 1) Boundary condition:  2) Continuity: Each step of the optimal matching path  3) Monotonicity: 
Can be obtained to meet the above constraints of the path, the path with the smallest distortion distance is the need for the optimal matching path. Then there are:
Through denominator 
Let the human joint angle feature sequences of the two action videos to be evaluated be  1) Based on the three constraints described in the previous subsection, the search cycle starts from  2) Repeat the previous step until both human joint angle feature sequences have been searched, i.e.,  3) Create a sequence element matching array  4) According to the saved correspondences in the matching array 
Normalize 
In this paper, the inertial sensor gloves and infrared detection rod are used to collect the playing data, and through simulation experiments, the playing data of 10 piano songs, including “To Alice” and “Wedding in a Dream”, are collected respectively, and the 10 songs are played three times, and a total of 200 playing samples are obtained, and these samples are uniformly classified into the training set and the test set for the test. The action features of different fingers are extracted during the playing process, and the ability of piano playing gestures to extract features is analyzed using the method presented in this paper.
The results of the analysis of different finger movement features extraction ability are shown in Table 1. From the table, it can be seen that the application of this method can effectively extract the playing features of each position of the hand, and are able to extract the standard deviation and extreme deviation of each joint in the process of playing, and can effectively complete the identification of the movement features of each finger, and the identification process does not appear a substantial error, which means that after the application of the method of this paper, the recognition of the hand gestures can be better extracted, with an extraction accuracy rate of more than 90%. It shows that after applying the method in this paper, hand gesture recognition can extract the changing features of each joint of the hand better, and the accuracy of the extraction is more than 90%.
Analysis of the ability of different finger movements
| Feature extraction | The accuracy of the finger motion extraction(%) | |||||
|---|---|---|---|---|---|---|
| Thumb | Index | Middle | Ring | Finger | ||
| Handback | Angle deviation | 97.095 | 96.575 | 93.418 | 99.424 | 97.213 | 
| Elevation aberration | 98.43 | 98.287 | 98.075 | 96.356 | 96.326 | |
| Angle deviation | 97.733 | 99.254 | 99.382 | 97.693 | 98.146 | |
| Cross Angle anomaly | 95.92 | 100 | 99.026 | 100 | 95.432 | |
| Acceleration standard deviation | 98.275 | 99.897 | 97.083 | 97.894 | 98.143 | |
| Standard deviation | 96.713 | 99.615 | 99.171 | 95.663 | 99.642 | |
| Angular velocity | 95.36 | 98.604 | 98.144 | 100 | 98.188 | |
| Lower finger | Angle deviation | 96.081 | 100 | 98.179 | 99.902 | 96.333 | 
| Elevation aberration | 98.596 | 100 | 100 | 100 | 96.736 | |
| Angle deviation | 95.946 | 98.202 | 96.977 | 97.918 | 97.544 | |
| Cross Angle anomaly | 93.586 | 97.657 | 100 | 97.024 | 99.371 | |
| Acceleration standard deviation | 95.741 | 97.362 | 98.857 | 97.367 | 97.616 | |
| Standard deviation | 98.6 | 100 | 100 | 97.714 | 96.094 | |
| Angular velocity | 94.91 | 95.79 | 99.519 | 98.123 | 97.329 | |
| Knuckle | Angle deviation | 97.847 | 98.066 | 99.5 | 96.653 | 96.642 | 
| Elevation aberration | 96.563 | 98.21 | 97.891 | 97.161 | 100 | |
| Angle deviation | 96.883 | 98.891 | 97.3 | 96.603 | 98.627 | |
| Cross Angle anomaly | 98.727 | 99.909 | 97.421 | 99.085 | 98.224 | |
| Acceleration standard deviation | 97.777 | 98.935 | 96.007 | 96.22 | 97.338 | |
| Standard deviation | 98.94 | 98.383 | 96.957 | 96.978 | 95.886 | |
| Angular velocity | 98.92 | 98.95 | 98.553 | 97.403 | 100 | |
In this paper, the dynamic gesture recognition method based on feature action sequences (Method 1) and the LSTM-based CSI gesture recognition method (Method 2) are selected to compare with the method of this paper (G-HRNet), and the recognition effects of the two datasets are shown in Fig. 4, in which (a) and (b) are the accuracy of the training set and the accuracy of the test set, respectively. As can be seen from the figure, the recognition effect of the piano playing gesture of this method is very good, which indicates that the accuracy of the model recognition of this method is effectively improved by the training of the dataset, and its average accuracy in the training set and test set reaches 95.09% and 95.94%, respectively; whereas the recognition effect of the Method 2 method is always kept at a lower level, and its average accuracy in the training set and test set is 70.47% and 69.94%, respectively. The average accuracy in the training and test sets is 70.47% and 69.39%, respectively. Although the recognition effect of Method 1 is slightly higher than that of Method 2, it is still poorer than that of this paper (73.09% and 73.51%), and the recognition effect of this paper is better than that of Method 1 and Method 2 under different sample sizes, which indicates that the application of this paper can effectively improve the recognition accuracy of piano playing gestures.

The two Numbers are based on the collection
Applying the method of this paper to analyze the gesture recognition effect during the performance of the test set “To Alice”, the dynamic change of the pitch angle of each joint of different fingers of the hand during recognition. The change curve of the piano playing hand is shown in Fig. 5. It can be observed that the method of this paper can effectively recognize the gesture data of piano playing, which can clearly express the dynamic situation of hand information. The fluctuation of the pitch angle of the upper joints of the fingers at different times can be clearly seen, indicating the changes of the fingers of the player at different times, and each joint can be recognized in detail. Obviously, the method presented in this paper can effectively recognize the fluctuation of hand gestures during piano playing, and the recognition results are very clear.

The curve of the piano playing hand changes
The distribution statistics of the continuous keystroke time of the finger are shown in Fig. 6, most of the subjects in the experimental group have the continuous keystroke time between 1051ms and 1948ms, very few subjects have the continuous keystroke time within 1110ms, which is regarded as very flexible, and very few subjects have the continuous keystroke time greater than 1727ms, which is regarded as not flexible enough. Due to the gravity rebound speed of the test device, the limit of the continuous keystroke time is about 900 ms. Overall, the distribution of the continuous keystroke time of the finger shows a normal distribution trend. After further K-S normal test, the distribution of finger continuous keystroke time can be considered to obey a normal distribution with mean μ of 1438.74ms and standard deviation σ of 110.37ms. According to the statistical distribution of the continuous finger keystroke time of the experimental group and the scoring setting standard related to the difficulty characteristics in sports track events, we set μ-2.5σ, i.e., 1073ms, as the scoring point of 100 points, and μ+2σ, i.e., 1840ms, as the scoring point of 50 points, which were substituted into the cumulative scoring formula to obtain the following system of equations:
Solving this system of equations gives: 
Where 

The distribution of finger continuous keystroke time
Table 2 shows the statistics of continuous keystroke time and dexterity of each finger in the experimental group. In order to facilitate the understanding, the finger numbering is specially explained in this paper, using the English symbols L and R to denote the right and left hands respectively, and the subscripts 1-5 to denote the thumb, index finger, middle finger, ring finger and little finger respectively. Since the experimental subjects are all right-handed, the flexibility scores of each finger of the right hand are significantly higher than the scores of each finger of the left hand, and at the same time, the index finger of each hand is the most flexible and has the best independence, the ring finger is checked by the little finger and the middle finger and is the least flexible, and the standard deviation of the little finger is the largest, and the scoring result is in line with the reality, and it can reflect the degree of flexibility of each finger very well.
The finger is a continuous stroke and a flexible score
| Finger number | Mean | Standard deviation | Flexible average score | 
|---|---|---|---|
| L5 | 1522.198 | 172.484 | 61.21 | 
| L4 | 1638.956 | 166.778 | 53.77 | 
| L3 | 1509.271 | 155.468 | 66.4 | 
| L2 | 1476.312 | 177.393 | 69.31 | 
| L1 | 1518.939 | 174.481 | 59.66 | 
| R1 | 1404.546 | 178.336 | 73 | 
| R2 | 1314.255 | 139.574 | 78.32 | 
| R3 | 1360.534 | 132.529 | 71.62 | 
| R4 | 1484.587 | 156.215 | 71.8 | 
| R5 | 1422.542 | 160.372 | 68.85 | 
Tables 3 and 4 show examples of left and right hand spans for the same subject in the experimental group obtained after processing the data collected from the test platform, respectively. The finger spans are expressed in terms of piano interval differences to facilitate the calculation of distance fit later. It can be seen that there is a difference in the span of each finger of the subjects, and it is necessary to measure the finger span.
Example of the left finger span
| Finger number | L5 | L4 | L3 | L2 | L1 | 
|---|---|---|---|---|---|
| L5 | — | 5.55 | 7.05 | 8.05 | 11.55 | 
| L4 | 5.55 | — | 4.05 | 7.05 | 10.55 | 
| L3 | 7.05 | 4.05 | — | 5.55 | 9.05 | 
| L2 | 8.05 | 7.05 | 5.55 | — | 8.55 | 
| L1 | 11.55 | 10.55 | 9.05 | 8.55 | — | 
Example of the right finger span
| Finger number | R1 | R2 | R3 | R4 | R5 | 
|---|---|---|---|---|---|
| R1 | -- | 8.55 | 9.55 | 10.55 | 11.55 | 
| R2 | 8.55 | -- | 5.55 | 7.05 | 9.05 | 
| R3 | 9.55 | 5.55 | -- | 4.05 | 5.55 | 
| R4 | 10.55 | 7.05 | 4.05 | -- | 4.55 | 
| R5 | 11.55 | 10.55 | 9.05 | 8.55 | — | 
Based on the audio-visual based piano fingering assessment scale developed in this paper, the experiments in this paper were conducted using the proposed fingering assessment method based on the G-HRNet hand posture estimation model using the dataset constructed in this paper.
Using the dataset constructed in this paper, test experiments were conducted for the a priori knowledge-based piano fingering evaluation scheme, and the corresponding video evaluation results are shown in Table 5. Each fingering subset includes the following video samples: correct fingering, wrong hand shape, wrong fingering, deviated key direction, collapsed palm, unstable hand, and curled fingers. Under the same conditions, the assessment difficulty for fingering played by a single finger with hook, butt, and wipe was lower than that for fingering played by two fingers at the same time with both the major and minor handfuls, so its assessment accuracy was slightly higher than that of the latter.
Video assessment experiment results
| Subset | Tick | Holder | Erase | Big pinch | Little pinch | 
|---|---|---|---|---|---|
| Accuracy rate(%) | 95.55 | 96.77 | 97.53 | 90.49 | 85.42 | 
Fig. 7 and Fig. 8 show the time domain waveform results and frequency domain results of chromatic order 4 (fa), respectively, where Fig. 8(a) shows the harmonic-containing information and Fig. 8(b) shows the fundamental frequency. It is found that the figure can clearly show its fundamental frequency information.

The result of the time domain waveform of the semi-tone order 4(fa)

Semi-tone order 4(fa) frequency domain results
For fingering audio, frequency domain information is obtained by Fourier transform. Its frequency value is obtained to be consistent with the standard tone of 783 Hz for the middle tone fa in the key of D major, and this chromatic scale is correctly pitched, which indicates that the key press strength of the left hand is reasonable, so the evaluation result is that the audio of this fingering is correct.
The results of the audio comparison experiment are shown in Table 6. Each fingering subset consisted of the following audio samples: correct pitch, playing the wrong key, and deviations in chromatic scale pitch. Intonation is crucial for learning a musical instrument, and for audio assessment, the use of pitch comparison is simple and effective. It has a high degree of accuracy compared to video assessment.
Audio comparison experimental results
| Subset | Tick | Holder | Erase | Big pinch | Little pinch | 
|---|---|---|---|---|---|
| Accuracy rate(%) | 99.02 | 99.97 | 98.62 | 96.03 | 97.46 | 
In this paper, on the basis of the HRNet model for human hand gesture estimation, a series of lightweight processing is carried out on the model structure, and the G-HRNet model is proposed and completed to be constructed. The DTW algorithm is utilized to determine the similarity between human hand movements and standard movements while playing the piano and verify the model’s recognition effect. The results show that the G-HRNet model can effectively extract the angular features of the back of the player’s hand, the lower joints of the fingers, and the upper joints of the fingers, and compared with other methods, this method has higher recognition accuracy, and can effectively recognize the changes of the pitch angle of the upper and lower joints of the fingers at different times. In addition, the method proposed in this paper based on the automatic recognition of the hand posture of the piano playing system can simply and effectively measure the finger flexibility and span information, which provides a data measurement basis for the recognition of piano playing skills.
