Publicado en línea: 17 mar 2025
Recibido: 10 oct 2024
Aceptado: 27 ene 2025
DOI: https://doi.org/10.2478/amns-2025-0231
Palabras clave
© 2025 Huiming Liu, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Piano playing is an auditory art, the art of imagination, the art of time and space, in the performance, the player should be played on the basis of performance components, otherwise, the audience will only stay in the listening rather than to appreciate [1-2]. Piano playing process, not only requires the performer to have rich emotions, understanding of the piano, good control and good training, but also more importantly, the mastery of piano playing skills [3-4]. Piano playing skill practice is not something that can be achieved in one day, nor is it something that can be achieved with a certain amount of theory. The mastery of piano playing skills is further deepened on the basis of general performance, mainly including: finger touch, coordination of movement, coherent playing, mastery of finger strength, rhythm and music coordination, the use of pedals, and other rules of piano playing [5-8].
With the development of Artificial Intelligence (AI) technology, its application in various industries is becoming more and more extensive. Music is no exception, and AI technology is gradually becoming a practical tool for music creation and performance.AI technology has had a profound impact on music creation and performance. AI technology has had a profound impact on music composition and performance, providing a powerful and practical tool for musicians who lack inspiration, need immediate feedback, or want to improve their performance [9-12]. As a useful aid, AI can give real-time feedback based on the user’s piano playing. It can point out problems such as inaccurate tempo, incorrect finger position, and wrong technique in the user’s performance, and provide appropriate suggestions and guidance [13-15].
This paper proposes a study on the application of AI recognition of human skeletal joint movements in piano performance, using neural networks to realize movement recognition. The study takes piano performance fingering as an example for research, and uses a two-way recurrent network algorithm to study the relationship between piano notes and fingering. This paper experimentally collects 28 short piano scores by Bach, 5 scores by Cherny 299, and 7 scores in the China Conservatory of Music’s Social Art Proficiency Examination Level 1-3, a total of 40 scores, as an experimental dataset to test and analyze the performance of this paper’s algorithms. According to the established quantitative evaluation of piano fingering based on AI action recognition for different fingering performances, the advantages and disadvantages of piano fingering are objectively analyzed.
The Kinect sensor measures the distance the light flies in the air through the time-of-flight (TOF) technology, and then calculates the 3D coordinates of the target object according to the principle of triangulation.The TOF technology has the advantages of non-contact, non-destructive, fast, and strong anti-interference, etc. Therefore, this paper utilizes this technology to acquire human body action images, which provides a reliable image input for the subsequent recognition and improves the recognition accuracy. Background segmentation separates the foreground (human body) and background in the depth image by setting a suitable threshold value, and the advantage of Otsu threshold segmentation method is that it can adaptively determine the threshold value, which makes the segmentation of foreground and background more accurate. Meanwhile, Otsu threshold segmentation method is simple to compute and more efficient, so it is selected to complete the accurate segmentation of human body and background, and its implementation process is as follows.
Step 1: Input the human movement image acquired by Kinect sensor.
Step 2: Calculate the histogram of the image and divide the image into L gray levels on average.
Step 3: Calculate the total number of pixels in the whole human action image
In equation (1),
Step 4: Calculate the gray level probability
Step 5: Segment the human action image into foreground
Step 6: Calculate the probability distribution of the pixels for the 2 classes of foreground and background:
Step 7: Calculate the average gray value of the foreground and background 2 classes of pixels:
Step 8: Calculate the between-class variance
Step 9: Substitute the values of
Step 10: According to the obtained optimal threshold
After distinguishing the human
1) Joint vector angle. Joint vector angle is the angle between 2 joint vectors. By calculating the angle of the joint vectors, the angle information of the joints can be obtained, which in turn describes the motion state and direction of the human joints.
In Eq. (8):
2) Joint Displacement Vector. The joint displacement vector is the position change of the joint point in space. By analyzing the position change of the joint point, the trajectory, speed and other information of the human body movement can be obtained.
In Eq. (9),
3) Relative position of joint points. The relative position of joint points refers to the relative positional relationship between various parts of the human body. By analyzing the relative positions between joint points, the spatial structure and posture information of human body movements can be obtained.
In Eq. (10),
In human skeletal joint action AI recognition, a BP neural network is used to train the model so that it can automatically recognize the type of action from the input action features of human skeletal joint points.The BP neural network recognition model is shown in Fig. 1.

BP neural network recognition model
The BP neural network can be described by equation (11), the
In Eq. (11),
The BP neural network adjusts the network parameters through the error back propagation algorithm to minimize the recognition error.
After the above process, the human skeletal joint action AI recognition is completed, which provides an important driving basis for the immersive experience.
This paper takes the playing fingering in piano performance skills as an example to discuss the fingering changes in piano performance to enhance the performer’s fingering skills when performing.
The structural characteristics of FF networks themselves are not conducive to modeling sequence information. Sequence information needs to connect the words in the sequence to understand, RNN then adds consideration of the input sequence timing, that is, the current output is not only related to the current input, but also related to previous inputs, and it is more reasonable to apply to the fingering estimation problem.
When the current pitch information is input, the network retains the information related to the previous moment’s pitch, similar to the idea when determining fingering manually. In sequence annotation tasks similar to fingering estimation, RNNs have achieved significant results in tasks such as constructing language models and speech recognition.The memory of RNNs for historical information is able to solve the problem of modeling problems that need to rely on long-distance feature information.There are a variety of RNNs, and the most basic RNNs are shown in Fig. 2.

Basic RNN diagram
Fig. 2(b) shows the time-expanded structure of Fig. 2(a), which all represent RNNs.The basic RNN consists of an input layer, a hidden layer, and an output layer.
In contrast to the basic neural network, the recurrent neural network adds a path with the hidden layer state
where
From the above two equations, it can be seen that the output value
For language modeling, the information of both the preceding and following words of the current word has an effect on the output, at this point, a unidirectional RNN that combines only the preceding information cannot model the effect of the following word, and then a bidirectional neural network is needed to combine the contextual information. The basic construction of bidirectional RNN is shown in Fig. 3.

Bidirectional RNN diagram
From Fig. 3, it can be seen that the hidden layer of the bidirectional neural network has to save two values, one recurrent network unit
where
Long Short-Term Memory (LSTM) networks can handle long distance dependency problems that cannot be solved by basic neural networks.In addition to the hidden layer states of a basic RNN, to better utilize the relationship between the inputs and solve the gradient explosion and gradient vanishing problems, LSTM adds additional unit states, also known as cellular states, to hold the long term memory information [17].LSTM learns from the data whether to maintains long term memory, whether to pass current input information into the long term state, and whether to use the long term state as input are three parameters [18]. In this paper, we combine the Long Short-Term Memory (LSTM) network with a bidirectional recurrent neural network to construct a bidirectional LSTM, BL-LSTM, suitable for finger labeling in this paper. The BI-LSTM network structure conforms to part of the idea when determining fingering manually, and can control its effect on fingering annotation based on the pitch information before and after the current note data
In this chapter, an intelligent evaluation scheme for piano fingering based on AI action recognition is proposed, and its overall framework is shown in Figure 4. For the piano fingering to be assessed, a comprehensive assessment is carried out from visual and auditory perspectives. The overall scheme takes into account the basic laws of piano fingering learning, converts the requirements of fingering in teaching practice into a machine representable way, and realizes both intelligent and specialized auxiliary assessment through computer language. The input piano fingering is divided into video and audio streams, and is evaluated by AI-based fingering recognition method and video-based fingering evaluation method in the visual aspect, and audio-based pitch comparison method in the auditory aspect, and finally outputs a comprehensive intelligent evaluation result.

Fingering evaluation based on AI action recognition
For the piano performance skills to be assessed, image-based hand movement recognition and fingering-based assessment in performance are performed sequentially, and the corresponding assessment results are obtained from a visual perspective using AI movement recognition technology and video content.
For the piano fingering techniques to be evaluated, audio-based pitch comparison is performed, and audio analysis techniques are utilized to obtain the corresponding evaluation results from an auditory perspective. Audio-based pitch comparison: extract the audio from the input piano fingering technique to be evaluated, and utilize Fourier transform to obtain its frequency domain information. Its fundamental frequency is compared with the standard tone to realize the audio evaluation result. Pianos are usually fixed with a set of a tuning forks in small letters, and the correspondence table of scales and frequencies is shown in Table 1.
D major scale and frequency correspondence table(Middle register)
The pentatonic scale | roll-call | Phonetic name | Frequency(HZ) |
---|---|---|---|
1 | Do | D | 585 |
2 | Re | E | 651 |
3 | Mi | #F | 735 |
5 | Sol | A | 875 |
6 | La | B | 991 |
Chromatic scale | |||
4 | Fa | G | 774 |
7 | Si | #C | 1104 |
The audio is extracted from the input fingering and the time domain is converted to frequency domain to obtain its frequency. The time domain expression for the sound is shown in equation (15).
where
where
The mode and phase angles of
Compare the audio to be measured with the frequencies of the pentatonic scale in turn. According to Eq. (21), take the level
Judgment:
Compare the audio to be measured with the standard frequency of the chromatic scale. If the difference is greater than the threshold
Based on the LSTM method, 40 scores were collected and labeled with fingering data for training and testing. In terms of evaluation metrics, in addition to comparing the consistency rate of the fingering sequences with the manual labeling, two new evaluation metrics are proposed to measure the elasticity of the fingering sequences and the efficiency of playing.
There is no formally recognized dataset and rubric yet. Therefore, a self-constructed dataset was used, with Bach 28 short piano (28 songs), Cherny 299 (5 songs) and China Conservatory of Music Social Art Level Examination Grade 1-3 (7 songs), totaling forty songs, with more than 4,000 notes in the left hand and more than 6,000 notes in the right hand. In order to verify the reliability of the model, the dataset was divided into five groups for five-fold cross-validation, and eight sets of experiments were designed to compare and verify the improvements of my scheme.
The number of combinations of 88 note outputs is so large that the parameters in the output probability matrix would amount to 31*61929≈1.91 million, of which the number of combinations of tetrachords and pentatonic rhythms accounts for 90.2%. Statistics on the notes and fingerings in the dataset show that the probability of tetrachord and pentachord appearing is extremely low, approximating to 0. Therefore, by deleting the tetrachord, pentachord, and chord notes and fingerings in the output probability matrix and the transfer probability matrix, the output probability matrix is reduced to (86+981+5136)*25≈160,000, and the conditional state transfer matrix is 25*25=625. The output probability parameter matrix composed of the single tones and the two- and three-fingered chords is reduced to (86+981+5136)*25≈160,000, with a conditional state transfer matrix of 25*25=625. The output probability parameter matrix consisting of two- and three-finger harmonics accounted for 98.5% of the output probability matrix.
Figure 5 shows the frequency distribution of right hand notes and fingerings, however, the multi-finger harmonization is very low-frequency in the self-built dataset in the figure, and many positions have a frequency of 0. The reason for this is that the effect of our model pays more attention to monophonic sequences of musical scores, which are more demanding in terms of finger transitions, and therefore not much harmonization is covered in the training set, and we need to add more datasets with harmonization in the future.
The output matrix is theoretically spread over all notes, but as shown in Fig. 5, the right hand covers the note Midi number range (18-105) mainly in the right half of the region, and the left hand is similarly concentrated in the left half of the region, with very few appearing in the other half of the region. Thus the output probability matrix of the left- and right-handed models for the parameters to be trained is again reduced by half.

Right hand note and fingering frequency distribution
In order to specify the performance of the algorithmic labeling results in this paper and to verify the flaws of the agreement rate metric, some of the labeling results of the experiment are intercepted. Labeling results for two bars of the high voice part of Minuet in G major in Bach 28 short piano. The 5→3 finger labeled by Original LSTM+Viterbi spans from G4 to G5, i.e., one octave, which exceeds the maximum playable span between the 3- and 5-fingers without changing the hand position, and it belongs to the unplayable fingering that exceeds the playable span, and the 4→3 finger in the second measure is the wrong through spanning finger, while the 3→2→1 finger in the second measure in the labeling algorithm of this paper is compared to the The 4→3→2 finger in Groundtruth is misaligned by one consecutive finger. On the surface, Original LSTM+Viterbi and BI-LSTM are inconsistent with Groundtruth, but it is obvious that the labeling result in BI-LSTM is more reasonable than that in Original LSTM+Viterbi. The second subsection in Original LSTM+Viterbi even has a 1→3 cross-finger. In the labeling results, 3/5 of the fingerings in Original LSTM+Viterbi are consistent with Groundtruth, and only 2/5 of the fingerings in BI-LSTM are consistent with Groundtruth, but JBI-LSTM is more reasonable within the fault tolerance range, which can be seen that the consistency rate does not measure the performance of the finger labeling algorithm completely. In addition, it can also be found that the back-and-forth correlation of inconsistent fingering in Original LSTM+Viterbi is not strong, whereas JBI-LSTM is prone to a string of labeling results that are inconsistent with Groundtruth, which will be more likely to reduce the agreement rate. Comparatively, JBI-LSTM is prone to two extremes of very high or very low agreement rates, ranging from a high of 88% to a low of only 42% in the test set.
The trained model is used for fingering prediction using BI-LSTM algorithm and the results are shown in Table 2. After comparison, the effect of training with note sequences is poor, the agreement rate is only 57.11%, and the agreement rate of training with note difference sequences can reach 69.7 %, which is higher than that of Judgement-LSTM, thus it can be seen that, compared with the one-to-one mapping relationship between fingering and notes, fingering labeling is more closely related to the note difference information, and the proposed BI-LSTM labeling method precisely also combines note The proposed BI-LSTM labeling method precisely combines the note difference information for fingering labeling. However, the Bi-LSTM-CRF still has 3% of inelastic fingering, with an irrationality rate of 6.02%. Therefore, in terms of fingering elasticity, the algorithm in this paper is superior to Bi-LSTM-CRF.
Comparison of annotation results of different algorithms
Accuarcy | Unrealizable rate | Unreasonable rate | |
---|---|---|---|
Original LSTM | 53.56% | 25.24% | 19.6% |
Fault-tolerant LSTM | 59.91 % | 23.41% | 17.2% |
Merged-output LSTM | 54.45% | 24.7% | 20.7% |
Judgement-LSTM | 68.71% | 5% | 19.2 |
Bi-LSTM-CRF (Note difference sequence) | 57.11% | 17.77% | 18.1% |
BI-LSTM (Note sequence) | 69.7% | 0% | 3.87% |
In order to test and verify the effectiveness and objectivity of the established quantitative evaluation method of piano fingering based on AI action recognition, the following relevant experiments are designed to carry out a comparative analysis of the evaluation results of fingering. The experiments in this chapter also utilize the Python language to write the algorithm of the piano fingering quantitative evaluation system, and debug and implement it in the PyCharm compiler. Piano fingering quantitative evaluation system in the reinforcement learning network for single-step fingering evaluation and complete fingering sequence evaluation both will play a role in the evaluation, then the following will be designed from these two aspects of the experiment to test.
The single-step fingering evaluation of the reinforcement learning network mainly involves the finger conversion between two consecutive notes, respectively, set up different categories of conversion fingering under two consecutive notes, and set up five fingerings for each category, and each category is scored by the quantitative evaluation system of piano fingering to get five scores, as shown in Table 3. From the experimental results, the average value of the scores of smooth fingerings is the highest, and all of them are full scores of 6 points, which is in line with the criterion of the optimization of natural smooth fingerings. The average scores of expanding finger, reducing finger, and transferring finger categories were gradually reduced one at a time, which was also basically in line with the priority ranking of inter-finger conversion fingering; while the average score of the violating fingering category was the lowest, and the score of each time was negative, which was conducive to the subsequent performance to avoid the use of the violating fingering during the performance.
Transform the evaluation results of fingering
Fingering class | Synreferential | Extended finger | Dactylocontraction | Referral | Illegal fingering |
---|---|---|---|---|---|
Score 1 | 6 | 4 | 1 | 4 | -2 |
Score 2 | 6 | 6 | 3 | -2 | -3 |
Score 3 | 6 | 4 | 3 | 4 | -3 |
Score of 4 | 6 | 1 | 1 | 4 | -3 |
Score 5 | 6 | 4 | 4 | 1 | -2 |
Average score | 6 | 3.8 | 2.4 | 2.2 | -2.6 |
The comparison of the scores of various types of fingering is shown in Figure 6. It can be seen that the distribution of the scores of each type of fingering is balanced but there are high and low advantages and disadvantages, which is conducive to guiding the generation of reasonable and scientific fingering sequences, and facilitating the performers to avoid the occurrence of erroneous fingering when playing.

Comparison of various fingering scores
The quantitative evaluation method of piano fingering based on AI action recognition is not only evaluated for each step of fingering transition during playing, but also for the whole fingering sequence generated. In the following, a key sequence fragment with the number of notes of 10 is selected, and the standard fingering sequence, general fingering sequence, and random fingering sequence are inputted into the piano fingering quantitative evaluation system for scoring experiments, and the statistics of the experimental results are shown in Table 4 as well as Fig. 7. From the results of AI’s recognition and evaluation of different fingering movements, the total score and scoring rate of the standard fingering sequence are very high, and the scoring trend can be seen to be steadily increasing with less fluctuation. The total score and score rate of the general fingering sequence are lower than that of the standard fingering sequence, and the score trend and score rate are less stable and more fluctuating. The randomized fingerprinting sequence has a very low total score and score rate, and its performance is very unstable, with large fluctuations in scores. The random fingering sequence, on the other hand, has a very low total score and score rate, and its performance is very unstable with large score fluctuations. It can be seen that the evaluation of piano fingering sequences according to the established quantitative piano fingering evaluation method based on AI action recognition is basically in accordance with the rules, and is scientific and effective.
The result of fingering sequence evaluation
Fingering sequence number | Fingering sequence number | Total points | Scoring average |
---|---|---|---|
1 | Standard | 48 | 0.96 |
2 | normal | 35 | 0.67 |
3 | Random | 9 | 0.15 |

Fingering sequence score trend
This paper takes fingering techniques in piano performance playing as a case study. This paper proposes piano fingering evaluation based on AI action recognition, and compares the results of the algorithms that are consistent with performing fingering annotation based on different LSTM methods. The results show that after comparison of different LSTM algorithms, the BI-LSTM fingering labeling algorithm proposed in this paper has high consistency and can track the rhythm of piano notes better. Based on the evaluation results of piano fingering based on AI action recognition, the total score and score rate of the standard fingering sequence are very high, respectively, 48 and 0.96%, and the trend of the score can be seen to be steadily increasing, with less fluctuation, while the random fingering sequence has a very low total score and score rate, and its performance is very unstable, with large fluctuation of the score. It can be seen that the quantitative evaluation method of piano fingering based on AI action recognition established in accordance with the method of this paper is basically in accordance with the rules for the evaluation of piano playing fingering sequences, and is scientific and effective.