Accesso libero

Quantitative Analysis and AI Assessment of Piano Performance Techniques

  
17 mar 2025
INFORMAZIONI SU QUESTO ARTICOLO

Cita
Scarica la copertina

Introduction

Piano playing is an auditory art, the art of imagination, the art of time and space, in the performance, the player should be played on the basis of performance components, otherwise, the audience will only stay in the listening rather than to appreciate [1-2]. Piano playing process, not only requires the performer to have rich emotions, understanding of the piano, good control and good training, but also more importantly, the mastery of piano playing skills [3-4]. Piano playing skill practice is not something that can be achieved in one day, nor is it something that can be achieved with a certain amount of theory. The mastery of piano playing skills is further deepened on the basis of general performance, mainly including: finger touch, coordination of movement, coherent playing, mastery of finger strength, rhythm and music coordination, the use of pedals, and other rules of piano playing [5-8].

With the development of Artificial Intelligence (AI) technology, its application in various industries is becoming more and more extensive. Music is no exception, and AI technology is gradually becoming a practical tool for music creation and performance.AI technology has had a profound impact on music creation and performance. AI technology has had a profound impact on music composition and performance, providing a powerful and practical tool for musicians who lack inspiration, need immediate feedback, or want to improve their performance [9-12]. As a useful aid, AI can give real-time feedback based on the user’s piano playing. It can point out problems such as inaccurate tempo, incorrect finger position, and wrong technique in the user’s performance, and provide appropriate suggestions and guidance [13-15].

This paper proposes a study on the application of AI recognition of human skeletal joint movements in piano performance, using neural networks to realize movement recognition. The study takes piano performance fingering as an example for research, and uses a two-way recurrent network algorithm to study the relationship between piano notes and fingering. This paper experimentally collects 28 short piano scores by Bach, 5 scores by Cherny 299, and 7 scores in the China Conservatory of Music’s Social Art Proficiency Examination Level 1-3, a total of 40 scores, as an experimental dataset to test and analyze the performance of this paper’s algorithms. According to the established quantitative evaluation of piano fingering based on AI action recognition for different fingering performances, the advantages and disadvantages of piano fingering are objectively analyzed.

AI quantitative assessment of piano performance skills
AI Recognition of Skeletal Joints for Piano Playing
Human motion image acquisition and segmentation

The Kinect sensor measures the distance the light flies in the air through the time-of-flight (TOF) technology, and then calculates the 3D coordinates of the target object according to the principle of triangulation.The TOF technology has the advantages of non-contact, non-destructive, fast, and strong anti-interference, etc. Therefore, this paper utilizes this technology to acquire human body action images, which provides a reliable image input for the subsequent recognition and improves the recognition accuracy. Background segmentation separates the foreground (human body) and background in the depth image by setting a suitable threshold value, and the advantage of Otsu threshold segmentation method is that it can adaptively determine the threshold value, which makes the segmentation of foreground and background more accurate. Meanwhile, Otsu threshold segmentation method is simple to compute and more efficient, so it is selected to complete the accurate segmentation of human body and background, and its implementation process is as follows.

Step 1: Input the human movement image acquired by Kinect sensor.

Step 2: Calculate the histogram of the image and divide the image into L gray levels on average.

Step 3: Calculate the total number of pixels in the whole human action image A: A=i=0L1ai

In equation (1), ai is the number of pixels in the image with gray level i.

Step 4: Calculate the gray level probability Bi at level i: Bi=aiA

Step 5: Segment the human action image into foreground O0 and background O1 assuming a threshold γ.

Step 6: Calculate the probability distribution of the pixels for the 2 classes of foreground and background: B0=i=0γ1Bi B1=i=rL1Bi

Step 7: Calculate the average gray value of the foreground and background 2 classes of pixels: C0=i=0γ1iBiB0 C1=i=γL1iBiB1

Step 8: Calculate the between-class variance bγ2 : bγ2=B0B1(C0C1)2

Step 9: Substitute the values of γ into Eq. (3) Eq. (7) in turn to find the threshold value that can maximize bγ2 .

Step 10: According to the obtained optimal threshold γ, the background segmentation is realized.

Motion Feature Extraction for Human Skeletal Joint Points

After distinguishing the human O0 from the background image, it is necessary to perform motion feature extraction of human skeletal joint points. Before feature extraction, in order to improve the accuracy and efficiency of extraction, it is necessary to clarify the key extraction features. Since there are 206 bones in the human body, but the human body makes various movement gestures mainly involving 28 bones, which are connected by joints, the movement of the bones is actually driven to the corresponding joints, so as long as the extraction of the individual joint features will also clarify the type of human body joint movements. There are three features extracted, namely joint vector angle, joint displacement vector and relative position of joints.

1) Joint vector angle. Joint vector angle is the angle between 2 joint vectors. By calculating the angle of the joint vectors, the angle information of the joints can be obtained, which in turn describes the motion state and direction of the human joints. cosθ=AB| A|×| B|

In Eq. (8): AB is the dot product of the 2 vectors; | A| and | B| are the modulus of the 2 vectors.

2) Joint Displacement Vector. The joint displacement vector is the position change of the joint point in space. By analyzing the position change of the joint point, the trajectory, speed and other information of the human body movement can be obtained. Hp(k)=hp(k+1)hp(k1)Δt

In Eq. (9), Hp(k) is the displacement vector of joint p in the frame k of the same action sequence; hp(k + 1) and hp(k – 1) are the 3D coordinates of joint p in the frames k + 1 and k – 1 of the same action sequence; and Δt is the time interval between the two frames.

3) Relative position of joint points. The relative position of joint points refers to the relative positional relationship between various parts of the human body. By analyzing the relative positions between joint points, the spatial structure and posture information of human body movements can be obtained. Gpj(k)=hp(k)hj(k)

In Eq. (10), Gpj(k) is the relative position of joints p and j in the frame k of the same action sequence; hp(k) and hj(k) are the 3D coordinates of joints p and j in the frame k of the same action sequence.

Implementation of AI Recognition for Human Skeletal Joint Movements

In human skeletal joint action AI recognition, a BP neural network is used to train the model so that it can automatically recognize the type of action from the input action features of human skeletal joint points.The BP neural network recognition model is shown in Fig. 1.

Figure 1.

BP neural network recognition model

The BP neural network can be described by equation (11), the Yk=f(j=1nwkjf(m=1Mwjmrm+ϕj)+ϕk)

In Eq. (11), rm is the joint action feature of the input layer neuron m; f( ) is the activation function; Yk is the action recognition result output by the output layer neuron k; wkj and wjm are the connection weights; ϕj and ϕk are the connection presets; n and M are the number of input and output layer neurons, respectively.

The BP neural network adjusts the network parameters through the error back propagation algorithm to minimize the recognition error.

After the above process, the human skeletal joint action AI recognition is completed, which provides an important driving basis for the immersive experience.

Piano performance indicator labeling

This paper takes the playing fingering in piano performance skills as an example to discuss the fingering changes in piano performance to enhance the performer’s fingering skills when performing.

Basic RNN

The structural characteristics of FF networks themselves are not conducive to modeling sequence information. Sequence information needs to connect the words in the sequence to understand, RNN then adds consideration of the input sequence timing, that is, the current output is not only related to the current input, but also related to previous inputs, and it is more reasonable to apply to the fingering estimation problem.

When the current pitch information is input, the network retains the information related to the previous moment’s pitch, similar to the idea when determining fingering manually. In sequence annotation tasks similar to fingering estimation, RNNs have achieved significant results in tasks such as constructing language models and speech recognition.The memory of RNNs for historical information is able to solve the problem of modeling problems that need to rely on long-distance feature information.There are a variety of RNNs, and the most basic RNNs are shown in Fig. 2.

Figure 2.

Basic RNN diagram

Fig. 2(b) shows the time-expanded structure of Fig. 2(a), which all represent RNNs.The basic RNN consists of an input layer, a hidden layer, and an output layer. x is the input vector, U is the weight matrix from the input layer to the hidden layer; V is the weight matrix from the hidden layer to the output layer; W is the weight of the previous value of the hidden layer as the input of this one; o is also a vector, which represents the value of the output layer. values. It can be seen that the base RNN can realize one-to-one input and output, and can realize sequence-to-sequence mapping, which meets the requirements of the finger estimation problem.

In contrast to the basic neural network, the recurrent neural network adds a path with the hidden layer state st as the next input, establishing a relationship between the previous and previous inputs to the current output in the time sequence. The basic RNN is based on Eqs. (12), (13) to establish the connection between the current state and the previous state.

ot=g(Vst) st=f(Uxt+Wst1)

where g and f are the activation functions.

From the above two equations, it can be seen that the output value ot of the basic RNN can be affected by the current input xt and the preceding inputs of arbitrary length xt–1, xt–2, xt–3 and so on. The recurrent network establishes the relationship between the front and back inputs that the basic neural network cannot establish, but the basic recurrent neural network does not contact the input sequence after it and cannot establish the connection between the pairs of outputs, and it is prone to gradient explosion and gradient vanishing problems in the training leading to its inability to capture the long-distance dependency [16]. The former problem can be solved by setting up a bi-directional structure to contact the later inputs, the latter problem is a hard problem of the basic RNN network due to its own structure, which needs to be adjusted by improving the model’s internals, based on which other structures that can capture long-distance relationships have been developed, such as the LSTM network with the Gated Recurrent Unit (GRU).

Bidirectional recurrent neural networks

For language modeling, the information of both the preceding and following words of the current word has an effect on the output, at this point, a unidirectional RNN that combines only the preceding information cannot model the effect of the following word, and then a bidirectional neural network is needed to combine the contextual information. The basic construction of bidirectional RNN is shown in Fig. 3.

Figure 3.

Bidirectional RNN diagram

From Fig. 3, it can be seen that the hidden layer of the bidirectional neural network has to save two values, one recurrent network unit A is involved in forward computation and the other recurrent network unit A′ is involved in backward computation. The final output value depends on A and A′. Its at a certain moment t, the output is calculated as follows.

yt=g(VAt+VAt)

where g denotes the activation function, At = f(Uxt + WAt–1), At=f(Uxt+WAt1) , At and At use different parameters, i.e., the forward and backward networks do not share weights. The bi-directional network fits the idea that the pitch before and after the current tone is taken into account when estimating the fingering manually.

Long and short-term memory networks

Long Short-Term Memory (LSTM) networks can handle long distance dependency problems that cannot be solved by basic neural networks.In addition to the hidden layer states of a basic RNN, to better utilize the relationship between the inputs and solve the gradient explosion and gradient vanishing problems, LSTM adds additional unit states, also known as cellular states, to hold the long term memory information [17].LSTM learns from the data whether to maintains long term memory, whether to pass current input information into the long term state, and whether to use the long term state as input are three parameters [18]. In this paper, we combine the Long Short-Term Memory (LSTM) network with a bidirectional recurrent neural network to construct a bidirectional LSTM, BL-LSTM, suitable for finger labeling in this paper. The BI-LSTM network structure conforms to part of the idea when determining fingering manually, and can control its effect on fingering annotation based on the pitch information before and after the current note data

Fingerprinting assessment method based on multimodal decision-level fusion of audio and video
Piano fingering assessment program

In this chapter, an intelligent evaluation scheme for piano fingering based on AI action recognition is proposed, and its overall framework is shown in Figure 4. For the piano fingering to be assessed, a comprehensive assessment is carried out from visual and auditory perspectives. The overall scheme takes into account the basic laws of piano fingering learning, converts the requirements of fingering in teaching practice into a machine representable way, and realizes both intelligent and specialized auxiliary assessment through computer language. The input piano fingering is divided into video and audio streams, and is evaluated by AI-based fingering recognition method and video-based fingering evaluation method in the visual aspect, and audio-based pitch comparison method in the auditory aspect, and finally outputs a comprehensive intelligent evaluation result.

Figure 4.

Fingering evaluation based on AI action recognition

Fingerprint evaluation based on AI action recognition

For the piano performance skills to be assessed, image-based hand movement recognition and fingering-based assessment in performance are performed sequentially, and the corresponding assessment results are obtained from a visual perspective using AI movement recognition technology and video content.

Audio-based fingering evaluation module

For the piano fingering techniques to be evaluated, audio-based pitch comparison is performed, and audio analysis techniques are utilized to obtain the corresponding evaluation results from an auditory perspective. Audio-based pitch comparison: extract the audio from the input piano fingering technique to be evaluated, and utilize Fourier transform to obtain its frequency domain information. Its fundamental frequency is compared with the standard tone to realize the audio evaluation result. Pianos are usually fixed with a set of a tuning forks in small letters, and the correspondence table of scales and frequencies is shown in Table 1.

D major scale and frequency correspondence table(Middle register)

The pentatonic scale roll-call Phonetic name Frequency(HZ)
1 Do D 585
2 Re E 651
3 Mi #F 735
5 Sol A 875
6 La B 991
Chromatic scale
4 Fa G 774
7 Si #C 1104

The audio is extracted from the input fingering and the time domain is converted to frequency domain to obtain its frequency. The time domain expression for the sound is shown in equation (15).

Yk=i=0Aicos(2πfikts+φi),k=0,1,2

where Yk is the discrete acoustic wave function, Ai is the amplitude, fi is the frequency, ts is the sampling time interval, and ϕi is the phase. After the discrete Fourier transform DFT, the frequency domain expression of the sound is shown in Eq. (16).

Zk=n=0N1Ynei2σNkn,0kN1

where Zk is a sequence of complex numbers as shown in Eq. (17).

Zk=ak+bki,0kN1

The mode and phase angles of Zk express the amplitude and phase of the sound wave as shown in Eqs. (18)-(20).

A0=a02+b02N Ak=ak2+bk2N×2,1kN1 φk=tan1bkak
Determination of string selection

Compare the audio to be measured with the frequencies of the pentatonic scale in turn. According to Eq. (21), take the level P that corresponds to the result with the smallest absolute value. Judge as in Eq. (22), if it is the same as the specified level Q, then the string is played correctly; if it is different, then it is judged to be the wrong string.

O=min| ffi |,i=D,E,#F,A,B

Judgment: { P=Q,CorrectPQ,Error

Determination of Chromatic Intonation

Compare the audio to be measured with the standard frequency of the chromatic scale. If the difference is greater than the threshold η, the tone is high, and the left hand stroke needs to be reduced. If the difference is less than –η, the tone is low and the left hand stroke should be increased. Increasing or decreasing the depth of the stroke is adjusted according to the absolute value of the difference. If the absolute value of the frequency difference is less than or equal to the threshold value, the judgment is correct. The judgment is shown in Equation (23).

{ ffj>η,Higherηffjη,Correct,j=G,#Cffj<η,Lower
Quantitative results and analysis of piano performance skills
Experiments and analysis of indicator labeling

Based on the LSTM method, 40 scores were collected and labeled with fingering data for training and testing. In terms of evaluation metrics, in addition to comparing the consistency rate of the fingering sequences with the manual labeling, two new evaluation metrics are proposed to measure the elasticity of the fingering sequences and the efficiency of playing.

Dataset of scores with fingerings

There is no formally recognized dataset and rubric yet. Therefore, a self-constructed dataset was used, with Bach 28 short piano (28 songs), Cherny 299 (5 songs) and China Conservatory of Music Social Art Level Examination Grade 1-3 (7 songs), totaling forty songs, with more than 4,000 notes in the left hand and more than 6,000 notes in the right hand. In order to verify the reliability of the model, the dataset was divided into five groups for five-fold cross-validation, and eight sets of experiments were designed to compare and verify the improvements of my scheme.

Characterization of fingerprint data and parameter selection

The number of combinations of 88 note outputs is so large that the parameters in the output probability matrix would amount to 31*61929≈1.91 million, of which the number of combinations of tetrachords and pentatonic rhythms accounts for 90.2%. Statistics on the notes and fingerings in the dataset show that the probability of tetrachord and pentachord appearing is extremely low, approximating to 0. Therefore, by deleting the tetrachord, pentachord, and chord notes and fingerings in the output probability matrix and the transfer probability matrix, the output probability matrix is reduced to (86+981+5136)*25≈160,000, and the conditional state transfer matrix is 25*25=625. The output probability parameter matrix composed of the single tones and the two- and three-fingered chords is reduced to (86+981+5136)*25≈160,000, with a conditional state transfer matrix of 25*25=625. The output probability parameter matrix consisting of two- and three-finger harmonics accounted for 98.5% of the output probability matrix.

Figure 5 shows the frequency distribution of right hand notes and fingerings, however, the multi-finger harmonization is very low-frequency in the self-built dataset in the figure, and many positions have a frequency of 0. The reason for this is that the effect of our model pays more attention to monophonic sequences of musical scores, which are more demanding in terms of finger transitions, and therefore not much harmonization is covered in the training set, and we need to add more datasets with harmonization in the future.

The output matrix is theoretically spread over all notes, but as shown in Fig. 5, the right hand covers the note Midi number range (18-105) mainly in the right half of the region, and the left hand is similarly concentrated in the left half of the region, with very few appearing in the other half of the region. Thus the output probability matrix of the left- and right-handed models for the parameters to be trained is again reduced by half.

Figure 5.

Right hand note and fingering frequency distribution

Experimental results

In order to specify the performance of the algorithmic labeling results in this paper and to verify the flaws of the agreement rate metric, some of the labeling results of the experiment are intercepted. Labeling results for two bars of the high voice part of Minuet in G major in Bach 28 short piano. The 5→3 finger labeled by Original LSTM+Viterbi spans from G4 to G5, i.e., one octave, which exceeds the maximum playable span between the 3- and 5-fingers without changing the hand position, and it belongs to the unplayable fingering that exceeds the playable span, and the 4→3 finger in the second measure is the wrong through spanning finger, while the 3→2→1 finger in the second measure in the labeling algorithm of this paper is compared to the The 4→3→2 finger in Groundtruth is misaligned by one consecutive finger. On the surface, Original LSTM+Viterbi and BI-LSTM are inconsistent with Groundtruth, but it is obvious that the labeling result in BI-LSTM is more reasonable than that in Original LSTM+Viterbi. The second subsection in Original LSTM+Viterbi even has a 1→3 cross-finger. In the labeling results, 3/5 of the fingerings in Original LSTM+Viterbi are consistent with Groundtruth, and only 2/5 of the fingerings in BI-LSTM are consistent with Groundtruth, but JBI-LSTM is more reasonable within the fault tolerance range, which can be seen that the consistency rate does not measure the performance of the finger labeling algorithm completely. In addition, it can also be found that the back-and-forth correlation of inconsistent fingering in Original LSTM+Viterbi is not strong, whereas JBI-LSTM is prone to a string of labeling results that are inconsistent with Groundtruth, which will be more likely to reduce the agreement rate. Comparatively, JBI-LSTM is prone to two extremes of very high or very low agreement rates, ranging from a high of 88% to a low of only 42% in the test set.

The trained model is used for fingering prediction using BI-LSTM algorithm and the results are shown in Table 2. After comparison, the effect of training with note sequences is poor, the agreement rate is only 57.11%, and the agreement rate of training with note difference sequences can reach 69.7 %, which is higher than that of Judgement-LSTM, thus it can be seen that, compared with the one-to-one mapping relationship between fingering and notes, fingering labeling is more closely related to the note difference information, and the proposed BI-LSTM labeling method precisely also combines note The proposed BI-LSTM labeling method precisely combines the note difference information for fingering labeling. However, the Bi-LSTM-CRF still has 3% of inelastic fingering, with an irrationality rate of 6.02%. Therefore, in terms of fingering elasticity, the algorithm in this paper is superior to Bi-LSTM-CRF.

Comparison of annotation results of different algorithms

Accuarcy Unrealizable rate Unreasonable rate
Original LSTM 53.56% 25.24% 19.6%
Fault-tolerant LSTM 59.91 % 23.41% 17.2%
Merged-output LSTM 54.45% 24.7% 20.7%
Judgement-LSTM 68.71% 5% 19.2
Bi-LSTM-CRF (Note difference sequence) 57.11% 17.77% 18.1%
BI-LSTM (Note sequence) 69.7% 0% 3.87%
Quantitative results for the piano indicator

In order to test and verify the effectiveness and objectivity of the established quantitative evaluation method of piano fingering based on AI action recognition, the following relevant experiments are designed to carry out a comparative analysis of the evaluation results of fingering. The experiments in this chapter also utilize the Python language to write the algorithm of the piano fingering quantitative evaluation system, and debug and implement it in the PyCharm compiler. Piano fingering quantitative evaluation system in the reinforcement learning network for single-step fingering evaluation and complete fingering sequence evaluation both will play a role in the evaluation, then the following will be designed from these two aspects of the experiment to test.

The single-step fingering evaluation of the reinforcement learning network mainly involves the finger conversion between two consecutive notes, respectively, set up different categories of conversion fingering under two consecutive notes, and set up five fingerings for each category, and each category is scored by the quantitative evaluation system of piano fingering to get five scores, as shown in Table 3. From the experimental results, the average value of the scores of smooth fingerings is the highest, and all of them are full scores of 6 points, which is in line with the criterion of the optimization of natural smooth fingerings. The average scores of expanding finger, reducing finger, and transferring finger categories were gradually reduced one at a time, which was also basically in line with the priority ranking of inter-finger conversion fingering; while the average score of the violating fingering category was the lowest, and the score of each time was negative, which was conducive to the subsequent performance to avoid the use of the violating fingering during the performance.

Transform the evaluation results of fingering

Fingering class Synreferential Extended finger Dactylocontraction Referral Illegal fingering
Score 1 6 4 1 4 -2
Score 2 6 6 3 -2 -3
Score 3 6 4 3 4 -3
Score of 4 6 1 1 4 -3
Score 5 6 4 4 1 -2
Average score 6 3.8 2.4 2.2 -2.6

The comparison of the scores of various types of fingering is shown in Figure 6. It can be seen that the distribution of the scores of each type of fingering is balanced but there are high and low advantages and disadvantages, which is conducive to guiding the generation of reasonable and scientific fingering sequences, and facilitating the performers to avoid the occurrence of erroneous fingering when playing.

Figure 6.

Comparison of various fingering scores

The quantitative evaluation method of piano fingering based on AI action recognition is not only evaluated for each step of fingering transition during playing, but also for the whole fingering sequence generated. In the following, a key sequence fragment with the number of notes of 10 is selected, and the standard fingering sequence, general fingering sequence, and random fingering sequence are inputted into the piano fingering quantitative evaluation system for scoring experiments, and the statistics of the experimental results are shown in Table 4 as well as Fig. 7. From the results of AI’s recognition and evaluation of different fingering movements, the total score and scoring rate of the standard fingering sequence are very high, and the scoring trend can be seen to be steadily increasing with less fluctuation. The total score and score rate of the general fingering sequence are lower than that of the standard fingering sequence, and the score trend and score rate are less stable and more fluctuating. The randomized fingerprinting sequence has a very low total score and score rate, and its performance is very unstable, with large fluctuations in scores. The random fingering sequence, on the other hand, has a very low total score and score rate, and its performance is very unstable with large score fluctuations. It can be seen that the evaluation of piano fingering sequences according to the established quantitative piano fingering evaluation method based on AI action recognition is basically in accordance with the rules, and is scientific and effective.

The result of fingering sequence evaluation

Fingering sequence number Fingering sequence number Total points Scoring average
1 Standard 48 0.96
2 normal 35 0.67
3 Random 9 0.15
Figure 7.

Fingering sequence score trend

Conclusion

This paper takes fingering techniques in piano performance playing as a case study. This paper proposes piano fingering evaluation based on AI action recognition, and compares the results of the algorithms that are consistent with performing fingering annotation based on different LSTM methods. The results show that after comparison of different LSTM algorithms, the BI-LSTM fingering labeling algorithm proposed in this paper has high consistency and can track the rhythm of piano notes better. Based on the evaluation results of piano fingering based on AI action recognition, the total score and score rate of the standard fingering sequence are very high, respectively, 48 and 0.96%, and the trend of the score can be seen to be steadily increasing, with less fluctuation, while the random fingering sequence has a very low total score and score rate, and its performance is very unstable, with large fluctuation of the score. It can be seen that the quantitative evaluation method of piano fingering based on AI action recognition established in accordance with the method of this paper is basically in accordance with the rules for the evaluation of piano playing fingering sequences, and is scientific and effective.

Lingua:
Inglese
Frequenza di pubblicazione:
1 volte all'anno
Argomenti della rivista:
Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro