Otwarty dostęp

Research on automatic identification and evaluation method of piano playing skills based on convolutional neural network

  
21 mar 2025

Zacytuj
Pobierz okładkę

Introduction

In the teaching of piano performance in colleges and universities, the words “performance” and “playing” reveal the dual requirements of musical expressiveness and skills and techniques respectively [1]. Skills and techniques form the basis of piano performance, and students must be proficient in the technology, while musical expression is the soul of the performance, which gives emotion and depth to the work, so that each performance has a unique personality and expression [2-3]. However, the current piano performance teaching in colleges and universities often favors the cultivation of students’ technical skills and ignores the importance of musical expression, which not only restricts the breadth and depth of students’ artistic expression, but also affects students’ ability to fully understand music performance [4-6].

With the continuous development of music education, the piano, as an important tool for musical expression, has attracted more and more attention to the relationship between its playing skills and musical expression [7]. Traditional music education tends to focus on the training of technique, while paying relatively little attention to musical expression [8]. However, with the increasing importance of musical expression in performance, how to effectively combine technique and performance has become a hot research topic. Influenced by the Internet, music education has also gradually entered into a completely new field of development [9-10].

Internet-based music education breaks through the limitations of traditional education, by giving full play to the role of the network, prompting students to learn knowledge in an unlimited number of places and time, while also mobilizing students’ motivation and interest in learning with its rich learning resources [11-13]. As an important part of the Internet, mobile Internet can provide people with more perfect network services based on mobile terminals [14].

Music performance has always been an important subject system in music education curriculum and teaching, and innovative, reliable, credible, and effective music performance assessment is an important part of the success of music education [15-16]. The potential problems of music performance assessment are closely related to the characteristics of music performance. For each musical work, different musicians will have different expressions in terms of timbre, performance, intensity, interpretation, and so on. Thus, different musical performances may require different assessment methods [17-19].

In this paper, a hand gesture recognition system based on G-HRNet is constructed for hand gesture piano playing techniques.The system obtains real-time camera shooting data for real-time processing, and recognizes the joint points of the hand using target detection and gesture recognition. On this basis, the high-resolution network HRNet is introduced, and the Ghost module and DFC attention mechanism of this network model are introduced, the lightweight G-Block module is constructed, and the lightweight high-resolution network model G-HRNet is proposed.Combined with the human body joint angle feature and DTW algorithm based on human body joint angle feature and DTW algorithm to assess the similarity of the hand movements during piano playing with the standard playing movements. And the application effect of this paper’s algorithm is verified by public datasets.

Automatic estimation model of piano playing hand posture based on G-HRNet
Deep learning based piano hand fingering recognition system
General system architecture

Firstly, the system acquires the corresponding piano playing video in real time through the camera, after image processing, then sends it to the algorithm recognition module for piano playing error recognition, hand shape error recognition and scoring. Finally, the recognized video is displayed on the front-end through WebSocket communication. The algorithm recognition module counts the number of playing errors and hand shape errors, and displays the errors and the overall score on the front end using HTTP communication. The overall framework of the system is shown in Figure 1. Algorithm recognition is composed of steps such as data acquisition, data labeling, model training, and algorithm deployment. The YOLOv3 target detection algorithm is used for model training, and the algorithm deployment is implemented to integrate the trained model into the system and deploy it on the server.

Figure 1.

System framework

System Functional Module Design

The deep learning-based piano hand fingering recognition system designed in this paper can recognize and correct fingerings played by practitioners. The system composition module is shown in Figure 2. Through AI technology to simulate the accompanying trainer and guide the students’ piano practice, the image and video recognition algorithms are used to accurately detect the hand shape of playing the piano and correct the students’ hand shape and fingering errors in real time, which mainly realizes the functions of image acquisition, gesture recognition, and recognition and comparison.

Figure 2.

System module

Algorithm Architecture Design

The key algorithm for hand recognition in this project is composed of two primary components, the target detection algorithm and pose estimation algorithm. The target detection algorithm mainly uses the YOLOv3 algorithm [20] to detect the position where the hand is located in the key frame.The YOLOv3 algorithm employs a separate neural network for target detection by dividing the image into regions and predicting the probability and bounding box for each region. The block diagram of the scoring algorithm for this system is shown in Fig. 3. The algorithm is divided into two main steps: first, a series of candidate regions are generated on the image based on certain rules, and then these candidate regions are labeled by their positional relationship with the real box. Secondly, image features are extracted using a convolutional neural network to predict the location and category of candidate regions. Then two functions are extended on this basis, the first function is to make an error hand shape judgment on the hand shape of the current key frame by YOLOv3 algorithm, and the second function is to predict the hand key point by using the “HRnet” pose estimation algorithm after obtaining the position coordinates of the hand in the current key frame, and the hand key point coordinates are obtained by using the “HRnet” pose estimation algorithm. The obtained coordinates of the hand keypoints are stored in an array, and the DTW algorithm [21] is used to compare the player’s playing situation with the standard data played by the piano teacher and score the player’s performance. The final output is the number of incorrect hand shapes and the score of the player’s performance.

Figure 3.

Scoring algorithm block diagram

Target Detection Algorithm Training

This project uses YOLOv3 target detection algorithm for model training. The default weights file provided by the official is used for initialization to set the model’s label type and other information. The labeled dataset is fed into the network for data preprocessing. Perform network model training to process the images in the dataset and get the output of the network. Finally the trained model is used for target prediction.

Lightweight G-HRNet model construction
Introduction to the High Resolution Network HRNet

Prior to the advent of the high-resolution network HRNet [22], network models for attitude estimation were based on the method of recovering high-resolution feature maps. However, such an approach leads to a large amount of valid information being lost in the process of constant up and down sampling. In order to improve the detection accuracy of the joints, the network structure is constantly increased and widened, and even the resolution of the input image is increased, etc., which leads to a large number of existing high-precision model parameters and a large amount of computation, and it is difficult to be used in the case of limited computational resources, so in this paper, we propose the lightweight network G-HRNet, which is optimized for lightweight improvement of the HRNet.

G-Block

Deploying neural networks on embedded devices is difficult due to memory and computational power limitations. The operation used to generate an arbitrary convolutional layer of n feature mapping is shown in Equation (1): Y=X*f+b

Where “*” represents the convolution operation; b represents the bias term; Yh×w×n represents the output feature map with n channels, h and w represent the height and width of the output data; fc×k×k×n represents the convolution filter of this layer, and k×k represents the size of the convolution kernel of the convolution filter f.

Assume that there are m intrinsic feature maps Yh×w×m that are generated by initial convolution as shown in equation (2): Y=X*f

The convolution filters are fc×k×k×m and mn, and the other hyperparameters such as convolution kernel, step size, and space size are consistent with the original convolution.

In order to obtain the required n feature maps, a series of linear operations are used to generate s “Ghost” feature maps on the intrinsic feature map Y, as shown in equation (3): yij=Φij(yi);i=1,2,,m,j=1,2,,s

Where yi represents the feature map of the ird intrinsic convolutional map in convolutional map Y, and Φij represents the ith “Ghost” feature map generated by the jth linear operation yij. The Ghost module has a constant map and m(s1)=(s1)ns linear operations, and it can be deduced that theoretically, the same number of feature maps can be obtained by the Ghost module and the standard convolution. The speedup and parameter ratios of the Ghost module and the standard convolution can be deduced from the following equations: rs=nhckknshwckk+(s1)nshwdd=ckk1sckk+s1sdds rc=nhckknsckk+(s1)nsddscs+c1s

Given an input feature ZH×W×C , it can be treated as HW tokens, i.e., ziC , Z = {z11,z12,…,zHW}. One way to implement an attention graph using a Fully Connected Layer (FC) is shown below: ahw=h,wFhw,h,wzh,w where represents the tensor operation, FHW×H×W is the learnable weights, and A = {α11,α12,…αHW} is the resulting attention graph. In computing the attention output ahw for each location, the information from all locations is fused, thus capturing the global information by combining all the tokens with the learnable weights. Combining feature maps in CNNs is usually low-rank, so it is not really necessary to densely connect all input and output tokens from different spatial locations. It is formulated as: ahw=hHFh,h,wHzhw,h=1,2,,H,w=1,2,,W ahw=wWFw,hwWαhw,h=1,2,,H,w=1,2,,W where FH and FW are the weights and the inputs are the original features Z, Eqs. (7) and (8) act sequentially on top of the features to capture the long-range correlations along the two directions, respectively, which is the decoupled fully-connected attention mechanism (DFC).The DFC attention captures the dependencies between pixels at distant spatial locations, which greatly enhances the expressive power of the lightweight model.The DFC attention mechanism decomposes the fully-connected layer (FC) is decomposed into horizontal FC and vertical FC with large sensory fields along the two directions, respectively. Adding this computationally efficient and simple to deploy attention mechanism to lightweight networks can achieve a better balance between accuracy and speed.

The G-Block consists of three Ghost convolutions, the first Ghost convolution serves as an extension layer to increase the original number of feature channels, followed by the activation function ReLU to speed up the training process. The second Ghost convolution and the third Ghost convolution reduce the number of channels to the original number of channels, and the last Ghost convolution increases the BN after the last Ghost convolution, and finally combines the principle of residual network, where the inputs are divided into the main inputs and residual inputs, and the inputs and outputs are connected using a shortcut, which prevents the loss of information.

G-HRNet network structure

The original HRNet is lightweighted. The specific improvement is divided into two aspects, one is the use of Ghost convolution [23] and DFC attention mechanism to construct a lightweight residual module; the second is to streamline the parallel subnet of the original HRNet, which in turn achieves the purpose of reducing the number of model parameters.The G-HRNet network is mainly composed of three phases, which are the first phase, the second phase, and the third phase, respectively.The G-HRNet consists of parallel connected subnets, each of which is composed of different subnets from top to bottom resolution gradually reducing the resolution of different features to multi-scale feature fusion. G-HRNet is composed of parallel connected sub-networks, each sub-network gradually reduces the resolution from top to bottom to carry out multi-scale feature fusion between different resolution features, the low resolution is restored to high resolution through up-sampling, and then fused with the high resolution features; the high resolution is reduced to lower resolution through down-sampling, and then fused with the low resolution features, and thus feature fusion is carried out continuously.

DARK encoding and decoding methods

The heatmap regression approach DARK utilizes the Taylor expansion of the Gaussian distribution to mitigate the quantization error in heatmap regression.DARK is decoded as follows:

Assuming that the predicted heatmap is completely free of error is the constructed heatmap labeling, i.e: heatmap(x,y)=e(xμx)2+(yμy)22σ2 where heatmap is the predicted heatmap, (x,y) is the coordinate of a pixel on the heatmap, μ = (μx,μy) is the true coordinate of the key point, and σ is a constant.

In the first step, we log the predicted heatmap and P is the logged heatmap: P(x,y)=(xμx)2+(yμy)22σ2

In the second step, the logarithmic heat map is Taylor-expanded at the maximum point m, then the logarithmic heat map at the real key point coordinates μ = (μx,μy)can be written as D′(m), D″(m) as the first-order and second-order derivatives at the maximum point m on the logarithmic heat map: P(μ)=P(m)+D(m)(μm)+12D(m)(μm)2

In the third step, both sides of the equal sign of the above equation are derived simultaneously with respect to μ to obtain: D(μ)=0+D(m)+12D(m)(2μ2m)

In the fourth step, bring μ, into Eq. (12) to simplify to get Eq. (13): μ=m(D(m))1D(m)

Hand Posture Estimation Model Based on G-HRNet

In this paper, we build a hand pose estimation model based on G-HRNet network, which inputs color hand images, and after the Stem stage downsamples the input hand images by 4 times, and feeds the obtained hand feature maps into the first stage subnetwork. After that, it goes to the second stage of the model, the high-resolution feature map. Firstly, the feature maps with half lower resolution than it are sampled to get the feature maps for feature extraction, followed by entering into the third stage subgrid, downsampling the feature maps of the second stage to generate the feature maps of the third stage subgrid. Repeat the above process, and finally the feature maps generated from the three subnets are fused to output at the size of the highest resolution feature map, i.e., 64×64, and after a convolution with a convolution kernel of 21, the output is obtained as 21 joint-point heat maps, and finally the final hand pose estimation joint-point detection results are obtained after DARK heat map decoding.

DTW-based piano fingering evaluation

In order to get more reliable output for similarity assessment, the problem of time axis difference needs to be overcome when calculating the similarity between two human joint angle sequences by matching and aligning the two human joint angle feature vectors to be assessed on the time axis. Dynamic temporal regularization algorithm is an effective way to solve this problem. It temporally regularizes the time series by extending or shortening operations on the time axis, thus enabling similarity calculation between time series.

Dynamic Time Warping (DTW) Algorithm

The DTW algorithm effectively solves the problem of matching based on the traditional Euclidean distance matching algorithm for time series with unequal lengths, which cannot be stretched or compressed on the time axis for similarity comparison of speech signal feature parameter sequences.The core idea of the DTW algorithm is to search for the optimal temporal match between two time series elements by means of a dynamic programming algorithm, which implies that the DTW algorithm allows the time series to be The core idea of DTW algorithm is to find the best time match between two time series elements by dynamic programming algorithm, which means that DTW algorithm allows one-to-many or many-to-one mapping between time series elements, thus lengthening or shortening the time series to realize the alignment and evaluation in the time axis.DTW algorithm has a wide range of applications in the fields of human movement recognition and analysis, DNA sequence alignment and image processing which can be transformed into linear sequence analysis and processing.

There are two time series X = {X(1),X(2),⋯,X(m)} and Y = {Y(1),Y(2),⋯,Y(n)} with lengths m and n, where X(m) and Y(n) denote the internal feature vectors of the two sequences, respectively.The goal of the DTW algorithm is to find the matching distance that minimizes the cumulative distortion between the two sequences. In order to find the optimal matching path of these two sequences intuitively, it is necessary to construct a m×n matrix A, where the element Ai,j of the ith row and jth column of the matrix denotes the distance D(R(i),R(j)) between the point X(j) on the sequence X and the point Y(j) on the sequence Y, which is generally carried out using the Euclidean distance, i.e., D2(R(i),T(i)) = (R(i),T(i))2. Element A(i,j) of the matrix A denotes the alignment of the point X(j) on the sequence X with the point Y(j) on the sequence Y. A path W from (1,1) to (m,n) is found in matrix A by a dynamic programming (DP) algorithm to characterize the direct mapping relationship between sequence X and sequence Y. W is a consecutive element in matrix A such that the sum of all matrix elements on the path is minimized, at which point the matrix lattice points through which the path passes are the elements of sequence X and sequence Y after alignment matching, and the sum of matrix elements on the path is the DTW minimum matching distance. Define the kth element of matrix A as wk = (i,j)k, W can be expressed as: W={ w1,w2,,wk }max(m,n)Km+n1

The DTW algorithm needs to add some constraints when finding the optimal matching path, i.e:

1) Boundary condition: w1 = (1,1) and 2wk = (m,n). The chosen optimal matching path W must start from the lower left corner of matrix A and end at the upper right corner and satisfy max(m,n) ≤ K < m + n – 1.

2) Continuity: Each step of the optimal matching path W is allowed to select only the elements adjacent to it. That is, if wk−1 = (a′,b′), then for the next point wk = (a,b) of the optimal matching path needs to satisfy aa′ ≤ 1, bb ≤ 1.

3) Monotonicity: a and b of wk = (a,b) in the optimal matching path W must be increasing. That is, if wk−1 = (a′,b′), then for the next point of the path wk = (a,b) need to meet aa′ ≥ 0, bb′ ≥ 0.

Can be obtained to meet the above constraints of the path, the path with the smallest distortion distance is the need for the optimal matching path. Then there are: DTW(R,T)=min{ k=1kwk/K }

Through denominator K unequal lengths of sequences can be stretched or compressed to compensate for different lengths of regularized paths. The cumulative distortion distance β(i,j) can be found through the DP idea, as shown in Eq. (16): β(i,j)=D(R(i),T(j))+min{β(i1,j1),β(i,j1),β(i1,j)}

Algorithm implementation

Let the human joint angle feature sequences of the two action videos to be evaluated be P = {p1,p2,p3,⋯pm} and Q = {q1,q2,q3,⋯qn}, respectively, where pm is the human joint angle feature vector corresponding to frame m of the first action video, and pn represents the human joint angle feature vector corresponding to frame n of the second action video. The specific implementation steps of the human action similarity evaluation algorithm based on human joint angle features and DTW are as follows:

1) Based on the three constraints described in the previous subsection, the search cycle starts from W1 = (1,1) to reach Wk = (pm,qn)k the optimal matching distance β(pm,qn): β(pm,qn)=dist(pm,qn)+min{ β(pm1,qn),β(pm,qn1),β(pm1,qn1) }

2) Repeat the previous step until both human joint angle feature sequences have been searched, i.e., i = m and j = n, to obtain the complete best match path W = w1,w2,w3,⋯,wk.

3) Create a sequence element matching array M of length m to record the unique mapping relationships of all elements in the joint angle feature sequence P to be evaluated in the standard joint angle feature sequence Q. Iterate over the best matching path W obtained in the previous step, and if an element in sequence P and an element in sequence Q are in a one-to-many relationship, select the pair with the smallest cosine distance as the current matching path to be deposited in matching array M.

4) According to the saved correspondences in the matching array M, the cosine similarity CosineSim(Pm,Qn) between the corresponding elements is solved one by one, and then it is summed up and the mean value is obtained Sim(P,Q): Sim(P,Q)=k=1KConsineSim(Pm,Pn)np where CosineSim(Pm,Qn)=k=19PkmPknk=19Pkm2k=19Pkn2 .

Normalize Sim(P,Q) to obtain the total similarity score between sequence P and sequence Q, the higher the score, the more similar P is to Q. The formula is as follows: Similarity(P,Q)=1Sim(P,Q)+1

Results of the assessment of automatic recognition of piano playing skills
Deep learning based piano playing gesture recognition results

In this paper, the inertial sensor gloves and infrared detection rod are used to collect the playing data, and through simulation experiments, the playing data of 10 piano songs, including “To Alice” and “Wedding in a Dream”, are collected respectively, and the 10 songs are played three times, and a total of 200 playing samples are obtained, and these samples are uniformly classified into the training set and the test set for the test. The action features of different fingers are extracted during the playing process, and the ability of piano playing gestures to extract features is analyzed using the method presented in this paper.

Analysis of different finger movement feature extraction capabilities

The results of the analysis of different finger movement features extraction ability are shown in Table 1. From the table, it can be seen that the application of this method can effectively extract the playing features of each position of the hand, and are able to extract the standard deviation and extreme deviation of each joint in the process of playing, and can effectively complete the identification of the movement features of each finger, and the identification process does not appear a substantial error, which means that after the application of the method of this paper, the recognition of the hand gestures can be better extracted, with an extraction accuracy rate of more than 90%. It shows that after applying the method in this paper, hand gesture recognition can extract the changing features of each joint of the hand better, and the accuracy of the extraction is more than 90%.

Analysis of the ability of different finger movements

Feature extraction The accuracy of the finger motion extraction(%)
Thumb Index Middle Ring Finger
Handback Angle deviation 97.095 96.575 93.418 99.424 97.213
Elevation aberration 98.43 98.287 98.075 96.356 96.326
Angle deviation 97.733 99.254 99.382 97.693 98.146
Cross Angle anomaly 95.92 100 99.026 100 95.432
Acceleration standard deviation 98.275 99.897 97.083 97.894 98.143
Standard deviation 96.713 99.615 99.171 95.663 99.642
Angular velocity 95.36 98.604 98.144 100 98.188
Lower finger Angle deviation 96.081 100 98.179 99.902 96.333
Elevation aberration 98.596 100 100 100 96.736
Angle deviation 95.946 98.202 96.977 97.918 97.544
Cross Angle anomaly 93.586 97.657 100 97.024 99.371
Acceleration standard deviation 95.741 97.362 98.857 97.367 97.616
Standard deviation 98.6 100 100 97.714 96.094
Angular velocity 94.91 95.79 99.519 98.123 97.329
Knuckle Angle deviation 97.847 98.066 99.5 96.653 96.642
Elevation aberration 96.563 98.21 97.891 97.161 100
Angle deviation 96.883 98.891 97.3 96.603 98.627
Cross Angle anomaly 98.727 99.909 97.421 99.085 98.224
Acceleration standard deviation 97.777 98.935 96.007 96.22 97.338
Standard deviation 98.94 98.383 96.957 96.978 95.886
Angular velocity 98.92 98.95 98.553 97.403 100
Analysis of model recognition effect

In this paper, the dynamic gesture recognition method based on feature action sequences (Method 1) and the LSTM-based CSI gesture recognition method (Method 2) are selected to compare with the method of this paper (G-HRNet), and the recognition effects of the two datasets are shown in Fig. 4, in which (a) and (b) are the accuracy of the training set and the accuracy of the test set, respectively. As can be seen from the figure, the recognition effect of the piano playing gesture of this method is very good, which indicates that the accuracy of the model recognition of this method is effectively improved by the training of the dataset, and its average accuracy in the training set and test set reaches 95.09% and 95.94%, respectively; whereas the recognition effect of the Method 2 method is always kept at a lower level, and its average accuracy in the training set and test set is 70.47% and 69.94%, respectively. The average accuracy in the training and test sets is 70.47% and 69.39%, respectively. Although the recognition effect of Method 1 is slightly higher than that of Method 2, it is still poorer than that of this paper (73.09% and 73.51%), and the recognition effect of this paper is better than that of Method 1 and Method 2 under different sample sizes, which indicates that the application of this paper can effectively improve the recognition accuracy of piano playing gestures.

Figure 4.

The two Numbers are based on the collection

Analysis of the results of hand changes in piano playing

Applying the method of this paper to analyze the gesture recognition effect during the performance of the test set “To Alice”, the dynamic change of the pitch angle of each joint of different fingers of the hand during recognition. The change curve of the piano playing hand is shown in Fig. 5. It can be observed that the method of this paper can effectively recognize the gesture data of piano playing, which can clearly express the dynamic situation of hand information. The fluctuation of the pitch angle of the upper joints of the fingers at different times can be clearly seen, indicating the changes of the fingers of the player at different times, and each joint can be recognized in detail. Obviously, the method presented in this paper can effectively recognize the fluctuation of hand gestures during piano playing, and the recognition results are very clear.

Figure 5.

The curve of the piano playing hand changes

Finger dexterity and span analysis
Finger dexterity analysis

The distribution statistics of the continuous keystroke time of the finger are shown in Fig. 6, most of the subjects in the experimental group have the continuous keystroke time between 1051ms and 1948ms, very few subjects have the continuous keystroke time within 1110ms, which is regarded as very flexible, and very few subjects have the continuous keystroke time greater than 1727ms, which is regarded as not flexible enough. Due to the gravity rebound speed of the test device, the limit of the continuous keystroke time is about 900 ms. Overall, the distribution of the continuous keystroke time of the finger shows a normal distribution trend. After further K-S normal test, the distribution of finger continuous keystroke time can be considered to obey a normal distribution with mean μ of 1438.74ms and standard deviation σ of 110.37ms. According to the statistical distribution of the continuous finger keystroke time of the experimental group and the scoring setting standard related to the difficulty characteristics in sports track events, we set μ-2.5σ, i.e., 1073ms, as the scoring point of 100 points, and μ+2σ, i.e., 1840ms, as the scoring point of 50 points, which were substituted into the cumulative scoring formula to obtain the following system of equations: { 100=k7.52+Z50=k32+Z

Solving this system of equations gives: k = 1.06, Z = 40.46. The finger dexterity score is calculated as follows: yi=1.06×[ 5(ti1469)/180 ]2+40.46,1i10

Where ti is the continuous keystroke time of the finger numbered i, yi is the finger dexterity score of the finger numbered i, and in particular, when ti is greater than 2041ms it is directly assessed as 0 points.

Figure 6.

The distribution of finger continuous keystroke time

Table 2 shows the statistics of continuous keystroke time and dexterity of each finger in the experimental group. In order to facilitate the understanding, the finger numbering is specially explained in this paper, using the English symbols L and R to denote the right and left hands respectively, and the subscripts 1-5 to denote the thumb, index finger, middle finger, ring finger and little finger respectively. Since the experimental subjects are all right-handed, the flexibility scores of each finger of the right hand are significantly higher than the scores of each finger of the left hand, and at the same time, the index finger of each hand is the most flexible and has the best independence, the ring finger is checked by the little finger and the middle finger and is the least flexible, and the standard deviation of the little finger is the largest, and the scoring result is in line with the reality, and it can reflect the degree of flexibility of each finger very well.

The finger is a continuous stroke and a flexible score

Finger number Mean Standard deviation Flexible average score
L5 1522.198 172.484 61.21
L4 1638.956 166.778 53.77
L3 1509.271 155.468 66.4
L2 1476.312 177.393 69.31
L1 1518.939 174.481 59.66
R1 1404.546 178.336 73
R2 1314.255 139.574 78.32
R3 1360.534 132.529 71.62
R4 1484.587 156.215 71.8
R5 1422.542 160.372 68.85
Finger Span Analysis

Tables 3 and 4 show examples of left and right hand spans for the same subject in the experimental group obtained after processing the data collected from the test platform, respectively. The finger spans are expressed in terms of piano interval differences to facilitate the calculation of distance fit later. It can be seen that there is a difference in the span of each finger of the subjects, and it is necessary to measure the finger span.

Example of the left finger span

Finger number L5 L4 L3 L2 L1
L5 5.55 7.05 8.05 11.55
L4 5.55 4.05 7.05 10.55
L3 7.05 4.05 5.55 9.05
L2 8.05 7.05 5.55 8.55
L1 11.55 10.55 9.05 8.55

Example of the right finger span

Finger number R1 R2 R3 R4 R5
R1 -- 8.55 9.55 10.55 11.55
R2 8.55 -- 5.55 7.05 9.05
R3 9.55 5.55 -- 4.05 5.55
R4 10.55 7.05 4.05 -- 4.55
R5 11.55 10.55 9.05 8.55
Piano fingering audio-video evaluation experiments

Based on the audio-visual based piano fingering assessment scale developed in this paper, the experiments in this paper were conducted using the proposed fingering assessment method based on the G-HRNet hand posture estimation model using the dataset constructed in this paper.

Fingerprinting Video Evaluation Experiments

Using the dataset constructed in this paper, test experiments were conducted for the a priori knowledge-based piano fingering evaluation scheme, and the corresponding video evaluation results are shown in Table 5. Each fingering subset includes the following video samples: correct fingering, wrong hand shape, wrong fingering, deviated key direction, collapsed palm, unstable hand, and curled fingers. Under the same conditions, the assessment difficulty for fingering played by a single finger with hook, butt, and wipe was lower than that for fingering played by two fingers at the same time with both the major and minor handfuls, so its assessment accuracy was slightly higher than that of the latter.

Video assessment experiment results

Subset Tick Holder Erase Big pinch Little pinch
Accuracy rate(%) 95.55 96.77 97.53 90.49 85.42
Fingerstyle Audio Comparison Experiments

Fig. 7 and Fig. 8 show the time domain waveform results and frequency domain results of chromatic order 4 (fa), respectively, where Fig. 8(a) shows the harmonic-containing information and Fig. 8(b) shows the fundamental frequency. It is found that the figure can clearly show its fundamental frequency information.

Figure 7.

The result of the time domain waveform of the semi-tone order 4(fa)

Figure 8.

Semi-tone order 4(fa) frequency domain results

For fingering audio, frequency domain information is obtained by Fourier transform. Its frequency value is obtained to be consistent with the standard tone of 783 Hz for the middle tone fa in the key of D major, and this chromatic scale is correctly pitched, which indicates that the key press strength of the left hand is reasonable, so the evaluation result is that the audio of this fingering is correct.

The results of the audio comparison experiment are shown in Table 6. Each fingering subset consisted of the following audio samples: correct pitch, playing the wrong key, and deviations in chromatic scale pitch. Intonation is crucial for learning a musical instrument, and for audio assessment, the use of pitch comparison is simple and effective. It has a high degree of accuracy compared to video assessment.

Audio comparison experimental results

Subset Tick Holder Erase Big pinch Little pinch
Accuracy rate(%) 99.02 99.97 98.62 96.03 97.46
Conclusion

In this paper, on the basis of the HRNet model for human hand gesture estimation, a series of lightweight processing is carried out on the model structure, and the G-HRNet model is proposed and completed to be constructed. The DTW algorithm is utilized to determine the similarity between human hand movements and standard movements while playing the piano and verify the model’s recognition effect. The results show that the G-HRNet model can effectively extract the angular features of the back of the player’s hand, the lower joints of the fingers, and the upper joints of the fingers, and compared with other methods, this method has higher recognition accuracy, and can effectively recognize the changes of the pitch angle of the upper and lower joints of the fingers at different times. In addition, the method proposed in this paper based on the automatic recognition of the hand posture of the piano playing system can simply and effectively measure the finger flexibility and span information, which provides a data measurement basis for the recognition of piano playing skills.

Język:
Angielski
Częstotliwość wydawania:
1 razy w roku
Dziedziny czasopisma:
Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne