Research on automatic identification and evaluation method of piano playing skills based on convolutional neural network

In the teaching of piano performance in colleges and universities, the words “performance” and “playing” reveal the dual requirements of musical expressiveness and skills and techniques respectively [1]. Skills and techniques form the basis of piano performance, and students must be proficient in the technology, while musical expression is the soul of the performance, which gives emotion and depth to the work, so that each performance has a unique personality and expression [2-3]. However, the current piano performance teaching in colleges and universities often favors the cultivation of students’ technical skills and ignores the importance of musical expression, which not only restricts the breadth and depth of students’ artistic expression, but also affects students’ ability to fully understand music performance [4-6].

With the continuous development of music education, the piano, as an important tool for musical expression, has attracted more and more attention to the relationship between its playing skills and musical expression [7]. Traditional music education tends to focus on the training of technique, while paying relatively little attention to musical expression [8]. However, with the increasing importance of musical expression in performance, how to effectively combine technique and performance has become a hot research topic. Influenced by the Internet, music education has also gradually entered into a completely new field of development [9-10].

Internet-based music education breaks through the limitations of traditional education, by giving full play to the role of the network, prompting students to learn knowledge in an unlimited number of places and time, while also mobilizing students’ motivation and interest in learning with its rich learning resources [11-13]. As an important part of the Internet, mobile Internet can provide people with more perfect network services based on mobile terminals [14].

Music performance has always been an important subject system in music education curriculum and teaching, and innovative, reliable, credible, and effective music performance assessment is an important part of the success of music education [15-16]. The potential problems of music performance assessment are closely related to the characteristics of music performance. For each musical work, different musicians will have different expressions in terms of timbre, performance, intensity, interpretation, and so on. Thus, different musical performances may require different assessment methods [17-19].

In this paper, a hand gesture recognition system based on G-HRNet is constructed for hand gesture piano playing techniques.The system obtains real-time camera shooting data for real-time processing, and recognizes the joint points of the hand using target detection and gesture recognition. On this basis, the high-resolution network HRNet is introduced, and the Ghost module and DFC attention mechanism of this network model are introduced, the lightweight G-Block module is constructed, and the lightweight high-resolution network model G-HRNet is proposed.Combined with the human body joint angle feature and DTW algorithm based on human body joint angle feature and DTW algorithm to assess the similarity of the hand movements during piano playing with the standard playing movements. And the application effect of this paper’s algorithm is verified by public datasets.

2

Automatic estimation model of piano playing hand posture based on G-HRNet

2.1

Deep learning based piano hand fingering recognition system

2.1.1

General system architecture

Firstly, the system acquires the corresponding piano playing video in real time through the camera, after image processing, then sends it to the algorithm recognition module for piano playing error recognition, hand shape error recognition and scoring. Finally, the recognized video is displayed on the front-end through WebSocket communication. The algorithm recognition module counts the number of playing errors and hand shape errors, and displays the errors and the overall score on the front end using HTTP communication. The overall framework of the system is shown in Figure 1. Algorithm recognition is composed of steps such as data acquisition, data labeling, model training, and algorithm deployment. The YOLOv3 target detection algorithm is used for model training, and the algorithm deployment is implemented to integrate the trained model into the system and deploy it on the server.

2.1.2

System Functional Module Design

The deep learning-based piano hand fingering recognition system designed in this paper can recognize and correct fingerings played by practitioners. The system composition module is shown in Figure 2. Through AI technology to simulate the accompanying trainer and guide the students’ piano practice, the image and video recognition algorithms are used to accurately detect the hand shape of playing the piano and correct the students’ hand shape and fingering errors in real time, which mainly realizes the functions of image acquisition, gesture recognition, and recognition and comparison.

2.1.3

Algorithm Architecture Design

The key algorithm for hand recognition in this project is composed of two primary components, the target detection algorithm and pose estimation algorithm. The target detection algorithm mainly uses the YOLOv3 algorithm [20] to detect the position where the hand is located in the key frame.The YOLOv3 algorithm employs a separate neural network for target detection by dividing the image into regions and predicting the probability and bounding box for each region. The block diagram of the scoring algorithm for this system is shown in Fig. 3. The algorithm is divided into two main steps: first, a series of candidate regions are generated on the image based on certain rules, and then these candidate regions are labeled by their positional relationship with the real box. Secondly, image features are extracted using a convolutional neural network to predict the location and category of candidate regions. Then two functions are extended on this basis, the first function is to make an error hand shape judgment on the hand shape of the current key frame by YOLOv3 algorithm, and the second function is to predict the hand key point by using the “HRnet” pose estimation algorithm after obtaining the position coordinates of the hand in the current key frame, and the hand key point coordinates are obtained by using the “HRnet” pose estimation algorithm. The obtained coordinates of the hand keypoints are stored in an array, and the DTW algorithm [21] is used to compare the player’s playing situation with the standard data played by the piano teacher and score the player’s performance. The final output is the number of incorrect hand shapes and the score of the player’s performance.

2.1.4

Target Detection Algorithm Training

This project uses YOLOv3 target detection algorithm for model training. The default weights file provided by the official is used for initialization to set the model’s label type and other information. The labeled dataset is fed into the network for data preprocessing. Perform network model training to process the images in the dataset and get the output of the network. Finally the trained model is used for target prediction.

2.2

Lightweight G-HRNet model construction

2.2.1

Introduction to the High Resolution Network HRNet

Prior to the advent of the high-resolution network HRNet [22], network models for attitude estimation were based on the method of recovering high-resolution feature maps. However, such an approach leads to a large amount of valid information being lost in the process of constant up and down sampling. In order to improve the detection accuracy of the joints, the network structure is constantly increased and widened, and even the resolution of the input image is increased, etc., which leads to a large number of existing high-precision model parameters and a large amount of computation, and it is difficult to be used in the case of limited computational resources, so in this paper, we propose the lightweight network G-HRNet, which is optimized for lightweight improvement of the HRNet.

2.2.2

G-Block

Deploying neural networks on embedded devices is difficult due to memory and computational power limitations. The operation used to generate an arbitrary convolutional layer of n feature mapping is shown in Equation (1): (1) $Y = X * f + b$

Where “*” represents the convolution operation; b represents the bias term; $Y \in ℝ^{h^{'} \times w^{'} \times n}$ represents the output feature map with n channels, h and w represent the height and width of the output data; $f \in ℝ^{c \times k \times k \times n}$ represents the convolution filter of this layer, and k×k represents the size of the convolution kernel of the convolution filter f.

Assume that there are m intrinsic feature maps $Y' \in ℝ^{h^{'} \times w^{'} \times m}$ that are generated by initial convolution as shown in equation (2): (2) $Y' = X * f^{'}$

The convolution filters are $f^{'} \in ℝ^{c \times k \times k \times m}$ and m≤n, and the other hyperparameters such as convolution kernel, step size, and space size are consistent with the original convolution.

In order to obtain the required n feature maps, a series of linear operations are used to generate s “Ghost” feature maps on the intrinsic feature map Y, as shown in equation (3): (3) $y_{i j} = Φ_{i j} (y_{i}); \forall_{i} = 1, 2, \dots, m, j = 1, 2, \dots, s$

Where y_i represents the feature map of the ird intrinsic convolutional map in convolutional map Y, and Φ_ij represents the ith “Ghost” feature map generated by the jth linear operation y_ij. The Ghost module has a constant map and $m \cdot (s - 1) = \frac{(s - 1) \cdot n}{s}$ linear operations, and it can be deduced that theoretically, the same number of feature maps can be obtained by the Ghost module and the standard convolution. The speedup and parameter ratios of the Ghost module and the standard convolution can be deduced from the following equations: (4) $\begin{matrix} r_{s} = \frac{n \cdot h \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h \cdot w \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h \cdot w \cdot d \cdot d} \\ = \frac{c \cdot k \cdot k}{\frac{1}{s} \cdot c \cdot k \cdot k + \frac{s - 1}{s} \cdot d \cdot d} \approx s \end{matrix}$ (5) $r_{c} = \frac{n \cdot h \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s$

Given an input feature $Z \in ℝ^{H \times W \times C}$ , it can be treated as HW tokens, i.e., $z_{i} \in ℝ^{C}$ , Z = {z₁₁,z₁₂,…,z_HW}. One way to implement an attention graph using a Fully Connected Layer (FC) is shown below: (6) $a_{h w} = \sum_{h^{'}, w^{'}} F_{h w, h^{'}, w^{'}} ⊙ z_{h^{'}, w^{'}}$ where $⊙$ represents the tensor operation, F^HW×H×W is the learnable weights, and A = {α₁₁,α₁₂,…α_HW} is the resulting attention graph. In computing the attention output a_hw for each location, the information from all locations is fused, thus capturing the global information by combining all the tokens with the learnable weights. Combining feature maps in CNNs is usually low-rank, so it is not really necessary to densely connect all input and output tokens from different spatial locations. It is formulated as: (7) ${a^{'}}_{h w} = \sum_{h}^{H} F_{h, h, w}^{H} ⊙ z_{h^{'} w}, h = 1, 2, \dots, H, w = 1, 2, \dots, W$ (8) $a_{h w} = \sum_{w^{'}}^{W} F_{w, h w^{'}}^{W} ⊙ {α^{'}}_{h w^{'}}, h = 1, 2, \dots, H, w = 1, 2, \dots, W$ where F^H and F^W are the weights and the inputs are the original features Z, Eqs. (7) and (8) act sequentially on top of the features to capture the long-range correlations along the two directions, respectively, which is the decoupled fully-connected attention mechanism (DFC).The DFC attention captures the dependencies between pixels at distant spatial locations, which greatly enhances the expressive power of the lightweight model.The DFC attention mechanism decomposes the fully-connected layer (FC) is decomposed into horizontal FC and vertical FC with large sensory fields along the two directions, respectively. Adding this computationally efficient and simple to deploy attention mechanism to lightweight networks can achieve a better balance between accuracy and speed.

The G-Block consists of three Ghost convolutions, the first Ghost convolution serves as an extension layer to increase the original number of feature channels, followed by the activation function ReLU to speed up the training process. The second Ghost convolution and the third Ghost convolution reduce the number of channels to the original number of channels, and the last Ghost convolution increases the BN after the last Ghost convolution, and finally combines the principle of residual network, where the inputs are divided into the main inputs and residual inputs, and the inputs and outputs are connected using a shortcut, which prevents the loss of information.

2.2.3

G-HRNet network structure

The original HRNet is lightweighted. The specific improvement is divided into two aspects, one is the use of Ghost convolution [23] and DFC attention mechanism to construct a lightweight residual module; the second is to streamline the parallel subnet of the original HRNet, which in turn achieves the purpose of reducing the number of model parameters.The G-HRNet network is mainly composed of three phases, which are the first phase, the second phase, and the third phase, respectively.The G-HRNet consists of parallel connected subnets, each of which is composed of different subnets from top to bottom resolution gradually reducing the resolution of different features to multi-scale feature fusion. G-HRNet is composed of parallel connected sub-networks, each sub-network gradually reduces the resolution from top to bottom to carry out multi-scale feature fusion between different resolution features, the low resolution is restored to high resolution through up-sampling, and then fused with the high resolution features; the high resolution is reduced to lower resolution through down-sampling, and then fused with the low resolution features, and thus feature fusion is carried out continuously.

2.2.4

DARK encoding and decoding methods

The heatmap regression approach DARK utilizes the Taylor expansion of the Gaussian distribution to mitigate the quantization error in heatmap regression.DARK is decoded as follows:

Assuming that the predicted heatmap is completely free of error is the constructed heatmap labeling, i.e: (9) $h e a t m a p (x, y) = e^{- \frac{{(x - μ_{x})}^{2} + {(y - μ_{y})}^{2}}{2 σ^{2}}}$ where heatmap is the predicted heatmap, (x,y) is the coordinate of a pixel on the heatmap, μ = (μ_x,μ_y) is the true coordinate of the key point, and σ is a constant.

In the first step, we log the predicted heatmap and P is the logged heatmap: (10) $P (x, y) = - \frac{{(x - μ_{x})}^{2} + {(y - μ_{y})}^{2}}{2 σ^{2}}$

In the second step, the logarithmic heat map is Taylor-expanded at the maximum point m, then the logarithmic heat map at the real key point coordinates μ = (μ_x,μ_y)can be written as D′(m), D″(m) as the first-order and second-order derivatives at the maximum point m on the logarithmic heat map: (11) $P (μ) = P (m) + D' (m) (μ - m) + \frac{1}{2} D^{″} (m) {(μ - m)}^{2}$

In the third step, both sides of the equal sign of the above equation are derived simultaneously with respect to μ to obtain: (12) $D' (μ) = 0 + D' (m) + \frac{1}{2} D^{″} (m) (2 μ - 2 m)$

In the fourth step, bring μ, into Eq. (12) to simplify to get Eq. (13): (13) $μ = m - {(D^{″} (m))}^{- 1} D^{'} (m)$

2.2.5

Hand Posture Estimation Model Based on G-HRNet

In this paper, we build a hand pose estimation model based on G-HRNet network, which inputs color hand images, and after the Stem stage downsamples the input hand images by 4 times, and feeds the obtained hand feature maps into the first stage subnetwork. After that, it goes to the second stage of the model, the high-resolution feature map. Firstly, the feature maps with half lower resolution than it are sampled to get the feature maps for feature extraction, followed by entering into the third stage subgrid, downsampling the feature maps of the second stage to generate the feature maps of the third stage subgrid. Repeat the above process, and finally the feature maps generated from the three subnets are fused to output at the size of the highest resolution feature map, i.e., 64×64, and after a convolution with a convolution kernel of 21, the output is obtained as 21 joint-point heat maps, and finally the final hand pose estimation joint-point detection results are obtained after DARK heat map decoding.

2.3

DTW-based piano fingering evaluation

In order to get more reliable output for similarity assessment, the problem of time axis difference needs to be overcome when calculating the similarity between two human joint angle sequences by matching and aligning the two human joint angle feature vectors to be assessed on the time axis. Dynamic temporal regularization algorithm is an effective way to solve this problem. It temporally regularizes the time series by extending or shortening operations on the time axis, thus enabling similarity calculation between time series.

2.3.1

Dynamic Time Warping (DTW) Algorithm

The DTW algorithm effectively solves the problem of matching based on the traditional Euclidean distance matching algorithm for time series with unequal lengths, which cannot be stretched or compressed on the time axis for similarity comparison of speech signal feature parameter sequences.The core idea of the DTW algorithm is to search for the optimal temporal match between two time series elements by means of a dynamic programming algorithm, which implies that the DTW algorithm allows the time series to be The core idea of DTW algorithm is to find the best time match between two time series elements by dynamic programming algorithm, which means that DTW algorithm allows one-to-many or many-to-one mapping between time series elements, thus lengthening or shortening the time series to realize the alignment and evaluation in the time axis.DTW algorithm has a wide range of applications in the fields of human movement recognition and analysis, DNA sequence alignment and image processing which can be transformed into linear sequence analysis and processing.

There are two time series X = {X(1),X(2),⋯,X(m)} and Y = {Y(1),Y(2),⋯,Y(n)} with lengths m and n, where X(m) and Y(n) denote the internal feature vectors of the two sequences, respectively.The goal of the DTW algorithm is to find the matching distance that minimizes the cumulative distortion between the two sequences. In order to find the optimal matching path of these two sequences intuitively, it is necessary to construct a m×n matrix A, where the element A_i,j of the ith row and jth column of the matrix denotes the distance D(R(i),R(j)) between the point X(j) on the sequence X and the point Y(j) on the sequence Y, which is generally carried out using the Euclidean distance, i.e., D²(R(i),T(i)) = (R(i),T(i))². Element A(i,j) of the matrix A denotes the alignment of the point X(j) on the sequence X with the point Y(j) on the sequence Y. A path W from (1,1) to (m,n) is found in matrix A by a dynamic programming (DP) algorithm to characterize the direct mapping relationship between sequence X and sequence Y. W is a consecutive element in matrix A such that the sum of all matrix elements on the path is minimized, at which point the matrix lattice points through which the path passes are the elements of sequence X and sequence Y after alignment matching, and the sum of matrix elements on the path is the DTW minimum matching distance. Define the kth element of matrix A as w_k = (i,j)_k, W can be expressed as: (14) $W = {w_{1}, w_{2}, \dots, w_{k}} \max (m, n) \leq K \leq m + n - 1$

The DTW algorithm needs to add some constraints when finding the optimal matching path, i.e:

1) Boundary condition: w₁ = (1,1) and 2w_k = (m,n). The chosen optimal matching path W must start from the lower left corner of matrix A and end at the upper right corner and satisfy max(m,n) ≤ K < m + n – 1.

2) Continuity: Each step of the optimal matching path W is allowed to select only the elements adjacent to it. That is, if w_k−1 = (a′,b′), then for the next point w_k = (a,b) of the optimal matching path needs to satisfy a–a′ ≤ 1, b–b ≤ 1.

3) Monotonicity: a and b of w_k = (a,b) in the optimal matching path W must be increasing. That is, if w_k−1 = (a′,b′), then for the next point of the path w_k = (a,b) need to meet a–a′ ≥ 0, b–b′ ≥ 0.

Can be obtained to meet the above constraints of the path, the path with the smallest distortion distance is the need for the optimal matching path. Then there are: (15) $D T W (R, T) = \min {\sqrt{\sum_{k = 1}^{k} w_{k} / K}}$

Through denominator K unequal lengths of sequences can be stretched or compressed to compensate for different lengths of regularized paths. The cumulative distortion distance β(i,j) can be found through the DP idea, as shown in Eq. (16): (16) $\begin{array}{l} β (i, j) = D (R (i), T (j)) \\ + \min {β (i - 1, j - 1), β (i, j - 1), β (i - 1, j)} \end{array}$

2.3.2

Algorithm implementation

Let the human joint angle feature sequences of the two action videos to be evaluated be P = {p₁,p₂,p₃,⋯p_m} and Q = {q₁,q₂,q₃,⋯q_n}, respectively, where p_m is the human joint angle feature vector corresponding to frame m of the first action video, and p_n represents the human joint angle feature vector corresponding to frame n of the second action video. The specific implementation steps of the human action similarity evaluation algorithm based on human joint angle features and DTW are as follows:

1) Based on the three constraints described in the previous subsection, the search cycle starts from W₁ = (1,1) to reach W_k = (p_m,q_n)_k the optimal matching distance β(p_m,q_n): (17) $\begin{array}{l} β (p_{m}, q_{n}) = d i s t (p_{m}, q_{n}) \\ + \min {β (p_{m - 1}, q_{n}), β (p_{m}, q_{n - 1}), β (p_{m - 1}, q_{n - 1})} \end{array}$

2) Repeat the previous step until both human joint angle feature sequences have been searched, i.e., i = m and j = n, to obtain the complete best match path W = w₁,w₂,w₃,⋯,w_k.

3) Create a sequence element matching array M of length m to record the unique mapping relationships of all elements in the joint angle feature sequence P to be evaluated in the standard joint angle feature sequence Q. Iterate over the best matching path W obtained in the previous step, and if an element in sequence P and an element in sequence Q are in a one-to-many relationship, select the pair with the smallest cosine distance as the current matching path to be deposited in matching array M.

4) According to the saved correspondences in the matching array M, the cosine similarity CosineSim(P_m,Q_n) between the corresponding elements is solved one by one, and then it is summed up and the mean value is obtained Sim(P,Q): (18) $S i m (P, Q) = \frac{\sum_{k = 1}^{K} C o n \sin e S i m (P_{m}, P_{n})}{n_{p}}$ where $Cos i n e S i m (P_{m}, Q_{n}) = \frac{\sum_{k = 1}^{9} P_{k}^{m} P_{k}^{n}}{\sqrt{\sum_{k = 1}^{9} P_{k}^{m^{2}}} \sqrt{\sum_{k = 1}^{9} P_{k}^{n^{2}}}}$ .

Normalize Sim(P,Q) to obtain the total similarity score between sequence P and sequence Q, the higher the score, the more similar P is to Q. The formula is as follows: (19) $S i m i l a r i t y (P, Q) = \frac{1}{S i m (P, Q) + 1}$

3

Results of the assessment of automatic recognition of piano playing skills

3.1

Deep learning based piano playing gesture recognition results

In this paper, the inertial sensor gloves and infrared detection rod are used to collect the playing data, and through simulation experiments, the playing data of 10 piano songs, including “To Alice” and “Wedding in a Dream”, are collected respectively, and the 10 songs are played three times, and a total of 200 playing samples are obtained, and these samples are uniformly classified into the training set and the test set for the test. The action features of different fingers are extracted during the playing process, and the ability of piano playing gestures to extract features is analyzed using the method presented in this paper.

3.1.1

Analysis of different finger movement feature extraction capabilities

The results of the analysis of different finger movement features extraction ability are shown in Table 1. From the table, it can be seen that the application of this method can effectively extract the playing features of each position of the hand, and are able to extract the standard deviation and extreme deviation of each joint in the process of playing, and can effectively complete the identification of the movement features of each finger, and the identification process does not appear a substantial error, which means that after the application of the method of this paper, the recognition of the hand gestures can be better extracted, with an extraction accuracy rate of more than 90%. It shows that after applying the method in this paper, hand gesture recognition can extract the changing features of each joint of the hand better, and the accuracy of the extraction is more than 90%.

Table 1.

Analysis of the ability of different finger movements

Feature extraction		The accuracy of the finger motion extraction(%)
Feature extraction		Thumb	Index	Middle	Ring	Finger
Handback	Angle deviation	97.095	96.575	93.418	99.424	97.213
	Elevation aberration	98.43	98.287	98.075	96.356	96.326
	Angle deviation	97.733	99.254	99.382	97.693	98.146
	Cross Angle anomaly	95.92	100	99.026	100	95.432
	Acceleration standard deviation	98.275	99.897	97.083	97.894	98.143
	Standard deviation	96.713	99.615	99.171	95.663	99.642
	Angular velocity	95.36	98.604	98.144	100	98.188
Lower finger	Angle deviation	96.081	100	98.179	99.902	96.333
	Elevation aberration	98.596	100	100	100	96.736
	Angle deviation	95.946	98.202	96.977	97.918	97.544
	Cross Angle anomaly	93.586	97.657	100	97.024	99.371
	Acceleration standard deviation	95.741	97.362	98.857	97.367	97.616
	Standard deviation	98.6	100	100	97.714	96.094
	Angular velocity	94.91	95.79	99.519	98.123	97.329
Knuckle	Angle deviation	97.847	98.066	99.5	96.653	96.642
	Elevation aberration	96.563	98.21	97.891	97.161	100
	Angle deviation	96.883	98.891	97.3	96.603	98.627
	Cross Angle anomaly	98.727	99.909	97.421	99.085	98.224
	Acceleration standard deviation	97.777	98.935	96.007	96.22	97.338
	Standard deviation	98.94	98.383	96.957	96.978	95.886
	Angular velocity	98.92	98.95	98.553	97.403	100

3.1.2

Analysis of model recognition effect

In this paper, the dynamic gesture recognition method based on feature action sequences (Method 1) and the LSTM-based CSI gesture recognition method (Method 2) are selected to compare with the method of this paper (G-HRNet), and the recognition effects of the two datasets are shown in Fig. 4, in which (a) and (b) are the accuracy of the training set and the accuracy of the test set, respectively. As can be seen from the figure, the recognition effect of the piano playing gesture of this method is very good, which indicates that the accuracy of the model recognition of this method is effectively improved by the training of the dataset, and its average accuracy in the training set and test set reaches 95.09% and 95.94%, respectively; whereas the recognition effect of the Method 2 method is always kept at a lower level, and its average accuracy in the training set and test set is 70.47% and 69.94%, respectively. The average accuracy in the training and test sets is 70.47% and 69.39%, respectively. Although the recognition effect of Method 1 is slightly higher than that of Method 2, it is still poorer than that of this paper (73.09% and 73.51%), and the recognition effect of this paper is better than that of Method 1 and Method 2 under different sample sizes, which indicates that the application of this paper can effectively improve the recognition accuracy of piano playing gestures.

3.1.3

Analysis of the results of hand changes in piano playing

Applying the method of this paper to analyze the gesture recognition effect during the performance of the test set “To Alice”, the dynamic change of the pitch angle of each joint of different fingers of the hand during recognition. The change curve of the piano playing hand is shown in Fig. 5. It can be observed that the method of this paper can effectively recognize the gesture data of piano playing, which can clearly express the dynamic situation of hand information. The fluctuation of the pitch angle of the upper joints of the fingers at different times can be clearly seen, indicating the changes of the fingers of the player at different times, and each joint can be recognized in detail. Obviously, the method presented in this paper can effectively recognize the fluctuation of hand gestures during piano playing, and the recognition results are very clear.

3.2

Finger dexterity and span analysis

3.2.1

Finger dexterity analysis

The distribution statistics of the continuous keystroke time of the finger are shown in Fig. 6, most of the subjects in the experimental group have the continuous keystroke time between 1051ms and 1948ms, very few subjects have the continuous keystroke time within 1110ms, which is regarded as very flexible, and very few subjects have the continuous keystroke time greater than 1727ms, which is regarded as not flexible enough. Due to the gravity rebound speed of the test device, the limit of the continuous keystroke time is about 900 ms. Overall, the distribution of the continuous keystroke time of the finger shows a normal distribution trend. After further K-S normal test, the distribution of finger continuous keystroke time can be considered to obey a normal distribution with mean μ of 1438.74ms and standard deviation σ of 110.37ms. According to the statistical distribution of the continuous finger keystroke time of the experimental group and the scoring setting standard related to the difficulty characteristics in sports track events, we set μ-2.5σ, i.e., 1073ms, as the scoring point of 100 points, and μ+2σ, i.e., 1840ms, as the scoring point of 50 points, which were substituted into the cumulative scoring formula to obtain the following system of equations: (20) ${\begin{cases} 100 = k {7.5}^{2} + Z \\ 50 = k 3^{2} + Z \end{cases}$

Solving this system of equations gives: k = 1.06, Z = 40.46. The finger dexterity score is calculated as follows: (21) $y_{i} = 1.06 \times {[5 - (t_{i} - 1469) / 180]}^{2} + 40.46, 1 \leq i \leq 10$

Where t_i is the continuous keystroke time of the finger numbered i, y_i is the finger dexterity score of the finger numbered i, and in particular, when t_i is greater than 2041ms it is directly assessed as 0 points.

Table 2 shows the statistics of continuous keystroke time and dexterity of each finger in the experimental group. In order to facilitate the understanding, the finger numbering is specially explained in this paper, using the English symbols L and R to denote the right and left hands respectively, and the subscripts 1-5 to denote the thumb, index finger, middle finger, ring finger and little finger respectively. Since the experimental subjects are all right-handed, the flexibility scores of each finger of the right hand are significantly higher than the scores of each finger of the left hand, and at the same time, the index finger of each hand is the most flexible and has the best independence, the ring finger is checked by the little finger and the middle finger and is the least flexible, and the standard deviation of the little finger is the largest, and the scoring result is in line with the reality, and it can reflect the degree of flexibility of each finger very well.

Table 2.

The finger is a continuous stroke and a flexible score

Finger number	Mean	Standard deviation	Flexible average score
L5	1522.198	172.484	61.21
L4	1638.956	166.778	53.77
L3	1509.271	155.468	66.4
L2	1476.312	177.393	69.31
L1	1518.939	174.481	59.66
R1	1404.546	178.336	73
R2	1314.255	139.574	78.32
R3	1360.534	132.529	71.62
R4	1484.587	156.215	71.8
R5	1422.542	160.372	68.85

3.2.2

Finger Span Analysis

Tables 3 and 4 show examples of left and right hand spans for the same subject in the experimental group obtained after processing the data collected from the test platform, respectively. The finger spans are expressed in terms of piano interval differences to facilitate the calculation of distance fit later. It can be seen that there is a difference in the span of each finger of the subjects, and it is necessary to measure the finger span.

Table 3.

Example of the left finger span

Finger number	L5	L4	L3	L2	L1
L5	—	5.55	7.05	8.05	11.55
L4	5.55	—	4.05	7.05	10.55
L3	7.05	4.05	—	5.55	9.05
L2	8.05	7.05	5.55	—	8.55
L1	11.55	10.55	9.05	8.55	—

Table 4.

Example of the right finger span

Finger number	R1	R2	R3	R4	R5
R1	--	8.55	9.55	10.55	11.55
R2	8.55	--	5.55	7.05	9.05
R3	9.55	5.55	--	4.05	5.55
R4	10.55	7.05	4.05	--	4.55
R5	11.55	10.55	9.05	8.55	—

3.3

Piano fingering audio-video evaluation experiments

Based on the audio-visual based piano fingering assessment scale developed in this paper, the experiments in this paper were conducted using the proposed fingering assessment method based on the G-HRNet hand posture estimation model using the dataset constructed in this paper.

3.3.1

Fingerprinting Video Evaluation Experiments

Using the dataset constructed in this paper, test experiments were conducted for the a priori knowledge-based piano fingering evaluation scheme, and the corresponding video evaluation results are shown in Table 5. Each fingering subset includes the following video samples: correct fingering, wrong hand shape, wrong fingering, deviated key direction, collapsed palm, unstable hand, and curled fingers. Under the same conditions, the assessment difficulty for fingering played by a single finger with hook, butt, and wipe was lower than that for fingering played by two fingers at the same time with both the major and minor handfuls, so its assessment accuracy was slightly higher than that of the latter.

Table 5.

Video assessment experiment results

Subset	Tick	Holder	Erase	Big pinch	Little pinch
Accuracy rate(%)	95.55	96.77	97.53	90.49	85.42

3.3.2

Fingerstyle Audio Comparison Experiments

Fig. 7 and Fig. 8 show the time domain waveform results and frequency domain results of chromatic order 4 (fa), respectively, where Fig. 8(a) shows the harmonic-containing information and Fig. 8(b) shows the fundamental frequency. It is found that the figure can clearly show its fundamental frequency information.

For fingering audio, frequency domain information is obtained by Fourier transform. Its frequency value is obtained to be consistent with the standard tone of 783 Hz for the middle tone fa in the key of D major, and this chromatic scale is correctly pitched, which indicates that the key press strength of the left hand is reasonable, so the evaluation result is that the audio of this fingering is correct.

The results of the audio comparison experiment are shown in Table 6. Each fingering subset consisted of the following audio samples: correct pitch, playing the wrong key, and deviations in chromatic scale pitch. Intonation is crucial for learning a musical instrument, and for audio assessment, the use of pitch comparison is simple and effective. It has a high degree of accuracy compared to video assessment.

Table 6.

Audio comparison experimental results

Subset	Tick	Holder	Erase	Big pinch	Little pinch
Accuracy rate(%)	99.02	99.97	98.62	96.03	97.46

4

Conclusion

In this paper, on the basis of the HRNet model for human hand gesture estimation, a series of lightweight processing is carried out on the model structure, and the G-HRNet model is proposed and completed to be constructed. The DTW algorithm is utilized to determine the similarity between human hand movements and standard movements while playing the piano and verify the model’s recognition effect. The results show that the G-HRNet model can effectively extract the angular features of the back of the player’s hand, the lower joints of the fingers, and the upper joints of the fingers, and compared with other methods, this method has higher recognition accuracy, and can effectively recognize the changes of the pitch angle of the upper and lower joints of the fingers at different times. In addition, the method proposed in this paper based on the automatic recognition of the hand posture of the piano playing system can simply and effectively measure the finger flexibility and span information, which provides a data measurement basis for the recognition of piano playing skills.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Research on automatic identification and evaluation method of piano playing skills based on convolutional neural network

Xiaoliang Wu

Data publikacji: 21 mar 2025

Otrzymano: 20 paź 2024

Przyjęty: 02 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0622

Słowa kluczoweYOLOv3 algorithm, High-resolution network HRNet, Dynamic temporal regularization (DTW) algorithm, G-HRNet model, Piano playing

© 2025 Xiaoliang Wu, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
YOLOv3 algorithm, High-resolution network HRNet, Dynamic temporal regularization (DTW) algorithm, G-HRNet model, Piano playing