Efficiency of AI Technology Application in Music Education - A Perspective Based on Deep Learning Model DLMM

Music teaching occupies an important position in today’s education system, which not only enriches students’ campus life, enhances students’ aesthetic ability, but also provides students with diversified learning and thinking styles [1-2]. However, many music classrooms still adopt the traditional way in teaching methods, which often neglects students’ individual differences and subjective initiative, making the effect of music teaching unsatisfactory [3-5]. In this context, how to improve the efficiency of music teaching, stimulate students’ innovative thinking, and cultivate their music literacy has become an important issue for music educators and researchers.

In recent years, China’s science and technology level has been developing rapidly, and the field of music education has gradually begun to be more and more closely cross-fertilized with science and technology and artificial intelligence (AI) technology [6-7]. Teachers utilize AI technology to conduct remote interactive teaching with students, breaking the geographical limitations of traditional music classrooms [8-11]. At the same time, teachers are able to better grasp the learning status of students based on the instant data analysis provided by AI technology, so as to conduct more targeted teaching [12-15]. In the practice of AI teaching in music education, there is the problem of model performance degradation due to the deviation of data distribution in the training set and test set [16-17]. The multimodal domain adaptation algorithm based on differential learning (DLMM), on the other hand, effectively utilizes the variability of different modal models for multimodal domain adaptation to provide better multimodal decision-making for music education [18].

The article firstly designs a lightweight music score recognition method based on improved CRNN, specifically, CRNN-lite introduces residual depth-separable convolution in the convolutional layer to reduce the computation and accelerate the extraction of feature maps. In the loop layer, a bidirectional simple loop unit is used, and parallel computation is employed to avoid the strong dependency problem of serial computation. The parameters of the cross-entropy function are adjusted in the transcription layer to target the learning of unbalanced sample data. Then the article proposes a multimodal domain adaptation algorithm based on differential learning, which can be divided into two main phases: pre-training on the source domain and a prototype-based reliability metric. Furthermore, the paper explores the effect of using different datasets on the performance of this paper’s method and conducts comparative experiments. Finally, the effectiveness of teaching music education after applying the proposed method in this paper is analyzed.

2

Lightweight music score recognition method based on improved CRNN

2.1

Application of Artificial Intelligence in Music Education

Nowadays, the application of artificial intelligence technology in the field of education has become a popular trend, and the music teaching content of colleges and universities is also more dynamic, open and diversified with the help of artificial intelligence. On the one hand, artificial intelligence technology has strong interconnectivity, which is suitable for the retrieval and presentation of music teaching resources, which can effectively broaden the cognitive horizons of teachers and students and enrich the access to resources. Music teachers can use artificial intelligence technology to obtain extremely classic but old and hard-to-reach audio and video materials, such as rare performance recordings and detailed explanations of classic works. At the same time, the application of artificial intelligence technology can quickly obtain teaching resources according to the different needs of teachers, including different versions, different picture quality, and even specific to a group performance, so as to enhance the relevance and effectiveness of teaching content. On the other hand, the application of artificial intelligence technology can abandon the traditional fixed learning content and turn to continuously adjust the teaching content according to the students’ learning progress, learning preferences, personality characteristics, etc., so as to realize the personalization and dynamization of teaching content. For example, music teachers can use the deep learning algorithms of artificial intelligence technology to analyze students’ learning history and behavioral patterns, and automatically recommend suitable materials such as music theory tutorials, composition skills teaching, and instrumental performance guidance.

Music is divided into two parts: the visual score and the auditory song, and its teaching object is usually the youth group, so the practice process often needs some ways to highlight the characteristics of music art that can be directly seen and heard by the learners more than the traditional theory teaching. Traditional science teaching, the commonly used method is to use Office on the PC for PPT teaching, but music teaching music theory symbols on the PC is difficult to play directly, its PPT courseware production is more difficult and interactivity is poor, and the attraction of the youth is not high enough. With the development of science and technology, a number of domestic and foreign music notation software, such as Overture, Sibelius, Pizzcato and so on the birth of the gradual realization of the software notation and playback functions, but due to the need for too much specialized theoretical skills and cumbersome operation steps, most of the teachers engaged in the teaching of art and music are still deterred. In this paper, we adopt an improved CRNN-lite-based recognition model for music recognition in music education, which is able to effectively extract music scores and quickly convert musical symbols, rhythms, and melodies, so that they can be played by computers or formatted music files can be recorded into various scoring software. It helps teachers and students with their educational and creative requirements, and has a facilitating effect on music education.

2.2

Methodological process

Previous studies usually use deep networks to extract features in the convolutional layer, and use complex gating units for serial computation in the recurrent layer, without additional processing of the loss function in the transcription layer, which leads to longer training time and lower accuracy of CRNN [19].

To address the above problems, CRNN-lite uses residual connections to extract features based on the introduction of depth-separable convolution in the convolutional layer, which reduces the computational effort and improves the learning ability of the network at the same time. A simple recurrent unit (SRU) is introduced in the recurrent layer to reduce the number of gate units and speed up model training through parallel computing. Focal Loss is introduced in the transcription layer CTC method, which is improved to Focal CTC to equalize the learning samples and improve the accuracy of the model [20].

The structure of the music score recognition model of CRNN-lite is shown in Fig. 1, and the specific music score recognition process is as follows: 1)

Convolutional layer feature extraction: the sheet music image is transformed into a single-channel grayscale map. Use four depth-separable convolution blocks to perform convolution operation on the grayscale map, and connect the residuals of the intermediate results in the convolution process to output the feature map.

2)

Cyclic layer feature sequence classification: adjust the corresponding matrix of the feature map to get the feature sequence with a lower number of dimensions. The feature sequence matrix is fed into a bidirectional recurrent neural network composed of SRUs, and the recognition distribution matrix of the output sequences is output. Each frame corresponds to a one-dimensional vector. The length of the vector is equal to the sum of all recognized note types, and each element of the vector represents the probability distribution of each note.

3)

Transcription layer aligned output sequence: the elements with the highest probability in the probability distribution matrix are recombined into a one-dimensional sequence matrix, and the frames with the same semantics are aligned using CTC to remove redundant invalid frames.

4)

Output recognition sequence: convert the processed sequence matrix into a note encoding sequence for output.

5)

Update model parameters: take the correct sequence matrix and the recognized sequence matrix as the parameters of the Focal Loss loss function, adjust the value of the loss function according to the degree of sample equalization and then back propagate to update the parameters of the CRNN network.

2.3

Methodological design

2.3.1

Residual Depth Separable Convolution Based Feature Extraction

The convolutional layer’s network structure is crucial and has a direct impact on the model’s training efficiency and the upper learning limit. For sheet music images, the feature map extracted by the convolutional layer will be transformed into a sequence of features, which will then be passed into the recurrent layer for learning. Therefore, in this paper, the convolutional layer is improved to both speed up the training process and ensure the effectiveness of the extracted features.

Deep convolution splits the convolution kernel into a single-channel form, and without changing the depth of the input feature image, the convolution operation is performed on each channel, so as to obtain the output feature map consistent with the number of channels of the input feature map. Whereas point-by-point convolution is essentially a convolution kernel of size 1 × 1, which serves to perform upscaling and downscaling operations on the feature map. The standard convolution and depth separable convolution are shown in Fig. 2 and Fig. 3.

Assuming that the size of the standard convolution kernel is D_K×D_K×M, which represents the width, height, and dimension of the convolution kernel, the subscript K represents the size of the convolution kernel, and since the convolution kernel usually adopts the structure of equal width and height, the length and height are shared with D_K, the parameter number of the N standard convolutions is D_K×D_K×M×N, and the amount of computation is D_K×D_K×M×N×D_W×D_H, and the subscripts W and H represent the width and height of the output matrix, and the number of times that the convolution kernel slips in the input matrix in both the horizontal and the vertical directions. Similarly, the size of the convolution kernel for depth convolution is D_K×D_K×M, the size of the convolution kernel for point-by-point convolution is 1×1×M, there are N convolution kernels, and the number of parameters for the depth separable convolution is D_K×D_K×M+M×N. D_W×D_H multiplications and additions are performed for the depth convolution, and N point-by-point convolutions are multiplied and added a total of D_W×D_H times, so the amount of the computation for the depth separable convolution is D_K×D_K×M×D_W×D_H+M×N×D_W×D_H. The ratio of depth-separable convolutional computation to standard convolutional computation is shown in Equation (1): 1 $\frac{D_{K} \times D_{K} \times M \times D_{w} \times D_{h} + M \times N \times D_{w} \times D_{h}}{D_{K} \times D_{K} \times M \times N \times D_{w} \times D_{h}} = \frac{1}{N} + \frac{1}{D_{K}^{2}}$

This section combines the ideas of depth separable convolution and residual networks to improve the convolutional layer into a residual depth separable convolutional network. Different from the Rule6 activation function used for depth separable convolution in Mobile Nets, the depth separable convolution defined in this paper uses Leaky Rule activation function, and the structure of depth separable convolution is shown in Fig. 4.

The residual depth separable convolutional network consists of four depth separable convolutional blocks, each of which performs a convolution operation on the input to output C₁, C₂, C₃, C₄ respectively, where the original image, in addition to being the input to the first convolutional block, is also subjected to an additive operation through a constant mapping with the output of the second convolutional block C₂, resulting in M₁ which will be used as the input to the third convolutional block and is once again subjected to an additive operation through a constant mapping with the fourth convolutional block’s output C₄ to get M₂ and used as final output. The residual depth separable convolutional network structure is shown in Fig. 5.

2.3.2

SRU-based note classification recognition

RNN is a class of neural networks that deal with sequence data, and the feature maps extracted by CNN for music score images can be transformed into feature sequences and fed into RNN networks for classification. However, the gradient of RNN network is dominated by the near gradient in the process of backpropagation, which leads to the difficulty of the model to learn the dependency of the far distance, and thus the gradient vanishing problem arises.

The SRU improves the gating unit to remove the strong dependence of the network on the hidden state, and realizes the parallel computing of the network by obtaining the gating parameter matrix in advance, which greatly accelerates the learning speed of the network. The structure of the SRU at time t is shown in Figure 6, x_t is the input, h_t is the output state, c_t is the internal state, σ is the Sigmoid activation function, v_f and v_r are the parameter vectors mapped to c_t−1, and g_t is the linear transformation of x_t : g_t = wx_t and w are the parameter matrices. f_t means forgot door, r_t means reset door.

In order to reduce the degree of recursion, its two gating units f_t and r_t no longer depend on the hidden state h_t−1 of the previous moment but on the internal state c_t−1 of the previous moment, the forgetting gate f_t is computed as shown in Eq. (2), the reset gate r_t is computed as shown in Eq. (3), and b_f and b_r are the bias units of f_t and r_t, respectively [21]. c_t synthesizes the information of the past state and the current input, and reduces the computation by using Hadamard product instead of matrix product, which is defined as shown in Equation (4): 2 $f_{t} = σ (w_{f} x_{t} + v_{f} ⊙ c_{t - 1} + b_{f})$ 3 $r_{t} = σ (w_{r} x_{t} + v_{r} ⊙ c_{t - 1} + b_{r})$ 4 $c_{t} = f_{t} ⊙ c_{t - 1} + (1 - f_{t}) ⊙ g_{t}$

h_t employs a jump-joining approach, as shown in Eq. (5), which directly incorporates input x_t into the computation, with the aim of optimizing the gradient propagation, so that the gradient does not vanish when the network depth is increased because the propagation distance is too far.

5

h_{t} = r_{t} ⊙ c_{t} + (1 - r_{t}) ⊙ x_{t}

The SRU recurrent network structure is shown in Fig. 7, and the model contains a total of two layers of bidirectional SRUs, with a total of 512 hidden units in each layer of bidirectional SRUs, which first pass the information in chronological order, and then the last output unit passes the information in the reverse chronological order. The output results of the two-layer bidirectional SRUs are finally recognized and classified by dot product computation.

2.3.3

Balanced Sample Learning Based on Focal Loss

In the semantic reconstruction phase of OMR, a neural network can be trained end-to-end using the CTC loss function. CTC targets the output of the correct sequence without regard to which frames correspond to the input symbols, and only converges against the model to assemble the expected columns. Traditional CTC does not train well with datasets that are extremely unbalanced or contain a large number of low frequency samples. Symbols with low frequency of occurrence have less impact on the model during training, and symbols with high frequency of occurrence have more impact on the model during training. This imbalance in the dataset leads to overfitting of notes with high frequency of occurrence and underfitting of notes with low frequency of occurrence.

Focal Lossll solves this problem well by overcoming the overfitting and underfitting problems due to the imbalance in the dataset. Based on focus theory and cross-entropy, the Focal Loss loss function is defined as shown in Equation (6): 6 $L_{F o c a l_L o s s} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})$

Where p_t is the size of the recognition probability, α_t and γ are adjustable factors, and L_{Focal_Loss}(p_t) represents the loss function when the recognition probability is p_t. Equation (6) p_t reflects the degree of proximity between the recognized sequence and the real sequence, and the larger p_t indicates that it is closer to the real sequence, i.e., the more accurate the classification. p_t also reflects the degree of difficulty of classification, p_t the larger the confidence level of classification is higher, representing the sample is easier to classify. The smaller the p_t, the lower the confidence level of the classification, the harder the sample is to classify. α_t is used to regulate the ratio between positive and negative sample losses to suppress the imbalance between positive and negative samples, and γ is used to control the imbalance between simple/hard-to-differentiate samples. (1–p_t)^γ is the modulation factor of this loss function, which tends to be 0 for accurately classified samples p_t→1 and 1 for inaccurately classified samples p_t→0 Together, the Focal Loss loss remains unchanged for inaccurately classified samples and becomes smaller for accurately classified samples under the combined effect of α_t and γ. Overall, Focal Loss increases the weight of the inaccurately categorized samples in the loss function, which makes the loss function tend to the difficult-to-categorize samples and helps to improve the accuracy of the difficult-to-categorize samples.

3

Multimodal domain adaptation algorithm design based on differential learning

With the popularization and development of artificial intelligence, the trend of customization has begun to appear in intelligent applications, and how to make the trained lightweight music score recognition method based on improved CRNN migrate to the application scenario of music education quickly and well under the premise of guaranteeing the accuracy has become one of the technical difficulties concerned by the education community. Domain-adaptive correlation algorithms are an important technical route to solve such problems. Therefore, in this section, a new multimodal domain adaptation algorithm based on differential learning is designed to perform multimodal domain adaptation by exploiting the variability of different modal models.

3.1

Algorithmic framework for multimodal domain adaptation based on differential learning

The DLMM framework can be divided into four stages: 1)

Pre-training on the source domain, each modal model is pre-trained with labeled source domain data, and each modal model is divided into three parts, namely, feature extractor, prototype mapper and classifier, where the feature extractor is responsible for extracting the features of the sample data, and the prototype mapper is responsible for further mapping the features extracted by the feature extractor to the prototype space and learning the prototype representations of each class, and the classifier inputs the features extracted by the feature extractor into the fully connected network and outputs the recognition results. The fully connected network is fed with the features extracted by the feature extractor by the classifier and the recognition results are generated. The pre-training phase involves synchronizing the supervised classification task and prototype learning task in a multi-task learning mode.

2)

Prototype-based reliability measure, the pre-trained models of each modality are migrated to the target domain and identify the target domain samples by classifiers, and the similarity between the target domain samples and the class prototype representation in the prototype space is used as a measure of the reliability of the decision of that modal model, so for each target domain sample, the identification of it by each of the M modal models and the corresponding identification’s reliability, and the identification with the highest reliability is adopted as the pseudo-label for that sample.

3)

Asynchronous learning, after obtaining highly reliable pseudo-labels for the target domain samples, the models of each modality will use the target domain data for incremental learning to improve the generalization of the model over the source domain. In this phase, the models of different modalities first evaluate the learning difficulty of these incremental samples individually and learn the target domain samples in order from easy to difficult. This process is called asynchronous learning since the order of learning is different for each modal model.

4)

Reliability-aware fusion, after completing the asynchronous learning of multimodality, each modality’s model improves the generalization performance on the target domain, in order to better fuse the decisions of different modality models, the reliability of the models will be used as the weight of their recognition results, and the decisions of all models are fused to obtain the final recognition results [22].

3.2

Pre-training phase on the source domain

The first stage of the multimodal domain adaptation algorithm based on differential learning is the pre-training of the model for each modality using source domain data. The pre-training stage based on multi-task learning is shown in Fig. 8, where each modal model consists of three parts: feature extractor F, prototype mapper M and classifier G, where feature extractor F is responsible for extracting the features of the sample data, prototype mapper M is responsible for further mapping the features extracted by the feature extractor to the prototype space and learning the prototype representation of each class, and classifier G inputs the features extracted by the feature extractor into the fully connected network and outputs the recognition results. Classifier 6 inputs the features extracted by the feature extractor into the fully connected network and outputs the recognition results.

For the supervised classification task, the source domain samples are firstly fed into the feature extractor to obtain the feature vectors, and then the extracted feature vectors are fed into the classifier to obtain the classification results, and the cross-entropy function is used as the loss function to optimize the feature extractor and classifier, and the loss function for the classification task is shown in Equation (7): 7 $L_{C}^{m} = \sum_{i} - Y_{S_{i}} \log σ (G^{m} (F^{m} (X_{S_{i}}^{(m)})))$

In Eq. (7), F^m(1≤m≤M) denotes the feature extractor for modality m, G^m denotes its corresponding classifier, σ denotes the Softmax function, and Y_{s_i} denotes the one-hot vector labels originating from the sample.

For the prototype learning task, it aims to learn the class vector feature representation of each modal data, which is called class prototype in this paper. In this paper, the similarity between the sample vectors and the class prototypes is denoted as Equation (8): 8 $d (x, w_{k}^{m}) = e^{- \frac{{‖ M^{m} (F^{m} (x)) - w_{k}^{m} ‖}_{2}^{2}}{γ}}$

In Eq. (8), M^m denotes the prototype mapper of modality m, which is responsible for mapping the sample vectors into the prototype space to measure its similarity with the class prototype $W_{k}^{m}$ , and γ is the scaling factor, which is set to 2 by default. As can be seen from Eq. (8), the similarity $d (x, w_{k}^{m'})$ between the sample vectors and the class prototypes of the different modalities have the range of values [0,1], which ensures to a certain extent that the reliabilities of the decisions of the different modalities are comparable. Further more, in this paper, the multilabel loss is used to learn the parameters of class prototypes and prototype mappers, and the loss function for the prototype learning task is shown in Equation (9): 9 $L_{P}^{m} = \sum_{i} \sum_{k = 1}^{c} [- Y_{S_{i, t}} \log d (X_{S_{i}}^{(m)}, W_{k}^{m}) - (1 - Y_{S_{t_{i}}}) \log (1 - d (X_{S_{i}}^{(m)}, W_{k}^{m}))]$ where Y_{S_i,k} denotes the krd element in the one-hot label vector Y_{S_t}, which is a binary element of 0 or 1. When its value is 1, it indicates that the label of the current sample belongs to class k, and the sample vector is encouraged to be similar to the class prototype $W_{k}^{m}$ of class k, i.e., $d (X_{S_{t}}^{(m)}, W_{k}^{m})$ should be as close to 1 as possible, and at the same time, it is encouraged to be as far away as possible from class prototypes other than those of class k.

For the pre-training of each modal model on the source domain, this paper adopts a multi-task learning approach to optimize the supervised classification task and the prototype learning task at the same time, and the total loss function in the pre-training stage of each model is shown in Equation (10): 10 $L_{s}^{m} = L_{c}^{m} + λ L_{p}^{m}$

In Eq. (10), λ is a balancing factor that balances the supervised classification loss $L_{c}^{m}$ and the prototype learning loss $L_{p}^{m}$ , which is a hyperparameter that is set to 0.5 by default. Thus the total loss $L_{s}^{m}$ can be optimized to achieve the purpose of learning the two learning tasks in the pre-training phase at the same time, and after the pre-training process of the model for each modality is completed, the set of prototypes for that modality ${W_{k}^{m}}$ can be obtained as well as the optimized feature extractor F^m, prototype mapper M^m, and classifier G^m.

3.3

Prototype-based reliability metrics

After pre-training is completed on the source domain, the model is migrated to the target domain and feature extractor F^m and classifier G^m are used to recognize the target domain samples to obtain pseudo-labels. Specifically, given the target domain samples X_{T_i} = {X_{T_i}|1≤m≤M}, $X_{T_{i}}^{m}$ represents the input of the mth modality, and ${\hat{Y}}_{T_{i}}^{m}$ represents the classifier of the mth modality G^m, where the element corresponding to class k with the highest probability of G^m output is set to 1 and the element at other positions is set to 0, then the reliability of the pseudo-label of the mth modal can be measured by the similarity between the sample and the prototype of class k, as shown in equation (11): 11 $R_{T_{i}}^{m} = d (X_{T_{i}}^{m}, W_{\hat{k}}^{m})$

In Eq. (11), the function d(·), defined by Eq. represents the similarity between the sample and the prototype. This is because the class prototype actually represents the knowledge learned by the model on the source domain. Compared to reliability measures such as a posteriori probability, prototypes are relatively less sensitive to domain shifts and therefore provide a more robust measure of the reliability of decisions.

Set $〈 X_{T_{i}}^{m}, {\hat{Y}}_{T_{i}}^{m}, R_{T_{i}}^{m} 〉 (1 \leq m \leq M)$ represents the reliability of sample vectors, pseudo-labels and pseudo-labels for a given target domain sample under different modal model observations. For a target domain sample, the pseudo-label reliability of different modal models is usually different, for this reason, it is necessary to compare the pseudo-label reliabilities of different modalities and select the pseudo-label with the highest reliability as the final pseudo-label for this target domain sample, and the modality with the highest identification reliability is defined by Equation (12): 12 $\tilde{m} = \underset{m}{\arg \max} R_{T_{i}}^{m}$

After identifying the modality $\tilde{m}$ with the highest recognition reliability, the final pseudo-label ${\hat{Y}}_{T_{i}}^{m}$ of the target domain sample X_{T_t} is then designated as the recognition result ${\hat{Y}}_{T_{i}}^{m}$ of modality $\tilde{m}$ , and the corresponding recognition reliability is the recognition reliability of modality $\tilde{m}$ . The pseudo-label and the reliability are defined by Equation (13) and Equation (14), respectively: 13 ${\hat{Y}}_{T_{i}}^{\hat{m}} = {\hat{Y}}_{T_{i}}^{m}$ 14 $R_{T_{i}}^{m} = R_{T_{i}}^{\hat{m}}$

In order to mitigate the negative effects of erroneous pseudo-labels, this section employs a reliability-based label smoothing strategy, where the lower the reliability, the higher the degree of smoothing of the pseudo-labels. Equation (15) gives the expression of the smoothed pseudo-label: 15 ${\dot{Y}}_{T_{T}} = R_{T_{i}} * {\hat{Y}}_{T_{*}} + \frac{(1 - R_{T_{T}})}{C}$

In Eq. C denotes the total number of categories. After identification and recognition reliability assessment of all target domain samples, only the most reliable part of pseudo-labels are selected in this paper as the learning materials for subsequent incremental learning, and the screened reliable pseudo-labels are defined by Eq. (16): 16 $U_{R} = {(X_{T_{i}}, {\dot{Y}}_{T_{T}}, R_{T_{i}}) | R_{T_{i}} > R_{λ}}$

In Eq. U_R denotes the set of incremental learning materials, and R_A is a constant threshold used to screen out low-reliability pseudo-labels, so that a sample and its pseudo-labels are used as learning materials only if the pseudo-label reliability R_{T_T} of the sample is higher than R_λ. By default, R_λ is set to the reliability of all target domain samples with reliability ranked in the top 50%.

4

Experimental tests

4.1

Comparative experiments with different algorithms

In order to verify the effect of CRNN-Lite+DLMM on sheet music recognition in music education, this section firstly compares CRNN-Lite+DLMM with other algorithms without adaptation domain, and also compares the effect of different adaptation domain algorithms on the recognition algorithms under the same kind of recognition algorithms. The results of the comparison experiments of different algorithms are shown in Table 1.

Table 1.

Comparison of different algorithms

Algorithm	P/%	R/%	mAP@0.5/%	Model Parameters/Mb	Fps	Gflop
CRNN-Lite	92.7	91.6	93.3	10.7	90.9	56.3
CRNN-Lite+DLMM	95.6	92.7	94.6	9.5	92	55.2
CRNN-Lite+Mm-Sada	69	51.2	53.8	3	173.8	10.4
CRNN-Lite+Mla	81.4	57.5	71	13.7	143.7	26.8
CRNN-Lite+Pfts	84	56.6	68.8	26.2	106.4	66.6
CRNN-Lite+γ-Mod	61.5	44.4	50.1	4.7	192.7	12.3
MIR	73.4	53.3	60.9	14.8	160.1	44
MIR + DLMM	79	57.9	59.7	49.8	83.2	166.4
MIR+Mm-Sada	80.7	42.6	49.7	37.3	161.9	104.1
MIR+Mla	73.6	51.7	59.9	3.9	150.9	8.9
MIR+Pfts	84.3	57.7	67.2	11.8	175.5	27.3
MIR+γ-Mod	86.5	61.2	68.8	26.1	92.5	80.8
Polyphonic-Tromr	68.3	54.6	60.4	8.9	171.7	11.5
Polyphonic-Tromr+ DLMM	76	61.3	71.5	15.7	148.2	26
Polyphonic-Tromr+Mm-Sada	84.7	63.3	72.2	18.3	97.6	65.5
Polyphonic-Tromr+Mla	62.5	40.6	56.1	3.6	190.7	4.8
Polyphonic-Tromr+Pfts	70.5	54.7	60.1	14.5	161.1	41.3
Polyphonic-Tromr+γ-Mod	80.8	57.4	59.6	55.7	77.6	162.2

As shown in the table, CRNN-Lite+DLMM has better results than other algorithms in terms of precision, recall, and mAP@0.5 indicators. Compared with the original recognition algorithm CRNN-Lite, its accuracy increased by 2.9%, the recall rate increased by 1.1%, and the mAP@0.5 increased by 1.3%. Compared with CRNN-Lite+Mm-Sada, CRNN-Lite+Mla, CRNN-Lite+Pfts, and CRNN-Lite+ -Mod algorithms, its accuracy increased by 26.6%, 14.2%, 11.6%, and 34.1%, respectively, and the recall rate increased by 41.5%, 35.2%, 36.1%, and 48.3%, respectively, and mAP@0.5 increased by 40.8%, 23.6%, 25.8%, and 44.5%, respectively. Through the comparative experiments of the influence of different adaptation domain algorithms on the recognition algorithm, it can be seen that compared with the MIR+DLMM algorithm and the Polyphonic-Tromr+DLMM algorithm, the accuracy of the CRNN-Lite+DLMM algorithm increased by 16.6% and 19.6%, the recall rate increased by 34.8% and 31.4%, and the mAP@0.5 algorithm increased by 34.9% and 23.1%, respectively. Experimental results show that the CRNN-Lite+DLMM algorithm can better complete the task of score recognition in music teaching.

4.2

Analysis of experimental results for recognition efficiency

In this paper, the total time spent by the algorithm on training and testing is used as an evaluation index of algorithm efficiency. In order to verify the good or bad performance of the model in this paper, in the same experimental environment, design the comparison algorithm experiments, and the experimental time of different algorithms is shown in Table 2. From the table, it can be seen that the CRNN-Lite+DLMM algorithm proposed in this paper has obvious advantages over other algorithms. Compared to the CRNN-Lite algorithm, the recognition efficiency has improved by 12.5%.

Table 2.

Experiment length of different algorithms

Algorithm	Training time(s)	Test time(s)
CRNN-Lite	2215	465
CRNN-Lite+DLMM	1985	412
CRNN-Lite+Mm-Sada	3265	802
CRNN-Lite+Mla	2657	495
CRNN-Lite+Pfts	3019	777
CRNN-Lite+γ-Mod	3530	778
MIR	2906	555
MIR + DLMM	4263	1011
MIR+Mm-Sada	2981	693
MIR+Mla	3265	780
MIR+Pfts	2648	504
MIR+γ-Mod	2999	782
Polyphonic-Tromr	3523	770
Polyphonic-Tromr+ DLMM	2896	534
Polyphonic-Tromr+Mm-Sada	4260	1007
Polyphonic-Tromr+Mla	2969	683
Polyphonic-Tromr+Pfts	2983	796
Polyphonic-Tromr+γ-Mod	3515	797

4.3

Analysis of the effectiveness of music teaching

This section analyzes the effect of music teaching after using the method proposed in this paper in terms of two dimensions: song singing completion and music learning effect. The experiments were set up with an experimental class (N=50) and a control class (N=50) respectively. The experimental class was taught music using the method proposed in this paper, while the control class was taught music using the traditional method.

4.3.1

Completion of song singing

Through the experimental class and the control class of the same course of different methods of experimental teaching, the author in a week after the experimental class and the control class of the students carried out the examination, in the song singing examination, the experimental class of 66% of the students can be sung emotionally and can master the basic requirements of the song singing, while the control class of the majority of the students in the singing of the emotion is not enough to invest in or did not meet the basic requirements of the song singing. This suggests that applying this paper to music teaching is a good way to improve the quality of music teaching. This shows that the application of the method proposed in this paper in music teaching can greatly improve the learning efficiency of students, and at the same time, enable students to have more time to appreciate the learning of music, and discover and feel the beauty of music. The statistics of students’ song singing completion are shown in Table 3 (scoring criteria: ① emotional expression, singing posture, breathing method, pitch, rhythm, strength, completeness ② emotional expression this, singing posture, breathing method, completeness ③ completeness ④ not meeting the requirements of song singing).

Table 3.

Student song performance

Class	Laboratory Class(N=50)		Cross-Reference Class(N=50)
	Number	Percentage	Number	Percentage
①	33	66%	20	40%
②	15	30%	18	36%
③	1	2%	8	16%
④	1	2%	4	8%

4.3.2

Analysis of music learning effects

This study utilizes the paired samples test, which is used to determine whether two related samples come from an aggregate with the same mean. There are two ways of combining data sources: self-pairing and homozygous pairing. Self-pairing is used here to refer to the same experimental subject who receives two treatments, before and after, at two different times, and whose two observations, before and after, are used for control and comparison. The data from the pre-test and post-test questionnaires of the experimental subjects were imported into SPSS, and the paired samples test was carried out. The percentage of confidence interval was adjusted to 95%. Paired samples analysis and testing are shown in Table 4 and as shown in Table 5.

Table 4.

Match sample analysis

		Mean Value	N	Standard Deviation	Standard Error Mean
Pair 1	Study Interest	2.3891	50	0.69622	0.09855
Pair 1	Study Interest 2	4.3395	50	0.55987	0.06577
Pair 2	Learning Motivation	2.2501	50	0.53612	0.07602
Pair 2	Learning Motivation 2	4.2651	50	0.52584	0.07362
Pair 3	Learning Efficiency	2.0561	50	0.49532	0.6855
Pair 3	Learning Efficiency 2	4.1658	50	0.57846	0.08107
Pair 4	Learning Emotion	2.2989	50	0.59336	0.08425
Pair 4	Learning Emotion 2	4.4102	50	0.51322	0.07321

Table 5.

Matched sample

		Pair Difference					T	Freedom	Significance (Double Tail)
		Mean Value	Standard Deviation	Standard Error Mean	The Difference Is 95% Confidence Interval
		Mean Value	Standard Deviation	Standard Error Mean	Lower Limit	Upper Limit
Pair 1	Study Interest- Study Interest 2	-1.95025	0.92366	0.12985	-2.23115	-1.68512	-15.122	50	0.000
Pair 2	Learning Motivation- Learning Motivation 2	-2.01585	0.80051	0.11362	-2.24985	-1.77544	-17.854	50	0.000
Pair 3	Learning Efficiency- Learning Efficiency 2	-2.10305	0.78442	0.10885	-2.33165	-1.87322	-19.305	50	0.000
Pair 4	Learning Emotion- Learning Emotion 2	-2.11065	0.76051	0.10601	-2.42112	-1.87622	-20.055	50	0.000

The results of paired sample statistics show that the mean values of the four dimensions of learning interest, learning motivation, learning efficiency, and learning emotion have increased by a certain amount, where the mean value refers to the average score of students doing the questionnaire on several questions in the four dimensions, there are five scoring items from 1-5, and the higher the score represents the higher their satisfaction, in which the learning interest2, learning motivation2, learning efficiency2, and learning emotion2 Representing the four dimensions of the posttest, interest in learning, interest in learning was 2.3891 in the pre-test and 4.3395 in the posttest. Learning Motivation was 2.2501 on the pre-test and 4.2651 on the post-test. Learning Efficiency was 2.0561 on the pre-test and 4.1658 on the post-test. Learning Affect was 2.2989 on the pre-test and 4.4102 on the post-test. It can be seen that the mean values of the experimental class in all the above four dimensions have increased considerably, proving that the students in the experimental class have improved their abilities in all four dimensions after experiencing the use of the method proposed in this paper for music teaching. In terms of standard deviation, the standard deviation of the pre- and post-tests of learning motivation fluctuates less and is relatively stable, suggesting that the students’ tendency to choose is more consistent. The pre- and post-test standard deviations of learning interest, learning efficiency, and learning emotion fluctuate more, indicating that students’ tendency to make choices is more decentralized. The results of the paired samples test show that the significance of all four dimensions is 0, which is less than 0.005, proving that students show significant differences in the pre-test and post-test, and that there is a large increase in the mean values of the four dimensions in terms of the difference in the mean values.

In summary, the students in the experimental class showed significant differences in the four dimensions of learning interest, motivation, learning efficiency and learning emotion, and the students were more satisfied with the effect of music teaching using the method proposed in this paper, as follows:

First, the students in the experimental class showed a significant interest in learning music. Teachers assisted the teaching through the methods proposed in this paper, strengthened students’ understanding and ability to perceive music from various aspects, and utilized a variety of ways and means to enhance students’ interest in music theory learning.

Secondly, the motivation of students in the experimental class to learn music was significantly enhanced. Teachers added multimedia technology to the four sections of music appreciation, music performance, music creation, and music-related culture. This was manifested in the form of pictures, audio, video, and animation.

Thirdly, the efficiency and effectiveness of the experimental class students in music learning were significantly improved. The largest difference in the mean value of learning efficiency is 2.1097, indicating that students have the greatest improvement in the dimension of learning efficiency, and there is great progress in music perception ability, rhythm and beat discrimination ability, and music theory knowledge mastery ability, which is a significant improvement for learning efficiency.

Fourth, students in the experimental class showed significant positive changes in their emotions related to music learning. Students believe that learning music will make them happy, learning music can enrich their emotions, and the difference in the mean value of learning emotions is the second highest, according to the first place by a hair’s breadth, which indicates that there is a great change in students’ emotions in music learning, and the sensitivity of students to music emotions is also greatly improved.

5

Conclusion

Based on the perspective of the deep learning model DLMM, this paper explores the application of a lightweight score recognition method based on improved CRNN in music education. Aiming at AI technology in music education, this paper proposes a lightweight music score recognition method, CRNN-lite, based on improved CRNN, which alleviates the problems of long time consuming single iteration and high total number of iteration rounds in the training process, and improves the recognition rate of distorted music scores. In order to solve the problem of migrating the method to music education, the article further proposes a multimodal domain adaptation algorithm DLMM based on differential learning, and the experimental results show that the multimodal domain adaptation algorithm DLMM proposed in this paper has a significant advantage in recognition efficiency, which is improved by 12.5% compared with the CRNN-Lite algorithm, respectively. After teaching music through the comprehensive application of the methods in this paper, the students in the experimental class are significantly more efficient and effective in music learning, and the mean value of learning efficiency is improved by 2.1097 compared with the control class.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Efficiency of AI Technology Application in Music Education - A Perspective Based on Deep Learning Model DLMM

Jie Chang

Zhenmeng Wang

Publicado en línea: 17 mar 2025

Recibido: 29 oct 2024

Aceptado: 07 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0326

Palabras clave<kwd>CRNN</kwd>, <kwd>DLMM</kwd>, <kwd>Music education</kwd>, <kwd>AI technology</kwd>

© 2025 Jie Chang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
<kwd>CRNN</kwd>, <kwd>DLMM</kwd>, <kwd>Music education</kwd>, <kwd>AI technology</kwd>