Efficiency of AI Technology Application in Music Education - A Perspective Based on Deep Learning Model DLMM
Publié en ligne: 17 mars 2025
Reçu: 29 oct. 2024
Accepté: 07 févr. 2025
DOI: https://doi.org/10.2478/amns-2025-0326
Mots clés
© 2025 Jie Chang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Music teaching occupies an important position in today’s education system, which not only enriches students’ campus life, enhances students’ aesthetic ability, but also provides students with diversified learning and thinking styles [1-2]. However, many music classrooms still adopt the traditional way in teaching methods, which often neglects students’ individual differences and subjective initiative, making the effect of music teaching unsatisfactory [3-5]. In this context, how to improve the efficiency of music teaching, stimulate students’ innovative thinking, and cultivate their music literacy has become an important issue for music educators and researchers.
In recent years, China’s science and technology level has been developing rapidly, and the field of music education has gradually begun to be more and more closely cross-fertilized with science and technology and artificial intelligence (AI) technology [6-7]. Teachers utilize AI technology to conduct remote interactive teaching with students, breaking the geographical limitations of traditional music classrooms [8-11]. At the same time, teachers are able to better grasp the learning status of students based on the instant data analysis provided by AI technology, so as to conduct more targeted teaching [12-15]. In the practice of AI teaching in music education, there is the problem of model performance degradation due to the deviation of data distribution in the training set and test set [16-17]. The multimodal domain adaptation algorithm based on differential learning (DLMM), on the other hand, effectively utilizes the variability of different modal models for multimodal domain adaptation to provide better multimodal decision-making for music education [18].
The article firstly designs a lightweight music score recognition method based on improved CRNN, specifically, CRNN-lite introduces residual depth-separable convolution in the convolutional layer to reduce the computation and accelerate the extraction of feature maps. In the loop layer, a bidirectional simple loop unit is used, and parallel computation is employed to avoid the strong dependency problem of serial computation. The parameters of the cross-entropy function are adjusted in the transcription layer to target the learning of unbalanced sample data. Then the article proposes a multimodal domain adaptation algorithm based on differential learning, which can be divided into two main phases: pre-training on the source domain and a prototype-based reliability metric. Furthermore, the paper explores the effect of using different datasets on the performance of this paper’s method and conducts comparative experiments. Finally, the effectiveness of teaching music education after applying the proposed method in this paper is analyzed.
Nowadays, the application of artificial intelligence technology in the field of education has become a popular trend, and the music teaching content of colleges and universities is also more dynamic, open and diversified with the help of artificial intelligence. On the one hand, artificial intelligence technology has strong interconnectivity, which is suitable for the retrieval and presentation of music teaching resources, which can effectively broaden the cognitive horizons of teachers and students and enrich the access to resources. Music teachers can use artificial intelligence technology to obtain extremely classic but old and hard-to-reach audio and video materials, such as rare performance recordings and detailed explanations of classic works. At the same time, the application of artificial intelligence technology can quickly obtain teaching resources according to the different needs of teachers, including different versions, different picture quality, and even specific to a group performance, so as to enhance the relevance and effectiveness of teaching content. On the other hand, the application of artificial intelligence technology can abandon the traditional fixed learning content and turn to continuously adjust the teaching content according to the students’ learning progress, learning preferences, personality characteristics, etc., so as to realize the personalization and dynamization of teaching content. For example, music teachers can use the deep learning algorithms of artificial intelligence technology to analyze students’ learning history and behavioral patterns, and automatically recommend suitable materials such as music theory tutorials, composition skills teaching, and instrumental performance guidance.
Music is divided into two parts: the visual score and the auditory song, and its teaching object is usually the youth group, so the practice process often needs some ways to highlight the characteristics of music art that can be directly seen and heard by the learners more than the traditional theory teaching. Traditional science teaching, the commonly used method is to use Office on the PC for PPT teaching, but music teaching music theory symbols on the PC is difficult to play directly, its PPT courseware production is more difficult and interactivity is poor, and the attraction of the youth is not high enough. With the development of science and technology, a number of domestic and foreign music notation software, such as Overture, Sibelius, Pizzcato and so on the birth of the gradual realization of the software notation and playback functions, but due to the need for too much specialized theoretical skills and cumbersome operation steps, most of the teachers engaged in the teaching of art and music are still deterred. In this paper, we adopt an improved CRNN-lite-based recognition model for music recognition in music education, which is able to effectively extract music scores and quickly convert musical symbols, rhythms, and melodies, so that they can be played by computers or formatted music files can be recorded into various scoring software. It helps teachers and students with their educational and creative requirements, and has a facilitating effect on music education.
Previous studies usually use deep networks to extract features in the convolutional layer, and use complex gating units for serial computation in the recurrent layer, without additional processing of the loss function in the transcription layer, which leads to longer training time and lower accuracy of CRNN [19].
To address the above problems, CRNN-lite uses residual connections to extract features based on the introduction of depth-separable convolution in the convolutional layer, which reduces the computational effort and improves the learning ability of the network at the same time. A simple recurrent unit (SRU) is introduced in the recurrent layer to reduce the number of gate units and speed up model training through parallel computing. Focal Loss is introduced in the transcription layer CTC method, which is improved to Focal CTC to equalize the learning samples and improve the accuracy of the model [20].
The structure of the music score recognition model of CRNN-lite is shown in Fig. 1, and the specific music score recognition process is as follows:
Convolutional layer feature extraction: the sheet music image is transformed into a single-channel grayscale map. Use four depth-separable convolution blocks to perform convolution operation on the grayscale map, and connect the residuals of the intermediate results in the convolution process to output the feature map. Cyclic layer feature sequence classification: adjust the corresponding matrix of the feature map to get the feature sequence with a lower number of dimensions. The feature sequence matrix is fed into a bidirectional recurrent neural network composed of SRUs, and the recognition distribution matrix of the output sequences is output. Each frame corresponds to a one-dimensional vector. The length of the vector is equal to the sum of all recognized note types, and each element of the vector represents the probability distribution of each note. Transcription layer aligned output sequence: the elements with the highest probability in the probability distribution matrix are recombined into a one-dimensional sequence matrix, and the frames with the same semantics are aligned using CTC to remove redundant invalid frames. Output recognition sequence: convert the processed sequence matrix into a note encoding sequence for output. Update model parameters: take the correct sequence matrix and the recognized sequence matrix as the parameters of the Focal Loss loss function, adjust the value of the loss function according to the degree of sample equalization and then back propagate to update the parameters of the CRNN network.

Recognition model of the euro1ite of CRNN-1ite
The convolutional layer’s network structure is crucial and has a direct impact on the model’s training efficiency and the upper learning limit. For sheet music images, the feature map extracted by the convolutional layer will be transformed into a sequence of features, which will then be passed into the recurrent layer for learning. Therefore, in this paper, the convolutional layer is improved to both speed up the training process and ensure the effectiveness of the extracted features.
Deep convolution splits the convolution kernel into a single-channel form, and without changing the depth of the input feature image, the convolution operation is performed on each channel, so as to obtain the output feature map consistent with the number of channels of the input feature map. Whereas point-by-point convolution is essentially a convolution kernel of size 1 × 1, which serves to perform upscaling and downscaling operations on the feature map. The standard convolution and depth separable convolution are shown in Fig. 2 and Fig. 3.

Standard convolution

Deep separable convolution
Assuming that the size of the standard convolution kernel is
This section combines the ideas of depth separable convolution and residual networks to improve the convolutional layer into a residual depth separable convolutional network. Different from the Rule6 activation function used for depth separable convolution in Mobile Nets, the depth separable convolution defined in this paper uses Leaky Rule activation function, and the structure of depth separable convolution is shown in Fig. 4.

Depth can separate the convolution structure
The residual depth separable convolutional network consists of four depth separable convolutional blocks, each of which performs a convolution operation on the input to output

The residual depth can separate the convolution network structure
RNN is a class of neural networks that deal with sequence data, and the feature maps extracted by CNN for music score images can be transformed into feature sequences and fed into RNN networks for classification. However, the gradient of RNN network is dominated by the near gradient in the process of backpropagation, which leads to the difficulty of the model to learn the dependency of the far distance, and thus the gradient vanishing problem arises.
The SRU improves the gating unit to remove the strong dependence of the network on the hidden state, and realizes the parallel computing of the network by obtaining the gating parameter matrix in advance, which greatly accelerates the learning speed of the network. The structure of the SRU at time

SRU structure
In order to reduce the degree of recursion, its two gating units
The SRU recurrent network structure is shown in Fig. 7, and the model contains a total of two layers of bidirectional SRUs, with a total of 512 hidden units in each layer of bidirectional SRUs, which first pass the information in chronological order, and then the last output unit passes the information in the reverse chronological order. The output results of the two-layer bidirectional SRUs are finally recognized and classified by dot product computation.

SRU cyclic network structure
In the semantic reconstruction phase of OMR, a neural network can be trained end-to-end using the CTC loss function. CTC targets the output of the correct sequence without regard to which frames correspond to the input symbols, and only converges against the model to assemble the expected columns. Traditional CTC does not train well with datasets that are extremely unbalanced or contain a large number of low frequency samples. Symbols with low frequency of occurrence have less impact on the model during training, and symbols with high frequency of occurrence have more impact on the model during training. This imbalance in the dataset leads to overfitting of notes with high frequency of occurrence and underfitting of notes with low frequency of occurrence.
Focal Lossll solves this problem well by overcoming the overfitting and underfitting problems due to the imbalance in the dataset. Based on focus theory and cross-entropy, the Focal Loss loss function is defined as shown in Equation (6):
Where
With the popularization and development of artificial intelligence, the trend of customization has begun to appear in intelligent applications, and how to make the trained lightweight music score recognition method based on improved CRNN migrate to the application scenario of music education quickly and well under the premise of guaranteeing the accuracy has become one of the technical difficulties concerned by the education community. Domain-adaptive correlation algorithms are an important technical route to solve such problems. Therefore, in this section, a new multimodal domain adaptation algorithm based on differential learning is designed to perform multimodal domain adaptation by exploiting the variability of different modal models.
The DLMM framework can be divided into four stages:
Pre-training on the source domain, each modal model is pre-trained with labeled source domain data, and each modal model is divided into three parts, namely, feature extractor, prototype mapper and classifier, where the feature extractor is responsible for extracting the features of the sample data, and the prototype mapper is responsible for further mapping the features extracted by the feature extractor to the prototype space and learning the prototype representations of each class, and the classifier inputs the features extracted by the feature extractor into the fully connected network and outputs the recognition results. The fully connected network is fed with the features extracted by the feature extractor by the classifier and the recognition results are generated. The pre-training phase involves synchronizing the supervised classification task and prototype learning task in a multi-task learning mode. Prototype-based reliability measure, the pre-trained models of each modality are migrated to the target domain and identify the target domain samples by classifiers, and the similarity between the target domain samples and the class prototype representation in the prototype space is used as a measure of the reliability of the decision of that modal model, so for each target domain sample, the identification of it by each of the Asynchronous learning, after obtaining highly reliable pseudo-labels for the target domain samples, the models of each modality will use the target domain data for incremental learning to improve the generalization of the model over the source domain. In this phase, the models of different modalities first evaluate the learning difficulty of these incremental samples individually and learn the target domain samples in order from easy to difficult. This process is called asynchronous learning since the order of learning is different for each modal model. Reliability-aware fusion, after completing the asynchronous learning of multimodality, each modality’s model improves the generalization performance on the target domain, in order to better fuse the decisions of different modality models, the reliability of the models will be used as the weight of their recognition results, and the decisions of all models are fused to obtain the final recognition results [22].
The first stage of the multimodal domain adaptation algorithm based on differential learning is the pre-training of the model for each modality using source domain data. The pre-training stage based on multi-task learning is shown in Fig. 8, where each modal model consists of three parts: feature extractor

Pre-training based on multi-task learning
For the supervised classification task, the source domain samples are firstly fed into the feature extractor to obtain the feature vectors, and then the extracted feature vectors are fed into the classifier to obtain the classification results, and the cross-entropy function is used as the loss function to optimize the feature extractor and classifier, and the loss function for the classification task is shown in Equation (7):
In Eq. (7),
For the prototype learning task, it aims to learn the class vector feature representation of each modal data, which is called class prototype in this paper. In this paper, the similarity between the sample vectors and the class prototypes is denoted as Equation (8):
In Eq. (8),
For the pre-training of each modal model on the source domain, this paper adopts a multi-task learning approach to optimize the supervised classification task and the prototype learning task at the same time, and the total loss function in the pre-training stage of each model is shown in Equation (10):
In Eq. (10),
After pre-training is completed on the source domain, the model is migrated to the target domain and feature extractor
In Eq. (11), the function
Set
After identifying the modality
In order to mitigate the negative effects of erroneous pseudo-labels, this section employs a reliability-based label smoothing strategy, where the lower the reliability, the higher the degree of smoothing of the pseudo-labels. Equation (15) gives the expression of the smoothed pseudo-label:
In Eq.
In Eq.
In order to verify the effect of CRNN-Lite+DLMM on sheet music recognition in music education, this section firstly compares CRNN-Lite+DLMM with other algorithms without adaptation domain, and also compares the effect of different adaptation domain algorithms on the recognition algorithms under the same kind of recognition algorithms. The results of the comparison experiments of different algorithms are shown in Table 1.
Comparison of different algorithms
| Algorithm | P/% | R/% | mAP@0.5/% | Model Parameters/Mb | Fps | Gflop |
|---|---|---|---|---|---|---|
| CRNN-Lite | 92.7 | 91.6 | 93.3 | 10.7 | 90.9 | 56.3 |
| CRNN-Lite+DLMM | 95.6 | 92.7 | 94.6 | 9.5 | 92 | 55.2 |
| CRNN-Lite+Mm-Sada | 69 | 51.2 | 53.8 | 3 | 173.8 | 10.4 |
| CRNN-Lite+Mla | 81.4 | 57.5 | 71 | 13.7 | 143.7 | 26.8 |
| CRNN-Lite+Pfts | 84 | 56.6 | 68.8 | 26.2 | 106.4 | 66.6 |
| CRNN-Lite+ |
61.5 | 44.4 | 50.1 | 4.7 | 192.7 | 12.3 |
| MIR | 73.4 | 53.3 | 60.9 | 14.8 | 160.1 | 44 |
| MIR + DLMM | 79 | 57.9 | 59.7 | 49.8 | 83.2 | 166.4 |
| MIR+Mm-Sada | 80.7 | 42.6 | 49.7 | 37.3 | 161.9 | 104.1 |
| MIR+Mla | 73.6 | 51.7 | 59.9 | 3.9 | 150.9 | 8.9 |
| MIR+Pfts | 84.3 | 57.7 | 67.2 | 11.8 | 175.5 | 27.3 |
| MIR+ |
86.5 | 61.2 | 68.8 | 26.1 | 92.5 | 80.8 |
| Polyphonic-Tromr | 68.3 | 54.6 | 60.4 | 8.9 | 171.7 | 11.5 |
| Polyphonic-Tromr+ DLMM | 76 | 61.3 | 71.5 | 15.7 | 148.2 | 26 |
| Polyphonic-Tromr+Mm-Sada | 84.7 | 63.3 | 72.2 | 18.3 | 97.6 | 65.5 |
| Polyphonic-Tromr+Mla | 62.5 | 40.6 | 56.1 | 3.6 | 190.7 | 4.8 |
| Polyphonic-Tromr+Pfts | 70.5 | 54.7 | 60.1 | 14.5 | 161.1 | 41.3 |
| Polyphonic-Tromr+ |
80.8 | 57.4 | 59.6 | 55.7 | 77.6 | 162.2 |
As shown in the table, CRNN-Lite+DLMM has better results than other algorithms in terms of precision, recall, and mAP@0.5 indicators. Compared with the original recognition algorithm CRNN-Lite, its accuracy increased by 2.9%, the recall rate increased by 1.1%, and the mAP@0.5 increased by 1.3%. Compared with CRNN-Lite+Mm-Sada, CRNN-Lite+Mla, CRNN-Lite+Pfts, and CRNN-Lite+ -Mod algorithms, its accuracy increased by 26.6%, 14.2%, 11.6%, and 34.1%, respectively, and the recall rate increased by 41.5%, 35.2%, 36.1%, and 48.3%, respectively, and mAP@0.5 increased by 40.8%, 23.6%, 25.8%, and 44.5%, respectively. Through the comparative experiments of the influence of different adaptation domain algorithms on the recognition algorithm, it can be seen that compared with the MIR+DLMM algorithm and the Polyphonic-Tromr+DLMM algorithm, the accuracy of the CRNN-Lite+DLMM algorithm increased by 16.6% and 19.6%, the recall rate increased by 34.8% and 31.4%, and the mAP@0.5 algorithm increased by 34.9% and 23.1%, respectively. Experimental results show that the CRNN-Lite+DLMM algorithm can better complete the task of score recognition in music teaching.
In this paper, the total time spent by the algorithm on training and testing is used as an evaluation index of algorithm efficiency. In order to verify the good or bad performance of the model in this paper, in the same experimental environment, design the comparison algorithm experiments, and the experimental time of different algorithms is shown in Table 2. From the table, it can be seen that the CRNN-Lite+DLMM algorithm proposed in this paper has obvious advantages over other algorithms. Compared to the CRNN-Lite algorithm, the recognition efficiency has improved by 12.5%.
Experiment length of different algorithms
| Algorithm | Training time(s) | Test time(s) |
|---|---|---|
| CRNN-Lite | 2215 | 465 |
| CRNN-Lite+DLMM | 1985 | 412 |
| CRNN-Lite+Mm-Sada | 3265 | 802 |
| CRNN-Lite+Mla | 2657 | 495 |
| CRNN-Lite+Pfts | 3019 | 777 |
| CRNN-Lite+ |
3530 | 778 |
| MIR | 2906 | 555 |
| MIR + DLMM | 4263 | 1011 |
| MIR+Mm-Sada | 2981 | 693 |
| MIR+Mla | 3265 | 780 |
| MIR+Pfts | 2648 | 504 |
| MIR+ |
2999 | 782 |
| Polyphonic-Tromr | 3523 | 770 |
| Polyphonic-Tromr+ DLMM | 2896 | 534 |
| Polyphonic-Tromr+Mm-Sada | 4260 | 1007 |
| Polyphonic-Tromr+Mla | 2969 | 683 |
| Polyphonic-Tromr+Pfts | 2983 | 796 |
| Polyphonic-Tromr+ |
3515 | 797 |
This section analyzes the effect of music teaching after using the method proposed in this paper in terms of two dimensions: song singing completion and music learning effect. The experiments were set up with an experimental class (N=50) and a control class (N=50) respectively. The experimental class was taught music using the method proposed in this paper, while the control class was taught music using the traditional method.
Through the experimental class and the control class of the same course of different methods of experimental teaching, the author in a week after the experimental class and the control class of the students carried out the examination, in the song singing examination, the experimental class of 66% of the students can be sung emotionally and can master the basic requirements of the song singing, while the control class of the majority of the students in the singing of the emotion is not enough to invest in or did not meet the basic requirements of the song singing. This suggests that applying this paper to music teaching is a good way to improve the quality of music teaching. This shows that the application of the method proposed in this paper in music teaching can greatly improve the learning efficiency of students, and at the same time, enable students to have more time to appreciate the learning of music, and discover and feel the beauty of music. The statistics of students’ song singing completion are shown in Table 3 (scoring criteria: ① emotional expression, singing posture, breathing method, pitch, rhythm, strength, completeness ② emotional expression this, singing posture, breathing method, completeness ③ completeness ④ not meeting the requirements of song singing).
Student song performance
| Class | Laboratory Class(N=50) | Cross-Reference Class(N=50) | ||
|---|---|---|---|---|
| Number | Percentage | Number | Percentage | |
| ① | 33 | 66% | 20 | 40% |
| ② | 15 | 30% | 18 | 36% |
| ③ | 1 | 2% | 8 | 16% |
| ④ | 1 | 2% | 4 | 8% |
This study utilizes the paired samples test, which is used to determine whether two related samples come from an aggregate with the same mean. There are two ways of combining data sources: self-pairing and homozygous pairing. Self-pairing is used here to refer to the same experimental subject who receives two treatments, before and after, at two different times, and whose two observations, before and after, are used for control and comparison. The data from the pre-test and post-test questionnaires of the experimental subjects were imported into SPSS, and the paired samples test was carried out. The percentage of confidence interval was adjusted to 95%. Paired samples analysis and testing are shown in Table 4 and as shown in Table 5.
Match sample analysis
| Mean Value | N | Standard Deviation | Standard Error Mean | ||
|---|---|---|---|---|---|
| Pair 1 | Study Interest | 2.3891 | 50 | 0.69622 | 0.09855 |
| Study Interest 2 | 4.3395 | 50 | 0.55987 | 0.06577 | |
| Pair 2 | Learning Motivation | 2.2501 | 50 | 0.53612 | 0.07602 |
| Learning Motivation 2 | 4.2651 | 50 | 0.52584 | 0.07362 | |
| Pair 3 | Learning Efficiency | 2.0561 | 50 | 0.49532 | 0.6855 |
| Learning Efficiency 2 | 4.1658 | 50 | 0.57846 | 0.08107 | |
| Pair 4 | Learning Emotion | 2.2989 | 50 | 0.59336 | 0.08425 |
| Learning Emotion 2 | 4.4102 | 50 | 0.51322 | 0.07321 |
Matched sample
| Pair Difference | T | Freedom | Significance (Double Tail) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean Value | Standard Deviation | Standard Error Mean | The Difference Is 95% Confidence Interval | ||||||
| Lower Limit | Upper Limit | ||||||||
| Pair 1 | Study Interest- Study Interest 2 | -1.95025 | 0.92366 | 0.12985 | -2.23115 | -1.68512 | -15.122 | 50 | 0.000 |
| Pair 2 | Learning Motivation- Learning Motivation 2 | -2.01585 | 0.80051 | 0.11362 | -2.24985 | -1.77544 | -17.854 | 50 | 0.000 |
| Pair 3 | Learning Efficiency- Learning Efficiency 2 | -2.10305 | 0.78442 | 0.10885 | -2.33165 | -1.87322 | -19.305 | 50 | 0.000 |
| Pair 4 | Learning Emotion- Learning Emotion 2 | -2.11065 | 0.76051 | 0.10601 | -2.42112 | -1.87622 | -20.055 | 50 | 0.000 |
The results of paired sample statistics show that the mean values of the four dimensions of learning interest, learning motivation, learning efficiency, and learning emotion have increased by a certain amount, where the mean value refers to the average score of students doing the questionnaire on several questions in the four dimensions, there are five scoring items from 1-5, and the higher the score represents the higher their satisfaction, in which the learning interest2, learning motivation2, learning efficiency2, and learning emotion2 Representing the four dimensions of the posttest, interest in learning, interest in learning was 2.3891 in the pre-test and 4.3395 in the posttest. Learning Motivation was 2.2501 on the pre-test and 4.2651 on the post-test. Learning Efficiency was 2.0561 on the pre-test and 4.1658 on the post-test. Learning Affect was 2.2989 on the pre-test and 4.4102 on the post-test. It can be seen that the mean values of the experimental class in all the above four dimensions have increased considerably, proving that the students in the experimental class have improved their abilities in all four dimensions after experiencing the use of the method proposed in this paper for music teaching. In terms of standard deviation, the standard deviation of the pre- and post-tests of learning motivation fluctuates less and is relatively stable, suggesting that the students’ tendency to choose is more consistent. The pre- and post-test standard deviations of learning interest, learning efficiency, and learning emotion fluctuate more, indicating that students’ tendency to make choices is more decentralized. The results of the paired samples test show that the significance of all four dimensions is 0, which is less than 0.005, proving that students show significant differences in the pre-test and post-test, and that there is a large increase in the mean values of the four dimensions in terms of the difference in the mean values.
In summary, the students in the experimental class showed significant differences in the four dimensions of learning interest, motivation, learning efficiency and learning emotion, and the students were more satisfied with the effect of music teaching using the method proposed in this paper, as follows:
First, the students in the experimental class showed a significant interest in learning music. Teachers assisted the teaching through the methods proposed in this paper, strengthened students’ understanding and ability to perceive music from various aspects, and utilized a variety of ways and means to enhance students’ interest in music theory learning.
Secondly, the motivation of students in the experimental class to learn music was significantly enhanced. Teachers added multimedia technology to the four sections of music appreciation, music performance, music creation, and music-related culture. This was manifested in the form of pictures, audio, video, and animation.
Thirdly, the efficiency and effectiveness of the experimental class students in music learning were significantly improved. The largest difference in the mean value of learning efficiency is 2.1097, indicating that students have the greatest improvement in the dimension of learning efficiency, and there is great progress in music perception ability, rhythm and beat discrimination ability, and music theory knowledge mastery ability, which is a significant improvement for learning efficiency.
Fourth, students in the experimental class showed significant positive changes in their emotions related to music learning. Students believe that learning music will make them happy, learning music can enrich their emotions, and the difference in the mean value of learning emotions is the second highest, according to the first place by a hair’s breadth, which indicates that there is a great change in students’ emotions in music learning, and the sensitivity of students to music emotions is also greatly improved.
Based on the perspective of the deep learning model DLMM, this paper explores the application of a lightweight score recognition method based on improved CRNN in music education. Aiming at AI technology in music education, this paper proposes a lightweight music score recognition method, CRNN-lite, based on improved CRNN, which alleviates the problems of long time consuming single iteration and high total number of iteration rounds in the training process, and improves the recognition rate of distorted music scores. In order to solve the problem of migrating the method to music education, the article further proposes a multimodal domain adaptation algorithm DLMM based on differential learning, and the experimental results show that the multimodal domain adaptation algorithm DLMM proposed in this paper has a significant advantage in recognition efficiency, which is improved by 12.5% compared with the CRNN-Lite algorithm, respectively. After teaching music through the comprehensive application of the methods in this paper, the students in the experimental class are significantly more efficient and effective in music learning, and the mean value of learning efficiency is improved by 2.1097 compared with the control class.
