Accès libre

Research on Deep Learning Model Training and Cross-Ethnic Cultural Interaction Mechanism of Daur Music Inheritance in Heilongjiang Province

  
23 sept. 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

Daur traditional music is an important part of Daur traditional culture, which records the history of the whole Daur ethnic group, records the folk customs of the whole nation, and has very important historical and cultural significance [1-2]. Through the inheritance of Daur traditional music, it not only protects the Daur language, but also enhances people’s knowledge of Daur folk customs and cultural values, and is also of great significance in promoting social harmony and progress [3-4].

Due to the different times, the folk songs inlaid with the characteristics of different times, the Daur folk customs and national psychological characteristics in the folk songs of different times are fully reflected [5-6]. Daur people usually use music to record their emotions, and their creative materials are taken from life, which is the reflection of people’s life and the crystallization of wisdom [7-8]. Compared with the traditional music of other ethnic groups, the traditional music of the Daur ethnic group has more affinity, which is inextricably linked to its unique interpretation. Usually, the traditional music of other ethnic groups is usually interpreted in the way of theater or large stage, which is more dramatic [9-11]. Aesthetic simplicity is one of the cultural characteristics of Daur traditional music, and its simplicity is generally manifested in the following three aspects, the simplicity of the lyrics, the simplicity of the song structure and the simplicity of the melody [12-13]. Firstly, the simplicity of lyrics, most of the traditional Chinese music is more artistic, and the lyrics are more ornate and usually rich in poetic meaning. However, Daur traditional music is the opposite, it pays more attention to the depiction of scenes and the expression of emotions, and its lyrics are relatively concise and simple [14-15].

In recent years, under the joint efforts of all walks of life, the protection and inheritance of Daur traditional music has gradually formed a favorable development trend [16]. Whether it is government departments at all levels and relevant cultural departments, institutions of higher learning, or the general public, they have been involved in the inheritance and protection of intangible cultural heritage, trying various forms of protection and inheritance of the way and inheritance mode [17-18]. The protection and inheritance of traditional music culture carried out on the platform of colleges and universities is a way to carry out the inheritance of traditional music in colleges and universities mainly for the inheritance of national music culture and the cultivation of talents, while the protection and inheritance carried out on the platform of intangible cultural heritage workshops extends the space for living and inheriting by organizing training courses for intangible cultural heritage projects, cultural and artistic performances, and other activities, and by creating an atmosphere of protection and inheritance of intangible cultural heritage [19-21].

Aiming at the problem of Daur ethnic music inheritance in Heilongjiang Province, this paper proposes a solution path based on deep learning. The study first analyzes the cultural characteristics of Daur music inheritance and the mechanism of national cultural interaction, and explores the theoretical path of Daur music inheritance development in Heilongjiang Province. On this basis, this paper proposes a music generation model based on deep learning to solve the inheritance problem. Aiming at the training instability of the original generative adversarial network in the music generation task, this paper adopts WGAN-GP and attention similarity matching module for optimization, and constructs a MIDI-GAN Daur music generation model. Finally, the model is trained and tested by subjective and objective evaluation of the generated Daur music.

Daur Music of Heilongjiang Province

The Daur ethnic group is one of China’s least populous minority groups and one of the most representative ethnic groups in northern China. The Daur ethnic group is relatively concentrated, mainly in Heilongjiang Province, Inner Mongolia Autonomous Region and Xinjiang Uygur Autonomous Region. Among them, Qiqihar City in Heilongjiang Province is an important distribution area. The traditional music of the Daur ethnic group is an important part of the traditional culture of the Daur ethnic group, which records the history of the Daur ethnic group and the folk customs of the whole ethnic group, and has very important historical and cultural significance. Through the inheritance of Daur traditional music, it not only protects the Daur language, but also enhances people’s knowledge of Daur folklore and customs and their recognition of cultural values, and is also of great significance to the promotion of social harmony and progress.

Cultural identity of Daur music

The cultural characteristics of Daur music, as shown in Figure 1, are characterized by life, collectivity, affinity and simplicity. It is characterized by oral transmission and flux in the transmission of music.

The Daur people usually use music to record their emotions, and their creative materials are taken from life, which is the reflection of people’s life and the crystallization of wisdom. The traditional music of the Daur ethnic group includes a relatively wide range of songs depicting people’s lives and production scenes, such as “logging song” and “fishing”, songs depicting landscapes, such as “spring is coming” and “red flowers on the slopes of the mountains”, as well as songs documenting the customs of the ethnic groups. The material of these songs comes from daily life and has a very strong flavor of life. In addition, the creation of Daur traditional music is often the common idea of the whole collective, which shows the life of the whole group, rather than a body. For example, the traditional music of the Daur ethnic group in the category of “Ruri Geler”, which mainly depicts the Daur people’s songs, dances and boxing scenes, which are closely related to the daily life of the Daur people, and epitomize the daily life of the entire Daur people.

Compared with the traditional music of other ethnic groups, the Daur traditional music has more affinity, which is inextricably linked to its unique interpretation. Daur traditional music, in addition to religious ceremonial music (Yadegenino) need to be performed on specific occasions, are generally interpreted in everyday life. This free and casual way of interpretation makes Daur traditional music more approachable than other ethnic music.

Aesthetic simplicity is one of the cultural characteristics of Daur traditional music, and its simplicity is generally manifested in the simplicity of the lyrics, the simplicity of the song structure and the simplicity of the melody.

The Daur ethnic group is a relatively special ethnic group, which has its own unique language, but lacks corresponding words. This lack of writing and the diversity of languages make the traditional music of the Daur ethnic group to be passed on only in the way of oral chanting. Different people live in different environments, which makes their life experience and music quality different. Therefore, in the process of inheritance, the inheritors usually cannot fully understand and appreciate the emotions that the inheritors want to express, which makes the traditional music of Daur ethnic group undergo certain changes. At the same time, Daur traditional music also has a certain improvisation, which often makes the rhythm and melody of the traditional music of this ethnic group change, and ultimately exacerbates the change of Daur traditional music.

Figure 1.

The Daur music culture

Inter-ethnic cultural interaction mechanisms for the Daur community

The relationship between the economy and ethnic culture in minority areas is one of mutual penetration and mutual promotion, and reasonable adjustments can be made to bring them into balance.

The relationship between the economy and ethnic culture of ethnic minority regions is shown in Figure 2. The growth of the total economy of ethnic minority regions, the acceleration of industrialization and urbanization, and the improvement of the material living conditions of ethnic minorities have led to an increasing convergence of consumption patterns, lifestyles, and behaviours, which has led to the internationalization of languages, clothing, diets, architecture, education, and knowledge systems. Economic development inevitably requires the introduction of foreign advanced science and technology and management methods, while the dissemination and promotion of the application of science and technology require the use of uniform academic symbols, calculation formulas and computer languages, etc., and the management and service institutions of various economic activities must also be unified. Minority regions are receiving foreign technical standards, norms and forms of economic organization. At the same time, the scope of use of traditional languages and scripts is gradually shrinking the opportunity cost for ethnic minorities to receive education in their own ethnic groups is gradually increasing. This ultimately leads to greater difficulty in the transmission of ethnic culture and the gradual convergence of diversity. The economic development of ethnic areas further promotes exchanges and communication among ethnic groups, gradually weakening the independence and integrity of ethnic cultures. The distribution of China’s ethnic minorities, which is characterized by a large number of mixed dwellings and a small number of clusters, determines the complex ethnic structure of ethnic areas. National autonomous areas are all inhabited by ethnic minorities, but no autonomous area is inhabited only by ethnic groups practicing autonomy. In China, there are autonomous areas in which one ethnic group exercises autonomy, and there are also areas in which several ethnic groups exercise joint autonomy. In either case, however, they are inhabited by other nationalities in addition to those practicing autonomy. As ethnic areas develop economically, the independence and integrity of ethnic cultures gradually diminish as ethnic groups interact with and influence one another.

Figure 2.

The economic and ethnic cultural relations of ethnic minorities

The interactive development of the economy and ethnic culture in ethnic minority areas is an inevitable requirement for the comprehensive and sustainable economic and social development of ethnic areas, and is a new development concept, new development idea and new development strategy for promoting regional development. At the same time, the interactive development of the economy and national culture in ethnic minority areas has a realistic foundation and institutional guarantee, which is not only very necessary but also feasible.

For the cultural interaction mechanism of the Daur ethnic group, the state provides institutional guarantee for the interactive development of ethnic culture, and ethnic areas are special administrative regions of the country with the right to regional ethnic autonomy. Regional ethnic autonomy means that under the unified leadership of the State, places inhabited by various ethnic minorities practise regional autonomy, set up organs of self-governance, and exercise the right of self-governance. The implementation of regional ethnic autonomy by the State has played an enormous role in strengthening the relations of equality, solidarity and mutual assistance among all ethnic groups, safeguarding national unity, accelerating the development of ethnic autonomous areas and promoting the progress of ethnic minorities. On the other hand, there is a realistic basis for the interactive development of the economy and culture of ethnic minority areas. Daur regions have realized the advantages of the interactive development of economy and ethnic culture, and have begun to explore the ways of their effective interaction. This provides a realistic basis for the interactive development of economy and ethnic culture in ethnic minority areas.

The scope of the mechanism of cross-ethnic cultural interaction and development involves the state, the ethnic regions and the ethnic minority groups. If the mechanism at the national level is regarded as a macro-mechanism, then the mechanism at the level of ethnic regions is a meso-mechanism, and the mechanism at the level of ethnic minority groups is a micro-mechanism.

The macro-mechanism at the national level operates a number of favourable mechanisms, mainly in the form of autonomous arrangements, management and development of economic construction and autonomous development of education, science and technology, culture and other social undertakings. To a large extent, it has refined the regulations on the arrangement of infrastructure projects, resource development and ecological protection, financial transfers and cultural and educational development in the economic development of ethnic regions. As a result, the development mechanisms at this level are generally more complete.

As the main body of the meso-level mechanism, the national autonomous regions have introduced relevant laws and regulations and established relevant mechanisms for the economic development and cultural inheritance of the national regions, but their operational effectiveness needs to be improved. In terms of economic development mechanisms, the enjoyment of special preferential policies of the state has run relatively smoothly, while there are deficiencies in the mechanisms for independent innovation based on the characteristics of the region and the implementation or termination of certain policies that are not suitable for ethnic regions. In terms of cultural development mechanisms, sufficient attention has been paid to the development of national cultural resources, which is conducive to short-term economic growth, while the issue of national cultural inheritance, which has a bearing on long-term development, has not received the attention it deserves.

The micro-mechanism is a deeper and more specific operating mechanism than the macro- and meso-mechanisms. The micro-mechanism adjusts the type of economic life of ethnic minorities based on the orientation of the macro- and meso-mechanisms. At the same time, the different aspirations of ethnic minorities to pursue affluence and spiritual fulfillment form new micro-mechanisms that act on macro- and meso-mechanisms. At present, the macro- and meso-mechanisms’ guidance to the micro-mechanisms can be well realized, while the micro-mechanisms have fewer ways to counteract the macro- and meso-mechanisms. Minority people should be given more opportunities to express their aspirations in order to improve the macro- and micro-mechanisms and to promote the interactive development of the economy and national culture in minority areas.

Deep Learning-Based Generative Modeling of Daur Music

This paper constructs a Daur music generation model based on deep learning to realize the digital inheritance needs of Daur traditional music in order to solve the problem of Daur music creation under the development of the information age. It seeks to protect the Daur language, enhance people’s knowledge of Daur folklore and customs and cultural values, and maintain the environment for the development and inheritance of Daur music.

Music Generation Model
Generating Adversarial Networks

Generative models are very important models in the field of deep learning that generate new and similar to the original data based on the existing data. Generative Adversarial Network (GAN) is a deep learning model which consists of two parts of neural networks and learns by playing these two parts of neural networks against each other. Because generative adversarial networks can learn according to the dataset data and generate new data that conforms to the distribution of the original data, they have been widely used in recent years in the fields of music generation, text generation, image generation, and so on. The generative adversarial network model is shown in Figure 3, the generative adversarial network is mainly composed of generator network and discriminator network. Among them, the generator is responsible for extracting features from the input data and generating new data samples accordingly. The discriminator, on the other hand, undertakes the task of judging whether the input data is real or not. Through continuous adversarial training, the generator and the discriminator promote each other and improve the performance together. The generator continuously optimizes the quality of the generated data by adjusting its own parameters, so that the generated data is closer to the distribution of real data. The discriminator, on the other hand, improves its ability to distinguish between real data and generated data through continuous learning. This mechanism of mutual competition and cooperation enables the generative adversarial network to gradually approach the distribution of real data and generate new data of high quality.

Figure 3.

Generate the antagonistic network model

In the training process, the generator and the discriminator are trained against each other, and the training goal of the discriminator is to improve its discriminative ability as much as possible to analyze the generated fake data, i.e., to maximize the correct rate of sample differentiation. The training goal of the generator is to generate data as close as possible to the real data to deceive the discriminator, so that the discriminator can not distinguish between the generated data and the real data, i.e., to maximize the probability of misclassification of the generated data by the discriminator. The optimization objective of the generative adversarial network is: Ladv(D;G)= Ladv(D;G)= Eϕ(x)min(0,1D(x))+E(z)[min(0,1+D(G(z)))] Ladv(G;D)=Ez[ΣD(G(z))]

Where, z denotes the data input to the generator, G denotes the generator and D denotes the discriminator. In the training network phase, the generator generates data from the input data, and the discriminator receives the data generated by the generator and the real data and outputs a discriminant value (true or false). In the network generation phase, the generative adversarial network will keep only the generator and generate new data from the input of the generator.

Transformer

Music is a temporal art, which has a strong back-and-forth correlation on the time scale. Due to the superiority of recurrent neural networks in processing temporal data, many scholars have adopted them as the basic model for music generation. However, due to its recursive structure, recurrent neural networks can only pass information to the next moment after processing the current information, which results in recurrent neural networks not being able to parallelize computation and inefficient training. Moreover, as the length of the sequence grows, the information may be lost in the process of passing, resulting in the recurrent neural network not being able to capture long-distance dependencies.

The Transformer model uses an attention mechanism to model the data, which solves the problems of recurrent neural networks not being able to compute in parallel and gradient vanishing, and achieves superior performance in tasks such as natural language processing, music generation, machine translation, etc. The Transformer is a sequence-to-sequence model, which consists of an encoder and a decoder.The Transformer encoder consists of a stack of multi-head self-attention mechanisms and feed-forward neural networks that perform residual join and normalization operations at each layer. In contrast to the Transformer encoder, the multi-head self-attention mechanism of the Transformer decoder performs a masking operation in order to prevent each position from participating in the attention computation. In addition, the decoder has an extra multi-head cross-attention mechanism between the multi-head self-attention mechanism and the feed-forward neural network, which can be used to receive the information passed by the encoder.

WGAN and WGAN-GP

Due to the problems of unstable training of generative adversarial networks, possible pattern collapse and thus poor quality of generated samples, Wasserstein GAN is abbreviated as WGAN.The most significant improvement of WGAN is the adoption of Wasserstein distance as a criterion for measuring the gap between the generator and the discriminator, which provides better smoothing as compared to the KL dispersion and JS dispersion. Thus it theoretically solves the problem of vanishing or exploding gradients and also helps the network to generate higher quality samples. Usually, the Wasserstein distance is defined by the following formula: W(Pr,Pg)=infγΠ(Pr,Pg)E(x,y)γ[xy] wherein, Pr represents the distribution of real samples, Pg represents the distribution of generator-generated samples, Pr,Pg represents the set of all possible joint distributions of Pr and Pg , and for any possible joint distribution γ in the set, a real sample x and a generated sample y can be sampled from it, and xy is the distance between these two samples, so the expected value of the sample distance under the joint distribution can be obtained through sampling and calculation E(x,y)Yxy , and the Wasserstein distance is the lower bound of this expected value in all possible joint distributions. Since the Wasserstein distance cannot be solved directly, in WGAN the authors define it as the following: WPr,Pg=1KsupfLKExPr[f(x)]ExPg[f(x)]

The Lipschitz continuity condition is introduced in this equation. This condition requires that there exists a constant greater than or equal to 0 such that any two elements x1 and x2 of the function f in the domain of definition satisfy the following equation, where K is referred to as the Lipschitz constant of the function f.

fx1fx2Kx1x2

In WGAN, the Wasserstein distance is the value of taking an upper bound of Expr[f(x)]Expg[f(x)] and dividing by K for all functions f that satisfy the condition in the fLK case where the Lipschitz constant of the function f is required. If a set of parameters ω is used to represent these functions fω that may satisfy the condition, the approximate solution is expressed as follows: KW(Pr,Pg)maxWfo∣⩽KExPr[fω(x)]Ex~Px[fω(x)] $K \cdot W\left(P_r, P_g\right) \approx \max _{W \| f_o \mid \leqslant K} E_{x-P_r}\left[f_\omega(x)\right]-E_{x \sim P_x}\left[f_\omega(x)\right]$

However, in the above equation, there is also a restriction of fwLK . In order to solve this problem, WGAN introduces a parameter pruning strategy, i.e., it restricts all parameters ωi in the neural network to be within a range of [c,c] , where c represents any real number. Such an operation enables the derivative fmx of the input sample x to be restricted to a certain range as well, thus guaranteeing the existence of a constant K making fωLK . At this point, the task of minimizing the Wasserstein distance transforms into maximizing ExPrfω(x)Exprfω(x) .

Although WGAN improves the training stability by introducing the Wasserstein distance and reduces the occurrence of problems such as pattern collapse. However, the weight pruning strategy proposed in WGAN has some drawbacks. On the one hand its leads to parameter centralization, i.e., the weights learned by the network converge to two extremes, which will lead to a decrease in the discriminator’s generalization ability and judgment ability. On the other hand, when the network used by the discriminator is deeper, WGAN may still have the problem of gradient explosion or gradient vanishing. For this reason, WGAN-GP is proposed.

To address the shortcomings of WGAN, WGAN-GP uses gradient penalty instead of the weight pruning strategy in WGAN. The basic idea is to limit not only the weights of the discriminators, but also the changes in their gradients. Specifically, the gradient penalty penalizes the discriminator for the magnitude of its gradient with respect to its input. If the gradient is too large, it indicates that the discriminator is too sensitive to changes in the input and needs to be penalized. The formula for its gradient penalty is as follows: GPLoss=λEx~Pdata(x),ε~U[0,1][(xD(εx+(1ε)G(z))21)2] $G P_{-}$Loss $=\lambda \cdot E_{x \sim P_{\text {data }}(x), \varepsilon \sim U[0,1]}\left[\left(\left\|\nabla_x D(\varepsilon \cdot x+(1-\varepsilon) \cdot G(z))\right\|_2-1\right)^2\right]$ where λ is the coefficient of the gradient penalty, which is used to control the weight of the gradient penalty in the total loss, x is a sample drawn from the real data distribution pdata(x) , ε is a random value drawn from the uniform distribution U(0,1) , z represents random noise, G(z) represents the generated sample of the generator, and xD(εx+(1ε)G(z)) is the gradient of the discriminator about its input.

By adding a gradient penalty to the loss function, the training stability of WGAN is further improved and higher quality samples can be generated. In the model presented in this paper, WGAN-GP is used to improve the training stability and generate higher quality music samples.

Daur music generation model construction

The overall framework of the music generation model is shown in Fig. 4. The overall framework of the proposed model in this chapter consists of two parts: the WGAN-GP-based music generation module and the attention-based similar matching module. Among them, the WGAN-GP-based music generation module for daubers is used to convert random noise into a music coding matrix, which in turn is converted into a MIDI file that can be played. In order to improve the stability of the model during the training process, Wasserstein distance and gradient penalty strategies are also introduced. The role of the attention-based similar matching module, on the other hand, is to similarly match the coding matrices of the generated music samples with those of the real music samples by combining the self-attention mechanism, and to add the results obtained from the matching to the generator’s loss function, so as to provide a certain degree of guidance to the generator, and to direct the generator to generate the music clips that are aurally similar to the real music samples.

Figure 4.

Music generation model

The general principle of the model work is as follows, for the generator, it is firstly input a segment of random noise vector to it, and then it generates the coding matrix of the music samples, and then it inputs the generated coding matrix into the discriminator for recognition, and its goal in the training process is to make its generated samples easier to be recognized by the discriminator as the real music samples of the music of the Dharma ethnic group. In addition to this, since a similar matching module based on the self-attention mechanism is also introduced into the model, the generator has another goal in the training process, i.e., to generate music samples that match the real music samples more closely, thus further improving the quality of the generated music samples. For the discriminator, whose inputs are the music coding matrix generated by the generator and the coding matrix of the real music, its goal in the training process is to distinguish the generated music samples from the real music samples as much as possible, so as to better guide the generator for optimization.

In the music generation module of Daur music, the module uses WGAN-GP as the main framework, which is mainly reflected in the introduction of the Wasserstein distance to measure the difference between the generated music samples and the real music samples, so as to change the optimization goal of the discriminator into minimizing the Wasserstein distance between the generated music samples and the real music samples of the generator. At the same time, in order to ensure that the decision boundary of the discriminator is smooth and to improve the model training stability, the output of the discriminator in the model is set as a scalar and a linear loss function is used. In addition, in order to ensure that the discriminators do not just learn the generator’s strategy during the training process, but also learn the data distribution of real music samples, this chapter also introduces a gradient penalty strategy into the model, which not only ensures that the model parameters satisfy the Lipschitz continuity condition, but also directs the discriminator’s gradient to remain consistent, thus helping to stabilize the training process.

The generative model can assist the creation of Daur music, and it provides new ideas for the development of Daur music while creating a good digital inheritance environment for Daur music.

Experiments in generative modeling of music of the Daur ethnic group

The dataset used for the experiments was a total of 1200 single-track monophonic MIDI samples of Daur music collected on the web. The MIDI data were normalized and cut into fixed-length music samples, which were then encoded and converted into two-dimensional sequences for input into the model for training. This experiment set each input and generated data as music samples of 5 bars in length. By standardizing the beat as 4/4, the minimum note unit as sixteenth note, and the sampling rate as 0.125s, i.e., sampling 8 minimum note units per second, all the music data were segmented into 5-bar music segments with a time step of 80. With the help of Pretty-MIDI, a MIDI data processing package in Python, the segmented music segments were multi-hot encoded and converted into 2D note sequences, each individual 2D note sequence can be represented by the 2D sequence x. x{0,1}h×w , h denotes the midi note/pitch, w denotes the time step, and the range of the pitch within MIDI is 0-127. The data were converted after preprocessing into [128×80] two-dimensional vector, and the values of the elements in the vector indicate the playing state of the note. The final training dataset is obtained, a total of 8600 fixed-size 2D note sequences for model training.

The negatively penalized discrimination loss of the discriminator in WGAN is negatively correlated with the quality of model generation as well as the degree of convergence, i.e., the smaller the loss value, the better the quality of the generated samples. In this paper, a model using a convolutional neural network instead of a multi-scale feature extraction module (with identical other parameters) is trained for comparison with MIDI-GAN. The negative penalty discriminative loss values are calculated for both models for each STEP in the training process.

The loss value comparison of model training is shown in Fig. 5, which shows the loss value comparison curve of the models at 100k steps of training. After the introduction of the multi-scale feature extraction module, the music generation model can get lower negative penalty discriminative loss, and the convergence of the model training is accelerated, which indicates that the multi-scale feature extraction module can enhance the feature extraction ability of the discriminator and improve the quality of the model generation.The convergence of the MIDI-GAN slows down after about 25k steps of training, and starts to be able to generate regular music, and converges completely after 55k steps. MIDI-GAN converges completely after 55k steps and generates the best results.

Figure 5.

Comparison of loss values of model training

In addition, this paper also trained the model using gradient penalization strategy and weight cropping strategy respectively to compare the weight distribution of the discriminator under different strategies, and the results are shown in Fig. 6, with the model in this paper on the left and the original WGAN model using parameter cropping strategy on the right. The weight distribution of the discriminator using the weight cropping strategy is obviously piled up in the range of the cropping edges, while the discriminator weight distribution obtained from the final training of the model using the gradient penalty is uniform, and the uniform weight distribution is more helpful to improve the generalization and robustness of the model, and the MIDI-GAN achieves the Lipschitz limitation through the gradient penalty, which better prevents the model from overfitting.

Figure 6.

The weight distribution of the discriminator under different strategies

The training speed of the model is also an important index to evaluate the performance of the model, and the constructed MIDI-GAN can be trained and converged faster due to the use of optimization strategies such as residual linkage and gradient penalty. In this paper, the current mainstream music generation models Music VAE and SR-CNN-VAEGAN are chosen to compare the complexity with the MIDI-GAN in this paper, respectively. Among them, MusicVAE is a music generation model based on variational self-encoder, while SR-CNN-VAEGAN is a hybrid music generation model that combines the advantages of variational self-encoder and generative adversarial network, and both of them show better generation effect than RNN model in the field of music generation. In the experiments, each model is trained five times using the same dataset and the experimental results are averaged. The model comparison results are shown in Table 1, the number of parameters of MIDI-GAN in this paper is 2.13M, which is lower than the other models, and both in the number of convergence rounds and training time compared to the other models have some improvement.

Model comparison

Model Parameter quantity (Million) Convergence number (epoch) Training time (s) Unit time (s)
SR-CNN-VAEGAN 12.11 350 64542 215.2
Music VAE 4.62 653 25154 42.3
MIDI-GAN(Ours) 2.13 300 6054 21.5
Evaluation of the effectiveness of generating Daur music

It is difficult to use objective quantitative indexes to judge the music works of Daur ethnic minority. Unlike the tasks of classification and prediction which can use standard quantitative indexes to measure the results, the results of the model-generated music need to be subjectively evaluated by professionals, including the fluency, completeness, emotional richness and other related aspects of the music, and there are a lot of objective evaluation indexes at the present stage to judge the quality of the music generation, which mainly reflect the similarity between the generated music and the real music, but are less convincing than the subjective evaluation. At this stage, there are also many objective evaluation indexes to judge the quality of music generation, which mainly reflect the similarity between the generated music and the real music in terms of music theory, but they are less persuasive than the subjective evaluation. Therefore, this paper adopts a combination of subjective and objective evaluation to comprehensively assess the quality of the music generated by the model.

Objective assessment of the quality of music

Current researchers have proposed a large number of objective evaluation indexes to judge the quality of generated music, in order to comprehensively evaluate generated music, this paper objectively evaluates generated music from four aspects:

Perplexity Level (PPL)

PPL is a commonly used index to evaluate the performance of language models, which reflects the degree of matching between the content generated by the model and the training set, the lower the PPL indicates that the music generated by the model is closer to the real music and the more effective the generative model is. A good language model is able to generate content similar to the dataset, calculated as follows: PPL=exp1Ni=1NlogPxi

Pitch categories (UPC)

UPC indicates the number of different pitches used in a measure and can reflect the pitch diversity of the generated sample.

Rhythmic Consistency (GC)

GC refers to the degree of rhythmic similarity between neighboring sections, the higher the GC value, the more smooth and powerful the music sounds, which can be used to assess the rhythmic consistency of the music, calculated as follows: GC=11T1i=1T1dGi+Gi+1

Empty Beat Rate (EBR)

EBR represents the proportion of empty beats (i.e. beats without any notes played) to the total number of beats in a piece of music. The higher the EBR is, the rhythmic sense of the music will be more loose, and vice versa, the rhythmic sense of the music will be stronger, and EBR can reflect the rhythmic change of the music to a certain extent, the formula is as follows: EBR=empty_beatsbeats

Each model generates a total of 100 kinds of Daur music, and this paper uses the method provided by the MusPy package to calculate the final results by taking the arithmetic mean, in order to calculate the scores of the relevant indexes in the real dataset, the same 100 different samples in the validation set were randomly selected. In addition, mLSTM and Transformer models are further introduced in this paper for comparison. The results of comparing the music generated by different models with the database data in terms of objective metrics are shown in Table 2. Comparing the results in the table, it can be seen that on the metric PPL, mLSTM and Transformer have relatively high scores, both exceeding SR-CNN-VAEGAN, Music VAE and the model in this paper. It shows that the music generation models based on Transformer and mLSTM are less effective in dealing with the long Daur music sequences. The PPL of this paper’s model has the lowest score of 1.70, indicating that the music generated by this model is the closest to the music distribution in the real dataset, with a higher degree of realism. In addition, the score of this paper’s music generation model still maintains the optimal result in the three indicators of UPC, GC and EBR, which indicates that the model architecture of this paper has rationality, and the score of this model in the indicators of UPC and EBR is the closest to the dataset and has a large gap with the other models, which indicates that the model generates music with a better pitch diversity, rhythmic consistency, and structural stability on, which further proves that the music generated by the model is closer to human composition.

The music generated by different models is compared to the objective index

Model PPL UPG GC EBR
mLSTM 2.26 9.35 0.65 0.25
Transformer 1.79 9.27 0.87 0.19
SR-CNN-VAEGAN 1.76 9.34 0.84 0.18
Music VAE 1.72 8.75 0.83 0.15
MIDI-GAN(Ours) 1.70 8.43 0.81 0.11
Subjective evaluation of generated music

Many objective evaluation indexes have been proposed, but music as a product of artistic creation, its evaluation still needs human participation, because it is impossible to use quantitative hard indicators to judge a work of art, only the subjective evaluation of human beings is the most persuasive, for this reason, this paper designs the relevant subjective evaluation indexes to more comprehensively evaluate the generated music, including (A) evaluative value, (B) arousal, (C) authenticity, (D) harmony, and (E) overall quality.) authenticity, (D) harmony, and (E) overall quality.

Before the beginning of the experiment, experts in the field of Daur music were selected as experimenters. Thirteen males and thirteen females were included. The participants rated the musical works from the proposed five indicators, with the ratings increasing from 1 to 5. The results of the subjective score evaluation visualization are shown in Figure 7, from the subjective evaluation comparison, it can be seen that the scores of the indicators of this paper’s model ranged from 3.52 to 4.21, and the overall quality score was 3.92, which were higher than the other models. In subjective listening this paper’s model is better than other models, the generated music is more realistic and harmonious, and the overall quality is better, so the model proposed in this paper performs better on the task of emotional music generation.

Figure 7.

Subjective score evaluation visual results

Conclusion

This paper focuses on the music heritage of the Daur ethnic group, mainly focuses on the music generation method based on generative adversarial network, and improves the original generative adversarial network to construct a MIDI-GAN music generation model. The model improves the training stability of the model by introducing the Wasserstein distance and gradient penalty strategy, and at the same time guides the generator to generate music samples that are closer to the real music listening sensation through the introduction of a similar matching module based on the self-attention mechanism. In the experiments of the model, the number of parameters, the number of convergence rounds, the training time and the unit time of the model in this paper are 2.13M, 300epoch, 6054s and 21.5s, respectively. Compared with the mainstream music generation models Music VAE and SR-CNN-VAEGAN each index has certain advantages. The music generation model constructed in this paper has the best results in the objective evaluation of confusion, pitch category, rhythmic consistency, and empty beat rate, and the scores of each index in the subjective evaluation range from 3.52 to 4.21, with an overall quality score of 3.92. The overall quality score is still 0.17 points higher compared to the better SR-CNN-VAEGAN model.

Through the model training results as well as subjective and objective evaluation experiments, it is proved that the model is more stable in convergence during the training process, and its generated music samples of Daur ethnic group are similar to the real music samples, which have certain appreciation value and are of some value for use in music inheritance.