The role of musical feature extraction model MFE in multicultural music fusion innovation

As a borderless art form, music elements from different cultural backgrounds collide and merge with each other, forming a rich and colorful variety of musical styles and genres. Multicultural musical intermingling can be traced back to ancient times, when traders, travelers and conquerors moved between different regions, bringing with them their own musical traditions [1-4]. Over time, these musical traditions merged with local music in new geographical environments, giving rise to new musical forms. Multicultural music fusion not only enriches musical expressions, but also promotes understanding and respect among different cultures.

Multicultural music fusion is an evolving process which reflects the diversity and inclusiveness of human society. Through music, people can feel the aesthetics and emotional expressions of other cultures, which enhances cultural exchange and resonance on a global scale [5-7]. In addition, the integration of cultural music also provides musicians with a broader creative space, which stimulates more creativity and inspiration [8-9]. The practical strategies of multicultural music integration include cultural identification, cultural experience, cultural exchange and cultural innovation. First of all, creators need to identify and understand the characteristics and values of different cultures and keep an open attitude towards multiculturalism [10-11]. There are differences between cultures, and creators will come into contact with a variety of cultures from various countries during the creative process, and there is the phenomenon of not being able to accept or understand them. Therefore, the attitude towards multiculturalism is important for creating music. Secondly, creators experience the musical expressions of different cultures by participating in musical activities and performances of various cultures [12]. Only by practically experiencing the charm of cultures can they better create multicultural music. Furthermore, creators can participate in cross-cultural music exchange, in which they can learn, borrow and integrate the musical elements of other cultures [13-15]. Practice makes true knowledge, exchange and learning is the prerequisite for integration, just like the collision of popular music and traditional culture, both need to seize its characteristics to ensure that the integration process is not distorted. Finally, creators can also practice innovation, combining different cultural elements learned to create new musical works and expressions [16-18]. However, both culture and music have their own unique features, and the integration of the two must require a method that can extract representative features from a variety of cultural materials and audio data to make the integration more smooth and natural, and musical feature extraction can be a good solution to this problem.

This paper takes the application of MFE in music composition as its main focus, and analyzes the multicultural integration in music composition.Subsequently, the audio features are extracted more effectively by utilizing the dynamic changes between audio frames. While using TDNN to feature extract the acoustic features of the input model, GRU is also used to iterate the audio features frame by frame to learn the dynamic information therein, and a hybrid feature extraction method MFE combining GRU and TDNN is proposed. Using Transformer-XL as the basic model of music generation, the Mask mechanism is improved, and MFE is applied to the improved Transformer-XL model to complete the construction of the music generation model. The multicultural fusion music generated by the model is analyzed from two dimensions: melody generation and arrangement generation. Melody generation is discussed through overall performance, chord accuracy, and phrase stops, and arrangement generation is analyzed by designing online auditions and offline performance evaluation experiments.

2

Multicultural integration in music composition

Music is a form of art created by human beings, and music creation is an important form of cultural exchange, which will continue to develop and evolve with the passage of time, and with the strengthening of globalization and cultural exchanges, music creation is also constantly innovating and developing, and the phenomenon of multicultural fusion is becoming more and more obvious. The phenomenon of multicultural fusion in music creation is reflected in the following dimensions, and this phenomenon of more multicultural fusion not only brings new artistic styles and ways of expression, but also reflects the diversity of cultures and the depth of exchange.

2.1

Fusion of musical styles

Music style fusion is a common form of expression of the phenomenon of multicultural fusion in music creation, which mainly refers to the fusion of musical elements from different cultural styles to create new forms of music. For instance, the amalgamation of classical and popular music elements, or the amalgamation of folk music and contemporary popular music.

2.2

Integration of cultural elements

Fusion of cultural elements is an important form of expression of the phenomenon of multicultural fusion in music creation, which can help the listener to deeply understand its cultural connotation, and mainly refers to the fusion of different cultural factors into music to create music works with unique cultural charm. For example, combining Chinese traditional music elements with pop music, or combining Indian music elements with electronic music.

2.3

Integration of art forms

The fusion of art forms refers to the fusion of music and other cultural and artistic forms to create a more rich expressive and infectious art works of a form of expression, which can highlight the artistic value and unique charm of music works. For instance, combining theater and music can improve the way emotions and stories are conveyed to the audience. In musicals, actors sing songs, dance, perform dialogues, etc., which makes the whole work more dramatic and musical, and at the same time more vivid and touching.

2.4

Digital technology integration

Digital technology fusion is the expression of a new multicultural fusion phenomenon applying modern technology, which mainly refers to the fusion of different musical elements through digital technology to create more novel and unique musical works and performance forms. For example, the use of electronic music, sampling technology, etc. to integrate musical elements of different styles and eras. In this study, MFE technology is used to extract musical characteristics from different cultures, which are used in the construction of intelligent models for music innovation and creation.

3

MFE-based music generation model

In recent research, many deep learning models have been used for music generation, and Transformer is one of the most prominent and popular generative models.In this paper, we use the Transformer-XL variant model and incorporate the BERT training method, so that the Transformer-XL model can also see bi-directional information, which improves its ability to generate music. In addition, instead of directly feeding the MIDI format of music into the model for training, this paper uses the MFE method to extract multicultural music features and combines it with the improved Transformer-XL model, which not only prolongs the duration of the music performance, but also improves the quality of music generation.

3.1

GRU and TDNN based MFE methods

MFCC and FBank features are designed to simulate the human ear’s sound perception mechanism and achieve an approximation of the auditory response, and thus have achieved excellent results in many audio signal processing tasks, including audio recognition. However, this short-time signal feature extraction based on frame splitting truncates the continuity of the audio signal, which in turn affects the extraction of temporal information. Therefore, the entire audio sequence is automatically modeled by neural networks to better capture the dynamic features of the audio.

In this section, a hybrid feature extraction (MFE) method is designed based on GRU and TDNN, which utilizes GRU to learn the temporal dependencies between audio features and extract the dynamic features therein, and then combines them with the features extracted by TDNN to achieve the purpose of enhancing the dynamic information.

3.1.1

GRU

The internal structure of the GRU loop body contains two gating functions as well as a hidden state, the reset gate r_i, the update gate z_i and the hidden state h_i.

r_i is used to control the ratio of the hidden state h_t–1 from the previous time step combined with the input x_i from the current time step. It receives h_i–1 and x_i as inputs and computes a set of reset coefficients between 0 and 1 for the features in each dimension: (1) $r_{t} = σ (W_{r} [h_{t - 1}, x_{t}])$

where W_r is the learnable parameter of the reset gate, [h_i–1,x_i] denotes two vector splices, and σ(·) denotes the Sigmoid function. Then the reset coefficients are multiplied back on h_i–1 to realize a partial reset of the information of the previous time step, and work with x_i to generate a candidate hidden state ${\tilde{h}}_{i}$ for the current time step: (2) ${\tilde{h}}_{t} = \tanh (W_{i} [r_{i} * h_{i - 1}, x_{i}])$

where $w_{\bar{h}}$ is the learnable parameter for computing ${\tilde{h}}_{t}$ and ^* denotes the element-by-element multiplication z_t determines the ratio of information forgotten and updated in the hidden state. It also computes a set of values between 0 and 1 as update coefficients through x_i and h_t–1: (3) $z_{t} = σ (W_{z} [h_{t - 1}, x_{t}])$

where W_z is the learnable parameter of the update gate. Finally, the hidden state h_t of the current time step is computed by z_t,h_i–1 and ${\tilde{h}}_{i}$ . Partial forgetting of past information is achieved by multiplying (1–z_i) with h_i=1, and partial retention of current information is achieved by multiplying z_i with ${\tilde{h}}_{i}$ . The update of h_i is accomplished by adding the two parts: (4) $h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}$

It can be seen that GRU controls the degree of forgetting the past information and updating the current information through r_i and z_i, which ensures the ability to memorize long time-series features. Compared with LSTM, GRU has a simpler structure with one less gating function in the loop body and omits the cell state, which is more advantageous in terms of the number of parameters and computational efficiency.

3.1.2

TDNN

TDNN is a feed-forward neural network for processing sequential data. Different from the traditional fully connected network, it adopts a neuron structure with time delay, each neuron has a certain width, and the output of the neuron is only related to the input in a certain time range.Each point of TDNN represents a frame feature vector, and the neurons of the TDNN move along the temporal direction in a certain step to realize the extraction of local features. During the moving process, the weights of the neurons are shared, which means that the neurons at each time step use the same weights to process the input data. Such a shared weight mechanism helps to reduce the number of model parameters and improves the generalization ability of the model. As the number of network layers deepens, the temporal range of TDNN layer connections gradually becomes wider, thus abstracting the global feature representation.

The width of the TDNN neurons is the size of the convolution kernel, and the frames within the time span are weighted and summed by the parameters in the convolution kernel during computation. These weights are continuously adjusted to optimize the model performance by training the minimization loss function.

The operations in these network layers are linear, and in order to fit the real data, the model needs to be capable of nonlinear transformations. Therefore, the output of the network layer needs to be nonlinearly transformed using some functions, which are called activation functions. Common activation functions include Sigmoid function, ReLU function, tanh function and so on, as shown in Equation (5), Equation (6) and Equation (7):

(5)

σ (x) = \frac{1}{1 + e^{- x}}

(6)

Re L U (x) = {\begin{array}{l} x, x > 0 \\ 0, x \leq 0 \end{array}

(7)

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

These functions have different properties and can be mapped to different ranges of input signals by nonlinear transformations, thus increasing the expressive power and fitting ability of the network.

3.1.3

MFE methodology

The structure of the MFE is shown in Fig. 1, for the input FBank features, in addition to the one-dimensional convolutional layer used in the previous sections of this paper, an additional branch of the GRU is added by parallelization, where the dimension of the hidden state h_t is set to 64. Afterwards, the features generated by the two branches are spliced together along the channel direction, and a convolutional layer with a convolution kernel size of 1 fuses the two parts of the features, which then undergoes a BN operation and then using tanh as the activation function.

It should be noted that the time cost required by GRU is high due to the need to iterate time step by time step during the computation process. In the PyTorch framework, in order to improve the parallelization of the computation, GRU computes a batch of data at the same time step simultaneously, and therefore uses a different input data format than the one-dimensional convolutional layer. One-dimensional convolutional layers require F_in ∈ ℝ^B×C×T for the input data, i.e., the first dimension of the feature is the batch size, the second dimension is the number of channels, and the third dimension is the timestep length. Whereas the input data format required by GRU is F_in ∈ ℝ^T×B×C, i.e. the first dimension is the timing length, the second dimension is the batch size and the third dimension is the number of channels. So after getting the FBank features with data format F_FBamk ∈ ℝ^B×C×T, the dimensions of the features need to be exchanged before inputting to GRU. Accordingly, the output data format of the GRU is F_out ∈ ℝ^T×B×(H×D), H represents the dimension of the hidden state, D is related to the direction of the GRU and has a value of 2 when using bi-directional GRUs and 1 when uni-directional GRUs are used.After obtaining the output of the GRU, the dimensions of the features are exchanged back to the original format, i.e., $F_{o u t}^{'} \in ℝ^{B \times (H \times D) \times T}$ in order to facilitate the transfer in the subsequent network.

3.2

Improvements to the Transformer-XL model

3.2.1

The Transformer-XL model

The Transformer model consists of four parts, which are the input, output, encoder, and decoder parts.

The input part of the Transformer model consists of a data embedding layer and a positional coding layer, and the output part of the model consists of a linear layer and a Softmax layer.The role of the data embedding layer is that the model needs to transform the input textual data as well as numerical data into vector representation.The purpose of the location coding layer is to include location information in the input data. The role of the linear layer is to deal with the linear changes in the previous step to achieve the effect of transforming the dimensions.The role of the Softmax layer is to scale the numbers of the vectors in the last dimension into the value domain of 0-1, and make their sum to 1, and according to the probability of the proportion at this point in time to get the most suitable output.

The biggest problem with Transformer is that there is no way to model sequences that exceed its maximum length, if the sequence exceeds that length it needs to be intercepted and then the intercepted fragments are re-encoded, the above method is feasible but it leads to the problem of contextual fragmentation, which means that each intercepted fragment is modeled individually without sufficient contextual information between each other. Transformer-XL, which is based on Transformer, suggests implementing a fragmentation recursion mechanism and replacing absolute position encoding with relative position encoding to address the issue of context fragmentation and long-term dependency.

Transformer-XL is also input with fixed-length segments during training as Transformer-similarly, the difference is that the state of the previous segment of Transformer-XL is cached, and then the hidden layer state of the previous time slice is reused when calculating the current segment. Because the features of the previous segment are reused in the current segment, this gives Transformer-XL the ability to model long-term dependencies. The details of the fragment recursion can be seen according to equation (8):

(8)

{\tilde{h}}_{τ + 1}^{n - 1} = [S G (h_{τ}^{n - 1}) ° h_{τ + 1}^{n - 1}]

Where h denotes the hidden state of the model, τ denotes a fragment, SG denotes that no gradient needs to be computed, and [A ∘ B] denotes the splice of two vectors. Starting from the second fragment, the hidden state of each fragment is spliced with that of the previous fragment to complete the recursion between fragments and achieve the purpose of obtaining long-term dependency.

Since different segments in Transformer have the same position encoding for the same position, the model has no way to correctly distinguish the position information of different segments, so Transformer-XL proposes to use relative position encoding instead of absolute position encoding. The specific operation is to consider only the relative position relationship between the Query vector and the Key vector when calculating the attention score, and add this relative position relationship, to the calculation of attention at each level. The formula is shown in (9):

(9)

A_{i, j}^{t e l} = E_{x_{i}}^{T} W_{q}^{T} W_{k, E} E_{x_{j}} + E_{x_{i}}^{T} W_{q}^{T} W_{k, R} R_{i - j} + u^{T} W_{k, E} E_{x_{j}} + ν^{T} W_{k, R} R_{i - j}

where E_x denotes the embedding of the token input, W_k,E corresponds to the weight matrix of the embedding, W_k,R corresponds to the weight matrix of the positional encoding, R_i–j denotes the relative positional encoding, and u^T and v^T are both learnable vectors.

3.2.2

Improvements to the Mask mechanism

Considering the unique “cloze in the blank” mechanism of the BERT model, this paper applies this training method to Transformer-XL, so that the improved Transformer-XL can clearly see the past information and future information, and the predicted results are fully combined with the “context”.The Transformer-XL model’s Mask mechanism masks all information after the predicted position, resulting in the model only focusing on past information. In view of this, this paper improves the Mask mechanism so that it only masks the next moment, which is what is about to be predicted, during training, and does not mask information from the next position onwards.

Since Transformer-XL combines Transformer and RNN model ideas and proposes a fragment recursion mechanism, which recurses the features of the previous fragment to the next fragment, the model’s past information includes not only all the information before the current token prediction moment, but also the information recursed from the previous fragment, which is also the information of the MEMORY token. Therefore, the model can make better use of past information when predicting, avoiding the problem of generating music with poor wholeness and fragmentation between bars.

3.2.3

Backbone framework

In modeling music, the main framework of the model is based on Transformer-XL, after which an improved Mask mechanism is added to it, which gives the model the ability to see bidirectional information. In the training process, in this paper, the complete music is divided into multiple segments, and the layer n – 1 hidden features of the τ st segment in the music can be expressed as: (10) ${\tilde{h}}_{τ}^{n - 1} = [S G (h_{τ - 1}^{n - 1}) \circ h_{τ}^{n - 1}]$

Among them, the features of the τ st fragment utilize the memory features of the previous fragment, and although it does not participate in BP computation, it is also spliced in dimension, which is equivalent to the current fragment reusing the information of the previous fragment, which increases the dependency of the model, expands the length of the sequence processed by the model, and effectively solves the fragmentation problem.

When the model is being trained, the information of each fragment of the input music is represented by the Query vector, Key vector and Value vector: (11) $q_{τ}^{n}, k_{τ}^{n}, ν_{τ}^{n} = h_{τ}^{n - 1} W_{q}^{T}, {\tilde{h}}_{τ}^{n - 1} W_{k}^{T}, {\tilde{h}}_{τ}^{n - 1} W_{ν}^{T}$

In the recursive approach to self-attention computation, the query vector Query is represented only by its own features without incorporating the recursive mechanism, and both the Key and Value vectors combine the features of the previous fragment, W being the weights to be learned by the model.

Since the improved Transformer-XL model also uses the relative position encoding approach, the layer n hidden features of the τ nd fragment in the model can be computed from the layer n – 1 hidden features: (12) $A_{τ, i, j}^{n} = q_{τ, i}^{n T} k_{τ, j}^{n} + q_{τ, i}^{n T} W_{k, R}^{n} R_{i - j} + u^{T} k_{τ, j} + ν^{T} W_{k, R}^{n} R_{i - j}$ $$A_{\tau ,i,j}^n = q_{\tau ,i}^{n\>T}k_{\tau ,j}^n + q_{\tau ,i}^{n\>T}W_{k,R}^n{R_{i - j}} + {u^T}{k_{\tau ,j}} + {\nu ^T}W_{k,R}^n{R_{i - j}}$$ (13) $α_{τ, i}^{n} = Im p r o v e d - M a s k - S o f t \max (A_{τ, i}^{n}) ν_{τ, i}^{n}$ (14) $α_{τ}^{n} = {[α_{τ, 1}^{n} \circ α_{τ, 2}^{n} \circ \dots \circ α_{τ, m}^{n}]}^{T} W^{n}$

Where $α_{τ, t}^{*}$ is the Query vector, the Key vector and the Value vector are computed with self-attention to obtain the attention features of the ind head. After that it is subjected to a series of computations of normalization, residual join and position feedforward network to get the potential feature representation of each position in each segment: (15) $z_{τ}^{n} = L a y e r N o r m (L i n e a r (α_{τ}^{n}) + h_{τ}^{n - 1})$ (16) $h_{τ}^{n} = \max (0, z_{τ}^{n} W_{1}^{n} + b_{1}^{n}) W_{2}^{n} + b_{2}^{n}$

Where $h_{r}^{n}$ denotes the layer n hidden feature of the τ nd fragment in the score, and the output $h_{r}^{'}$ can be calculated by Softmax function to predict the probability of notes under each moment: (17) $P_{θ} = \frac{\exp (h_{τ, N}^{*})}{\sum_{N} \exp (h_{τ, N}^{*})}$

3.2.4

Music generation process

Figure 2 shows the specific process of music generation, firstly, this paper converts the music into audio in MIDI format, after that, using the MFE method, the features of the music are transformed into specific events, and then these events are sent to the model for training, and finally the music with the characteristics of multicultural fusion can be obtained.

4

Experiments and analysis of results

In order to verify the effectiveness of the proposed melody and arrangement generation framework, this paper conducts a large number of experiments on real music datasets. In this section, the experimental dataset, melody generation experimental results, and arrangement generation experimental results are analyzed in detail as shown below.

4.1

Experimental data set

In this paper, extensive experiments were conducted on a realistic music dataset consisting of more than 30,000 digital music files (MIDI). In order to avoid bias in the experiments, some incomplete music files, such as MIDI files missing the necessary tracks, were removed. The experiments use over 10,000 multicultural fusion songs to train the model, along with extensive experimental preprocessing.Training data for the experiments consisted of 7000 music tracks, while model tuning involved 2000 music tracks, and the remaining music tracks were used for validation of the experiments.

4.2

Results of melody generation experiments

This section focuses on verifying the performance of the improved Transformer-XL music generation model on the melody generation task, which includes several aspects such as manual evaluation, chord accuracy analysis, and phrase stop analysis. The recent optimal music generation models are selected as comparison experiments as follows: magenta (RNN), GANMidi (GAN), CRMCG, Transformer-XL, RM-Transformer.

4.2.1

Overall performance

Since there are no suitable automated metrics to evaluate the music generation task, based on the knowledge in the music domain, the experiment was manually evaluated in four aspects: rhythm, melody, integrity, and singability. Ten music experts were invited to evaluate the multicultural fusion music generated by different models. All models generated one piece of music combined into one round of scoring samples, and a total of 60 rounds of scoring were conducted. All the experts ranked the generated music according to the proposed manual evaluation indexes, with scores ranging from 6 to 1. After obtaining all the scoring data, the scores obtained by each model were statistically analyzed.

Figure 3 shows the results of the melody generation evaluation for the different models. The Improved Transformer-XL model achieved the best results on all indicators, verifying the effectiveness of the model on the music generation task, with an overall average score of 3.76 for generating music, which is 6.21% to 94.82% higher than the scores of other models. Meanwhile, the improved Transformer-XL model outperforms the original Transformer-XL model in several metrics, indicating that the improved Mask mechanism can indeed enhance the effectiveness of multicultural fusion music generation.The RNN-based music generation model significantly outperforms the GAN-based model, indicating that the RNN model is more suitable for the problem of sequence generation.

4.2.2

Chordal accuracy analysis

The experiments were further analyzed to improve the performance of chord progression in the Transformer-XL music generation model, defining chord accuracy to assess whether the chords of the generated melody match the input chord sequence:

(18)

\begin{array}{l} C h o r d A c c u r a c y = \sum_{i = 1}^{P} e (y_{i}, \overset{∽}{y_{i}}) / P, \\ e (y_{i}, \overset{∽}{y_{i}}) = {\begin{cases} 1, i f y_{i} = \overset{∽}{y_{i}} \\ 0, i f y_{i} \neq \overset{∽}{y_{i}} \end{cases}} \end{array}

where P is the number of phrases, just the ind chord detected in the generated melody, and ${\tilde{y}}_{i}$ is the ith corresponding chord in the given chord progression.

Figure 4(a) shows the results of analyzing the chords of the generated music, and the average chord accuracy of the melody generated by the improved Transformer-XL model is 87.24%. Figures 4(b)∽(e) show the effect of the chord accuracy of the generated melodies on the metrics of rhythm, melody, completeness, and singability, respectively, verifying that the quality of the melody generation improves significantly as the chord accuracy increases.

4.2.3

Analysis of section stops

Pauses are common in music, i.e. a period of time without the appearance of any notes. On the one hand, pauses can be used to divide the music into sections, dividing the long and short sections of the music, making the music more rhythmic, and on the other hand, they also provide the listener and the singer with rest intervals and time to change breath. A good-sounding piece of music must maintain a good dynamic balance between its musical activities and pauses.Figure 5 shows the distribution of different phrase lengths in both the generated music and the real music. In this case, each phrase is divided by pauses, and the maximum and minimum percentages of different degrees of phrases of generated music and real music are between 0 and 24% and 0 and 14%, respectively. It can be seen that the generated music and the real music have similar distributions in terms of phrase lengths, which indicates that the improved Transformer-XL model can learn the pauses of the phrases in multicultural fusion music very well, and can maintain good structural properties.

4.3

Experimental Results of Arranging Generation

In this paper, the improved Transformer-XL model-generated multicultural fusion arrangements are analyzed through experiments in two dimensions: online audition evaluation and offline performance evaluation.

4.3.1

Online Audition Evaluation

The online audition effect scoring platform was developed using front-end and back-end separation, with the front-end developed in React and the back-end in Java.After the platform was developed, 15 test arrangements were placed on the platform, four of which were from the demonstration audio set, notated as D1∽D4.Four songs are from the midishow website, which is a professional music sharing and communication platform where users can upload their own music creations, labeled as D5∽D8. 7 songs are generated by improving the Transformer-XL model, labeled as D9∽D15. The duration of each arrangement is interrupted for 30 seconds to avoid listening fatigue of the testers.The testers were given the opportunity to listen to 15 arrangements online and rate them based on their subjective listening experience. The testers invited for the online audition evaluation are music lovers, and the scoring system is a five-grade scale.

The scores of each piano piece were counted and the average score was calculated, and the rankings of the online audition evaluation are shown in Fig. 6. The multicultural fusion arrangement D10 generated by the improved Transformer-XL model enters the top three in the ranking, with a score of 4.23. It shows that the testers cannot completely distinguish between the human work piece and the automatically-generated arrangement of this paper, and also indicates that the automatically-generated arrangement of this paper meets the testers’ appreciation requirements. In addition, the scores of D11, D13 and D15 of this paper are below 3.5 and ranked at the back of the list, which indicates that there is a difference in the quality of the automatically generated arrangements, and there is room for further optimization of the network model of automatic composing.

4.3.2

Evaluation of offline performance

The offline performance evaluation invited professionals who have extensive experience in music performance. The professionals specified five indicators for this evaluation, namely “melody, texture, harmony, tension, and aesthetics”, each of which is worth 100 points, and then performed the selected 15 arrangements live and recorded the scores of the five indicators for each arrangement.

For the weight size of each indicator, this paper uses the entropy weight method to calculate the specific weight. According to the information provided by each indicator, the entropy method utilizes the information entropy to determine the size of the role of the indicator in the comprehensive evaluation, and then obtains a specific weight value.

The final results of the offline performance evaluation are shown in Figure 7. The demonstration audio and the MS-created arrangement works ranked in the top three, with evaluation scores above 92. Three of the automatically generated arrangements in this paper rank in front of the manually created arrangements, and the difference in D7 scores between them and the MS-created arrangements is small, all of which are above 88. It shows that the automatically generated arrangements with the improved Transformer-XL model in this paper can reach the level of general manual creation, but the score gap is larger compared to the works with demonstration audio. The lower-scoring arrangement D15 is also an automatically generated arrangement in this paper, which indicates that there is a difference in the quality of automatically generated multicultural fusion arrangements, and the network model for automatic composition needs to be further optimized.

5

Conclusion

Algorithmic composition can not only inspire composers to create, but also enable a wide range of non-professionals to participate in music creation and enjoy the fun of creating music. Based on the innovation of multicultural integration, this paper utilizes GRU and TDNN algorithms to construct MFE model for multicultural music feature extraction, and realizes music generation by improving Transformer-XL model, and discusses the effect of algorithmic composition. The primary findings are as follows:

1) The MFE-based music generation model in this paper has a chord accuracy of 87.24% on melody generation and can learn the phrase stop features better. At the same time, the generated melody score is 3.76, which is 6.21%∽94.82% higher than the other models in generating melody scores, which achieves better results in melody generation.

2) In terms of arrangement generation, the model generated music in this paper reaches the general level of artificial creation, and there are arrangements in the online audition evaluation and offline performance evaluation that go to the top score ranking. Among them, the D10 score of the arrangements in the online evaluation reaches 4.23, and three arrangements in the offline performance have scores above 88. When there are still model-generated arrangements with relatively low scores, there is still room for improvement in the music generation model of this paper.

The research on music generation technology based on improved Transformer-XL proposed in this paper under the MFE music feature extraction model has achieved milestones, which is not only able to generate well-structured melodies, but also can generate good-quality arrangements, and innovatively explores the fusion of multicultural music creation, which can provide technical support for music creators.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

The role of musical feature extraction model MFE in multicultural music fusion innovation

Tianhang Li

Pubblicato online: 21 mar 2025

Ricevuto: 12 nov 2024

Accettato: 10 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0661

Parole chiaveMFE, Feature extraction, GRU, TDNN, Transformer-XL, Music generation, Multiculturalism

© 2025 Tianhang Li, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Parole chiave
MFE, Feature extraction, GRU, TDNN, Transformer-XL, Music generation, Multiculturalism