The role of musical feature extraction model MFE in multicultural music fusion innovation
Pubblicato online: 21 mar 2025
Ricevuto: 12 nov 2024
Accettato: 10 feb 2025
DOI: https://doi.org/10.2478/amns-2025-0661
Parole chiave
© 2025 Tianhang Li, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
As a borderless art form, music elements from different cultural backgrounds collide and merge with each other, forming a rich and colorful variety of musical styles and genres. Multicultural musical intermingling can be traced back to ancient times, when traders, travelers and conquerors moved between different regions, bringing with them their own musical traditions [1-4]. Over time, these musical traditions merged with local music in new geographical environments, giving rise to new musical forms. Multicultural music fusion not only enriches musical expressions, but also promotes understanding and respect among different cultures.
Multicultural music fusion is an evolving process which reflects the diversity and inclusiveness of human society. Through music, people can feel the aesthetics and emotional expressions of other cultures, which enhances cultural exchange and resonance on a global scale [5-7]. In addition, the integration of cultural music also provides musicians with a broader creative space, which stimulates more creativity and inspiration [8-9]. The practical strategies of multicultural music integration include cultural identification, cultural experience, cultural exchange and cultural innovation. First of all, creators need to identify and understand the characteristics and values of different cultures and keep an open attitude towards multiculturalism [10-11]. There are differences between cultures, and creators will come into contact with a variety of cultures from various countries during the creative process, and there is the phenomenon of not being able to accept or understand them. Therefore, the attitude towards multiculturalism is important for creating music. Secondly, creators experience the musical expressions of different cultures by participating in musical activities and performances of various cultures [12]. Only by practically experiencing the charm of cultures can they better create multicultural music. Furthermore, creators can participate in cross-cultural music exchange, in which they can learn, borrow and integrate the musical elements of other cultures [13-15]. Practice makes true knowledge, exchange and learning is the prerequisite for integration, just like the collision of popular music and traditional culture, both need to seize its characteristics to ensure that the integration process is not distorted. Finally, creators can also practice innovation, combining different cultural elements learned to create new musical works and expressions [16-18]. However, both culture and music have their own unique features, and the integration of the two must require a method that can extract representative features from a variety of cultural materials and audio data to make the integration more smooth and natural, and musical feature extraction can be a good solution to this problem.
This paper takes the application of MFE in music composition as its main focus, and analyzes the multicultural integration in music composition.Subsequently, the audio features are extracted more effectively by utilizing the dynamic changes between audio frames. While using TDNN to feature extract the acoustic features of the input model, GRU is also used to iterate the audio features frame by frame to learn the dynamic information therein, and a hybrid feature extraction method MFE combining GRU and TDNN is proposed. Using Transformer-XL as the basic model of music generation, the Mask mechanism is improved, and MFE is applied to the improved Transformer-XL model to complete the construction of the music generation model. The multicultural fusion music generated by the model is analyzed from two dimensions: melody generation and arrangement generation. Melody generation is discussed through overall performance, chord accuracy, and phrase stops, and arrangement generation is analyzed by designing online auditions and offline performance evaluation experiments.
Music is a form of art created by human beings, and music creation is an important form of cultural exchange, which will continue to develop and evolve with the passage of time, and with the strengthening of globalization and cultural exchanges, music creation is also constantly innovating and developing, and the phenomenon of multicultural fusion is becoming more and more obvious. The phenomenon of multicultural fusion in music creation is reflected in the following dimensions, and this phenomenon of more multicultural fusion not only brings new artistic styles and ways of expression, but also reflects the diversity of cultures and the depth of exchange.
Music style fusion is a common form of expression of the phenomenon of multicultural fusion in music creation, which mainly refers to the fusion of musical elements from different cultural styles to create new forms of music. For instance, the amalgamation of classical and popular music elements, or the amalgamation of folk music and contemporary popular music.
Fusion of cultural elements is an important form of expression of the phenomenon of multicultural fusion in music creation, which can help the listener to deeply understand its cultural connotation, and mainly refers to the fusion of different cultural factors into music to create music works with unique cultural charm. For example, combining Chinese traditional music elements with pop music, or combining Indian music elements with electronic music.
The fusion of art forms refers to the fusion of music and other cultural and artistic forms to create a more rich expressive and infectious art works of a form of expression, which can highlight the artistic value and unique charm of music works. For instance, combining theater and music can improve the way emotions and stories are conveyed to the audience. In musicals, actors sing songs, dance, perform dialogues, etc., which makes the whole work more dramatic and musical, and at the same time more vivid and touching.
Digital technology fusion is the expression of a new multicultural fusion phenomenon applying modern technology, which mainly refers to the fusion of different musical elements through digital technology to create more novel and unique musical works and performance forms. For example, the use of electronic music, sampling technology, etc. to integrate musical elements of different styles and eras. In this study, MFE technology is used to extract musical characteristics from different cultures, which are used in the construction of intelligent models for music innovation and creation.
In recent research, many deep learning models have been used for music generation, and Transformer is one of the most prominent and popular generative models.In this paper, we use the Transformer-XL variant model and incorporate the BERT training method, so that the Transformer-XL model can also see bi-directional information, which improves its ability to generate music. In addition, instead of directly feeding the MIDI format of music into the model for training, this paper uses the MFE method to extract multicultural music features and combines it with the improved Transformer-XL model, which not only prolongs the duration of the music performance, but also improves the quality of music generation.
MFCC and FBank features are designed to simulate the human ear’s sound perception mechanism and achieve an approximation of the auditory response, and thus have achieved excellent results in many audio signal processing tasks, including audio recognition. However, this short-time signal feature extraction based on frame splitting truncates the continuity of the audio signal, which in turn affects the extraction of temporal information. Therefore, the entire audio sequence is automatically modeled by neural networks to better capture the dynamic features of the audio.
In this section, a hybrid feature extraction (MFE) method is designed based on GRU and TDNN, which utilizes GRU to learn the temporal dependencies between audio features and extract the dynamic features therein, and then combines them with the features extracted by TDNN to achieve the purpose of enhancing the dynamic information.
The internal structure of the GRU loop body contains two gating functions as well as a hidden state, the reset gate
where
where
where
It can be seen that GRU controls the degree of forgetting the past information and updating the current information through
TDNN is a feed-forward neural network for processing sequential data. Different from the traditional fully connected network, it adopts a neuron structure with time delay, each neuron has a certain width, and the output of the neuron is only related to the input in a certain time range.Each point of TDNN represents a frame feature vector, and the neurons of the TDNN move along the temporal direction in a certain step to realize the extraction of local features. During the moving process, the weights of the neurons are shared, which means that the neurons at each time step use the same weights to process the input data. Such a shared weight mechanism helps to reduce the number of model parameters and improves the generalization ability of the model. As the number of network layers deepens, the temporal range of TDNN layer connections gradually becomes wider, thus abstracting the global feature representation.
The width of the TDNN neurons is the size of the convolution kernel, and the frames within the time span are weighted and summed by the parameters in the convolution kernel during computation. These weights are continuously adjusted to optimize the model performance by training the minimization loss function.
The operations in these network layers are linear, and in order to fit the real data, the model needs to be capable of nonlinear transformations. Therefore, the output of the network layer needs to be nonlinearly transformed using some functions, which are called activation functions. Common activation functions include Sigmoid function, ReLU function, tanh function and so on, as shown in Equation (5), Equation (6) and Equation (7):
These functions have different properties and can be mapped to different ranges of input signals by nonlinear transformations, thus increasing the expressive power and fitting ability of the network.
The structure of the MFE is shown in Fig. 1, for the input

The structure of MFE
It should be noted that the time cost required by GRU is high due to the need to iterate time step by time step during the computation process. In the PyTorch framework, in order to improve the parallelization of the computation, GRU computes a batch of data at the same time step simultaneously, and therefore uses a different input data format than the one-dimensional convolutional layer. One-dimensional convolutional layers require
The Transformer model consists of four parts, which are the input, output, encoder, and decoder parts.
The input part of the Transformer model consists of a data embedding layer and a positional coding layer, and the output part of the model consists of a linear layer and a Softmax layer.The role of the data embedding layer is that the model needs to transform the input textual data as well as numerical data into vector representation.The purpose of the location coding layer is to include location information in the input data. The role of the linear layer is to deal with the linear changes in the previous step to achieve the effect of transforming the dimensions.The role of the Softmax layer is to scale the numbers of the vectors in the last dimension into the value domain of 0-1, and make their sum to 1, and according to the probability of the proportion at this point in time to get the most suitable output.
The biggest problem with Transformer is that there is no way to model sequences that exceed its maximum length, if the sequence exceeds that length it needs to be intercepted and then the intercepted fragments are re-encoded, the above method is feasible but it leads to the problem of contextual fragmentation, which means that each intercepted fragment is modeled individually without sufficient contextual information between each other. Transformer-XL, which is based on Transformer, suggests implementing a fragmentation recursion mechanism and replacing absolute position encoding with relative position encoding to address the issue of context fragmentation and long-term dependency.
Transformer-XL is also input with fixed-length segments during training as Transformer-similarly, the difference is that the state of the previous segment of Transformer-XL is cached, and then the hidden layer state of the previous time slice is reused when calculating the current segment. Because the features of the previous segment are reused in the current segment, this gives Transformer-XL the ability to model long-term dependencies. The details of the fragment recursion can be seen according to equation (8):
Where
Since different segments in Transformer have the same position encoding for the same position, the model has no way to correctly distinguish the position information of different segments, so Transformer-XL proposes to use relative position encoding instead of absolute position encoding. The specific operation is to consider only the relative position relationship between the Query vector and the Key vector when calculating the attention score, and add this relative position relationship, to the calculation of attention at each level. The formula is shown in (9):
where
Considering the unique “cloze in the blank” mechanism of the BERT model, this paper applies this training method to Transformer-XL, so that the improved Transformer-XL can clearly see the past information and future information, and the predicted results are fully combined with the “context”.The Transformer-XL model’s Mask mechanism masks all information after the predicted position, resulting in the model only focusing on past information. In view of this, this paper improves the Mask mechanism so that it only masks the next moment, which is what is about to be predicted, during training, and does not mask information from the next position onwards.
Since Transformer-XL combines Transformer and RNN model ideas and proposes a fragment recursion mechanism, which recurses the features of the previous fragment to the next fragment, the model’s past information includes not only all the information before the current token prediction moment, but also the information recursed from the previous fragment, which is also the information of the MEMORY token. Therefore, the model can make better use of past information when predicting, avoiding the problem of generating music with poor wholeness and fragmentation between bars.
In modeling music, the main framework of the model is based on Transformer-XL, after which an improved Mask mechanism is added to it, which gives the model the ability to see bidirectional information. In the training process, in this paper, the complete music is divided into multiple segments, and the layer
Among them, the features of the
When the model is being trained, the information of each fragment of the input music is represented by the Query vector, Key vector and Value vector:
In the recursive approach to self-attention computation, the query vector Query is represented only by its own features without incorporating the recursive mechanism, and both the Key and Value vectors combine the features of the previous fragment,
Since the improved Transformer-XL model also uses the relative position encoding approach, the layer
Where
Where
Figure 2 shows the specific process of music generation, firstly, this paper converts the music into audio in MIDI format, after that, using the MFE method, the features of the music are transformed into specific events, and then these events are sent to the model for training, and finally the music with the characteristics of multicultural fusion can be obtained.

The process of music generation
In order to verify the effectiveness of the proposed melody and arrangement generation framework, this paper conducts a large number of experiments on real music datasets. In this section, the experimental dataset, melody generation experimental results, and arrangement generation experimental results are analyzed in detail as shown below.
In this paper, extensive experiments were conducted on a realistic music dataset consisting of more than 30,000 digital music files (MIDI). In order to avoid bias in the experiments, some incomplete music files, such as MIDI files missing the necessary tracks, were removed. The experiments use over 10,000 multicultural fusion songs to train the model, along with extensive experimental preprocessing.Training data for the experiments consisted of 7000 music tracks, while model tuning involved 2000 music tracks, and the remaining music tracks were used for validation of the experiments.
This section focuses on verifying the performance of the improved Transformer-XL music generation model on the melody generation task, which includes several aspects such as manual evaluation, chord accuracy analysis, and phrase stop analysis. The recent optimal music generation models are selected as comparison experiments as follows: magenta (RNN), GANMidi (GAN), CRMCG, Transformer-XL, RM-Transformer.
Since there are no suitable automated metrics to evaluate the music generation task, based on the knowledge in the music domain, the experiment was manually evaluated in four aspects: rhythm, melody, integrity, and singability. Ten music experts were invited to evaluate the multicultural fusion music generated by different models. All models generated one piece of music combined into one round of scoring samples, and a total of 60 rounds of scoring were conducted. All the experts ranked the generated music according to the proposed manual evaluation indexes, with scores ranging from 6 to 1. After obtaining all the scoring data, the scores obtained by each model were statistically analyzed.
Figure 3 shows the results of the melody generation evaluation for the different models. The Improved Transformer-XL model achieved the best results on all indicators, verifying the effectiveness of the model on the music generation task, with an overall average score of 3.76 for generating music, which is 6.21% to 94.82% higher than the scores of other models. Meanwhile, the improved Transformer-XL model outperforms the original Transformer-XL model in several metrics, indicating that the improved Mask mechanism can indeed enhance the effectiveness of multicultural fusion music generation.The RNN-based music generation model significantly outperforms the GAN-based model, indicating that the RNN model is more suitable for the problem of sequence generation.

Evaluation results of different model melodic melody
The experiments were further analyzed to improve the performance of chord progression in the Transformer-XL music generation model, defining chord accuracy to assess whether the chords of the generated melody match the input chord sequence:
where
Figure 4(a) shows the results of analyzing the chords of the generated music, and the average chord accuracy of the melody generated by the improved Transformer-XL model is 87.24%. Figures 4(b)∽(e) show the effect of the chord accuracy of the generated melodies on the metrics of rhythm, melody, completeness, and singability, respectively, verifying that the quality of the melody generation improves significantly as the chord accuracy increases.

The analysis of the chords of generation music
Pauses are common in music, i.e. a period of time without the appearance of any notes. On the one hand, pauses can be used to divide the music into sections, dividing the long and short sections of the music, making the music more rhythmic, and on the other hand, they also provide the listener and the singer with rest intervals and time to change breath. A good-sounding piece of music must maintain a good dynamic balance between its musical activities and pauses.Figure 5 shows the distribution of different phrase lengths in both the generated music and the real music. In this case, each phrase is divided by pauses, and the maximum and minimum percentages of different degrees of phrases of generated music and real music are between 0 and 24% and 0 and 14%, respectively. It can be seen that the generated music and the real music have similar distributions in terms of phrase lengths, which indicates that the improved Transformer-XL model can learn the pauses of the phrases in multicultural fusion music very well, and can maintain good structural properties.

The distribution of music in different lengths
In this paper, the improved Transformer-XL model-generated multicultural fusion arrangements are analyzed through experiments in two dimensions: online audition evaluation and offline performance evaluation.
The online audition effect scoring platform was developed using front-end and back-end separation, with the front-end developed in React and the back-end in Java.After the platform was developed, 15 test arrangements were placed on the platform, four of which were from the demonstration audio set, notated as D1∽D4.Four songs are from the midishow website, which is a professional music sharing and communication platform where users can upload their own music creations, labeled as D5∽D8. 7 songs are generated by improving the Transformer-XL model, labeled as D9∽D15. The duration of each arrangement is interrupted for 30 seconds to avoid listening fatigue of the testers.The testers were given the opportunity to listen to 15 arrangements online and rate them based on their subjective listening experience. The testers invited for the online audition evaluation are music lovers, and the scoring system is a five-grade scale.
The scores of each piano piece were counted and the average score was calculated, and the rankings of the online audition evaluation are shown in Fig. 6. The multicultural fusion arrangement D10 generated by the improved Transformer-XL model enters the top three in the ranking, with a score of 4.23. It shows that the testers cannot completely distinguish between the human work piece and the automatically-generated arrangement of this paper, and also indicates that the automatically-generated arrangement of this paper meets the testers’ appreciation requirements. In addition, the scores of D11, D13 and D15 of this paper are below 3.5 and ranked at the back of the list, which indicates that there is a difference in the quality of the automatically generated arrangements, and there is room for further optimization of the network model of automatic composing.

Rankings of online audition scores
The offline performance evaluation invited professionals who have extensive experience in music performance. The professionals specified five indicators for this evaluation, namely “melody, texture, harmony, tension, and aesthetics”, each of which is worth 100 points, and then performed the selected 15 arrangements live and recorded the scores of the five indicators for each arrangement.
For the weight size of each indicator, this paper uses the entropy weight method to calculate the specific weight. According to the information provided by each indicator, the entropy method utilizes the information entropy to determine the size of the role of the indicator in the comprehensive evaluation, and then obtains a specific weight value.
The final results of the offline performance evaluation are shown in Figure 7. The demonstration audio and the MS-created arrangement works ranked in the top three, with evaluation scores above 92. Three of the automatically generated arrangements in this paper rank in front of the manually created arrangements, and the difference in D7 scores between them and the MS-created arrangements is small, all of which are above 88. It shows that the automatically generated arrangements with the improved Transformer-XL model in this paper can reach the level of general manual creation, but the score gap is larger compared to the works with demonstration audio. The lower-scoring arrangement D15 is also an automatically generated arrangement in this paper, which indicates that there is a difference in the quality of automatically generated multicultural fusion arrangements, and the network model for automatic composition needs to be further optimized.

The final result of the offline performance evaluation
Algorithmic composition can not only inspire composers to create, but also enable a wide range of non-professionals to participate in music creation and enjoy the fun of creating music. Based on the innovation of multicultural integration, this paper utilizes GRU and TDNN algorithms to construct MFE model for multicultural music feature extraction, and realizes music generation by improving Transformer-XL model, and discusses the effect of algorithmic composition. The primary findings are as follows:
1) The MFE-based music generation model in this paper has a chord accuracy of 87.24% on melody generation and can learn the phrase stop features better. At the same time, the generated melody score is 3.76, which is 6.21%∽94.82% higher than the other models in generating melody scores, which achieves better results in melody generation. 2) In terms of arrangement generation, the model generated music in this paper reaches the general level of artificial creation, and there are arrangements in the online audition evaluation and offline performance evaluation that go to the top score ranking. Among them, the D10 score of the arrangements in the online evaluation reaches 4.23, and three arrangements in the offline performance have scores above 88. When there are still model-generated arrangements with relatively low scores, there is still room for improvement in the music generation model of this paper.
The research on music generation technology based on improved Transformer-XL proposed in this paper under the MFE music feature extraction model has achieved milestones, which is not only able to generate well-structured melodies, but also can generate good-quality arrangements, and innovatively explores the fusion of multicultural music creation, which can provide technical support for music creators.
