A stochastic process model of melody generation in popular music composition and its contribution to compositional innovation
Data publikacji: 21 mar 2025
Otrzymano: 21 paź 2024
Przyjęty: 09 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0603
Słowa kluczowe
© 2025 Hongxu Kang, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Music generation is a popular research direction in the field of artificial intelligence, aiming to generate artistic and creative musical works by computer. Generative models play a key role in music generation, they are able to learn the features and patterns of music, and then generate new musical works that match a specific style or emotion [1-4]. Generative modeling is a kind of model in the field of artificial intelligence, whose goal is to learn the probability distribution of data from the given training data and generate new samples through this distribution [5-7]. In music generation, generative models learn the probability distributions of musical elements such as different notes, rhythms and chords and use these distributions to generate new musical compositions [8-11].
Several common generative models are widely used in music generation, including: Hidden Markov Models (HMMs): it treats music as a discrete sequence of states and generates new music by learning transfer probabilities and firing probabilities. HMMs can learn the structure and regularity of the music and generate musical compositions that conform to a specific style [12-15]. Generative Adversarial Networks (GANs): is a system consisting of generators and discriminators that play against each other to generate musical compositions with a high degree of authenticity [16-17]. The generator is responsible for generating music and the discriminator is responsible for evaluating the authenticity of the music. Through continuous confrontation and learning, GANs can generate very realistic musical works [18-20]. Variational Auto-Encoders (VAEs): is a kind of autoencoder that maps the input data to a latent space and generates new musical works by learning the parameters of its distribution. VAEs have a better performance in music generation and can generate varied and creative musical works [21-24].
This paper models the stochastic process of pop melody generation and explores new paths for compositional innovation. We express the MIDI-formatted scores by adding rests, and create a probabilistic model of pitch sequences through Markov modeling. To address the shortcomings of traditional Markov generation results, which are more cluttered, the probabilistic model of the generated melody is constrained to output a pitch sequence that meets the constraints. Determine the number of unit pulses and the time value to generate a pop rhythm sequence and blend it with the note sequence. Finally, a density distribution of the probability of occurrence of different rhythms is constructed to tightly integrate the generated melodies and rhythms and innovate the compositional process. The generated melody is evaluated from both subjective and objective aspects, and the innovative promotion of this paper's method for music composition is summarized.
The basic attributes of music include pitch, duration, intensity, and timbre. Music is also a kind of sound, so it has the attribute of sound, and pitch is due to the frequency of vibration of the sound producing body, and western music divides music into twelve equal temperaments according to the frequency of vibration of the pitch. The time value of music is related to the amplitude of the vibration of the vocal body, which describes the duration of the performance of a pitch, and is often expressed in quarter notes, eighth notes, sixteenth notes, and so on. In music, the tones named with capital letters C, D, E, F, G, A and B are called basic levels, while the change levels are raised or lowered by a semitone on the basis of the basic levels. For the convenience of singing, the basic levels have corresponding names: do, re, mi, fa, sol, la, si, etc. Tone intensity is the strength of each note, which is related to the amplitude of the articulation. Tone represents the color of the music, and the tone of the same piece of music played by different instruments will be different.
Rhythm is a musical genre in which tones of different lengths are combined together in accordance with specific laws. Rhythm is embodied in the beat, and it cannot be separated from it. Rhythm is a regular cycle of strong and weak beats repeated over and over again.
Sound has a distance, we use the interval to express the distance between two pitches, the unit of measurement is degree, the degree of the interval is obtained by calculating the total number of levels contained between the two pitches, for example, between the name of the cantor re and fa there are three levels, respectively, re, mi, fa, then the interval between re and fa is three degrees. However, in the musical tone system, there are two tones that contain the same number of levels but different numbers of semitones, and in order to distinguish between them, the words pure, major, minor, increasing, decreasing, doubling, doubling, and decreasing are added in front of the degree. In terms of the senses, intervals can be categorized into consonant and dissonant intervals, which, as the name suggests, are comfortable, harmonious, and pleasing to the ear, while dissonant intervals can sound harsh.
Melody is the soul of music, the high and low notes, the rhythm of the fast and slow, the strength of the weak, so that the melody presents different colors, different pitches connected to form a melodic pitch profile, the abstraction is a curved melodic curve. The distances between different points on the curve represent the intervals of the pitches. The basic form of the melody can be summarized as a horizontal progression, an upward progression, a downward progression, and a wave progression. Melody is the basis of the voice, monophonic music has only one melody, polyphonic music contains several melodies, around a main theme, the melody that is independent and interact with each other, the melody of the diatonic part of the melody between the way, in general, can be divided into the same way to proceed, parallel to proceed, reverse to proceed and diagonal to proceed.
Stochastic processes are the discipline of modeling and analyzing stochastic phenomena that vary in time and space, and are widely used in fields such as physics, biology, engineering, music, and management. Given a stochastic process {
Then {
In practical applications, it is often assumed that the Markov chain is chi-squared so that the relevant conclusions of the chi-squared Markov chain can be used to make predictions and analyze real-life production. A chi-squared Markov chain is completely determined by its initial distribution (
All melodies were collected in MIDI file format. For convenience, rests were added to the sequences obtained from the MIDI files, and this processing helps to analyze music fragments containing note sequences and rests [26].
Relying on the Markov model alone to generate the results is too random, and the note arrangement of the generated melody is too homogeneous and lacks innovation if it is simply based on the fixed generation rules [27]. Therefore, through the training of melodies in the corpus, the initial probability matrix and transfer probability matrix of the probabilistic model are listed, and the fixed styles or specific constraints that constrain the melodies are selected, so that the constrained probabilistic model generating each note can satisfy the given style for output.
Take the following melodic sequence as a reference:
First, the probabilistic model parameters <Q, P> extracted from a given sequence of leitmotifs are shown below:
Transfer probability distribution matrix:
Next, each row of
A constraint satisfaction problem refers to making the state of a set of objects satisfy some constraint or rule. Define the ternary of constraint satisfaction problem as <
As an example, a pentatonic melody requires that the output pitch sequence be 4 in length, that the second note of the output be either
The initial probability matrix of the probabilistic model, derived from the given melodic fragment, is:
Set variables
The second variable,
The variable, III
Since variable
The probabilistic model that satisfies this constraint yields a final melodic sequence structure of:
By doing so, the generated melodies can be made to have certain stylistic constraints, which can produce melodic fragments with a high degree of pleasantness.
In order to generate a rhythmic model, it is necessary to define the number of “unit pulses” in each unit. Here, 8 unit pulses are defined per measure, and the tempo is 4/4.
The time value of one unit pulse corresponds to one eighth note. The state of the modified model needs to go through more nodes if its time value exceeds one unit pulse. In the modified model, each state is represented as (
The two algorithms of the probabilistic model are used to process melody and rhythm, respectively. Although they are based on the same reference melody, they are independent processes without any connection between them. They generate sequences of notes and rhythms separately and then combine them together. Due to the separate processing of melody and rhythm, the final generated sequence will contain some notes that do not appear in the reference melody.
The probabilistic model was previously constrained, but the tightness of the link between melody and rhythm was not high. For this reason, a density distribution of the probability of the same note appearing in different rhythms is constructed, which is combined with the probabilistic model with the relevant constraints to generate musical fragments.
The density distribution of the probability of the same note appearing in different rhythms in a melodic fragment can extract information about the probability of the same pitch appearing in the reference melody when it is expressed in different time values, defined as follows:
Firstly the specific note to be counted is selected, secondly the total number of occurrences of the note in the training set is derived, and then the number of occurrences of the note at different temporal values is counted separately to obtain its probability density distribution information.
The output of the melody is based on a set of rules to select the output tense value of the corresponding note:
Extract the probability density distribution of each note time value from the training set. Replace all these probability values with the sum of the previous values. Choose a random number Compare Output the note corresponding to that value.
Assume that the probability that note
The corresponding note is an eighth note. In the output, an eighth note of pitch
Combine the method with a probabilistic model to generate a sequence of notes. First define the parameter <
The initial probability matrix of the probability model is obtained:
Transfer probability matrix:
Then B is obtained based on the probability of different note tensors, and finally, a probabilistic model of tensors is constructed from the above parameters as shown in Fig. 1.

Constructing a probabilistic model of current values
The above model is able to combine the generation of rhythm and melody, taking into account the generation of note timings while generating the melody, so that the dissonance between melody and rhythm is reduced.
Compositional experiments are conducted in accordance with this paper, and the effects are evaluated by comparison. For the convenience of description, the method of generating melodic segments using the traditional Markov chain stochastic process is abbreviated as MCR, and the improved method of fusing rules and probabilistic models in this paper is abbreviated as RPM.
In order to assess the effectiveness of compositional formalization methods, the quality level of their outputs needs to be evaluated. Here, the MCR method and the RPM method are used respectively for automatic composition based on the same compositional control parameters, i.e., tempo, key signature, beat signature, key signature, and so on. In order to take into account the evaluation bias that different styles of music can have due to the personal preferences of the listeners, two types of music styles with large differences were generated, a relaxed and upbeat style, and a sad and lyrical style (sad and lyrical belong to one style in music theory). For this experiment, the most commonly used parameters were selected: 80 beats/min for the lyrical style, 140 beats/min for the lighthearted style, key signatures were randomly obtained, and beat numbers were randomly selected from 2/4 and 4/4, to create 100 MIDI-formatted music outputs (50 for the lighthearted style, and 50 for the sad lyrical style). The evaluation results will not be impacted by the specific selection of which compositional control parameter, provided that the control parameters are the same for both methods.
Through the final statistics, the comparison results of the two methods are shown in Figure 2. Figure 2 shows the results of 8+8 sampling, i.e., the results of randomly selecting 8 songs to evaluate respectively. For 8+8 sampling, the composite mean score of the MCR method is 5.06 and the composite mean score of the RPM method is 6.17, with an improvement rate of about 21.95%. The number of works sampled will have an impact on the results of the assessment since the subjects will change their mental state during the assessment. Therefore, it is necessary to adjust the number of samples taken to view the experimental results comprehensively.

8+8 sampling results
The quantities were modified to 3 compressed songs and 12 expanded songs, respectively. The final results of reevaluating 3 randomly selected songs are shown in Figure 3. For the 6+6 sampling, the combined mean score for the MCR method is 5.52 and the combined mean score for the RPM method is 7.28, an improvement of 31.88%.

6+6 sampling results
The final results obtained by re-selecting 12 songs randomly for evaluation respectively are shown in Figure 4. For the 12+12 sampling, the combined mean score for the MCR method is 4.61 and for the RPM method is 5.99, an improvement of about 30%.

12+12 sampling results
The average improvement in subjective ratings for these three samples ranged from 20% to 30%, which can indicate that from the subjective point of view of the listeners, the newly generated melodies in this paper have a superior listening experience than the traditional Markov chain stochastic process generation method.
There are three objective evaluation metrics for generating melody:
R: A phrase is said to be related if there is any of the melodic developmental techniques in relation to neighboring phrases. The value of R is expressed as the number of phrases with correlation. I: For the interval characterization of popular music, if the proportion of intervals between two adjacent notes within a phrase is greater than 59% for major second, minor third, and pure intervals, it is judged to be consistent. The value of I is expressed as a percentage of satisfying the conforming interval. D: the number of short rhythms into long rhythms.
The objective evaluation metric I is obtained by calculating the pitch interval that conforms to the solution strategy, and the melodic lines generated by the traditional Markov, attentional-rnn model and the method in this paper are shown in Figures 5 to 7, respectively.

Melody line of traditional Markov

Melody line of attention-rnn

Melody line of RPM
The objective comprehensive evaluation data are shown in Fig. 8, and all the data in the figure are average values. The model in this paper has obvious improvement in R, I and D indexes compared with Markov model and attention_rnn model, and in general, the model in this paper improves about 1~1.7 times in the overall objective comprehensive evaluation indexes compared with Markov model and attention_rnn model.

Objective comprehensive evaluation data
The RPM model proposed in this paper is of certain advantages in pop music melody generation, which is specifically manifested in:
Clear tuning: the melodic tuning generated by the RPM model conforms to the tuning characteristics of popular music, while the Markov model and attentions_rnn model generate melodies with inconspicuous tuning. Accurate interval relationship: the melody generated by the attention_rnn model is less than the index solution strategy of I. Although the melodies generated by the RPM model and the Markov model both satisfy the index solution strategy of I, it is found that by analyzing the interval relationship between neighboring notes, the interval relationship between neighboring note intervals of the melody generated by the Markov model can be a one-octave relationship, which will reduce the the quality of the musical composition. Comparatively speaking, the model in this paper is more accurate than the Markov model and the attention_rnn model in grasping the interval relationship of pop music, and the generated melodies are more pop music characteristics. Rhythmic regularity: The melodic phrases generated by the RPM model are rhythmically regular, and most of the phrases in each melody end with a short rhythm into a long rhythm. On the other hand, the Markov model and attention_rnn model generate melodies with more scattered rhythms, which makes it difficult to have a clear grasp of the length of the phrases, and also results in the melodies not having a relatively stable sense of suspension. Clear melodic structure: The melody generated by the RPM model consists of two phrases, and by comparing the melodic lines of the corresponding phrases of the two phrases, the second phrase is a repetition of the first one, and the correlation between phrases within each phrase is high; the second phrase is a repetition of the first one or a modal progression, and the third phrase shows the function of transitions, which is the culmination of the whole phrase, and the fourth phrase plays the role of summarization, which is a typical starting and ending structure. The fourth phrase serves as a summary, which is typical of the rise and fall structure.
The Markov model generates melodies based on microscopic notes, so the correlation between phrases is low, and the global structure is less relational. attention_rnn model generates melodies that depend on the quantity and quality of the data, and due to the low quality of China's existing pop music database, the generated melodies have some meaningless repetitions, and the section structure is relatively less obvious.
Traditional music composition is a process whereby the composer encodes musical information using a musical score, and when the singer or vocalist is presented with the composer's finished musical composition to perform, it is a process of decoding the encoding. In the process of decoding, the process of sheet music has a certain role, but the sheet music can not completely record the entire process of the composer's encoding of musical information, so in the process of the composer's music will be due to personal innovation or understanding of the bias and other factors, so that the composer's music effect and the actual music expression effect there is a gap. If you use the computer to create music in this paper, there is no such difference, first of all, the computer can make the composer's coding process easier, and can record the composer's information in real detail. At the same time, the computer also reduces the difficulty of decoding for singers and vocalists, and the conversion process is carried out by the computer. The composer only needs to perform regularly on the keys, and this performance information can be automatically encoded on the computer, which is the real meaning of digital encoding. The computer is capable of automatic decoding, and this process is where digital signals are converted into analog signals. When composing for the computer, the composer has to form a unified whole of conception, composition and performance, to be able to process the acoustics in time and to be able to arrange the best finished product with quality.
When the composer utilizes the computer method of this paper to compose, the computer is subjected to the composition signal, it will convert the composer's score information into sound signals at the first time, and complete the perception of music in time through the headphones or speakers and other devices, and the computer can also present these music signals directly in the sequencer software interface, displaying audio graphics or sound frequency phrases that are consistent with the frequency of the speakers, and the music will become a dynamic process through the display of these indicators. The display of these indicators, the composition into a dynamic process, so that the composer in time to see the effect of music creation and the shortcomings therein, through the recoding of the music, the use of merging, copying, cutting, deleting and other re-organization of the music, so that the effect of the music is more perfect, and complete the creation of the music in the dynamic process of operation.
The article focuses on adding constraints to the traditional Markov stochastic process, balancing innovation and regularity in melody generation, and improving the tightness of the combination of generated rhythm and melody. Under the three sampling methods of "8+8", "6+6" and "12+12", the melodic scores generated by the proposed method are increased by about 21.95%, 31.88% and 30% respectively compared with the traditional Markov method, which ensures the hearing of the generated melody in the human ear. In the three objective indexes of phrase correlation, interval characteristics and number of short rhythms, this paper's model is better than the traditional Markov and attention_rnn models, and the overall performance is improved by about 1~1.7 times. Based on this, it can be concluded that the method of this paper effectively optimizes melody generation in popular music composition, which is conducive to promoting compositional innovation.
2022 Provincial Quality Engineering Continuing Education Teaching Reform Project in Anhui Province: Construction and Practice of the Curriculum System of Higher Education in Northern Anhui under the Background of Rural Revitalization - Taking the Musicology Major of Fuyang Normal University as an Example (2022jxjy044); Horizontal Research Project of Fuyang Normal University: Midshore Art Innovation Technology Transformation and Service Contract (HX2021048); 2024 Social Science Innovation and Development in Anhui Province: Research on the Logical Way of Spreading Revolutionary Songs in Anhui (1921-1949) (2024CX148).