Research on Pattern Recognition Methods of Traditional Music Style Characteristics in Big Data Environment
Pubblicato online: 21 mar 2025
Ricevuto: 03 nov 2024
Accettato: 24 feb 2025
DOI: https://doi.org/10.2478/amns-2025-0658
Parole chiave
© 2025 Qinfei Han et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Traditional music style reflects the overall basic characteristics of traditional music works and is the basis for traditional music appreciation, analysis, and research. Computerized automatic genre analysis of traditional music has great theoretical significance and practical value for the retrieval, classification, distribution, and research of traditional music works [1-2]. Determining the genre of a piece of music is a clustering problem under fuzzy conditions. The current common scheme is to segment a piece of music, judge the genre characteristics of each segment separately, and synthesize the results of each segment to judge the genre of the whole piece of music [3-4]. To accurately describe a specific piece of traditional music involves a wide range of elements, including at least rhythm, melody, harmony, timbre, and so on [5-6].
With the development of information and multimedia technologies, digital music can be widely accessed through different media, including radio broadcasting, the Internet, and digital storage devices such as CD-ROMs [7-8]. Access to a large amount of music requires the development of tools that can efficiently retrieve and manage music of interest to end users, hence the emergence of Music Information Retrieval MIR, an emerging research area in multimedia, which has received a great deal of attention from both the research and music communities [9-10]. One of the key issues in MIR is categorization, with its assignment of labels to each song based on style, mood, artist, etc. Music categorization is an interesting topic with many potential applications, providing important utility for music retrieval due to the fact that the majority of end-users are likely to be interested in only certain types of music [11-12]. Therefore, a recognition and classification system is able to search for music that people are interested in. On the other hand, different music genres have different attributes, and once they are categorized, people can manage them more efficiently.
Music recognition has received a great deal of attention from MIR researchers in recent years, and many tasks in MIR can be naturally turned into recognition classification problems, such as style recognition, emotion recognition, artist recognition, instrument recognition, and so on. Among them, music style recognition occupies a very important position in music information retrieval [13-14]. Many music users are only interested in some specific styles of music, and music style recognition happens to classify music into different types according to styles so that they can use the music recommendation function according to their interests, which facilitates fast retrieval and efficient management of their favorite music [15].
Most of the music repertoire is sung by people and accompanied by various types of musical instruments. In addition, the structural features of the music style will be different with different styles, and even the same person will sing differently when interpreting different bittersweet styles with different registers [16]. Multiple factors have led to the fact that it is very difficult for people to extract the features of music signals, which in turn makes it difficult to improve the recognition accuracy of bittersweet styles. Therefore, recognizing different music styles has recently received extensive attention and has been developed rapidly. How to further improve the recognition accuracy and efficiency of music styles has become the research focus in this field today [17-18].
In this paper, LPC cepstrum coefficients, Mel frequency cepstrum coefficients, and time-domain parameters are taken as traditional music-style feature parameters and extracted. Based on statistical learning theory, the support vector machine (SVM) model is elaborated, and the model is introduced from the binary classification problem to the multi-classification problem applicable to traditional music style recognition. The incremental learning theory is used to improve the support vectors of the SVM model to form the traditional music style pattern recognition model in this paper. A test dataset is constructed with 2000 traditional music pieces representing baroque, romantic, jazz, folk, and opera styles, respectively, as research objects. The music signals input to the dataset are pre-emphasized windowed, and framed so as to reduce the noise in the data in order to subsequently extract the traditional music style features in the data. The Hamming window is selected as the main window function to extract the time domain feature parameters of traditional music styles. In order to avoid the appearance of zero values, this paper adds a fixed jitter constant to the energy spectrum when the Mel frequency cepstrum coefficient is extracted. Based on the constructed traditional music dataset, five traditional music styles are recognized using the recognition model in this paper, and the effectiveness of the SVM model under incremental learning in traditional music style pattern recognition is highlighted by comparing it with the commonly used SVM model in this paper.
Traditional music styles are composed of different musical elements; the composition of these musical elements includes meter, tone, etc., and the embodied form of these elements is the characteristic parameter of music. The extraction of music's characteristic parameters is based on the operation of music file frames. A file frame is a data group extracted from a sound waveform file, which divides an audio file into multiple consecutive sampling points of
LPC refers to Linear Predictive Coefficients, Linear Predictive Coefficients technology is widely used in speech coding and decoding, from AMR proposed by 3GPP to G729 of ITU-T, LPC is the core component. LPC Cepstrum Coefficients (LPCC) are based on LPC technology, and many successful systems nowadays use LPCC as their digital feature parameter [19]. Most of the parameters used for speech pattern recognition can be used for traditional music pattern recognition because of the great similarity between traditional music and speech.
LPCC is a complex cepstrum, which is solved by first
First of all, it is assumed that the reference model for analysis is the all-polar model, and its system function is shown in Eq. (1):
Eq. (1) where
Taking the logarithm of
Substituting Eq. (1) into Eq. (2) and taking partial derivatives with respect to
To wit:
Therefore:
Organized and available:
By making the result of each term in Eq. (7) equal to 0, a recurrence relation between
where
Mel frequency cepstrum coefficient (MFCC) is a better parameter in music style recognition and outperforms other parameters in many experiments [20]. The perceived frequency of the human ear is still close to a linear relationship with the natural frequency, while above 1000 Hz, the perceived frequency becomes a logarithmic relationship with the natural frequency. Since the human heart perception is ultimately used as the evaluation standard and object in music style pattern recognition, the concept of Mel frequency has been proposed, and 1Mel is defined as 1/1000 of the degree of pitch perception at 1000Hz.
The extraction process of MFCC includes the following steps:
1) Firstly, the original music signal 2) Then, the time-domain music signal Since the frame length complementary 0 processing has been performed prior to the DFT, the spectral expression of the original music time domain signal can be derived using the Fast Fourier Transform (FFT). 3) The music signal is transformed by the FFT to obtain the spectrum The triangular filter center frequency
Eq. (11) where
After finding the Mel frequency, it is necessary to take the logarithm of the Mel spectrum so that there will be better robustness when using the Mel frequency to do the spectral error estimation and the output logarithmic spectrum data (transfer function) obtained after taking the logarithm of the music signal is:
After finding out the logarithmic spectrum of the music signal, it is necessary to perform the discrete cosine transform on the logarithmic spectrum
Commonly used time-domain parameters in music style mode aliasing include short-time energy, short-time average amplitude, and short-time average over-zero rate. The so-called short-time means that a short-time window function is added to the music signal, and the corresponding energy analysis, signal amplitude analysis, or over-zero rate analysis is carried out on the music signal after the window is added.
In music signals, the energy and average amplitude represent the degree of strength of the music signal, which is an important characteristic of music signals.
The short-term energy of a music signal is defined as follows:
Square window:
Hamming windows:
Hanen Window:
Let
In music style analysis, music signal energy is sensitive to high levels, and short-time average amplitude removes this effect. The short-time average amplitude is defined as follows:
It can be seen that the selection of the window function has a large impact on the short-time energy and short-time average amplitude. When calculating the short-time energy and short-time average amplitude, the following two aspects should be considered: the type of window function and the length of the window function. The time corresponding to the window length of the window function is generally taken as 0ms~100ms, in which the rhythmic change information of the music can be reflected.
The short-time zero crossing rate refers to the number of times the time domain waveform of the music signal crosses the axes per unit of time. When the sample values of two neighboring sample points are different, it means that there is one time zero crossing occurs. The short-time zero crossing rate is defined as follows:
Where
It can be seen that the short-time over-zero rate is sensitive to noise, and if there is noise present, it will cause a large number of non-normal over-zeroes, which will require band-pass filtering of the original music file. The short-time over-zero rate after filtering is defined as follows:
The over-zero rate in Eq. (18) is expressed as the number of times the positive and negative thresholds T are traversed, and the use of threshold truncation filters out the noise that affects the over-zero rate of the music and improves the correctness of the over-zero rate.
In recent years, SVM has gradually become a popular technology in the field of data mining and machine learning. Many researchers have conducted in-depth studies on support vector machines and achieved certain results in both theory and practice. The core idea of SVM includes the theory of structural risk minimization and the theory of optimal classification hyperplane. The classification model using SVM theory can achieve excellent performance even with fewer samples. In addition, compared with other incremental learning algorithms, SVM has significant advantages in solving problems such as “over-learning”, “stability,” and “dimensionality catastrophe”. It is due to the above advantages of SVM that it is selected in this paper as the basic algorithm for building a music style feature recognition model and manages to be able to reach the goal of incremental music style recognition.
Statistical learning theory is a small sample statistical theory that systematically studies machine learning problems [21]. The machine learning problem is simply the general problem of function estimation using a finite sample of observations. It can be formulated as follows: given
where the unknown joint probability distribution
Approximate
It is usually possible to find a function
Two factors affect the generalization ability of the ERM algorithm: one is the confidence range associated with the VC dimension one is the empirical risk. The two satisfy the following relationship with a probability of at least 1–
where
The VC dimension is intuitively defined as follows: for a set of exponential functions {
From Eq. (26), it can be seen that if a larger function set {
The SRM principle is a method aimed at minimizing risk generalization for two items: empirical risk and confidence range. The basic idea is to construct the set of functions
Where the VC dimension
The SRM principle is to choose the appropriate
In general, as the VC dimension
Based on the linearly differentiable case, SVM is proposed where an optimal classification plane exists [22]. The optimal classification plane is defined to satisfy the need for maximizing the classification interval while ensuring the accurate classification of two classes of samples. The ERM is realized by classifying the two classes of samples correctly. Maximizing the interval, on the other hand, ensures that the confidence range is as small as possible, and the combination of these two satisfies the SRM principle. The purpose of linear separability is to find a hyperplane that will correctly classify the samples of these two classes, respectively, and the structure of the whole process is schematically shown in Fig. 1. The two classes of training samples in the figure are represented by circular and square images,

Optimal hyperplane in linearly separable case
Suppose the set of training samples is {
The decision function is:
where sgn(·) is the sign function.
Then, the function
Next, the following function is defined using Lagrange's method for finding extreme values:
Where
Solve for the maximum value of the following function under
If
That is, the weight coefficients of the optimal classification plane can be represented by a vector of training samples, where
In the training samples, there will be many samples
The optimal classification surface is determined by the support vectors, i.e. the samples where
Then, the optimal classification function is:
In addition to the linearly differentiable case described above, most problems in practical situations are linearly indistinguishable. As shown in Figure 2, this situation often produces a large error using the optimal classification surface in the figure. When the linearly indivisible case is encountered, the optimal classification surface needs to be improved accordingly. What the researcher uses is to add a slack term
where

Optimal hyperplane in the case of linear indivisibility
For the very small value problem of Eq. (37), the approach taken is also to utilize the Lagrangian dyadic form, with the difference being that the constraints are a bit different from linear differentiability. To wit:
The constraints are:
where,
The
and the corresponding discriminant function Eq. (36) should also become:
Therefore, the idea of SVM can be summarized as the use of a suitable inner product function to deal with nonlinear problems, which aims to map the input space to high dimensions to find the optimal classification surface.
Different kernel functions and parameters can have different effects on the performance of SVM. In SVM algorithms, the choice of kernel function becomes a very critical one for linearly indivisible data. For the kernel trick, the aim is to introduce a new high-dimensional feature space into which the training samples are projected so that the samples can be linearly divided in the high-dimensional feature space. Thus, this mapping is defined as
However, because this mapping is from input space to feature space, it will allow the dimensionality to grow explosively, causing the operation of the inner product
The four types of kernel functions that are most used today in theoretical studies and practical applications of SVMs are:
The SVM obtained in this case is a hyperplane in the sample space.
The SVM derived in this case is an nth-order polynomial classifier.
In the Gaussian kernel function, the center of each basis function has a support vector corresponding to that center, and the algorithm's computation obtains the weights.
In this case, the SVM is a multilayer perceptron that contains a hidden layer, and the algorithm automatically calculates the number of nodes in the hidden layer to get the number of nodes in the hidden layer.
Multi-class classification learning is an important problem in the field of machine learning, and its application in real life is very common. The principle and implementation method of multi-classification in support vector machines is discussed below.
According to the given training sample set
Find a decision function:
It can be seen that solving a multi-class classification problem is essentially finding the rule that divides the points on
The one-to-one algorithm constructs all possible two-class classifiers in N classes of training samples, resulting in the construction of a total of
If the output of
Since the number of two-class classifiers
The one-to-one residual algorithm constructs
In the classification of test data samples, the comparison method is used. The test sample
The algorithm constructs each two-class classifier. All the
Based on the 1-a-1 algorithm, Plat et al. propose a new learning construct: decision-oriented acyclic graphs. The method contains
In DDAG algorithm training, only the individual subclassifiers need to be trained, and by maximizing the interval of the binary classifiers in the DDAG structure i.e., the classification misclassification rate can be made to be lower.DAGSVMs use an exclusion method to classify the samples, and each time after the subclassifier operation, the most unlikely categories are excluded. The DDAG structure is different from the tree structure, and for the general decision tree approach, if an error occurs at a node, it will carry the error forward, whereas the DDAG structure has redundancy in that the classification paths for the same category may be different. This method is easier to compute and more efficient to learn than a general decision tree.
The classification and recognition process is usually faced with constantly evolving new data, and the initial training sample set cannot reflect all the sample information. When the new training samples accumulate to a certain scale, in order to obtain the remaining part of the sample information, the new sample dataset and the already trained sample dataset are directly merged and trained later, which not only leads to abnormal difficulties in training but also will consume a lot of training time and storage capacity. If the traditional machine learning algorithms are still utilized, not only can they not perfectly solve the above problems, but they also cause some other problems that are difficult to solve. A pioneering idea is to split the new dataset into several subsets, incremental training of these subsets in turn, each training is utilized to the last historical training results, so that the classification system model with the increase in the number of incremental training and constant adapt to adjust to improve accuracy. The traditional music style feature recognition does not effectively solve the above problems, so there is a need to study the incremental learning technique, which allows the music style recognition model to be optimized continuously with the addition of music data.
The meaning of incremental learning is to take the new samples as an increment and try to optimize the initial recognition model obtained from the last training to a certain extent so that the new recognition model after optimization can maintain a high recognition accuracy for both the last training sample set and the incremental sample set [23]. Ideally, the accuracy of the optimized recognizer can be improved with the accumulation of incremental learning. In addition, traditional machine learning algorithms do not retain the effective information in the previous training results, while incremental learning algorithms can even manage to effectively filter out the most valuable samples in the last training set, throw out those repeated invalid information, reduce the time of the next training, thus improving efficiency.
Through the analysis of SVM theory, it can be seen that the key role in the establishment of the recognition model is, in fact, the support vector set, and the SV set can even represent all the information of the current training sample set. In addition, the SV set generally accounts for only a small proportion of the entire training sample set. As new samples are constantly added, the original SV set can no longer characterize the information of the new samples, i.e., the current recognition model can no longer accurately recognize them, so it needs to be retrained. Due to the importance of the SV set and its representativeness, if the SV set can be utilized appropriately instead of being discarded, the information of the last recognition model can be utilized, and the size of the training sample set can be reduced. In summary, it is the theoretical advantages of SVM that make it very effective in solving incremental learning challenges.
For the SVM's optimization problem, each data
The optimal solution to the optimization problem is only
The decision function is 1) is located in the classification interval on the same side of the classification boundary as the samples of the same kind and can be correctly classified by the original classifier, satisfying 0 ≤ 2) is located in the classification interval and is on the opposite side of the classification boundary from the same kind of samples and cannot be correctly classified by the original classifier, satisfying 0 ≤ 3) is located outside the classification interval and is on the opposite side of the classification boundary from the same sample, so it cannot be correctly classified by the original classifier and satisfies 0 ≤
In summary, 0≤
After analyzing and reasoning in related literature, if (
In summary, for the incremental learning process of SVM, the SV set of the current classification model might undergo the following transformation: a portion of the n SVs has a probability of being transformed into SVs, and the original SVs have a probability of being transformed into nSVs. Delivering the above conclusions is shown in Fig. 3.

Changes of support vector after increment
There are approximately 2,000 pieces of traditional music in the experimental database, including five types of music: baroque, romantic, jazz, folk, and opera. Most of the Baroque-style music in the database is by Bach and Handel, who were the most important composers of the Baroque era. The Romantic style music in the database consists of works by Chopin, Schubert, Liszt, Beethoven, and other composers of the Romantic era. Jazz consists of works by jazz singers, the singers include nine male and sixteen female singers. Folk songs and operas in the database also include works by many different composers.
All music data in the database is 18kHz, 32-bit, mono files. About 6750 30-second clips were randomly selected from 2000 pieces of music to make up the recognition database, of which 5500 were used for training, while 1250 were used for testing. For each traditional music genre, about 1100 clips were in the training set, and about 250 clips were in the test set; 30-second clips from the same piece of music do not appear in both the training and test sets.
In order to facilitate the subsequent processing of the music signal, the pre-emphasized output signal is usually normalized. A comparison of the signal before and after the pre-emphasis is shown in FIG. 4, with (a) and (b) indicating the original signal waveform and the signal waveform after the pre-emphasis, respectively. The peaks of the waveform after pre-emphasis are more prominent, which facilitates further processing.

Pre-aggravated comparison
A good window function should generally satisfy two factors: the slopes at the ends of the window function should be correspondingly reduced in the time domain so that when multiplied by the speech waveform, the edges of the window can be slowly overdriven, thus reducing the truncation effect on the speech frame. In the frequency domain, there is a relatively low sideband maximum and a high bandwidth of 3dB. Considering these two factors, this paper chooses the Hamming window, the actual music signal framing, as shown in Figure 5.

Music signal frame
Fig. 6 shows the traditional time-domain feature parameter extraction process, where the top plot represents the original music signal fragment and the middle plot is the window function. The bottom plot shows the N sampling points within a frame obtained by multiplying the window function and the music signal fragment at a certain moment.

Traditional time-domain characteristic parameter extraction
In the process of selecting the window function, the matrix window spectrum leakage is very serious, and it is rarely used in practical applications. In this paper, the side flap selected is relatively small and has a fast decay speed, small leakage of the Hamming window, and its extracted time domain feature parameters are shown in Figure 7. In terms of the extraction of time domain feature parameters of music signals, most of the extracted short-time features. These short-time feature parameters can capture the characteristic attributes of the music style. Because the music signal has a long-time unsteady nature, in this paper, before processing the music signal, the signal should be intercepted for 30-50ms to transform into a short-time smooth signal. As the name implies, the feature parameters extracted on the time window of 30-50ms are the time-domain feature parameters.

Hamming frequency response
In this paper, the sampling rate processed is 18kHz, the window function is used 35ms, the frameshift is 20ms, the window length is expressed as

The energy spectrum of one frame signal after processing
Fig. 9 shows the MFCC feature parameters of a certain frame extracted for five different styles of traditional music. S1-S5 denotes baroque, romantic, jazz, folk, and opera styles, respectively. From the figure, it can be seen that the trend of MFCC features for each music style is different.

MFCC of different music styles
In order to verify the effectiveness of the traditional music style recognition model constructed based on incremental SVM in this paper, under the condition of using the same music style feature parameters and the same kernel function, the recognition experiments are carried out using incremental SVM and the traditional SVM recognition model respectively on the data containing five music styles, and the results are shown in Fig. 10, in which S1-S5 denote the five traditional music styles respectively, and MA denotes the average accuracy rate. The black bars on the left side of the figure show the recognition results for the five music styles after introducing incremental learning, while the white bars show the recognition results without using incremental learning. The last column shows the average accuracy of the recognition results in both cases, which are 96.27% and 70.48%, respectively. The comparison of the results in the figure can verify the effectiveness of the incremental learning introduced in this paper on the basis of traditional SVM. To a certain extent, the scope of the unrecognizable region is narrowed, and the accuracy of the recognition is improved, especially for the improvement of the accuracy of jazz music style recognition.

Identification result contrast
Figure 11 shows the recognition results of this paper's traditional music style recognition model for five traditional music styles, from which it can be seen that the accuracy of this paper's model for recognizing five different styles of traditional music is around 90% of which the recognition rate of traditional music in the opera style is as high as 95.11%. In contrast, the model has a higher misrecognition rate on Romantic and Jazz styles, with 8 pieces of Romantic-style traditional music misrecognized as Jazz and 7 pieces of Jazz recognized as Baroque-style traditional music. Overall, the traditional music style pattern recognition model constructed based on the incremental SVM algorithm in this paper is able to effectively capture the traditional music style features, so as to realize the accurate recognition of traditional music styles.

Incremental SVM model recognition results
This paper introduces the incremental learning theory on the basis of the SVM algorithm to construct a traditional music style pattern recognition model and uses the extracted traditional music style feature parameters as the model input information to recognize five different styles of traditional music. The recognition accuracy of this model on five traditional music styles, including baroque style, romantic style, jazz, folk song, and opera, is maintained at about 90%, and the average recognition accuracy is as high as 96.27%. The model's misrecognition rate on romantic style and jazz is relatively high; 8 romantic-style pieces are misrecognized as jazz, 7 jazz pieces are misrecognized as baroque-style music and 7 jazz pieces are misrecognized as folk songs. To summarize, the model in this paper performs well in the traditional music style recognition task, and the SVM algorithm incorporating incremental learning theory improves the average accuracy of traditional music style recognition by 25.79% compared with the traditional SVM algorithm. The above results strongly indicate that the introduction of incremental learning theory in this paper improves the performance of the SVM model and the feasibility of the model in recognizing different traditional music styles by music style feature parameters.
