Research on Pattern Recognition Methods of Traditional Music Style Characteristics in Big Data Environment

Traditional music style reflects the overall basic characteristics of traditional music works and is the basis for traditional music appreciation, analysis, and research. Computerized automatic genre analysis of traditional music has great theoretical significance and practical value for the retrieval, classification, distribution, and research of traditional music works [1-2]. Determining the genre of a piece of music is a clustering problem under fuzzy conditions. The current common scheme is to segment a piece of music, judge the genre characteristics of each segment separately, and synthesize the results of each segment to judge the genre of the whole piece of music [3-4]. To accurately describe a specific piece of traditional music involves a wide range of elements, including at least rhythm, melody, harmony, timbre, and so on [5-6].

With the development of information and multimedia technologies, digital music can be widely accessed through different media, including radio broadcasting, the Internet, and digital storage devices such as CD-ROMs [7-8]. Access to a large amount of music requires the development of tools that can efficiently retrieve and manage music of interest to end users, hence the emergence of Music Information Retrieval MIR, an emerging research area in multimedia, which has received a great deal of attention from both the research and music communities [9-10]. One of the key issues in MIR is categorization, with its assignment of labels to each song based on style, mood, artist, etc. Music categorization is an interesting topic with many potential applications, providing important utility for music retrieval due to the fact that the majority of end-users are likely to be interested in only certain types of music [11-12]. Therefore, a recognition and classification system is able to search for music that people are interested in. On the other hand, different music genres have different attributes, and once they are categorized, people can manage them more efficiently.

Music recognition has received a great deal of attention from MIR researchers in recent years, and many tasks in MIR can be naturally turned into recognition classification problems, such as style recognition, emotion recognition, artist recognition, instrument recognition, and so on. Among them, music style recognition occupies a very important position in music information retrieval [13-14]. Many music users are only interested in some specific styles of music, and music style recognition happens to classify music into different types according to styles so that they can use the music recommendation function according to their interests, which facilitates fast retrieval and efficient management of their favorite music [15].

Most of the music repertoire is sung by people and accompanied by various types of musical instruments. In addition, the structural features of the music style will be different with different styles, and even the same person will sing differently when interpreting different bittersweet styles with different registers [16]. Multiple factors have led to the fact that it is very difficult for people to extract the features of music signals, which in turn makes it difficult to improve the recognition accuracy of bittersweet styles. Therefore, recognizing different music styles has recently received extensive attention and has been developed rapidly. How to further improve the recognition accuracy and efficiency of music styles has become the research focus in this field today [17-18].

In this paper, LPC cepstrum coefficients, Mel frequency cepstrum coefficients, and time-domain parameters are taken as traditional music-style feature parameters and extracted. Based on statistical learning theory, the support vector machine (SVM) model is elaborated, and the model is introduced from the binary classification problem to the multi-classification problem applicable to traditional music style recognition. The incremental learning theory is used to improve the support vectors of the SVM model to form the traditional music style pattern recognition model in this paper. A test dataset is constructed with 2000 traditional music pieces representing baroque, romantic, jazz, folk, and opera styles, respectively, as research objects. The music signals input to the dataset are pre-emphasized windowed, and framed so as to reduce the noise in the data in order to subsequently extract the traditional music style features in the data. The Hamming window is selected as the main window function to extract the time domain feature parameters of traditional music styles. In order to avoid the appearance of zero values, this paper adds a fixed jitter constant to the energy spectrum when the Mel frequency cepstrum coefficient is extracted. Based on the constructed traditional music dataset, five traditional music styles are recognized using the recognition model in this paper, and the effectiveness of the SVM model under incremental learning in traditional music style pattern recognition is highlighted by comparing it with the commonly used SVM model in this paper.

2

Parameters characterizing traditional music styles

Traditional music styles are composed of different musical elements; the composition of these musical elements includes meter, tone, etc., and the embodied form of these elements is the characteristic parameter of music. The extraction of music's characteristic parameters is based on the operation of music file frames. A file frame is a data group extracted from a sound waveform file, which divides an audio file into multiple consecutive sampling points of n groups of data. The general length of the frame n can be 8192, 4096, 2048, 1024, and 512 and other different points, and there can be a certain overlap between the first and last two frames. After the music file is processed into frames, the time-domain feature parameters can be extracted directly, or the corresponding transform-domain feature parameters can be extracted by using transform operations such as FFT. The following section briefly introduces the commonly used feature parameters in traditional music style pattern recognition.

2.1

LPC cepstrum factor

LPC refers to Linear Predictive Coefficients, Linear Predictive Coefficients technology is widely used in speech coding and decoding, from AMR proposed by 3GPP to G729 of ITU-T, LPC is the core component. LPC Cepstrum Coefficients (LPCC) are based on LPC technology, and many successful systems nowadays use LPCC as their digital feature parameter [19]. Most of the parameters used for speech pattern recognition can be used for traditional music pattern recognition because of the great similarity between traditional music and speech.

LPCC is a complex cepstrum, which is solved by first z-transforming the music signal, taking the logarithm of the result after the z-transformation, and then z-inverse transforming the result after arithmetic to obtain the LPCC parameters.

First of all, it is assumed that the reference model for analysis is the all-polar model, and its system function is shown in Eq. (1): (1) $H (z) = \frac{1}{1 - \sum_{k = 1}^{p} a_{k} z^{- k}}$

Eq. (1) where p is the order of the LPC linear predictor. Assuming that the impulse response of this model is h(n), then: (2) $H (z) = \sum_{n = 1}^{\infty} h (n) z^{- n}$

Taking the logarithm of H(z) in equation (2) yields: (3) $\hat{H} (z) = \lg H (z) = \sum_{n = 1}^{\infty} \hat{h} (n) z^{- n}$

Substituting Eq. (1) into Eq. (2) and taking partial derivatives with respect to z⁻¹ on both sides of the equation gives: (4) $\frac{\partial}{\partial z^{- 1}} \lg [\frac{1}{1 - \sum_{k = 1}^{P} a_{k} z^{- k}}] = \frac{\partial}{\partial z^{- 1}} \sum_{n = 1}^{\infty} \hat{h} (n) z^{- n}$

To wit: (5) $\frac{\sum_{k = 1}^{p} k a_{k} z^{- n + 1}}{1 - \sum_{k = 1}^{p} a_{k} z^{- k}} = \sum_{n = 1}^{\infty} n \hat{h} (n) z^{- n + 1}$

Therefore: (6) $(1 - \sum_{k = 1}^{p} a_{k} z^{- k}) \sum_{n = 1}^{\infty} n \hat{h} (n) z^{- n + 1} = \sum_{k = 1}^{p} k a_{k} z^{- k + 1}$

Organized and available: (7) $\begin{matrix} z^{0} [\hat{h} (1) - a_{1}] \\ + z^{- 1} [2 h (2) - h (1) a_{1} - 2 a_{2}] \\ ⋮ \\ + z^{- p} [(p + 1) \hat{h} (p + 1) - p \hat{h} (p) a_{1} - \dots - \hat{h} (1) a_{p}] \\ + z^{- p - 1} [(p + 2) \hat{h} (p + 2) - (p + 1) \hat{h} (p + 1) a_{1} - \dots 12 \hat{h} (2) a_{p}] \\ ⋮ \\ = 0 \end{matrix}$

By making the result of each term in Eq. (7) equal to 0, a recurrence relation between $\hat{h} (n)$ and a_k can be obtained and thus $\hat{h} (n)$ can be derived from a_k: (8) ${\begin{matrix} \begin{matrix} \hat{h} (0) = 0 \\ \hat{h} (1) = a_{1} \end{matrix} & (n \leq 0) \\ \hat{h} (n) = a_{n} + \sum_{k = 1}^{n - 1} (1 - k / n) a_{k} \hat{h} (n - k) & (1 \leq n \leq p) \\ \hat{h} (n) = \sum_{k = 1}^{p} (1 - k / n) a_{k} \hat{h} (n - k) & n > p \end{matrix}$

where a_k is the LPC coefficient and p is its order.

2.2

Mel frequency cepstrum factor

Mel frequency cepstrum coefficient (MFCC) is a better parameter in music style recognition and outperforms other parameters in many experiments [20]. The perceived frequency of the human ear is still close to a linear relationship with the natural frequency, while above 1000 Hz, the perceived frequency becomes a logarithmic relationship with the natural frequency. Since the human heart perception is ultimately used as the evaluation standard and object in music style pattern recognition, the concept of Mel frequency has been proposed, and 1Mel is defined as 1/1000 of the degree of pitch perception at 1000Hz.

The extraction process of MFCC includes the following steps:

1) Firstly, the original music signal s(n) is pre-processed, including pre-emphasis, frame-splitting, windowing, etc. After a series of pre-processing, the time-domain signal x(n) is obtained from the music frame.

2) Then, the time-domain music signal x(n) after the frame-splitting is supplemented with a number of zeros, and the frame length is supplemented with an integer power of 2. Generally, the frame length is taken to be N=512, and the time-domain music signal after the frame-splitting is subjected to a discrete Fourier transform (DFT), and the frequency-domain expression of the music signal X(k) is obtained after the DFT: (9) $X (k) = \sum_{n = 0}^{N - 1} x (n) e^{- j 2 π n k / N} (0 \leq n, k \leq N - 1)$

Since the frame length complementary 0 processing has been performed prior to the DFT, the spectral expression of the original music time domain signal can be derived using the Fast Fourier Transform (FFT).

3) The music signal is transformed by the FFT to obtain the spectrum X(k), and then the spectrum is filtered by the Mel frequency filter bank, and the result after filtering is the Mel spectrum. After obtaining the Mel spectrum, it is necessary to do logarithmic energy processing on the spectrum, i.e., take the logarithmic value of its energy and then obtain the logarithmic spectrum of the music signal S(m). The Mel frequency filter bank contains a number of band-pass filters H_m(k),0 ≤ m ≤ M, M filters are triangular filters, and the center frequency of the H_m(k) filter bank is denoted by f(m)f(m). The center frequency interval of the filter bank is related to the serial number of the delta filters, and the difference between the center frequencies of neighboring delta filters at the front of the filter bank is larger than that of neighboring delta filters at the back of the filter bank. The Mel frequency filter is expressed as follows: (10) $H_{m} (k) = {\begin{array}{l} 0 & (k < f (m - 1)) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & (f (m - 1) \leq k \leq f (m)) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & (f (m) < k \leq f (m + 1)) \\ 0 & (k > f (m + 1)) \end{array} (0 \leq m < M)$

The triangular filter center frequency f(m) is defined as follows: (11) $f (m) = (\frac{N}{F_{s}}) B^{- 1} (B (f_{t} + m \frac{B (f_{h}) - B (f_{t})}{M + 1}))$

Eq. (11) where f₁,f_h is the minimum and maximum frequency values of the current triangular filter spectrum, respectively, and N is the length of the FFT sequence, i.e., the length of the frame of the music signal after the complementary zero. F_x denotes the sampling rate of the music signal and B⁻¹ denotes Eq. B⁻¹(b) = 700(e^h/1125 – 1).

After finding the Mel frequency, it is necessary to take the logarithm of the Mel spectrum so that there will be better robustness when using the Mel frequency to do the spectral error estimation and the output logarithmic spectrum data (transfer function) obtained after taking the logarithm of the music signal is: (12) $S (m) = \ln (\sum_{k = 0}^{N - 1} | X (k) |^{2} H_{m} (k)) (0 \leq m < M)$

After finding out the logarithmic spectrum of the music signal, it is necessary to perform the discrete cosine transform on the logarithmic spectrum S(m), which can transform the logarithmic spectrum into the spectral domain, and the output value after the discrete cosine transform is the Mel frequency cepstrum coefficient c(n): (13) $c (n) = \sum_{m = 1}^{M - 1} S (m) \cos (\frac{π n (m + 1 / 2)}{M}) (0 \leq m < M)$

2.3

Time domain parameters

Commonly used time-domain parameters in music style mode aliasing include short-time energy, short-time average amplitude, and short-time average over-zero rate. The so-called short-time means that a short-time window function is added to the music signal, and the corresponding energy analysis, signal amplitude analysis, or over-zero rate analysis is carried out on the music signal after the window is added.

2.3.1

Short-time energy and short-time average amplitude

In music signals, the energy and average amplitude represent the degree of strength of the music signal, which is an important characteristic of music signals.

The short-term energy of a music signal is defined as follows: (14) $E_{n} = {\sum_{m = - \infty}^{\infty} [x (m) w (n - m)]}^{2} = \sum_{m = n}^{n + N - 1} {[x (m) w (n - m)]}^{2}$

w(n) in Eq. (14) is the window function, and the following window functions are commonly used in signal processing:

Square window: (15) $w (n) = {\begin{cases} 1 & (0 \leq n \leq N - 1) \\ 0 & (n < 0 ≫ ò n < N) \end{cases}$

Hamming windows: (16) $w (n) = {\begin{array}{l} 0.54 - 0.46 \cos (\frac{2 π n}{N - 1}) & (0 \leq n \leq N - 1) \\ 0 & (n < 0 or n < N) \end{array}$

Hanen Window: (17) $w (n) = {\begin{array}{l} 0.5 (1 - \cos (\frac{2 π n}{N - 1})) & (0 \leq n \leq N - 1) \\ 0 & (n < 0 or n < N) \end{array}$

Let h(n) = w²(n) and write Eq. (4) as a convolutional expression: (18) $E_{n} = \sum_{m = - \infty}^{\infty} [x {(m)}^{2} w {(n - m)}^{2}] = \sum_{m = - \infty}^{\infty} x^{2} (m) h (n - m) = x^{2} (n) * h (n)$

In music style analysis, music signal energy is sensitive to high levels, and short-time average amplitude removes this effect. The short-time average amplitude is defined as follows: (19) $M_{n} = \sum_{m = - \infty}^{\infty} | x (m) | w (n - m) = | x (n) | w (n)$

It can be seen that the selection of the window function has a large impact on the short-time energy and short-time average amplitude. When calculating the short-time energy and short-time average amplitude, the following two aspects should be considered: the type of window function and the length of the window function. The time corresponding to the window length of the window function is generally taken as 0ms~100ms, in which the rhythmic change information of the music can be reflected.

2.3.2

Short-time zero crossing rate

The short-time zero crossing rate refers to the number of times the time domain waveform of the music signal crosses the axes per unit of time. When the sample values of two neighboring sample points are different, it means that there is one time zero crossing occurs. The short-time zero crossing rate is defined as follows: (20) $Z_{n} = \sum_{m = - \infty}^{\infty} sgn [x (n)] - sgn [x (n - 1)] w (n - m)$

Where sgn[x] is a symbolic function, and it and the window function w(n) are defined as follows, respectively: (21) $sgn (x) = {\begin{array}{l} 1 & (x \geq 0) \\ - 1 & (x < 0) \end{array}$ (22) $w (n) = {\begin{array}{l} 1 & (n \geq 0) \\ - 1 & (n < 0) \end{array}$

It can be seen that the short-time over-zero rate is sensitive to noise, and if there is noise present, it will cause a large number of non-normal over-zeroes, which will require band-pass filtering of the original music file. The short-time over-zero rate after filtering is defined as follows: (23) $Z_{n} = \sum_{m = - \infty}^{\infty} {sgn [x (n) - T] - sgn [x (n - 1) - T]} + {sgn [x (n) + T] - sgn [x (n - 1) + T]} w (n - m)$

The over-zero rate in Eq. (18) is expressed as the number of times the positive and negative thresholds T are traversed, and the use of threshold truncation filters out the noise that affects the over-zero rate of the music and improves the correctness of the over-zero rate.

3

Traditional music style feature recognition model

In recent years, SVM has gradually become a popular technology in the field of data mining and machine learning. Many researchers have conducted in-depth studies on support vector machines and achieved certain results in both theory and practice. The core idea of SVM includes the theory of structural risk minimization and the theory of optimal classification hyperplane. The classification model using SVM theory can achieve excellent performance even with fewer samples. In addition, compared with other incremental learning algorithms, SVM has significant advantages in solving problems such as “over-learning”, “stability,” and “dimensionality catastrophe”. It is due to the above advantages of SVM that it is selected in this paper as the basic algorithm for building a music style feature recognition model and manages to be able to reach the goal of incremental music style recognition.

3.1

Statistical learning

3.1.1

Empirical risk minimization principles and VC dimensions

Statistical learning theory is a small sample statistical theory that systematically studies machine learning problems [21]. The machine learning problem is simply the general problem of function estimation using a finite sample of observations. It can be formulated as follows: given l sample of independent and identically distributed observations (x₁,y₁),(x₂,y₂),⋯,(x₁,y₁) find an optimal function f(x,δ₀) in a given set of functions {f(x,δ,δ ∈ Δ) such that it satisfies the expected risk minimization: (24) $R (δ) = \int L (y, f (x, δ)) d F (x, y)$

where the unknown joint probability distribution F(x,y) = F(x)F(y|x) portrays the correlation between variables x ∈ Rⁿ and Δ , y ∈ R, Δ is the set of parameters and L(y,f(x,δ)) represents the loss incurred when using the function f(x,δ) for prediction, also known as the loss function. However, since the joint probability distribution F(x,y) is unknown, it is not possible to calculate the expected risk (24). All that is available is a sample column (x₁,y₁),…,(x₁,y₁) ∈ Rⁿ × R which, by the law of large numbers, replaces the expected risk with the following empirical risk: (25) $R_{ϵ m p} (δ) = \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f (x_{i}, δ))$

Approximate f(x,δ₀) with a function f(xδ_n) that minimizes the empirical risk R_emp(δ).

It is usually possible to find a function f(x,δ_n) in {f(x,δ)δ ∈ Δ) that minimizes the empirical risk f(x_i,δ_n) = y_i,i = 1,…,l, but the predictive ability of this function on new samples may be poor, i.e., there is an “over-learning” problem, which affects the generalization ability of the model. There are two common approaches to overcoming the overlearning problem: one is the regularization algorithm, and the other is the empirical risk minimization (ERM) principle. The so-called ERM principle aims to select a suitable function set H and minimize the empirical risk on H to overcome the overlearning.

Two factors affect the generalization ability of the ERM algorithm: one is the confidence range associated with the VC dimension one is the empirical risk. The two satisfy the following relationship with a probability of at least 1–η: (26) $R (δ) \leq R_{ϵ m p} (δ) + \sqrt{\frac{h \ln (2 n / h + 1) - \ln (n / 4)}{n}}$

where $\sqrt{\frac{h \ln (2 n / h + 1) - \ln (n / 4)}{n}}$ is called the confidence risk.

The VC dimension is intuitively defined as follows: for a set of exponential functions {f(x,δ),δ ∈ Δ}, if there are h samples that can completely break up the functions in the set of functions in the form of 2^h, it is said that the maximum number of samples that can be broken up in the set of functions {f(x,Δ),δ ∈ Δ} is h, and the VC dimension of that function is the maximum number of samples h that it can be broken up. Specifically, if the function set can break up any of the samples, the VC dimension of the function set is infinite. The VC dimension reflects the size of the capacity of the function set. The larger the VC dimension, the richer the functions contained in the function set.

From Eq. (26), it can be seen that if a larger function set {f(x,δ),δ ∈ Δ) is chosen, the appropriate hypothesis approximation f(x,δ_n) can be taken to make R_òmp smaller, but this will then lead to a higher VC dimension, making the confidence range larger. Conversely, by choosing a smaller function set f(x,δ),δ ∈ Δ} the VC dimension will be smaller, which will lead to an increase in the empirical risk R_emp(δ). To strike a balance between the two, statistical learning theory introduces the principle of structural risk minimization.

3.1.2

Principle of structural risk minimization

The SRM principle is a method aimed at minimizing risk generalization for two items: empirical risk and confidence range. The basic idea is to construct the set of functions S = {f(x,δ),δ ∈ δ) as a sequence of subsets satisfying: (27) $S_{1} \subset S_{2} \subset ... \subset S_{k} \subset ... \subset S$

Where the VC dimension h_i of each function subset is finite and satisfied: (28) $h_{1} \leq h_{2} \leq ... \leq h_{k} \leq ...,$

The SRM principle is to choose the appropriate h^* to minimize the sum of the confidence interval and the empirical risk, which leads to the corresponding functional assumption S^*.

In general, as the VC dimension h increases, the confidence risk increases, and the empirical risk decreases. Conversely, as the VC dimension h decreases, the empirical risk increases, and the confidence risk decreases. Therefore, the function in dimension $S_{i}^{*}$ is chosen as the optimal function. The support vector machine, which is based on statistical learning theory, is proposed based on the SRM criterion.

3.2

Support Vector Machine Model

3.2.1

Optimal Classification Surface

1)

Linearly Differentiable

Based on the linearly differentiable case, SVM is proposed where an optimal classification plane exists [22]. The optimal classification plane is defined to satisfy the need for maximizing the classification interval while ensuring the accurate classification of two classes of samples. The ERM is realized by classifying the two classes of samples correctly. Maximizing the interval, on the other hand, ensures that the confidence range is as small as possible, and the combination of these two satisfies the SRM principle. The purpose of linear separability is to find a hyperplane that will correctly classify the samples of these two classes, respectively, and the structure of the whole process is schematically shown in Fig. 1. The two classes of training samples in the figure are represented by circular and square images, P is the optimal classification line, δ is the classification maximum interval, and P₁ and P₂ are the straight lines where the support vectors are located. If generalized to higher dimensions, P represents the optimal hyperplane and P₁ and P₂ represent the planes where the support vectors are located. From the statistical knowledge, we can get that the optimal hyperplane satisfies the SRM principle, so the generalization ability of the optimal hyperplane is higher than other classification hyperplanes.

Suppose the set of training samples is {x_i,y_i},i = 1,2,…nxòR^d, yò{1, – 1} is the classification label. Then, the general form of the linear discriminant function in d -dimensional space can be expressed as: g(x) = ω*x + b , and the classification surface equation is: (29) $ω * x + b = 0$

The decision function is: (30) $f (x) = s g n (ω * x + b)$

where sgn(·) is the sign function.

Then, the function g(x) is normalized so that |g(x)|=1 of the samples closest to the classification surface, at which point the classification interval is 2/ω. The problem of maximizing the classification interval can be replaced by the problem of minimizing ω (or ω²). Then, the problem of optimizing the solution of linearly divisible SVM can be summarized as follows: (31) $\min \frac{| | ω | |^{2}}{2} & s . t . y_{i} (ω * x_{i} + b) \geq 1, i = 1, 2, ..., n$

Next, the following function is defined using Lagrange's method for finding extreme values: (32) $L (ω, b, α) = | | ω | |^{2} / 2 - \sum_{i = 1}^{n} α_{i} [y_{i} (ω * x_{i} + b) - 1]$

Where α_i > 0 is called the Lagrange multiplier in order to find the optimal ω^* and b^* making the function reach the minimal value. The original convex quadratic optimization problem can be converted into its dual, i.e., in the constraints, by taking the partial derivatives of ω and b in Eq. (31), respectively, and then making them equal to zero: (33) $\begin{array}{l} \sum_{i = 1}^{n} y_{i} α_{i} = 0 \\ α_{i} \geq 0 i = 1, 2, ..., n \end{array}$

Solve for the maximum value of the following function under α_i: (34) $\max Q (α) = \sum_{i = 1}^{n} α_{i} - 1 / 2 \sum_{i, j = 1}^{n} α_{i} α_{j} y_{i} y_{j} (x_{i} * x_{j})$

If $α_{i}^{*}$ is the optimal solution: (35) ${\begin{array}{l} ω^{*} = \sum_{i = 1}^{n} α_{i}^{*} x_{i} y_{i} \\ b^{*} = - \frac{1}{2} ω^{*} (x_{r} + x_{s}) \end{array}$

That is, the weight coefficients of the optimal classification plane can be represented by a vector of training samples, where x_r and x_r are an arbitrary pair of support vectors in both categories.

In the training samples, there will be many samples $α_{i}^{*}$ will be zero, and only a small fraction of samples $α_{i}^{*}$ will be non-zero;

The optimal classification surface is determined by the support vectors, i.e. the samples where $α_{i}^{*}$ is not zero.

Then, the optimal classification function is: (36) $f (x) = sgn (ω^{*} x + b^{*}) = sgn \sum_{i = 1}^{n} [α_{i}^{*} y_{i} (x_{i} \cdot x) + b^{*}]$

b^* is the threshold for classification, which can be derived from any support vector.

2)

Linearly indivisible

In addition to the linearly differentiable case described above, most problems in practical situations are linearly indistinguishable. As shown in Figure 2, this situation often produces a large error using the optimal classification surface in the figure. When the linearly indivisible case is encountered, the optimal classification surface needs to be improved accordingly. What the researcher uses is to add a slack term ξ_i ≥ 0 to the constraints, then the optimization problem of solving the optimal plane becomes solving for the minimum value of Eq. (37): (37) ${\begin{array}{l} ϕ (ω, ξ) = \frac{1}{2} | | ω | |^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ s . t . y_{i} (ω x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} > 0. \end{array}$

where C is a constant used to penalize the misclassified samples to a greater degree. As the penalty for misclassification increases, the value of C increases. This is done to enhance the learning model to make it more generalizable.

For the very small value problem of Eq. (37), the approach taken is also to utilize the Lagrangian dyadic form, with the difference being that the constraints are a bit different from linear differentiability. To wit: (38) $\max Q (α) = \sum_{i = 1}^{n} α_{i} - 1 / 2 \sum_{i, j = 1}^{n} α_{i} α_{j} y_{i} y_{j} (x_{i} * x_{j})$

The constraints are: (39) $0 \leq α_{i} \leq C, i = 1, 2, ..., n \sum_{i = 1}^{n} α_{i} y_{i} = 0$

where, α_i has three possible values (1) α_i = 0,(2) 0 < α_i < C,(3) α_i = C.

The x_i satisfying (2) is the standard support vector, and the x_i satisfying (3) is called the boundary support vector. From the above analysis, it can be seen that only the support vectors play a role in determining the optimal hyperplane and the decision function. Meanwhile, we can see from the optimal classification function in equation (36) that there is a dot product between the classification samples and the support vectors. In the case of binary classification, it can be well understood; however, if it is in the high-dimensional space, an inner product should be sought to replace the dot product so that the original feature space can be transformed into a new feature space, and here K(x,x′) is used to replace the dot product in the optimal classification surface. The optimization function at this point then becomes: (40) $Q (α) = \sum_{i = 1}^{n} α_{i} - 1 / 2 \sum_{i, j = 1}^{n} α_{i} α_{j} y_{i} y_{j} K (x_{i} * x_{j})$

and the corresponding discriminant function Eq. (36) should also become: (41) $f (x) = sgn (ω^{*} x + b^{*}) = sgn {\sum_{i = 1}^{n} α_{i}^{*} y_{i} K (x_{i} * x) + b^{*}}$

Therefore, the idea of SVM can be summarized as the use of a suitable inner product function to deal with nonlinear problems, which aims to map the input space to high dimensions to find the optimal classification surface.

3.2.2

Feature Space and Kernel Functions

Different kernel functions and parameters can have different effects on the performance of SVM. In SVM algorithms, the choice of kernel function becomes a very critical one for linearly indivisible data. For the kernel trick, the aim is to introduce a new high-dimensional feature space into which the training samples are projected so that the samples can be linearly divided in the high-dimensional feature space. Thus, this mapping is defined as ϕ(x), and so solving the constrained optimization problem evolves: (42) $\begin{array}{l} \min_{α} \frac{1}{2} \sum_{i = 1}^{l} \sum_{j = 1}^{l} α_{i} α_{j} y_{i} y_{j} (ϕ_{i}, ϕ_{j}) - \sum_{i = 1}^{l} α_{i} \\ s . t . \sum_{i = 1}^{l} α_{i} y_{i} = 0 \\ α_{i} \geq 0, i = 1, 2, ..., l \end{array}$

However, because this mapping is from input space to feature space, it will allow the dimensionality to grow explosively, causing the operation of the inner product ϕ_i·ϕ_j of the above-constrained problem to increase to an unsustainable level. So, a kernel function is constructed, which has the expression: (43) $K (x_{i}, x_{j}) = ϕ (x_{i}) \cdot ϕ (x_{j})$

The four types of kernel functions that are most used today in theoretical studies and practical applications of SVMs are:

1)

Linear kernel functions

(44)

K (x, x_{i}) = x \cdot x_{i}

The SVM obtained in this case is a hyperplane in the sample space.

2)

Polynomial kernel function

(45)

K (x, x_{i}) = {[(x \cdot x_{i}) + 1]}^{n}

The SVM derived in this case is an nth-order polynomial classifier.

3)

Gaussian kernel function

(46)

K (x, x_{i}) = \exp [- \frac{| | x - x_{i} | |^{2}}{2 σ^{2}}]

In the Gaussian kernel function, the center of each basis function has a support vector corresponding to that center, and the algorithm's computation obtains the weights.

4)

Sigmod kernel function

(47)

K (x, x_{i}) = \tanh (v (x, x_{i}) + c)

In this case, the SVM is a multilayer perceptron that contains a hidden layer, and the algorithm automatically calculates the number of nodes in the hidden layer to get the number of nodes in the hidden layer.

3.2.3

Multi-classification problems

Multi-class classification learning is an important problem in the field of machine learning, and its application in real life is very common. The principle and implementation method of multi-classification in support vector machines is discussed below.

According to the given training sample set T = {(x₁,y₁),(x₂,y₂…,(x_i,y_i)} ∈ (X×Y)′ where: (48) $x_{i} \in X = R^{n}, y_{i} \in Y = {1, 2, ..., M}, i = 1, 2, ..., l$

Find a decision function: (49) $f (x) : X = R^{n} \to Y$

It can be seen that solving a multi-class classification problem is essentially finding the rule that divides the points on Rⁿ into M parts. Commonly used multi-classification algorithms for support vector machines are the “one to one remainder” method, the “one to one” method, the decision-oriented acyclic graph method, and so on.

1)

One-to-one approach

The one-to-one algorithm constructs all possible two-class classifiers in N classes of training samples, resulting in the construction of a total of k(k – 1)/2 classifiers, each trained on only two classes of training samples in k classes, combining these two-class classifiers and using the voting method, the class with the most votes is the class to which the sample point belongs. The test sample x is input to a two-class classifier constructed from the samples of class m and class n in class k if the classification function: (50) $f^{m n} (x) = sgn {\sum_{i = 1}^{l} a_{i}^{m n} y_{i}^{m n} K (x_{i}, x) + b^{m n}}$

If the output of x belongs to category m, one vote will be added to category m. If it belongs to category n, one additional vote is given to category n.

Since the number of two-class classifiers k(k – 1)/2 of this algorithm increases dramatically with the increase in the number of classes k, it requires a large amount of computation, resulting in slower speed in training and testing classification, and in testing classification, when the number of votes obtained by a certain two classes is the same, it is not possible to determine which class it belongs to, which may result in classification errors.

2)

One-to-one residual method

The one-to-one residual algorithm constructs k two-class classifiers for a k-class problem. In constructing the m th of the k classifiers m ∈ y_i , the training samples of class m are treated as one class with category labeled y_mi = 1 and the samples of the remaining classes are treated as one class labeled y_mi = –1. The optimization creates a classification output function for the m th classifier as: (51) $f^{m} (x) = sgn {\sum_{Support vector} a_{i}^{m} y_{i}^{m} K (x_{i}, x) + b^{m}}$

In the classification of test data samples, the comparison method is used. The test sample x is fed into k two-class classifiers, and k outputs are obtained from the classification output function equation. These k numbers are compared to find out the largest one, and the serial number of the classifier with the largest output is the class number to which the test sample x belongs.

The algorithm constructs each two-class classifier. All the k class n training samples have to participate in the operation; in the test classification, all k two-class classifiers classify the test samples before determining the class of the test samples, so when the number of training samples n and the number of classes k are large, the speed of the training and test classification is slower, and it will produce an indivisible region during classification, which affects the correct rate of classification.

3)

Decision-oriented acyclic graph

Based on the 1-a-1 algorithm, Plat et al. propose a new learning construct: decision-oriented acyclic graphs. The method contains k(k – 1)/₂ classifier and introduces the idea of graph-theoretic directed acyclic graphs in the process of combining multiple 1-a-1 classifiers. For a k-class classification problem, there are k(k – 1) nodes. Each classifier corresponds to two classes and is distributed in a k-layer structure, where the top layer contains only one node, called the root node, the second layer contains two nodes, and the i th layer contains i nodes, where the i th node in the j th layer points to the i th and i + 1 th nodes in the j + 1 th layer.

In DDAG algorithm training, only the individual subclassifiers need to be trained, and by maximizing the interval of the binary classifiers in the DDAG structure i.e., the classification misclassification rate can be made to be lower.DAGSVMs use an exclusion method to classify the samples, and each time after the subclassifier operation, the most unlikely categories are excluded. The DDAG structure is different from the tree structure, and for the general decision tree approach, if an error occurs at a node, it will carry the error forward, whereas the DDAG structure has redundancy in that the classification paths for the same category may be different. This method is easier to compute and more efficient to learn than a general decision tree.

3.3

SVM incremental learning

3.3.1

Incremental learning

The classification and recognition process is usually faced with constantly evolving new data, and the initial training sample set cannot reflect all the sample information. When the new training samples accumulate to a certain scale, in order to obtain the remaining part of the sample information, the new sample dataset and the already trained sample dataset are directly merged and trained later, which not only leads to abnormal difficulties in training but also will consume a lot of training time and storage capacity. If the traditional machine learning algorithms are still utilized, not only can they not perfectly solve the above problems, but they also cause some other problems that are difficult to solve. A pioneering idea is to split the new dataset into several subsets, incremental training of these subsets in turn, each training is utilized to the last historical training results, so that the classification system model with the increase in the number of incremental training and constant adapt to adjust to improve accuracy. The traditional music style feature recognition does not effectively solve the above problems, so there is a need to study the incremental learning technique, which allows the music style recognition model to be optimized continuously with the addition of music data.

The meaning of incremental learning is to take the new samples as an increment and try to optimize the initial recognition model obtained from the last training to a certain extent so that the new recognition model after optimization can maintain a high recognition accuracy for both the last training sample set and the incremental sample set [23]. Ideally, the accuracy of the optimized recognizer can be improved with the accumulation of incremental learning. In addition, traditional machine learning algorithms do not retain the effective information in the previous training results, while incremental learning algorithms can even manage to effectively filter out the most valuable samples in the last training set, throw out those repeated invalid information, reduce the time of the next training, thus improving efficiency.

Through the analysis of SVM theory, it can be seen that the key role in the establishment of the recognition model is, in fact, the support vector set, and the SV set can even represent all the information of the current training sample set. In addition, the SV set generally accounts for only a small proportion of the entire training sample set. As new samples are constantly added, the original SV set can no longer characterize the information of the new samples, i.e., the current recognition model can no longer accurately recognize them, so it needs to be retrained. Due to the importance of the SV set and its representativeness, if the SV set can be utilized appropriately instead of being discarded, the information of the last recognition model can be utilized, and the size of the training sample set can be reduced. In summary, it is the theoretical advantages of SVM that make it very effective in solving incremental learning challenges.

3.3.2

Support vector change process after incremental learning

For the SVM's optimization problem, each data x satisfies the KKT conditions (52)~(54) below when and only when it is: (52) $α_{i} = 0 \Rightarrow y_{i} f (x_{i}) \geq 1$ (53) $0 < α_{i} < C \Rightarrow y_{i} f (x_{i}) = 1$ (54) $α_{i} = 0 \Rightarrow y_{i} f (x_{i}) \leq 1$

The optimal solution to the optimization problem is only α = (α₁,α₂,…,α_i.

The decision function is f(x) and the new sample is (x_i,y_i). Then this sample violates the KKT condition in the following three ways:

1) is located in the classification interval on the same side of the classification boundary as the samples of the same kind and can be correctly classified by the original classifier, satisfying 0 ≤ y_if(x_i) < 1.

2) is located in the classification interval and is on the opposite side of the classification boundary from the same kind of samples and cannot be correctly classified by the original classifier, satisfying 0 ≤ y_if(x_i) < –1.

3) is located outside the classification interval and is on the opposite side of the classification boundary from the same sample, so it cannot be correctly classified by the original classifier and satisfies 0 ≤ y_if(x_i) < –1.

In summary, 0≤y_if(x_i)<1 is a sufficient condition for the new sample (x_i,y_i) to violate the KKT condition.

After analyzing and reasoning in related literature, if (x_i,y_i) is the new sample and (x_i,y_i) is the decision function corresponding to the classifier in the original training set, if (x_i,y_i) meets the KKT condition corresponding to the classifier at this time, the new sample will not change the original support vector set. On the contrary, if, (x_i,y_i) does not meet the KKT condition, the new sample (x_i,y_i) will make the support vector set change. Therefore, for both the original samples and the new samples, as long as they do not comply with the KKT condition, the classification model after incremental learning will be affected by it. For those new samples that follow the KKT condition, their information has already been included in the original classification model, so in the next incremental training, we can focus on those new samples that do not meet the KKT condition because they are the ones that have an impact on changing the classification model. In addition, during the next incremental training, if the new dataset is directly superimposed on the already trained dataset for training, there is a high probability that the important classification information within the last training set will be discarded, which will result in the degradation of the performance of the classification model and the loss of accuracy.

In summary, for the incremental learning process of SVM, the SV set of the current classification model might undergo the following transformation: a portion of the n SVs has a probability of being transformed into SVs, and the original SVs have a probability of being transformed into nSVs. Delivering the above conclusions is shown in Fig. 3. s₁~s₄ is the original support vectors, which are obtained from the original training set after training, A₁~A₃ is the newly added samples, and B₁~B₃ is the data samples from the original training set within the geometric intervals determined by H₁ and H₂. H_3,4 is the optimal hyperplane obtained after incremental learning training, and N₂,S₃,B₂,A₁,A₂,N₁ is the SV set after incremental learning. The initial samples N₁, N₂ and the newly added samples A₁ that follow the KKT condition become the SV set after incremental learning, and the newly added samples A₂ and the samples B₂ in the original geometric interval that do not meet the KKT condition become the SV after incremental training. S₁ and S₂ in the original SV set become the data that meet the KKT condition after incremental learning.

4

Results of pattern recognition of traditional music styles

4.1

Experimental database

There are approximately 2,000 pieces of traditional music in the experimental database, including five types of music: baroque, romantic, jazz, folk, and opera. Most of the Baroque-style music in the database is by Bach and Handel, who were the most important composers of the Baroque era. The Romantic style music in the database consists of works by Chopin, Schubert, Liszt, Beethoven, and other composers of the Romantic era. Jazz consists of works by jazz singers, the singers include nine male and sixteen female singers. Folk songs and operas in the database also include works by many different composers.

All music data in the database is 18kHz, 32-bit, mono files. About 6750 30-second clips were randomly selected from 2000 pieces of music to make up the recognition database, of which 5500 were used for training, while 1250 were used for testing. For each traditional music genre, about 1100 clips were in the training set, and about 250 clips were in the test set; 30-second clips from the same piece of music do not appear in both the training and test sets.

4.2

Data pre-processing

4.2.1

Pre-emphasis

In order to facilitate the subsequent processing of the music signal, the pre-emphasized output signal is usually normalized. A comparison of the signal before and after the pre-emphasis is shown in FIG. 4, with (a) and (b) indicating the original signal waveform and the signal waveform after the pre-emphasis, respectively. The peaks of the waveform after pre-emphasis are more prominent, which facilitates further processing.

4.2.2

Adding windows and frames

A good window function should generally satisfy two factors: the slopes at the ends of the window function should be correspondingly reduced in the time domain so that when multiplied by the speech waveform, the edges of the window can be slowly overdriven, thus reducing the truncation effect on the speech frame. In the frequency domain, there is a relatively low sideband maximum and a high bandwidth of 3dB. Considering these two factors, this paper chooses the Hamming window, the actual music signal framing, as shown in Figure 5.

4.3

Music style feature parameter extraction

4.3.1

Time domain feature parameter extraction

Fig. 6 shows the traditional time-domain feature parameter extraction process, where the top plot represents the original music signal fragment and the middle plot is the window function. The bottom plot shows the N sampling points within a frame obtained by multiplying the window function and the music signal fragment at a certain moment.

In the process of selecting the window function, the matrix window spectrum leakage is very serious, and it is rarely used in practical applications. In this paper, the side flap selected is relatively small and has a fast decay speed, small leakage of the Hamming window, and its extracted time domain feature parameters are shown in Figure 7. In terms of the extraction of time domain feature parameters of music signals, most of the extracted short-time features. These short-time feature parameters can capture the characteristic attributes of the music style. Because the music signal has a long-time unsteady nature, in this paper, before processing the music signal, the signal should be intercepted for 30-50ms to transform into a short-time smooth signal. As the name implies, the feature parameters extracted on the time window of 30-50ms are the time-domain feature parameters.

4.3.2

Mel Frequency Cepstrum Coefficient Extraction

In this paper, the sampling rate processed is 18kHz, the window function is used 35ms, the frameshift is 20ms, the window length is expressed as N = 520 sample points, that is, there are 520 sample points in a frame, and the discrete Fourier transform is performed on these N points to get the spectral expression of a frame of the music signal. After obtaining the spectral value of the signal, its energy spectrum is squared. In order to avoid the appearance of zero values, a fixed jitter constant is added to the energy spectrum, and the processed energy spectrum is shown in Figure 8.

Fig. 9 shows the MFCC feature parameters of a certain frame extracted for five different styles of traditional music. S1-S5 denotes baroque, romantic, jazz, folk, and opera styles, respectively. From the figure, it can be seen that the trend of MFCC features for each music style is different.

4.4

Traditional Music Style Pattern Recognition

In order to verify the effectiveness of the traditional music style recognition model constructed based on incremental SVM in this paper, under the condition of using the same music style feature parameters and the same kernel function, the recognition experiments are carried out using incremental SVM and the traditional SVM recognition model respectively on the data containing five music styles, and the results are shown in Fig. 10, in which S1-S5 denote the five traditional music styles respectively, and MA denotes the average accuracy rate. The black bars on the left side of the figure show the recognition results for the five music styles after introducing incremental learning, while the white bars show the recognition results without using incremental learning. The last column shows the average accuracy of the recognition results in both cases, which are 96.27% and 70.48%, respectively. The comparison of the results in the figure can verify the effectiveness of the incremental learning introduced in this paper on the basis of traditional SVM. To a certain extent, the scope of the unrecognizable region is narrowed, and the accuracy of the recognition is improved, especially for the improvement of the accuracy of jazz music style recognition.

Figure 11 shows the recognition results of this paper's traditional music style recognition model for five traditional music styles, from which it can be seen that the accuracy of this paper's model for recognizing five different styles of traditional music is around 90% of which the recognition rate of traditional music in the opera style is as high as 95.11%. In contrast, the model has a higher misrecognition rate on Romantic and Jazz styles, with 8 pieces of Romantic-style traditional music misrecognized as Jazz and 7 pieces of Jazz recognized as Baroque-style traditional music. Overall, the traditional music style pattern recognition model constructed based on the incremental SVM algorithm in this paper is able to effectively capture the traditional music style features, so as to realize the accurate recognition of traditional music styles.

5

Conclusion

This paper introduces the incremental learning theory on the basis of the SVM algorithm to construct a traditional music style pattern recognition model and uses the extracted traditional music style feature parameters as the model input information to recognize five different styles of traditional music. The recognition accuracy of this model on five traditional music styles, including baroque style, romantic style, jazz, folk song, and opera, is maintained at about 90%, and the average recognition accuracy is as high as 96.27%. The model's misrecognition rate on romantic style and jazz is relatively high; 8 romantic-style pieces are misrecognized as jazz, 7 jazz pieces are misrecognized as baroque-style music and 7 jazz pieces are misrecognized as folk songs. To summarize, the model in this paper performs well in the traditional music style recognition task, and the SVM algorithm incorporating incremental learning theory improves the average accuracy of traditional music style recognition by 25.79% compared with the traditional SVM algorithm. The above results strongly indicate that the introduction of incremental learning theory in this paper improves the performance of the SVM model and the feasibility of the model in recognizing different traditional music styles by music style feature parameters.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Research on Pattern Recognition Methods of Traditional Music Style Characteristics in Big Data Environment

Qinfei Han

Ning Wang

Pubblicato online: 21 mar 2025

Ricevuto: 03 nov 2024

Accettato: 24 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0658

Parole chiaveIncremental learning, SVM algorithm, MFCC, Windowed frames, Traditional music style, Pattern recognition

© 2025 Qinfei Han et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Parole chiave
Incremental learning, SVM algorithm, MFCC, Windowed frames, Traditional music style, Pattern recognition