A study of the influence of audio signal processing technology on the expression of music aesthetics in piano performance

Music as an art has always had a long development from ancient times to the present day, and many excellent musicians have emerged, producing many popular works of art [1]. Traditional music has been passed down in the form of musical scores and performances, and different performance needs have led to the formation of a variety of instruments represented by wind and string instruments. The audio played by multiple instruments has led to the emergence of hybrid instrumental audio [2-3]. In addition, mixed audio is not only multiple instruments, but also more mixed speech audio in musical works, which is a mixture of singing and instrumental accompaniment, and the traditional field of research on speech and instrumental audio is only based on its different acoustic principles [4]. With the development of computer technology, traditional audio processing technology was then introduced into the field of computers, and the application of modern technology to the study of music audio has made people more skillful in the understanding and application of music, such as the emergence of electronic music and so on [5]. And electronic music for the creation and dissemination of music have played a key role, more importantly, the development of computer technology makes the audio processing technology has also been rapidly improved [6].

In the modern society, all kinds of music are emerging. With the popularity of music, people’s requirements for music signal processing and analysis technology have also increased, and different audio processing, music retrieval and other technologies are constantly proposed. The combination of music characterization and pattern recognition technology has promoted the development of music classification technology based on music features, which is extremely important for the development and breakthrough in the fields of music retrieval and music classification [7-8].

For the recognition process of playing instruments in traditional audio, it is relatively easy for the human ear to recognize the instruments, but if we want to realize the recognition of different instruments in the audio segment from the perspective of computer recognition, the recognition process is still quite difficult due to the mixing of the audio of different instruments [9-10]. Another example is that everyday speech audio is often interspersed with background speech, background music, etc., which are usually unwanted and meaningless. It can be equated to noise, which requires the computer to be able to automatically recognize the speech and background noise and output the pure speech separately from the background noise, in order to achieve the purpose of extracting the pure speech audio from the mixed speech audio which is interspersed with background speech and background music [11-12]. And most of the current processing methods are audio source separation processing from the perspective of traditional audio processing. Such processing is simple, but it is not out of the realization from the computer’s point of view, and the generalization ability is not strong. Therefore, it is still valuable to know how to process complex music signals using computers and apply the results in practice [13-14].

From the perspective of signal processing, music itself is a combination of different frequency components at different times. Even when the same piece of music is performed by different instruments, it will reflect a great deal of difference from the auditory point of view, which is reflected in the time and frequency domains, and there will be very obvious features [15-16]. How to use computers to identify these features and use them to analyze and identify music signals is a very meaningful research direction. This not only helps people to understand the nature of music more deeply, but also can be applied in the teaching and searching of music. People can accordingly analyze and process the huge amount of music resources on the network and give the audio noise reduction to get better quality audio. It also facilitates the information retrieval of music. Thus, it greatly facilitates the dissemination and communication of music [17-18].

In this study, we start from the acquisition of piano performance audio signals, carry out signal preprocessing and feature extraction, use wavelet transform noise reduction technology to denoise the audio signals to improve the signal quality, then combine the Hidden Markov Model with signal pattern matching and emotion analysis, and compare the audio signals before and after the processing through simulation experiments, to quantify the specific impact of audio signal processing on the expression of music aesthetics and evaluate the role of audio signal processing in improving the sound quality and emotion transmission of piano performance. The impact of audio signal processing on the aesthetic expression of music is quantified, and the role of audio signal processing technology in improving sound quality and emotion transmission of piano performances is evaluated.

2

Audio signals and processing in piano performance

2.1

Audio signals

Piano performance in the audio information refers to the piano sound source from the audible sound, sound is a physical phenomenon, the human ear to hear is a psychological phenomenon, generally loaded with voice and other sound characteristics of the amplitude of regular sound waves, frequency changes in the fluctuation of the signal is called the audio signal.

2.1.1

Tone and audio

Frequency is the number of times a signal changes in one second of time, pitch is a person’s perception of the frequency of the sound, which is generally known as pitch in music, and tone is represented by the frequency shift, which is a direct correlation. Frequency, pitch control relationship is shown in Table 1.

Table 1.

The frequency and the tune table

Project	Content
Scale	B	A	G	F	E	D	C
Simple spectrum	7	6	5	4	3	2	1
Frequency	495	441	393	350	331	294	262
The logarithm of the frequency	53.9	52.9	51.9	50.9	50.4	49.4	48.4

2.1.2

Amplitude and Intensity

Sound intensity refers to the strength of the main tone in the sound signal, is the basis for discriminating music, the strength of the sound is an objective physical quantity. The subjective amount of sound intensity reflected by the human ear is called “loudness”, which is the property of hearing to judge the strength of the sound.

2.1.3

Tone and harmonics

People’s subjective feeling of the spectral characteristics of the composite sound is collectively called timbre. In the sound, timbre is mainly related to the harmonic components of the sound, the nature of the sound is a vibration, in the composition of the composite sound of a variety of frequency components, the lowest intrinsic frequency is called the fundamental frequency, the composition of the composite sound of the rest of the frequency is equal to the fundamental frequency is called a harmonic multiple of the frequency.

2.1.4

Tone Width and Frequency Bands

The more harmonic components contained in the audio signal, the better the tone. In piano playing, the quality of the sound is measured in terms of the frequency paradigm of the harmonic components contained in the sound signal, i.e. the bandwidth, which describes the range of frequencies that make up the composite signal.

2.2

Signal Acquisition and Coding

The nature of the audio signal is an analog signal, the audio signal is usually obtained through the microphone picking sound, to get a continuous change in the level signal, the level signal over time to produce changes, so that a continuous function. Volume is the amplitude of the function, pitch is the frequency of the function, the human ear can hear the lowest frequency of low-frequency signals is 20Hz, the main audio input microphone input, line input two ways.

Collected analog signal into a computer and the network can accept the first step of the digital signal is to sample the analog signal, so that it becomes a discrete function of time, that is, according to a certain interval (T), in the simulation of the sound wave on the intercept an amplitude value, the amplitude value of a number of binary digits to get the discrete signal x (nT), n for the integer, T is known as the sampling period, 1 / T is called the sampling frequency. The process of representing the sampled value as a binary number is called quantization coding. The second step is to sample the discrete signal coding that is called pulse code modulation, that is, the binary code to indicate the amplitude of each discrete signal, the hardware implementation is mainly by the sample keeper and analog-to-digital converter to complete, so that it constitutes an audio input device.

2.3

Signal Conversion and Compression

Audio signal conversion refers to analog-to-digital conversion, that is, the process of converting an analog signal into a digital signal, analog-to-digital conversion mainly includes: 1)

Quantization, the signal in the amplitude axis coordinates of the performance of the digital extraction.

2)

Sampling, the signal in the time axis coordinates of the digital extracted.

3)

Encoding, the sampling and quantization of data according to a certain format digital storage.

Audio signal compression theory and computer data information compression theory is completely different, the audio signal allows a certain amount of distortion, computer data information is not allowed to distort. Computer data information within the binary data compression must achieve lossless compression, with no errors. This compression is called lossless compression. Audio signal compression does not need to be lossless, as long as the compression of the sound signal after the distortion and the original sound sounds the same as the original sound can be, so audio signal compression is very easy.

3

Audio signal recognition techniques in piano performance

The main components of the audio signal recognition system in piano performance are shown in Fig. 1.The audio signal recognition technology is utilized to identify and classify the emotions in piano performance.

3.1

Preprocessing of audio signal recognition system

Audio signals have short duration. Preprocessing for audio signals [19-20] mainly includes signal pre-emphasis, windowing and framing processing, and endpoint detection.

3.1.1

Audio signal pre-emphasis

Audio from the sound to be detected by computer equipment in the process of attenuation, attenuation greater than 6dB, so after the completion of the test audio must be tested to compensate for this compensation process becomes the signal aggravation. In digital signal processing is generally used based on the differential equation of the digital filter to achieve the signal aggravation: (1) $y (n) = s (n) - α s (n - 1) α \in [0.9, 1.0]$ $$y(n) = s(n) - \alpha s(n - 1)\quad \alpha \in [0.9,1.0]$$

3.1.2

Split-frame processing of audio signals

Most of the frame delimitation of the audio signal uses the Hemming window, after the pre-emphasis of the audio signal to take the frame processing, the frame length of L can generally take the value of 256 points or more, the frame shift of D can generally be taken as 64 or with the length of the frame. If an audio signal sequence is S(n), then the sequence after the sub-frame processing, the nth point in the lth frame and the relationship between the original audio sequence S(n) is: (2) $x_{l} (n) = s [(l - 1) * D + n] n \in [0, 255]$ $${x_l}(n) = s[(l - 1)^*D + n]\quad n \in [0,255]$$

where x_l(n) is the value corresponding to point n on frame l.

3.1.3

Endpoint detection of audio signals

The purpose of endpoint detection is to determine the boundaries of the audio unit to be recognized, which in turn reduces the amount of recognition operations, eliminates redundant information and improves the recognition efficiency.In this paper, we use the short-time average amplitude to estimate the audio energy.

The idea of the average amplitude method is to take the absolute value of all the points in the region identified by the moving window and then sum it up to get an average. And the method to determine the threshold of the method, take the sound before the known as static 10 consecutive frames of data as the basis for calculating the minimum energy threshold I_thl and the maximum energy threshold I_thh determined by taking the maximum average amplitude of the 10 frames A_max and the minimum average amplitude A_min and then determine the high and low thresholds through the formula (3): (3) ${\begin{matrix} I_{a} = {0.03}^{*} (A_{\max} - A_{\min}) + A_{\min} \\ I_{b} = 4 * A_{\min} \\ I_{t h l} = M I N (I_{a}, I_{b}) \\ I_{t h h} = 5 * I_{t h l} \end{matrix}$ $$\left\{ {\begin{array}{*{20}{c}} {{I_a} = {{0.03}^*}({A_{\max }} - {A_{\min }}) + {A_{\min }}} \\ {{I_b} = 4^*{A_{\min }}} \\ {{I_{thl}} = MIN({I_a},{I_b})} \\ {{I_{thh}} = 5^*{I_{thl}}} \end{array}} \right.$$

The method of determining the starting point according to the threshold is: according to I_thl and I_thh, take the first point S(i) that exceeds the value of I_thl, and locate the frame number where S(i) is located to reach the frame number that reaches the average amplitude first, but after a specified period of time, the frame amplitude does not reach I_thh and then decreases to I_thl again, then the frame where this S(i) is located can not be used as a boundary frame, and the system will return to the original state to make a new selection, until there is a frame of S(i) that exceeds the high threshold for a period of time after exceeding the low threshold, then this frame is the starting frame of audio signal recognition.

3.2

Feature extraction of audio signals

3.2.1

LPCC parameters

The core idea of linear prediction is that, based on the linear correlation between a point in the audio signal and a number of points preceding it, a linear combination of a series of points preceding that point is then applied to represent the information at that point, called a prediction. The error is then processed by taking the difference between the predicted point and the actual point as the error, which in turn yields a series of correlation coefficients. The prediction error e(n) according to the principle of linear correlation is expressed as: (4) $e (n) = s (n) - \sum_{i = 0}^{N} a_{i} s (n - i)$ $$e(n) = s(n) - \sum\limits_{i = 0}^N {{a_i}} s(n - i)$$

In Eq. (4), N is the linear prediction order, which is generally chosen to be 18th order. In order to obtain the optimal solution for a_i, e(n) must be minimized. The method generally used is the minimum mean square error criterion to determine. The short time average error in a frame is: (5) $E {e^{2} (n)} = E {[s (n) - \sum_{i = 0}^{N} a_{i} s (n - i)]^{2}}$ $$E\{ {e^2}(n)\} = E\{ {[s(n) - \sum\limits_{i = 0}^N {{a_i}} s(n - i)]^2}\}$$

Making the value of the partial derivative of this error with respect to a_i to be 0 can be obtained: (6) $E {[s (n) - \sum_{i = 0}^{N} a_{i} s (n - i)] s (n - j)} = 0$ $$E\{ [s(n) - \sum\limits_{i = 0}^N {{a_i}} s(n - i)]s(n - j)\} = 0$$

where i, j ∈ [1, N], N is the maximum prediction order.

From the above equation, it can be seen that when the linear prediction coefficients are taken to the optimal value, the prediction error e(n) is orthogonal to the past sample points, and out of the short-term smoothness of the audio signal is known that for N each audio segment S_n in a frame, there is Φ_n(i, j) for: (7) $Φ_{n} (i, j) = E [S_{n} (m - i) S_{n} (m - j)]$ $${\Phi _n}(i,j) = E[{S_n}(m - i){S_n}(m - j)]$$

Then there is: (8) $\sum_{j = 1}^{p} a_{j} Φ_{n} (i, j) = Φ_{n} (i, 0)$ $$\sum\limits_{j = 1}^p {{a_j}} {\Phi _n}(i,j) = {\Phi _n}(i,0)$$

Let the autocorrelation function of the audio segment be: (9) $R_{N} (i) = \sum_{n = i}^{N - 1} S_{N} (n) S_{N} (n - j)$ $${R_N}(i) = \sum\limits_{n = i}^{N - 1} {{S_N}} (n){S_N}(n - j)$$

The autocorrelation method is applied to solve the equation for the optimal coefficients as: (10) $[\begin{matrix} R_{N} (0) & R_{N} (1) & \dots & R_{N} (p - 1) \\ R_{N} (1) & R_{N} (0) & \dots & R_{N} (p - 2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ R_{N} (p - 1) & R_{N} (p - 2) & \dots & R_{N} (0) \end{matrix}] [\begin{matrix} a_{1} \\ a_{2} \\ ⋮ \\ a_{p} \end{matrix}] = [\begin{matrix} R_{N} (1) \\ R_{N} (2) \\ ⋮ \\ R_{N} (p) \end{matrix}]$ $$\left[ {\begin{array}{*{20}{c}} {{R_N}(0)}&{{R_N}(1)}& \cdots &{{R_N}(p - 1)} \\ {{R_N}(1)}&{{R_N}(0)}& \cdots &{{R_N}(p - 2)} \\ \vdots & \vdots & \ddots & \vdots \\ {{R_N}(p - 1)}&{{R_N}(p - 2)}& \cdots &{{R_N}(0)} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{a_1}} \\ {{a_2}} \\ \vdots \\ {{a_p}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{R_N}(1)} \\ {{R_N}(2)} \\ \vdots \\ {{R_N}(p)} \end{array}} \right]$$

The correlation coefficient can be solved by solving equation (10) and applying the Levinson-Durbin recursive algorithm.

The process of solving the cepstrum coefficient is to take the discrete Fourier transform of the audio signal and take the logarithmic sequence of the transformed sequence, and then take the discrete Fourier inverse transform of this logarithmic sequence. The cepstrum coefficients can be obtained by using Eq: (11) $C_{n} = {\begin{matrix} a_{n} + \sum_{k = 1}^{N - 1} \frac{k C_{k} a_{N - k}}{n}, i \leq n \leq p \\ a_{n} + \sum_{k = N - p}^{N - 1} \frac{k C_{k} a_{N - k}}{n}, n > p \end{matrix}$ $${C_n} = \left\{ {\begin{array}{*{20}{c}} {{a_n} + \sum\limits_{k = 1}^{N - 1} {\frac{{k{C_k}{a_{N - k}}}}{n}} ,i \leq n \leq p} \\ {{a_n} + \sum\limits_{k = N - p}^{N - 1} {\frac{{k{C_k}{a_{N - k}}}}{n}} ,n > p} \end{array}} \right.$$

In Eq. (11), C_i is the cepstrum coefficient, a_i is the linear prediction coefficient, and p is the order of the cepstrum coefficient.

3.2.2

MFCC parameters

The MFCC is solved by first performing a short-time Fourier transform [21-22] on the audio signal to obtain the frequency spectrum, and then squaring the amplitude of the frequency spectrum to obtain the energy spectrum. Next, a set of triangular wave filters is applied to filter the energy spectrum in the frequency domain. The center frequency of each triangular wave in the triangular filter set is uniformly arranged according to the frequency scale, and the relationship between the Mel frequency scale and the Fourier frequency scale is: (12) $M = 2596 * \ln (1 + f / 700)$ $$M = 2596^*\ln (1 + f/700)$$

Where M is the Mel frequency in Mel and f is the frequency characterized by Fourier transform in Hz.

The number of filters is usually selected according to the number of critical bands, set the number of triangular filters for M, the filtered output is expressed as X(k), of which k ∈ [1, m]. The output of the filter bank to go to the logarithm, and then do the Fourier inverse transform can be obtained. This transformation process can be expressed as: (13) $C_{n} = \sqrt{\frac{2}{M}} \sum_{k = 1}^{M} \log X (k) \cos [(k - 0.5) π n / M]$ $${C_n} = \sqrt {\frac{2}{M}} \sum\limits_{k = 1}^M {\log } X(k)\cos [(k - 0.5)\pi n/M]$$

Of these, n = 1, 2, …, L. Generally speaking, the number of MFCC coefficients for audio signals is selected from 12 to 16.

3.3

Pattern Training and Recognition Techniques for Audio Signals

For pattern matching and training algorithms for audio signals are mainly implemented using wavelet networks and their related techniques, and this paper adopts the Hidden Markov Model-based as the audio signal recognition method.

Hidden Markov model corresponds to each relatively stable articulatory unit in the acoustic layer with implicit states, describes the change of articulation through state transfer and state residency, and introduces a probabilistic statistical model, which generally uses the probability density function to calculate the output probability of the audio parameter to the model, and searches for the optimal sequence of states and then finds the recognition result with the criterion of the maximum a posteriori probability. The algorithm represents the audio signal as a series of states, where each state represents a part of the audio signal. The states include monophthongs, diphthongs, and triphthongs. If each state corresponds to a character for a Hidden Markov Model and each character’s state corresponds to a probability, then based on Markov Theory the probability of the next possible state can be predicted from the previous states and the possible paths of the audio signal through the model can be measured.

4

Wavelet transform-based noise reduction for audio signals

4.1

Wavelet analysis methods

4.1.1

Continuous wavelet transform

The basic idea of wavelet transform [23-24] is to represent or approximate a signal or function by a set of orthogonal functions, and this family of functions is called the wavelet function family, which is constituted by the translation and expansion of a basis wavelet function.

Let ψ(t) ∈ L²(R), L²(R) denote the square-productible real number space, i.e., the space of signals with finite energy, and its Fourier transform $\hat{ψ} (ω)$ $$\hat \psi (\omega )$$, when $\hat{ψ} (ω)$ $$\hat \psi (\omega )$$ satisfies the permissibility condition: (14) $C_{ψ} = \int_{k} \frac{{| \hat{ψ} (ω |}^{2}}{| ω |} d ω < \infty$ $${C_\psi } = \int_k {\frac{{{{\left| {\hat \psi (\omega } \right|}^2}}}{{\left| \omega \right|}}} d\omega < \infty$$

Call ψ(t) a basis wavelet (or wavelet basis function). Taking the basis wavelet ψ(t) and stretching (scaling) and translating (time shifting) it gives a wavelet sequence: (15) $ψ_{a, b} (t) = {| a |}^{- 1 / 2} ψ ((t - b) / a) a, b \in R, a \neq 0$ $${\psi _{a,b}}(t) = {\left| a \right|^{ - 1/2}}\psi \left( {(t - b)/a} \right)\quad a,b \in R\:,a \ne 0$$

where a is called the scale factor and b is called the translation factor.

Let ψ(t) be the basis wavelet and ψ_ab(t) be the continuous wavelet sequence obtained after scaling and translation of the wavelet basis function, then the continuous wavelet transform (CWT) of signal f(t) ∈ L²(R) is defined as: (16) $(W_{ψ} f) (a, b) = 〈 f, ψ_{a, b} 〉 = {| a |}^{- 1 / 2} \int_{ℝ} f (t) \bar{ψ ((t - b) / a)} d t$ $$({W_\psi }f)(a,b) = \left\langle {f,{\psi _{a,b}}} \right\rangle = {\left| a \right|^{ - 1/2}}\int_\mathbb{R} f (t)\overline {\psi ((t - b)/a)} dt$$

where $\bar{ψ_{a, b} (t)}$ $$\overline {{\psi _{a,b}}(t)}$$ is the complex conjugate of ψ_a,b(t).

Any transformation should have an inverse transformation for it to have any application. Let ψ(t) be a fundamental wavelet and for any f(t), g(t) ∈ L²(R), there is: (17) $\int_{- \infty}^{\infty} \int_{- \infty}^{\infty} (W_{ψ} f) (a, b) \bar{(W_{ψ} f) (a, b)} \frac{1}{a^{2}} d a d b = C_{ψ} 〈 f, g 〉$ $$\int_{ - \infty }^\infty {\int_{ - \infty }^\infty {({W_\psi }f)(a,b)} } \overline {({W_\psi }f)(a,b)} \frac{1}{{{a^2}}}dadb = {C_\psi }\langle f,g\rangle$$

Moreover, when the Fourier transform $\hat{ψ} (ω)$ $$\hat \psi (\omega )$$ of the fundamental wavelet ψ(t) satisfies the permissibility condition, the original signal f(t) can be recovered from the continuous wavelet transform if the signal f(t) is continuous at t ∈ R, i.e., the inverse transformation of the continuous wavelet transform exists: (18) $f (t) = \frac{1}{C_{ψ}} \int_{\infty}^{\infty} \int_{\infty}^{\infty} (W_{ψ} f) (a, b) ψ_{a, b} (t) \frac{1}{a^{2}} d a d b$ $$f(t) = \frac{1}{{{C_\psi }}}\int_\infty ^\infty {\int_\infty ^\infty {({W_\psi }f)(a,b)} } {\psi _{a,b}}(t)\frac{1}{{{a^2}}}dadb$$

In ψ_a.b(t) the variable a reflects the scale of the function and the variable b detects the position of the translation along the time axis or position draw. In general, the energy of the basis function ψ(t) is concentrated at the origin and the energy of the continuous wavelet function ψ_a,b(t) is concentrated at point b.

4.1.2

Discrete wavelet transform

In practical applications of wavelet transforms, especially in wavelet analysis of signals on computers, continuous wavelet transforms must be discretized in continuous wavelet transforms, continuous wavelet functions are considered: (19) $ψ_{a, b} (t) = | a |^{- 1 / 2} ψ (\frac{t - b}{a})$ $${\psi _{a,b}}(t) = |a{|^{ - 1/2}}\psi (\frac{{t - b}}{a})$$

For simplicity, it is assumed here that the conditions: a ∈ R⁺, b ∈ R and a ≠ 1, the Fourier transform $\hat{ψ} (ω)$ $$\hat \psi (\omega )$$ of the wavelet basis function ψ(t) satisfy the permissibility condition. Then the permissibility condition is changed to under the assumption:

Reconstruction formula for the discretized wavelet transform: (20) $f (t) = c \sum_{- \infty}^{\infty} \sum_{- \infty}^{\infty} C_{j, k} ψ_{j, k} (t)$ $$f(t) = c\sum\limits_{ - \infty }^\infty {\sum\limits_{ - \infty }^\infty {{C_{j,k}}} } {\psi _{j,k}}(t)$$

Here, c is a constant independent of the original signal.A discretized wavelet ψ_j,k(t) = 2^{− j/2}ψ(2^{− j}t − k)(j, k ∈ Z) with the discretization parameter taken as a₀ = 2, b₀ = 1 is often called a binary wavelet.

4.1.3

Multi-resolution analysis

The tree structure of the three-level multiresolution analysis of Signal S is shown in Figure 2:

Multi-resolution analysis knowledge further decomposes the low frequency portion of the signal, while the high frequency portion is not decomposed, and the decomposition has the relation: S = A3 + D3 + D2 + D1.

By multiresolution analysis and the theory of spatial orthogonal decomposition, $L^{2} (R) = \sum_{j = - \infty} W_{j} \oplus V_{s}$ $${L^2}(R) = \sum\limits_{j = - \infty } {{W_j}} \oplus {V_s}$$, (s is an arbitrarily set scale). If the signal f(t) ∈ L²(R) is decomposed in space L²(R), the following decomposition expression can be obtained: (21) $f (t) = \sum_{j = - \infty}^{s} \sum_{k = - \infty}^{+ \infty} d_{j, k} ψ_{j, k} (t) + \sum_{k = - \infty}^{+ \infty} c_{s, k} ψ_{s, k} (t)$ $$f(t) = \sum\limits_{j = - \infty }^s {\sum\limits_{k = - \infty }^{ + \infty } {{d_{j,k}}} } {\psi _{j,k}}(t) + \sum\limits_{k = - \infty }^{ + \infty } {{c_{s,k}}} {\psi _{s,k}}(t)$$

4.2

Basic steps of wavelet noise reduction

Let the signal f(n) of length N be contaminated by noise e(n), then the measured noise-containing signal is: (22) $x (n) = f (n) + σ \times e (n)$ $$x(n) = f(n) + \sigma \times e(n)$$

The noise removal process can be carried out in the following way: first of all, the signal is analyzed by wavelet analysis such as three-layer decomposition, the decomposition process is shown in Figure 3, then the noise part is usually contained in cd1, cd2, cd3, so you can use the threshold threshold and other methods of wavelet coefficients, and then reconfigure the signal, i.e., you can achieve the purpose of noise reduction.

where s is the noise-containing signal, cA1, cA2, and cA3 are the low-frequency coefficients of layers 1-3, and cD1, cD2, and cD3 are the high-frequency coefficients of layers 1-3, respectively.

In general, the wavelet noise reduction process can be carried out in three steps: 1)

Select a wavelet and determine the number of layers for a wavelet decomposition, and then perform a wavelet decomposition on the noise-containing signal.

2)

For the decomposed wavelet coefficients, keep all the wavelet coefficients at large scale and low resolution, and set a closed value to threshold the wavelet coefficients at each scale and high resolution.

3)

According to the low-frequency coefficients of the Nth layer of the wavelet decomposition and the high-frequency coefficients of the first to the Nth layer after the closed-value processing, wavelet reconstruction of the signal is performed, and the Hidden Markov Model is used as the audio signal recognition method.

5

Impact of audio signal processing technology on the aesthetic expression of piano music

Using audio signal processing techniques, features such as timbre can be extracted during piano playing, and musical features are the basis of aesthetic expression. Through feature extraction, music emotion recognition can be realized, which helps listeners quickly find the corresponding music content according to individual signals. And audio signal processing technology can improve the aesthetic effect of musical works by enhancing the quality of music sound.

5.1

Wavelet Noise Reduction Decomposition Layer Determination and Analysis

In this section of the experiment, the original audio signal of piano playing is collected and its data length is 32 bits. In order to get the optimal number of decomposition layers for wavelet noise reduction, the wavelet basis function sym8 is chosen to perform noise reduction on the noisy piano audio signal with a signal-to-noise ratio of -6 at different numbers of decomposition layers. The signal-to-noise ratio and relative error results corresponding to different decomposition layers are shown in Fig. 4.

According to the data in the figure, it can be seen that the higher the number of decomposition layers of wavelet domain value noise reduction, the higher the signal-to-noise ratio, the smaller the relative error, and the better the de-noising effect, but after the number of decomposition layers of 7, the improvement of the noise reduction effect is not obvious, and the useful audio signals are partially distorted, and at this time, the signal-to-noise ratio is 7.55, and the relative error is 0.16.The difference between the reconstructed audio signals and the original speech signals is too large, therefore the number of decomposition layers is Therefore, the number of decomposition layers is determined to be 7.

In the piano audio signal to add the white noise with a signal-to-noise ratio of 6, the wavelet basis function and the number of decomposition layers to determine the wavelet threshold noise reduction of noisy speech signals, the noise reduction effect of the method proposed in this paper is shown in Figure 5. It can be seen that the wavelet transform analysis method proposed in this paper can effectively remove the noise, the denoising effect is obvious, which verifies the feasibility and superiority of the algorithm.

5.2

Parametric analysis of timbre characteristics in piano playing

In this section, the timbre analysis of the model in this paper is tested and its performance is evaluated. In this section, “Music 1” from the piano music database is used as the input music data source to test the timbre analysis in audio signal processing. The results of the timbre analysis of the test track are shown in Table 2.

Table 2.

Analysis of the test repertoire

Spectral component (Hz)	0-40	40-100	100-200	200-500	500-1K	1K-4K	4K-8K	8K-16K	16K-24K
Proportion	0.002	0.003	0.005	0.009	0.013	0.035	0.296	0.331	0.306

As can be seen from the analysis results in Table 2, for the music played by the piano with a duration of 30s, the energy ratio of each frequency band is different, in which the energy in the frequency band of 8KHz-16KHz is the largest, accounting for about 0.331%, and the energy above 4KHz accounts for most of the overall energy, which is about 0.933% of the overall, which makes the whole piece of music sound like This makes the whole music sound like a strong sense of space, and the sound is relatively bright and powerful.

In addition, according to the characteristic parameter vectors of the music, the result of using the Hidden Markov Model to classify the music is piano, which is consistent with the expectation, indicating that the audio signal processing model constructed in this paper makes correct predictions about the playing instruments of this music.

Figure 6 shows the visualization results of the MFCC timbre feature parameters demonstrated by this paper’s model, from which it can be seen that the MFCC feature parameter values of the test tracks are distributed in the range of -12.68 to 12.43, showing irregular ups and downs. It shows that the model in this paper can better capture key information in the piano audio signal and improve the timbre recognition performance. The test displays the data to the user in the form of graphs, which is interactive, friendly, and intuitive.

5.3

Evaluation of the effect of music emotion expression recognition

In order to test the recognition effect of the model in this paper on music emotion expression, this paper uses the self-built data piano music database as well as the Interactive Emotion Binary Motion Capture (IEMOCAP) emotion corpus for experimental validation.

In this paper, only five categories of emotion data in IEMOCAP are used: Happy category (merged with excited category), Angry category, Sad category, Fear category and Neutral category. Sample data from the remaining categories were discarded. A total of 6156 samples were retained for this classification method. The above two datasets are experimented using five fold cross validation method respectively. 70% of the data is used for training the Hidden Markov Model and the remaining data is used for validation and accuracy testing.

The confusion matrix for music emotion expression recognition is obtained as shown in Fig. 7. From the figure, it can be seen that the model in this paper has a high recognition accuracy for emotional expressions in piano music, in which the recognition accuracy for Happy is as high as 94.7%, but there is still a 1.1% probability of recognizing it as anger, or other emotions.

5.4

Assessment of the effect of sound quality in musical expression on the piano

In this section, an experiment is conducted to verify the effectiveness of audio signal processing techniques in improving sound quality during piano playing. The objective of the experiment is to quantitatively evaluate the effectiveness of audio signal processing technology in enhancing sound quality.

The original signal, distorted signal, and processed signal are recorded through high-precision audio acquisition equipment to ensure accuracy and reliability of the data, and the signals are saved in lossless format. Use professional audio analysis software to analyze the data, including spectrum analysis, time domain analysis, subjective listening evaluation, and objective index measurement. Spectrum analysis compares the frequency distribution of the signal, while time domain analysis compares the changes in the signal waveform. Subjective hearing evaluation is conducted by the hearing test team to record differences in hearing. Objective metrics assessment included calculation of signal-to-noise ratio (SNR) and total harmonic distortion (THD) to quantify the processing effect. Figure 8 illustrates the comparison results between the original signal spectrum, the distorted signal spectrum, and the processed signal spectrum.

As can be seen from the figure, after processing the audio signal, the phase relationship of each channel signal is significantly improved. Before processing, the distorted signal spectrum has a larger deviation compared to the original spectrum, and the smoothness and consistency of the spectrum are affected. The processed signal spectrum is closer to the original signal spectrum, indicating that the application of audio signal processing technology can effectively reduce spectral distortion and improve signal quality.

The sound quality before and after treatment was evaluated using a combination of subjective listening sense assessment and objective index measurements. The results of the sound quality assessment before and after treatment are shown in Table 3. After processing, all the sound quality indicators were significantly improved. Clarity, stereo sense, spatial sense, and other subjective listening indicators have significantly improved, compared to the pre-processing 34.66% to 55.63%. SNR increased by 13dB, while THD decreased by more than 60%. This shows that the audio signal processing technology has a significant effect on improving the quality of the audio signal in piano performance.

Table 3.

The results of the sound quality evaluation before and after processing

Indeies	Definition score	Stereoscopic rating	Spatial rating	Snr (db)	Thd (%)
Pre-processing	3.26	3.51	2.84	80	0.09
Post-processing	4.39	4.77	4.42	93	0.03

6

Conclusion

In this paper, audio signals in piano playing are analyzed in depth, and the collection of audio signals and their processing methods are introduced. Processing techniques such as audio signal recognition are used to preprocess and extract features from audio signals. The Hidden Markov method is used as the main model for audio signal recognition, and the wavelet transform method is combined to reduce the noise of the audio signal.

When the number of decomposition layers is 7, the wavelet transform method has the best denoising effect, and it can effectively remove noise from the audio signal of a piano player playing.

The MFCC feature parameter values of the test tracks are distributed in the range of -12.68 to 12.43, which indicates that the Hidden Markov Model adopted in this paper can accurately recognize the timbre characteristics of the piano performance.

The model in this paper has a high accuracy in recognizing musical emotion expressions, with the highest accuracy being 94.7% in recognizing Happy.

Audio signal processing technology can reduce piano spectral distortion and improve the quality of audio signals.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

A study of the influence of audio signal processing technology on the expression of music aesthetics in piano performance

Xiting Yang

Published Online: Mar 21, 2025

Received: Nov 09, 2024

Accepted: Feb 09, 2025

DOI: https://doi.org/10.2478/amns-2025-0671

KeywordsAudio signal recognition, Wavelet transform noise reduction, Feature extraction, Audio signal processing, Piano performance

© 2025 Xiting Yang, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Audio signal recognition, Wavelet transform noise reduction, Feature extraction, Audio signal processing, Piano performance