A study on timbre feature extraction and sound quality optimisation of guzheng performance based on spectral analysis

In today’s world integration, the unique individual tone of national instruments is also valued and explored by more musicians. In the process of continuous improvement of musical instruments, when the shape of musical instruments is decided, the nature and personality of each musical instrument is decided, that is to say, the basic timbre of the instrument itself is decided. The basic timbre of musical instruments can not be changed or replaced arbitrarily, and it can only be decided through the vibration of sound to determine the quality and timbre of the instrument [1-3].

Guzheng belongs to plucked instruments, the excitation of zheng strings in different directions and positions will produce different timbres and sound intensity (sound size), and the vibrations in different directions and positions reflect the characteristics of the sound of guzheng in a certain orientation. The modern guzheng triggers the sound source through string vibration, and the vibration is triggered by plucking the strings with the fingers or pseudo-nail clippings, and the strings arranged in sequence are excited by the fingers (pseudo-nail clippings or plectrums), which change their original static state, i.e., out of the equilibrium position, and then suddenly make use of the strings’ inertia and rebound force to make them rebound back to the equilibrium position and transgress, and then the weeks repeat themselves, and the continuous round-trip generates the acoustic vibration [4-6]. At this point, the sound energy through the conduction system kite column to the resonance box, and then get diffusion. At the same time, acoustic convection and acoustic resonance are generated, thus expanding the (zither) fist sound. In practice, the pitch and timbre changes of the guzheng are controlled and altered by the tension and length of the strings and the density of the string material. Strings of different materials have different elasticity and density, and vibrate with different frequency components [7-8].

The term “sound quality” is generally used in a general sense to refer to the quality of sound. To judge the sound quality of a sound producer, the pitch, volume, and timbre of the sound are measured in a comprehensive manner. “Timbre, also known as timbre, is the core element that characterizes the voice. The pitch is determined by the frequency of the vibration of the sound body, the volume is determined by the amplitude of the vibration of the sound body, and the timbre depends on the harmonic column structure and the onset transient of the sound (commonly known as the “tone head”) [9-10]. Sound quality encompasses timbre. However, timbre is difficult to improve in relation to the sound-producing body, and it expresses a harmonic in the sound wave, which is inherent in nature. Guzheng, due to the use of different materials with different sound-producing bodies, the thickness of the wood, the sparseness of the wood, and the wood structure, even in the case of the same pitch and the same intensity of the sound, the timbre emitted is different. This is the reason why people can distinguish different guzhengs according to the different tones emitted by the sound producing body. Therefore, as a professional manufacturer of musical instruments, the professional staff needs to distinguish the concept and nature of “sound quality” and “timbre”, and not to mix them arbitrarily [11-12].

The silk strings of the Guzheng’s traditional era, though pure in sound quality, lacked ethereality, and the tone was not bright and penetrating in the high notes, resulting in a short after-tone and low volume. The steel strings after the change of the ancient era make up for the shortcomings of the silk strings, and the clear and bright tone is more suitable for playing sensual and rich music. However, the long aftertone produces our murmur, which makes its performance far inferior to that of metal-nylon synthetic strings. Xue, H et al. constructed a generalisable guzheng work dataset with multiple sources and types of texts, achieved accurate classification based on a single feature, and confirmed that the quality of synthetic guzheng music is significantly different from that of real guzheng music [13]. Han, M et al. designed a Large Long Short-Term Memory (LSTM) model to assess the quality and style of synthesised guzheng compositions, and tested it with famous guzheng performance pieces, among others, confirming the feasibility and validity of the proposed model [14]. Wang, Z et al. proposed the use of a neural network model based on residual convolution algorithms with single-tasking and multi-tasking models with three recognition strategies for the recognition of Chinese ethnic pentatonic tunings. The study promotes the diversified development of traditional music culture [15]. Jiang, W et al. conceived a framework that can analyse and perceive timbre features, and through simulation experiments corroborated that the proposed framework can scientifically analyse the correlation between spatial dimensions and timbre evaluation, and thus confirm the auditory perception attributes of the three-dimensional timbral space [16]. Li, D et al. envisioned a multiscale network as the underlying logic as a technical approach to solve the frame-level multi-label classification problem in guzheng performance, and experimental results showed that the approach optimised IPT detection [17].

The main accessories affecting the sound of guzheng are strings and yards. Due to the differences in production materials, craft standards and handmade habits among manufacturers, the main body of the guzheng is often mismatched with the accessories, which affects the sound quality of the product. Zhang, S attempts to analyse the factors affecting the expressiveness of the guzheng from the perspectives of the technique of the guzheng performance, the emotion, the arrangement, and the environment of the performance as well as the suggestions for improvement, which makes a positive contribution to the enhancement of the artistic charm of the guzheng performance! [18]. Ding, H et al. cut and subdivided the guzheng audio samples to highlight the attribute features of each fingering of the guzheng, and replaced the deep learning algorithm with the traditional machine learning algorithm as the fingering recognition algorithm, which effectively improved the recognition accuracy of the guzheng playing techniques regarding the six fingerings [19]. Chen, H et al. combined quantitative research methods to reveal a strong correlation between ancient accounts of guzheng techniques, compositions, regional influences and authors’ subjective perceptions among guzheng researchers, as well as clarifying the role the guzheng has played in the development of traditional music, and deepening people’s understanding of traditional music [20]. Zhao, C et al. demonstrated the consistency between the objective evaluation system of pipa string sound quality and subjective perception based on spectral analysis and numerical simulation methods, and elucidated that the rosewood pipa has better sound pressure uniformity while the mahogany pipa has superior sound quality [21]. Zhong, X. introduced the birth and development of the Ya-Zheng national instrumental music and its performance characteristics based on the historical research literature on the Ya-Zheng, and pointed out that the development of the Ya-Zheng was closely related to the social power hierarchy, the historical environment and the cultural background of the time [22].

This paper establishes the string vibration model of guzheng music based on the principle of playing sound of guzheng music. Using the fast Fourier transform, the guzheng playing audio is preprocessed, the centre of the window function is shifted, and the signal is intercepted, and the conversion of the time-frequency function to the spectral function is achieved by the Fourier transform to complete the extraction of the audio signal of the guzheng playing, and the output is passed through the band-pass filter. Subsequently, the power spectrum of guzheng playing is obtained by using discrete Fourier instead of continuous Fourier transform. Additive and convolutional operations are performed respectively to transform the guzheng signal into an additive signal, and the timbre component sequence is obtained by linear system processing. The Mel frequency cepstrum coefficients are selected as the characteristic parameters of guzheng timbre, and the characteristic parameters of guzheng timbre are extracted. The harmonic structure is used to analyse the expression spectrum of guzheng timbre and evaluate the results of guzheng timbre extraction. Finally, the sound quality optimisation experiment is designed and the effect of guzheng sound quality after optimisation is evaluated by time-frequency curve analysis.

2

Spectrum analysis-based timbre feature extraction for guzheng performance

2.1

Guzheng Music Playing

2.1.1

Principles of Guzheng Sound Generation

According to the analysis of the energy conduction path of guzheng vibration in the body of guzheng, the articulation of guzheng can be divided into four parts: excitation system, vibration system, conduction system and resonance system. The vibration body and resonance body of guzheng interact with each other, and different frequency components will be enhanced or attenuated in the process of articulation, so different guzheng timbres have differences in auditory perception.

2.1.2

String Vibration Modelling of Guzheng Music

The source of the sound of the guzheng is the vibration of the strings. Assuming that the length of the string is L and it is fixed at both ends of the koto panel, and the tension of the string is T, the tension F acts on the string x = x₀ to make the string start to vibrate, and u(x, t) represents the lateral displacement of the string x at the moment of t, then the vibration of the koto string can be reduced to a fixed solution problem: (1) ${\begin{array}{l} \frac{\partial^{2} u}{\partial t^{2}} = a^{2} \frac{\partial^{2} u}{\partial x^{2}}, 0 < x < L \\ u (0, t) = 0, \frac{\partial u}{\partial t} (x, 0) = 0 \\ u (x, 0) = φ (x), \frac{\partial u}{\partial t} (x, 0) = φ (x) \end{array}$ (2) $φ (x) = {\begin{array}{l} \frac{F (L - x_{0}) x}{T L} \\ \frac{F (L - x) x_{0}}{T L} \end{array}$

Solving the equation gives: (3) $u (x, t) = \frac{F x_{0} L}{T π^{2}} \sum_{n = 1}^{\infty} \frac{1}{n^{2}} \sin \frac{n π x_{0}}{L} \cos \frac{n π a}{L} \sin \frac{n π}{L} x$

From the formula, it can be seen that the guzheng sound consists of the superposition of different frequencies, phases and amplitudes of the crossover tones, and each guzheng presents a different timbre, with the composition and relative strength of these crossover tones playing a decisive role.

2.2

Guzheng audio pre-processing

2.2.1

Spectrum analysis

Spectral analysis is the process of transforming the time domain into the frequency domain. Fast Fourier Transform (FFT) is one of the algorithms to perform this transformation [23].

2.2.2

Tonal feature recognition extraction under fast Fourier analysis

Fourier analysis is a powerful tool for analysing the steady-state properties of linear systems and smooth signals, and it is widely used in many engineering and scientific fields. Fast Fourier analysis, which is a method to deal with non-smooth signals by steady state analysis based on the assumption of short-time smoothness, can also be called time-dependent Fourier transform. For the music played by guzheng, the faster rapid music is generally about 240 beats per minute, even according to the limit of the guzheng player’s ability - at this speed, each beat plays thirty-two notes (that is, eight notes per beat), then according to the calculation of the neighbouring two notes are different, it can be played 960 notes per minute, and the time occupied by the playing of a single note is 0.0625 s. The time occupied by the playing of a single note is 0.0625 s. It can be seen that the assumption of short-time smoothness for fast Fourier analysis of guzheng music is valid (it can be assumed that the music signal is smooth in such a short time period as 10 ms). In this case, the lowest frequency that can be distinguished is 16 Hz, and the pitch of the lowest tone on the guzheng is about 27.5 Hz. The process of sound perception is closely related to the fact that the human auditory system has a spectral analysis function. Therefore, spectral analysis of music signals is one of the effective means of recognising music signals and processing audio signals.

The fast Fourier transform of signal {x(n)} is defined as: (4) $X_{n} (e^{j ω}) = \underset{m = - \infty}{\sum^{\infty}} x (m) \cdot ω (n - m) \cdot e^{- j ω}$

Where 〈ω(n)〉 is a window sequence, it is clear that X_n(e^iω) is a two-dimensional function, also known as the time-frequency function. Time-frequency function X_n(e^jω) can be understood from two perspectives of physical significance: one interpretation is: when n is fixed, for example, n = n₀, then X_n(e^jω) is the centre of the window function is moved to n₀ to intercept the signal, and then do the Fourier transform and get a spectral function. This is understood directly from equation (4) in the direction of the frequency axis. The second interpretation is from the direction of the time axis: when the frequency is fixed, for example, ω = ω_k, then $X (e^{j ω_{k}})$ .

This can be viewed as the output produced when the time signal passes through a bandpass filter with centre frequency ω_k. This is because the window sequence {ω(n)} in Eq. (4) usually has a low-pass frequency response, while x(n) · e^jmω_k has a Fourier transform of X_n(e^jω). The modulating effect of the exponent $e^{j m w_{i}}$ on x(n) here is to shift its spectrum, i.e., to flatten the component of the spectrum of x(n) that corresponds to the frequency ω_k to zero frequency.

2.2.3

Spectrogram and temporal resolution

The square of the fast Fourier transform amplitude |X_n(e^jω)²| is the spectral energy density function of the signal x(n) at time n. When considering x(n) as an energy finite signal, its spectral energy is continuously distributed in the frequency domain and can only be given as a density function. It is the Fourier transform of the short-time autocorrelation function of the signal x(n), i.e: (5) $P_{n} (ω) = | X_{n} (e^{j ω}) |^{2} = \sum_{k = - \infty}^{\infty} R_{n} (k) e^{j ω k}$

where the short-time autocorrelation function is defined as: (6) $R_{n} (k) = \underset{m = - \infty}{\sum^{\infty}} x (m) \cdot ω (n - m) \cdot x (m + k) \cdot ω (n - m - k)$

In practical calculations, the discrete Fourier transform is generally used instead of the continuous Fourier transform, which requires a periodic expansion of the signal, i.e., x(n) · ω(n) is considered as a period of some periodic signal, and then a discrete Fourier transform is applied to it, which yields the power spectrum [24]. It is worth noting that if the window length is L, then the length of x(n) · ω(n) is L, and the length of R_n(k) is 2L. If x(n) · ω(n) is expanded with a period of L, there is aliasing in the autocorrelation domain, i.e., the cyclic correlation of the periodic function is different from the linear correlation R_n(k) in one cycle, and the resulting power spectrum is just a set of undersampled values of the true power spectrum, i.e., L samples. If you want to get all 2L values of the power spectrum, you can make up L samples after x(n) · ω(n), expand it into a signal with a period of 2L, and then do the Fourier transform. At this point the cyclic correlation is equivalent to the linear correlation.

2.3

Tone Feature Extraction

2.3.1

Homomorphic processing, cepstrums and complex cepstrums

Assume that the music signal x(n) is: (7) $x (n) = x_{1} (n) * x_{2} (n)$

where x₁(n) is the excitation that determines the pitch, x₂(n) is the component of the system that determines the timbre, including the vibration of the string and the vibration of the resonator, etc., and * is the convolution of the two signals. The homomorphic system consists of a linear subsystem $L []$ and two characteristic subsystems $D_{*} []$ and $D_{*}^{- 1} []$ .

“+” and ‘*’ denote additive and convolutional operations, respectively [25]. The role of the first system $D_{*} []$ is to convert the convolution signal into an additive signal, i.e., the logarithmic operation is performed after the Z-transformation of the sound signal x(n), and finally the convolution signal is changed into an additive signal by the inverse Z-transformation, which is processed as follows: (8) ${\begin{matrix} Z [x (n)] = Z [x_{1} {(n)}^{*} x_{2} (n)] = X_{1} (z) \cdot X_{2} (z) = X (z) \\ \ln X (z) = \ln X_{1} (z) + \ln X_{2} (z) = {\hat{X}}_{1} (z) + {\hat{X}}_{2} (z) = \hat{X} (z) \\ Z^{- 1} [\hat{X} (z)] = Z^{- 1} [{\hat{X}}_{1} (z) + {\hat{X}}_{2} (z)] = {\hat{x}}_{1} (n) + {\hat{x}}_{2} (n) \end{matrix}$

The second system $L []$ is a linear system that satisfies the principle of linear superposition, where the additive signal $\hat{x} (n)$ is linearly processed to obtain $\hat{y} (n)$ . The third system $D_{*}^{- 1} []$ is an inverse transformation of system 1, which serves to restore the additive signal $\hat{y} (n) = {\hat{y}}_{1} (n) + {\hat{y}}_{2} (n)$ to a convolutional signal, i.e., it is processed by the following operations: (9) ${\begin{array}{l} Z [\hat{y} (n)] = \hat{Y} (Z) = {\hat{Y}}_{1} (Z) + {\hat{Y}}_{2} (Z) \\ \exp (\hat{Y} (Z)) = Y (Z) = Y_{1} (Z) \cdot Y_{2} (Z) \\ y (n) = Z^{- 1} [Y_{1} (Z) \cdot Y_{2} (Z)] = y_{1} (n) * y_{2} (n) \end{array}$

In the actual processing of music signals, if ${\hat{x}}_{1} (n)$ and ${\hat{x}}_{2} (n)$ are sequences of excitation signals and system components that are in different positions and do not overlap with each other, then when the linear system $L []$ is properly designed, it is possible to separate ${\hat{x}}_{1} (n)$ and ${\hat{x}}_{2} (n)$ and obtain a sequence of system components that determines the timbre.

In the characteristic subsystems $D_{*} []$ and $D_{*}^{- 1} []$ of homomorphic processing, although signals $\hat{x} (n)$ and $\hat{y} (n)$ are both time-domain sequences, the discrete time domain in which they are located is clearly different from the discrete time domain in which x(n) and y(n) are located, and so it is referred to as the complex inverse spectral domain, or complex inverse spectrum for short.

In most digital signal processing, $X (z), \hat{X} (z)$ the convergence domain will be in the unit circle, and the Z-transform on the unit circle is the Fourier transform, the relevant form is shown below: (10) ${\begin{matrix} F [x (n)] = X (e^{j ω}) = | X (e^{j ω}) | e^{j \arg (X (e^{j ω}))} \\ \hat{X} (e^{j ω}) = \ln [X (e^{j ω})] = \ln | X (e^{j ω}) | + j \arg (X (e^{j ω})) \\ \hat{x} (n) = F^{- 1} [\hat{X} (e^{j ω})] \end{matrix}$

If only the real part of $\hat{X} (e^{j ω})$ is considered, it can be obtained: (11) ${\begin{array}{l} C (e^{j ω}) = \ln | X (e^{j ω}) | \\ c (n) = F^{- 1} [C (e^{j ω})] = F^{- 1} [\ln | X (e^{j ω})] \end{array}$

where c(n) is called the cepstrum, or cepstrum for short, which is the Fourier transform inverse of the x(n) logarithmic amplitude spectrum.

The inverse spectral coefficients in sound signal processing contain more information than other parameters, and the more commonly used acoustic feature parameters are MFCC and LPCC, etc. The principle of MFCC is to construct a human auditory model, and the acoustic features of the sound signal passing through the filter bank are transformed directly by the Discrete Fourier Transform (DFT), while LPCC is from the point of view of the acoustic model, and uses the linear prediction coding (LPC) technique to find the inverse spectral coefficients. Coding (LPC) technique to find the inverse spectral coefficients. The evaluation of guzheng timbre is essentially based on the auditory perception characteristics of the human ear, so this paper chooses MFCC as the characteristic parameter of guzheng timbre.

2.3.2

Mel Frequency Cepstrum Coefficient Extraction

Mel frequency cepstrum coefficient (MFCC) is a commonly used characteristic parameter in the analysis of music signals with an emphasis on the auditory properties of the human ear, i.e., to analyse the spectral characteristics of the music signals based on the results of human hearing experiments, and to derive the timbral characteristics of the guzheng that conform to the subjective auditory sensations of human beings [26]. Mel frequency cepstrum coefficient transforms the actual frequency into the Mel scale frequency, which emphasizes the low-frequency information of the sound signals. The specific relationship between Mel frequency and actual frequency can be expressed by the following formula: (12) $F_{m e l} = 2595 \lg (1 + f / 700)$

The unit of perceived frequency F_mel is Mel, and f is the actual frequency in Hz. Generally defined frequency is 1000Hz, 60dB sound perceived frequency is 1000Mel, the relationship curve between the perceived frequency and the actual frequency. In the calculation of MFCC, the sound signal is divided into a series of frequency groups in the frequency domain to form a Mel filter bank, and the MFCC has a good recognition performance and anti-noise ability in speech recognition.

Based on the above theory, combined with the characteristics of guzheng music signal, this paper adopts the MFCC parameters as the timbre feature parameters of guzheng, and Fig. 1 shows the specific computational steps of MFCC feature parameter extraction.

1)

Preprocessing

The preprocessing of guzheng music signal x(n) is the same as the music signal preprocessing above, including pre-emphasis, end-point detection, frame-splitting, and window addition using the window function.

2)

Fast Fourier Transform (FFT)

The discrete Fourier transform can convert the music signal from time domain to frequency domain, but the disadvantage is that the arithmetic is large, thus in order to reduce the arithmetic in the MFCC parameter extraction, the fast Fourier transform is used instead of the discrete Fourier transform for the conversion of the time domain to the frequency domain. The FFT is performed on the ind frame of Guzheng music signal x_i(m) respectively to transform the music data from time domain to frequency domain: (13) $X (i, k) = F F T [x_{i} (m)]$

3)

Spectral line energy calculation

Before Mel filtering, it is necessary to calculate the energy of its spectral lines for each frame of data after FFT, i.e.: (14) $E (i, k) = {[X (i, k)]}^{2}$

4)

Mel filter bank

The music signal is pre-processed, FFT transformed to obtain the corresponding discrete spectrum, the energy of each frame of the spectrum through the corresponding sequence of triangular filters to achieve the filtering process, to obtain a series of related coefficients m₁, m₂, ….. The number of filters contained in the filter bank M depends on the cut-off frequency of the music signal, because in general, all the filters contained in the filter bank must be included in the signal from 0 Hz to the signal sample rate of one-half of the Nyquist frequency to meet the requirements. Nyquist frequency in order to fulfil the requirement.

For the calculation of the filter coefficients m₁, m₂, … is shown in equation (15): (15) $m_{j} = \ln (\sum_{k = 0}^{N - 1} E (i, k) \cdot H_{j} (k)) j = 1, ..., M$

Where E(i, k) is the spectral energy value of each frame of music data after FFT, and H_j(k) represents the frequency domain response of each filter in the Mel filter bank, different filters correspond to different frequency bands, and the corresponding H_j(k) are also different, and the related calculations are shown in Eq. (16): (16) $H_{j} (k) {\begin{array}{l} 0, k < f (j - 1) or k > f (j + 1) \\ \frac{2 (k - f (j - 1))}{(f (j + 1) - f (j - 1)) (f (j) - f (j - 1))}, f (j - 1) \leq k \leq f (j) \\ \frac{2 (f (j + 1) - k)}{(f (j + 1) - f (j - 1)) (f (j + 1) - f (j))}, f (j) \leq k \leq f (j + 1) \end{array}$

where f(j) represents the corresponding centre frequency of each delta filter, which can be defined using equation (17): (17) $f (j) = (\frac{N}{f_{s}}) F_{m e l}^{- 1} (F_{m e l} (f_{l}) + i \frac{F_{m e l} {(f_{h})}_{-} F_{m e l} (f_{l})}{M + 1})$

Where, f_i is the lowest frequency of the filter frequency range, f_h is the highest frequency of the filter frequency range, N is the length at FFT, f_x is the sampling frequency, $F_{m e l}^{- 1}$ is the inverse function of F_mel and f(j) is satisfied: (18) $M e l (f (j + 1)) - M e l (f (j)) = M e l (f (j)) - M e l (f (j - 1))$

In calculating the energy of the Mel filter, the derived spectral line energy is passed through the Mel filter bank described above and the energy in that Mel filter bank is calculated. That is, the energy spectrum E(i, k) of each frame of the signal in the frequency domain is multiplied and then added to the Mel filter frequency domain response H_j(k): (19) $S (i, j) = \sum_{k = 0}^{N - 1} E (i, k) H_{j} (k), 0 \leq j < M$

where i denotes the ind frame and k denotes the kth spectral line. 5)

Discrete cosine transform (DCT)

Assuming that x(n) is a sequence of N finite-valued one-dimensional real signals, the discrete cosine transform DCT is expressed as: (20) $X (\begin{matrix} k \end{matrix}) = \sqrt{\frac{2}{N}} \sum_{n = 0}^{N - 1} C (k) x (n) \cos [\frac{π (2 n + 1) k}{2 n}], k = 0, 1, ..., N - 1$

where C(k) is an orthogonality factor introduced to ensure the normality of the transformation basis, defined as: (21) $C (k) = {\begin{array}{l} \sqrt{2} / 2, k = 0 \\ 1, k = 1, 2, ..., N - 1 \end{array}$

The discrete cosine transform has rich signal spectral components, has good energy concentration, and does not need to estimate the phase of the sound in the operation, so it can achieve a better speech enhancement effect with lower computational complexity.

The inverse spectrum of the DCT is calculated in the process of extracting the MFCC parameters, which is similar to the FFT inverse spectrum when the signal is logarithmically transformed by the Fourier transform and then the FFT inverse transform is calculated to convert the frequency-domain signal back to the time-domain signal, i.e., the Mel filter energy is logarithmically transformed and then its DCT is calculated: (22) $m f c c (i, n) = \sqrt{\frac{2}{M}} \sum_{m = 0}^{M - 1} \log [S (i, j)] \cos (\frac{n π (2 j - 1)}{2 M})$

The Mel cepstrum coefficients of the signal in frame i can be found. Where M is the number of Mel filters, j refers to the jrd filter, S(i, j) is the Mel filter energy derived above, and n represents the spectrum after DCT.

3

Analysis of the results of timbre feature extraction for guzheng performance

3.1

Guzheng Tone Expression Spectrum

Since the timbre of a musical instrument depends on the harmonic structure, similar musical instruments have similar harmonic structure, thus the harmonic structure can be defined as an objective index corresponding to the timbre of the instrument. Based on the existing discrete harmonic transform, the steps of timbre feature extraction based on the harmonic structure are as follows: firstly, the music signal is sub-framed, with a frame length of 0.5s and a frame shift of 0.25s, and no sub-framing is carried out for signals with a duration of less than 0.5s. According to the extraction method of harmonic structure, the harmonic structure information of guzheng signal is obtained, and the harmonic coefficients are normalised to obtain the harmonic coefficients, which constitute the timbre expression spectrum from the discrete harmonic transform coefficients, the first-order differential discrete harmonic transform coefficients, and the second-order differential discrete harmonic transform coefficients.

For the A4 monotone of guzheng, the fundamental frequency is 440.0Hz, and since the adoption rate is 44.2798kHz, the window length of the discrete harmonic transform is 100 samples, and the window shift is set to be 1/3 of the window length, and the highest harmonic number of 10 is computed for each frame of the audio signal with a length of 1s, and the tone expression spectrum with the highest harmonic number of 10 is computed in Fig. 2 for the tone expression spectrum of the guzheng’s A4 monotone, and the figure is the eigenvalues of the frame 0~100. It can be seen that the first few frames of the timbre expression spectrum contain a large amount of audio information, and the coefficient of the timbre expression spectrum is greater than 0.5, so it is necessary to choose the appropriate highest harmonic number when carrying out the experiments of guzheng timbre feature extraction.

3.2

Guzheng Tone Extraction

The harmonic structures of a total of seven single tones from C4 to B4 of the guzheng were extracted, and the harmonic structures of three different sampling points of the guzheng were randomly selected for analysis. The amplitude-frequency spectra of the guzheng are shown in Fig. 3, with (a) and (b) as the first sampling points, (c) and (d) as the second sampling points, and (e) and (f) as the third sampling points, and the first 10 harmonic coefficients of the seven single tones are shown for each subfigure in Fig. 3. From the figure, it can be seen that the time-frequency waveform of sample point 2 has the largest signal amplitude, ranging from -0.75 to 0.75, and the duration of sample point 3 is the longest, close to 4 × 10⁴s. At the same time, there is a clear distinction between the harmonic structure of different sample points, and the harmonic structure of the same instrument (guzheng) is more similar, which can be used as a basis for extracting the timbre of the guzheng.

4

Sound quality optimisation and analysis of results

4.1

Optimising experimental programme design

The sound quality of guzheng is closely related to the aging degree of wood, and the body can be regarded as an exquisite and special wooden box structure. The acoustic sweeping technology is adopted, and the resonance peak rising, resonance peak frequency becoming lower, and multiple resonance peaks generating are taken as the effective judgement for the sound quality optimisation process of guzheng. Thus it becomes possible to transfer the vibrational ageing technique, which has been successfully applied to steel structures, to the production of koto. This chapter attempts to explore the feasibility of optimising the quality of the koto by applying a sweep signal to the test koto, returning the resonance frequency, and then continuously inputting the resonance frequency to excite the koto to vibrate.

Four strings were sampled before and after vibration for empty string pulling and plucking respectively. In this chapter, only the results of the A string are compared and analysed, and each extracted waveform and time are closer to the data segments, so as to observe the effect of vibration on the guzheng through the amplitude-time curves and amplitude-frequency curves, and to explore the effect of vibration on the optimisation of the acoustic quality of the guzheng.

4.2

Optimisation test data processing and result analysis

4.2.1

Analysis of sound quality optimisation effects

For the results of the guzheng sound quality optimisation test, an instrumental evaluation was first carried out. The Labview sound acquisition program recorded the results of the two sweeps before and after the vibration optimisation and automatically compared the results to make the difference automatically.

The optimisation time of the first optimisation test was 50 minutes. Due to the immaturity of the Labview sound acquisition program, the acquisition stopped in the middle of the optimisation and the experimental data were lost. Considering that the characteristics of vibration aging is that more than 70% of the effect appears in the first half hour of vibration, so the second optimisation test had to extend the optimisation time, so the optimisation time was extended to 150 minutes, and the resulting curve is shown below. Figure 4 shows the comparison of the sweep curves of the second test, from which it can be seen that after the second optimisation, under the premise of equal amplitude vibration, the 4# guzheng has tended to be stabilised, and in the first 500Hz of the frequency, there is not a big difference in the amplitude between the two sweeps, and the amplitude range is between 0.2 and 0.45. At around 750Hz and 1750Hz, the amplitude difference between the two sweeps is large, and the amplitude of the sweeps in the rest of the states is almost the same. The only next step is to consider increasing the power in order to achieve the desired optimisation effect.

So the power amplifier was replaced with a larger one and accordingly with a thicker coil. In this scenario, a third optimisation test was carried out. The vibration time was 150 minutes.

Figure 5 shows the third test sweep curve comparison, as can be seen from the figure, from 20Hz to 1200Hz effect is not obvious, from 1200Hz ~ 1700Hz effect, the emergence of several new peaks, respectively, the frequency = 1250Hz, 1385Hz and 1678Hz, the peak amplitude of the peak respectively, 0.5V, 0.65V and 0.735V. Greater than 1700Hz region has the most significant effect, with the peak amplitude of the second sweep exceeding that of the first across the board, indicating that the high-frequency region has been significantly optimised. As the vibration optimisation of low amplitude after the first test has already formed an optimisation effect on the low frequency region and part of the mid-frequency region, increasing the power of the power amplifier means that the amplitude is increased and the high frequency region is optimised.

Figure 6 shows the difference of the two sweep curves. Combined with the difference of the sweep curves, it can be seen more intuitively that the amplitude difference fluctuates in the interval of [0.01,0.058] in the high-frequency interval of 1700 Hz, which further verifies the effect of the optimisation of the guzheng’s sound quality.

4.2.2

Koto time-frequency curve analysis

The more pronounced the overtone spiky peaks and the greater the variation in the koto’s time domain curve, the more beautiful the koto will sound. Therefore, the 4# zither has been optimised from the point of view of time domain analysis.

Figure 7 shows the amplitude-time curve of the A-string pulling before the vibration treatment of the guzheng, and the amplitude-frequency curve obtained by FFT transformation. The curve is characterised by the fact that the highest amplitude corresponds to a frequency of 1136 Hz (about 5/2 octave), followed by 1326.5 Hz. The peak corresponding to the main frequency of 436.2 Hz is not the highest, and the amplitudes corresponding to the 5-octave frequency in the high-frequency region and the 11/2, 17/2 octave are larger, while the rest are smaller.

Figure 8 shows the amplitude-time curve of A-string playing after the vibration treatment of guzheng, and the frequency-amplitude curve obtained by FFT transformation. The characteristic 483Hz is the highest amplitude of the main frequency, while the 2x, 3x, 5x, 8x, 9x overtones are relatively more prominent, and the amplitude of the main frequency shows a ‘logarithmic curve’ type of rapid decay. On the other hand, there are 1 or 2 small peaks of overtones in the crossover overtones of the main frequency. These changes are the result of vibration optimisation.

5

Composition and Relevance of Tonal Expression in Guzheng Performance

5.1

Tonal Expression in Guzheng Performance

5.1.1

Basic Tone Applications

The basic tone includes four dimensions: fullness, solidity, evenness and relaxation. If the expression of the basic tone is detached from the basic tone, the melody played can only be heard as floating notes without the hazy sense of tone concentration, and the mood of the piece that the composer wants to express can not be shown. In addition, the bass melody of the left hand contrasts with the middle and high melody of the right hand, and the left hand must use the basic tone to the extreme in order to enrich the melody and harmony of the piece, otherwise the melody of the left hand is just notes floating in the air.

Then from the concept of basic tone, first of all, “full” tone expression composition, full tone in playing to “sink down”, no matter in what kind of music under the circumstances, round and full, from the inside out the tone is the player must master, followed by The second is the composition of “solid” tone, the players use the big arm to drive the small arm, the power will fall with the fingertips and then play, so that the sound played out of the sound is more solid, and can bring out the flavor very well. Furthermore, the composition of the “even” tone in the music, even tone requirements emphasize the playing of each tone tone to achieve uniformity, such as techniques in the rocking fingers, wheel fingers, arpeggios, etc., including the same tone requirements of the phrases to achieve uniformity.

5.1.2

Carving out granular timbres

The granular tone is characterised by fast power generation and strong finger independence. When the player plays the wheel finger, first of all, it should be different from the single tone without accent marking, the middle finger, index finger and big finger should concentrate their playing power and play with explosive force, especially pay attention to the independence of the fingertips, and avoid mixing the three tones of the wheel finger into one. Secondly, the mood of the piece should be advanced again and again in repetition, which requires the player to differentiate the expression of granular timbre in terms of strength and weakness when playing the same melody for the first and second time.

5.1.3

Presentation of linear tones

The use of linear tone in zheng performance is relatively common, and it is also reflected in the innovation of playing techniques, such as finger shaking, wheel playing, trill and glissando, etc. These techniques are all used to enhance the expression of linear tone in guzheng performance. The linear timbre of monophonic melody is mainly reflected in the loose plate and slow plate of the piece. In order to express the hazy morning mist, the piece uses a simple and single technique. From the player’s point of view, when playing a single technique, if you want to express the linear tone, you need to use the slow force and the wrist-driven force to connect the tones with each other.

5.1.4

Controlled Tone Grasp

Controlled tone emphasizes the “strong but not explosive, weak but not false” fingertip tone grasp. In order to show the misty, illusory scenery, the player needs to play the melody of the right hand with extremely weak strength, which requires the player to have the ability to control the weak tone, and to achieve a full tone in the extremely light and weak melody. Strong but not explosive tone also belongs to the category of tone control, the player in playing music rich in expressive fast passages, due to the promotion of the mood, the player’s playing state will generally be more excited, in this case played out the sound, especially the sound of the treble area will be controlled because of the expression of tone control is not well mastered and the “sonic boom” of the effect. The right hand should pay attention to the power of the right hand when playing. When playing with the right hand, pay attention to the power and keep it within certain limits.

5.2

The Relevance of Tonal Expression in Guzheng Performance

5.2.1

Motivating players to focus on tonal expression

When playing guzheng music, the first thing is to pay attention to the expression of timbre in the process of playing the music. The primary task and significance of timbre expression is to convey the mood that the composer wants to depict and the emotion that he wants to express, and the second is to serve as the connection of the subjective emotion of the performer. Most guzheng players ignore the importance of timbre expression in their practice, their interpretation of the work is not expressed in their performance, and they neglect the training of fingertip timbre over time. The importance of timbre expression is not only reflected in the emotional expression of the piece of music and the shaping of the mood, but also through the training of timbre, the fingertip timbre processing and other aspects of the importance of the performance level, thus invisibly improve the level of performance, and further inspire the players to pay attention to the expression of the timbre of the piece of music performance.

5.2.2

Enhancement of the performer’s ability to analyse the context of a piece of music

After paying attention to the fingertip playing tone, the performer can also promote the ability to analyse the context of the piece before playing, which can also lay the foundation for the combination of playing tone expression and the context of the piece. In the process of analysing the context of a piece of music, first of all, we should have an in-depth understanding and analysis of the genre of the piece, the background of the composition, and the introduction of the piece of music, which is to lay the foundation for the contextual colours of the whole piece of music.

5.2.3

Promoting the integration of timbral expression in performance with the context of the piece

After stimulating the player to pay more attention to the expression of the fingertip timbre and strengthening the player’s ability to analyse the context of the piece, the player can be further promoted to apply the timbre to the corresponding piece to complete the expression of the timbre. For example, in emotional passages, the player should use linear timbre to promote the expression of the emotion of the piece, while in intense emotional passages, the player should use a more granular, solid timbre to reflect. What kind of timbre should be used in different musical situations? How should the tone be used to express the mood of the piece? This is a mutually reinforcing process that requires thought and practice on the part of the performer. From this, it can be seen that timbre expression is not only concerned with the timbre of the fingertips in performance, but also can improve the players’ ability to analyse the context of the music in a step-by-step process, so as to combine theory with practice to promote the combination of timbre expression and the context of the music, and ultimately to reflect the significance of the reality of timbre expression in performance.

6

Conclusion

This paper proposes a string vibration model for guzheng music based on the sound generation principle of guzheng music playing. Using Fast Fourier Transform, the audio of guzheng is preprocessed, and MFCC is selected as the timbre characteristic parameter of guzheng, and after homomorphic processing, cepstrum and compound cepstrum processing, the Mel frequency cepstrum coefficients are extracted, which is the timbre characteristic parameter of guzheng. The performance timbre extraction results of guzheng are analysed through simulation experiments. 1)

The harmonic structures of seven single tones in the C4~B4 range of the guzheng were extracted, and the harmonic structures of three different sampling points of the guzheng were randomly selected and analysed; the amplitude of the time-frequency waveform signals at sampling point 2 was the largest, ranging from -0.75 to 0.75, and the duration at sampling point 3 was the longest, close to 4 × 10⁴s. The harmonic structures of the seven single tones in the C4~B4 range of the guzheng were extracted and analysed.

2)

Adopting acoustic sweeping technology to optimise the sound quality of the guzheng, after three experimental sweeps, a new peak appeared in the effect of 1200Hz~1700Hz, respectively, at frequencies of 1250Hz, 1385Hz and 1678Hz, with the peak amplitudes of 0.5V, 0.65V and 0.735 V. The effect of the region greater than 1700Hz was most significant. Combined with the sweep curve difference, it is more intuitive to see that in the high frequency interval of 1700Hz, the amplitude difference fluctuates in the interval of [0.01,0.058], and the optimisation effect of guzheng sound quality is more obvious.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

A study on timbre feature extraction and sound quality optimisation of guzheng performance based on spectral analysis

Dan Lu

Pubblicato online: 29 set 2025

Ricevuto: 27 gen 2025

Accettato: 11 mag 2025

DOI: https://doi.org/10.2478/amns-2025-1121

Parole chiaveFast Fourier transform, Mel frequency cepstrum coefficient, Discrete cosine transform, Harmonic structure, Guzheng tone, Sound quality optimisation

© 2025 Dan Lu, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Fast Fourier transform, Mel frequency cepstrum coefficient, Discrete cosine transform, Harmonic structure, Guzheng tone, Sound quality optimisation