Research on the Application of Artificial Intelligence Technology in Guzheng Music Cultural Inheritance

As an important part of Chinese culture, traditional music has a unique charm and attraction, and has a wide influence in the international arena. Through the dissemination of Chinese traditional music, understanding and communication between different cultures can be strengthened, and while promoting cultural exchanges and mutual understanding among countries around the world, it also helps to enrich the world’s music culture and increase the international influence of Chinese culture [1-4]. As a typical Chinese traditional musical instrument, the significance of guzheng to the development and inheritance of traditional music is self-evident. Since the reform and opening up, Chinese traditional music has been greatly impacted by the influx of foreign cultures, and the development of music is certainly no exception [5-7]. Facing the threat of popular music and foreign music, the inheritance and development of traditional music are facing a very big challenge, which is more and more obvious after stepping into the 21st century [8-10]. In recent years, the development of traditional music culture, such as guzheng music culture, has received more and more attention, but the implementation of relevant protection measures rarely plays a key role in protection [11-13]. With the rapid development of science and technology, artificial intelligence technology has become a key force in promoting social progress, economic transformation and cultural dissemination. The development and application of artificial intelligence technology provides new opportunities for the creative transformation and innovative development of Chinese outstanding traditional music culture [14-16]. The creative transformation and development of excellent traditional music culture in the way of “artificial intelligence + guzheng music culture” provides new ideas and opens up new paths for the dissemination of traditional music culture [17-19].

In this paper, a guzheng fingering estimation method based on prior knowledge and pitch difference-fingering model is proposed to enhance the learning effect of learners in the cultural heritage of guzheng music. A bidirectional LSTM is used to mine the mapping relationship between pitch difference and fingering. A fingering transfer layer is established to obtain the fingering transfer law of a single tone. Hidden Markov generative model is utilized to enhance data for guzheng fingering annotation. Prior knowledge is also specifically incorporated into the model to ensure the flexibility of the final output fingering results. The ZF-VA-Dataset, an audio and video dataset of guzheng fingering, is used as a research object to evaluate the accuracy of this paper’s model. The model of this paper is applied to the practical teaching of guzheng, and its reliability is discussed through the comparative analysis method.

2

Combination of Artificial Intelligence and Music Heritage Education

2.1

Examples of the integration of artificial intelligence and music heritage education

China is a multi-ethnic country with a long history and deep cultural heritage. In the field of music, there are different styles of ethnic musical instruments which are carriers of ethnic music culture and need to be protected and inherited. Most of the listeners are not familiar with ethnic musical instruments, the use of computers and artificial intelligence means of detecting ethnic musical instruments, can make people hear ethnic music at the same time to understand the instrument played, in the fun of learning to understand and love the ethnic musical instruments, at present, this is a more valuable and meaningful research direction.

For example, the founder of Yi Guzheng proposed a brand new teaching method of artificial intelligence-assisted guzheng construction, combining artificial intelligence with traditional guzheng teaching. Yi Guzheng Intelligent Music Classroom, hand in hand with domestic experts in many fields such as guzheng education, artificial intelligence, audio recognition, etc., combines traditional Chinese art with modern technology, and combines traditional guzheng culture with the developmental characteristics of modern children.

2.2

Guzheng playing techniques

Guzheng is an ancient Chinese plucked instrument with a wide range of music and beautiful sound. Due to its flexibility and versatility in playing, the guzheng was already popular in Qin during the Warring States period, and in the long history of music, the guzheng has developed a very Chinese cultural heritage. The zheng is rich in techniques, which can be pressed, played, monophonic, and polyphonic, and has a relatively large dynamic range of volume among plucked instruments. With the introduction and popularization of western musical instruments, the performance technique of guzheng has also been developed, and many composers and performers have fused western musical elements on the basis of traditional techniques, constantly reforming the performance technique of guzheng to promote the development of guzheng.

Not only do they use the traditional playing techniques of the guzheng that have been developed over a long period of time, such as the right hand playing techniques: single-tone playing fingerings (chopping, resting, wiping, picking, etc.), double-tone playing fingerings (double resting, double chopping, double picking, etc.), continuous-tone playing fingerings (flower finger), and sustained-tone playing fingerings (finger wheeling, finger rocking). As well as the left hand techniques (press, vibrate, knead, note, chant, point, slide). The traditional techniques of the guzheng still dominate, and are the essence of every player. There are also a number of modern techniques used in the guzheng. With the continuous influx and integration of western musical elements, the guzheng not only integrates the playing techniques of other instruments, but also frees itself from the left and right-handed activities in the traditional way of playing, so that the sound and acoustic performance of the guzheng is more rich and diversified. In the design of the music, the bamboo of the yangqin and the bow of the cello are used to combine the guzheng’s performance, as well as new techniques such as tapping the strings of the zheng, the zither box, sweeping, and the left side of the yard, etc. The guzheng’s sound and the sound of the left hand have become richer and more diversified.

3

Based on the pitch difference-fingering matching model

3.1

Note data preprocessing

There are two main ideas of data processing, one is to convert pitch information into its corresponding MIDI number as input to estimate fingering, and the other is to use the relative position information between the characterizing keys as input, and the paper refers to the processed data as pitch difference (PD) data, which is a representation that contains information about the physical positional intervals of neighboring pitches along with temporal information, which is more conducive to the model’s exploration of the relationship between pitch and fingerings. In this processing method, the left and right hand data are processed separately and trained separately, and the specific representation is as follows: (1) $d^{(t)} = {\begin{array}{l} 100 n & t = 1 \\ x^{(t)} - 100 x^{(t - 1)} n & | x^{(t)} - x^{(t - 1)} | < 12, t > 1 \\ 80 sgn [x^{(t)} - x^{(t - 1)}] & | x^{(t)} - x^{(t - 1)} | \geq 12, t > 1 \end{array}$ $${d^{(t)}} = \left\{ {\begin{array}{l} {100n}&{t = 1} \\ {{x^{(t)}} - 100{x^{(t - 1)}}n}&{\left| {{x^{(t)}} - {x^{(t - 1)}}} \right| < 12,t > 1} \\ {80\operatorname{sgn} \left[ {{x^{(t)}} - {x^{(t - 1)}}} \right]}&{\left| {{x^{(t)}} - {x^{(t - 1)}}} \right| \geq 12,t > 1} \end{array}} \right.$$

Where: x denotes the Midi number corresponding to the pitch of the note. t denotes the sequential index of the note. n indicates the number of notes at the current moment, i.e., the number of notes contained in the polyphony, and n is specified as 0 when the current note is a monophonic note.

3.2

Fingerprinting methods

This paper adopts the common fingering representation of sheet music, the following “1 finger” that is, “thumb”, “2 fingers” that is, the index finger, and so on. The left and right finger numbers are shown in Figure 1. The sequence labels of the data used in this paper are also represented by five numbers from 1 to 5.

3.3

Pitch Difference-Fingering Relationship Mapping Layer

3.3.1

Mapping Pitch Differences to Fingerings

The pitch difference to fingering mapping uses a BI-LSTM-like network in which two LSTM layers capture the forward and backward contextual information of the pitch difference in the score to model the mapping relationship between pitch difference and fingering. In the forward and backward LSTM networks, the basic units are LSTM units whose internal sectors are designed to obtain remote relationships of pitch differences in order to fully utilize the contextual information to determine fingerings. Remote contextual information is particularly important in certain situations, such as learning finger-turning for pitch increments or decrements, or finger-swiping to play notes in rapid repetition. For some notes, fingering annotations depend on the notes before and after, so we use a bidirectional LSTM for mining note-to-finger relationships, i.e., not only the current note and the notes that have been played are taken into account when determining the current fingering, but the notes that will be played are also taken into account.

The basic cell of the LSTM has three main gates to realize its function: the forget gate, the input gate and the output gate. At a certain moment t, the forward process within the cell and its internal update process are as follows:

The forgetting gate discards useless long-term memories and determines whether to continue storing long-term information. For example, if a previous chord input has little effect on the current fingering, the Forget Gate may choose to discard that chord note information: (2) $f^{(t)} = δ (W_{f} \cdot [h^{(t - 1)}, e^{(t)}] + b_{f})$ $${f^{(t)}} = \delta \left( {{W_f} \cdot \left[ {{h^{(t - 1)}},{e^{(t)}}} \right] + {b_f}} \right)$$

is a linear splice of the input pitch-difference pitch-difference information with the hidden layer state of the previous moment, b_f is the bias of the forgetting gate, and σ is the sigmoid function.

The input gate controls whether the current pitch difference information is allowed to enter the cell state: (3) $i^{(t)} = δ (W_{i} \cdot [h^{(t - 1)}, e^{(t)}] + b_{i})$ $${i^{(t)}} = \delta \left( {{W_i} \cdot \left[ {{h^{(t - 1)}},{e^{(t)}}} \right] + {b_i}} \right)$$ (4) ${\tilde{C}}^{(t)} = \tanh (W_{c} \cdot [h^{(t - 1)}, e^{(t)}] + b_{c})$ $${\tilde C^{(t)}} = \tanh \left( {{W_c} \cdot \left[ {{h^{(t - 1)}},{e^{(t)}}} \right] + {b_c}} \right)$$

where W_i is the weight matrix of the input gate and b_i is the bias of the input gate.

The current state of the cell C^(t) is a combination of previous and current memories. Under the control of the forget gate, it can save the information of the score from a long time ago. And under the control of the input gate, it can prevent currently irrelevant contents from being memorized: (5) $C^{(t)} = f^{(t)} * C^{(t - 1)} + i^{(t)} * {\tilde{C}}^{(t)}$ $${C^{(t)}} = {f^{(t)}}*{C^{(t - 1)}} + {i^{(t)}}*{\tilde C^{(t)}}$$

where * denotes multiplication by vector elements. This calculation saves the real-time state of the pitch information to long-term memory.

The output gate controls the effect of the long-term memory on the current output: (6) $o^{(t)} = δ (W_{o} \cdot [h^{(t - 1)}, e^{(t)}] + b_{o})$ $${o^{(t)}} = \delta \left( {{W_o} \cdot \left[ {{h^{(t - 1)}},{e^{(t)}}} \right] + {b_o}} \right)$$

where W_o is the weight matrix of the output gate and b_o is the bias of the output gate.

The state of the hidden layer is used to control whether the long-term state C^(t) is output as the fingering label of the current LSTM cell: (7) $h^{(t)} = o^{(t)} * \tanh (C^{(t)})$ $${h^{(t)}} = {o^{(t)}}*\tanh \left( {{C^{(t)}}} \right)$$

The final output of the LSTM cell is determined by the output of the output gate and the cell state. If the hidden layer states are extracted at each time step, we can obtain an output sequence with the same length as the input. Therefore, LSTM networks can theoretically model the correspondence between note sequences and fingerings. For BI-LSTM, it is a combination of forward LSTM and backward LSTM. The output is a splice $[h^{(t)}_{(f o r w a r d)} h^{(t)}_{(b a c k w a r d)}]$ $$\left[ {{h^{(t)}}_{\left( {forward} \right)}{h^{(t)}}_{\left( {backward} \right)}} \right]$$ of forward and backward hidden layer vectors with size 2*hidden_embeding_size.

3.3.2

Fingerprint probability mapping

To better characterize the fingering recognition results, the hidden state vector of 2*hidden_embeding_size is mapped to k dimensions in the BI-LSTM back-end connectivity layer, where k is the number of fingering labels and λ^(t) denotes the probability of each fingering label for the current input: (8) $λ^{(t)} = c [h^{(t)}_{(f o r w a r d)} h^{(t)}_{(b a c k w a r d)}]$ $${\lambda ^{(t)}} = c\left[ {{h^{(t)}}_{\left( {forward} \right)}{h^{(t)}}_{\left( {backward} \right)}} \right]$$ (9) $λ^{(t)} = {(λ_{1}^{(t)} + λ_{2}^{(t)} + λ_{3}^{(t)} \dots + λ_{k}^{(t)})}^{T}$ $${\lambda ^{(t)}} = {\left( {\lambda _1^{(t)} + \lambda _2^{(t)} + \lambda _3^{(t)} \cdots + \lambda _k^{(t)}} \right)^T}$$

$λ_{j}^{(t)}$ $$\lambda _j^{(t)}$$ indicates the probability that the fingering is finger j in position t.

3.4

Finger transfer layer

When the pitch rises or falls continuously, finger-turning fingering is often used instead of hand position translation. Although the basic bidirectional long-term short-term memory (BI-LSTM) network can contain remote pitch difference information and the relationship between pitch difference and fingering, it cannot directly reflect the ergonomic constraints between fingers. These observations inspired us to further improve the BI-LSTM network to incorporate a priori knowledge between finger sequences and note contexts. Ergonomic constraints of the hand are important for fingering annotation. To address this issue, we introduce the finger transfer probability matrix W_T into the model, which counts the fingering transfer patterns of single tones in the statistics set to constrain the neighboring fingerings in the BI-LSTM output. Finger shifts are related to the rise and fall of a monophonic sequence, so the fingering shift probability matrices W_T↑ and W_T↓ are defined according to the rise and fall of the input sequence pitch, respectively. Their representations are shown in Eq. (10): (10) $W_{T} = \frac{1}{2} [sgn (d^{(t)}) \cdot (W_{T ↑} - W_{T ↓}) - sgn (d^{(t)}) \cdot (W_{T ↑} + W_{T ↓})]$ $${W_T} = \frac{1}{2}\left[ {\operatorname{sgn} \left( {{d^{(t)}}} \right) \cdot \left( {{W_{T \uparrow }} - {W_{T \downarrow }}} \right) - \operatorname{sgn} \left( {{d^{(t)}}} \right) \cdot \left( {{W_{T \uparrow }} + {W_{T \downarrow }}} \right)} \right]$$

W_T↑ and W_T↓ constrain only the fingering choices of adjacent single tones. where P_ij is the likelihood of transfer from the ith finger to the jth finger. In the model, the parameters of this matrix are trained together with the parameters of the mapping part, which are updated during backpropagation.

Based on the input pitch difference sequence D, the fingering transfer matrix W_T and the output fingering sequence Λ obtained from the BI-LSTM, the output Y is represented as follows: (11) $y^{(t)} = W_{T} * y^{(t - 1)} + λ^{(t)}$ $${y^{(t)}} = {W_T}*{y^{(t - 1)}} + {\lambda ^{(t)}}$$ (12) ${\hat{y}}^{(t)} = softmax (y^{(t)})$ $${\hat y^{(t)}} = {\rm{softmax}}\left( {{y^{(t)}}} \right)$$ (13) $Y = ({\hat{y}}^{(1)}, {\hat{y}}^{(2)}, \dots {\hat{y}}^{(n)})$ $$Y = \left( {{{\hat y}^{(1)}},{{\hat y}^{(2)}}, \cdots {{\hat y}^{(n)}}} \right)$$

where $y_{i}^{(1)} = λ_{i}^{(1)} = Λ (1, i)$ $$y_i^{(1)} = \lambda _i^{(1)} = \Lambda (1,i)$$, y^t can also be expressed in the following form: (14) $y^{(t)} = {(y_{1}^{(t)} + y_{2}^{(t)} + y_{3}^{(t)} \dots + y_{k}^{(t)})}^{T}$ $${y^{(t)}} = {\left( {y_1^{(t)} + y_2^{(t)} + y_3^{(t)} \cdots + y_k^{(t)}} \right)^T}$$

$y_{i}^{(t)}$ $$y_i^{(t)}$$ denotes the probability of finger i appearing at time t.

The finger with the highest probability at time t, i.e., the finger number φ^(t) corresponding to the current pitch during training, is: (15) $φ^{(t)} = \arg \max_{i} [y_{i}^{(t)}]$ $${\varphi ^{(t)}} = \arg {\max _i}\left[ {y_i^{(t)}} \right]$$

3.5

A priori fingering knowledge

The finger transfer matrix learned during training is affected by the model structure, data diversity, and number of samples, and can only characterize the weak transfer relationship between adjacent fingers. If the current fingering is not realizable, the transfer probability between fingers is set to 0, otherwise the current output of the model is maintained. In left-handed fingering, there is a 5-finger to 1-finger hand translation within an octave, so this fingering is not subject to prior knowledge. The decision function T is shown in equation (16): (16) $T = {\begin{matrix} \frac{1}{2} {sgn [4.5 - f^{(t - 1)} \cdot f^{(t)}] + 1} & r i g h t : \begin{matrix} - 12 < d^{(t)} < 12 a n d \\ sgn (d^{(t)}) \cdot (f^{(t)} - f^{(t - 1)}) < 0 \\ - 12 < d^{(t)} < 12 a n d \end{matrix} \\ \frac{1}{2} {sgn [5.5 - f^{(t - 1)} \cdot f^{(t)}] + 1} & l e f t : \begin{matrix} sgn (d^{(t)}) \cdot (f^{(t)} - f^{(t - 1)}) > 0 \\ o t h e r \end{matrix} \end{matrix}$ $$T = \left\{ {\begin{array}{c} {\frac{1}{2}\left\{ {\operatorname{sgn} \left[ {4.5 - {f^{(t - 1)}} \cdot {f^{(t)}}} \right] + 1} \right\}}&{right:\begin{array}{c} { - 12 < {d^{(t)}} < 12and} \\ {\operatorname{sgn} \left( {{d^{(t)}}} \right) \cdot \left( {{f^{(t)}} - {f^{(t - 1)}}} \right) < 0} \\ { - 12 < {d^{(t)}} < 12and} \end{array}} \\ {\frac{1}{2}\left\{ {\operatorname{sgn} \left[ {5.5 - {f^{(t - 1)}} \cdot {f^{(t)}}} \right] + 1} \right\}}&{left:\begin{array}{c} {\operatorname{sgn} \left( {{d^{(t)}}} \right) \cdot \left( {{f^{(t)}} - {f^{(t - 1)}}} \right) > 0} \\ {other} \end{array}} \end{array}} \right.$$

When the decision function is equal to 0, there are fewer fingering paths between the two moments. Some of the transfer paths are pruned when the left hand music pitch rises or when the right hand pitch falls.

After adding the path pruning condition, the expression of $\tilde{y^{(t)}}$ $$\widetilde {{y^{(t)}}}$$ in the recognition phase is shown in Equation (17): (17) $\tilde{y^{(t)}} = {(T \cdot W_{T})}^{T} * y^{(t - 1)} + λ^{(t)}$ $$\widetilde {{y^{(t)}}} = {\left( {T \cdot {W_T}} \right)^T}*{y^{(t - 1)}} + {\lambda ^{\left( t \right)}}$$

At time t, we choose the fingering with the highest probability as the final fingering estimate, and the modeled fingering sequence number φ^(t) for the current pitch is: (18) $φ^{(t)} = \arg \max_{i} [\tilde{y_{2}^{(t)}}]$ $${\varphi ^{\left( t \right)}} = {\arg \max} _i\left[ {\widetilde {y_2^{\left( t \right)}}} \right]$$

3.6

Data Enhancement Methods Based on Hidden Markov Generative Modeling

Although score-fingerprint data enhancement cannot be directly applied to data enhancement methods in the field of natural language processing, some of the ideas contained therein are still inspiring, for example, the data enhancement method of synonymous substitution is very common in tasks such as lexical annotation of text sequences or machine translation. Inspired by the fact that synonymous substitution achieves data enhancement by maintaining the semantic equivalence between two words, data enhancement can be achieved by maintaining the equivalence of the mapping relationship between the score data and the fingering for score-fingering data.

In the fingering annotation task, any fingering is feasible unless the note is preceded or followed by other notes, i.e., fingering is essentially a problem of finding an optimal sequence of smooth state transitions. Therefore, guzheng playing can be interpreted as a process of generating a sequence of playing notes from a sequence of states of finger positions, based on which a data augmentation method based on the Hidden Markov Model (HMM) is proposed, and the HMM has also been used directly for fingering estimation tasks before with good results.

The fingering state is the hidden state, and the pitch difference is the observed state to construct the HMM.To build a first-order HMM, for example, based on the score fingering dataset, the author statistically determines the frequency of the initial fingering $F (f^{(1)})$ $$F\left( {{f^{(1)}}} \right)$$, the size of the initial set of fingering, the frequency of the occurrence of the current fingering $F (f^{(t)} | f^{(t - 1)})$ $$F\left( {{f^{(t)}}|{f^{(t - 1)}}} \right)$$ in the case of the previous fingering determination, and the frequency of the occurrence of the subsequent fingering f^(t) corresponding to the pitch difference $F (P D^{(t)} | F^{(t)})$ $$F\left( {P{D^{(t)}}|{F^{(t)}}} \right)$$ in the information of the existing scores.Where the parenthesized superscripts are the pitch difference or the fingering’s time index, and the contents of the parentheses indicate its time step. In addition, since chords occur together, each chord is counted as a whole element.

Based on this the three elements of the first order HMM: initial probability π, state transfer matrix A and firing matrix B are constructed. Where initial state $π = (F (f_{1}^{(1)}), F (f_{2}^{(1)}), \dots, F (f_{k}^{(1)}))$ $$\pi = \left( {F\left( {f_1^{(1)}} \right),F\left( {f_2^{(1)}} \right), \cdots ,F\left( {f_k^{(1)}} \right)} \right)$$, which is the probability of the initial state of the fingering. The state transfer probability matrix $A = [a_{i, j}]$ $$A = \left[ {{a_{i,j}}} \right]$$, where a_{i, j} denotes the probability that at any moment, if the fingering state is s_i, the state at the next moment is s_j, i.e., $\sum_{i = 2}^{n} F (f^{(t)} = s_{j} | f^{(t - 1)} = s_{i})$ $$\sum\limits_{i = 2}^n F \left( {{f^{(t)}} = {s_j}|{f^{(t - 1)}} = {s_i}} \right)$$ in the statistical results. Where n denotes the sequence length. The firing matrix $B = [b_{i, m}]$ $$B = \left[ {{b_{i,m}}} \right]$$, where b_{i, m} denotes the probability that an observation o_m will be acquired at any moment if the state is s_i, i.e. $\sum_{i = 1}^{n} F (PD = o_{m} | f^{(t)} = s_{i})$ $$\sum\limits_{i = 1}^n F \left( {{\text{PD}} = {o_m}|{f^{(t)}} = {s_i}} \right)$$ in the results.

With sufficient amount of data, statistics $F (f^{(t)} | f^{(t - 1)}, f^{(t - 2)})$ $$F\left( {{f^{(t)}}|{f^{(t - 1)}}} \right.,\left. {{f^{(t - 2)}}} \right)$$, etc. can be used to generate higher order HMMs, linking more fingering relations to generate higher quality fingering sequences. The data enhancement process based on the first-order HMM is as follows.

Assuming that the newly generated initial fingering $f_{a}^{(1)}$ $$f_a^{(1)}$$ obeys the initial state π, e.g., the probability $P {f_{a}^{(1)} = 2} = F (2^{(1)})$ $$P\left\{ {f_a^{(1)} = 2} \right\} = F\left( {{2^{(1)}}} \right)$$ that the newly generated initial fingering is 2-fingered, the initial fingering $f_{a}^{(1)}$ $$f_a^{(1)}$$ is: (19) $f_{a}^{(1)} = r a n d o m_(π)$ $$f_a^{(1)} = random\_(\pi )$$

where random_(π) denotes the generation of values in its sample space based on the probability matrix π. After the initial fingering is obtained, its corresponding pitch difference is: (20) $P D_{a}^{(1)} = r a n d o m_(B (s_{i} = f_{a}^{(1)}))$ $$PD_a^{(1)} = random\_\left( {B\left( {{s_i} = f_a^{(1)}} \right)} \right)$$

The next fingering is: (21) $f_{a}^{(t)} = r a n d o m_(A (s_{i} = f_{a}^{(t - 1)}))$ $$f_a^{(t)} = random\_\left( {A\left( {{s_i} = f_a^{(t - 1)}} \right)} \right)$$

The corresponding pitch difference is: (22) $P D_{a}^{(t)} = r a n d o m_(B (s_{i} = f_{a}^{(t)}))$ $$PD_a^{(t)} = random\_\left( {B\left( {{s_i} = f_a^{(t)}} \right)} \right)$$

This generates an augmented sequence of the specified length.

4

Guzheng fingering matching model to assist practicing guzheng function

As one of the most important national musical instruments in China, the guzheng relies heavily on the coordination of the player’s multiple senses and body movements during practice. For beginners, the guzheng is more difficult to learn than other instruments, which is mainly affected by the following factors: 1)

Diversification of fingerings: There are more than 30 kinds of fingerings in guzheng. The marking of fingering is the pattern of the music score, when the practitioner is not familiar with the fingering pattern, the learning cost will be increased.

2)

High requirements of fingering patterns: some people are nervous when playing guzheng, resulting in stiff fingers or clumsy fingers, which leads to habitual use of wrong patterns for playing. Sometimes it is hard for me to notice and correct myself, and the cost of correction is very high after the wrong habit is formed.

3)

Difficulty in coordinating left and right hands: Guzheng playing requires close coordination between left and right hands. For example, the two-handed rocking technique involves using both hands to rock different melodic instruments at the same time to produce sound.

4)

Playing position is more flexible: the reasonable playing position of the right hand in the standard teaching is about 1/7 of the string length from the yue shan, while the left hand position requirement is also more flexible.

The development and design of intelligent musical instruments all belong to the use of computer-aided instruction. Instrumental music teachers use fingering matching models to teach better and transfer playing knowledge more accurately, and this way of teaching using artificial intelligence technology is becoming more and more common, turning into a new way of teaching instrumental music.

5

Empirical studies of examples of assessment models

5.1

Guzheng fingering audio and video assessment experiments

5.1.1

Accuracy assessment

In this section, the ZF-VA-Dataset, an audio-video dataset of guzheng fingering, is constructed, with 600 audio-video fingering data collected from 22 volunteers, which are categorized into five subsets of hooks, torsos, wipes, major pinches, and minor pinches. Each subset includes correct and commonly incorrect fingerings that can be assessed from the video perspective, and accurate and biased pitches that can be assessed from the audio perspective.

Using the dataset ZF-VA-Dataset constructed in this section, test experiments are conducted for the Guzheng fingering estimation method scheme based on a priori knowledge and pitch difference-fingering model, and the corresponding video and audio evaluation results are shown in Table 1.

Each fingering subset includes the following video samples: correct fingering, wrong hand shape, wrong fingering, deviated string playing direction, collapsed palm, unstable hand, and curled fingers. Since it is a comprehensive assessment of each fingering, the accuracy is reduced compared to each subdivided subset in ZF-Dataset. All things being equal, the assessment of fingering played with a single finger for hook, rest, and smear is less difficult than fingering played with two fingers at the same time for both major and minor pinches, so its assessment accuracy is slightly higher than that of the latter. Each fingering subset included the following audio samples: correct pitch, playing the wrong string, and deviations from chromatic pitch. Intonation is crucial for learning a musical instrument, and the use of pitch comparison for audio assessment is simple and effective. It has a high accuracy compared to the video evaluation, with 90% accuracy for each fingering evaluation.

Table 1.

An experiment on the evaluation of guzheng

Subset	Tick	Holder	Erase	Big pinch	Small pinch
Video evaluation accuracy	72.4%	74.82%	74.53%	67.85%	62.47%
Audio evaluation accuracy	98.35%	99.47%	97.85%	96.47%	97.28%

5.1.2

Guzheng fingering audio evaluation experiments

The guzheng has 21 strings, each representing one tone, numbered from 1 to 21, from thin to thick, and distributed according to the pentatonic scale. In this paper, the common tuning of guzheng is D key as the base, then the No.1 string corresponds to double treble do, and the No.21 string corresponds to double bass do. The data analyzed in this paper come from the self-built database of guzheng (containing more than 8000 pieces of data of guzheng), and all the data are collected by the microphone in a standard room, under a quiet environment, and are obtained from the standard scales of guzheng by five professional teachers who play on five different brands of guzhengs. The sampling rate is 44100 Hz, and the data used in the following paper only involves the data obtained by playing the 3 basic fingerings (no complicated techniques), namely, hook, wipe, and rest. Because all the analyses in this paper are based on frequency sequences, it is necessary to perform time-frequency conversion on the sample data. In this paper, the Fourier transform is used to convert time to frequency, and the number of Fourier points is set to 8192 to ensure the precision of unit frequency. Considering that the very high harmonics of guzheng decay fast and the energy is weak, 512 is taken as the upper limit of the frequency point, after the time-frequency conversion, each data sample corresponds to a spectrum sequence with a frequency resolution of (44100/8192)Hz and the number of frequency points is 512.

For the fingering audio, its frequency domain information is obtained by Fourier transform. And compared with the corresponding standard tones. Taking chromatic scale 4 (fa) as an example, its time-domain waveform, frequency-domain and harmonic information are shown in Fig. 2, where (a) contains harmonic information (b) and (c) clearly shows its fundamental frequency information. The frequency value obtained is consistent with the standard tone of the D major alto fa of about 750 Hz, and this chromatic scale is pitched correctly, indicating that the left hand presses the strings with a reasonable strength, so the evaluation result is that the fingering is audio correct.

5.2

Example empirical studies

The experiment involved recruiting eight participants from a music program at a university. All eight participants had no relevant learning experience in guzheng and other musical instruments, were between the ages of 22 and 26, and were all right-handed. Before the beginning of the experiment, the purpose of the experiment, the content of the experiment, the inclusion criteria and the potential risks were explained to the participants in detail. The eight experimental participants were randomly divided into two groups:

The experimental group, including 4 participants, was used after the guzheng lesson to perform fingering exercises in the after-session independent practice session.

The control group, including 4 participants, was used to perform fingering practice after the guzheng lesson in a traditional mode, i.e., practicing on their own without any external tutoring, in the after-session independent practice session.

In addition, the experiment invited a guzheng teacher with 13 years of guzheng learning experience and 6 years of guzheng teaching experience as the teacher of basic guzheng knowledge teaching in the pre-test period to train and teach guzheng knowledge points to the 8 experimental participants.

5.2.1

Performance level and fingering mastery dimensions

The experiment calculated and counted the correct rate of 31 indicators for each participant separately to verify whether the performance level and fingering mastery of the practitioners were improved after practice. The percentage of the indicators with less than 60% correct rate of the test before and after the after-school independent practice for both groups of participants is shown in Figure 3. Before the independent practice, the performance level and mastery of the target fingerings were basically the same in both groups, with some participants in the control group having a lower number of indicators below 60% than those in the experimental group. However, after the after-school independent practice, the number of indicators with less than 60% correctness of the four participants in the experimental group decreased significantly, from an average of 51.3% before the practice to an average of about 7.25% after the practice, a decrease of about 85.87%. In contrast, the four participants in the control group decreased from an average of 55.59% before the exercise to 41.78% after the exercise, a decrease of about 24.84%. It is reasonable to speculate that compared to the four participants in the control group, the four participants in the experimental group effectively practiced by improving their fingering performance level and mastery.

Further, the correct rates of the indicators with less than 60% correct in the pre-test for both groups of participants were further analyzed. The comparison results are shown in Figure 4, which shows the change in the average correct rate before and after practicing the indicators with less than 60% correct rate in the pre-test of the participants in the two groups. As can be seen from the figure, the four participants in the experimental group showed a significant improvement in the correct rates of the indicators whose original correct rates were lower than 60% after the system-assisted practice, and the average correct rates of the indicators whose original correct rates were lower than 60% reached more than 70% for all four participants after the practice.

In contrast, the four participants in the control group showed less improvement in the indicators that were less than 60% correct on the pre-test, and the indicators remained less than 60% correct on average after practice. In the analysis of changes in correctness before and after practice for all indicators with less than 60% correct, the four practitioners in the experimental group were able to achieve about 1-2 times the original improvement in correctness after practice, even though the indicators had very low correctness in the pre-test. On the other hand, the four practitioners in the control group had very low correct rates on the pre-test, and their correct rates did not increase significantly or did not increase after practice. Therefore, it is reasonable to speculate that, with the assistance of the system, the four participants in the experimental group, by obtaining the feedback information provided by the system, could keep abreast of the results of their playing behaviors and correct their erroneous cognitions and muscle memories, which led to improved fingering performance and enhanced fingering mastery for effective practice.

5.2.2

Validation of User Experience Dimensions of Guzheng Fingering Assessment Models

The results of the one-sample t-test of the TAM questionnaire for the four participants in the experimental group are shown in Table 2. The results for model usability and ease of use indicate that all four participants in the experimental group had a high opinion of the model’s usability and ease of use. Similarly, the results of satisfaction indicate that the users are highly satisfied with the use of the model (p<0.01). The user experience dimension of this paper’s guzheng fingering estimation method exercise was confirmed. Further, through interviews, all four participants in the experimental group indicated that the feedback information from the model could help them clarify whether their current fingering movements were correct or not, which served as re-teaching and consolidation. They indicated that after the teacher’s lesson, due to their short exposure to the guzheng, they did not have enough understanding of the knowledge points, and they would even forget the playing requirements taught by the teacher. Therefore, the model solved this problem well. In addition, they said that indexing the requirements could make it more clear what they need to do for the relevant playing requirements, which would make it easier for them to practice on their own.

Table 2.

Survey analysis

Categories	Number	Average	Standard deviation	F	P
Availability	4	20.31	0.822	-12.255	0.001**
Ease of use	4	21.84	0.964	-6.804	0.008**
Satisfaction	4	14.27	0.961	-24.27	0.001**

6

Conclusion

This study promotes the development of Guzheng music cultural heritage by effectively integrating Guzheng education and artificial intelligence technology. Based on this, the learning effect of guzheng music teaching is improved through a fingering annotation method based on a pitch difference fingering model and prior knowledge of fingering. The results show that the Guzheng fingering audio and video dataset ZF-VA-Dataset was selected as the research object, and the model in this paper has high accuracy in each fingering evaluation. For the audio, the chromatic scale pitch is obtained to be correct and its frequency information meets the standard. The model in this paper can keep up with the results of the learner’s playing behavior based on the feedback provided by their playing, which improves their fingering performance and enhances their fingering mastery. Therefore, the model in this paper plays a more obvious role in promoting the transmission of guzheng music culture.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Research on the Application of Artificial Intelligence Technology in Guzheng Music Cultural Inheritance

Jiangli Jia

Yiran Yang

Publié en ligne: 21 mars 2025

Reçu: 17 oct. 2024

Accepté: 02 févr. 2025

DOI: https://doi.org/10.2478/amns-2025-0699

Mots clésArtificial intelligence, Guzheng music, Two-way network, Pitch difference-fingerings

© 2025 Jiangli Jia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Mots clés
Artificial intelligence, Guzheng music, Two-way network, Pitch difference-fingerings