Research on the Path of Traditional Music Education Inheritance Innovation and Aesthetic Experience Enhancement Supported by Information Technology

In recent years, colleges and universities have launched music education and gradually formed their own unique system, however, looking at the music education in Chinese colleges and universities, it is found that the Western music is generally more revered, and the inheritance and teaching of China’s own traditional music is not enough attention [1-2]. In particular, Chinese traditional music education, as part of folk music, is often passed from generation to generation through teachers and disciples, and there are many inadaptations for modern education. In the current teaching in colleges and universities, realizing the inheritance of traditional Chinese music requires comprehensive reform and development in consciousness and teaching ideas [3-4].

Music class as an important part of college teaching has received more and more attention, music education in college teaching gradually developed, and formed a relatively complete teaching system, it is also in this process, a large number of Western culture, Western music theory knowledge and so on was introduced, so that China’s music education is inherently with a certain tendency of Westernization [5-6]. In China’s current music education, there is a problem of insufficient attention to traditional music heritage. Subjectively, Chinese colleges and universities generally lack attention to traditional music in consciousness, and in the organization of the teaching process, there is no clear knowledge of the construction of traditional music education required as well as the preparation of instruments, and the funds invested are far from adequate. Coupled with the influence of Western musical thought, China’s characteristic traditional music is not well recognized internationally [7-9]. Objectively, the inheritance of traditional music education is often one-on-one teacher-disciple inheritance, with its own inheritance etiquette and relationship, which is obviously not available in modern education, objectively hindering the inheritance of traditional music. The inheritance of traditional music is not only the inheritance of technology and operating principles, but also a cultural inheritance, if you can’t recognize the advantages of modern education in consciousness, it is far from reaching the goal of teaching [10-11].

Music education has two unavoidable topics, one is aesthetic philosophy, one is aesthetic practice. The former is a “metaphysical” flow of thought, cognitive theory, and the latter is a “metaphysical” thought infiltration and practice [12]. Since the 20th century, Cai Yuanpei put forward “aesthetic education instead of religion”, Chinese aesthetic education has started a new course of development. Modern music education has always been adhering to the “aesthetic education, the application of the theory of aesthetics in education, in order to cultivate feelings for the purpose of” benchmarks, toward a higher goal, naturally, also accompanied by a variety of discussions on the philosophy of music and aesthetic education and a variety of attempts to practice. The resulting philosophical debates on music education and the many philosophical perspectives derived from them have already been presented. Entering the new era, Chinese traditional music education is facing more profound challenges at the level of aesthetic education practice [13-15].

In the current music education in colleges and universities, there exists an obvious excessive worship of Western music culture, while neglecting the inheritance of traditional Chinese music education. The inheritance of traditional Chinese music in the teaching process is a responsibility that should be consciously assumed by current music education in colleges and universities. Literature [16] describes the gradual marginalization of traditional music in today’s world, and Western art and music are expanding their influence, and believes that traditional music education is needed to advocate and publicize traditional music culture. Literature [17] combined with interview data to reveal the educational purpose of the Lanna music program, which aims to pass on traditional music and innovate traditional music in order to promote the presentation of traditional music on stage. Literature [18] conducted a teaching experiment to assess the feasibility of online teaching of Chinese folk music courses, and the study confirmed the feasibility of online courses for folk music, and to a certain extent, it also promotes the inheritance of Chinese traditional music culture Literature [19] emphasized the inheritance problems faced by Chinese traditional music, and at the same time discussed the importance of the traditional music culture, and concluded that it is necessary to pay attention to the importance of traditional music culture, and that it is necessary to pay attention to the importance of traditional music culture. In-depth exploration of the path of traditional music inheritance. Literature [20] examines the selection mechanism of music inheritance and dissemination, and points out that music has the ability to promote group harmony and cohesion, and argues that there is a need to analyze the logic of music for cultural well-being. Literature [21] discusses how to innovate the connotation and spirit of local music culture based on the inheritance of local music traditions and culture, and reveals the importance of music education for personal development, arguing that from the moment of birth, music education exerts an influence on the individual in terms of language arts and literary arts. Literature [22] talks about the connotation and significance of traditional music and argues that the term traditional music has been central to the discipline of ethnomusicology as well as within the organization. Researchers have affirmed the significance of traditional music culture, pondered the methods of traditional music culture inheritance and innovation, and have pointed out that traditional music education is a very important means, but the design of specific teaching methods and the selection of teaching content still have a very large research space.

Chinese traditional music education in the encounter with modern aesthetic education appeared in a predicament, one is because of the western culture, the education system of the predominance of the traditional culture, traditional aesthetic education of the foundation, the second is because of the traditional music aesthetic instrumental, utilitarian difficult to reset its origin and play a real nurturing energy, the practice of aesthetic education of traditional music education is difficult to take a substantial step, more unable to realize its own complete transformation. Literature [23] conceptualizes a counter-hegemonic approach to music education to counter the destructive influence of neoliberalism on music education, aiming at preserving the artistry, creativity, and unique artistic aesthetics of music education. Literature [24] attempts to systematically review the value chain of music education institutions from the perspective of music education, pointing out that music education institutions are in gradual decline and that social inequalities in education are visible from the curriculum design stage. Literature [25] conducted an experiment to investigate the association between music familiarity and music preferences, and the findings were realistic, regardless of music complexity, familiarity was a significant predictor of differences in music preferences. Literature [26] analyzed music enjoyment as an integrated perceptual, emotional aesthetic experience and found in the research literature concerning neuroimaging that music listening is traceable in target areas of the brain. Literature [27] explored how listeners use subjective and difference weighted criteria to assign aesthetic value to musical works, the study argued by conducting practical interactions, and the research experiments pointed out that listeners have fewer music appreciation criteria, lower perceptual consistency, as well as poorer perceptions of self-music evaluation and appreciation strategies. Literature [28] introduces the connotation and development process of punk music culture, and puts forward the hypothesis that DIY music can be used as a path of aesthetic education, and carries out relevant research analysis for detailed discussion and corroboration. When scholars study topics related to music aesthetics, the research direction is broader, involving music aesthetic education channels, music aesthetic assessment logic and assessment standards, etc., but the experimental depth is shallow and not persuasive enough.

This study develops a learning aid system for sight-singing and ear training based on an intelligent guided learning system to innovate traditional music education methods. The system is based on digital audio, using Fourier transform to transform the complex original traditional music signal from time domain to time-frequency domain, which is conducive to better analysis of the information. Drawing on the harmonic peak method and SHS algorithm, the pitch extraction algorithm of MF-SHS is proposed to extract traditional music features. On this basis, the DTW algorithm has been effectively improved to increase the speed of sequence matching and achieve intelligent scoring for traditional music sight-singing. Finally, experiments are designed to test the effect of the sight-singing and ear-training learning aid system proposed in this paper with the intelligent guided learning system, to explore the path of aesthetic experience of traditional music education.

2

Information technology-supported traditional music sight-singing education model

2.1

Music Sight-Singing Domain Knowledge Model

2.1.1

Development of Musical Ability in Sight Singing

Sight-singing is a highly technical subject in the field of music teaching. The content of sight-singing teaching includes the study of basic music theories and emphasizes the application of theoretical knowledge and skills, so it also has the characteristics of a comprehensive subject. The connotation of sight-singing teaching is to cultivate students’ musical perception, understanding, appreciation and creativity through the training of students’ sense of rhythm, sense of pitch, sense of tonality and ability to read music, so that students can master the correct pitch, rhythm, and ability to express musical emotion, so that students can more easily understand the image of music, perceive the style, genre and theme of music, and lay a solid foundation for future music learning.

2.1.2

Visual singing knowledge model

Knowledge modeling in the discipline of sight-singing is the basis for personalized instruction in the sight-singing Intelligent Guided Learning System. The knowledge model provides a structured description of domain knowledge, and this description reflects the relationship between knowledge and between knowledge and teaching resources. Personalized teaching resource management and pushing can be achieved by using knowledge model representation.

2.1.3

Learning resource model

Learning resources are the support objects for knowledge learning, and diverse data types can provide different display and interaction methods for different learning resources, so as to satisfy the learning preferences of different learners. According to the categorization of the knowledge of the discipline of sight-singing, the teaching resources of sight-singing mainly include four kinds of resources: sample resources, learning resources of sight-singing skills, training resources of musical ability, and teaching videos. Among them, the sample resources, sight-singing skill learning resources, and music ability training resources are mainly various types of music files, such as pentatonic images, music modeling audio, MIDI files, etc. The teaching video is a curriculum resource that explains knowledge and training skills.

2.2

Traditional music digitization techniques

2.2.1

Audio Digitization

In actual nature sound is very complex and the representation of its waveform is extremely variable, and if it is to be stored in a computer, the waveform must be converted into digital form, and the method currently used is mainly pulse code modulation coding, or what we generally call PCM coding. In order to store the amplitude values, which represent the strength of the sound, in a computer, the sampled values must be represented as a finite string of binary digits, i.e., quantized into a set of finite amplitude values [29]. First, the entire amplitude is divided into a finite set of small amplitudes (quantization step distance), the sample values that fall within a certain step distance are grouped together and given the same quantization value. Audio analog signal after sampling and quantization, in order to digitize the audio stored in the computer, the need to speak line coding, that is, with the two speakers system number of quantized value of each sample, to complete the entire analog-to-digital conversion process.

2.2.2

Audio data formats

With the help of computer system processing, there are two different ways in which music expresses its own content. One of them is to express it with symbolic data based on musical scores, i.e. Symbolic Data, which is mainly MIDI, and Humdrun, etc., and is characterized by the fact that the pitch, melody, rhythm, and other information of the music are stored in the file in symbolic format: the second one is to express it with audio data based on acoustic signals (Audio Data), i.e. Acousticsignals, our common formats are mainly WAV, MP3 and so on. There are simple software or hardware synthesizers that can easily convert MIDI data into audio signals and play them back.

2.3

CAT-based Learning Aid System for Sight Singing and Ear Training

2.3.1

General framework design

According to the above demand analysis, this paper is based on the cat technology of sight-singing and ear-training learning assistance system, its overall structure is shown in Figure 1. After the user has successfully logged in, the user will enter the main interface, where the main functional modules include teaching module, song library module, ear training learning module, sight singing learning module and data management module. Here, we integrate the database operations as part of the data management module. At the same time, the function of the lower module is realized to support the upper module. For administrator users, they can manipulate all the functions. For ordinary users, they mainly use the teaching module, the sight-singing learning module, the ear-training learning module, and the information query function in the data management module. Here, the information query is mainly about personal information.

2.3.2

Functional module design

1)

Library Management Module

The library management module mainly performs two main functions: track entry and track management. The execution process is as follows: select the tracks, import the chart files and sound files of the tracks into the system respectively, and improve the track description information, such as the name and category of the tracks, in the imported discussion program. According to the sound file to extract the audio characteristics of the track, the corresponding file will be stored according to the specified path, and then the track information and feature information will be stored in the database to complete the establishment of the library. The track management module allows for the updating and deletion of track information. To maintain data consistency, the deletion of tracks necessitates the deletion of related records in multiple tables in the database.

2)

Teaching module

The steps are as follows:

(1) Playback mode selection;

(2) MIDI file playback.

Among them, four playback mode selections are provided: sequential playback, random playback, selecting type playback, and playback according to historical scores.

(1) Detect whether there is a device in the system for MIDI audio playback;

(2) Open the MIDI file in the system’s memory and turn the MIDI device on to associate it with the MIDI driver:

(3) Extract all variables related to initializing the information in the header block of the MIDI file, and extract the MIDI device at the same time;

(4) Extract all the track data in the MIDI file, and at the same time process the decoding of this data, as well as process the MIDI information, and then you can enter the MIDI playback mode;

(5) After the MIDI file has been played, the resources must be released and reallocated, the MIDI audio output device must be turned off, and the MIDI file must be closed.

3)

Sight Singing Learning Module

The basic process of the sight-singing practice module is as follows:

(1) The system inputs the user’s sight-singing signal into the computer through an audio input device such as a microphone, and the signal is stored in the PCM-encoded WAV audio data format.

(2) A preprocessing step takes the user-input PCM signal through signal processing methods such as DC offset elimination to remove the interfering signal lights from it.

(3) The feature extraction module invokes an audio feature extraction method to perform feature extraction on the preprocessed audio data to obtain a desired sequence of pitch features.

(4) Smoothing process is to smooth the obtained pitch feature sequence by smoothing technique to remove some abnormal noise values.

2.3.3

Database design

According to the requirement analysis, there are three main subjects involved in the system, which are users, learning records and repertoire list. Among them, the learning records are further divided into sight-singing learning records and ear-training learning records. The E-R diagram of these three subjects is shown in Figure 2. It can be seen that there is an M:N relationship between users and repertoire, because a user can practice more than one repertoire and a repertoire can be practiced by more than one user. The relationship between users and learning records is a 1:N relationship, where a user can have multiple learning records, and a learning record corresponds to a user and a repertoire.

3

Intelligent scoring of traditional music sight-singing

3.1

Traditional music signal feature extraction

3.1.1

Time-frequency analysis of music signals

1)

Short-time Fourier transform

The short-time Fourier transform is a linear transform used to determine the frequency and phase of a signal over time in a localized region. In practice, the STFT is calculated by splitting a longer signal into equal length shorter localized signals, and then calculating the discrete Fourier transform on each shorter localized signal separately, the process of splitting the signal is called subframing [30]. The DFT expression for a signal of length N is: (1) $X (n) = {\begin{matrix} \sum_{n = 0}^{N - 1} x (n) e^{- j \frac{2 π}{N} k n}, 0 \leq k \leq | N - 1 \\ 0, other \end{matrix}$ \[\text{X}(\text{n})=\left\{ \begin{matrix} \sum\limits_{n=0}^{N-1}{x}(n){{e}^{-j\frac{2\pi }{N}kn}},0\le k\le |N-1 \\ 0,\text{other} \\ \end{matrix} \right.\]

Where, X(n) represents the data after DFT transform and x(n) represents the signal after frame-sampling.

2)

Constant Q-transform

The constant Q-transform with dense coefficients is an invertible nonlinear transform, which is similar to the human ear hearing system.The CQT allows the low frequency signal to have higher frequency resolution to decompose notes of similar frequency, and the high frequency signal to have higher temporal resolution to track the rapidly changing overtones.The quality factor Q of all frequency points in the CQT is equal, which can be simply expressed as the following equation.

(2)

Q = \frac{f_{c}}{f_{2} - f_{1}}

\[Q=\frac{{{f}_{c}}}{{{f}_{2}}-{{f}_{1}}}\]

where Δf = f₂–f₁ is the bandwidth and falls at -3 dB of the center frequency.

If x(n) is a discrete time-domain signal, its CQT transform is (3) $X^{C Q} (k, n) = \sum_{n - [\frac{N_{k}}{2}]}^{n + [\frac{N_{k}}{2}]} x (j) a_{k}^{*} (l - n + \frac{N_{k}}{2})$ \[{{X}^{CQ}}(k,n)=\sum\limits_{n-\left[ \frac{{{N}_{k}}}{2} \right]}^{n+\left[ \frac{{{N}_{k}}}{2} \right]}{x}(j)a_{k}^{*}(l-n+\frac{{{N}_{k}}}{2})\] (4) $a_{k} (n) = \frac{1}{N_{k}} w (\frac{n}{N_{k}}) \exp (- j 2 π n \frac{f_{k}}{f_{s}})$ \[{{a}_{k}}(n)=\frac{1}{{{N}_{k}}}w\left( \frac{n}{{{N}_{k}}} \right)\exp (-j2\pi n\frac{{{f}_{k}}}{{{f}_{s}}})\] where f_k is the center frequency of the knd point, B is the number of frequency points in each octave, $N_{k} Ò ⃛$ ${N_k}\unicode {x00D2}\unicode {x20DB} $ is the window length of the kth point and is inversely proportional to f_k with a ratio of Q, f_s is the sampling frequency, w(t) is the length of N_k is the window function. The $a_{k}^{*} (n)$ $a_{k}^{*}(n)$ in Eq. is the complex conjugate of a_k(n) and a_k(n) is also known as the time-frequency atom. The center frequency f_k can be defined as: (5) $f_{k} = f_{1} 2^{\frac{k - 1}{B}}$ \[{{f}_{k}}={{f}_{1}}{{2}^{\frac{k-1}{B}}}\]

Where f₁ is the center frequency of the lowest frequency point. In practice, B is a very important parameter which usually determines the time-frequency resolution. The corresponding window length is given by the following equation: (6) $N_{k} = \frac{f_{s}}{f_{k} (2^{\frac{1}{B}} - 1)}$ \[{{N}_{k}}=\frac{{{f}_{s}}}{{{f}_{k}}({{2}^{\frac{1}{B}}}-1)}\]

The largest possible Q is usually chosen to ensure that the bandwidth Δf_k at each frequency point is small enough to minimize the introduction of frequency interference. At the same time, the value of Q should not be arbitrarily large; too large a Q value will make it difficult to fully analyze the spectrum between frequency points. Therefore, the following formula is usually used to select the Q value: (7) $Q = \frac{q}{Δ w (2^{\frac{1}{B}} - 1)}$ \[Q=\frac{q}{\Delta w({{2}^{\frac{1}{B}}}-1)}\]

Where, 0 < q ≤ 1 is the scaling factor and Δw is the bandwidth of the window function w(t) spectral main flap -3dB.

Obviously, the direct computation X^cQ(k,n) of all frequency points n of the input signal is very consuming, and a method is proposed to efficiently compute the CQT transform using the following equation (8) $\sum_{n = 0}^{N - 1} x (n) a^{*} (n) = \sum_{j = 0}^{N - 1} X (j) A^{*} (j)$ \[\sum\limits_{n=0}^{N-1}{x}(n){{a}^{*}}(n)=\sum\limits_{j=0}^{N-1}{X}(j){{A}^{*}}(j)\] where X(j) is the discrete Fourier transform (DFT) of x(n) and A(j) is the DFT of a(n). Equation (7) applies to all discrete signals x(n),a(n) and uses the Parseval principle. By Equation (7), Equation (8) can be converted into (9) $X^{c Q} (k, N /_{2}) = \sum_{j = 0}^{N} X (j) A_{k}^{*} (j)$ \[{{X}^{cQ}}(\text{k},N{{/}_{2}})=\sum\limits_{j=0}^{N}{X}(j)A_{k}^{*}(j)\]

However, there are two significant problems with the above computational procedure: first, when a larger frequency range is encountered, the DFT transform block will be larger and the spectral kernel will no longer be sparse; second, in order to reasonably analyze all portions of the input signal, the CQT transform for the high-frequency points will analyze at least N_K/2 sample point, and N_K is the length of the window for the high-frequency points.

3.1.2

Pitch feature extraction algorithm

The Sub-harmonic-summation algorithm is used to detect the pitch of a musical signal. Without expanding on the algorithm in detail, the focus here is on the confidence formula (10) that appears in the text (10) $H (f) = \sum_{n = 1}^{N} h_{n} P (n f)$ \[H(f)=\sum\limits_{n=1}^{N}{{{h}_{n}}}P(nf)\]

Where, f is the candidate fundamental frequency, selected as the 5 to 15 frequency values with the largest amplitude in the search frequency range, P(nf) is the amplitude of the nrd harmonic component of the candidate fundamental frequency, h_n is the compression factor, and h_n = hⁿ⁻¹, h are pre-set values, usually less than 1.

The MF-SHS algorithm is a combination of algorithms, which selects the candidate fundamental frequency selection scheme in the peak harmonic method, and after traversing the amplitudes in the frequency domain, five fundamental frequencies are selected as candidates, and the confidence formula in the peak harmonic method is replaced by the confidence formula in the SHS algorithm, the (11) $B (N) = \sum_{i}^{M} P (i) \to H (f) = \sum_{n = 1}^{N} h_{n} P (n f)$ \[B(N)=\sum\limits_{i}^{M}{P}(i)\to H(f)=\sum\limits_{n=1}^{N}{{{h}_{n}}}P(nf)\]

3.2

Traditional music feature matching

3.2.1

Principles of Dynamic Time Regulation Algorithm

Dynamic time regularization algorithm is based on the idea of dynamic programming DP, which is a more classical algorithm in speech recognition, it solves the matching problem of test template and reference template caused by the different length of pronunciation, and the method of dynamic programming is to reduce the complex global optimization problem to many local optimization problems by searching for the optimal path, which is usually adopted as the Euclidean distance, or the Euclidean distance for short [31]. Euclidean distance is relatively simple to compute and it is also the most commonly used speech similarity measure. Let the feature parameters of T and R be K-dimensional, L and L′ are the elements of T and R respectively, then the Euclidean distance is: (12) $d [T (n), R (m)] = \frac{1}{k} \sum_{r = 1}^{K} {(L_{r} - L_{r}^{^{'}})}^{2}$ $d[T(n),R(m)]=\frac{1}{k}\underset{r=1}{\overset{K}{\mathop{\sum }}}\,{{({{L}_{r}}-L_{r}^{^{\prime }})}^{2}}$

3.2.2

Dynamic time regularization algorithm improvement

DTW realized using DP technology has the following main drawbacks:

(1) The system recognition performance is overly dependent on endpoint detection; (2) the computation of dynamic programming is too large; (3) it does not make full use of the temporal information of the speech signal; and (4) when finding the cumulative distance, the frames of the test templates are given equal weights.

To address these shortcomings, the following improvements are adopted in this paper:

1) The DTW algorithm with relaxed endpoints is used. It is unrealistic to make the endpoint detection very accurate to detect the word beginnings and endings of the two pronunciations, and it is also very reluctant to use fixed endpoints in DTW for searching is to consider that the first frame of the two pronunciations are both word beginnings, and the last meal are both word endings, so in this paper, we adopt the DTW algorithm with relaxed endpoints to fit the inaccuracy of the endpoint detection.

2) DTW algorithm using acoustic stimulation. The acoustic stimulation method uses the amount of change in the frequency domain as the basis for the assignment of frame sampling. It first sums up the spectral changes between equally spaced neighboring frames in the template to obtain the total amount of spectral changes in the whole template: then it analyzes the template again, and selects the non-uniformly spaced frame vectors according to the principle of uniform distribution of spectral changes. The details are as follows. Assuming that the sequence of feature vector energies of a speech template is Y₁,Y₂,⋯Y_N, where the nnd frame vector Y_n is the energy output from a L-channel band-pass filter bank into: Yn = {y_n1,y_n2,⋯y_nV}, define the acoustic excitation generated by the spectra of the two adjacent frames as: (13) $δ_{1} = \sum_{j = 1}^{L} | y_{n j} - y (n + 1) | (n = 2, 3, 4 \dots N)$ \[{{\delta }_{1}}=\sum\limits_{j=1}^{L}{|{{y}_{nj}}-y(n+1)|}(n=2,3,4\cdots N)\]

3.3

Traditional music scoring

The distance of the sequence is obtained through the DTW algorithm, and normalized to between [0, 1] the singer’s score can be calculated, but the main purpose of this paper’s scoring of sight-singing is to find an objective indicator of proficiency that measures the learner’s mastery of the musical skills in sight-singing through comparison, and then to calculate the overall performance score based on the synthesis of the indicators. In this way, intelligent guidance and recommendations can be made for learners. For the convenience of describing the calculation of each index, we represent the matching relationship between each note in the template and the pitch sequence as follows: (14) $(p i, t i) = (p (k + 1), p (k + 2), \dots, p (k + m i))$ \[(pi,ti)=(p(k+1),p(k+2),\cdots ,p(k+mi))\] where pi,ti denotes the pitch and number of frames of the i nd note, and p(k+1),p(k+2),…,p(k+mi) is the sequence of pitches matching that note, a subsequence of m_i consecutive frames starting at frame k. Then the following three evaluation metrics are defined:

1) Timing correctness: when the absolute difference between a note and its corresponding pitch sequence frame number, does not exceed 0.3 of the note’s frame number, the note’s timing value is considered to be sung correctly, and is calculated according to the following formula: (15) $d a_{i} = {\begin{matrix} 1 & \frac{| t_{t} - m_{t} |}{t_{t}} < 0.3 \\ 0 & other \end{matrix}$ \[d{{a}_{i}}=\left\{ \begin{matrix} 1 & \frac{|{{t}_{t}}-{{m}_{t}}|}{{{t}_{t}}}<0.3 \\ 0 & \text{other} \\ \end{matrix} \right.\]

The overall hourly correct rate is: (16) $D a = \frac{1}{n} \sum_{i = 1}^{n} d a_{i}$ \[Da=\frac{1}{n}\sum\limits_{i=1}^{n}{d}{{a}_{i}}\]

2) Pitch correctness: Calculate the mean value pm_i of the pitch of the subsequence corresponding to the note, and then judge whether the pitch of the note is sung correctly according to whether it is equal to the pitch of the note. The calculation is as follows: (17) $p a_{i} = {\begin{matrix} 1 & p_{i} = = {pm}_{i} \\ 0 & other \end{matrix}$ \[p{{a}_{i}}=\left\{ \begin{matrix} 1 & {{p}_{i}}==\text{p}{{\text{m}}_{i}} \\ 0 & \text{other} \\ \end{matrix} \right.\]

Overall pitch correctness was: (18) $P a = \frac{1}{n} \sum_{i = 1}^{n} p a_{i}$ \[Pa=\frac{1}{n}\sum\limits_{i=1}^{n}{p}{{a}_{i}}\]

3) Breath smoothness: mainly refers to the stability of pitch when singing, this paper uses the degree of dispersion (standard deviation) of the pitch value in the subsequence corresponding to each note to determine whether the singer’s breath is smooth. The smoothness of a single note is: (19) $v a_{i} = 1 - \frac{s t d_{i}}{\max_{i} - \min_{i}}$ \[v{{a}_{i}}=1-\frac{st{{d}_{i}}}{{{\max }_{i}}-{{\min }_{i}}}\]

For the ist note, std_i,maxi_i,min_i are the standard deviation, maximum value, and minimum value of subsequence p(k+1),p(k+2),…,p(k+mi), respectively.

Overall smoothness: (20) $V a = \frac{1}{n} \sum_{i = 1}^{n} v a_{i}$ \[Va=\frac{1}{n}\sum\limits_{i=1}^{n}{v}{{a}_{i}}\]

The final score of the performance is the mean of the duration accuracy and pitch accuracy, and then the breath smoothness is used as a weighting factor, calculated by the formula: (21) $r e s u l t = (D a + P a) * V a$ \[result=(Da+Pa)*Va\]

where Da, Pa, and Va are duration correctness, pitch correctness, and breath smoothness, respectively.

4

Analysis of the effects of traditional music education transmission and feedback on aesthetic experience

4.1

Traditional Music Singing Matching Results and Analysis

Singing audio is characterized by a large number of glissandos, legato, volume changes, and pitch jitter, which can cause the boundary features between notes to become very indistinct, posing a challenge for singing transcription. In this section, we analyze the note detection errors of students practicing singing traditional music note data with the original song notes using different methods by first classifying the note detection errors into four categories: independent multi-detection errors (non-singing notes due to background sounds), successive multi-detection errors (detecting a real note as multiple notes), independent missed detection errors (real notes that are not merged nor detected), and merged missed detection errors (detecting multiple real notes as one note), the four types of note testing error values are shown in Figure 3, where the yellow line indicates the presence of errors.

From the figure, it can be seen that consecutive multiple detection and merged missed detection are the main problems faced by all four singing transcription methods, indicating that there are real difficulties in detecting unclear note boundaries in the spectrogram. Compared to the other three methods, the auxiliary system proposed in this paper’s has fewer various detection errors in Fig. 3(a), with only three yellow lines, and the detection error reaches a probability of 5%, which indicates that the auxiliary system has a strong note detection performance and generalization ability.

Subsequently, the errors in the pitch search frame-level results were tallied. A few randomly selected audios from the dataset and a comparison of the pitch search results with the frame-level pitch annotations are shown in Fig. 4, where the red circles indicate the incorrectly predicted pitches, and we hypothesize that the Onset and Offset of the note detection outputs affects the temporal range of the frame-level pitch search, whereas the search range on the frequency axis is fixed, and the absence of a full note or the presence of an unvoiced frame in the search range may lead to a pitch prediction Error.

The statistics of the distance of the pitch error frames from the predicted note boundary (Onset or Offset) in each predicted note are shown in Fig. 5, with the horizontal axis indicating the distance of the error frames from the predicted note boundary, and the vertical axis indicating their percentage of all error frames. It can be seen that most of the pitch prediction errors are distributed near the predicted note boundaries 3-20, which verifies the above speculation and indicates that the accuracy of note detection affects the accuracy of pitch prediction.

Due to the pitch instability of human vocalizations and the problems of multiple singing and omission in sight-singing audio, there are often deviations between the sight-singing note sequences and the musical note sequences, which bring challenges to the note sequence alignment algorithm. In order to analyze the alignment errors generated by the note sequence alignment algorithm, we classify the causes of alignment errors into polyphony, omission, and pitch errors, and count the cases in which the alignment algorithm correctly handles various errors.

After the statistics, there were 70 polyphonic error notes, 6 omission error notes, and 2998 pitch error notes in the sight-singing test set. The percentage of errors that were correctly aligned by each alignment algorithm is shown in Fig. 6, which shows that the alignment schemes based on relative pitch and relative duration modeling (RP&RD-SHS and RP&RD-MF-SHS) are more effective in solving the alignment problems caused by pitch errors than the other alignment schemes; both the RP-based and RP&RD-based approaches are able to better solve the alignment problems caused by pitch errors; the RP-based and RP&D-based approaches are able to better solve the alignment problems caused by pitch errors; the RP-based approach is able to better solve the alignment problems caused by pitch errors. (RP) and relative pitch and relative duration modeling (RP&RD) are both able to better deal with the alignment problem of multiple singing errors; due to the small number of omission errors in the sight-singing test set samples, the relevant statistics may not be representative and will not be repeated here. Comparing the correct alignment of MIF-SHS and RP-DTW, it can be seen that there is a significant decrease in the ability of MIF-SHS to align multisung errors than RP-SHS, verifying that the addition of relative duration information between notes causes more ineffective forced alignment by the MIF-SHS algorithm. While comparing the correct alignment of RP&RD-SHS and RP-SHS, RP&RD-MIF-SHS achieves an improvement relative to RP-MIF-SHS for both polysing errors and pitch errors, indicating that the MIF-SHS algorithm is able to efficiently utilize the inter-note relative duration information to obtain a better alignment performance in the note sequence alignment task. Taken together, the MIF-SHS note feature extraction algorithm proposed in this paper has the best performance.

4.2

Music scoring experiment

4.2.1

Short Score Recognition

The experimental environment for sheet music recognition was built with Visual Studio 2010+OpenCv2.4.8 on Windows. Since it is a research in the field of music teaching, the experiments were conducted to test the four songs “Farewell”, “School Song”, “Jingle Bells” and “City in the Sky” in the Chinese music textbook, and the resolution of the pictures of these four songs was increased in order, and the results of their MIF-SHS recognition are shown in Table 1.

Table 1.

Simple spectrum recognition results compare

Title	See off		School song		Jingle bells		Sky city
Time consuming (ms)	555		989		2891		9931
Resolution	746×621		976×582		1485×1891		2356×2971
Consonant	14	14	3	2	12	11	11	11
Bar line	18	18	8	7	19	18	52	53
Digital note	68	68	28	27	103	102	187	177
Treble point	10	10	5	5	3	2	83	84
Bass point	4	4	3	2	24	22	2	3
Uderline	30	29	23	22	97	88	69	70
Short line	12	12	6	5	7	6	39	40
Attached point	6	6	3	2	13	10	17	18
Total	162	161	79	72	278	259	460	456
Accuracy rate	99.51%		98.88%		96.21%		97.74%

From the recognition time point of view, the higher the resolution of the picture, the longer the recognition time, especially after reaching 4k resolution, the recognition time has reached more than 9 seconds, and there is no inevitable connection between the picture resolution and the accuracy of the recognition results - the most important thing is that the target picture’s musical symbols must be high in clarity. From this recognition result, a picture of about 300 pixels is the most reasonable.

4.2.2

System scoring performance tests

In order to test the performance effectiveness of the learning assistance system based on CAT technology, traditional music songs such as “Flowing Water in the Empty Valley”, “Jasmine Flower”, “Flat Sand and Falling Geese”, “Ambush on Ten Fronts”, “Moonlit Night on the Spring River”, “White Snow in the Yangchun”, “Guangling Sanzan”, and “Three Melodies of Plum Blossoms” are selected in this section. There are three main parts in the scoring performance test of the system, which are note level detection test, bar scoring test, and whole song scoring test.

1)

Note Level Detection Test

One of the common evaluation criteria for correctness evaluation of multi-base tone detection is the note level F-measure. The higher the F-measure value, the higher the correctness of the detection. nCorr indicates the total number of correctly detected notes, nRef indicates the total number of notes in the standard audio, and nTot indicates the total number of detected notes. The note-level evaluation results of the test samples are shown in Table 2.

Table 2.

Test sample note level evaluation results

Music	nCorr	nRef	nTot	Recall	Precision	F-measure
“Empty water”	119	124	160	0.96073	0.75682	0.84516
“Jasmine flower”	89	96	96	0.93235	0.93117	0.93007
“Wild goose”	91	98	96	0.93355	0.94986	0.93993
“The ten faces ambush”	106	110	115	0.96471	0.92516	0.94286
“Spring river flower moon”	134	136	148	0.98417	0.90854	0.9432
“The white snow”	73	76	76	0.96329	0.96211	0.96101
“Guangling”	169	180	175	0.94018	0.96433	0.9504
“Triad”	146	156	147	0.93789	0.99026	0.96164
In total	927	976	1013	0.95211	0.92535	0.93428

From the table of test results, it can be seen that except for “Flowing Water in the Empty Valley”, the F-measure values of the other seven pieces are above 0.93, and the total F-measure value of the eight pieces also reaches 0.93428, and according to the relevant information, it is understood that the F-measure values of the other pieces are around 0.88, so the results of the note level detection are better.

2)

Bar Scoring Test

The bar scoring test recorded the actual number of bars with correct bar playing rate of 80%~90%, 70%~80%, 60%~70%, 60% or less, the number of correctly detected and the number of over-detected bars in 100 samples of 8 pieces, and the results of the bar scoring test are shown in Table 3. Where the detection rate is the number of correctly detected divided by the actual number, S1: “Flowing Water in the Empty Valley” S2: “Jasmine Flower” S3: “Flat Sand and Falling Geese” S4: “Ambush on Ten Fronts” S5: “Moonlit Night of the Spring River” S6: “White Snow on the Yangchun River” S7: “Broad Ranging Scattered” S8: “Plum Blossoms”, CT (Correctness), MT (Multicheck), the Detection rate (DR), A1 (Actua l).

Table 3.

Test results in sections

Music	80-90			70-80			60-70			Below 60			DR
Music	Al	CT	MT	Al	CT	MT	Al	CT	MT	Al	CT	MT	DR
S1	0	0	24	0	0	1	0	0	9	33	31	0	97.89%
S2	0	0	34	0	0	3	0	0	0	37	36	0	96.89%
S3	0	0	12	0	0	2	0	0	7	37	36	0	95.33%
S4	0	0	3	0	0	1	0	0	1	27	26	2	95.33%
S5	0	0	7	0	0	5	0	0	0	23	22	4	95.44%
S6	6	6	51	10	10	11	9	7	8	6	6	5	96.89%
S7	0	0	6	0	0	6	0	0	0	21	21	0	96.01%
S8	0	0	24	0	0	1	0	0	9	33	31	0	97.89%
In total	6	6	168	10	10	30	9	7	25	220	214	12	96.46%

From the table, it can be seen that the detection rate of each piece of music is above 95%, and the average detection rate of 8 pieces of music is 96.46%%, which is a good effect of bar detection. From the figure, we can also see that there are a lot of multi-detections, and the multi-detections are almost all concentrated in the bars with 80% to 90% correct playing rate, this is because the accuracy of the note detection is not 100%, and when a note in a correctly played bar is detected incorrectly, it may lead to a drop of the correct playing rate of the bar to between 80% and 90%. At the same time, it can be seen that the number of multi-detections for measures with a performance accuracy of less than 60% is very low, because the likelihood of a large number of notes being detected incorrectly in the same measure is very low.

3)

Whole song scoring test

The whole song scoring test compares the results of the overall detection accuracy of the 100 sample audios with the overall actual accuracy results, calculates the difference, and multiplies the difference by 100 and then rounds the results as shown in Table 4. From the table, it can be seen that the average deviation of each piece of music scoring is within 6, and the average deviation of 100 pieces of music sample scoring is 5.71, which is good for whole song scoring.

Table 4.

The whole process scores test results

Music	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8	Total deviation	Mean deviation
Music	Deviation	Deviation	Deviation	Deviation	Deviation	Deviation	Deviation	Deviation	Total deviation	Mean deviation
S1	6	6	4	6	4	2	7	7	28	27
S2	9	5	8	6	9	5	7	10	49	6.7
S3	9	6	6	4	5	4	4	3	39	5.7
S4	6	3	6	5	6	2	5	6	30	4.8
S5	4	4	4	3	4	3	3	5	18	3.6
S6	6	5	7	6	4	5	5	7	37	5.5
S7	8	8	7	8	9	8	8	9	59	7.7
S8	8	9	8	4	6	4	6	4	42	6
In total	56	46	50	42	47	33	45	51	302	5.71

4.3

Traditional Music Aesthetic Experience Path

4.3.1

Enriching teaching content and expanding students’ horizons

An important role of information technology in the classroom is to make the original single, boring classroom more diversified and rich, expand the teaching capacity, deepen the classroom connotation, and constantly open up the students’ horizons, thus laying a good foundation for improving the students’ music appreciation level. In order to realize this goal, teachers should pay attention to the following levels during the teaching process: firstly, teachers should break the single-mode teaching method of purely adopting recordings, and develop teaching modes more adaptable to the physical and mental characteristics of students, the laws of music teaching and the characteristics of the times, so as to improve the comprehensive quality of students in music; secondly, with the use of multimedia teaching means, teachers can guide students to understand the graphic and audio-visual aspects of music. Secondly, with the use of multimedia teaching tools, teachers can guide students to understand the characteristics of music in terms of graphic, text, sound and image, so that their knowledge of music can be improved, and by utilizing information technology, students can learn more about music and cultural works without going out of the classroom, which is greatly convenient for students to learn as well as for teachers to teach Again, teachers should pay attention to the use of information technology to integrate with the scenes of life, so as to make the complex and boring process of music teaching simpler and more interesting, and let students learn in the daily life scenes. Teachers should focus on the integration of information technology and life scenes to make the complex and boring music teaching process simpler and more interesting, so that students can deepen their understanding of music in daily life scenes, and further mobilize students’ enthusiasm for learning.

4.3.2

Complementing singing instruction to improve musical expression

Teachers should make good use of the advantages of information technology in the process of music teaching, focusing on improving students’ singing ability and continuously improving their aesthetic ability through the improvement of musical expression. Specifically, in the actual teaching process, teachers can start from two levels: on the one hand, play the role of a multimedia role model. Love of singing and dancing is the nature of children, many students are interested in music, but due to some problems in the teaching process, students lack of imitation, learning objects, and gradually lose their enthusiasm and interest in learning. Through the introduction of information technology, teachers can provide examples, models, and clear goals for students to learn how to sing. With the help of information technology, students will have a brand-new knowledge and understanding of the tunes, rhythms, and other parts of the songs. On the other hand, the plasticity of students is utilized to promote the role of information technology in promoting students’ deep and diversified understanding of music. Singing should not only focus on the use of methods, but also be combined with other forms of performance, which requires teachers in the teaching process, to be good at playing the positive role of information technology in promoting the improvement of students’ performance ability, through the introduction of the background of the creation of a series of forms such as the author’s personal profile, the style of a similar song cascade, etc., so that the students can understand a full range of song creation, singing process, so as to deepen the understanding of the musical skills, to Improve their own musical expression.

5

Conclusion

The study is designed to analyze both the individual optimization modules of the proposed traditional music knowledge model and the overall experimental results. The CAT technology-based sight-singing learning assistance system performs traditional music note matching detection with an error rate of only 5%, indicating that the assistance system has strong note detection performance and generalization ability, which is of great help to students for singing practice. Four tunes from the music textbook were selected, and the MIF-SHS music extraction feature method was used to recognize the short-score features of each tune. In order to test the effectiveness of the CAT technology-based sight-singing learning aid system in practical application, experiments were conducted from the note-level detection test, measure scoring test, and whole-song scoring test, and the total F-measure value of the 8 tunes also reached 0.93428 with good note detection.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Research on the Path of Traditional Music Education Inheritance Innovation and Aesthetic Experience Enhancement Supported by Information Technology

Ning Li

Pubblicato online: 17 mar 2025

Ricevuto: 17 ott 2024

Accettato: 11 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0272

Parole chiaveTraditional music, Music feature extraction, Learning aid system, DTW algorithm

© 2025 Ning Li, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Traditional music, Music feature extraction, Learning aid system, DTW algorithm