Accès libre

A study of speech recognition technology as an aid to timbre accuracy and pitch control in vocal music teaching

 et   
24 sept. 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

As a form of music expression, vocal music has always carried human emotions and expressed cultural connotations, and is an indispensable part of the art field. Vocal music teaching is an important way to cultivate vocal artists and music lovers, while vocal music teaching methods are directly related to the effect of vocal music learning and students’ musical literacy [12]. Students need to be proficient in a variety of basic skills, such as pitch control, timbre adjustment, and breathing techniques, in the highly technical discipline of vocal music, and continue to develop basic skills so that they have the ability to be able to accurately pronounce the voice and to maintain a stable pitch and tone quality [34].

Currently, computer technology has been widely popularized and recognized in the field of global music education, and its application in vocal music teaching is increasing in proportion. By combining deep learning and natural language processing technologies, speech recognition in vocal music teaching can better understand students’ needs and problems, thus providing more personalized teaching services, and then through intelligent analysis of the singer’s performance data, providing more accurate and effective training suggestions [58]. Specifically, speech recognition technology can accurately recognize and distinguish different voice types and characteristics, thus providing precise voice training and guidance for singers [910]. It can also analyze and learn from a large number of music recordings, so as to imitate different styles of music sounds and provide singers with richer and more diverse voice training resources [1112]. In addition, speech recognition technology can be based on the singer’s timbre, pitch and other aspects of the data, the singer’s performance for a comprehensive analysis and evaluation, to help singers understand their own shortcomings and improve [1315]. With the development of speech recognition technology in the direction of more intelligent, personalized and efficient, combining it with vocal music teaching to assist students’ pitch training can bring broader development space and opportunities for vocal music teaching.

Pérez-Gil, M. et al. introduced the role of Cantus application in vocal music teaching, which supports real-time analysis of audio streams, meaning that the software is a very suitable music training tool for teachers to customize their music training courses by adjusting the melodic patterns and syllabic system while providing evaluation functions [16]. Wang, X. et al. used speech recognition technology to to extract the frequency, amplitude and other objective acoustic evaluation parameters of singing voice, and used neural network method to objectively evaluate the singing quality of the artistic voice, and found that the combined method can accurately and objectively evaluate the quality of the singing voice by comparing it with the subjective evaluation of the teacher, which is conducive to the cultivation and selection of musical talents [17]. Liu, S. studied the voice teaching technology based on speech recognition, which combines the MFCC parameters and speech frame energy to combine to form characteristic parameters, and improve the DTW algorithm as a vocal evaluation algorithm, and finally realize the real-time modification and playback of sound waveforms and pitches on the computer [18]. Huang, T. et al. on the other hand, used the MFCC coefficients and the fundamental period as the characteristics to describe the differences in speech vocalization, and used the LAS model to extract the high-level features of the speech signals, by exploring the correct vocalization methods in line with the principles of physiological science, and continuously identify and correct students’ vocalizations [19]. He, Z. proposed a music score recognition method based on the combination of transfer learning and convolutional neural network, through the timbre processing and vocal feature technology extraction, and the weight parameters of the trained convolutional model were migrated to the voice recognition, which assisted in the vocal teaching of respiration, exhalation, articulation, and voice region training [20]. Nan, C. designed a vocal intelligent training system to convert audio information into a speech spectrogram, process it using wavelet transform processing to represent its information in the time and frequency domains, and obtain more compact audio ripples through a dimensionality reduction algorithm to detect the vocal quality [21]. Huang, C. compared the pharynx of different speech recognition techniques for vocal teaching in pharyngeal training, and the results show that the audio extraction method effectively extracts the features of pharyngeal training in vocal music teaching, significantly shortens its recognition time, and is conducive to improving the students’ vocal music skill level [22].

Aiming at the background sound interference problem of speech recognition, wavelet decomposition is used to reconstruct the low-frequency coefficient signals, and then combined with higher-order cumulative quantities to detect and analyze the reconstructed sound signals. The improved algorithm of fundamental cycle extraction is utilized to find out the vocal frequency of the singer and compare it with the specified pitch to realize the evaluation of pitch vocalization. The relevant feature vectors are sequentially extracted in the time series of the voice signal samples, and a convolutional neural network with one-dimensional vectors is proposed and constructed, avoiding the overly cumbersome procedure of one-dimensional feature vectors in the two-dimensional model, and realizing the automatic discrimination of vocal timbre. The algorithm of this paper is used to discriminate timbre and pitch, and the evaluation results are compared with those of professionals to test the evaluation effect of the algorithm, and teaching experiments are designed to compare the practical application effect of this paper’s algorithm.

Speech recognition and vocal evaluation method based on one-dimensional convolutional neural network
Improvement of fundamental cycle extraction methods
Algorithm for base-tone cycle detection based on improved wavelet variation

Generally speaking, the energy of nonlinear nonsmooth sound signal in good environment is higher than the noise energy, but in the real environment, the sound signal is often accompanied by some disturbing background noise, which makes the signal-to-noise ratio of the sound signal lower, thus substantially increasing the difficulty of the detection of the fundamental cycle, which can not be effectively solved by the traditional detection methods, and some special methods can only solve the problem of the fundamental sound detection in a specific noise environment. In addition, the feature filtering methods increase the computational amount and change the spectral structure of the artistic voice so that it loses some information. Therefore, the computational effort and the robustness of the weighting parameters of different feature combinations cannot solve the fundamental sound detection problem in noisy environments.

In conventional signal processing problems, it is usually assumed that the noise signal or the sound signal satisfies the Gaussian distribution law, which will greatly facilitate mathematical analysis and processing [23]. In particular, it is found that the higher-order cumulants above the third order tend to have one and the same characteristic - the higher-order cumulant of any Gaussian signal above the third order (including the third order) is zero, and this particularity is an excellent way to suppress the noise. Therefore, in this paper, wavelet decomposition is used to reconstruct the low-frequency coefficient signals, and then the reconstructed sound signals are detected and analyzed by cleverly combining the higher-order cumulants.

In sound signal processing problems, the cumulative quantity is a very practical statistical properties, in which, the mathematical description of the second-order, third-order and fourth-order cumulative quantities are as follows: assuming that there is a smooth random art voice signals {X(n)}, n = 0, ±1, … ±∞, with an average value of 0, then its second-order cumulative quantities are shown in Equation (1) as: C2,x(τ)=E{ x(n)x*(x+τ) }

The third order cumulative quantity is shown in equation (2) as: C3,x(τ1,τ2)=E{ x(n)x(n+τ1)x*(n+τ2) }

The fourth-order cumulants are shown in equation (3): C4,x(τ1,τ2,τ3)=E{x(n)x(n+τ1)x*(n+τ2)x*(n+τ3)}E{x(n)x(n+τ1)}E{x*(n+τ2)x*(n+τ3)}E{x(n)x*(n+τ2)}E{x(n+τ1)x*(n+τ3)}E{x(n)x*(n+τ3)}E{x(n+τ1)x*(n+τ2)}

From the above, it can be known that the third- and fourth-order cumulants of smooth random art voices are all zero, verifying the theory that the higher-order cumulants of any type of Gaussian sound signals above the second order are all zero.

If the existing sound signal s(t) is assumed to be represented by Gaussian noise as n(t), then the sound signal x(t) with noise is represented as see equation (4): x(t)=as(t)+bn(t) where a and b are the gain coefficients. Some of the traditional fundamental sound detection methods have greatly reduced the detection accuracy at low signal-to-noise ratios, so it is hoped that new methods can be proposed to obtain better detection results at low signal-to-noise ratios in order to address this problem. In the previous introduction of higher-order statistics, not only the fourth-order statistics of a smooth random art voice x(t) can be calculated, but also the fourth-order statistics of a sound signal x(t) in a noisy environment according to the theory that the higher-order statistics of any type of Gaussian sound signals above the second-order are all zero. The method proposed in this paper utilizes wavelet decomposition to reconstruct the low-frequency signal, and cleverly calculates its fourth-order statistic to detect the fundamental period of the reconstructed signal, which not only improves the accuracy of detection, but also keeps the amount of computation under control, and in the process of detection, the spectral structure of the sound signal remains unaltered and the acoustic information is preserved relatively intact [24].

First, the three-level center clipper is used to process the band noise sound signal to attenuate the influence of high-frequency noise and resonance peaks, and its output function is shown in Eq: y(t)=C[x(t)]={ 1x(t)>CL0x(t)CL1x(t)<CL

According to Eq. (4), the output signal is known to be x(t). Eq. (3) gives the formula for the fourth-order statistic, and then the expression of the fundamental tone detection autocorrelation function derived from the fourth-order statistic by inference is shown in Eq. (6): R(τ3)= [ i=0N1C4,x(i)C4,x(i+τ3) i=0N1C4,x2(i+τ3) ]2

Eq. (6) where N is the window length. That is, the peak positions of R are obtained and the thresholds of these peaks are compared. The fundamental tone period to be detected is calculated by counting the region of time span between two neighboring peaks.

If there is an existing noisy sound signal x(t), then the base tone period detection process for x(t) is as follows:

Do some preprocessing operations on each frame of the noisy sound signal, aiming at obtaining the information of the voiced segment and the region of the voiced segment.

Adjust the signal-to-noise ratio parameter to eliminate the DC component in the sound signal, add noise to the original sound signal, use the acquired information of the voiced segment to perform the DWT wavelet transform, and use the coefficients of the low-frequency band to reconstruct the signal.

Introducing a Cum denotes the cumulative volume operation, then there is if λi(i = 1 ,2, …, k), and xi(i = 1, 2, …, k) is a random variable, introducing equation (7): Cum(λ1x1,λ2x2,,λkxk)=(i=1kλi)Cum(x1,x2,,xk)

From the above, it is known that the sound signal s(t) is independently uncorrelated with the Gaussian noise signal n(t), then according to Eq. (7) the fourth-order statistic of the band-noise signal x(t) can be calculated see Eq. (8): Cum[ x(k+ξ1)x(k+ξ2)x(k+ξ3)x(k+ξ4) ]=a4*Cum[ x(k+ξ1)x(k+ξ2)x(k+ξ3)x(k+ξ4) ]+b4*Cum[ x(k+ξ1)x(k+ξ2)x(k+ξ3)x(k+ξ4) ]

Detecting the fundamental tone period in the preliminary selected speech segment, then comparing the predicted fundamental tone period result map with the real fundamental tone period result map, calculating the accuracy of the proposed algorithm, and then counting the stability or robustness of the proposed algorithm through several sets of fundamental tone period detection experiments with different signal-to-noise ratio conditions.

Pitch vocalization evaluation system

The fundamental frequency is the basic requirement for a singer to be able to sing accurately. If a certain tone does not even meet the requirements of the fundamental frequency, then it can be judged that this vocalist’s artistic voice is definitely not qualified. When the vocal folds vibrate, there are various overtones, such as the first overtone, the second overtone, etc. Although the overtones are generally integer multiples of the fundamental frequency, so the overtones are based on the fundamental frequency. The international notation is further divided into male and female vocal notation. Each note or key has a corresponding international clef, which specifies the corresponding frequency. However, the corresponding international notation in the male and female clefs are different, for example, C1 in the male clef corresponds to C0 in the international notation, while C0 in the international notation corresponds to C2 in the female clef [25].

On the piano, each different singer has a fixed region for the fundamental frequency at the time of vocalization, for example, tenor, baritone and bass have different frequencies corresponding to them in the male vocal score, while soprano, mezzo-soprano and alto have different frequencies corresponding to them in the female vocal score, so that the region in which the singer is located can be judged according to the frequency of his/her vocalization.

According to the improved algorithm of fundamental cycle extraction, the corresponding frequency can be obtained, and according to the frequency value, it can be judged whether the corresponding pitch matches the specified pitch. The base frequency extraction method is the improved method in this paper, and its steps are: firstly, each frame with noise signal is preprocessed to extract the information of the speech segment and eliminate the DC component. Then the endpoint detection is carried out first after adding window frames, and then the frames are divided after filtering. Then the wavelet decomposition is used to reconstruct the signal by taking the low frequency coefficients, and finally the fundamental frequency value is obtained by combining the fourth-order accumulation method with the fundamental tone detection of the reconstructed signal. You can judge whether the speech to be tested is accurate or not based on the x value in this value shown on the right and the correspondence in the international spectrum.

Speech recognition algorithm based on one-dimensional convolutional neural network
One-dimensional convolutional neural networks

Convolutional neural network is widely used in the field of image, its network structure input is mainly two-dimensional matrix vector. With the development of artificial intelligence technology, convolutional neural network is also used in speech recognition, it is well known that speech is a one-dimensional signal, the traditional practice is still to use a two-dimensional model to extract the feature parameters after the interception of the same length of the frame and then the number of frames for the length of the band signal for the width of the band signal, to form a picture of a similar to the speech signal image, the extraction of which is shown in Figure 1 [26].

Figure 1.

The structure of one dimensional convolution neural network

In the speech signal samples, in order to retain the complete one-dimensional features of the original signal, this paper proposes and constructs a convolutional neural network with one-dimensional vectors, which is based on the principle of extracting the relevant feature vectors sequentially in the original time series. Its input is a one-dimensional feature vector, so its internal convolution kernel and feature map are all one-dimensional.

Vijy=sigmoid(bij+ml=0LiwijmlV(i1)my+l)

In Equation (9), the meaning of Vijy is the value of the y th position on the j rd feature map of layer i, sigmoid(·) denotes the training function, bij denotes the bias of this feature map, m denotes the ordinal number of the set of feature maps connected to this feature map in layer (i–1), wijml denotes the value of the l th position in the convolution kernel connected to the feature map with the ordinal number of m on the j th feature map of layer i, and Li denotes the length of the convolution kernel in layer i.

Speech input and evaluation analysis

In the convolutional neural network, the core of the whole network is in the calculation of the convolutional layer, and the core calculation of the convolutional kernel occurs on the frequency band, the purpose is to extract the local features on the frequency band, and it has a certain degree of robustness to the noise of different frequency bands. Because the speech signal itself has two characteristics: one-dimensional features and time-varying characteristics, the input samples are put in one-dimensional vectors, and the characteristic parameters of the speech signal are extracted according to the sequential order, and then the short-time characteristics of the time axis are extracted by using the convolutional layer operation. Such an operation can better analyze and preserve the one-dimensional feature vectors on the time series, which can pave the way for the time-varying nature of speech. The one-dimensional model chosen in this paper to recognize speech signals avoids the overly cumbersome procedure of one-dimensional feature vectors in the two-dimensional model, and the model in this paper preserves the correlation and integrity of the frequency bands intact through the sliding of the roll set kernel on the time series to improve the recognition performance of the whole network.

The automatic tone discrimination process proposed in this paper includes four parts: preprocessing, feature extraction, feature selection and classification, and its flowchart is shown in Fig. 2. In the process given in Fig. 2, feature extraction is its key. Research has shown that timbre attribute is a multidimensional perceptual attribute, which is affected by both the time-domain structure and frequency-domain structure of sound signals, and multiple signal features are closely related to timbre perception. In this paper, features reflecting sound timbre attributes are extracted, including time domain features and frequency domain features.

Figure 2.

Flow chart of recognition

Analysis of the effects of vocal evaluation methods
Evaluation of Tone Accuracy
Evaluation criteria

MCD Evaluation

The objective evaluation of the algorithm in this paper is measured by comparing the difference in spectral characteristics between the actual speech and the target speech, and before performing the evaluation, it is first necessary to use a frame aligner to establish a frame-level mapping between the actual speech and the target speech to ensure that the two speech have the same length. The commonly used objective evaluation metric is the Mayer’s inverted spectral distortion (MCD), which is defined as shown in Equation (10): MCD[ dB ]=10ln102i=124(mk,itmk,jc)2

MCD is used to calculate the difference between the Mel Cepstrum (MCEP) of two audios, where t denotes the Mel Cepstrum coefficient dimension, and mt and mc denote the MCEP feature vectors of the target speech and the converted speech, respectively.A lower MCD value indicates a better timbre reproduction accuracy, and vice versa for a worse timbre accuracy. However, the MCD values do not always correlate with human perception, mainly because they calculate the distortion between the test speech and the target speech in terms of acoustic features, rather than the real speech signal heard by humans. Therefore a subjective evaluation is also required.

MOS test

The MOS test is a test in which the results of timbre reproduction are reflected by the evaluator’s score. In the MOS test, testers rate the quality of students’ timbre reproduction on a 5-point scale: “5” indicates that the timbre is very accurate and very close to the standard timbre, “4” indicates that the timbre accuracy is good and close to the standard timbre, “3” indicates that the timbre accuracy is average and deviates slightly from the standard timbre, and “2” indicates that the timbre accuracy is poor and deviates more from the standard timbre, and “1” indicates that the timbre accuracy is very poor and deviates from the standard timbre completely.

Experimental results

Sixteen students are randomly selected from sophomore voice majors of a university as experimental subjects (numbered 1~16, respectively) to test the timbre evaluation results of this paper’s algorithm, and then the voice teachers are invited to subjectively rate the evaluation results of the algorithm in order to assess the accuracy of this paper’s algorithm on the evaluation of vocal timbre.

This paper draws on and adopts the Mel’s Cepstrum Distortion (MCD) as an objective evaluation index. The lower the MCD value, the smaller the Mel’s Cepstrum gap between the actual voice and the target voice, the closer the vocal fragment is to the standard timbre, and vice versa, the vocal fragment deviates from the standard timbre. Figure 3 shows the MCD evaluation results of this paper’s algorithm for 16 students.

Figure 3.

The MCD results of 16 students

As shown in Fig. 3, among the 16 students, the student with the highest MCD value is No. 11 (7.56), which indicates that in this test, the timbre accuracy of this student is more obviously deviating from the standard timbre.The students with lower MCD values are students No. 7, 8, 13, and 16, whose MCD values are 6.86, 6.83, 6.86, and 6.75, respectively, which indicates that the timbre of these four students’ vocal accuracy is closer to the Mel inverted spectra of the target voice, with better performance of timbral accuracy.

In this experiment, 20 raters with vocal knowledge background were selected to participate in the evaluation and allowed to repeat the playback, and the raters scored the vocal speech of the 16 students according to their real feelings. Figure 4 shows the results of the MOS assessment for the 16 students. The results of the assessment were found to be similar to the results of the MCD test, with the highest ratings being given to students 7, 8, and 16, with MOS ratings of 3.93, 4.21, and 4.51, respectively, while the lowest MOS ratings were given to students 10, 11, and 12.

Figure 4.

The MOS results of 16 students

It can be seen that the evaluation results obtained by the algorithm in this paper come to similar conclusions as the subjective MOS assessment results, which means that the algorithm in this paper has the ability to judge the timbre accuracy of vocal music close to that of professional vocal teaching, and therefore can be used as an auxiliary means of vocal music teaching.

Pitch control evaluation

One student was randomly selected as an experimental subject (No. 8) from the 16 students above to test the pitch control evaluation results of the algorithm in this paper. The scoring test was mainly for skill-training practice pieces, which are the most commonly used practice pieces in vocal singing training, and are mainly targeted around specific vowels, speech syllables and skillful vocalizations. In the experimental simulation, the most common five vowels (a, e, i, o, u) were selected for testing and analyzing the basic vocal training materials and the male voice closed-mouth humming exercises.

In the pitch comparison, the fundamental frequency trajectories of the two pieces were obtained by the inverse spectral method, and the fundamental frequency trajectory of the male closed-mouth humming practice piece is shown in Figure 5. The average value of the target fundamental frequency is 155.78 Hz, and the average value of the trial fundamental frequency is 153.48 Hz. Combined with Figure 5, it can be initially considered that the student’s pitch control has reached a good level.

Figure 5.

The contrast of the base frequency trajectory

Twenty previous voice teachers were then asked to rate the student’s pitch control, and the results of the pitch and vowel ratings for the male closed-mouth humming practice piece are shown in Table 1. An additional tester without any vocal fundamentals was added to the test for comparison (number 0). As can be seen from Table 1, there is a significant difference in the scores due to the difference in the vocal foundation of the testers. The scoring result of the algorithm in this paper serves as a kind of feedback information, which to a certain extent reflects the similarity between the tester’s pitch and the standard pitch, and gives a scoring result close to manual scoring, which is more in line with human subjective feelings.

Scoring results

Tester umber Exercise meloda a o e i u Pitch control
8 Algorithm scoring 75.68 85.54 76.58 86.38 68.55 85.71
Expert scoring 79 81 74 83 80 88
Average error 1.824%
0 Algorithm scoring 53.58 63.36 13.85 12.88 12.35 58.66
Expert scoring 46 62 14 13 11 60
Average error 2.681%
Analysis of students’ vocal performance
Effectiveness of Speech Recognition Algorithms to Assist Teaching and Learning

The semester grades are a visual reflection of the students’ vocal music learning effect, the author will take the two classes of 2019 sophomore year of voice major in college C as an example, randomly designate one class as an experimental class (N=45) and the other class as a control class (N=45), compare the vocal music grades of the use of the traditional means of teaching and the algorithmic aided teaching means in this paper, reflecting the learning effect of the students under the two different teaching methods.

Figure 6 shows the changes in students’ vocal performance in the control class before and after using the traditional means of teaching vocal music. It was found that the post-test scores of the experimental class improved over the pre-test scores. The number of passing students in the control class increased by 2, and the number of students in the low score band of 60-69 decreased by 28%, while the number of students in the middle score band of 70-79 increased by 7 in the control class. The number of people in the high score band above 80 points before and after the experiment was 12 people, accounting for 26.7%. It can be seen that the traditional vocal music teaching method also has a certain effect, for the low-scoring students to improve more, while for the high-scoring students there is no obvious effect.

Figure 6.

The score comparison of control group

Figure 7 shows the changes in students’ vocal scores in the experimental class before and after using the traditional vocal teaching method. From Figure 7, it can be seen that the distribution of pre-test scores of the experimental class and the control class is relatively close, while the post-test scores of the experimental class improve more. After using this paper’s algorithm for vocal singing scoring assisted teaching, the number of failing students decreased by 2, and the number of students in the low-scoring segment of 60~69 decreased by 50%. Although the number of 70~79 segments did not change, the number of students in the high score segment of 80 or more increased by 45.5%, accounting for 48.9%, which was significantly higher than that of the control group.

Figure 7.

The score comparison of experimental group

It can be seen that after utilizing the algorithm of this paper for vocal assisted teaching, the new teaching methods and teaching mode have brought about a more obvious improvement in students’ vocal performance.

Students’ attitudes toward evaluating algorithm-assisted instruction

In the traditional lecture classroom and the teaching combined with algorithmic evaluation, 93.3% of the students chose the classroom with speech recognition algorithm-assisted teaching, and only 6.7% of the students chose the traditional classroom, and the vast majority of the students preferred the vocal teaching method combined with algorithmic objective evaluation. In the new teaching format, the atmosphere of interactive discussion and cooperative learning, as well as the algorithmic scoring assistance for vocal practice, students were able to get scientific feedback immediately after singing. After applying the algorithms, 85% of the students think that their interest in vocal lessons has been greatly increased, and 81% of the students think that their vocal level has been greatly improved after applying the algorithmic evaluation tools. It can be seen that after the use of new teaching methods of vocal music teaching, the majority of students recognized and loved, for the students’ interest in learning and motivation to learn to improve a lot of help.

Prospects for vocal music teaching in the age of artificial intelligence

Speech recognition technology is a branch of artificial intelligence, and artificial intelligence represents the direction of technology for human thinking to explore, is a revolutionary innovation attempt, it is of great significance for the development of vocal music. From the perspective of the market, digital technology based on artificial intelligence can not only be used for song recording and tuning, but also promote the development of music education on a macro level. As mentioned above, Internet companies have invested money in the field of artificial intelligence to set up research and development projects, exploring how to use artificial intelligence to create greater value in the music education industry. For example, the big data analysis built around artificial intelligence has a very wide range of application scenarios, and teachers can use the technology to analyze students’ preferences in music aesthetics, and quantify the expression of song emotion, the proficiency of voice training, and the audience’s evaluation criteria for different singing genres in a statistical method, which will enable artificial intelligence to play a positive role in the field of vocal music teaching. The impact of artificial intelligence also lies in the fact that it can promote the development of creative industries in vocal music teaching and provide a large number of sources of inspiration for creators. From the perspective of cultivating talents, AI can create more job opportunities for students studying vocal music in the fields of digital media, mobile gaming, animation and film soundtracks, thus attracting more students to music education and helping them solve the problem of employment to a certain extent.

Making machines understand music may seem like an incredible thing, but scientists are making it a reality. There are still various shortcomings in the development of artificial intelligence, but despite this, it can still play an important role in our lives, and has even surpassed human performance in some areas. In the vocal music teaching segment, artificial intelligence leads the trend of the intelligence of teaching tools, the diversification of teaching scenes, the dynamization of teaching methods, the archiving of singing information, and the humanization of machine accompaniment. While enjoying the convenience brought by science and technology, we should continue to pay attention to the relevant developments in the AI industry, and promote China’s vocal music teaching career towards a more scientific and efficient direction.

Conclusion

In this paper, the objective evaluation of pitch and timbre accuracy is realized by the improved wavelet change base tone cycle detection algorithm and one-dimensional neural network algorithm. Experiments show that among the 16 research subjects selected, the evaluation results of the timbre accuracy of this paper’s algorithm consider that student No. 11 performs the worst, while students No. 7, 8, 13, and 16 perform better, which is very close to the evaluation results of professional vocal teachers. Comparing the evaluation of pitch control of this paper’s algorithm for the two experimental subjects with the evaluation of 20 vocal teachers, it is found that the evaluation of the algorithm and the artificial five vowels and treble control are closer to each other, and the average error of the evaluation is 1.824% and 2.681%. The improved speech recognition algorithm in this paper has a good application effect, after using the algorithm for scoring to assist teaching, the number of students in the experimental class in the vocal test in the high score section increased by 45.5%, the number of low score section decreased by 50%, the vocal performance improvement is more obvious. It can be seen that the integration of speech recognition technology in vocal music teaching has the feasibility.