Accès libre

A Study of Using Deep Learning Technology to Improve the Accuracy of Polyphonic Singing in Community Choirs

 et   
03 févr. 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

Deep learning technology refers to the machine learning method that uses multi-layer neural networks to realize specific computational tasks. In recent years, with the improvement of computer arithmetic power and the arrival of the big data era, deep learning technology has been widely used. As an important branch in the field of artificial intelligence, deep learning technology has a wide range of application prospects in improving the accuracy of extension [14].

Choral art is a singing art form with the participation of all people. The current community chorus activities, in recent years there has been a rapid development, with the improvement of people’s standard of living, people’s demand for life gradually rises, people are no longer satisfied with the current living conditions, pay more attention to the improvement of the spiritual level [58]. Community choir not only helps to build a harmonious society but also improves people’s cultural literacy. The good performance of the community choir can not be separated from the accuracy of the polyphonic singing of the choir [910]. The polyphonic chorus is a kind of choral form composed of multiple voices, and when singing, each voice will carry out different musical lines at the same time in order to form rich sound levels and harmonic effects. Generally speaking, polyphonic chorus is often divided into three parts: bass, male voice and female voice, and each part is responsible for different registers and voice sharing [1114]. The polyphonic chorus is a very challenging and expressive choral form. In the lower years of the polyphonic chorus, it is necessary to choose age-appropriate repertoire and assign the choristers to the appropriate voice parts [1517]. Through reasonable arrangement and skill guidance, it can make the choristers of the lower grade get exercise and development in the multi-part chorus and create wonderful music works together [1819].

The study designed a musical melody extraction algorithm for community choir polyphonic singing. The algorithm has been improved based on RNN, and a joint neural network based on Res-CBAM has been proposed. The network is jointly constructed by the main network for pitch estimation and the auxiliary network for vocal detection. The CBAM attention mechanism is introduced into the main and auxiliary networks to learn the weights of the saliency features, and the bidirectional feedback mechanism is embedded in the output layer. According to the main melody of the human voice extracted by the algorithm in this paper, three aspects of listening discrimination training are designed: pitch, rhythmic rhythm, and vocal balance. The changes in polyphonic singing effect before and after training are compared by controlled experimental method to verify the effectiveness of this paper’s method.

Deep learning based polyphonic music melody extraction algorithm

Polyphonic singing training requires high requirements for the basic elements of music, which include pitch, tempo, strength, rhythm, and other musical elements. The study uses deep learning technology to extract the main melody of the human voice and assist community choirs in enhancing the precision of polyphonic singing.

Separation algorithm based on recurrent neural network

In the RNN-based music source separation algorithm, the input is the amplitude spectrum of the mixed signal. The amplitude spectrum of the mixed signal can be obtained through the separation model, corresponding to the target source signal corresponding to the respective amplitude spectrum, while the information of the phase spectrum of the mixed signal is not changed, and then inverse transformation, the separated amplitude spectrum and phase spectrum can be combined to restore the vocal source and accompaniment source into a one-dimensional time signal sequence [20].

The overall RNN-based separation structure is shown in Figure 1. The phase spectrum of the signal is unchanged, and the feature utilised for separation is the amplitude spectrum of the signal, which is trained for discrimination by adding time-frequency masking, allowing the network to predict both the vocal and accompaniment signals.

Figure 1.

The overall separation structure based on RNN

The important innovation of RNN for separation is the integration of time-frequency masking into the network training process, which utilizes its unique memory function of temporal order to model and train speech signals. However, the shortcoming is that the RNN’s ability to extract features from 2D images is insufficient to learn the local time-frequency information of the spectral structure of speech signals, like structures such as CNN, which is the most powerful in the field of extracting features in the image domain. Therefore, this chapter proposes a melody extraction method based on saliency and joint neural networks on the basis of RNN.

Melody extraction algorithm based on improved joint neural network

In this paper, we propose a melody extraction algorithm based on an improved joint neural network, which uses the preprocessed saliency map as an input to the network to estimate melodic pitch sequences. The melody extraction task is mainly composed of two parts. One is vocal detection, i.e., to determine whether a specific time range contains “melody pitch”, which can be regarded as a binary classification task. The second is pitch detection, i.e., determining what is the most probable melodic pitch for each time frame, which can be regarded as a multicategorization task. Among the evaluation indexes of melody extraction performance, two of them are directly related to human voice detection, and the detection of melody pitch is based on the fact that the frame is recognized as a “human voice frame”, so these two parts of the task have a close relationship, and the accuracy of the human voice detection also affects the performance of melody extraction [21].

Structure of joint neural network based on CRNN-CBAM

The joint neural network consists of a combination of two parts of the network: a primary network used to perform pitch classification and predict the melody of the song and a secondary network used to perform vocal interval classification and detect the human voice. The network structure is shown in Figure 2. The main network part of the joint neural network is used to perform pitch classification and output the output pitch sequence, which is the main part responsible for melody extraction. The main network structure consists of one convolution module, three residual modules, one downsampling module, a CBAM attention module, and a Bi-LSTM layer for predicting pitch sequences. The main network’s output layer contains pitch annotations and a unique non-speech annotation [22]. These annotations are treated as multiple classifications at the same level. In the melody extraction task, although human voice detection and pitch estimation are closely related, they have different goals and learn different feature information, so simply incorporating the non-vocal labelling has limitations in extracting features for speech activity. In the joint network, a secondary network for vocal detection is added, but instead of returning to the primary network, the existing primary network is retained, and a branch dedicated to vocal detection is added.

Figure 2.

Joint neural network structure based on Res-CBAM

Joint network loss calculation

The joint network for pitch estimation and vocal detection based on Res-CBAM connects the primary and secondary networks by jointly learning the features of the convolutional module. Suppose that the binary categorization result of each frame by vocal detection can be fed back to the main network of pitch estimation. In that case, it can enable the network to better discriminate whether the frame is a non-vocal frame or a frame containing a melody and to enhance the difference in the probability between the probability of the non-vocal “0” category and the probability of the rest of the categories that contain pitches. At the same time, if the multicategorization results of pitch estimation can be fed back into the binary classification network for vocal detection, the accuracy of the discrimination of vocal frames can be enhanced. So, the results of the two are combined, a two-way feedback mechanism is added to the output layer of the neural network, denoted as BF, and a loss function that can combine the two features is constructed.

After the feature extraction of the saliency map features of the main network and the auxiliary network, the feature weights of the main network are mapped to the same dimension as the input annotation by a fully connected layer, and the feature weights of the auxiliary network are mapped to a two-dimensional vector, which is then mapped to the probability of each type of information through the softmax normalization function, and the probability of each type of information sums up to 1. The formula is shown in Equation (1): y^=softmax(zj)=softmax(wjx+b) where wj is a vector consisting of the same class weights of the fully connected layer and bj is a bias term. In the primary network responsible for pitch estimation, the value of dimension 1 computed by softmax represents the probability of an unvocalised non-melodic frame estimated by the network and is denoted as Pmnv, while the sum of the values of the remaining 600 dimensions is the probability of a melodic frame with a human voice and is denoted as 1–Pmm. In the secondary network responsible for vocal detection, the vectors of the two dimensions computed by softmax represent the probability of a non-human vocal frame and the vocal frame probability, denoted as Panv and Pav.

In order to improve the accuracy of the co-network for human voice detection and to enhance the difference between human voice frames and non-human voice frames, the information predicted by the primary network about whether the frame carries melodic features or not is superimposed on the co-network, so that the two probabilities are summed up as shown in Eq. (2): Pom=Pmm+PamPon=Pon+Pon

Since the binary classification problem of human voice detection is easier to learn and classify than the multi-classification problem of pitch estimation, feeding the results of human voice detection into the pitch estimation of the primary network can enhance the distinction between non-melodic features and other pitch features in the primary network, and improve the accuracy of classification. The probability Pam of the non-vocal frame predicted by the auxiliary network can be multiplied with the probability Pmm of the first-dimensional non-melodic frame predicted by the main network, and the probability Pσr of the vocal frame predicted by the auxiliary network can be multiplied with the sum of the probabilities of the rest of the frames with melodies predicted by the main network, 1–Pmv, and the formula is shown in equation (3): Pmnv=PanvPmnvPmvi=PavPmvi,i=1,2,...601

At this point, the softmax probability calculation is performed again for Pan and Pamv, and Pmmv and Pmv, and the output results ypitch and yvoice after two-way feedback from the primary and secondary networks are obtained.

In the joint network, the annotation of the input network consists of two dimensionally sized matrices, and the output results should correspond to the input data, so for the loss function of the joint network, the loss function of the two networks should also be combined for training. The formula for the loss function is shown in equation (4): L=ω1CE(ypitch,y^pitch)+ω2BCE(yvoice,y^voice) LBCE=y^log(y)(1y)log(y^)

Where CE is the multiclassification cross entropy loss function and BCE is the biclassification cross entropy loss function, which is calculated by Equation (5). ω1 and ω2 are the weights occupied by the two loss functions, and in the experiments, the best results were obtained when ω1=0.65, ω2=0.35 was taken because the focus of the melody extraction task is still on the pitch estimation of the main network, and therefore a greater weight is given to the loss function of the main network.

Evaluation criteria

There are two main tasks in melody extraction:

Estimation of melodic pitch. When the difference between the predicted value and the reference value is within 0.5 semitones, the melody pitch is considered to be correctly estimated, and vice versa [23].

Melodic activity detection to determine whether the current frame is melodic. The evaluation metrics are defined as shown below:

Recall (VR): VR=TP/GV

False detection rate (VF): VFA=FP/GU

Raw pitch accuracy (RPA): RPA=(TPC+FNC)/GV

Raw pitch accuracy (RCA): RCA=(TPCch+FNCch)/GV

Overall Accuracy (OA): OA=(TPC+TN)/T0

Where: Gu denotes the non-melodic frame of the reference result. Gv denotes a melodic frame of the reference result. DU denotes the non-melodic frame of the detection result. DV denotes melodic frames in the detection result. TF denotes the number of non-melodic frames correctly detected. FN indicates the number of melodic frames incorrectly detected. TP indicates the number of melody frames correctly detected. FP indicates the number of non-melody frames detected incorrectly. Tpc indicates the number of melody frames with correct pitch detection. Fpc indicates the number of melody frames without melody frame pitch error detection. TPCch indicates the number of melody frames with correct pitch detection. FPCch indicates the number of melody frames with no melody frame pitch error detection.

Method design of musical melody extraction algorithms to aid singing accuracy

The vocal main melody extraction method for polyphonic music was designed above, and in this chapter, the technique is utilised for polyphonic singing training for community choirs to improve singing accuracy in three aspects, namely, pitch, rhythmic rhythm, and vocal balance, respectively.

Auditory discrimination training for intonation and pitch

Intonation training is the primary element of a polyphonic chorus. Only by singing an accurate tone can we achieve the effect of the chorus. For this reason, it is very necessary to carry out aural training based on the enhancement of auditory ability. In chorus, pitch mainly relies on the ability of hearing, so it is necessary to organically combine the pitch training and aural training, and to achieve the purpose of improving pitch by improving the ability of hearing discrimination.

Suppose the pitch is too high or too low during choral training. In that case, you can use the polyphonic music melody extraction algorithm to identify different sound intervals and determine the correct pitch for repeated vocal training, which can gradually improve the accuracy of the singer’s hearing and help to correctly grasp the pitch of the work.

Auditory discrimination training for rhythmic rhythms

Rhythm is the basic element of music, the basis of the musical skeleton and melody, and it cannot be separated from it. In the training process, a good grasp of rhythm is one of the most important aspects. Targeted rhythm training not only can effectively cultivate the singers’ sense of rhythm and enhance their ability to listen to each other, a collaborative ensemble, but also mobilise and activate the atmosphere and stimulate the singers’ interest in learning. Therefore, it is very important for singers to accurately master the rhythm for music learning.

To develop the singers’ sense of rhythm, the main vocal melody extracted from the method of this paper is utilized in rhythmic training. Rhythm training requires mastery of rhythmic patterns consisting of notes of various beats and time values. A correct understanding of rhythm plays an important role in expressing the ideas, emotions, and images of a musical work.

Auditory discrimination training for voice equalisation

The most important indicator of choral singing is the harmonious relationship between the individual voices. Analyze the main melody of the vocals, adjust the balance between the voices, and achieve unity of tone. The singers should be able to make and close their voices neatly, listen to the voices of the different parts, and adjust the timing of the entry and exit of their parts. At the same time, the singers should be trained to use the correct breathing, vocal position, resonance, timbre, etc., to achieve basic consistency in order to make the whole voice achieve consistency. Secondly, the melody and humming should be combined. In a choral song, the different voices need to hum using the appropriate vowels in order to find the correct soft highs and blend into one another. Finally, there is the combination of active listening and self-positioning. Chorus is a group art activity. Not only in the chorus sing their part of the music, but they also have a keen sense of hearing, listening to the other part of the music changes in time according to the instructions of the conductor to adjust their part of the vocal situation, with the overall completion of the choral work performance.

Empirical analyses
Analysis of experimental results
Experimental test data set

The main melody dataset used in this paper comes from the Music Information Retrieval Evaluation Competition (MIREX) organised by the International Music Retrieval Seminar, which contains the song’s main melody extraction project, and in order to efficiently evaluate the strengths and weaknesses of the methodology of this thesis, all of the following experiments use the publicly available datasets from the competition.

In this paper, two datasets, MIR-1K and MIREX05, are used for the main melody extraction experiments:

MIR-1K dataset: recorded by 19 professionally trained singers in accordance with the singing style of pop music, it contains a total of 1000 segments of R&B, pop, jazz, opera and other compartmentalised music signals, with a sampling frequency of 44.1kHz, and the duration of each recording is in the range of 8 to 12 seconds. The melodic pitch labels of the whole dataset are spaced at 10ms intervals, and in this paper, 800 segments of the music signals are randomly selected as the training data, and the remaining 200 segments of the music signals are used as the test data.

MIREX05 dataset: labelled by the LabROSA lab team at Columbia University, it includes a total of 13 music clips in various styles, with a sampling frequency of 44.1 kHz, and the recording time of each clip is about 24 to 39 seconds. The melodic pitch labels of the whole dataset are spaced at 10ms intervals, and in this paper, 1 segment of the music is randomly selected as the test data, and all the rest are used as the training set data.

Analysis of model performance

The proposed melody extraction method for a joint neural network based on Res-CBAM is compared with the CNN-CRF method in this paper. In addition, this paper also changes the structure of the main network, which is implemented in CRNN-CRF, CNN-CRF, and CRNN structures respectively. It compares the performance of different main networks in joint detection and the model performance after adding the CBAM attention mechanism. The experimental results obtained are shown in Fig. 3, and (a) and (b) denote the experimental results on the MIRIEX-1K dataset, the MIREX05 dataset, and the experimental results on the MIREX05 dataset.

Figure 3.

Different network structures in data concentration

Comparing the experimental data of CNN-CRF, CRNN-CRF, CRNN and Res-CBAM, it can be found that the joint detection network under the dual task of main melody extraction and song detection can effectively reduce the false alarm rate of the human voice in the MIR-1K dataset, and the energy ratio of the song and accompaniment in the MIR-1K dataset is 1:1, which means that the accompaniment is misrecognised as a song frame, in this case, becomes less, which leads to an improvement in the overall accuracy OA, with Res-CBAM improving by 2.8 percentage points compared to the best performing CNN-CRF, illustrating the effectiveness of the joint detection model. It can also be noticed that this is obtained at the expense of a certain melodic pitch accuracy. Comparing the experimental results with the addition of the CBAM attention mechanism, which rises most significantly in CRNN, the Rpa and Rca rise by 6 and 4.9 percentage points, respectively, compared to CRNN, while the overall correctness rate increases to 79.7%. The overall performance of Res-CBAM on the MIREX05 dataset continues to outperform other models.

In addition, in this paper, in the CRNN structure, a unidirectional feedback structure is taken for experimentation. The experimental results of different feedback directions in the two datasets are shown in Fig. 4. CRNN-BF1 is the model where the feedback path is in the direction of the main network to the auxiliary network, whereas CRNN-BF2 is the model where the feedback path is in the direction of the auxiliary network to the main network, and where CRNN-BF is the bidirectional feedback model. From the experimental data, it can be seen that compared to the network with no feedback module. The model performance has improved significantly in both one-way and two-way feedback structures, which also proves the effectiveness of the two-way feedback mechanism. More obviously, in the MIR-1K dataset, bidirectional feedback improves by about 6.3 percentage points in pitch accuracy over unidirectional feedback, and in the MIREX05 dataset, bidirectional feedback improves further in vocal recall over unidirectional feedback. As a whole, in both datasets, the model with a bidirectional feedback structure has a greater improvement in overall accuracy OA with lower octave error compared to the model with a unidirectional feedback structure. This is due to the fact that the bidirectional feedback mechanism is effective in effectively transferring information about the different features of the primary and secondary networks, which mutually reinforce each other’s tasks. The Octave Error Data shows that the addition of the feedback mechanism improves the accuracy of note pitch recognition compared to the original model. In MIREX05, since the octave error was originally low, no significant reduction can be seen, and the joint detection network has a great improvement in vocal recall, and the addition of the two-way feedback mechanism also has a significant improvement in melodic pitch recognition.

Figure 4.

The results of different feedback directions in the data concentration

Analysis of the effect of algorithm application

Twenty members of a community choir were selected for the study to carry out the training method designed in this paper. Singing accuracy tests were administered before and after the training, and observation records were evaluated to evaluate the research process.

The effect of the music melody extraction algorithm on singing accuracy was assessed through professional scoring before and after training. It was mainly evaluated in six aspects: singing ability, breath control ability, pitch ability, rhythm ability, polyphonic choral ability, and expressiveness. The statistics of the paired samples are shown in Table 1. The mean differences of the paired samples were 0.82, 1.82, 1.54, 1.12, 1.24, and 1.35, respectively. It can be found that the singers evaluated the changes in breath control ability most significantly. This was followed by pitch and singing expression. Most of the singers in this study have experience in vocal studies, but many of them still do not know how to use breath properly for vocalization and singing. During the implementation of this study, breath training was the main focus of teaching, leading the singers to pronounce their voices correctly and scientifically, so that they gradually understood the importance of breathing. Especially in the singing process, the balance and co-ordination of vocal parts must be built on the basis of good breath support.

Matched sample statistics

Dimension Mean SD N
Pair 1 Pretest Singing ability 2.84 0.536 20
Posttest 3.66 0.475 20
Pair 2 Pretest Breath control 2.34 0.526 20
Posttest 4.16 0.511 20
Pair 3 Pretest Timeliness 2.33 0.475 20
Posttest 3.87 0.463 20
Pair 4 Pretest Rhythm 2.96 0.369 20
Posttest 4.08 0.529 20
Pair 5 Pretest Chorus ability 2.63 0.511 20
Posttest 3.87 0.414 20
Pair 6 Pretest Expressiveness 2.76 0.332 20
Posttest 4.11 0.442 20

The paired sample test is shown in Table 2. It can be seen that the scores of the singers’ various indexes showed significant changes compared to those before the audiometric training. At the same time, the paired t-test analysis shows that the significance of p>0.05 indicates that the scores of each ability of the singers before and after the audiometric training are significantly different, indicating that the audiometric training plays an obvious role in the development of the singers’ various singing abilities, and the p-value is less than 0.05 and that the audiometric training based on the algorithm of extracting the melody of the polyphonic music has achieved a certain degree of success.

Matched sample

Pair difference 95% confidence interval t df Sig.(2-tail)
Mean Standard error mean Lower limit Upper limit
Pair 1 Singing ability -0.82 0.236 0.756 1.236 3.641 8 0.000
Pair 2 Breath control -1.82 0.254 1.423 2.341 6.324 8 0.001
Pair 3 Timeliness -1.54 0.187 0.789 1.254 3.214 8 0.000
Pair 4 Rhythm -1.12 0.236 0.741 1.274 5.321 8 0.003
Pair 5 Chorus ability -1.24 0.255 0.755 1.149 5.412 8 0.002
Pair 6 Expressiveness -1.35 0.214 0.763 1.254 6.102 8 0.000
Conclusion

The study improves on RNN by designing a joint neural network based on Res-CBAM to extract the main vocal melody sung by a community choir with multiple voices and designing auditory training to improve the singing accuracy.

Res-CBAM improves the overall accuracy OA on the MIRIEX-1K dataset by 2.8 percentage points compared to the best-performing CNN-CRF. After adding the CBAM attention mechanism, the RPA and RCA increased by 6 and 4.9 percentage points, respectively, compared to CRNN. The addition of the bidirectional feedback mechanism again resulted in an improvement of approximately 6.3 percentage points in pitch accuracy when compared to unidirectional feedback. It shows the effectiveness of the joint detection model.

The mean values of the scores of the community choir members after the audiometric training in terms of singing ability, breath control, pitch ability, rhythmic ability, polyphonic choral ability, and expressiveness were improved by 0.82, 1.82, 1.54, 1.12, 1.24, and 1.35, respectively, compared with the mean values of the scores before the training, which were all significantly improved (p<0.05). It is demonstrated that a polyphonic music melody extraction algorithm-based listening training can enhance singing accuracy.