Language model optimization and corpus construction techniques for Japanese speech recognition

The history of corpus development is nearly 60 years, it is a language database which is based on linguistic facts in real language use, with the help of modern computer technology, adopting data-driven empirical research methodology, and conducting comprehensive research on language as well as language learning and use, so as to obtain more results [1-3]. Its development, construction and application have been emphasized all over the world. Various types of corpora have also come into being, providing rich and informative language materials for language research. Compared with the vigorous development of corpus research in Europe and America, Japanese corpus research has always been in a relatively backward state [4-6]. The reasons for this are the constraints of objective conditions such as copyright issues, the slow change of researchers’ consciousness, and the relative lagging behind of corpus infrastructure, all of which are the bottlenecks hindering its development [7-8]. With the promotion of the national strategy of “One Belt, One Road” and the rapid development of international trade cooperation, contemporary college students not only need to practice the five basic skills of listening, speaking, reading, writing and translating in learning foreign languages, but also the development of the times puts forward higher requirements on the interpreting ability of graduates in foreign languages [9-12]. How to improve the quality of foreign language classroom teaching within a limited time? How to cultivate foreign language talents to meet the market demand more effectively? This poses a challenge to the current foreign language teaching, and at the same time, it has to cause educators to think deeply [13-15]. The development of information technology and the continuous improvement of storage technology provides convenience for foreign language teaching and research, and how to optimize the language model for Japanese speech recognition and construct a Japanese corpus has a positive guiding significance for improving the application of Japanese language teaching and research [16-18].

The study constructs a Japanese speech recognition model based on the Bi-LSTM-CTC acoustic model. The language model module of this speech recognition model selects an RNN-based language model. A parallel optimization training algorithm is proposed to optimize the RNN language model by implementing the training of RNN language model on GPU, and improving the speed of matrix and vector operations during network training with the help of the powerful computational capability of GPU. The construction of Japanese speech recognition corpora uses the speech-text alignment technique to improve model accuracy. In addition, the optimization effect of the Japanese speech recognition model based on Bi LSTM-CTC is evaluated by three evaluation indexes, namely, perplexity, accuracy, and word error rate, respectively.

2

Japanese Speech Recognition Model

2.1

Japanese Speech Recognition Model Construction Based on Bi LSTM-CTC

Since Japanese has many primitives, there are more options for Japanese continuous speech recognition. Recurrent Neural Network (RNN) is a network structure with memory capability, which is more systematic in modeling temporal data like speech and text [19]. Therefore, RNN is suitable for speech recognition. LSTM has one more internal structure state than RNN and is an improved version of RNN. Its cell structure is shown in Fig. 1. f_t denotes the forgetting gate, by which the part of the information c_t−1 of the previous moment can be discarded. i_t is the input gate, by which the part of information retained into the temporary state c_t is determined. o_t denotes the input gate by which the output in the current internal state c_t is determined.

LSTM is a unidirectional, front-to-back data flow pattern. This flow pattern causes LSTM to often lose some of the information. For speech data, it needs to be combined with speech data to extract the contextual information accurately. Therefore, the study proposes to use Bi LSTM instead of LSTM.

The propagation of Bi LSTM can be divided into three parts, first, the forward LSTM hidden layer obtains the forward output by the characteristic operation of the forward signal. Second, the reverse LSTM hidden layer can consider the input inverse signal as a pattern. By computing the features of that pattern, the inverse output is obtained. Third, these two sets of outputs are combined, and the resulting output of the Bi-LSTM model is compared with the previously input text to determine if there are any anomalies. This approach can effectively mine the features in the data in all directions to obtain a tighter correlation [20]. The linked temporal classification algorithm (CTC) is essentially an end-to-end algorithm that maps input sequences to output sequences. The training process of the neural network is as follows: first, the network is initialized, then the training data is input and normalized. Then the corresponding network model structure is built, parameters are set and trained, and the training is looped in num_epochs. Train in train_features to calculate the Loss loss value. Then clear the gradient and proceed to the next loop. Finally, an optimizer is used for optimization. In this study, the adaptive moment estimation (Adam) optimizer is used to train the Japanese speech model. And Softmax function is used as the activation function of the output layer of the acoustic model. The structure of the acoustic model based on Bi LSTM-CTC is shown in Fig. 2.

The Japanese speech recognition model based on Bi LSTM-CTC has 2 fully connected layers followed by 3 Bi LSTM network layers. The last will be 2 fully connected layers with 200 nodes per layer. The activation function of the bottom layer is Softmax, and the ReLU function is used for all other layers.The Bi LSTM-CTC has the same function and parameters as the input layer of the LSTM-CTC model. Connecting the two fully connected layers is done to change the dimensionality of the input features and to make them inputs to the Bi LSTM. The data is then infused into the fully connected layer of the Softmax function. The output of the model, along with the pre-labeled data discrepancies, is obtained through the CTC loss function and aligned to the labeled results. Assuming a given scalar x₁, ⋯, x_k, the computational expression of the Softmax function is shown in equation (1): (1) $z_{k} = s o f t \max (x_{k}) = \frac{e^{k} k}{\sum_{i = 1}^{K} e^{x} i}, k = 1, \dots, k$ $${z_k} = soft\max \left( {{x_k}} \right) = \frac{{{e^k}k}}{{\sum\limits_{i = 1}^K {{e^x}} i}},k = 1, \cdots ,k$$

Assuming that a sample x currently exists, whether or not that sample matches marker C can be calculated using equation (2): (2) $P (y = c | x) = s o f t \max (w_{C}^{T} x)$ $$P(y = c|x) = soft\max \left( {w_C^Tx} \right)$$

In equation (2), $w_{C}^{T}$ $$w_C^T$$ denotes the weight parameter. After the Softmax layer, the probability of the different labels corresponding to this sample can be derived, and the one with the highest probability is selected as the corresponding label.ReLU is a commonly used activation function, which improves the performance of the model to a certain extent. In essence, it is a segmented function that takes 0 as a cutoff point. If the output portion is greater than 0, then it is equivalent to the value of the original output. If the output portion is less than 0, then it becomes 0. The ReLU activation function expression is shown in equation (3): (3) $σ (x) = \max (0, x)$ $$\sigma (x) = \max (0,x)$$

LeakyReLU is a modified version for the ReLU function, whose functional expression is shown in equation (4): (4) $f (x) = {\begin{array}{l} α x, x \leq 0 \\ x, x > 0 \end{array}$ $$f\left( x \right) = \left\{ {\begin{array}{l} {\alpha x,x \leq 0} \\ {x,x > 0} \end{array}} \right.$$

In equation (4), parameter α is a value greater than 0 but very small. The study uses the LeakyReLU activation function to train the acoustic model, which is used to improve the shortcomings of ReLU and increase the accuracy of the model.

2.2

Language Model Optimization

2.2.1

RNN language models

N − gram language model can only utilize the first N − 1 above information, ignoring the longer contextual information [21]. Therefore RNN language model is chosen to assist in speech recognition. RNN is mainly composed of three network layers such as input layer, implicit layer and output layer. 1)

X_t is the input of moment t (e.g., X₀ is the input of moment t = 0).

2)

H_t is the hidden layer state at moment t, i.e., the memory of the recurrent neural network, which is jointly determined by X_t and H_t. The formula is as follows: (5) $H_{t} = f (U \cdot X_{t} + W \cdot H_{t - 1} + β)$ $${H_t} = f\left( {U \cdot {X_t} + W \cdot {H_{t - 1}} + \beta } \right)$$

Where: U, W and β are network parameters. f is the activation function (tanh function or ReLU function), and the computation of the above equation is cyclic. 3)

O_t is the output at moment t, which is determined by the implicit layer state H_t (memory) of the model at the current moment, and the sum of the output probabilities at all moments is 1: (6) $O_{t} = s o f t \max (V \cdot H_{t} + η)$ $${O_t} = soft\max \left( {V \cdot {H_t} + \eta } \right)$$

Where: V and η are network parameters. softmax is the activation function.

2.2.2

Parallel optimization of language models

The classification-based RNN framework is shown in Fig. 3.The learning and training process of the RNN language model is mainly to adjust the four weight matrices in the network, i.e., the four matrices U, W, V, and X in Fig. 3, and its main operations come from the relevant computations of the matrices and vectors, as well as their own updates.

When implemented on CPU, such operations can only be realized by multiple loops, which are more time-consuming due to the larger dimension. Therefore, if matrix and vector operations can be realized in parallel processing, the training efficiency of RNN language models can be greatly improved.

Graphics computing processor (GPU) is a separate design of graphics, image computing and other functions into a separate chip, GPU uses a large number of execution units, these execution units can be easily loaded with parallel processing, unlike the single-threaded processing of the CPU. Therefore, in this paper, the original training tool functions are firstly improved and recompiled in CUDA environment to realize the training of RNN Japanese language model on GPU. With other parameters unchanged, the number of word samples trained on GPU per second can be improved by 2~3 times compared with that trained on CPU.

Using the unique advantages of GPU in linear computing to train the neural network, the convergence of the model can be accelerated to a certain extent, this paper carries out parallel optimization after the model training on the GPU, the batch optimization algorithm is usually used with the SGD algorithm during the training of the neural network, that is, after the gradient vector of multiple samples is calculated, the parameters of the network are updated by learning, and m samples are randomly selected from the labeled sample set ${1, \dots, l}$ $$\left\{ {1, \cdots ,l} \right\}$$ as a subset for training and learning in each training process. This subset is often referred to as batch $(b a t c h)$ $$\left( {batch} \right)$$, and it often takes less time to process $m (b a t c h = m)$ $$m\left( {batch = m} \right)$$ samples at once than to train m times, and it takes less time to process one sample $(b a t c h = 1)$ $$\left( {batch = 1} \right)$$ at a time, so it accelerates the training of the model, and in addition, the batch processing method can change the operation of matrices and vectors in the process of neural network calculation into the operation of matrices and matrices, which can make full use of the computing advantages of GPUs and improve the training speed.

In the traditional neural network batch processing mode, there is no special requirement for the training order of labeled samples, and in the case of random ordering of samples, the order in which the samples are processed does not have an intrinsic effect on the final convergence of the model, but this feature is slightly different in the process of training RNN language models, within the scope of a sentence sample, the training of a word sample has to fully take into account the word’s historical information, and a word The processing of a word must be carried out after all the words in front of it have been processed, rather than randomly selecting a number of words from the corpus for training, which leads to the traditional neural network batch processing method of training is no longer suitable for the training of RNN language models. For parallel training of RNN language models, several sentences need to be processed in parallel at the same moment without disrupting the order of word samples in the sentences.

Assuming that the batch value size is taken as 6, i.e., 6 sentence samples are randomly drawn from the training corpus, which are sentence s₁ to sentence s₆, the training method using class-basedRNN is: randomly select a sentence from the corpus, so that each sentence is serially loaded into the network, and process each word in the sentence sequentially, in the following order: w₁₁, w₁₂, ⋯, w_1n, w₂₁, w₂₂, ⋯, w_2n, ..., w₆₁, w₆₂, ⋯, w_6n. Such an approach is equivalent to having the input to the network as a stream of data and processing each word in the stream in turn.

Since the length of each sentence sample in the corpus may not be the same, i.e., the number of word samples contained is different, when training batch sentence in parallel, some sentences may be shorter and the training will be completed soon. At this time, we can directly continue to load sentence samples from the remaining corpus into this data stream for training, to ensure that the simultaneous processing of batch data streams, until all the text corpus loading training is completed, that is, the completion of a cycle of training. During the training process, each data stream shares four weight matrices: U, W, V and X.

3

Corpus construction based on speech-to-text alignment technology

Speech recognition works by converting speech signals into corresponding text, and the corpus for training speech recognition models requires aligned speech and text samples. In this paper, we formalize the Japanese speech recognition corpus construction problem as a speech-text alignment problem, i.e., determining when to start and when to end in the audio in order to form a speech segment, and identifying the portion of text in the text that corresponds to that speech segment [22].

3.1

Japanese Text Alignment System Framework

First, based on the differences in the frequency spectral density characteristics of speech and noise, a Gaussian model-based speech endpoint detection technique is used to separate the noise segments from the audio segments in the long audio and mark the endpoints of the speech segments. Secondly, based on the Bi LSTM-CTC based acoustic model constructed above the speech segments are recognized one by one as the transcribed form of Japanese, and the resultant sequence of speech recognition and the time corresponding to this sequence are obtained, at which time the resultant sequence is referred to as the hypothetical sequence. Meanwhile, the reference text is corrected and processed to obtain the reference text sequence. Finally, the text features are extracted from the sentences of the result and the reference text based on the vector space model (VSM), the similarity between the hypothesis sequence and the reference text sequence is calculated, and the set of aligned segments is obtained by the dynamic time normalization algorithm (DTW).

3.2

Speech endpoint detection and denoising based on Gaussian mixture modeling

Firstly, the speech signal of the original audio is downsampled to 8kHz, and the spectrum is divided into six subbands using Fourier transform. According to the Nyquist Frequency Theorem, it is calculated that the speech spectrum with information value is below 4kHz, so the energy ranges of the six subbands are delineated to be 80-250Hz, 250~-500Hz, 500Hz-1kHz, 1-2kHz, 2-3KHz, and 3-4kHz, and at the same time, a sequence of feature vectors for the energy of each band is computed using the crossover frequency method. Each subband energy feature vector is modeled, and each model mixes two Gaussian distributions for speech and noise, as shown in equation (7): (7) $f (x | Z, r) = \frac{1}{\sqrt{2 \times π \times θ^{2}}} \times e^{\frac{- (x - u) 2}{2 \times θ 2}}$ $$f(x|Z,r) = \frac{1}{{\sqrt {2 \times \pi \times {\theta ^2}} }} \times {e^{\frac{{ - (x - u)2}}{{2 \times \theta 2}}}}$$

where x is the sequence of eigenvectors for the six subband energies selected. r represents the set of model parameters u and θ. u is the mean of the input signal and θ is the variance of the input signal, which determine the probability value of the Gaussian distribution of speech per frame. In Eq. (7), if parameter Z is 0, it represents the computed noise probability. Z is 1 then it represents the calculation of speech probability.

The log-likelihood ratio is found for each feature of the sub-band, and then the sum of the global weighted likelihood ratios is obtained as shown in equation (8). Where K_i is the weighting factor of the likelihood ratio. Set local threshold T_τ and global threshold T_a, if one of the six subband features has a likelihood ratio more than T_τ, it is considered to have speech. If the sum of the weighted likelihood ratios of the six subbands exceeds T_a, it is also considered to have speech, as shown in equation (9): (8) $L (x (n)) = \sum K_{i} L (x (n), i) = \sum K_{i} \log (\frac{f_{s} (x (n, i))}{f_{n} (x (n), i)}))$ $$L(x(n)) = \sum {{K_i}} L(x(n),i)\left. { = \sum {{K_i}} \log \left( {\frac{{{f_s}(x(n,i))}}{{{f_n}(x(n),i)}}} \right)} \right)$$ (9) $F_{v a d (n)} {\begin{array}{l} 1 & L (x) > T_{a} ∥ L_{i} > T_{τ} \\ 0 & e l s e \end{array}$ $${F_{vad(n)}}\left\{ {\begin{array}{l} 1&{L(x) > {T_a}\vert\vert{L_i} > {T_\tau }} \\ 0&{else} \end{array}} \right.$$

Finally the parameters of the Gaussian mixture model are adaptively updated, and the mean and variance of the speech and noise are recalculated using great likelihood estimation based on the data that has already been classified. The whole process is re-cycled for the next frame of speech, labeling the speech endpoints and noise endpoints and removing the segments between the noise points.

3.3

Text sentence alignment based on dynamic time regularization algorithm

The segments labeled by speech endpoints are subjected to Japanese speech recognition to obtain hypothetical sequences, and the reference text is subjected to text processing to obtain reference text sequences. The reference and hypothesis sequences are converted into phoneme sequences because the Japanese language has the phenomenon of homophony, and phoneme sequences can more accurately represent the degree of similarity between two sentences than word sequences. The hypothetical sequence and the reference text sequence are calibrated by a dynamic time normalization algorithm (DTW) to obtain the correct set of aligned segments.

Phonological alignment is typically carried out at the word or sentence level. Word-level alignment is simpler and faster, but alignment accuracy is closely related to the degree of corpus integrity. According to the text experiment goal, the dataset contains errors and other characteristics, this paper adopts sentence-based alignment. Meanwhile, the vector space model (VSM) is utilized to represent both the result sequence of speech recognition and the reference text sequence as sentence vectors. That is, each sentence contains a number of phonemes T₁, T₂, ⋯, T_n, which can be represented as vector $D (w_{1}, w_{2}, \dots, w_{k})$ $$D\left( {{w_1},{w_2}, \cdots ,{w_k}} \right)$$.

The alignment process takes the hypothesis sequence as the standard and sequentially searches the reference text to find the most similar part to the hypothesis sequence. In this paper, the cosine similarity is used to measure the similarity between two two sentence sequences in the reference text and the resultant text, as shown in equation (10): (10) $s i m (D_{i}, D_{j}) = \frac{\sum_{k = 1}^{n} w_{i k} \times w_{j k}}{\sqrt{\sum_{k = 1}^{n} w_{i k}^{2}} \cdot \sqrt{\sum_{k = 1}^{n} w_{j k}^{2}}}$ $$sim\left( {{D_i},{D_j}} \right) = \frac{{\sum\limits_{k = 1}^n {{w_{ik}}} \times {w_{jk}}}}{{\sqrt {\sum\limits_{k = 1}^n {w_{ik}^2} } \cdot \sqrt {\sum\limits_{k = 1}^n {w_{jk}^2} } }}$$

where w_k is the weight of phoneme T_k in each sentence, which can be calculated by the word frequency-inverse document frequency (TF-IDF) as shown in equation (11). TF_k is the number of times phoneme T_K occurs in a sentence, N is the total number of sentences in the reference text, and DF_K denotes the total number of occurrences of T_K in the reference text: (11) $w_{k} = T F_{k} \times \log (\frac{N}{D F_{K}})$ $${w_k} = T{F_k} \times \log \left( {\frac{N}{{D{F_K}}}} \right)$$

The distance matrix $S (\cos t m a t r i x)$ $$S\left( {\cos tmatrix} \right)$$ between the hypothesized sequence and the reference text sequence is computed through equation (9).

The dynamic time normalization algorithm is a local optimization algorithm that finds the best alignment between two temporally related sequences. The hypothesis sequence and the reference text sequence undergo a twisting transformation on the time axis so that the similarity between them is maximized. Then the task of the dynamic time normalization algorithm is to find a path L from (0, 0) to (m, n) on matrix S such that the total distance of the cumulative path is minimized. This path is called the optimal alignment path and is defined as equation (12), where $s i m (D_{i}, D_{j})$ $$sim\left( {{D_i},{D_j}} \right)$$ is the similarity between sentence i of the reference text and sentence j of the hypothetical text: (12) $L = \arg \min \sum_{(i, j) \in L} s i m (D_{i}, D_{j})$ $$L = \arg \min \sum\limits_{(i,j) \in L} {sim} \left( {{D_i},{D_j}} \right)$$

The Dynamic Time Warping algorithm exploits the independence of the subproblems and uses the idea of partitioning to split the large problem into a number of small, simple problems. In general, to improve the search efficiency, the range of data points on the path of DTW is restricted to a parallelogram, and the slope of the path ranges between 0.5 and 2. Accordingly, there are only three cases of pre-grid points at the current point, and the optimal alignment path can be extended as Eq. (13): (13) $γ (i, j) = \min {\begin{array}{l} γ (i - 1, j - 1) + η s i m (D_{i}, D_{j}) \\ γ (i - 2, j - 1) + s i m (D_{i}, D_{j}) \\ γ (i - 1, j - 2) + s i m (D_{i}, D_{j}) \end{array}$ $$\gamma (i,j) = \min \left\{ {\begin{array}{l} {\gamma (i - 1,j - 1) + \eta sim\left( {{D_i},{D_j}} \right)} \\ {\gamma (i - 2,j - 1) + sim\left( {{D_i},{D_j}} \right)} \\ {\gamma (i - 1,j - 2) + sim\left( {{D_i},{D_j}} \right)} \end{array}} \right.$$

where γ(i, j) is the cumulative distance length to point (i, j). Parameter η(η > 1) is the distance weight from point (i − 1, j − 1) to point (i, j), which aims to control the diagonal point to distance cost greater than the other points in D to (i, j) distance cost. After matching by DTW algorithm, the optimal path is obtained.

4

Analysis of the optimization effect of Japanese speech recognition model

4.1

Comparison of language model perplexity

4.1.1

Criteria for evaluating language models

The evaluation metric of perplexity degree (PPL) is usually used to judge the merits of a language model, and the size of the perplexity degree is based on information theoretic knowledge. Usually, the smaller the value of perplexity obtained on a Japanese test set, the stronger the language model binds to the language and the better the modeling effect, then the formula for calculating the value of perplexity is as follows: (14) $\begin{array}{l} p e r p l e x i t y (S) = p (^{w_{1}, w_{2}, w_{3}, ..., w_{m}) \frac{- 1}{m}} \\ = \sqrt[m]{\frac{1}{p (w_{1}, w_{2}, w_{3}, ..., w_{m})}} \\ = \sqrt[m]{\prod_{i = 1}^{m} \frac{1}{p (w_{i} | w_{1}, ..., w_{i - 1})}} \end{array}$ $$\begin{array}{l} perplexity(S) = p{({w_1},{w_2},{w_3},...,{w_m})^{\frac{{ - 1}}{m}}} \\ = \sqrt[m]{{\frac{1}{{p({w_1},{w_2},{w_3},...,{w_m})}}}} \\ = \sqrt[m]{{\prod\limits_{i = 1}^m {\frac{1}{{p({w_i}|{w_1},...,{w_{i - 1}})}}} }} \\ \end{array}$$

The value of perplexity usually describes the ability of a particular language model to predict a language sample, assuming that it is already known that the phrase (w₁, w₂, w₃, ..., w_m) will occur in the corpus, the higher the probability value of this phrase obtained by the language model calculation, the better the language model fits this corpus. From the defining equation, it can be seen that the perplexity degree is actually the geometric mean of the inverse of the probability obtained by calculating each word, i.e., the average of the number of choices available to the model in making a prediction for the next word, and it can also be thought of as the average number of candidate words under each role word in the prediction of a certain type of linguistic phenomenon by the language model. For example, when the perplexity of a certain language model is 30, then on average, the model predicts the next word with 30 words equivocally possible as reasonable choices for the next word.

Under the training of language models, the logarithmic expression of the perplexity is usually used as: (15) $\log (p e r p l e x i t y (S)) = - \frac{1}{m} \sum_{i = 1}^{m} \log p (w_{i} | w_{1}, ..., w_{i - 1})$ $$\log (perplexity(S)) = - \frac{1}{m}\sum\limits_{i = 1}^m {\log p({w_i}|{w_1},...,{w_{i - 1}})}$$

Using the additive form speeds up the computation compared to multiplying the square root of a product, while avoiding the downward overflow of floating-point numbers caused by too small a value for the probability product.

4.1.2

Data preparation

In this study, we crawled Japanese WeChat by online crawler, de-jumbled the corpus, removed numbers, Chinese and English, symbols, and error correction and other cleaning operations, and used the lexicon software to lexiconize the data, which is a total of 260 M. The training set is a maximum of 12,468,876 sentences, and the validation set is 103,142 sentences. The lexicon in this study consists only of words with a frequency greater than 10.

4.1.3

Analysis of experimental results

The experiments used the 3-gram model with Kneser-Ney smoothing algorithm, LSTM, and bi-directional LSTM to compare with the parallel optimized RNN language model, and the results of the comparison of the perplexity of the language model are shown in Fig. 4. KN3 denotes the 3-gram model with the addition of the Kneser-Ney smoothing algorithm, and Bi LSTM denotes the neural network model with bi-directional LSTM. From the figure, it can be concluded that the perplexity of both validation set and test set data of the RNN language model with parallel optimization performed in this paper is reduced to a great extent compared with other language models. Among them, on the validation set of Japanese data, the perplexity of this paper’s language model is 13.69%, which is a 62.34% decrease in perplexity compared to the 3-gram language model.

4.2

Text Alignment Experimental Results

4.2.1

Corpus evaluation criteria

In order to evaluate the correctness of the generated corpus, the quality of generation was evaluated by selecting 15 episodes of Japanese TV dramas with a total of 10 hours of recognition results to form a test set for manual annotation work, and comparing the generated data with the manual annotation results. Since the labeled results may have the problem of subjectivity, it is necessary to switch to a different annotator to verify the results and judge whether the time point of the annotation is reasonable or not. The chosen annotation data format is TextGrid, which corresponds to the manual annotation of audio data using Praat, and the evaluator needs to find out the position of the time point at which all the sentences in each episode of the TV series start.

After manual labeling, the segment data generated in this paper can be compared with the manual labeling, considering that the boundary fluctuation has little effect on the correct rate, so the fluctuation range is set to t. When the time difference between the cut-off point and the corresponding annotation is less than t, it can be considered as a correct segmentation, which is calculated as shown in Equation (16): (16) $\begin{array}{l} I t i m e a t t h e t a n g e n t p o s i t i o n \\ - p o s i t i o n t i m e a t t h e c o r r e s p o n d i n g l a b e l i n g I \leq t \end{array}$ $$\begin{array}{l} I{\text{ }}time{\text{ }}at{\text{ }}the{\text{ }}tangent{\text{ }}position \\ - position{\text{ }}time{\text{ }}at{\text{ }}the{\text{ }}corresponding{\text{ }}labeling{\text{ }}I \leq t \\ \end{array}$$

Where t as the boundary to get the cut-off time point confidence interval, used to determine the actual cut-off at the time of the time and the labeled location of the difference between the time of the time is within the range of $[- t, t]$ $$\left[ { - t,t} \right]$$, so as to determine whether the cut-off point is qualified. In the specific realization process is not directly find the two time subtraction judgment, but in each cut time point on the basis of adding or subtracting t to get the fault tolerance interval of each cut point, if in the list of labeled time points can be found to fall within the interval of the time point will be judged to be qualified. The accuracy of the audio cut score is obtained by calculating the ratio of the number of qualified points to the number of all cut points, which is calculated as shown in equation (17): (17) $Accuracy = \frac{Number of eligible time points}{All time points} \times 100 %$ $${\text{Accuracy}} = \frac{{{\text{Number of eligible time points}}}}{{{\text{All time points}}}} \times 100\%$$

where the full number of time points is the number present in the actual cut score. Because of the different values of t, the corresponding number of qualified points in accordance with Eq. (16) will also be different, the smaller value of t means the higher accuracy, the number of qualified points will be reduced accordingly, and on the contrary, it means the accuracy is more lenient, the number of qualified points will be increased accordingly. In the experiment, in order to obtain a more accurate corpus for cuts, the value of t is limited to 1.5s, and the comparison experiment is carried out by setting multiple difference values, so as to find a better accuracy rate under different difference conditions.

4.2.2

Test results

In order to confirm the quality of the audio cutscene, the determined labeled end-of-sentence position time is compared with the actual speech cutscene point time, and the number of all qualified time points in the test data, i.e., the number of labeled position times included in the cutscene point time interval, is calculated, and the cutscene effect of the TV series’ audio is verified by the accuracy of the total number. In order to compare the accuracy rate of different intervals, the experiment chooses three different difference t values for calculation within a reasonable range, which are t = 1.5, t = 1, and t = 0.5, whose units are seconds, which are harmonized with the cut-point time. The accuracy of the cut-off results at different t values is shown in Fig. 5. As the value of t increases, the accuracy of judging the cut-off point increases. The highest tangent result obtains 91.33% sentence tangent accuracy, corresponding to t of 1.5, which represents more actual tangent time and labeled time gap can be controlled within 1.5 seconds. Whereas, 68.34% of sentence cut score accuracy was obtained with t being 0.5, indicating that it is more difficult to control the time gap within 0.5 seconds. The whole corpus generation process is fully automated, which provides data support for improving the accuracy of the Japanese speech recognition model.

4.3

Overall effect of Japanese speech recognition model

4.3.1

Speech Recognition Model Evaluation Criteria

Optimization of RNN language models and automatic generation of corpora can enable speech recognition models to exclude the interference of irrelevant context and improve the correct rate of speech recognition. A more commonly used evaluation metric in the speech recognition process is the word error rate (WER). In calculating the WER of a test set, the quotient of the number of three types of errors accumulated in all test sentences and the number of all labeled text words is used: (18) $W E R = \frac{# I n s e r t i o n + # D e l e t i o n + # S u b s t i t u t e}{W o r d s}$ $$WER = \frac{{\# Insertion + \# Deletion + \# Substitute}}{{Words}}$$

In Eq. (18), #Insertion denotes the number of insertion errors, #Deletion denotes the number of deletion errors, #Substitute denotes the number of substitution errors, and Words denotes the total number of words.

4.3.2

Comparison with the baseline model

The Bi LSTM-CTC speech recognition model proposed in this paper is based on the optimization of the RNN language model by adding a corpus based on the speech text alignment technique used to enhance the generalization ability of the speech recognition model. CNN-CTC and LSTM-CTC are selected as baseline models for comparison with Bi LSTM-CTC. The number of network layers is set to 2 for all three, Dropout is used to prevent overfitting, random orthogonal initialization is used for initialization weights, and the learning rate is set to 0.001. In order to conveniently check the dropout of the loss value, this paper sets the loss value to be saved once every 4000 pieces of speech data in training, and the dropout curve of the loss value is shown in Fig. 6. Compared with the three, the CNN-CTC model has the fastest convergence speed.Bi LSTM-CTC also has a faster decline in loss values at the beginning, but tends to decline steadily after a period of time, and the loss values are the same as those of the CNN-CTC model when the amount of data is 61, and the loss values are smaller than those of the CNN -CTC model when the final convergence occurs. LSTM -CTC, compared with the first two, has a faster decline in loss values at the beginning LSTM - CTC compared to the former two, the loss value decreases faster at the beginning, but it decreases very slowly after a period of time, when Bi LSTM-CTC and CNN - CTC have already converged, while the loss value of LSTM - CTC is still relatively high, and has not completed the training.

The number of parameters and the word error rate of the test set for the three models are shown in Figure 7. The WER is high because the loss value of LSTM-CTC is still in a decreasing trend.Bi LSTM-CTC not only reduces the number of parameters compared to it, but also Bi LSTM-CTC trains an epoch in about half the time that it takes to train an epoch of LSTM-CTC. Compared with CNN-CTC, the word error rate of Bi LSTM-CTC decreases by 6.2 percentage points, and the model performance is better. From the above experiments, it can be concluded that Bi LSTM-CTC is faster to train and has fewer parameters compared to LSTM-CTC, and has a higher speech recognition accuracy compared to CNN-CTC although the training is slower because Bi LSTM-CTC has some inter-temporal looping ability.

4.3.3

Comparison with other models

The WERs of Bi LSTM-CTC and other models are shown in Fig. 8.The Bi LSTM-CTC model reduces the word error rate by 7.49 percentage points compared to the traditional GMM-HMM model, which is remarkable. The word error rate of the end-to-end speech recognition model constructed than BLSTM and DCNN is also reduced, indicating that the generalization performance of the model in this paper is better.

5

Conclusion

In this paper, a Japanese speech recognition model is constructed by adding a parallel optimized RNN language model and a corpus based on speech-to-text alignment technology using Bi LSTM-CTC acoustic models. The optimized RNN language model is evaluated using the perplexity evaluation index, and the perplexity of the optimized RNN language model is 13.69%, which is 62.34% lower than that of the 3-gram language model. The effectiveness of the use of the speech-to-text alignment technique is examined according to the accuracy evaluation metrics, which yields 91.33% sentence cutoff accuracy. The word error rate for the constructed Japanese speech recognition model is lower than that of the baseline model and other speech recognition models. It shows that the Japanese speech recognition model designed in this paper has good application prospects.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Language model optimization and corpus construction techniques for Japanese speech recognition

Zhou Huang

Yang Cao

Published Online: Mar 21, 2025

Received: Nov 08, 2024

Accepted: Feb 24, 2025

DOI: https://doi.org/10.2478/amns-2025-0697

KeywordsBi LSTM-CTC acoustic model, RNN language model, Parallel optimization, Speech-to-text alignment technique, Japanese speech recognition

© 2025 Zhou Huang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Bi LSTM-CTC acoustic model, RNN language model, Parallel optimization, Speech-to-text alignment technique, Japanese speech recognition