Language model optimization and corpus construction techniques for Japanese speech recognition
Published Online: Mar 21, 2025
Received: Nov 08, 2024
Accepted: Feb 24, 2025
DOI: https://doi.org/10.2478/amns-2025-0697
Keywords
© 2025 Zhou Huang et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The history of corpus development is nearly 60 years, it is a language database which is based on linguistic facts in real language use, with the help of modern computer technology, adopting data-driven empirical research methodology, and conducting comprehensive research on language as well as language learning and use, so as to obtain more results [1-3]. Its development, construction and application have been emphasized all over the world. Various types of corpora have also come into being, providing rich and informative language materials for language research. Compared with the vigorous development of corpus research in Europe and America, Japanese corpus research has always been in a relatively backward state [4-6]. The reasons for this are the constraints of objective conditions such as copyright issues, the slow change of researchers’ consciousness, and the relative lagging behind of corpus infrastructure, all of which are the bottlenecks hindering its development [7-8]. With the promotion of the national strategy of “One Belt, One Road” and the rapid development of international trade cooperation, contemporary college students not only need to practice the five basic skills of listening, speaking, reading, writing and translating in learning foreign languages, but also the development of the times puts forward higher requirements on the interpreting ability of graduates in foreign languages [9-12]. How to improve the quality of foreign language classroom teaching within a limited time? How to cultivate foreign language talents to meet the market demand more effectively? This poses a challenge to the current foreign language teaching, and at the same time, it has to cause educators to think deeply [13-15]. The development of information technology and the continuous improvement of storage technology provides convenience for foreign language teaching and research, and how to optimize the language model for Japanese speech recognition and construct a Japanese corpus has a positive guiding significance for improving the application of Japanese language teaching and research [16-18].
The study constructs a Japanese speech recognition model based on the Bi-LSTM-CTC acoustic model. The language model module of this speech recognition model selects an RNN-based language model. A parallel optimization training algorithm is proposed to optimize the RNN language model by implementing the training of RNN language model on GPU, and improving the speed of matrix and vector operations during network training with the help of the powerful computational capability of GPU. The construction of Japanese speech recognition corpora uses the speech-text alignment technique to improve model accuracy. In addition, the optimization effect of the Japanese speech recognition model based on Bi LSTM-CTC is evaluated by three evaluation indexes, namely, perplexity, accuracy, and word error rate, respectively.
Since Japanese has many primitives, there are more options for Japanese continuous speech recognition. Recurrent Neural Network (RNN) is a network structure with memory capability, which is more systematic in modeling temporal data like speech and text [19]. Therefore, RNN is suitable for speech recognition. LSTM has one more internal structure state than RNN and is an improved version of RNN. Its cell structure is shown in Fig. 1.

LSTM unit structure
LSTM is a unidirectional, front-to-back data flow pattern. This flow pattern causes LSTM to often lose some of the information. For speech data, it needs to be combined with speech data to extract the contextual information accurately. Therefore, the study proposes to use Bi LSTM instead of LSTM.
The propagation of Bi LSTM can be divided into three parts, first, the forward LSTM hidden layer obtains the forward output by the characteristic operation of the forward signal. Second, the reverse LSTM hidden layer can consider the input inverse signal as a pattern. By computing the features of that pattern, the inverse output is obtained. Third, these two sets of outputs are combined, and the resulting output of the Bi-LSTM model is compared with the previously input text to determine if there are any anomalies. This approach can effectively mine the features in the data in all directions to obtain a tighter correlation [20]. The linked temporal classification algorithm (CTC) is essentially an end-to-end algorithm that maps input sequences to output sequences. The training process of the neural network is as follows: first, the network is initialized, then the training data is input and normalized. Then the corresponding network model structure is built, parameters are set and trained, and the training is looped in num_epochs. Train in train_features to calculate the Loss loss value. Then clear the gradient and proceed to the next loop. Finally, an optimizer is used for optimization. In this study, the adaptive moment estimation (Adam) optimizer is used to train the Japanese speech model. And Softmax function is used as the activation function of the output layer of the acoustic model. The structure of the acoustic model based on Bi LSTM-CTC is shown in Fig. 2.

Model structure based on Bi LSTM-CTC
The Japanese speech recognition model based on Bi LSTM-CTC has 2 fully connected layers followed by 3 Bi LSTM network layers. The last will be 2 fully connected layers with 200 nodes per layer. The activation function of the bottom layer is Softmax, and the ReLU function is used for all other layers.The Bi LSTM-CTC has the same function and parameters as the input layer of the LSTM-CTC model. Connecting the two fully connected layers is done to change the dimensionality of the input features and to make them inputs to the Bi LSTM. The data is then infused into the fully connected layer of the Softmax function. The output of the model, along with the pre-labeled data discrepancies, is obtained through the CTC loss function and aligned to the labeled results. Assuming a given scalar
Assuming that a sample
In equation (2),
LeakyReLU is a modified version for the ReLU function, whose functional expression is shown in equation (4):
In equation (4), parameter
Where:
Where:
The classification-based RNN framework is shown in Fig. 3.The learning and training process of the RNN language model is mainly to adjust the four weight matrices in the network, i.e., the four matrices

Structure of the class based RNN
When implemented on CPU, such operations can only be realized by multiple loops, which are more time-consuming due to the larger dimension. Therefore, if matrix and vector operations can be realized in parallel processing, the training efficiency of RNN language models can be greatly improved.
Graphics computing processor (GPU) is a separate design of graphics, image computing and other functions into a separate chip, GPU uses a large number of execution units, these execution units can be easily loaded with parallel processing, unlike the single-threaded processing of the CPU. Therefore, in this paper, the original training tool functions are firstly improved and recompiled in CUDA environment to realize the training of RNN Japanese language model on GPU. With other parameters unchanged, the number of word samples trained on GPU per second can be improved by 2~3 times compared with that trained on CPU.
Using the unique advantages of GPU in linear computing to train the neural network, the convergence of the model can be accelerated to a certain extent, this paper carries out parallel optimization after the model training on the GPU, the batch optimization algorithm is usually used with the SGD algorithm during the training of the neural network, that is, after the gradient vector of multiple samples is calculated, the parameters of the network are updated by learning, and
In the traditional neural network batch processing mode, there is no special requirement for the training order of labeled samples, and in the case of random ordering of samples, the order in which the samples are processed does not have an intrinsic effect on the final convergence of the model, but this feature is slightly different in the process of training RNN language models, within the scope of a sentence sample, the training of a word sample has to fully take into account the word’s historical information, and a word The processing of a word must be carried out after all the words in front of it have been processed, rather than randomly selecting a number of words from the corpus for training, which leads to the traditional neural network batch processing method of training is no longer suitable for the training of RNN language models. For parallel training of RNN language models, several sentences need to be processed in parallel at the same moment without disrupting the order of word samples in the sentences.
Assuming that the batch value size is taken as 6, i.e., 6 sentence samples are randomly drawn from the training corpus, which are sentence
Since the length of each sentence sample in the corpus may not be the same, i.e., the number of word samples contained is different, when training
Speech recognition works by converting speech signals into corresponding text, and the corpus for training speech recognition models requires aligned speech and text samples. In this paper, we formalize the Japanese speech recognition corpus construction problem as a speech-text alignment problem, i.e., determining when to start and when to end in the audio in order to form a speech segment, and identifying the portion of text in the text that corresponds to that speech segment [22].
First, based on the differences in the frequency spectral density characteristics of speech and noise, a Gaussian model-based speech endpoint detection technique is used to separate the noise segments from the audio segments in the long audio and mark the endpoints of the speech segments. Secondly, based on the Bi LSTM-CTC based acoustic model constructed above the speech segments are recognized one by one as the transcribed form of Japanese, and the resultant sequence of speech recognition and the time corresponding to this sequence are obtained, at which time the resultant sequence is referred to as the hypothetical sequence. Meanwhile, the reference text is corrected and processed to obtain the reference text sequence. Finally, the text features are extracted from the sentences of the result and the reference text based on the vector space model (VSM), the similarity between the hypothesis sequence and the reference text sequence is calculated, and the set of aligned segments is obtained by the dynamic time normalization algorithm (DTW).
Firstly, the speech signal of the original audio is downsampled to 8kHz, and the spectrum is divided into six subbands using Fourier transform. According to the Nyquist Frequency Theorem, it is calculated that the speech spectrum with information value is below 4kHz, so the energy ranges of the six subbands are delineated to be 80-250Hz, 250~-500Hz, 500Hz-1kHz, 1-2kHz, 2-3KHz, and 3-4kHz, and at the same time, a sequence of feature vectors for the energy of each band is computed using the crossover frequency method. Each subband energy feature vector is modeled, and each model mixes two Gaussian distributions for speech and noise, as shown in equation (7):
where
The log-likelihood ratio is found for each feature of the sub-band, and then the sum of the global weighted likelihood ratios is obtained as shown in equation (8). Where
Finally the parameters of the Gaussian mixture model are adaptively updated, and the mean and variance of the speech and noise are recalculated using great likelihood estimation based on the data that has already been classified. The whole process is re-cycled for the next frame of speech, labeling the speech endpoints and noise endpoints and removing the segments between the noise points.
The segments labeled by speech endpoints are subjected to Japanese speech recognition to obtain hypothetical sequences, and the reference text is subjected to text processing to obtain reference text sequences. The reference and hypothesis sequences are converted into phoneme sequences because the Japanese language has the phenomenon of homophony, and phoneme sequences can more accurately represent the degree of similarity between two sentences than word sequences. The hypothetical sequence and the reference text sequence are calibrated by a dynamic time normalization algorithm (DTW) to obtain the correct set of aligned segments.
Phonological alignment is typically carried out at the word or sentence level. Word-level alignment is simpler and faster, but alignment accuracy is closely related to the degree of corpus integrity. According to the text experiment goal, the dataset contains errors and other characteristics, this paper adopts sentence-based alignment. Meanwhile, the vector space model (VSM) is utilized to represent both the result sequence of speech recognition and the reference text sequence as sentence vectors. That is, each sentence contains a number of phonemes
The alignment process takes the hypothesis sequence as the standard and sequentially searches the reference text to find the most similar part to the hypothesis sequence. In this paper, the cosine similarity is used to measure the similarity between two two sentence sequences in the reference text and the resultant text, as shown in equation (10):
where
The distance matrix
The dynamic time normalization algorithm is a local optimization algorithm that finds the best alignment between two temporally related sequences. The hypothesis sequence and the reference text sequence undergo a twisting transformation on the time axis so that the similarity between them is maximized. Then the task of the dynamic time normalization algorithm is to find a path
The Dynamic Time Warping algorithm exploits the independence of the subproblems and uses the idea of partitioning to split the large problem into a number of small, simple problems. In general, to improve the search efficiency, the range of data points on the path of DTW is restricted to a parallelogram, and the slope of the path ranges between 0.5 and 2. Accordingly, there are only three cases of pre-grid points at the current point, and the optimal alignment path can be extended as Eq. (13):
where
The evaluation metric of perplexity degree (PPL) is usually used to judge the merits of a language model, and the size of the perplexity degree is based on information theoretic knowledge. Usually, the smaller the value of perplexity obtained on a Japanese test set, the stronger the language model binds to the language and the better the modeling effect, then the formula for calculating the value of perplexity is as follows:
The value of perplexity usually describes the ability of a particular language model to predict a language sample, assuming that it is already known that the phrase (
Under the training of language models, the logarithmic expression of the perplexity is usually used as:
Using the additive form speeds up the computation compared to multiplying the square root of a product, while avoiding the downward overflow of floating-point numbers caused by too small a value for the probability product.
In this study, we crawled Japanese WeChat by online crawler, de-jumbled the corpus, removed numbers, Chinese and English, symbols, and error correction and other cleaning operations, and used the lexicon software to lexiconize the data, which is a total of 260 M. The training set is a maximum of 12,468,876 sentences, and the validation set is 103,142 sentences. The lexicon in this study consists only of words with a frequency greater than 10.
The experiments used the 3-gram model with Kneser-Ney smoothing algorithm, LSTM, and bi-directional LSTM to compare with the parallel optimized RNN language model, and the results of the comparison of the perplexity of the language model are shown in Fig. 4. KN3 denotes the 3-gram model with the addition of the Kneser-Ney smoothing algorithm, and Bi LSTM denotes the neural network model with bi-directional LSTM. From the figure, it can be concluded that the perplexity of both validation set and test set data of the RNN language model with parallel optimization performed in this paper is reduced to a great extent compared with other language models. Among them, on the validation set of Japanese data, the perplexity of this paper’s language model is 13.69%, which is a 62.34% decrease in perplexity compared to the 3-gram language model.

Language model PPL comparison
In order to evaluate the correctness of the generated corpus, the quality of generation was evaluated by selecting 15 episodes of Japanese TV dramas with a total of 10 hours of recognition results to form a test set for manual annotation work, and comparing the generated data with the manual annotation results. Since the labeled results may have the problem of subjectivity, it is necessary to switch to a different annotator to verify the results and judge whether the time point of the annotation is reasonable or not. The chosen annotation data format is TextGrid, which corresponds to the manual annotation of audio data using Praat, and the evaluator needs to find out the position of the time point at which all the sentences in each episode of the TV series start.
After manual labeling, the segment data generated in this paper can be compared with the manual labeling, considering that the boundary fluctuation has little effect on the correct rate, so the fluctuation range is set to
Where
where the full number of time points is the number present in the actual cut score. Because of the different values of
In order to confirm the quality of the audio cutscene, the determined labeled end-of-sentence position time is compared with the actual speech cutscene point time, and the number of all qualified time points in the test data, i.e., the number of labeled position times included in the cutscene point time interval, is calculated, and the cutscene effect of the TV series’ audio is verified by the accuracy of the total number. In order to compare the accuracy rate of different intervals, the experiment chooses three different difference

Results of accuracy rate
Optimization of RNN language models and automatic generation of corpora can enable speech recognition models to exclude the interference of irrelevant context and improve the correct rate of speech recognition. A more commonly used evaluation metric in the speech recognition process is the word error rate (WER). In calculating the WER of a test set, the quotient of the number of three types of errors accumulated in all test sentences and the number of all labeled text words is used:
In Eq. (18), #
The Bi LSTM-CTC speech recognition model proposed in this paper is based on the optimization of the RNN language model by adding a corpus based on the speech text alignment technique used to enhance the generalization ability of the speech recognition model. CNN-CTC and LSTM-CTC are selected as baseline models for comparison with Bi LSTM-CTC. The number of network layers is set to 2 for all three, Dropout is used to prevent overfitting, random orthogonal initialization is used for initialization weights, and the learning rate is set to 0.001. In order to conveniently check the dropout of the loss value, this paper sets the loss value to be saved once every 4000 pieces of speech data in training, and the dropout curve of the loss value is shown in Fig. 6. Compared with the three, the CNN-CTC model has the fastest convergence speed.Bi LSTM-CTC also has a faster decline in loss values at the beginning, but tends to decline steadily after a period of time, and the loss values are the same as those of the CNN-CTC model when the amount of data is 61, and the loss values are smaller than those of the CNN -CTC model when the final convergence occurs. LSTM -CTC, compared with the first two, has a faster decline in loss values at the beginning LSTM - CTC compared to the former two, the loss value decreases faster at the beginning, but it decreases very slowly after a period of time, when Bi LSTM-CTC and CNN - CTC have already converged, while the loss value of LSTM - CTC is still relatively high, and has not completed the training.

Loss curve
The number of parameters and the word error rate of the test set for the three models are shown in Figure 7. The WER is high because the loss value of LSTM-CTC is still in a decreasing trend.Bi LSTM-CTC not only reduces the number of parameters compared to it, but also Bi LSTM-CTC trains an epoch in about half the time that it takes to train an epoch of LSTM-CTC. Compared with CNN-CTC, the word error rate of Bi LSTM-CTC decreases by 6.2 percentage points, and the model performance is better. From the above experiments, it can be concluded that Bi LSTM-CTC is faster to train and has fewer parameters compared to LSTM-CTC, and has a higher speech recognition accuracy compared to CNN-CTC although the training is slower because Bi LSTM-CTC has some inter-temporal looping ability.

The number of parameters and the error rate of the test set
The WERs of Bi LSTM-CTC and other models are shown in Fig. 8.The Bi LSTM-CTC model reduces the word error rate by 7.49 percentage points compared to the traditional GMM-HMM model, which is remarkable. The word error rate of the end-to-end speech recognition model constructed than BLSTM and DCNN is also reduced, indicating that the generalization performance of the model in this paper is better.

Compare it to his model
In this paper, a Japanese speech recognition model is constructed by adding a parallel optimized RNN language model and a corpus based on speech-to-text alignment technology using Bi LSTM-CTC acoustic models. The optimized RNN language model is evaluated using the perplexity evaluation index, and the perplexity of the optimized RNN language model is 13.69%, which is 62.34% lower than that of the 3-gram language model. The effectiveness of the use of the speech-to-text alignment technique is examined according to the accuracy evaluation metrics, which yields 91.33% sentence cutoff accuracy. The word error rate for the constructed Japanese speech recognition model is lower than that of the baseline model and other speech recognition models. It shows that the Japanese speech recognition model designed in this paper has good application prospects.
