Combined Application of Speech Recognition and Natural Language Processing Technologies in the Electric Power Industry
Publié en ligne: 19 mars 2025
Reçu: 02 nov. 2024
Accepté: 02 févr. 2025
DOI: https://doi.org/10.2478/amns-2025-0530
Mots clés
© 2025 Tao Xu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
With the rapid development of smart grids, the need for speech recognition in grid business is increasing. Like many existing industries, the power industry has its own specific specialized information, such as abbreviated vocabulary, names, system names, etc. However, these specialized information vocabularies are difficult to recognize when transformed by speech recognition using public domain-oriented speech recognition technologies, making the recognition accuracy of power industry speech low, which also affects the efficiency and cost of various businesses and services in the power grid system [1-4]. The wide application of information and communication technology in the electric power industry promotes new changes in the mode of electric power production and development, and lays a solid foundation for electric power to usher in the era of intelligence and generates a large number of application needs [5-8]. The demand for artificial intelligence technologies such as image recognition, speech recognition, natural language processing, etc. is becoming stronger and stronger around all aspects of the electric power system, especially in the application fields of electric power construction, operation and inspection, safety supervision, marketing, and operation and management. It is urgent to solve the business pain points in the electric power field by combining the advanced research results of the current AI technology to realize cost reduction and efficiency [9-10]. Therefore, it is necessary to study the speech recognition technology for the electric power industry to realize the speech information recognition and processing for the smart grid.
Speech recognition technology, also known as automatic speech recognition (ASR), refers to the conversion of human speech signals into input information or corresponding text that can be recognized and read by computers [11]. Its applications include intelligent voice quality control, voice navigation, indoor equipment control, and equipment condition monitoring. In the electric power industry, this technology is used in customer service, marketing, equipment operation and inspection and other electric power-specific business scenarios to carry out intelligent recognition and fault diagnosis for speaker audio and equipment operating condition audio, including audio preprocessing, feature parameter extraction, model training and other key technologies [12-14].
Natural Language Processing (NLP) is an important direction in the field of Artificial Intelligence, researching theories and methods to realize effective communication between humans and computers through natural language [15]. It involves semantic understanding, machine translation, intelligence retrieval, document proofreading and generation, etc., and is widely used in scenarios such as electric power marketing customer service robots, intelligent office, and policy intelligence analysis [16-18]. Based on the core technologies such as syntactic logic of natural language, character concept representation and deep semantic analysis to realize the unified representation of multilingualism, to promote the effective communication and free interaction between humans and machines [19-20].
With the rapid promotion of key AI technologies, the power industry will increasingly rely on the convenience brought by AI. Literature [21] proposes a speech recognition method for power system based on natural language processing, which introduces identity vectors into the DNN acoustic model, removes the feature difference information in the speech data and retains the relevant semantic information, which effectively improves the recognition accuracy of the power system. Literature [22] studied the application of speech recognition technology in the design of electricity billing service system of power supply companies, and proposed an improved Bert fusion model to provide digital support in the electricity billing service and management of the power industry, and to promote the efficient operation of power supply enterprises in their daily work. Literature [23] shows that dispatching instructions between telephone-based systems and dispatching organizations at all levels are an important way of grid operation, and artificial intelligence and other technologies are used to design a speech-adaptive speech recognition method to cope with the speech recognition capability of the intelligent dispatching system under the complex grid operation state. Literature [24] developed a marketing customer service speech recognition and semantic understanding system for large-scale electric power marketing service demand, which can realize accurate data screening from massive information and intelligently analyze customer information, and has a wide range of application value for improving the service of electric power enterprises. Literature [25] describes the necessity of intelligent electric power customer service, traditional electric power customer service is affected by employee working time, knowledge level and other factors have an impact on the user experience, and intelligent electric power customer service based on speech analysis technology has a better competence for the flow pressure of large-scale communication service hotline. Literature [26] constructed a dispatch speech acoustic dictionary based on power grid professional vocabulary and localized vocabulary, and introduced deep learning algorithms into the acoustic model of dispatch speech recognition, which can effectively reduce the influence of channel and noise, and improve the robustness of the acoustic model.
In this paper, natural language processing technology is applied to speech recognition in the power industry and a power speech recognition model is designed. It includes Transformer-based out-of-set word model, i.e. acoustic model design and out-of-set word correction model design. Further, an n-gram language error checking model is designed by studying the text error checking process after speech recognition. Subsequently the application of the combined model of power speech recognition and natural language processing techniques is carried out. Spectrogram features, Fbank features and MFCC features are extracted respectively, and the training sets of datasets THCHS-30 and AISHELL-1 are used for model training to test the best input features for the model of this paper. Then the establishment of the power speech database is carried out based on three aspects: audio corpus recording, text corpus design, and corpus annotation. Subsequently, in order to verify the speech recognition effect of the proposed model, the more popular models are used to conduct comparative experiments on the power speech dataset established in this paper. And the feasibility of the proposed algorithm in power dispatching system is verified by comparing the timeliness of different speech information processing.
In this chapter, the algorithm design of the Grid Power Intelligent Dispatch Speech Recognition Interaction System will be presented, including the Transformer-based out-of-set word model, i.e., acoustic model design and out-of-set word correction model design, as well as the design of the n-gram language-based error checking model.
Unlike generic speech recognition scenarios, speech communication in power dispatching scenarios is characterized by the following features:
The use of vocabulary is special. There are a large number of professional words in the field of power dispatching, which can be roughly divided into two categories, one is such as “grounding switch”, “insulator”, “power flow map” based on professional equipment terms, dispatching words, etc., the other is such as “ozotang Huashi line”, “augmentation station”, etc., with local characteristics of the line name and power station name. The first type of vocabulary is rarely used in general communication scenarios, and the second type is not used at all. Smaller vocabulary. Communication in power dispatching scenarios is centered on “power dispatching tasks”, so the vocabulary used is small compared to daily communication. Power dispatching instructions are structured. The dispatching instructions issued by the dispatcher, such as “open the Guangzhou city tide chart”, “pull open the No. 1 knife gate of Auyu Honggao line at Olin station”, etc., have a fixed grammatical structure and a simple relationship.
Therefore, to address these characteristics of voice communication in the field of power dispatching, this paper makes a special design for the voice recognition interaction system, which contains acoustic model, out-of-set word correction model and n-gram language error checking model, which will be introduced one by one in this chapter.
The out-of-set word speech recognition system proposed in this chapter is shown in Figure 1. The out-of-set word speech recognition system is composed of an acoustic model, a language model and an out-of-set word correction model, and the speech synthesis model is only used in the training process. The acoustic model, language model and out-of-set word correction model are all neural network models based on Transformer.

Speech recognition system for extra-set words
In the training process, the speech synthesis model uses the out-of-set word text to generate synthetic audio and realize audio data augmentation, the augmented synthetic audio and the real audio go through the acoustic model respectively to get their respective recognition results, and then finally the out-of-set word correction model is trained by using the two types of recognition results and the corresponding real values. In the decoding process, the acoustic model combines with the language model to predict the recognition results. After that, the out-of-set word correction model corrects the out-of-set word error part of the recognition results to realize out-of-set word speech recognition.
The structure of the acoustic model contains an encoder, a decoder and a Softmax prediction module. The encoder/encoder is composed of a positional embedding layer, a stack of several encoding/decoding blocks. Each encoding block is composed of two sub-layers, a multi-head self-attention layer and a feed-forward neural network layer. The decoding block is composed of three sublayers: a multi-head self-attention layer, an implicit multi-head self-attention layer, and a feedforward neural network. Residual connections are used between blocks, Layer Normalization ensures the stability of feature distribution, and Dropout technique mitigates the gradient vanishing problem of deep networks. Among them, the empirical value of 12 is selected for encoding block layer
Fbank audio feature
The Decoder combines the encoded features and the previous prediction sequence to iteratively compute the decoded features for the current time step, Decoder function as in equation (2):
The decoder starts predicting from the character ‘<start>’ and iteratively predicts the sequence of characters until the end of the predicted character ‘<end>’. In a time step, decoded feature
The coded features are mapped, through a linear mapper and Softmax function, into probability vectors with vector dimensions equal to the dictionary length of the acoustic model, and finally a CTC loss calculation is performed:
The loss function of the optimized acoustic model is a hybrid CTC and Seq2Seq loss, which is a weighted sum of both CTC and Seq2Seq losses:
In the decoding inference process, the probability vectors obtained from the two decoding methods are weighted to determine the decoding result. In order to obtain as accurate a recognition result as possible, the bundle search method is used. The bundle search method is a temporal prediction N-Best decoding strategy that is used only in the inference process. Bundle search retains the N-Best result in the prediction of each time step as an input or probability calculation for the next time step, thus achieving a wider search surface and improving the recognition results despite sacrificing some time and space.
The synthetic speech model and language model are not the main modules of the method proposed in this paper and will not be repeated, the language model in this paper is an end-to-end Chinese language model based on Transformer.
The Transformer-based out-of-set word correction model, shown in Fig. 2, is composed of an encoder, a decoder, and a Softmax prediction module. The encoder/decoder has a number of layers of encoding/decoding blocks stacked together, with residual connections between blocks. The structure of the encoder/decoder is the same as the structure of the acoustic model. The number of block layers of the encoder is taken as empirical value

Transformer based OOV-SC model
Unlike the acoustic model, the location information embedding method of the out-of-set word correction model is the OneHot coding method. Sin-function based coding approach can describe the relative position information well. In the process of speech recognition, the information provided by the nearby context is more useful for speech recognition than the distant context. However, the absolute coding approach based on OneHot can describe the absolute position information well and is simple to implement, the recognition of out-of-set word error regions requires more semantic association based on the overall context.
The out-of-set word correction model works in a similar way to machine translation, where the input to the encoder is the recognition result of the acoustic model, assuming that the recognition result of the acoustic model is
The decoder iteratively computes the decoded features for each time step using the encoded feature
In each time step, the decoded features are linearly mapped to potential spatial features, which have the same data dimensions as the out-of-set word dictionary size, and the potential spatial features are passed through a Softmax function to obtain a probability vector
The optimization of the training model is composed of the Seq2Seq loss of the decoder weighted by the alignment loss:
The previous section is an explanation of the basic theoretical knowledge of the out-of-set word speech recognition system, and this chapter mainly introduces how to realize the error checking of the text after speech recognition on the basis of the content of the previous chapter.
A complete speech communication contains three main factors: the subject, the discourse entity and the context, and the close combination of the three can completely express the meaning of a piece of speech, in which the contextual information determines the current speech situation and background, and at the same time has a certain constraint on the semantic expression. A text error will inevitably lead to the text of the sentence semantics does not make sense, so to find the error in the sentence, that is, to find the sentence does not meet the text semantics of the word, which leads to the text semantics does not make sense of the word can be categorized into the following three cases: does not meet the current syntax of the sentence, semantics or context. In this paper, we temporarily disregard the error checking at the syntactic level, and realize text error checking from both semantic and contextual aspects. In most existing natural language processing algorithms, the contextual knowledge of a word is usually represented by one or two words around the target word, or by some main words in the chapter and paragraph where the target word is located, called context words. In this paper, we take the several words adjacent to the words before and after the current error-checked word as the context words to characterize the textual context knowledge, and characterize the semantic knowledge with the core words and their fixed collocations in the text, which together provide the basis for the error-checking. As shown in Figure 3 is the overall framework of text error checking after speech recognition.

Text error checking process after speech recognition
From Figure 3, we can get the process of text error checking after speech recognition: firstly, we use the training corpus to construct the n-gram model and the core word collocation thesaurus, then we preprocess the text after speech recognition, and firstly, we use the constructed n-gram model to calculate the contextual harmony of the words in the preprocessed text in order to realize the initial error checking, and then we use the core word collocation thesaurus to calculate the collocational aggregation degree of the words after the initial error checking. Then we use the core word collocation thesaurus to calculate the word collocation aggregation degree of the words after the initial error checking to carry out the secondary error checking, and finally mark the wrong words with the results of secondary error checking.
The main idea of error checking is: take the prediction ability of n-gram model on neighboring words or words as the basis of error checking, for the limitation that n-gram model can only use historical words (context above) to check errors, firstly, adopt a weighted allocation method based on n-gram model to calculate the context harmony of words in the text after speech recognition, and preset a threshold for the value of context harmony, and words and characters below the threshold will be marked as true word errors in the text. errors, realize the initial error checking of true word errors and loose string errors in the text; then for the error checking defects of true word errors in the n-gram model, based on the core word collocation thesaurus, calculate the language collocation aggregation degree of phrases and core words in the text after the initial error checking in turn, and preset a threshold for the language collocation aggregation degree, and the word errors below the threshold value, realize the second error checking of true word errors in the text after the speech recognition, and at the same time realize the correction of the error checking results of the n-gram model. -gram model error checking result correction, and finally take the second error checking labeling result as the final error checking output.
There are two main ideas for error checking based on n-gram language model:
Calculate the probability of the current word appearing under the premise of historical word occurrence, if the n-gram probability value of the current word satisfies the pre-set threshold Calculate the n-gram probability value of the whole sentence, if the value satisfies the threshold
Comparing the above two kinds of error checking ideas, this paper takes the first idea as the main idea of error checking, by calculating the contextual harmony

Error checking flow of weighted distribution method based on n-gram model
The experiments in this chapter are supplemented by the use of the Tsinghua University’s open Chinese dataset THCHS-30 in addition to the data from Hill Shell’s open source AISHELL-1 Chinese Mandarin speech data to do the experiments, which is used in this study as the training set to complete the model training, and the test set to complete the test experiments. In the experiments, Spectrogram features, Fbank features and MFCC features are extracted as inputs to train the end-to-end speech recognition model of this paper’s model, respectively.
After completing the training, the experimental results of the test set are used to compare the word error rate (CER) of these three different input features, so as to prove which of the three speech features, Fbank features, Spectrogram features and MFCC features extracted from the speech signal is more suitable to be used as the input features of this paper’s model, and the experimental results are shown in Table 1.
Comparison of different input features
Data set | Input feature | CER% |
---|---|---|
THCHS-30 | Spectrogram | 15.72 |
Fbank | 17.28 | |
MFCC | 18.68 | |
AISHELL-1 | Spectrogram | 15.54 |
Fbank | 16.77 | |
MFCC | 19.30 |
For the two different datasets used in the experiments, as can be seen from the data in Table 1, both datasets have low word error rates (CER) of 15.72% and 15.54%, respectively, when the Spectrogram feature is used as an input feature, while the second highest word error rates (CER) are obtained for both datasets when the Fbank is used as an input feature, with results of 17.28% and 16.77%, while when MFCC is used as an input feature, both datasets have the worst results for word error rate (CER). 18.68% and 19.30%, respectively. This is due to the fact that the MFCC feature requires more manual intervention, which can lead to loss of information and greatly reduce the performance of the system. Fbank does preserve the original characteristics of the audio signal as much as possible, but it can also lead to the presence of redundant information, which can affect the model’s ability to not capture real usable audio information when the dataset size is insufficient.
By conducting a comparison between these two sets of experiments, it can thus be concluded that the Spectrogram feature of the speech signal is more suitable as an input feature for the model in this paper, which reduces the word error rate (CER%) of the speech recognition model, which in turn is able to improve the performance of the system.
A comparison experiment is done between the model proposed in this paper and the baseline models CNN-BiLSTM and CNN-BiGRU models in the AISHELL-1 dataset, and we choose the four metrics of Accurary, Precision, Recall, and F1 as the performance metrics of this experiment to determine whether the performance of the model incorporating the attentional mechanism has been improved or not. The results of the experiment are shown in Fig. 5.

Performance comparison of different models (%)
As shown in Fig. 5, it indicates the best performance of the model in this paper.
At present, most of the open-source speech corpora are based on mass news and other corpora, but this type of corpus basically does not contain the content of power dispatching instructions and some specialized vocabularies such as equipment and operation, which do not meet the application requirements of power dispatching scenarios. Therefore, according to the corpus characteristics of power dispatching instructions, a design scheme for the voice data set in the field of power dispatching is proposed, a 360-article power dispatching industry corpus is constructed, a total of 5,000 voice dispatching instructions are recorded, and after the construction of the data set is completed, it is expanded by using the method of speed perturbation.
Audio corpus recording
The voice corpus for the power dispatching industry consists of 5000 Chinese Mandarin audio corpora, with 20 recorders including 10 male and 10 female, all from different regions. During the recording process, a quiet indoor environment was maintained at all times, and the audio frequency of the speech corpus was set to 16kHz, mono, with a sampling precision of 16bit, and the audio was saved in wav format. The 5000 voices were divided into two parts using the ratio of 8:2, of which 4000 voices were used as the training set and 1000 voices were used as the test set.
Text corpus design
When constructing the text corpus in the field of power dispatching, it needs to be designed according to the content characteristics of the instructions issued by the dispatcher in the daily dispatching work of the power grid. By collecting information such as grid scheduling specifications, commonly used scheduling instructions, typical scheduling operation tickets, etc., and combining with the special characteristics of grid scheduling work, the characteristics that exist in power scheduling instructions in the usual situation are summarized:
Many names will be involved in the power dispatching instructions, and the dispatchers need to clearly express the relevant equipment names when issuing the instructions. There is a relatively fixed naming method for professional equipment. For example, the substation is named in the way of “place name station”, the line is named in the way of “place name numbering line”, and the power pole is named in the way of “numbering pole” or “place name numbering pole”. In different local grids, the pronunciation of numbers in the number is different, for example, the number “1001” is read as “幺洞洞幺” (digital military language “1001”) instead of “一零零一” (Chinese capital numbers “1001”), which needs to be fully considered in the process of corpus design. In the process of corpus design, it is necessary to give full consideration to the pronunciation of numbers, symbols and other special characters.
The content of dispatching instructions of a power supply company in Guangdong Province is selected as a template to construct a text corpus of power dispatching according to the rules and characteristics of its dialogues. The text covers the common equipment name, dispatching, operation and other specialized terms in the power system, and the text corpus information. The terminology of commonly used equipment names in the power system basically contains the names of major equipment mentioned in daily dispatching. The corpus deals with numbers, units, special symbols, etc. in the electric power industry, all of which are described in Chinese characters, e.g., the Chinese character expression for “10KV” is “ten kilovolts”. The scheduling instruction corpus example contains the basic order format of daily scheduling instructions.
Corpus annotation
After producing the text and audio of the power scheduling corpus, it is necessary to annotate each piece of audio. In this paper, the end-to-end speech recognition model is studied with words as the basic modeling unit, therefore, the audio and Chinese character text labels need to be one-to-one correspondence, and the label text format is the path of the audio file corresponding to the content of the corresponding text statement, and each word in the statement is separated by the space bar. Verification and correction are performed after labeling the speech dataset of power dispatch commands to ensure the accuracy and usability of the speech dataset.
In order to verify the speech recognition effect of the proposed model, comparison experiments are conducted on the power speech dataset established in this paper using the more popular models. The experimental results are shown in Table 2. The experimental results on the dataset show that the number of parameters of the method in the paper is 25, which is at a medium level. Its word error rate (WER) is 8.21%, which indicates that the method has the highest recognition accuracy on this dataset. The real-time rate (RTF) is 0.017, indicating that the method has the best real-time performance on this dataset.
Comparison of experimental results on Power voice data set
Method | Parameter quantity | WER(%) | RTF |
---|---|---|---|
Discriminative | 20 | 11.42 | 0.040 |
Pseudo Visual | 23 | 13.34 | 0.047 |
Vanilla Transformer | 25 | 10.62 | 0.031 |
Speech-Transformer | 26 | 10.06 | 0.026 |
Open-Transformer | 36 | 9.64 | 0.029 |
Ours | 25 | 8.21 | 0.017 |
By analyzing Table 2, it can be seen that the proposed method in this paper performs very well on the above dataset, not only achieving a low WER and RTF while keeping a low number of parameters, demonstrating its efficiency and practicality in speech recognition tasks, and proving that the algorithm has good generalization performance.
Power scheduling system in addition to the need for speech recognition accuracy to meet the actual scheduling needs, but also need to meet the timeliness, in order to verify the feasibility of the proposed algorithm in the power scheduling system, the next step will be to verify the proposed algorithm by comparing the timeliness of the different voice information processing. As shown in Table 3, the response time of different types of different voice duration system is analyzed by different types of different voice duration information input system, for different voices, it can be seen that the voice response time of the input voice system of different durations increases with the increase of the response time of the voice duration, but the latency is low, and the response time of the three modes of the system is the longest of 9.7 ms for the voice duration of 60 s, which meets the actual use requirements of the dispatching system. The longest response time of the three modes is 9.7 ms at 60 s, which can meet the actual demand of the scheduling system and further illustrates the feasibility of the proposed algorithm.
The system response time varies according to different types of voice duration
Speech duration/s | Response time/s | ||
---|---|---|---|
Real-time voice | Voice file | Telephone terminal | |
5 | 2.6 | 1.6 | 3.3 |
10 | 3.0 | 1.9 | 3.8 |
20 | 4.5 | 3.0 | 4.9 |
30 | 5.3 | 4.4 | 5.8 |
60 | 8.5 | 7.1 | 9.7 |
In this study, natural language processing technology is applied to speech recognition in the electric power industry, and an electric power speech recognition model is designed, and validation of the model application and practical verification of the dataset are carried out, and the following conclusions are obtained:
When Spectrogram features are used as input features, the CERs of the datasets THCHS-30 and AISHELL-1 are 15.72% and 15.54%, respectively, which are the lowest values among all the features, indicating that the Spectrogram features of the speech signals are more suitable to be used as the input features of this paper’s model. The model in this paper performs best in the four indicators of Accurary, Precision, Recall and F1. The number of parameters of this paper’s model on the electric power speech dataset is 25, and its word error rate WER is 8.21% and real-time rate RTF is 0.017, which demonstrates its high efficiency and practicability in speech recognition tasks and proves that the algorithm has good generalization performance. The speech response time of the input speech system increases with the increase of speech duration response time also increases, but the delays are low, the longest response time of the three-mode system is 9.7 ms when the speech duration is 60 s, which can satisfy the demand of the actual use of the scheduling system, and shows the feasibility of the algorithm proposed in this paper.
Supported by China Yangtze Power Co., Ltd. scientific research project (project number: 4323020012).