Computer Translation-Based Language Modeling Enables Multi-Scenario Applications of the English Language

In the past decade, with the development of many modern cutting-edge scientific technologies, the solution of cutting-edge problems in a number of disciplines has become increasingly dependent on the development of linguistics. Since language is the carrier of cultural and social information, the further development of information science depends to a large extent on the development of language science. For example, artificial intelligence, which must simulate the human brain mechanism and thinking process, cannot help but simulate the internal language process of human beings to some extent, because human abstract thinking is realized through the expression of language [1-4]. Take the human-computer dialogue, the most important thing is the modeling of natural language, without the study of language models, there is no realization of the high degree of formalization of natural language, therefore, it can be said that the “language model” is the “bridge” between linguistics and information science. From the perspective of “natural language processing”, language model is a mathematical model that describes the inner law of natural language, and the construction of language model is one of the core contents of the research method of computational linguistics, and it is also the core theory of corpus linguistics. It can be divided into rule model and statistical model [5-8].

Translating one language into another with the help of computers, referred to as computer translation (MT), is increasingly becoming an important technology, and the development of national defense, economy, political stability and social welfare depends on the sharing of information. Never in the history of mankind has the need to break through language barriers been felt more urgently than it is today, new common markets and growing world trade have created a strong need for language support, and the exchange of the nine official languages of the European Community alone means that a great deal of manpower is engaged in translating into seventy-two different languages in different directions every day [9-12]. Because of this, it is hoped that with the help of computers, the most powerful tool of information processing, people can be freed from such a heavy and tedious processing as translation work, or at least from a part or most of the repetitive and monotonous work, to engage in more creative labor. In summary, it seems very meaningful to examine the multi-scenario application of computer translation-based language modeling empowering the English language, through the practical use of language in communication, it can be observed whether the learners can use the learned language in a specific language environment well to carry out practical interactions and achieve the purpose of communication [13-15].

The study addresses the problem of catastrophic forgetting when incorporating BERT pre-trained language models into neural machine language models, and introduces a masking matrix strategy to mitigate it. Then, through the internal fusion and dynamic weighting of multi-attention mechanism, the model can make full use of the output information of the optimized BERT to realize the construction of English translation model based on improved Masking-BERT enhancement. The performance of the English translation model is analyzed in terms of English utterance compression effect, training loss and BLEU value, and its practical application effect is explored through the model’s translation accuracy, response time and expert satisfaction score. After that, Transformer is used to model grammatical error correction, taking into account both local contextual information and long-distance dependencies in the text. The English grammar error correction model is trained and tested, and other grammar error correction methods are selected to compare with this paper’s model, analyzing the accuracy rate, call rate and F_0.5 value of different grammar error types, and exploring the effect of the constructed model on the detection and correction of English grammar errors.

2

Overview

Today, when the development of many cutting-edge scientific technologies has to rely more and more on language development, it is worthwhile to explore in depth how the research results on language can be used precisely and effectively for multi-scenario applications of the English language. Literature [16] summarizes the research related to ChatGPT, mainly analyzes the state-of-the-art large-scale language models in the GPT family and their applications in different domains, and points out that the key innovations of large-scale pre-training, instruction fine-tuning, and reinforcement learning from human feedback are of great importance for the adaptability and performance improvement of LLM. Literature [17] points out that pre-trained language models have led to a paradigm shift from supervised learning to pre-training to fine-tuning in the field of Natural Language Processing (NLP), and examines the future research directions in the field of language modeling by analyzing the classification methods, characterization methods, and frameworks for pre-trained models. Literature [18] proposes an extraction and abstraction neural network document summarization method based on transformational language models and experimentally verifies the effectiveness and feasibility of the method, which generates more abstract summaries and also achieves higher ROUGE scores. Literature [19] presents a cued large-scale language model for machine translation, which is verified to have excellent performance and can enhance translation results by using GLM-130B as a testbed. Literature [20] developed an English translation model based on intelligent recognition technology and deep learning framework, and designed experiments to verify its effectiveness, which has high accuracy and efficiency in speech recognition and translation, and provides a feasible and efficient solution for practical applications. Literature [21] proposed a machine translation language model by synthesizing feed-forward neural network decoder and attention mechanism, and experimentally verified that the proposed model has better performance and can be applied to different English scenarios, and the research provides new ideas for the direction of multi-scenario application of machine translation language model in the future.

3

English Translation Model Based on Improved Masking-BERT Enhancement

With the rapid development of AI artificial intelligence technology, the use of computers to translate between different languages has become a big trend. In order to improve the efficiency and quality of English teaching, many educators apply the AI translation model to the English classroom. To address this phenomenon, an English translation model based on improved Masking-BERT enhancement is proposed.

3.1

Masking Matrix Strategy

In order to the oblivious catastrophe problem of BERT training, the study proposes to use a mask matrix based BERT training strategy as follows:

In BERT, the l st Transformer block consists of 6 linear layers, respectively $W^{l} \in {W_{K}^{l}, W_{Q}^{l}, W_{v}^{l}, W_{10}^{l}, W_{l}^{l}, W_{0}^{l}}$ ${{W}^{l}}\in \left\{ W_{K}^{l},W_{Q}^{l},W_{v}^{l},W_{10}^{l},W_{l}^{l},W_{0}^{l} \right\}$ . Each of these linear layers W^l is associated with a real-valued matrix M^l of the same size as it, which is randomly initialized from a uniform distribution. During training, backpropagation continuously updates the matrix M^l with the loss values of the subsequent machine translation task. During forward propagation, the binary mask $M_{b i n}^{l}$ $M_{bin}^{l}$ of the current linear layer in M^l is first computed by an element-level threshold function, which is calculated as in Equation (1): (1) ${(m_{l i n}^{l})}_{i, j} = {\begin{matrix} 1, i f m_{i, j}^{l} \geq τ \\ 0, o t h e r w i s e \end{matrix}$ ${{\left( m_{lin}^{l} \right)}_{i,j}}=\left\{ \begin{matrix} 1,ifm_{i,j}^{l}\ge \tau \\ 0,otherwise \\ \end{matrix} \right.$

where τ is the hyperparameter of the global threshold and , i, j represents the coordinates of the 2D linear layer , $M_{i, j}^{l} \in M^{l}$ $M_{i,j}^{l}\in {{M}^{l}}$ .

Multiply this binary mask $M_{b i n}^{l}$ $M_{bin}^{l}$ with the current linear layer W^l to get the updated linear layer ${\hat{w}}^{l}$ ${{\hat{w}}^{l}}$ . By this method, the model can choose by itself the part of parameters needed to pre-train the model and freeze them, and then update the remaining parameters. The specific process of updating the improved parameters is shown in equation (2): (2) ${\dot{w}}^{l} : = w^{l} e M_{b i n}^{l}$ \[{{\dot{w}}^{l}}:={{w}^{l}}eM_{bin}^{l}\]

3.2

Multi-attention mechanisms

3.2.1

Internal integration of multi-attention mechanisms

In order to make the neural machine translation model can better utilize the output information of BERT trained based on the masking matrix strategy, internal fusion of multiple attention mechanisms in the BERT model is performed. Based on the output of the final hidden layer of Masking-BERT, the encoding and decoding attention results of the current layer of the model are computed as in Eqs. (3)~(5), respectively: (3) $\begin{matrix} h_{i}^{%} = \frac{1}{2} (a t t n_{s} (h_{i}^{l - 1}, H_{E}^{l - 1}, H_{E}^{l - 1}) \\ + a t t n_{B} (a t t e n_{s} + (h_{i}^{l - 1}, H_{E}^{l - 1}, H_{E}^{l - 1}), H_{B}, H_{B})) \end{matrix}$ $$\matrix{ {h_i^\% = {1 \over 2}(att{n_s}\left( {h_i^{l - 1},H_E^{l - 1},H_E^{l - 1}} \right)} \cr { + att{n_B}\left( {atte{n_s} + \left( {h_i^{l - 1},H_E^{l - 1},H_E^{l - 1}} \right),{H_B},{H_B}} \right))} \cr } $$ (4) ${\hat{s}}_{t}^{l} = a t t n_{s} (s_{t}^{l - 1}, s_{< l + 1}^{l - 1}, s_{< l + 1}^{l - 1})$ \[\hat{s}_{t}^{l}=att{{n}_{s}}\left( s_{t}^{l-1},s_{<l+1}^{l-1},s_{<l+1}^{l-1} \right)\] (5) ${\hat{s}}_{t}^{l} = \frac{1}{2} (a t t n_{E} (s_{t}^{μ}, H_{E}^{l}, H_{E}^{l}) + a t t n_{B} (a t t n_{E} (s_{t}^{μ}, H_{E}^{l}, H_{E}^{l}), H_{B}, H_{B})$ \[\hat{s}_{t}^{l}=\frac{1}{2}\left( att{{n}_{E}}\left( s_{t}^{\mu },H_{E}^{l},H_{E}^{l} \right)+att{{n}_{B}}\left( att{{n}_{E}}\left( s_{t}^{\mu },H_{E}^{l},H_{E}^{l} \right),{{H}_{B}}, \right.{{H}_{B}} \right)\]

3.2.2

Dynamic weighted fusion methods

Let the two connected layers in layer l of the encoder side of the self-attention mechanism be $A T T_{s}^{l}$ $ATT_{s}^{l}$ , and $A T T_{B}^{l}$ $ATT_{\text{B}}^{l}$ , respectively, and aggregate the output information of the self-attention mechanism and the output information of the Masking BERT attention attention mechanism in the current layer, and in the current layer integrate the output of the self-attention mechanism with the output of the Masking-BERT attention attention mechanism in the current layer, to obtain the hidden layer state $H_{E}^{L}$ $H_{E}^{L}$ of the output of the encoder side, which is calculated as in Equation (6). Integration, the hidden layer state 4 of the output at the encoder side can be obtained, which is calculated as in Eq. (6): (6) $H_{E}^{L} = g_{l} \otimes A T T_{s}^{l} + (1 - g_{l}) \otimes A T T_{B}^{l}$ \[H_{E}^{L}={{g}_{l}}\otimes ATT_{s}^{l}+\left( 1-{{g}_{l}} \right)\otimes ATT_{B}^{l}\]

$A T T_{E}^{d}$ $ATT_{E}^{d}$ and $A T T_{B}^{d}$ $ATT_{B}^{d}$ respectively, and integrate the output of the encoder-decoder attention mechanism of the current layer with the output of the Masking-BERT attention mechanism of the current layer in the current layer, and the output of the decoder side of the hidden layer state $H_{E}^{L}$ $H_{E}^{L}$ , which is computed as in Eq. (7): (7) $H_{E}^{L} = g_{l} \otimes A T T_{E}' + (1 - g_{l}) \otimes A T T_{B}^{l}$ \[H_{E}^{L}={{g}_{l}}\otimes AT{{T}_{E}}\prime +\left( 1-{{g}_{l}} \right)\otimes ATT_{B}^{l}\]

3.3

Neural Machine Translation Model

Let the hidden layer representation of the l st layer at the encoder side be $H_{E}^{l}$ $H_{E}^{l}$ , the hidden layer representation of the i th word at the l rd layer be $H_{i}^{l}$ $H_{i}^{l}$ , and the output of the last hidden layer of Masking-BERT be H_B . The attention mechanism of each layer at the encoder side is shown in Equation (8): (8) $\begin{array}{l} H_{i}^{σ_{0}} & = \frac{1}{2} (a t t n_{s} (h_{i}^{t - 1}, H_{E}^{t - 1}, H_{E}^{t - 1}) \\ + a t t n_{B} (a t t e n_{s} + (h_{i}^{t - 1}, H_{E}^{t - 1}, H_{E}^{t - 1}), H_{B}, H_{B})) \end{array}$ \[\begin{array}{*{35}{l}} H_{i}^{{{\sigma }_{0}}} & =\frac{1}{2}(att{{n}_{s}}(h_{i}^{t-1},H_{E}^{t-1},H_{E}^{t-1}) \\ {} & +att{{n}_{B}}(atte{{n}_{s}}+(h_{i}^{t-1},H_{E}^{t-1},H_{E}^{t-1}),{{H}_{B}},{{H}_{B}})) \\ \end{array}\]

where attn_S and attn_B are the self-attention mechanisms, respectively.

Let the hidden layer state of the first t time step of layer l at the decoder side be $S_{< l}^{l}$ $S_{<l}^{l}$ and $S_{< ε}^{l} = (s_{1}^{l}, \dots, s_{t - 1}^{l})$ $S_{<\varepsilon }^{l}=\left( s_{1}^{l},\cdots ,s_{t-1}^{l} \right)$ , then the attention mechanism of each layer at the decoder side is shown in Eqs. (9) and (10): (9) ${\hat{s}}_{t}^{l} = a t t n_{s} (s_{t}^{l - 1}, S_{< l + 1}^{l - 1}, S_{< + + 1}^{l - 1})$ \[\hat{s}_{t}^{l}=att{{n}_{s}}\left( s_{t}^{l-1},S_{<l+1}^{l-1},S_{<++1}^{l-1} \right)\] (10) $\begin{array}{l} {\hat{s}}_{t}^{l} & = \frac{1}{2} (a t t n_{E} (s_{t}^{μ}, H_{E}^{l - 1}, H_{E}^{l - 1}) \\ + a t t n_{B} (a t t n_{E} (s_{t}^{μ}, H_{E}^{l - 1}, H_{E}^{l - 1}), H_{B}, H_{B}) \end{array}$ \[\begin{array}{*{35}{l}} \hat{s}_{t}^{l} & =\frac{1}{2}(att{{n}_{E}}(s_{t}^{\mu },H_{E}^{l-1},H_{E}^{l-1}) \\ {} & +att{{n}_{B}}(att{{n}_{E}}(s_{t}^{\mu },H_{E}^{l-1},H_{E}^{l-1}),{{H}_{B}},{{H}_{B}}) \\ \end{array}\]

where attn_s , attn_B and attn_E denote self-attention mechanisms, respectively.

3.4

Translation Model Performance Analysis

3.4.1

English utterance compression

In order to validate the performance of the English translation model based on improved Masking-BERT enhancement designed by the study, the study conducts experimental validation in the CoNLL 2020 dataset, and the study utilizes online translation and machine translation to compare with the method of this paper.

The compression rate and compression stability are used as comparison indexes. Figure 1 shows the compression rate and compression stability comparison results of the three methods, where (a) is the compression rate comparison results of the three methods, and (b) is the compression stability comparison results of the three methods. As can be seen from Fig. (a), the average values of compression rates of online translation, machine translation and the English translation model of this paper are 77.9%, 81.4% and 88.4%, respectively. As can be seen from Fig. (b), the average values of compression stability of online translation, machine translation and systematic approach are 75.2%, 82.3% and 86.8%, respectively. This indicates that the English translation model in this paper also has high performance in the process of English utterance compression.

3.4.2

Training Losses and BLEUs

To further validate the performance of the systematic approach, Figures 2 and 3 show the comparison of the training loss and BLEU evaluation results of the three models, respectively. The average loss of the English translation method in this paper is lower than that of online translation and machine translation, which are 0.80, 0.85 and 0.82, respectively.The BLEU scores of the English translation method in this paper are 0.86, while those of the online translation and machine translation are 0.78 and 0.82, respectively, which indicates that the English translation method in this paper has a certain degree of superiority compared with the more commonly used translation methods at present.

In order to further validate the performance of the designed English translation model based on improved Masking-BERT enhancement, the study evaluates the results of the three methods, and the evaluation results of the three are shown in Figure 4. The compression ratio (0.66), grammaticality (5.67) and information content (4.32) of the English translation method in this paper are significantly better than the other two methods, while the difference in the heat ratio (0.48) is not significant. It shows that the design of this paper can provide a higher-performance method for English translation.

3.5

Effect of practical application of translation model

In order to test the application effect of the constructed translation models in real problems, the actual operation stability and translation satisfaction of different translation models in the whole translation system are analyzed.

3.5.1

Operational stability

Figure 5 shows the translation accuracy and system response time of different models under the English to Chinese and Chinese to English tasks. The response times of online translation, machine translation and the English translation model of this paper in English to Chinese are 8.16 s, 5.92 s, 1.21 s, and the translation accuracy reaches 0.74, 0.81, 0.87, respectively.In the task of Chinese to English, the response times are 10.39 s, 7.96 s, 2.65 s, and the translation accuracy reaches 0.72, 0.76, respectively, 0.84.

3.5.2

Satisfaction

Figure 6 shows the satisfaction of 20 experts for the three translation models in the actual translation task, of which Figures (a), (b) and (c) show the satisfaction results of online translation, machine translation and the English translation model of this paper, respectively. The experts’ translation satisfaction and situational fitness for the English translation model of this paper are almost all in the range of 3 to 5 points, and most of the experts are concentrated in the first quadrant, which shows that the model has the highest satisfaction. In addition, the expert satisfaction of the other three models is either concentrated in the middle or more dispersed, thus showing that their satisfaction is poor.

4

English grammar error correction model based on self-attention mechanism

Automatic grammatical error correction is a typical task in natural language processing research, where the goal is to build an automated system to correct possible grammatical errors in texts. Broadly speaking, research on grammatical error correction has undergone a methodological evolution from rule-based, to statistical-based, to machine translation-based approaches. Computer translation is used in English grammatical error correction to construct a model for English grammatical error correction based on neural machine translation based on Transformer, an encoder-decoder model with self-attention mechanism.

4.1

Baseline model for grammatical error correction

4.1.1

A formal definition of grammatical error correction

Grammatical error correction is defined as the following process: input a sentence containing a grammatical error and output a corrected sentence that does not contain a grammatical error and that preserves the semantics of the original input sentence.

(11)

p (y | x) = \prod_{t = 1}^{n} p (y_{t} | x, y_{1 : t - 1}; θ)

\[p\left( y\left| x \right. \right)=\prod\limits_{t=1}^{n}{p}\left( {{y}_{t}}\left| x \right.,{{y}_{1:t-1}};\theta \right)\]

In general, model parameters θ are learned based on great likelihood estimation: (12) $θ = \underset{θ}{\arg \max} \sum_{t = 1}^{n} \log p (y_{t} | x, y_{1 : t - 1}; θ)$ \[\theta =\underset{\theta }{\mathop{\arg \max }}\,\sum\limits_{t=1}^{n}{\log }p\left( {{y}_{t}}\left| x \right.,{{y}_{1:t-1}};\theta \right)\]

4.1.2

Recurrent Neural Networks

A recurrent neural network is a classical model for modeling variable-length sequences, and its standard formalization is as follows: (13) $h_{t} = ϕ_{h} (W_{h} x_{t} + U_{h} h_{t - 1} + b_{h})$ \[{{h}_{t}}={{\phi }_{h}}\left( {{W}_{h}}{{x}_{t}}+{{U}_{h}}{{h}_{t-1}}+{{b}_{h}} \right)\] (14) $y_{t} = ϕ_{y} (W_{y} h_{t} + b_{y})$ \[{{y}_{t}}={{\phi }_{y}}\left( {{W}_{y}}{{h}_{t}}+{{b}_{y}} \right)\]

In order to solve the above problems, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are introduced.The formal definition of LSTM is as follows: (15) $\begin{array}{l} I n p u t G a t e : i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \\ O b l i v i o n G a t e : f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \\ O u t p u t G a t e : o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \\ I n p u t c o n v e r s i o n : {\tilde{c}}_{t} = \tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}) \\ C e l l s t a t e : c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t} \\ Im p l i c i t s t a t e : h_{t} = o_{t} \circ \tanh (c_{t}) \end{array}$ \[\begin{align} & Input\text{ }Gate:{{i}_{t}}=\sigma \left( {{W}_{i}}{{x}_{t}}+{{U}_{i}}{{h}_{t-1}}+{{b}_{i}} \right) \\ & Oblivion\text{ }Gate:{{f}_{t}}=\sigma \left( {{W}_{f}}{{x}_{t}}+{{U}_{f}}{{h}_{t-1}}+{{b}_{f}} \right) \\ & Output\text{ }Gate:{{o}_{t}}=\sigma \left( {{W}_{o}}{{x}_{t}}+{{U}_{o}}{{h}_{t-1}}+{{b}_{o}} \right) \\ & Input\text{ }conversion:{{{\tilde{c}}}_{t}}=\tanh \left( {{W}_{c}}{{x}_{t}}+{{U}_{c}}{{h}_{t-1}}+{{b}_{c}} \right) \\ & Cell\text{ }state:{{c}_{t}}={{f}_{t}}\circ {{c}_{t-1}}+{{i}_{t}}\circ {{{\tilde{c}}}_{t}} \\ & \operatorname{Im}plicit\text{ }state:{{h}_{t}}={{o}_{t}}\circ \tanh \left( {{c}_{t}} \right) \\ \end{align}\]

where W_* , U_* , b_* are the parameters of the LSTM network. Is the bit-by-bit multiplication.GRU is simplified based on LSTM and contains only two gates in the loop cell, which is formally defined as: (16) $\begin{array}{l} R E S E T G A T E : r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r}) \\ U p d a t i n g d o o r s : z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z}) \\ I n p u t c o n v e r s i o n : {\tilde{h}}_{t} = \tanh (W_{h} x_{t} + U_{h} (r_{t} \circ h_{t - 1}) + b_{h}) \\ Im p l i c i t s t a t e : h_{t} = (1 - z_{t}) \circ h_{t - 1} + z_{t} \circ {\tilde{h}}_{t} \end{array}$ $\begin{align} & RESET\text{ }GATE:{{r}_{t}}=\sigma \left( {{W}_{r}}{{x}_{t}}+{{U}_{r}}{{h}_{t-1}}+{{b}_{r}} \right) \\ & Updating\text{ }doors:{{z}_{t}}=\sigma \left( {{W}_{z}}{{x}_{t}}+{{U}_{z}}{{h}_{t-1}}+{{b}_{z}} \right) \\ & Input\text{ }conversion:{{{\tilde{h}}}_{t}}=\tanh \left( {{W}_{h}}{{x}_{t}}+{{U}_{h}}\left( {{r}_{t}}\circ {{h}_{t-1}} \right)+{{b}_{h}} \right) \\ & \operatorname{Im}plicit\text{ }state:\text{ }{{h}_{t}}=\left( 1-{{z}_{t}} \right)\circ {{h}_{t-1}}+{{z}_{t}}\circ {{{\tilde{h}}}_{t}} \\ \end{align}$

4.1.3

The encoder-decoder model

In sequence-to-sequence learning, this is typically modeled using a neural network encoder-decoder model. The encoder first encodes the input sequence into a series of implicit state representations (vectors) in a continuous space, and based on these implicit states output by the encoder, as well as the prefixes of the symbol sequence that have been output at the current time step, the decoder predicts the output of the next symbol. Formally: (17) $e_{1}, e_{2}, \dots, e_{m} = e n c o d e r (x_{1}, x_{2}, \dots, x_{m})$ \[{{e}_{1}},{{e}_{2}},\ldots ,{{e}_{m}}=encoder\left( {{x}_{1}},{{x}_{2}},\ldots ,{{x}_{m}} \right)\] (18) $y_{t} = d e c o d e r (e_{1}, e_{2}, \dots, e_{m}, y_{1}, y_{2}, \dots, y_{t - 1})$ \[{{y}_{t}}=decoder\left( {{e}_{1}},{{e}_{2}},\ldots ,{{e}_{m}},{{y}_{1}},{{y}_{2}},\ldots ,{{y}_{t-1}} \right)\]

4.1.4

Attention mechanisms

The attention mechanism is an important component of today’s mainstream neural network encoder-decoder models. Specifically, at time step t , the current implicit state d_t of the decoder, and the implicit states e_i , i = 1, 2,…,m of the encoder output, are used to compute the attention weights assigned to e_i . All the attentional weights form a probability distribution over the input sequence, based on which the encoder’s sequence of implied states e_i , i = 1, 2,…,m , is weighted and summed to obtain the contextual representation c_t at the current time step to help the decoder predict the t th target language word. The specific formula is given below: (19) $α_{t, i} = \frac{\exp (A t t e n (d_{t}, e_{i}))}{\sum_{j = 1}^{m} \exp (A t t e n (d_{t}, e_{j}))}$ \[{{\alpha }_{t,i}}=\frac{\exp \left( Atten\left( {{d}_{t}},{{e}_{i}} \right) \right)}{\sum\limits_{j=1}^{m}{\exp }\left( Atten\left( {{d}_{t}},{{e}_{j}} \right) \right)}\] (20) $c_{t} = \sum_{i = 1}^{m} α_{t, i} \cdot e_{i}$ \[{{c}_{t}}=\sum\limits_{i=1}^{m}{{{\alpha }_{t,i}}}\cdot {{e}_{i}}\]

In the above equation, Atten is the function that calculates the attention score, commonly defined as: (21) $A t t e n (d, e) = {\begin{array}{l} d^{T} e & d o t \\ d^{T} W_{A} e & g e n e r a l \\ v^{T} \tanh (W_{A} \cdot [d; e]) & c o n c a t \end{array}$ $Atten\left( d,e \right)=\left\{ \begin{array}{*{35}{l}} {{d}^{\mathcal{T}}}e & dot \\ {{d}^{\mathcal{T}}}{{W}_{A}}e & general \\ {{v}^{\mathcal{T}}}\tanh \left( {{W}_{A}}\cdot \left[ d;e \right] \right) & concat \\ \end{array} \right.$

where W_A and v are learnable parameters.

4.2

English Grammar Error Correction Model

4.2.1

Model structure

The Transformer model consists of an encoder and a decoder, given source-side error sentences x = (x₁,x₂,…,x_m), x_i ∈ X , X as source-side word lists, the Transformer encoder encodes x as a set of implicit state representations in continuous space e = (e₁,e₂,…,e_m) , based on which the Transformer decoder generates time-step by time-step corrected sentences y = (y₁,y₂,…,y_n) , y_i ∈ y , y as target-side word lists.

1)

Encoder and Decoder

The encoder and decoder in Transformer each contain six identical layers, located in the encoder, consisting of a self-attention sublayer and a forward neural network sublayer. The input passes through the self-attention sublayer, after which the same forward neural network is applied to the output at different locations in the self-attention sublayer.

2)

Attention mechanism in Transformer

(1)

Dot product attention for scaling

Given a query vector q , a set of key vectors K , and a set of value vectors V , the scaled dot product attention is given by: (22) $S c a A t t e n (q, K, V) = s o f t \max (\frac{q K^{T}}{\sqrt{d_{k}}}) V$ \[ScaAtten(q,K,V)=soft\max \left( \frac{q{{K}^{\text{T}}}}{\sqrt{{{d}_{k}}}} \right)V\]

(2)

Multiple attention

In order to allow the model to simultaneously access information from different representation subspaces when encoding at different locations in the sequence, multi-head attention performs dot product attention computation with multiple scaling: (23) $M u l t i H e a d (q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{h}) W^{O}$ \[MultiHead\left( q,K,V \right)=Concat\left( hea{{d}_{1}},hea{{d}_{2}},\ldots ,hea{{d}_{h}} \right){{W}^{O}}\] (24) $h e a d_{i} = S c a A t t e n (q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$ \[hea{{d}_{i}}=ScaAtten\left( qW_{i}^{Q},KW_{i}^{K},VW_{i}^{V} \right)\]

3)

Self-attention sublayer

Multihead attention is applied in the self-attention sublayer, where query vectors, key vectors, and value vectors are derived from the output of the previous sublayer (or directly the Embedding of the input).

4)

Encoder-Decoder Attention Sublayer

This sublayer is similar to the attention layer in a typical recurrent neural network encoder-decoder model, with query vector q coming from the output of the previous sublayer in the decoder, and key vector set K and value vector set V both coming from the output of the encoder.

5)

Positional Encoding

Since the Transformer model does not contain any loop structure, in order to utilize the positional information of the symbols in the sequence, position encoding is incorporated into the input Embedding, and the dimension of position encoding is the same as that of the implicit dimension d_model of the Transformer model, which is computed by the formula: (25) $\begin{array}{l} P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{\frac{2 i}{d_{\mod e l}}}}) \\ P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{\frac{2 i}{a_{\mod e l}}}}) \end{array}$ \[\begin{align} & P{{E}_{\left( pos,2i \right)}}=\sin \left( \frac{pos}{{{10000}^{\frac{2i}{{{d}_{\bmod el}}}}}} \right) \\ & P{{E}_{\left( pos,2i+1 \right)}}=\cos \left( \frac{pos}{{{10000}^{\frac{2i}{{{a}_{\bmod el}}}}}} \right) \\ \end{align}\]

6)

Input and Output Layers

As with typical sequence-to-sequence models, Transformer uses embedding layers at the bottom of the encoder and decoder to convert symbols in the sequence into vectors. When decoding to generate the symbols at the target end, the output of the decoder is converted to a vector of probability distributions on the symbol table by a linear transformation and softmax function.Unlike Transformer’s original paper, the parameters of the Embedding layer of the encoder and decoder, as well as of the linear transformation layer in front of the softmax layer are not shared in this paper when modeling syntactic error correction.

4.2.2

Training guidelines

When training Transformer, great likelihood estimation is used with the goal of maximizing the likelihood of the model on training data S : (26) $θ = \underset{θ}{\arg \max} \sum_{(x, y) \in S} \log p (y | x; θ)$ \[\theta =\underset{\theta }{\mathop{\arg \max }}\,\sum\limits_{\left( x,y \right)\in S}{\log }p\left( y\left| x \right.;\theta \right)\]

4.2.3

Decoding Strategies

Given the input error sentence x , the Transformer model uses beam search decoding to generate the corrected sentence y_hyp at the target end, and at each time step, the top k candidate prefix sentences with the highest scores are retained. In addition, to suppress the model’s behavior of favoring shorter sentences, a length penalty term is also introduced into the original likelihood scores, which is specified by the following formula: (27) $s c o r e (y_{h y p}, x) = \frac{\log (p (y_{h y p} | x))}{L P (y_{h y p})}$ \[score\left( {{y}_{hyp}},x \right)=\frac{\log \left( p\left( {{y}_{hyp}}\left| x \right. \right) \right)}{LP\left( {{y}_{hyp}} \right)}\] (28) $L P (y_{h y p}) = \frac{{(5 + | y_{h y p} |)}^{α}}{{(5 + 1)}^{α}}, α \in [0.6, 0.7]$ $LP\left( {{y}_{hyp}} \right)=\frac{{{\left( 5+\left| {{y}_{hyp}} \right| \right)}^{\alpha }}}{{{\left( 5+1 \right)}^{\alpha }}},\alpha \in \left[ 0.6,0.7 \right]$

4.3

Experimental results and analysis

4.3.1

Experimental data

The training data for the experiments in this chapter attempts to incorporate forged error sentences to train the model, in addition to using manual labeling training data. Two different types of training data are used in this experiment, and the composition of the training data is described below:

1) Due to the major grammatical error types in the corpus with manual annotation, e.g., the NUCLE dataset has 57,151 sentences, but only 3,716 sentences contain at least one or more grammatical errors, less than 0.4% of the words in the whole dataset need to be corrected, with the coronal errors accounting for 14.8%, the noun singular-plural errors accounting for 8.4%, and the prepositional errors accounting for only 5.4%. The sparsity of grammatical errors will greatly affect the effectiveness of model training, so in this chapter of the study, sentences with grammatical errors will be extracted as training data.

2) Since the corpus data with artificial error annotations are less, in order to improve the performance of the error correction classification model, in addition to the training data introduced above, the training data with forged errors are also introduced for the experiments in this chapter. The WiKi-Text103 corpus is selected as the seed corpus for this experiment, and prepositional errors, coronal errors, and noun singular-plural errors are randomly created, i.e., prepositions and coronal errors are randomly replaced or deleted after they are recognized in a sentence, and nouns are randomly replaced with another singular-plural form of a noun when they are detected, so as to generate more training data for training the classifier for the corresponding error types.

In this chapter, W&I-dev will be chosen as the validation set, and the test data provided by CoNLL-2014 will be used as the test set for this experiment, and P, R, and F_0.5 will be selected as the evaluation metrics for the experiment.

4.3.2

Training results

In this paper, the labeled data and the forged error data sets are merged together to jointly train the English error correction model based on the self-attention mechanism, and the training results of the model are shown in Fig. 7, and the performance of the final model achieves 93% accuracy on the test set, with the loss value stabilized at 0.08.

4.3.3

Comparative analysis of models

Experiments were conducted on the CoNLL-2014 test set on this paper’s English error correction method based on the self-attention mechanism and other grammar error correction methods (denoted as Method 1~ Method 4). The experimental comparison results of English grammar error correction methods are shown in Figure 8. The results of this paper’s model on the CoNLL-2014 test set are all improved compared with other grammar error correction methods, and the values of precision, recall, and F_0.5 reach 0.783, 0.788, and 0.785, which are increased by 2.4%~12.7%, 7.0%~8.7%, and 5.1%~10.7%, respectively.The introduction of Transformer improves the grammar performance of the error correction model.

4.3.4

Error type analysis

The ERRANT Syntax Error Annotation Toolkit was used to explore the performance of the models in this paper under different syntax error types. The ERRANT toolkit was first used to convert the M2 files of the CoNLL-2014 test set into M2 files adapted to the input of the ERRANT toolkit prior to analysis. The prefixes “M”, “R”, and “U” represent the three meanings of missing, replacing, and uncorrected errors, respectively.

In this paper, we focus on the correction of common types of errors, and the syntactic error correction results of the model are shown in Fig. 9, where (a) and (b) are the test results of syntactic error correction and syntactic error detection, respectively. The model has a good effect of correcting nouns, prepositions, spelling errors, and verb form errors, and the accuracy of grammatical error correction is above 76%. Among them, the accuracy of correcting spelling errors is up to 79.1%. In terms of error type detection, the model has a high accuracy in identifying the type of error for singular and plural forms of nouns and verbs, with a 78.4% and 79.6%, respectively. For the identification and detection of punctuation errors, the combined performance of the model correction and detection achieved 0.726 and 0.748 on the F_0.5 value, and the model in this paper has some ability to identify and correct errors of punctuation types.

5

Conclusion

For better English language learning and utilization, this paper proposes an English translation model based on improved Masking-BERT enhancement based on the computer translation perspective. The English grammar error correction task is regarded as a sequence-to-sequence generative task, i.e., the error correction process is regarded as the process of translating an incorrect sentence into a correct one, and the self-attention mechanism is used to construct the English grammar error correction model. Through the analysis of the English translation and grammar error correction model, we explore its multi-scenario application in English language.

Compared with other models, the English translation model in this paper has better English utterance compression performance, with a compression rate and compression stability of 88.4% and 86.8%, respectively, and also outperforms the comparison model in terms of training loss and BLEU score. Meanwhile, the model’s translation accuracy in English to Chinese and Chinese to English reaches 72%~87%, the response time is less than 3s, and the satisfaction scores are concentrated in 3~5, which is superior in practical application.

The English grammar error correction model in this paper scores more than 78% in precision, recall, and F_0.5 indexes, which is higher than other grammar error correction models, and is able to carry out English grammar error correction more accurately. Among them, the accuracy of the model in correcting nouns, prepositions, spelling errors, and verb form errors is greater than 76%, and the accuracy of detecting errors in singular and plural forms of nouns and verb forms is greater than 78%. The model in this paper is capable of successfully detecting and correcting common English grammar errors.

Under the trend of globalization, the increase in English usage scenarios such as work, study, and immigration has resulted in the emergence of a large number of English learning groups. This study constructs an English translation model and a grammar error correction model based on computer translation, which can assist the public in learning and daily application of the English language and promote the efficiency of English learning.

Funding:

This research was supported by the Ministry of Education’s Supply and Demand Integration Project of Employment and Education: “Research on Enhancing the Innovation and Entrepreneurship Ability of Traditional Chinese Medicine International Communication Talents” (No. 2024041148735).

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Computer Translation-Based Language Modeling Enables Multi-Scenario Applications of the English Language

Lei Li

Publicado en línea: 17 mar 2025

Recibido: 26 oct 2024

Aceptado: 07 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0349

Palabras clave<kwd>Computer translation</kwd>, <kwd>BERT</kwd>, <kwd>Transformer</kwd>, <kwd>Grammatical error correction</kwd>, <kwd>English language</kwd>

© 2025 Lei Li, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
<kwd>Computer translation</kwd>, <kwd>BERT</kwd>, <kwd>Transformer</kwd>, <kwd>Grammatical error correction</kwd>, <kwd>English language</kwd>