Optimising Semantic Accuracy and Contextual Comprehension in English Translation Based on Deep Learning Algorithms

At present, the use of the English language in daily communication is more and more extensive, and the problem of English wrong text gradually appears in the process of English translation. It has a greater impact on the accurate translation of English, which largely reduces the main accuracy rate of English translation. For the detection of English wrong text, a large number of English translation robot wrong text detection system has appeared, however, with the comprehensive popularity of information transmission and the Internet [1-4], the traditional English translation robot wrong text detection system in the detection and correction of grammatical errors drawbacks gradually appeared, its correction in the English grammar error correction in the correction of accuracy and correction of inefficiency, largely hindering the English translation robot. This has largely hindered the application and development of English translation robots [5-8].

With the development and maturity of automatic translation software, the accuracy of translation and calibration of machine translation software is more demanding. In the machine translation environment, semantic analysis needs to be combined with contextual features of automatic translation software [9-11]. Based on the semantic similarity, automatic translation and calibration are performed to improve the semantic assignment accuracy of automatic translation. Under the condition of semantic heterogeneity, automatic calibration of machine translation is mainly achieved by conceptual analysis of semantic similarity [12-15]. Abstracting the relevance features and semantic similarity features of automatically translated texts, eliminating the existence of semantic heterogeneity according to similar semantics, realising the analysis of the knowledge structure graph in the process of English translation and constructing the semantic concept tree to improve the translation error [16-19].

In this paper, the optimisation of semantic accuracy and contextual comprehension of English translation is achieved by constructing an English machine translation model that incorporates convolutional neural network (CNN) and Transformer model. Experiments are conducted in an English-Chinese parallel corpus to compare the CNN-Transformer English translation model in this paper with the base CNN model and the Transformer machine translation model. The training speed of the models is compared first, and then two metrics, Bilingual Evaluation of Substitutions (BLEU) and Perplexity Level (PPL), are selected to compare the optimisation effects of the three English machine translation models on semantic accuracy and contextual comprehension, respectively.

2

Optimising English machine translation based on deep learning algorithms

2.1

Machine translation and its classification

Machine Translation (MT) usually refers to the use of computer technology to convert one natural language (i.e., source language) into another natural language (i.e., target language). Machine translation technology has gone through four stages of development, which are rule-based, instance-based, statistical and deep learning approaches to machine translation.

2.1.1

Rule-based machine translation

Rule-based machine translation systems achieve the basic tasks of inputting the source language, analysing, converting and outputting the target language by means of encoders and decoders, which are the basic frameworks of machine translation. The rule-based machine translation approach is limited by manually defined rules, dictionaries and knowledge bases that require a lot of time and resources to construct and maintain. Rule bases and dictionaries have limited coverage and cannot cope with all translation situations, especially domain-specific, dialectal or new vocabulary. Therefore, although they perform well in some domains, they have limited utility in dealing with a wide range of natural language translation tasks.

2.1.2

Instance-based machine translation

Instance-based machine translation methods and statistical machine translation methods can be categorised as corpus-based machine translation methods, the basic principle of which is to achieve translation work with the help of a corpus. Instance-based machine translation methods mainly use existing translated texts as samples, and then compare the samples with untranslated sentences to derive a mapping relationship. Due to the limited number and varying quality of the corpus, the method has some limitations in practical application.

2.1.3

Statistics-based machine translation

Statistical based machine translation method (SMT) can be understood as a process of information transfer in which machine translation is interpreted as an operation performed under a channel model. The native task of machine translation in this model is to find the sentence with the highest probability. However, the method only considers the linear relationship between words and does not take into account the structure of the sentence. The method cannot improve the translation quality when the difference in the order of the two languages is relatively large.

2.1.4

Deep learning based machine translation

Deep learning-based machine translation methods, also known as Neural Machine Translation (NMT), are gradually replacing traditional machine translation models that combine rule-based and corpus-based approaches in many language pairs as well as application scenarios. It focuses on modelling the mapping relationship between input and output directly from a single system by means of neural networks, without the need for separate optimisation processes for multiple systems as in traditional machine translation systems. Among them, the Transformer model is an architecture based purely on the self-attention mechanism and feed-forward neural networks, which not only sets a new record for machine translation but also greatly accelerates the convergence speed of model training.

Although neural machine translation has substantially improved the translation quality, there are still problems such as leakage, data sparsity and introduction of knowledge, which have a certain impact on the improvement of translation quality. Therefore, in this paper, we improve the Transformer model to improve the semantic accuracy and contextual comprehension of English translation.

2.2

Deep Learning in Improving Semantic Accuracy

2.2.1

Semantic representation learning

Deep learning maps words, phrases, or sentences into a high-dimensional vector space through embedding techniques, and these vectors are able to capture semantic relationships between words. For example, word embedding models such as Word2Vec and GloVe are able to learn the similarities and differences between words, providing rich semantic information for translation models. During the translation process, the model can accurately understand the semantics of the source language text and generate a translation that is similar to the target semantics.

2.2.2

Attention mechanisms

The Attention Mechanism (AM) is a major innovation of deep learning in machine translation, which allows the model to dynamically focus on different parts of the source language text while translating. By assigning different weights to each word or phrase in the source language, the model is able to capture key information more accurately, thus prioritising this important information during translation and improving the semantic accuracy of the translation.

2.3

Deep Learning in Improving Contextual Comprehension

2.3.1

Sequence modelling capability

RNNs and their variants (e.g. LSTM, GRU) have a powerful sequence modelling capability, capable of handling input sequences of arbitrary length and preserving long-distance dependencies in the sequences. This is crucial for understanding complex sentence structures and long-distance dependencies in English. In the translation process, these models are able to synthesise the contextual information of the whole sentence and generate a more contextualised translation.

2.3.2

Transformer model

The Transformer model further enhances contextual comprehension through the self-attention mechanism. It no longer relies on the cyclic structure of RNN to process sequences, but adopts parallel computing, which greatly improves the processing speed. At the same time, the self-attention mechanism enables the model to focus on all positions in the input sequence at the same time, thus capturing the contextual information more comprehensively and generating a more fluent and natural translation.

3

Deep learning-based English machine translation models

3.1

Transformer English Machine Translation Model

The Transformer model is a novel neural machine translation model, which not only achieves better translation performance but also greatly improves the model training efficiency by using the attention mechanism [20].

The Transformer model adopts a standard encoder-decoder structure. both the encoder and decoder in the Transformer model consist of multiple stacked encoding and decoding layers, where each encoding layer consists of two parts: self-attention and position-by-position feed-forward neural network, while the decoding layer consists of self-attention, encoder -decoder attention and position-by-position feedforward neural network. In addition to this, in order to address issues such as exploding or vanishing gradients and unstable training process, Transformer uses residual connections between each sub-layer and applies layer regularisation.

Benefiting from the design of the attention mechanism in the model, which can model between any two points on the sequence, Transformer not only has the ability to process data in parallel comparable to CNNs, but also has the ability to capture long-distance dependencies between sequences better than LSTMs and has the ability to capture features better.

The attention mechanism used in this paper is the dot product attention mechanism, which can be computed from three vectors $Q$ , $K$ and $V$ :

(1)

A t t e n t i o n (Q, K, V) = S o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where $d_{k}$ denotes the dimension of $K$ , $Q$ denotes the query vector, and $K$ and $V$ denote the key-value pair vector.

The traditional multi-head attention mechanism consists of multiple dot product attention mechanisms, and the results of multiple dot product attention mechanisms are spliced into a linear layer to fuse the information from each attention head:

(2)

h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(3)

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O}

where $W_{i}^{Q} \in ℝ^{d_{\mod e l} \times d_{k}}$ , $W_{i}^{K} \in ℝ^{d_{\mod e l} \times d_{k}}$ , $W_{i}^{V} \in ℝ^{d_{\mod e l} \times d_{v}}$ , $W^{O} \in ℝ^{h d_{v} \times d_{\mod e l}}$ denote the weights of the linear mapping, $d_{k} = d_{v} = d_{\mod e l} / h$ , $d_{\mod e l}$ are the model dimensions, $h$ is the number of heads in multi-head attention, and $C o n c a t$ denotes the vector splicing function.

The advantage of multi-head attention is that the model learns in different representation subspaces through individual mutually independent attention heads. In order to prevent the mutually independent attention heads from acquiring duplicate information, a linear layer is added after the multi-head attention for the information interaction between different attention heads, which improves the model’s generalisation ability and thus learns more.

3.2

English Translation Model Fusing CNN and Transformer

In order to further optimise the semantic accuracy and context comprehension of the Transformer model in English translation, this paper proposes an English machine translation model based on fusion of CNN and Transformer, which follows a left and right branch design, with the left branch being an attentional mechanism for capturing global contextual feature information [21]. For local contextual information, convolution is used for the operation. To further reduce the computational effort, the model uses a dynamic convolutional model instead of normal convolution.

In this way, the attention and convolution modules are placed side by side, in two different perspectives, overall and local, so that the model can better capture the features, and the structure of the attention module of the model is shown in Fig. 1.

Since the model in this paper only modifies the attention module in the original model, the inputs and outputs remain the same as the original model, the model also needs to add position vectors for fusion after converting the input speech sequences into word vectors, which are then used as inputs to the attention module.

Let the input source language sequence be represented by $S$ and each word in the sequence be represented by $W_{i}$ . The sequence can be represented by equation (4):

(4)

S = \{W_{1}, W_{2}, W_{3}, \dots, W_{n}\}

The word vector encoding of the sequence can be represented by equation (5):

(5)

E = E m b e d d i n g (S) + P o s i t i o n a l E m b e d d i n g

Eq. (5) in $E \in ℝ^{n \times d_{s}}$ . The obtained Embedding is subjected to segmentation operation according to dimensions, then $X \in ℝ^{n \times d_{x}}$ , $X_{2} \in ℝ^{n \times d_{x_{2}}}$ , then we have Eq. (6):

(6)

d_{s} = d_{x} + d_{x_{2}}

Also dimension $d$ is related to the number of attention heads $h$ , and the dimension of the attention head must be able to divide the number of attention heads. Let the number of attention heads of global attention be $h_{T}$ , and the dimension assigned to each head be $d_{T}^{h}$ . The number of heads of convolutional attention be $h_{C}$ , and the dimension assigned to each head be $d_{C}^{h}$ , then there is the allocation shown in Equation (7):

(7)

h_{T} \times d_{T}^{h} + h_{C} \times d_{C}^{h} = d_{T} + d_{C} = d_{S}

In order to be consistent with the parameters of the original Transformer model, this paper defines the number of attention heads of the original Transformer model as $H$ , then there is $H = h_{T} + h_{C}$ . The left branch calculates the global attention in the same way, by multiplying $X$ by the corresponding $W_{T}^{Q}$ , $W_{T}^{K}$ and $W_{T}^{V}$ matrices to get the corresponding $Q_{T}$ , $K_{T}$ and $V_{T}$ , and mark $Z$ as the output of the left branch, which is calculated as shown in equation (8):

(8)

Z = A t t e n t i o n (Q_{T}, K_{T}, V_{T}) = s o f t \max (\frac{Q_{T} K_{T}^{T}}{\sqrt{d_{T}}}) V_{T}

The right branch of the model uses dynamic convolution to compute local attention, and noting $C$ as the output of the right branch, we have:

(9)

\begin{matrix} C = D y n a m i c C o n v (X_{2}, i, c) \\ = D e p t h w i s e C o n v (X_{2}, f {(X_{2 i})}_{h_{i}}, i, c) \end{matrix}

Thus $Z \in ℝ^{n \times d_{T}}$ , $C \in ℝ^{n \times d_{C}}$ . Further splicing of the obtained arithmetic results gives $o u t = A \oplus C$ , $o u t \in ℝ^{n \times d_{s}}$ , at which point the dimension of the output result is equal to the dimension of the loss, and the model can be stacked for computation.

The structure of the model is represented using $M_{d_{T} \cdot d_{c} \times β}^{d}$ where $d$ denotes the embedding dimension of the model, $d_{T}$ denotes the dimension of the original attention, $d_{c}$ denotes the dimension of the convolutional computation, and $β$ denotes the number of computational attention layers. The overall structure of the model is shown in Figure 2. For example, $M_{200, 312 \times 3}^{512}$ indicates that the embedding dimension of this model is 512, the dimension of raw attention is 200, the dimension of convolutional computation is 312, and both the encoder and decoder of the model are three layers.

Both the encoder and decoder in the model contain an attention module and a feed-forward network module, the main role of the feed-forward module is to enhance the computational effect of the attention module by linear changes, and the model uses the same feed-forward module as the original Transformer. And the head and the number of assigned dimensions in the attention module need to satisfy equation (7), if the head occupied by the original attention is 2, and the number of heads for convolutional computation is 6, the parameters of the two are satisfied:

(10)

2 \times d_{T}^{h} + 6 \times d_{C}^{h} = 512

3.3

Model performance evaluation metrics

In this paper, BLEU and PPL are selected as model performance evaluation indexes to assess the semantic accuracy and contextual comprehension of English translation, respectively.

3.3.1

BLEU

Bilingual Evaluation Substitution (BLEU) is a commonly used evaluation index for assessing the quality of machine translation, which calculates the similarity between the machine translation result and the reference translation by adopting n-gram matching, i.e., the larger the BLEU value is, the higher the similarity is, and the better the quality of machine translation is [22].

Its formula is as follows:

(11)

B L E U = B P \cdot \exp (\sum_{n = 1}^{N} w_{n} \log P_{n})

(12)

p_{n} = \frac{C o u n t_{h i t}}{C o u n {t^{'}}_{o u t p u t}}

Where $C o u n t_{h i t}$ indicates the number of times the n-grams in the translated text output by the system hit in the reference translation, and $C o u n t_{o u t p u t}$ indicates the total number of n-grams in the translated text output by the system. In practice, in order to avoid double counting of the same word, a truncated way of defining $C o u n t_{h i t}$ and $C o u n t_{o u t p u t}$ is used in the definition of BLEU. BP is the over-being-short penalisation factor, which is meant to be the penalisation factor based on the fact that when a translated sentence is shorter than the reference translation, it is not a good idea to use the same term as the reference translation. based penalty factor. And the BP penalty factor is not used when the translated sentences are longer, which is due to the fact that the n-gram itself already contains a penalty for over being long sentences, which is defined by the formula:

(13)

B P = \{\begin{matrix} 1 & i f c > r \\ e^{1 - \frac{r}{c}} & i f c \leq r \end{matrix}

where $c$ is the sentence length of the translated translation and $r$ is the sentence length of the candidate translation.

3.3.2

PPL

Perplexity (PPL) is one of the common measures of how good a language model is, and it focuses on the probability of occurrence of a sentence based on each word in the sentence [23]. Its derivation formula is as follows:

For the sentence S (sequence of words W):

(14)

S = W_{1}, W_{2}, \dots, W_{k}

The probability of its occurrence is:

(15)

P (S) = P (W_{1}, W_{2}, \dots, W_{k}) = P (W_{1}) P (W_{2} |W_{1}) \dots P (W_{k} |W_{1}, W_{2}, \dots, W_{k - 1})

The formula for PPL is shown in equation (16):

(16)

P (S) = P P L = P {(w_{1} w_{2} \dots w_{N})}^{- \frac{1}{N}} = \sqrt[N]{\frac{1}{P (w_{1} w_{2} \dots w_{N})}}

The basic idea behind the PPL evaluation metric is that language models that can assign higher probability values to the sentences in the test set are more effective. From equation (16), it can be seen that inside the root sign is the inverse of the probability of the sentence, so when the sentence is better (with a high probability of occurrence), the PPL will be smaller, and the effectiveness of the model, i.e., the contextual comprehension, will be better.

4

Model application experiments

4.1

Experimental data

The experiment starts from the three dimensions of time, space and topic respectively, and crawls news articles on the Internet using python language, from which 400,000 Chinese sentences and 400,000 English sentences are selected as the Chinese monolingual corpus and English monolingual corpus for the experiments, respectively, and arranges and integrates them to form an English-Chinese weakly parallel corpus as the training set for the experiments in this paper based on the multidimensional integration method. In order to make the relevance of the English-Chinese weakly parallel corpus stronger, about 100,000 pairs of English-Chinese parallel corpus provided by CWMT2017 are added into the training set, and the English-Chinese weakly parallel corpus constructed by crawling and arranging from the Internet is mixed and washed with the CWMT2017 English-Chinese parallel corpus, and the English-Chinese weakly parallel corpus to be trained in the experiments is finally obtained, among which the Chinese and English sentences are about 500,000 sentences each. The Chinese sentences and the English sentences are about 500,000 sentences each. The experimental development set was randomly selected from the experimental training set CWMT2017 with 2400 WeiChinese bilingual pairs. For the test set, the English-Chinese test set test2015 provided by CWMT was used, with a total of 1500 utterance pairs.

4.2

Experimental environment and parameter settings

In order to verify the performance of the proposed English machine translation model based on the fusion of CNN and Transformer, the experiments will be carried out using the MyEclipse and PyCharm based IDE for the English translation system development. The operating system is selected Linux operating system, and the system experimental environment is Python3.6 and CUDA9.0 version.

In order to achieve better experimental results, the experiment will use the word-splitting tool to slice the English utterance to get four different granularities of words, syllables, subwords and characters, and translate the English-Chinese and Chinese-English parallel corpus respectively. Among them, the size of the word list used in the experiment is set to 32k, the number of encoder and decoder layers are set to 6, and the batch size is set to 1024.

4.3

Experimental results and analyses

4.3.1

Comparison of model training time

In order to have a more intuitive understanding of the training effect of the proposed CNN-Transformer model, the experiments compare and analyse the training time of this model with that of the traditional Convolutional Neural Network model (CNN) and the standard Transformer machine translation model in 250,000 training English-Chinese bilingual parallel corpus.

The experimental results show that in the same training dataset, the training time of this model, the CNN model and the Transformer machine translation model are 17h, 20h and 26h, respectively, and the training time of the other two models exceeds the training time of this model by 3h and 9h, respectively, which shows that the proposed CNN-Transformer algorithm can improve the training speed and computational efficiency of machine translation models. Model’s training speed and computational efficiency, and shorten the training time.

4.3.2

Impact of Different Slicing Granularity on BLEU of Translation Models

In order to verify the performance of each translation model when different translation models adopt different English cut-off granularity, the experiments will be carried out by using the present model, the basic CNN model and the Transformer machine translation model respectively to translate the English-Chinese-Chinese-English parallel corpus of an English sentence with four different granularities of words, syllables, subwords, and characters, and thus obtain the three models’ BLEU scores are shown in Figure 3. Where (a)~(b) denote the BLEU scores of English-Chinese and Chinese-English translations of the three models, respectively.

As can be seen from the results in Fig. 3, the BLEU values of this paper’s model are higher than those of the base CNN model and the Transformer machine translation model in the tests with different slicing granularities for English. In the subword granularity of English slicing, the BLEU values of this paper’s model in the direction of English-Chinese and Chinese-English translation are 21.64% and 22.47%, respectively, which are 4.13% and 3.87% higher compared to the CNN model, while the comparison with the Transformer machine translation model is 1.42% and 1.50% higher, respectively. In character granularity, the present model improves the BLEU scores in English-Chinese and Chinese-English by 4.31% and 4.38%, respectively, compared to the CNN model. Compared with the Transformer machine translation model, this model improves the BLEU scores in English-Chinese and Chinese-English by 1.62% and 1.51%, respectively. Comprehensive analysis shows that this paper’s model can achieve high translation results under four different granularities, no matter in the bidirectional translation process of English-Chinese or Chinese-English translation.

4.3.3

Comparison of Different Translation Models PPL

In order to verify the optimisation effect of the CNN-Transformer English translation model on context comprehension, it is compared with the base CNN model and the Transformer machine translation model in a comparative experiment. The experiments use the test set data to evaluate the English machine translation model using the semantic perplexity PPL as an evaluation metric while using the trained translation model for translation result evaluation.

The trend of PPL change in model training in the direction of English-Chinese translation is shown in Figure 4.

As can be seen from Fig. 4, the PPL of the trained CNN-Transformer model, Transformer model and CNN model decreased by 89.79%, 86.24% and 82.61%, respectively, when performing English-Chinese translation on the test set.The PPL values showed a decreasing trend as a whole, and the decrease of the PPL value of the CNN-Transformer model was the is the largest and always smaller than the other two models, indicating that the model in this paper has the highest fluency in translating English into Chinese and the best optimisation effect on context comprehension.

The trend of PPL change in model training in the direction of Chinese-English translation is shown in Figure 5.

As can be seen from Fig. 5, the PPL values of CNN-Transformer model, Transformer model and CNN model show a decreasing trend in Chinese-English translation training on the test set, with a decrease of 87.52%, 82.10% and 81.63%, respectively. The PPL value of the CNN-Transformer model is also always smaller than that of the Transformer model and the CNN model, indicating that the CNN-Transformer translation model in this paper has a higher fluency when translating from Chinese to English and has the best optimisation effect on context comprehension among the three models.

Comprehensive Figures 4~5 show that the semantic confusion PPL of this paper’s CNN-Transformer translation model in the English-Chinese parallel corpus is low, and the PPL values of the English-Chinese and Chinese-English translations obtained from the final training are 80.81 and 42.44, respectively.The lower the value of the PPL, the lower the semantic confusion, the more fluent the utterances in the English-Chinese language, and the contextual comprehension improves, thus proving the effectiveness of this paper’s model in optimising contextual comprehension in English translation.

5

Conclusion

In this paper, a CNN-Transformer English machine translation model fusing convolutional neural network and Transformer model is constructed to achieve the optimisation of semantic accuracy and contextual comprehension of English translation, and its effect is verified.

In the same training dataset, the training time of this paper’s model, the CNN model, and the Transformer machine translation model are 17h, 20h, and 26h, respectively. Compared with this paper’s model, the training time of the other two models exceeds that of this paper’s model by 3h and 9h, which indicates that the CNN-Transformer model constructed in this paper can improve the training speed and computational efficiency of the machine translation model, and shorten the training time. Computational efficiency and shorten the training time.

In the subword granularity of English cut-off, the BLEU values of this paper’s model in the direction of English-Chinese and Chinese-English translation are 21.64% and 22.47%, respectively. Compared to the CNN model, the BLEU scores of this model are improved by 4.13% and 3.87%, respectively. Compared to the Transformer machine translation model, the BLEU scores of this model are improved by 1.42% and 1.50% respectively. The BLEU scores of this paper’s model are also higher than the other two models in word, syllable, subword and character granularity. This indicates that this paper’s model can achieve high translation results in the English-Chinese bidirectional translation process under four different granularities. And compared with the other two models, this paper’s model has the best optimisation of semantic accuracy.

The PPL values of the trained CNN-Transformer model, Transformer model, and CNN model decreased by 89.79%, 86.24%, and 82.61% for English-Chinese translation, and 87.52%, 82.10%, and 81.63% for Chinese-English translation on the test set, respectively.The PPL values overall show a decreasing trend, and the PPL value of the CNN-Transformer model is always smaller than the other two models, and the decrease is always larger than the other two models. It indicates that the model in this paper has the highest fluency in English-Chinese bidirectional translation and can effectively optimise the contextual comprehension of English translation.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Optimising Semantic Accuracy and Contextual Comprehension in English Translation Based on Deep Learning Algorithms

Ronglin Fu

Data publikacji: 24 wrz 2025

Otrzymano: 31 gru 2024

Przyjęty: 19 kwi 2025

DOI: https://doi.org/10.2478/amns-2025-1002

Słowa kluczoweCNN, Transformer model, BLEU, PPL, English translation

© 2025 Ronglin Fu, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
CNN, Transformer model, BLEU, PPL, English translation