Selection and Optimization of University English Teaching Path Based on Knowledge Distillation and Transfer Learning

Traditional data-driven diagnostic methods require training and testing data to have consistent distributions, however, distributional differences are bound to exist in different domains. In order to adapt to the distributional differences between different domains, migration learning methods have become increasingly popular as a means of solving such problems by fine-tuning a pre-trained model to reuse a limited amount of target data [1-4]. In the field of computer vision, migration learning is widely used in tasks such as image classification, target detection, and image generation. In the field of natural language processing, migration learning also plays an important role. For example, in tasks such as text classification, sentiment analysis, and machine translation, by using pre-trained word vectors or language models, the performance and generalization ability of the models can be significantly improved. There are still some challenges on how to accurately assess the similarity between source and target tasks to determine the effectiveness of migration learning, how to choose appropriate migration strategies to adapt to different tasks and data distributions, and how to balance the relationship between source and target tasks to avoid the occurrence of negative migration phenomena [5-8].

Knowledge distillation refers to the transfer of high-precision diagnostic knowledge from a cumbersome model to a lightweight model to improve its diagnostic accuracy. The field of knowledge distillation has seen a number of novel algorithms and techniques emerge in recent years, and contrast distillation facilitates the learning of more detailed feature representations in student models by introducing an additional contrast loss term in the student model [9-11]. Attention-guided knowledge distillation focuses on using the attention mechanism of the teacher model to guide the student model to learn important feature maps or temporal information, and by distilling the attention weights or the attention distribution of the feature maps, the student model learns the key features more effectively [12]. Adaptive distillation techniques dynamically adjust distillation strategies based on the performance of student models during training to utilize instructor knowledge more efficiently [13]. Multi-instructor distillation combines the knowledge of multiple instructor models with different strengths to provide more comprehensive guidance to student models. These state-of-the-art knowledge distillation techniques continue to push the boundaries of model compression and transfer learning, providing more powerful and flexible solutions for real-world applications [14-15].

Learning ability is one of the students’ core qualities to be cultivated in English curriculum, and it is a key element in the development of core qualities, and the development of learning ability helps students master scientific learning methods and develop good lifelong learning habits [16-17]. The teaching methods of English teachers affect students’ English learning methods, and the scientific and effective teaching methods of English teachers are an important way to help students improve their learning effect and develop their learning ability. Therefore, teachers should purposefully and consciously use the theories related to knowledge distillation and transfer learning to guide the teaching design and promote the transfer of learning in classroom teaching practice [18-20].

In this paper, knowledge distillation and transfer learning are combined to build a model of English teaching path selection and optimization in colleges and universities, and a teaching experiment is evaluated using the T-test method. The model is mainly divided into two parts: the teacher model and the student model, and the dense convolutional neural network (DenseNet) is selected for the teacher model, while the artificial neural network (ANN) is selected for the student model. In the model effect evaluation experiment, the experimental group and the control group used the model-optimized English teaching path and the traditional teaching method respectively. The characteristics of the research subjects were first described, and then the differences in background variables and representativeness of the research subjects were examined to ensure that the samples were selected reliably. Finally, independent samples are utilized and paired samples t-tests are applied to compare the effects of English proficiency improvement of students in the experimental and control groups.

2

The Path of College English Teaching Based on Knowledge Distillation and Transfer Learning

2.1

Transfer Learning and Knowledge Distillation

2.1.1

Transfer learning

In transfer learning, the source domain is the domain being transferred while the target domain is the domain to be learned. The migration task involves applying models and knowledge learned in the old domain to the new domain. Transfer learning is achieved by identifying and utilizing the commonalities of domains as a bridge to systematically transfer existing cognitive achievements from the source domain to the target domain.

In this paper, we use the notation D to denote a domain of migration learning where the inputs of the sample data in the domain are X and the outputs are Y , and its probability distribution is denoted as P(x, y). Meanwhile, the source and target domains are denoted as D_s and D_t , respectively. when D_s ≠ D_t , it corresponds to X_s ≠ X_t , Y_s ≠ Y_t or P_s(x, y) ≠ P_t(x, y) .

Specifically, through the method of transfer learning, the source domain data is utilized on the target domain to learn a model that minimizes the error of the prediction function : f : X_t → Y_t on the target domain by leveraging the rich information contained in the source domain data. This is to realize effective migration and adaptive learning of knowledge from the source domain to the target domain by exploring and exploiting the commonalities and correlations that may exist between the source domain and the target domain.

A core aspect of transfer learning is to construct and optimize a predictive model for the target domain based on existing source domain data. The learning process in which one task helps to facilitate another task is called “positive transfer learning”, while the learning process in which one task mitigates another task is called “negative transfer learning”.

2.1.2

Knowledge distillation

The essence of knowledge distillation (KD) belongs to the category of transfer learning, and its main idea is to take the well-trained model as the teacher model, and “distill” the “knowledge” from the output of the teacher model for the training of the student model by controlling the temperature T and hoping that the lightweight model can learn the “knowledge” of the teacher model to achieve the same performance as that of the teacher model. The “knowledge” here is narrowly interpreted as the similarity in the output of the teacher’s model, which can be used to migrate and assist in the training of other models.

The knowledge distillation model consists of three parts: teacher model, student model and knowledge transfer, and the whole process is trained on a supervised dataset. It transfers the knowledge learned by the teacher model, which has a large number of parameters and strong learning ability, to the student model, which has fewer parameters and weaker learning ability.

Based on the knowledge used for distillation, distillation can be divided into the following three ways:

1) Response-based distillation: learning the output of the teacher model, e.g. DistilBERT model, where the student model learns the knowledge of the output layer of the teacher model.

2) Feature-based distillation: learning the knowledge of the middle layer of the teacher model, e.g. PKDBERT, where the student model learns the knowledge of the middle layer.

3) Relation-based distillation: learning the relationship between layers of the teacher model, the relationship between samples, TinyBERT learns the knowledge of the embedding layer.

The main reason why this knowledge is effective is that some implicit features (dark knowledge) cannot be represented at the data level, and teacher models with strong learning ability can learn these features. For general classification problems, the label of data is a “one-hot” category, i.e., the category of a piece of data is fixed, which is called “hard label”.

In the process of knowledge distillation, the trained teacher network will provide the label probability distributions of the softmax layer to the student model as a guide during prediction. And these distributions of label probability contain inter-category information, which can be referred to as soft labels.

The degree of knowledge distillation is determined by temperature, with a higher temperature value indicating a higher degree of distillation and a more moderate label distribution. Whereas a decrease in the temperature value indicates a lower degree of distillation, which in turn amplifies the probability of misclassification and therefore introduces unnecessary noise.

The teacher network and the student network are jointly trained, and the knowledge and learning style of the teacher network affects the learning of the student network, and the loss function Loss is shown in equation (1): (1) $\begin{array}{l} L o s s = λ \times L_{d i s t i l l} + (1 - λ) \times L_{C E} \\ = λ \times c r o s s e n t r o p y (s_{i}, t_{i}) + (1 - λ) \times c r o s s e n t r o p y (s_{i}, y_{i}) \end{array}$ \[\begin{array}{*{35}{l}} Loss=\lambda \times {{L}_{distill}}+(1-\lambda )\times {{L}_{CE}} \\ =\lambda \times cross\text{ }entropy({{s}_{i}},{{t}_{i}})+(1-\lambda )\times cross\text{ }entropy({{s}_{i}},{{y}_{i}}) \\ \end{array}\]

Where: L_distill denotes the loss function for knowledge distillation between teacher and student. L_CE denotes the cross-entropy loss function between the output of the student model and the labeled categorical hard labels. λ denotes the balance parameter of the two loss functions. s_i denotes the student model output. y_i is the hard label. t_i denotes the output of the teacher model, the soft label. cross entropy is the cross-entropy loss function. The larger the weighting coefficient of the soft label cross entropy, the more reliant transfer learning is on the contribution of the teacher model.

2.1.3

Mechanisms for combining knowledge distillation and transfer learning

The combination of knowledge distillation and transfer learning provides an innovative approach to teaching English in higher education. Specifically, a model pre-trained on a large-scale multilingual dataset can be adapted to a specific English teaching task using transfer learning techniques. Then, knowledge distillation techniques are used to further optimize this adapted model to make it more suitable for specific teaching scenarios and student groups. This combination mechanism not only improves the pedagogical adaptability and efficiency of the model, but also reduces the reliance on large amounts of labeled data to some extent.

2.2

Teacher Model and Student Model Selection

2.2.1

Teacher model selection

In this paper, the selected teacher model is Dense Convolutional Neural Network (DenseNet).

DenseNet is a deep learning network inspired and improved by the residual network (ResNet) architecture, and the DenseNet network structure is shown in Fig. 1 [21]. It is different from ResNet’s method to improve the network performance, and its core idea is dense connection, which establishes the connection relationship between different layers, makes full use of the feature information of each layer, and improves the training effect of the network. DenseNet mainly consists of multiple dense blocks and transition layers, and each layer of the Dense Block is connected to all the previous layers, using the Concatenation The Dense Block is connected to all previous layers using the Concatenation method, and keeps the feature map size of each layer the same, and the adjacent Dense Blocks are connected by the Transition Layer, and downsampling is achieved by the batch normalization layer, the activation layer, the convolution layer, and the pooling layer, which reduces the number of channels by using the 1×1 convolution, and reduces the size of the feature maps by pooling, and serves as a compression model.

In a general convolutional neural network, layer l has l connections, whereas in DenseNet, layer l has l(l + 1)/2 connections, and the input to each of its layers is the output of all the previous layers. That is, the output of the general network at layer l is: (2) $x_{l} = H_{l} (x_{l - 1})$ \[{{x}_{l}}={{H}_{l}}\left( {{x}_{l-1}} \right)\]

And in DenseNet, it will connect each previous layer as an input: (3) $x_{l} = H_{l} ([x_{0}, x_{1}, \dots, x_{l - 1}])$ \[{{x}_{l}}={{H}_{l}}\left( \left[ {{x}_{0}},{{x}_{1}},\ldots ,{{x}_{l-1}} \right] \right)\]

where H_l(·) denotes the nonlinear transformation function, including operations such as batch normalization, activation, and convolution. If k feature maps are output after each H_l(·) , i.e., the number of feature map channels obtained is k , this hyperparameter k is called growth rate in DenseNet. In general, using a smaller growth rate (e.g., k = 12 ) makes the network narrower, reduces the parameters, and achieves better network performance. Thus in contrast to other convolutional neural networks, DenseNet is an implicitly strongly supervised model that enhances the propagation of features through the network model by strengthening the inputs at each layer, enabling feature reuse, and each layer has direct access to the gradient from the loss function and the original input signals, boosting the backpropagation of the gradient and making the network easier to train, which both reduces the network’s number of parameters and alleviating the problem of gradient vanishing.

2.2.2

Student model selection

In this paper, the selected student model is the Artificial Neural Network (ANN).

ANN is a multilayer supervised learning perceptron with strong self-learning ability to minimize the empirical risk, and the structure of ANN is shown in Fig. 2 [22]. The ANN model is mainly composed of two parts: information feed-forward propagation and error feedback propagation. In the information feed forward broadcast, the input samples are adaptively and randomly extracted features through multiple implicit layers, which are mapped to the target type by the output layer.

Let the training sample set be ${x_{i}, y_{i}}_{i = 1}^{m}$ \[\left\{ {{x}_{i}},{{y}_{i}} \right\}_{i=1}^{m}\] and the number of samples be m , where x_i ∈ R^d includes d features, x_i ∈ R^l includes l health states, and the function expression for the h th hidden layer is: (4) ${(H_{i}^{h})}_{j} = σ^{h} (\sum_{i = 1}^{n_{h - 1}} ω_{j}^{h} \cdot x_{i}^{h - 1} + b_{j}^{h})$ \[{{(H_{i}^{h})}_{j}}={{\sigma }^{h}}\left( \sum\limits_{i=1}^{{{n}_{h-1}}}{\omega _{j}^{h}\centerdot x_{i}^{h-1}+b_{j}^{h}} \right)\]

Where: ${(H_{i}^{h})}_{j}$ \[{{(H_{i}^{h})}_{j}}\] denotes the output of hidden layer neurons. , x_i,n_h represents the number of hidden layer neurons. σ^h represents the hidden layer activation function. $ω_{j}^{h}$ \[\omega _{j}^{h}\] represents the previous layer inter-neuron weights. $b_{j}^{h}$ \[b_{j}^{h}\] represents the h th hidden layer bias.

The expression of the output layer prediction function of ANN is: (5) $O_{k} = σ^{o} (\sum_{i = 1}^{n_{H}} ω_{j}^{o} \cdot x_{i}^{H} + b_{j}^{o})$ \[{{O}_{k}}={{\sigma }^{o}}\left( \sum\limits_{i=1}^{{{n}_{H}}}{\omega _{j}^{o}\centerdot x_{i}^{H}+b_{j}^{o}} \right)\]

Where: O_k denotes the predicted output of the k nd neuron of the output layer. σ^o denotes the output layer activation function. $ω_{j}^{o}$ \[\omega _{j}^{o}\] denotes the output layer weights. $b_{j}^{o}$ \[b_{j}^{o}\] denotes the output layer bias.

Determine the training samples {x_i, y_i} when the ANN optimization makes the error between the predicted output and the target minimum, the optimization objective function expression of the ANN is: (6) $\min_{ω, b} E_{i} = \frac{1}{2} {\sum_{k = 1}^{l} [{(O_{i})}_{k} - {({\hat{O}}_{i})}_{k}]}^{2}$ \[\underset{\omega ,b}{\mathop{\min }}\,{{E}_{i}}=\frac{1}{2}{{\sum\limits_{k=1}^{l}{\left[ {{\left( {{O}_{i}} \right)}_{k}}-{{\left( {{\widehat{O}}_{i}} \right)}_{k}} \right]}}^{2}}\]

The training parameters ω and b are updated by gradient descent as follows: (7) $ω \leftarrow ω - α \cdot \frac{\partial E_{i}}{\partial ω}$ \[\omega \leftarrow \omega -\alpha \centerdot \frac{\partial {{E}_{i}}}{\partial \omega }\] (8) $b \leftarrow b - α \cdot \frac{\partial E_{i}}{\partial b}$ \[b\leftarrow b-\alpha \centerdot \frac{\partial {{E}_{i}}}{\partial b}\]

Where: α denotes the learning rate. The error gradient is propagated backward from the output layer to the input layer, and the training parameters are iteratively updated layer by layer.

2.3

Selection and Optimization of English Teaching Paths in Colleges and Universities

2.3.1

Knowledge Distillation in Higher Education English Teaching Pathways

In this paper, the teacher model is used as the training model for the source domain, while the student model is used as the training model for the target domain. A dense convolutional neural network is built as the teacher model to pre-train the source domain samples, and the output predictions are soft labels. The artificial neural network is built as the student model, the main task is the hard labels fitted at the student model T = 1 , and the auxiliary task is the soft label prediction fitted at the student model T = t . The principle of knowledge distillation for teaching English in colleges and universities is shown in Fig. 3.

The KD method introduces the value of the temperature factor T , T that will affect the effect of Softmax, and the KD-Softmax expression is: (9) $q_{i} = \frac{e^{\frac{z_{i}}{T}}}{\sum_{i} e^{\frac{z_{j}}{T}}}$ \[{{q}_{i}}=\frac{{{e}^{\frac{{{z}_{i}}}{T}}}}{\sum\limits_{i}{{{e}^{\frac{{{z}_{j}}}{T}}}}}\]

Where: q_i denotes the soft labeling of each category. z_i denotes the logistic regression function of the output of each category. When T = 1 denotes the traditional Softmax classification function with the function expression: (10) $q_{i} = \frac{e^{z_{i}}}{\sum_{i} e^{z_{i}}}$ \[{{q}_{i}}=\frac{{{e}^{{{z}_{i}}}}}{\sum\limits_{i}{{{e}^{{{z}_{i}}}}}}\]

where q_i represents the probability distribution of the output of each category.

The teacher model loss function Loss_teacher uses a cross-entropy function that measures the difference in asymmetric distributions among random variables, and the teacher network loss function expression is: (11) $L o s s_{teacher} = - \sum_{i = 1}^{n} y_{s}^{i} \log {\hat{y}}_{s}^{i} + (1 - y_{s}^{i}) \log (1 - {\hat{y}}_{s}^{i})$ \[Los{{s}_{\text{teacher}}}=-\sum\limits_{i=1}^{n}{y_{s}^{i}}\log \hat{y}_{s}^{i}+(1-y_{s}^{i})\log (1-\hat{y}_{s}^{i})\]

Where: $y_{s}^{i}$ $y_{s}^{i}$ denotes the source domain sample. ${\hat{y}}_{s}^{i}$ $\hat{y}_{s}^{i}$ denotes the source domain soft label.

Student Model Loss Function Loss_student also uses the categorical cross-entropy function, and its student network loss function expression is: (12) $L o s s_{student} = - \sum_{i = 1}^{n} y_{t}^{i} \log {\hat{y}}_{t}^{i} + (1 - y_{t}^{i}) \log (1 - {\hat{y}}_{t}^{i})$ \[Los{{s}_{\text{student}}}=-\sum\limits_{i=1}^{n}{y_{t}^{i}}\log \hat{y}_{t}^{i}+(1-y_{t}^{i})\log (1-\hat{y}_{t}^{i})\]

Where: $y_{t}^{i}$ $y_{t}^{i}$ denotes the target domain sample. ${\hat{y}}_{t}^{i}$ $\hat{y}_{t}^{i}$ denotes the target domain hard-label.

Soft-label loss consists of two parts: teacher model soft-label loss and student model soft-label prediction loss, i.e., the accumulation of teacher model cross entropy and student model cross entropy, and its functional expression is: (13) $L o s s_{Soft} = λ \cdot L o s s_{teacher} + β \cdot L o s s_{student}$ \[Los{{s}_{\text{Soft}}}=\lambda \cdot Los{{s}_{\text{teacher}}}+\beta \cdot Los{{s}_{\text{student}}}\]

Where, λ, β denotes the Lagrange multiplier with a range of 0 < λ ≤ 1, 0 < 1 β ≤ 1 values, and λ = 0.6, β = 0.4 is taken in this paper.

Hard-label loss consists of two parts of hard-label prediction loss and hard-label loss of the student model, i.e., T = 1 the superposition of hard-label prediction loss and cross-entropy loss, and its functional expression is: (14) $L o s s_{Hard} = μ \cdot L o s s_{student}^{'} - η \sum_{i = 1}^{n} y_{t^{'}}^{i} \log {\hat{y}}_{t^{'}}^{i} + (1 - y_{t^{'}}^{i}) \log (1 - {\hat{y}}_{t^{'}}^{i})$ \[Los{{s}_{\text{Hard}}}=\mu \cdot Loss_{\text{student}}^{'}-\eta \sum\limits_{i=1}^{n}{y_{{{t}'}}^{i}}\log \hat{y}_{{{t}'}}^{i}+(1-y_{{{t}'}}^{i})\log (1-\hat{y}_{{{t}'}}^{i})\]

Where: $L o s s_{student}^{'}$ $Loss_{\text{student}}^{\prime }$ denotes the target domain hard label prediction loss for T = 1 . $y_{t^{'}}^{i}$ $y_{{{t}^{\prime }}}^{i}$ denotes the target domain sample. ${\hat{y}}_{t^{'}}^{i}$ $\hat{y}_{{{t}^{\prime }}}^{i}$ denotes the target domain hard label.

Both model tasks train college English teaching path samples, which are similar, so the two tasks share the hidden layer parameters while retaining the output layer of their respective tasks. The distillation loss function is mainly composed of two parts of the loss function of the teacher model Loss_Soft and the student model Loss_Hard , and its distillation loss function expression is: (15) $L o s s_{Distillation} = α T^{2} φ (W_{s}^{T}, W_{t}^{T}) + (1 - α) ϕ (W_{s}, Y)$ \[Los{{s}_{\text{Distillation}}}=\alpha {{T}^{2}}\varphi (W_{s}^{T},W_{t}^{T})+(1-\alpha )\phi ({{W}_{s}},Y)\]

Where: α represents the distillation intensity. φ(·) is the relative entropy, i.e., KL dispersion. $W_{t}^{T}$ $W_{t}^{T}$ denotes the Softmax weight of the target domain model via temperature factor T . $W_{s}^{T}$ $W_{s}^{T}$ denotes the source domain model weights. ϕ(·) denotes the cross entropy. W_t denotes the hard labeling of the target domain model. Y is the target domain test sample.

The above distillation function expression only realizes the marginal distribution difference of domain migration. With this basis, this paper introduces the stratified transfer learning algorithm (STL) to improve the conditional distribution difference of domain samples with the function expression: (16) $M M D_{S T L} (A^{T} x_{s}, A^{T} x_{t}) = \sum_{c = 1}^{C} {‖ \frac{1}{n_{s}^{(c)}} \sum_{i = 1}^{n_{s}^{(c)}} A^{T} x_{s i} - \frac{1}{n_{t}^{(c)}} \sum_{j = 1}^{n_{t}^{(c)}} A^{T} x_{t j} ‖}_{ℋ}$ \[MM{{D}_{STL}}({{A}^{T}}{{x}_{s}},{{A}^{T}}{{x}_{t}})=\sum\limits_{c=1}^{C}{{{\left\| \frac{1}{n_{s}^{(c)}}\sum\limits_{i=1}^{n_{s}^{(c)}}{{{A}^{T}}}{{x}_{si}}-\frac{1}{n_{t}^{(c)}}\sum\limits_{j=1}^{n_{t}^{(c)}}{{{A}^{T}}}{{x}_{tj}} \right\|}_{\mathcal{H}}}}\]

Where: x_si, x_tj denotes the source and target domain samples, respectively. ||·||_H denotes the reproducible Hilbert space paradigm.

The final distillation target loss Loss_Distillation function is expressed as: (17) $L o s s_{Distillation} = α T^{2} φ (W_{s}^{T}, W_{t}^{T}) + (1 - α) ϕ (W_{s}, W_{label}) + γ \cdot M M D_{STL} (A^{T} x_{s}, A^{T} x_{t})$ \[Los{{s}_{\text{Distillation}}}=\alpha {{T}^{2}}\varphi (W_{s}^{T},W_{t}^{T})+(1-\alpha )\phi ({{W}_{s}},{{W}_{\text{label}}})+\gamma \cdot MM{{D}_{\text{STL}}}({{A}^{T}}{{x}_{s}},{{A}^{T}}{{x}_{t}})\]

where γ is the Lagrange multiplier, and the value range is 0 ≤ γ ≤ 1, and in this case γ = 0.5 .

In summary, for sample X of the college English teaching path, firstly, the output vector s is obtained by pre-training the teacher model of the source domain sample. Secondly, the target domain sample is pre-trained by the student model to obtain the output vector t , the output vector s of the teacher model and the output vector t of the student model are used, and the temperature factor T is set for distillation “purification” to obtain the soft label loss of Loss_Soft , which provides rich “dark knowledge” for the student model. When the student model of T = 1 pre-trains the target domain samples to obtain the corresponding cross-entropy loss function, the hard label loss of Loss_Hard is obtained. Combined with Loss_Soft and Loss_Hard , STL was introduced to improve the condition distribution difference of domain samples, the final distillation loss Loss_Distribution was obtained, the gradient loss function was fed back to the student model, the network parameters of the student model were updated, the student model received negative label information, and finally the target domain test samples were used to realize the selection and optimization of college English teaching paths. Among them, some parameters of the teacher model are consistent with those of the student model, that is, the number of hidden layer nodes and the number of neurons in the fully connected layer of DenseNet are consistent with the number of hidden layer nodes and neurons in the student model ANN.

2.3.2

Model-specific application steps

The English teaching path selection and optimization model for colleges and universities proposed in this paper consists of four main parts: acquisition of English teaching resources and teaching contexts, sample preprocessing, model construction and training, and migration optimization of English teaching paths in colleges and universities. The process framework of the college English teaching path selection and optimization model based on knowledge distillation and transfer learning is shown in Figure 4, and its main steps are as follows:

1) English teaching resources and teaching context input. Take the basic knowledge points of English teaching and diversified teaching materials as teaching resources, take specific teaching tasks and needs as teaching contexts, and input both as samples into the model.

2) Sample preprocessing. The input English teaching resources and teaching contexts are preprocessed for sample normalization and paired to obtain English teaching path samples. The source domain samples are divided into a training and test set according to a ratio of 7:3, and the target domain samples are divided into 20% labeled and 80% unlabeled samples.

3) Model building and training. Firstly, the normalized source domain samples are extracted using DenseNet adaptive features, and the source domain test samples are used to obtain the optimal source domain migration model. Secondly, the target domain samples are pre-trained by ANN model, the temperature factor T is set, the knowledge distillation is “purified”, the soft target loss and hard target loss are obtained, and the hierarchical migration is introduced to improve the difference in the distribution of conditions, the final distillation loss function is obtained, and the feedback is provided to update the student model.

4) College English teaching path migration optimization. The target domain samples are input to the distilled student model, the features are mapped to the high-dimensional RKHS space, and the Softmax logistic classifier is used to realize the optimization decision of the college English teaching path.

3

Effectiveness Assessment Model of the Optimization Path of English Teaching in Colleges and Universities

On the basis of evaluating the model, this paper designs a controlled experiment on teaching and uses independent samples t-test and paired samples t-test to explore the effectiveness of the English teaching path in colleges and universities obtained by model optimization.

3.1

Independent samples t-test

The t-test for two independent samples is used to test whether two independent samples come from an aggregate with the same mean, that is, to test whether two independent normal aggregates have equal means [23].

3.1.1

Formulation of the null hypothesis

The two independent samples t-test entails testing whether there is a significant difference between the means of the two aggregates. Its null hypothesis is H₀: μ₁ – μ₂ = 0 , where, μ₁ and μ₂ are the means of the two aggregates respectively.

3.1.2

Selection of test statistic

The two independent samples test of means presupposes that the distribution of two independent aggregates obeys normal distributions $N = (μ_{1}, σ_{1}^{2})$ \[N=\left( {{\mu }_{1}},\sigma _{1}^{2} \right)\] and $N = (μ_{2}, σ_{2}^{2})$ \[N=\left( {{\mu }_{2}},\sigma _{2}^{2} \right)\] , where $σ_{1}^{2}$ \[\sigma _{1}^{2}\] and $σ_{2}^{2}$ \[\sigma _{2}^{2}\] are the variances of the two aggregates, respectively.

Under the condition that the null hypothesis is valid, the t-statistic is used for the test of the mean of two independent samples. The t-statistic for constructing two independent samples is selected and analyzed in two cases.

1) When the variances of the two populations are unknown but equal, i.e., $σ_{1}^{2} = σ_{2}^{2}$ \[\sigma _{1}^{2}=\sigma _{2}^{2}\] , the t-test statistic is constructed: (18) $t = \frac{X_{1} - X_{2} - (μ_{1} - μ_{2})}{S_{v} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$ \[t=\frac{{{X}_{1}}-{{X}_{2}}-\left( {{\mu }_{1}}-{{\mu }_{2}} \right)}{{{S}_{v}}\sqrt{\frac{1}{{{n}_{1}}}+\frac{1}{{{n}_{2}}}}}\]

Where: n₁ and n₂ are the two sample sizes respectively. S₁ and S₂ are the two-sample standard deviations, respectively, and there: (19) $S_{v}^{2} = \frac{(n_{1} - 1) S_{1}^{2} + (n_{2} - 1) S_{2}^{2}}{n_{1} + n_{2} - 2}$ \[S_{v}^{2}=\frac{\left( {{n}_{1}}-1 \right)S_{1}^{2}+\left( {{n}_{2}}-1 \right)S_{2}^{2}}{{{n}_{1}}+{{n}_{2}}-2}\]

This statistic obeys a t-distribution with a degree of freedom of n₁ + n – 2 .

2) The t-test statistic constructed when the variances of the two aggregates are unknown and unequal, i.e. $σ_{1}^{2} \neq σ_{2}^{2}$ \[\sigma _{1}^{2}\ne \sigma _{2}^{2}\]: (20) $t = \frac{X_{1} - X_{2} - (μ_{1} - μ_{2})}{\sqrt{\frac{S_{1}^{2}}{n_{1}} + \frac{S_{2}^{2}}{n_{2}}}}$ \[t=\frac{{{X}_{1}}-{{X}_{2}}-\left( {{\mu }_{1}}-{{\mu }_{2}} \right)}{\sqrt{\frac{S_{1}^{2}}{{{n}_{1}}}+\frac{S_{2}^{2}}{{{n}_{2}}}}}\]

This statistic obeys a t-distribution with modified degrees of freedom: (21) $d f = \frac{{(\frac{S_{1}^{2}}{n_{1}} + \frac{S_{2}^{2}}{n_{2}})}^{2}}{\frac{{(\frac{S_{1}^{2}}{n_{1}})}^{2}}{n_{1}} + \frac{{(\frac{S_{2}^{2}}{n_{2}})}^{2}}{n_{2}}}$ \[df=\frac{{{\left( \frac{S_{1}^{2}}{{{n}_{1}}}+\frac{S_{2}^{2}}{{{n}_{2}}} \right)}^{2}}}{\frac{{{\left( \frac{S_{1}^{2}}{{{n}_{1}}} \right)}^{2}}}{{{n}_{1}}}+\frac{{{\left( \frac{S_{2}^{2}}{{{n}_{2}}} \right)}^{2}}}{{{n}_{2}}}}\]

In statistical analysis, if the variances of two totals are equal, this is said to satisfy variance chi-square. Determining the chi-square of two independent samples is the key to constructing and selecting the two independent samples t-test statistic, which can be utilized to test whether there is a significant difference in the variance of the two totals using the Levene F chi-square test.

First, the null hypothesis $H_{0} : σ_{1}^{2} = σ_{2}^{2}$ \[{{H}_{0}}:\sigma _{1}^{2}=\sigma _{2}^{2}\] is formulated. performing the test procedure, if the probability value is less than a given significance level (typically 0.05), then null hypothesis H₀ is rejected and the two overall variances are considered unequal. Otherwise, the variances of the two aggregates are considered not significantly different.

The formula for calculating the value of the F statistic in the F test is: (22) $F = \frac{\max (S_{1}^{2}, S_{2}^{2})}{\min (S_{1}^{2}, S_{2}^{2})} \sim F (n_{1} - 1, n_{2} - 1)$ \[F=\frac{\max \left( S_{1}^{2},S_{2}^{2} \right)}{\min \left( S_{1}^{2},S_{2}^{2} \right)}\sim F\left( {{n}_{1}}-1,{{n}_{2}}-1 \right)\]

Where: n₁ – 1 is the degree of freedom of $\max (S_{1}^{2}, S_{2}^{2})$ \[\max \left( S_{1}^{2},S_{2}^{2} \right)\] . n₂ – 1 is the degree of freedom of $\min (S_{1}^{2}, S_{2}^{2})$ \[\min \left( S_{1}^{2},S_{2}^{2} \right)\] .

3.1.3

Determination of significance of differences

Given the null hypothesis, the observed value of the test statistic is obtained by substituting the test value μ₀ into the μ₁ – μ₂ part of the t-statistic and the probability p value of the test statistic is calculated from the distribution function of the t-distribution.

When the p value is less than the significance level, the null hypothesis is rejected and the overall mean is considered to be significantly different from μ₀ . Conversely, the null hypothesis is accepted and the two overall means are considered not significantly different.

3.2

Paired samples t-test

The two paired samples test is used to test whether two related samples come from a normal population with the same mean, i.e., for two paired samples, it is inferred whether the means of the two populations are significantly different [24].

The paired samples t-test is also required to test whether there is a significant difference between two overall means with the null hypothesis that H₀: μ₁ – μ₂ = 0 , μ₁ and μ₂ are the means of the two overalls, respectively.

Let (X₁,Y₁),(X₂,Y₂),…(X_n,Y_n) be a paired sample with a difference of d_i = X_i – Y_i,i = 1,2,…n . The premise of the paired sample test is that the difference of two samples d follows a normal distribution.

Under the condition that the null hypothesis holds, the mean of the difference from the population is zero.

The paired samples t-test uses the t-statistic, constructed as: (23) $t = \frac{d - (μ_{1} - μ_{2})}{S / \sqrt{n}}$ \[t=\frac{d-\left( {{\mu }_{1}}-{{\mu }_{2}} \right)}{{S}/{\sqrt{n}}\;}\]

When μ₁ – μ₂ = 0 , the t-statistic obeys a t-distribution with a degree of freedom of n – 1 .

The criteria for determining the significance of differences in the paired-samples t-test are consistent with the independent-samples t-test.

4

Teaching experiments and analysis

4.1

Objectives of the experimental study

This study aims to verify the impact of the college English teaching path optimization model by intervening in college English teaching through a teaching path optimized based on knowledge distillation and transfer learning. The purpose of the specific experimental study includes two aspects: first, to verify whether the optimized college English teaching path has a significant effect on the improvement of English proficiency level of non-English major college students. Second, to verify whether the optimized college English teaching path has different effects on different dimensions of English proficiency.

4.2

Experimental Study Subjects

4.2.1

The process of identifying research subjects

In the academic year 2023-2024, the researcher undertook the task of teaching college English courses to the 2023 classes of product design and animation majors at the College of Art and Design, University of S. Thus, the students majoring in product design and animation in 2023 were identified as the subjects of this experimental study. The product design class consists of three natural classes (referred to as “product 2301-03”), totaling 92 students. The animation class consists of two natural classes (referred to as “animation 2301-02”), totaling 54 students. In view of the fact that the university English teaching in the case school adopts the co-teaching system, Product 2301-03 was randomly selected as the experimental class and Animation 2301-02 as the control class, and an unequal pre and post-test teaching experiment was carried out with the two English natural classes.

4.2.2

Characterization of the study population

A summary of the personal background information of the experimental research subjects is shown in Table 1.

Table 1.

The personal information of the experimental subjects

Background information		Experimental class		Control class
Background information		Number	Percentage	Number	Percentage
Gender	Male	34	36.96%	20	37.04%
Gender	Female	58	63.04%	34	62.96%
Nationality	The Han Nationality	75	81.52%	43	79.63%
Nationality	Minority Nationality	17	18.48%	11	20.37%
Place of student source	The north	91	98.91%	54	100.00%
Place of student source	The south	1	1.09%	0	0.00%
College entrance examination results	50-89 points	70	76.09%	42	77.78%
College entrance examination results	90-150 points	22	23.91%	12	22.22%

As can be seen from Table 1, there is a great similarity between the research subjects in the two classes. In the gender dimension, both are dominated by female students, accounting for more than 62%. In the ethnic dimension, both are predominantly Han Chinese, accounting for more than 75%. In the dimension of place of origin, only one person in the experimental class was born in a southern city, while the rest were born in the north. In the dimension of college entrance examination English performance, both the experimental class and the control class have more than 75% of the research subjects’ college entrance examination English scores below 90 points, corresponding to the English level of students in English B classes of other universities, and more than 20% of the research subjects’ college entrance examination English scores above 90 points, corresponding to the English level of students in English A classes of other universities, which is basically in line with the distribution of the English level of non-English-major college students of the same case study schools.

4.2.3

Tests for differences in background variables of the study population

Due to the large difference in the number of students in the experimental and control classes, further tests of the relevant variables for the study population were needed to ensure that there were no significant differences between the study population in the experimental and control classes. The results of the test of differences in background variables between the samples of the experimental and control classes are shown in Table 2.

Table 2.

Difference test of background variables between experimental class and control class

		Sum of squares	Degree of freedom	Mean square	F	Significance
Gender	Inter group	0.067	1	0.067	0.236	0.648
	Within group	34.185	144	0.258
	Total	34.264	145
Nationality	Inter group	0.253	1	0.239	1.305	0.272
	Within group	25.144	144	0.181
	Total	25.316	145
Place of student source	Inter group	0.008	1	0.008	0.042	0.841
	Within group	28.641	144	0.214
	Total	28.657	145
College entrance examination results	Inter group	0.152	1	0.152	0.224	0.654
	Within group	91.246	144	0.694
	Total	91.383	145

As can be seen from Table 2, in the dimensions of gender, ethnicity, place of origin, and English achievement in the college entrance examination, the ANOVA chi-square test was conducted on the research subjects in the experimental and control classes, and the F-values were equal to 0.236, 1.305, 0.042, 0.224, and the p-values were 0.648, 0.272, 0.841, and 0.654, which were greater than 0.05 respectively.This shows that there is no significant difference between the research subjects in the experimental and control classes in the dimensions of gender, ethnicity, place of origin and English scores in the college entrance examination, there is no significant difference between the subjects of the experimental class and the control class. It can be seen that the two classes are highly homogeneous and possess the necessary research conditions to conduct the experiment.

4.2.4

Test of representativeness of the study population

In order to further test the representativeness of the students in the quasi-experimental classes, a one-sample t-test was conducted with the research subjects participating in the experiment as a single sample and the 704 people participating in the survey of English proficiency level proficiency in the case schools as the reference population to test the difference in the distribution of English proficiency levels between the experimental sample and the total sample of the survey. The results of the one-sample t-test are shown in Table 3.

Table 3.

Single sample T test results

	Test value=1.72
	t	Degree of freedom	Sig.(two-tailed)	Mean difference	95% confidence interval for the difference
	t	Degree of freedom	Sig.(two-tailed)	Mean difference	Lower limit	Upper limit
Inter group	-0.204	145	0.857	-0.016	-0.17	0.12

As can be seen from Table 3, although the experimental research subjects in this paper are art majors, there is no significant difference between them and other non-English majors undergraduates in the case school in the dimension of English proficiency level (t=-0.204, p=0.857>0.05), which suggests that the experimental research subjects can be representative of the overall level of the research of non-English majors undergraduates in the case school. In addition, combining with Table 1, it can be seen that the distribution of English college entrance examination scores of the experimental research subjects is concentrated in the range of 70-90 points, which is also generally consistent with the distribution of English proficiency of non-English-major undergraduates in the case schools, and has the overall characteristics of the research subjects.

4.3

Experimental design and analysis of results

4.3.1

Experimental design

This study adopts an experimental comparison approach, in which the experimental class and the control class were taught college English with the same teaching materials in the same teaching environment for two semesters, and all the subjects were tested on their English proficiency before and after the experiment (pre-test and post-test), which included listening, reading comprehension, oral expression, and written expression, and each of them accounted for 25% of the total score, and the scoring was based on a percentage system. The experimental class was guided by the optimization pathway of English teaching derived from the model in this paper, while the control class was taught in the traditional way.

The study used quantitative methods, applying SPSS28.0 statistical software for quantitative analysis, including independent samples t-test and paired samples t-test, to compare and analyze the performance of the students in the experimental and control classes in all dimensions of English language proficiency before and after the experiment, and to test the gaps between the test data of the two groups in terms of the progress of performance.

4.3.2

Statistical analysis of paired samples of experimental and control classes

The results of the statistical analysis of the pre and post-test paired samples of the experimental and control classes are shown in Table 4. In the table, D1, D2, D3 and D4 represent the average scores of listening, reading comprehension, oral expression and written expression, respectively.

Table 4.

Statistical analysis of paired samples before and after testing

		D1	D2	D3	D4	Total	Number	Standard deviation	Standard error
Pretest	Experimental class	65.12	71.09	56.89	74.53	66.91	92	9.423	1.476
Pretest	Control class	63.64	69.24	58.13	73.96	66.24	54	8.205	1.215
Posttest	Experimental class	78.57	86.35	71.37	86.52	80.70	92	7.341	1.027
Posttest	Control class	70.21	76.42	62.44	80.11	72.30	54	8.256	1.234

As can be seen from Table 4, in terms of the mean value of the total score, the experimental class and the control class in the pre-test were 66.91 and 66.24 respectively, with standard deviations of 9.423 and 8.205 respectively, which indicates that the English level of the experimental class before the experiment was slightly higher than that of the control class, and that there was not a big difference in general. In the post-test, the mean total scores of the control class and the experimental class are 80.70 and 72.30 respectively, with standard deviations of 7.341 and 8.256 respectively, which indicates that the English proficiency of the experimental class has been improved after the experiment, and the polarization is narrowing, while the learning level of students in the control class has not improved significantly, and there is a tendency to widen the polarization a little bit. As for the specific scores of the dimensions of English proficiency, the maximum score difference between the experimental class and the control class in the pre-test is 1.85 points, and the minimum score difference between the two classes in the post-test is 6.41 points, and the score gap has increased significantly.

4.3.3

Independent samples t-test for experimental and control classes

The results of the independent samples t-test for the pre and post-tests of the experimental and control classes are shown in Table 5. Where D1, D2, D3 and D4 denote listening, reading comprehension, oral expression and written expression, respectively.

Table 5.

Independent sample T-test for pretest and posttest

	Dimension	Levin’s test for equality of variances				T-test
		Assuming	F	Sig.	t	Sig.(2-tailed)	95% confidence interval of the difference
		Assuming	F	Sig.	t	Sig.(2-tailed)	Lower limit	Upper limit
Pretest	D1	Equal variances	3.315	0.081	0.145	0.715	-3.527	4.024
	D2	Equal variances	4.232	0.079	0.156	0.637	-4.135	3.859
	D3	Equal variances	2.964	0.093	0.189	0.741	-2.958	3.421
	D4	Equal variances	3.109	0.105	0.195	0.822	-3.763	4.325
Posttest	D1	Equal variances	2.942	0.089	2.943	0.007	-6.942	-2.354
	D2	Equal variances	3.857	0.094	3.127	0.000	-7.257	-2.678
	D3	Equal variances	2.645	0.132	3.481	0.003	-8.104	-2.593
	D4	Equal variances	3.054	0.103	2.894	0.018	-7.109	-1.742

From Table 5, it can be seen that in the pre-test, the significant values of the variance chi-square test for the dimensions of Listening, Reading Comprehension, Oral Expression, and Written Expression are 0.081, 0.079, 0.093, and 0.105, respectively, which are greater than 0.05, which means that the variance of the dimensions where the two classes are located is homogeneous. Under variance chi-square, the significant values (two-tailed) of each dimension are 0.715, 0.637, 0.741 and 0.822 respectively, all of which are greater than 0.05. Combined with Table 4, it can be seen that the scores of the dimensions of English proficiency of the two classes are not significantly different from each other before the experiment although there is a slight difference between them in terms of the mean value and the standard deviation, i.e., the two classes are basically comparable in terms of their English proficiency before the experiment. In the posttest, the p-values of the variance chi-square test are all greater than 0.05, i.e., variance chi-square. Meanwhile, under variance chi-square, the significant values (two-tailed) are all less than 0.05, which indicates that there is a statistically significant difference in the posttest scores of the two groups.

4.3.4

Paired Samples T-Test for Experimental and Control Classes

The results of the pre and post-test paired samples t-tests for the experimental and control classes are shown in Table 6, with D1, D2, D3 and D4 denoting listening, reading comprehension, oral expression and written expression, respectively.

Table 6.

Paired sample T-test for pretest and posttest

			Paired difference			t	Sig.(2-tailed)
			Mean	Std. deviation	Std. error mean	t	Sig.(2-tailed)
Experimental class	D1	Pretest and posttest	-13.45	1.421	0.241	3.147	0.000
	D2	Pretest and posttest	-15.26	1.714	0.197	4.274	0.000
	D3	Pretest and posttest	-14.48	2.047	0.259	-2.716	0.000
	D4	Pretest and posttest	-11.99	1.892	0.264	-4.207	0.000
Control class	D1	Pretest and posttest	-6.57	4.125	0.755	4.815	0.085
	D2	Pretest and posttest	-7.18	5.264	0.761	-5.143	0.073
	D3	Pretest and posttest	-4.31	3.973	0.804	-3.459	0.104
	D4	Pretest and posttest	-6.15	4.677	0.825	-4.531	0.096

As can be seen from Table 6, the p-value of the two pre- and post-test scores of the experimental class in the dimensions of listening, reading comprehension, oral expression and written expression is 0.000, which is less than 0.05, indicating that there is a statistically significant difference between the two test scores of the experimental class in all dimensions of English proficiency. This result supports the hypothesis that the optimized English teaching path presented in this paper can promote the improvement of students’ English proficiency. Similarly, there is no statistically significant difference between the two scores of the control class in all dimensions (p>0.05), this result indicates that the progress of the control class is not obvious compared with the experimental class, which verifies the validity of the optimization model of the English teaching path of this paper’s model in colleges and universities.

5

Conclusion

To enhance the quality and efficiency of English teaching, this paper examines a college English teaching path selection and optimization model that incorporates knowledge distillation and knowledge transfer. The teacher-student model is used to pre-train the English teaching path samples in the source and target domains respectively, adaptively and stochastically extract the English teaching path features, and obtain the final distillation loss by continuously adjusting the temperature factor T, which is fed back to the student model to ultimately realize the decision-making of the optimization of the English teaching paths in colleges and universities, and the T-test method is used to evaluate the optimization effect of the model. The model evaluation experiment’s conclusions are as follows:

First of all, this paper conducts reliability tests on the selected research subjects. In the dimensions of gender, ethnicity, place of origin, and English achievement in the college entrance examination, the ANOVA chi-square test was conducted for the students in the experimental class and the control class, and the F values were 0.236, 1.305, 0.042, 0.224, and the p values were 0.648, 0.272, 0.841, and 0.654, which were all greater than 0.05. It indicates that in the dimensions of gender, ethnicity, place of origin, and English achievement in the college entrance examination, the research subjects in the experimental and control classes are homogeneous and have the conditions for experimental implementation. At the same time, a one-sample t-test was conducted with the research subjects participating in the experiment as a single sample and the students participating in the English proficiency level survey in the case school as the reference population. The results show that there is no significant difference between the experimental research subjects in this paper and other non-English major undergraduate students in the case schools in the English proficiency level dimension (t=-0.204, p=0.857>0.05), indicating that the selected research subjects are representative.

Second, the two-sample t-test was utilized to assess the differences in the English proficiency scores of the two groups of students. There is no significant difference in the English proficiency of the experimental group adopting the optimized teaching path of this paper’s model and the control group adopting the traditional teaching method before the experiment (p>0.05), while there is a significant difference after the experiment (p<0.05). And the English proficiency of the experimental group after the experiment is significantly improved compared with that before the experiment (p<0.05), while the control group is not significantly improved (p>0.05). This fully demonstrates the usefulness of the model in this paper for optimizing the English teaching process in colleges and universities.

Funding:

This research was supported by the Henan Provincial Higher Education Teaching Reform Research and Practice Project Approval (Progressive Integration, Categorized Expansion: Research and Practice on the Paradigm of English Curriculum Serving the Career Development of Medical Students, 2024SJGLX0876).

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Selection and Optimization of University English Teaching Path Based on Knowledge Distillation and Transfer Learning

Jing Li

Published Online: Mar 17, 2025

Received: Oct 17, 2024

Accepted: Jan 29, 2025

DOI: https://doi.org/10.2478/amns-2025-0348

KeywordsDenseNet, ANN, Knowledge distillation, Transfer learning, t-test, English language teaching

© 2025 Jing Li, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
DenseNet, ANN, Knowledge distillation, Transfer learning, t-test, English language teaching