Prediction of English Vocabulary Learning Difficulty and Adjustment of Teaching Strategies Based on Decision Tree Algorithm

English vocabulary, as the basic teaching of English education, has a great influence on the development of students’ English reading and English writing ability. The importance of vocabulary teaching is self-evident to English teaching [1-2]. If students want to learn English well, they need to master a large number of vocabulary, so that they can better understand English knowledge, get information about the content of the article and understand the thoughts and feelings of the article [3-4]. When discussing vocabulary teaching, many teachers tend to think that it is the teaching process of vocabulary cognition, vocabulary interpretation, vocabulary memorization and application, which is, in fact, one of the important links in the cultivation of English core literacy [5-6]. Through effective optimization of vocabulary teaching, students’ language proficiency, learning ability, thinking quality and cultural awareness can be better developed [7].

Vocabulary teaching is a major difficulty in the educational process of English subject. In the traditional form of teaching, teachers often ask students to memorize relevant vocabulary content to gain basic knowledge and practice relevant topics to get high scores in the exam [8-9]. However, this form of teaching is easy to limit the development of students’ thinking and affect students’ interest in learning. After the educational concepts of the new curriculum standard, when teaching English vocabulary, teachers should improve students’ interest in learning, so that students can learn vocabulary knowledge from the perspective of improving the core literacy of individual English subjects [10-12]. It is also a great test for teachers to make adjustments to the vocabulary teaching process so as to promote the all-round development of students [13-14].

With the rapid development of information technology, the mode of English teaching has experienced great changes. Vocabulary is the foundation of English learning [15-16]. However, the traditional way of teaching vocabulary is often too single and difficult to adapt to the development needs of the information technology society [17]. Facing this challenge, teachers need to actively explore new teaching strategies and make full use of the advantages of information technology in order to stimulate students’ interest in learning and improve teaching efficiency [18-19]. Therefore, it is necessary to explore English vocabulary teaching strategies under the background of information technology and explore new teaching methods in order to improve students’ vocabulary mastery and the quality of English teaching [20-21].

Therefore, it is necessary to study the current situation of students’ English vocabulary teaching and students’ English vocabulary level, and Literature. [22] systematically analyzes the importance of English vocabulary and the current situation of English vocabulary learning, focuses on the current English vocabulary learning strategies, and makes positive contributions to the design and strategy optimization of English vocabulary learning teaching methods. Based on the above research, it can be seen that the students’ English vocabulary knowledge level is extremely unoptimistic, and the optimization of English vocabulary teaching methods is imperative.Literature [23], based on a mixed-method design, reveals that both teachers and students show motivation in teaching English vocabulary, with teachers most preferring the use of whole-context strategies for teaching English vocabulary, while students prefer determination and metacognitive strategies in English vocabulary learning, and states that vocabulary strategy choice is highly correlated with students’ level of vocabulary learning.Literature [24] combines qualitative techniques and descriptive research design methodology with English teachers as the subjects of the study, supplemented by the analysis of relevant informational data such as documents, interviews, and observations, revealing that English teachers at MTsS Siulak Gedang chose methodological tools such as dictionaries and translation techniques to teach English vocabulary, while the teachers indicated that the selection of these vocabulary teaching strategies followed the appropriate as well as the simplicity principle. Literature [25] describes that English vocabulary teaching in Chinese university vocabulary teaching classrooms is mainly rote memorization and passive learning with a lack of opportunities for practice and interaction, and proposes the introduction of the CLT approach to English vocabulary teaching to help students build vocabulary, and points out that this strategy can also help to improve the students’ practical performance in real-life linguistic and English language contexts. Literature [26] designed a vocabulary learning application based on the mobile platform of cell phones and discussed in detail the development and design process, aiming to meet the needs of students’ English vocabulary learning in any space and at any time. Using semi-structured interviews, [27] revealed that strategies such as shared learning, feedback, and peer assessment can help improve students’ vocabulary knowledge in English vocabulary teaching practice. Literature [28] discusses an educational game teaching model centered on the Android platform, which has positive significance in promoting vocabulary enhancement for English learners, and considers this game education model an innovative teaching practice that provides an important reference for the optimization of English vocabulary teaching.

In this paper, the definition and method of DBSCAN clustering are introduced first, and the performance of the clustering method is evaluated using one internal evaluation index for clustering and five external evaluation indexes for clustering.Then, the decision tree classification algorithm is proposed, optimized on the most common ID3 algorithm, and the C4.5 algorithm is proposed as the English vocabulary learning difficulty classification algorithm in this paper. For the problem of long computation time of decision tree construction of C4.5 algorithm, the balance coefficient is introduced for optimization, so as to carry out the prediction of English vocabulary learning difficulty. Finally, a method for adjusting English vocabulary teaching strategies is proposed.The improved teaching strategies are applied in practice for the purpose of testing and evaluating the method presented in this paper.

2

Prediction of English Vocabulary Learning Difficulty Based on Decision Tree Algorithm

2.1

Cluster Analysis of English Vocabulary Learning Difficulty Based on DBSCAN

2.1.1

Cluster analysis

Cluster analysis [29], the groupings obtained after similarity partitioning, have a great deal of similarity in the data points within the groups, while the similarity between the groups is small. Assume that there exists a data set X with a dimension of D dimensions and a total of n data points in data set X, viz. X{x_i|{x_i,1,x_i,2,x_i,3,…,x_i,j}}, where i denotes the number of data points and takes a value between 1 and n and j denotes the dimension in which the data points are located and takes a value between 1 and D. Based on the similarity of the data points the data points are divided into k classes, i.e.: C₁,C₂,…,C_k. In dividing into these k subgroups, the following conditions need to be fulfilled: C_p ≠ ∅; C₁∪C₂∪C₃∪,…C_k = X; C_p∩C_q = ∅, p ≠ q; where, p and q denote the number of subgroups and take values between 1 and k. Each subgroup is not an empty set, contains at least one data point, and the concatenation of all subgroup sets is the entire data set. Current cluster analysis methods include five categories: segmentation, hierarchy, density, grid, and modeling.

Evaluation metrics for cluster analysis currently include: internal evaluation metrics and external evaluation metrics. Several clustering evaluation metrics are described below, including: the contour coefficient (SC), the standardized mutual information (NMI), the Jacarrd coefficient (Jac), the Rand index (RI), the F-Measure (FM), and the F₁-index (F1).

The contour coefficient (SC) is a typical internal evaluation metric for clustering. Suppose, dataset X with n data point is clustered into k subgroups. In order to calculate the contour coefficient (SC) of the whole data set, first, calculate a(p), where point p is a data point in subgroup C_j, point q is a data point in subgroup C_j except point p, d(p,q) denotes the distance between point p and point q, and a(p) is the average distance from data point p in a subgroup to other data points in that subgroup, which can be used to indicate the compactness of the data points in the subgroup.

(1)

a (p) = \frac{1}{| C_{j} | - 1} \sum_{q \in C_{j} p \neq q} d (p, q)

\[a(p)=\frac{1}{\left| {{C}_{j}} \right|-1}\sum\limits_{q\in {{C}_{j}}p\ne q}{d}(p,q)\]

Where point p is a data point in grouping C_j and point q is a data point in grouping C_i. b(q) is used to indicate the separateness of data points between groupings.

(2)

b (p) = min {\frac{1}{| C_{i} |} \sum_{q \in C_{i}} d (p, q) | i = 1, 2, \dots, k, i \neq j}

\[b(p)=\text{min}\left\{ \frac{1}{\left| {{C}_{i}} \right|}\sum\limits_{q\in {{C}_{i}}}{d}(p,q)\left| i=1,2,\ldots ,k,i\ne j \right. \right\}\]

Secondly, the contour coefficient SC(p) of point p can be calculated by Eq. (3).

(3)

S C (p) = \frac{b (p) - a (p)}{max {a (p), b (p)}}

$SC(p)=\frac{b(p)-a(p)}{\text{max}\{a(p),b(p)\}}$

Finally, the contour coefficient (SC) of dataset X can be calculated by equation (4). The closer the value of the contour coefficient (SC) is to the value 1, the better the clustering is.

(4)

S C = \frac{1}{n} \sum_{p \in D_{S}} S C (p)

\[SC=\frac{1}{n}\sum\limits_{p\in {{D}_{S}}}{S}C(p)\]

Standard mutual information (NMI) is a typical external evaluation metric. Suppose, R_L is the actual clustering result obtained by the algorithm and S_T is the standard clustering result. The NMI of R_L and S_T can be calculated by Equation (5), where n^l is the number of data points of the l th subgroup in R_L, n^t is the number of data points of the tth subgroup in R_T, and $n_{t}^{l}$ $n_{t}^{l}$ is the number of the same subgroups in R_L and R_T containing the same data points. The larger the value of Standard Mutual Information (NMI), the better the clustering.

(5)

N M I (R_{L}, S_{T}) = \frac{\sum_{l = 1}^{L} Σ_{t = 1}^{T} n_{t}^{l} log (\frac{n n_{t}^{l}}{n^{l} n^{t}})}{\sqrt{(\sum_{l = 1}^{L} n^{l} log (\frac{n^{l}}{n})) (\sum_{t = 1}^{T} n^{t} log (\frac{n^{t}}{n}))}}

\[NMI\left( {{R}_{L}},{{S}_{T}} \right)=\frac{\sum\limits_{l=1}^{L}{\Sigma _{t=1}^{T}}n_{t}^{l}\text{log}\left( \frac{nn_{t}^{l}}{{{n}^{l}}{{n}^{t}}} \right)}{\sqrt{\left( \sum\limits_{l=1}^{L}{{{n}^{l}}}\text{log}\left( \frac{{{n}^{l}}}{n} \right) \right)\left( \sum\limits_{t=1}^{T}{{{n}^{t}}}\text{log}\left( \frac{{{n}^{t}}}{n} \right) \right)}}\]

The four evaluation metrics, Jacarrd’s coefficient (Jac), Rand’s index (RI), F-Measure (FM), and F1 index (F1), are classical, simple, and commonly used external evaluation metrics. Suppose, there are data points p and q and two clustering results R_L and S_T in dataset X, then there are four scenarios in total where data points p and q are distributed in R_L and S_T: 1) Points p and q are in the same grouping in R_L and also belong to the same grouping in S_T; 2) Points p and q are in the same grouping in R_L but do not belong to the same grouping in S_T; and 3) Points p and q are not in the R_L belong to the same grouping in S_T, but do belong to the same grouping in 21; 4) Points p and q do not belong to the same grouping in R_L, and do not belong to the same grouping in S_T. Assuming that, the number of data points meeting the first situation is α , the number of data points meeting the second situation is β , the number of data points meeting the third situation is γ , the number of data points meeting the fourth situation is δ , the sum of data points meeting the four situations is recorded as m, and there are n data points in the dataset X, and the relationship of the above six parameters can be expressed by the formulas (6) and (7).

(6)

m = α + β + γ + δ

\[m=\alpha +\beta +\gamma +\delta \]

(7)

m = \frac{n (n - 1)}{2}

\[m=\frac{n(n-1)}{2}\]

Jacarrd coefficient (Jac) can be calculated by equation (8), the larger the value of Jacarrd coefficient (Jac), the better the clustering effect.

(8)

J a c = α + β + γ

\[Jac=\alpha +\beta +\gamma \]

The Rand Index (RI) can be calculated by Equation (9), the larger the value of Rand Index (RI), the better the clustering effect.

(9)

R I = \frac{α + δ}{m}

$RI=\frac{\alpha +\delta }{m}$

The F₁ index (F1) can be calculated by Eq. (10), and the larger the value of the F₁ index (F1), the better the clustering effect is indicated.

(10)

F_{1} = \sqrt{\frac{α}{α + β} \cdot \frac{α}{α + γ}}

\[{{F}_{1}}=\sqrt{\frac{\alpha }{\alpha +\beta }\cdot \frac{\alpha }{\alpha +\gamma }}\]

F-Measure (FM) uses Precision (Pre) and Recall (Rec) together as an index to evaluate the performance of clustering. Where k denotes the grouping of clusters, N_c denotes the number of common data points contained in standard grouping k and actual grouping k, N_k denotes the number of all data points in standard grouping k, and N_t denotes the number of all data points in actual grouping k.

(11)

Pr e (k) = \frac{N_{c}}{N_{k}}

\[\text{Pr}e(k)=\frac{{{N}_{c}}}{{{N}_{k}}}\]

(12)

Re c (k) = \frac{N_{c}}{N_{t}}

$\text{Re}c\left( k \right)=\frac{{{N}_{c}}}{{{N}_{t}}}$

F-Measure (FM) can be calculated by equation (13), the larger the value of F-Measure (FM), the better the clustering effect.

(13)

F M = \frac{2 Pr e \cdot Re c}{Pr e + Re c}

\[FM=\frac{2\text{Pr}e\cdot \text{Re}c}{\text{Pr}e+\text{Re}c}\]

2.1.2

DBSCAN algorithm

DBSCAN [30] is a pioneering and popular algorithm for density-based cluster analysis. It is capable of recognizing clusters of any shape without setting the number of clusters and is robust to outliers, so in this paper it is applied to English vocabulary learning difficulty clustering.The relevant definitions of DBSCAN algorithm are as follows:

Definition 1. ε– Neighborhood: the ε– neighborhood of data point p is denoted as N_ε(p) and can be expressed as N_ε(p) = {q ∈ Ds|d(p,q) ≤ ε}, where Ds is the data set and d(p,q) is the distance between data points p and q.

Definition 2. Density: the density of data point p is denoted as ρ(p) and can be expressed as ρ(p) = |N_ε(p)|. The so-called density of point p is the number of points contained in its ε– neighborhood.

Definition 3. Core point: point p is a core point if ρ(p) ≥ min pts, otherwise point p is a non-core point. Where min pts is the set density threshold.

Definition 4. Boundary Points: a point p is a boundary point if point p is a non-core point and p ∈ N_ε(q). Where point q is a core point.

Definition 5. Noise Point: If point p is not a core point and also not a boundary point, then it is a noise point.

Definition 6. Direct Density Reachable: a point q is directly density reachable to a point p if p ∈ N_ε(q), and data point q is a core point. Direct density reachability does not satisfy symmetry.

Definition 7. Density Reachability: If there exists a series of data points p₁,p₂,⋯,p_i,⋯,p_n, (p₁ = q and p_n = q) and a direct density reachability from p_i to p_i+1, then the density reachability of points q to p is reached. Where p_i ∈ Ds, 1 ≤ i ≤ n. Density reachable also does not satisfy symmetry.

Definition 8. Density connected: Point q is density connected to point p if there exists data points s to point p and point q are both density reachable. Where s ∈ Ds. Density connected satisfies symmetry.

Definition 9. Clustering: a cluster Cl is non-empty and is a Ds subset that satisfies the following conditions:

(1) For any point p and point q , if p ∈ Cl, point p is density accessible to point q , then q ∈ Cl. (Maximality)

(2) For any p,q ∈ Cl, point p is connected to point q density. (Connectivity)

The main idea of the DBSCAN algorithm is to randomly select a point from the dataset at point p. If point p is a core point, find all the points that are densely reachable from point p and label them in the same cluster. Otherwise, randomly select another unlabeled point from the English vocabulary dataset. Repeat the above steps until all points are labeled.

2.2

Improved Decision Tree Algorithm for English Vocabulary Learning Difficulty Prediction

2.2.1

Decision Tree Classification Algorithm

The decision tree algorithm is an inductive learning algorithm that is able to reason about the specific representation of a decision tree and the classification rules from a collection of irregular data samples.

The following is a general description of constructing a decision tree:

1) Firstly, all the training samples on the root node are selected for attributes, and the best feature attribute is selected as the root node.

2) According to this feature attribute, all training samples at the position of this node are divided, and the generated subset of training samples is the best classification at this node.

3) Determine whether the divided subset of samples has been considered correctly classified and if so create leaf nodes.

4) If there are still some subsets of samples that cannot be classified into the correct class, then it is necessary to correspond to a subset of these samples to find the best features, continue to split the subset of samples, and build good corresponding nodes.

5) The whole process is a recursive operation, the termination condition is that all the data of the training samples are correctly classified or the features are empty, so that all the samples can correspond to a leaf node.

This section focuses on a brief introduction of the research status of the above aspects.

1)

Attribute selection metric

In the process of constructing a decision tree, the core problem of the whole algorithm is to select the nodes of the decision tree. Information gain and information gain rate are attribute selection metrics based on information entropy, so information entropy is introduced first.

(1)

Information Entropy

Information entropy is a measure of the degree of information confusion. If the distribution of classes inside the data subset is uniformly mixed, then the information entropy is high, and if the distribution of classes in the subset is single, the information entropy is lower. Its specific definition is as follows:

Let s be a collection of s data samples with m different values of the category attribute, corresponding to m different categories C_i, i ∈ {1,2,3,⋯,m}. Assuming s_i is the number of samples in category C_i, the amount of information required to categorize a given data object is: (14) $I (s_{1}, s_{2}, \dots, s_{m}) = \sum_{k = 1}^{m} - (p_{i} \log p_{k})$ \[I\left( {{s}_{1}},{{s}_{2}},\cdots ,{{s}_{m}} \right)=\sum\limits_{k=1}^{m}{-}\left( {{p}_{i}}\log {{p}_{k}} \right)\] where p_i is the probability that any data object belongs to category C_i, which can be computed as s_i / s. And according to information theory, all information is encoded by bit.

(2)

Information Gain

Information gain refers to the difference between the original information of the data set and the information after categorization, and the calculation process of information gain is as follows:

Let attribute A have v different values {a₁,a₂,a₃,⋯,a_v}. Then the A attribute can be utilized to partition the S set into {S₁,S₂,S₃,⋯,S_v}, where S_j is the set of data containing the values of A in the set S taking a_j values. When the A attribute is used as a test attribute, let s_ij be the number of samples for which the subset S_j belongs to the C_i category. Then the information entropy of the subset divided by the A attribute is expressed as: (15) $E (A) = \sum_{i = 1}^{v} \frac{s_{i j} + \dots + s_{m j}}{s} I (s_{t j}, \dots, s_{m j})$ \[E\left( A \right)=\sum\limits_{i=1}^{v}{\frac{{{s}_{ij}}+\ldots +{{s}_{mj}}}{s}}I\left( {{s}_{tj}},\ldots ,{{s}_{mj}} \right)\]

Then the information gain obtained after partitioning the sample set for the current node according to the A attribute is: (16) $G a i n (A) = I (s_{1}, s_{2}, \dots, s_{m}) - E (A)$ \[Gain(A)=I\left( {{s}_{1}},{{s}_{2}},\ldots ,{{s}_{m}} \right)-E(A)\]

(3)

Information Gain Rate

The information gain rate is the normalization of the information gain, and the normalization of the information gain uses the concept of split information.

Attribute A Split information is calculated as shown in equation (17): (17) $S p l i t I n f o (A) = - \sum_{i = 1}^{v} \frac{| s_{i} |}{| s |} {log}_{2} \frac{| s_{j} |}{| s |}$ \[SplitInfo\left( A \right)=-\sum\limits_{i=1}^{v}{\frac{\left| {{s}_{i}} \right|}{\left| s \right|}}\text{lo}{{\text{g}}_{2}}\frac{\left| {{s}_{j}} \right|}{\left| s \right|}\]

Equation (18) is the formula for the gain rate of attribute A.

(18)

G a i n R a t i o (A) = \frac{G a i n (A)}{S p l i t I n f o (A)}

\[GainRatio\left( A \right)=\frac{Gain\left( A \right)}{SplitInfo\left( A \right)}\]

(4)

The coefficient of Gini

The Gini coefficient measures the impurity of the data division, and the Gini impurity indicates the probability that a randomly selected sample is misclassified in a subset.

The Gini coefficient is calculated as shown in equation (19): (19) $G i n i (S) = 1 - \sum_{i = 1}^{m} p_{i}^{2}$ \[Gini\left( S \right)=1-\sum\limits_{i=1}^{m}{p_{i}^{2}}\]

If the total data set contains more cluttered categories, the Gini index will be larger.

2)

Decision Tree Pruning

Scholars have proposed pruning techniques to improve the overfitting situation. Currently, the commonly used pruning methods are pre-pruning and post-pruning.

Pre-pruning method is to prune the decision tree by stopping the construction of the tree in advance, which is not often used because the threshold of stopping cannot be obtained in advance.

Post-pruning methods are used to simplify large-scale decision trees by pruning the branches of certain nodes after the decision tree has grown completely.

C4.5 algorithm is an optimized algorithm for ID3 algorithm, as this algorithm is more accurate and faster than ID3. The decision tree created by C4.5 algorithm can be used for classification prediction.The main improvements of C4.5 algorithm for ID3 are as follows.

(1) The C4.5 algorithm is able to handle default values in the training data.

(2) There is a more sophisticated approach to the pruning process of decision trees.

(3) The C4.5 algorithm is able to handle continuous data by discretizing the attributes of continuous types.

The main difference between the C4.5 algorithm and the ID3 algorithm [31] is the difference in the attribute selection metric used, where the ID3 algorithm utilizes information gain, the C4.5 algorithm applies the information gain rate for the attribute selection calculation. In addition the C4.5 algorithm can process continuous type attributes after discretizing them first.

The C4.5 algorithm utilizes the information gain rate to select the split attributes of the current node, which effectively eliminates the disadvantage of information gain tending to multi-valued attribute selection.

The flowchart of C4.5 algorithm is shown in Fig. 1.

The calculation of the information gain rate is actually a process of normalizing the information gain, and the concept of split information is used in the calculation of the information gain rate.

The split information of A can be expressed by equation (20).

(20)

S p l i t I n f o_{A} (T) = \sum_{j = 1}^{m} \frac{| T_{j} |}{| T |} {log}_{2} \frac{| T_{j} |}{| T |}

\[SplitInf{{o}_{A}}\left( T \right)=\sum\limits_{j=1}^{m}{\frac{\left| {{T}_{j}} \right|}{\left| T \right|}}\text{lo}{{\text{g}}_{2}}\frac{\left| {{T}_{j}} \right|}{\left| T \right|}\]

The C4.5 algorithm information gain rate is calculated as: (21) $G a i n R a t i o (A) = \frac{I n f o (T) - I n f o_{A} (T)}{S p l i t I n f o_{A} (T)}$ \[GainRatio\left( A \right)=\frac{Info\left( T \right)-Inf{{o}_{A}}\left( T \right)}{SplitInf{{o}_{A}}\left( T \right)}\]

2.2.2

Improvements to the C4.5 algorithm

For the problem of high time complexity of C4.5, the concept of equivalent infinitesimal is utilized to reduce the computation time of the decision tree. In addition, due to the application of this algorithm to the prediction of English vocabulary learning difficulty, in order to improve the accuracy of the algorithm, a balancing coefficient ω(0 < ω < 1) is introduced according to the course credits when calculating the amount of split information for different English vocabulary learning difficulties.

The concept of equivalent infinitesimal in Taylor’s formula is introduced to reduce the computational cost of the decision tree, so as to achieve the purpose of saving the decision tree generation time. According to the definition of Taylor’s series [32] a tetrad of infinitely differentiable functions f(x) on a neighborhood of a has the leptokurtic series of Eq. (22): (22) $f (x) = \sum_{n = 0}^{\infty} \frac{f^{n} (a)}{n!} {(x - a)}^{n}$ \[f(x)=\sum\limits_{n=0}^{\infty }{\frac{{{f}^{n}}(a)}{n!}}{{(x-a)}^{n}}\]

Then when f(x) = ln(1 + x), a = 0 times into Eq. (22), at this point f(x) is a Maclaren level: (23) $ln (1 + x) = \sum_{n = 0}^{\infty} \frac{{(- 1)}^{n + 1}}{n!} x^{n}$ \[\text{ln}(1+x)=\sum\limits_{n=0}^{\infty }{\frac{{{(-1)}^{n+1}}}{n!}}{{x}^{n}}\]

So when the value of x is infinitesimal, the formula can be simplified to: (24) $ln (1 + x) = \sum_{n = 0}^{\infty} \frac{{(- 1)}^{n + 1}}{n!} x^{n} \approx x$ \[\text{ln}(1+x)=\sum\limits_{n=0}^{\infty }{\frac{{{(-1)}^{n+1}}}{n!}}{{x}^{n}}\approx x\]

The simplification leads to the simplified C4.5 algorithm’s information gain rate formula (25): (25) $\begin{matrix} G a i n R a t i o (A) = \frac{G a i n (A)}{S p l i t I n f o_{A} (T)} \\ = \frac{\sum_{i = 1}^{n} \frac{| T C_{i} | \times (| T | - | T C_{i} |)}{| T |} - \sum_{j = 1}^{m} \sum_{i = 1}^{n} \frac{| T C_{y} | \times (| T_{j} | - | T C_{y} |)}{| T_{j} |}}{\sum_{j = 1}^{n} \frac{| T_{j} | \times (| T | - | T_{j} |)}{| T |}} \end{matrix}$ \[\begin{align} & GainRatio\left( A \right)=\frac{Gain\left( A \right)}{SplitInf{{o}_{A}}\left( T \right)} \\ & =\frac{\sum\limits_{i=1}^{n}{\frac{\left| T{{C}_{i}} \right|\times \left( \left| T \right|-\left| T{{C}_{i}} \right| \right)}{\left| T \right|}}-\sum\limits_{j=1}^{m}{\sum\limits_{i=1}^{n}{\frac{\left| T{{C}_{y}} \right|\times \left( \left| {{T}_{j}} \right|-\left| T{{C}_{y}} \right| \right)}{\left| {{T}_{j}} \right|}}}}{\sum\limits_{j=1}^{n}{\frac{\left| {{T}_{j}} \right|\times \left( \left| T \right|-\left| {{T}_{j}} \right|) \right.}{\left| T \right|}}} \end{align}\]

After simplification, the formula is changed from logarithmic operation to the basic operation of addition, subtraction, multiplication and division, which improves the efficiency of the algorithm.

In the process of calculating the information gain rate, a balancing degree factor ω(0 < ω < 1) is introduced to balance the importance of different course attributes: (26) $G a i n R a t i o (A_{ω}) = \frac{G a i n (A_{ω})}{S p l i t I n f o_{A_{ω}} (T)} = \frac{I n f o (T) - I n f o_{A_{ω}} (T)}{S p l i t \ln f o_{A_{ω}} (T)}$ \[GainRatio\left( {{A}_{\omega }} \right)=\frac{Gain\left( {{A}_{\omega }} \right)}{SplitInf{{o}_{{{A}_{\omega }}}}\left( T \right)}=\frac{Info\left( T \right)-Inf{{o}_{{{A}_{\omega }}}}\left( T \right)}{Split\ln f{{o}_{{{A}_{\omega }}}}\left( T \right)}\]

Among them: (27) $S p l i t I n f o_{A_{p}} (T) = \sum_{j = 1}^{m} (\frac{| T_{1} |}{| T |} + ω) {log}_{2} \frac{| T_{λ} |}{| T |}$ \[SplitInf{{o}_{{{A}_{p}}}}(T)=\sum\limits_{j=1}^{m}{\left( \frac{\left| {{T}_{1}} \right|}{\left| T \right|}+\omega \right)}\text{lo}{{\text{g}}_{2}}\frac{\left| {{T}_{\lambda }} \right|}{\left| T \right|}\]

Then after simplification: (28) $\begin{matrix} G a i n R a t i o (A) = \frac{G a i n (A)}{S p l i t \ln f o_{A_{m}} (T)} \\ = \frac{\sum_{i = 1}^{n} \frac{| T C_{i} | \times (| T | - | T C_{i} |)}{| T |} - \sum_{j = 1}^{m} \sum_{r = 1}^{n} \frac{| T C_{i j} | \times (| T | - | T C_{i, 1} |)}{| T_{j} |}}{\sum_{j = 1}^{m} \frac{(H_{1} | T | + | T_{j} |) \times (| T | - | T_{j} |)}{| T |}} \end{matrix}$ $\begin{align} & GainRatio\left( A \right)=\frac{Gain\left( A \right)}{Split\ln f{{o}_{{{A}_{m}}}}\left( T \right)} \\ & =\frac{\sum\limits_{i=1}^{n}{\frac{\left| T{{C}_{i}} \right|\times \left( \left| T \right|-\left| T{{C}_{i}} \right| \right)}{\left| T \right|}}-\sum\limits_{j=1}^{m}{\sum\limits_{r=1}^{n}{\frac{\left| T{{C}_{ij}} \right|\times \left( \left| T \right|-\left| T{{C}_{i,1}} \right| \right)}{\left| {{T}_{j}} \right|}}}}{\sum\limits_{j=1}^{m}{\frac{\left( {{H}_{1}}\left| T \right|+\left| {{T}_{j}} \right| \right)\times \left( \left| T \right|-\left| {{T}_{j}} \right|) \right.}{\left| T \right|}}} \end{align}$

In order to avoid the influence of subjective factors, different scores are set according to the importance of different English vocabularies p, all the balancing degree coefficient is set according to the number of scores of different English vocabularies in the algorithm, and the balancing degree coefficient is set to $ω = \frac{1}{p}$ $\omega =\frac{1}{p}$ for different English vocabularies.

3

Analysis of the results of the difficulty prediction experiment

3.1

Clustering Results and Analysis of English Vocabulary Learning Difficulty

The vector features of the seven dimensions of English vocabulary with different learning difficulties in the dataset were input into the DBSCAN algorithm, and the grid search was carried out by comprehensively considering the number of clusters and the DBI metrics, and finally all the words in the dataset were clustered into six clusters, and the results are shown in Table 1.

Table 1.

Word memory retrieval difficulty clustering results

Cluster	0	1	2	3	4	5
Quantities	233	17	4	6	6	4
1 try	0.31	1.28	0.28	0.02	0.21	0.41
2 tries	5.02	14.51	1.96	1.35	4.02	10
3 tries	22.73	33.28	10.27	9	25.27	37.04
4 tries	34.82	30.22	24.26	26.01	40.97	35.01
5 tries	24.52	15.25	31.81	36.41	22.54	13.21
6 tries	10.65	5.19	25.25	23.01	6.19	3.2
7 or more	1.74	0.84	6.5	3.83	0.79	0.01

The clusters with very similar seven-dimensional vectors of English vocabulary learning difficulty were merged two by two to obtain three categories of word memory retrieval difficulty: easy, medium, and difficult. The distribution of the mean seven-dimensional vectors of the three difficult words is shown in Figure 2.

3.2

Predicting the difficulty of learning English vocabulary

The prediction of the word “cockroach” yields a result of “medium” with an average of 4.49 guesses. Figure 3 shows the distribution of the seven-dimensional vector features of the word “cockroach” and its comparison with the seven-dimensional vector features of words of different difficulties. The word “cockroach” belongs to the hard words, but it still has a big difference with the difficult words.

3.3

Model evaluation

The model prediction results were evaluated using the mean absolute error (MAE) and the goodness of fit (R²). MAE was used to evaluate the error of the model and R² was used to evaluate the interpretability of the regression model. The calculation formula is as follows: (29) $M A E = \frac{1}{m} \sum_{i = 1}^{m} | {\hat{y}}_{i} - y_{i} |$ \[MAE=\frac{1}{m}\sum\limits_{i=1}^{m}{\left| {{{\hat{y}}}_{i}}-{{y}_{i}} \right|}\] (30) $R^{2} = \frac{\sum_{i = 1}^{m} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}$ \[{{R}^{2}}=\frac{\sum\limits_{i=1}^{m}{{{\left( {{{\hat{y}}}_{i}}-\bar{y} \right)}^{2}}}}{\sum\limits_{i=1}^{m}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}}\] Where: ŷ_i is the predicted value of the ind sample; y_i is the true value of the ith sample; ȳ is the average of all samples.

The number of decision trees is set to 1000 and the feature selection ratio is 40%. Train the set of classification models of this paper and calculate their MAE and R² values, the results are shown in Table 2, the model has a strong goodness of fit.

Table 2.

Model set error evaluation index

Tries	1	2	3	4	5	6	7+
MAE	0.19	0.87	1.83	1.14	1.15	1.42	0.63
R²	0.85	0.92	0.94	0.92	0.92	0.92	0.91

Accuracy, precision, recall, and F₁-value are commonly used evaluation metrics for classification prediction problems, and the calculation formulas are as follows, respectively: (31) $A c c u r a c y = \frac{T P + T N}{T P + F N + T N + F P}$ \[Accuracy=\frac{TP+TN}{TP+FN+TN+FP}\] (32) $Pr e c i s i o n = \frac{T P}{T P + F P}$ $\text{Pr}ecision=\frac{TP}{TP+FP}$ (33) $Pr e c i s i o n = \frac{T P}{T P + F P}$ \[\text{Pr}ecision=\frac{TP}{TP+FP}\] (34) $F_{1} = \frac{2 T P}{2 T P + F N + F P} = \frac{2 \cdot Pr e c i s i o n \cdot Re c a l l}{Pr e c i s i o n + Re c a l l}$ \[{{F}_{1}}=\frac{2TP}{2TP+FN+FP}=\frac{2\cdot \text{Pr}ecision\cdot \text{Re}call}{\text{Pr}ecision+\text{Re}call}\]

Leave 25% of the samples as the test set and retrain the classification model. By predicting the samples in the test set, the accuracy of the random forest classification prediction model is obtained as 0.988, and the other generalization ability evaluation indexes are shown in Table 3, the generalization ability evaluation indexes of the classification model in this paper are all excellent, with the values above 0.8, and the recall of the simple category is slightly worse, only 0.81, and the model overall performance is excellent.

Table 3.

Classification prediction model generalization ability evaluation index

Categories	Precision	Recall	F1	Sample size
Simplicity	1	0.81	0.92	9
Medium	0.97	1	0.97	60
Difficulty	1	0.99	0.98	6
Macroid	0.98	0.95	0.99	70
Weighted mean	0.98	0.98	0.97	70

Robustness refers to how well the model tolerates changes in the data. A model is robust if small deviations in the data have only a small effect on the model’s output. After deleting all structural information of the first and last letters in the 142-dimensional sparse matrix, the C4.5 model set of “4 tries” is trained and the new model is compared with the original model to obtain the results in Fig. 4. Figure 4 shows that after removing 26 features, the R² of the model does not change much. This indicates that the regression model is robust.

One of the seven-dimensional vectors is deleted respectively, and the remaining six-dimensional vectors are input into the classification prediction model, and the results are shown in Table 4, which indicate that the C4.5 classification prediction model is robust, and all the indexes are above 0.9.

Table 4.

The model reduces the evaluation of some input

Delete	Accuracy	Precision	Recall	F1
-	0.99	0.98	0.99	0.97
1	0.96	0.95	0.93	0.93
2	0.95	0.99	0.97	0.95
3	0.99	0.99	0.99	0.99
4	0.94	0.96	0.96	0.95
5	0.99	0.99	0.99	0.99
6	0.98	0.98	0.99	0.95
7	0.98	0.99	0.99	0.99

4

Adjustment of English Vocabulary Teaching Strategies Based on the Concept of OBE Education

Under the leadership of OBE concept, English teaching in higher vocational colleges and universities should uphold the principle of “student-oriented and result-oriented”, clarify the educational objectives, give full play to the classroom effectiveness, and reasonably construct the teaching assessment system to promote the continuous progress of students. In the following, the author discusses vocabulary teaching strategies for higher vocational English based on OBE education concepts from several aspects.

1)

Constructing a highly adaptable goal system according to the OBE teaching philosophy

First, consider the end as the beginning and establish teaching objectives. Under the OBE teaching mode, when teachers set word teaching goals, they should take students’ expected performance as the starting point and the end point, and use it as a guide to plan the teaching content and methods.

Second, focusing on the outcomes and determining the teaching process. Teachers should make detailed and clear teaching plans at the beginning of each class so that students have clear understanding of their own learning objectives and expected results.

Third, understand the needs and determine the teaching strategies. In teaching practice, teachers need to deeply understand the unique personality and ability of each student, so that they can adjust their teaching strategies in a targeted manner and operate in a diversified teaching mode that is more adaptable to individual needs.

2)

Enriching vocabulary teaching methods based on the requirements of OBE philosophy

First, multimodal learning. English teaching in higher vocational colleges and universities has widely used the “multimodal” learning strategy, which aims at mobilizing multi-sensory experiences such as visual, auditory and tactile senses in order to strengthen students’ language skills.

Second, vocabulary contextualization. English vocabulary contextualization refers to learning in a specific context, so that students can learn and understand vocabulary in the actual environment. In this way, it can not only enhance students’ memory of new vocabulary, but also enable them to understand the meanings and usages of these vocabulary words in real contexts, matching the OBE teaching concept of focusing on effectiveness.

Third, vocabulary interactive exercises. As a teaching strategy that integrates language communication, cooperation, and competition, vocabulary interactive practice can achieve student-centered interactive practice and enhance students’ learning interest and effectiveness.

Fourth, vocabulary classification and generalization. Vocabulary classification and induction is an effective learning strategy in higher vocational English courses, the core concept of which is to guide students to systematize and memorize new vocabulary in a logical order in order to improve the efficiency of memorization and learning effectiveness.

3)

Optimize the vocabulary teaching assessment mode based on the OBE education concept standard

First, design an evaluation index system. In order to deeply and systematically evaluate the effectiveness of higher vocational English teaching practice based on the OBE education concept, teachers should construct a comprehensive evaluation system, which covers three dimensions: “learning outcomes”, “teaching process” and “teaching environment”.

Second, a combination of qualitative and quantitative. In order to ensure the comprehensiveness and accuracy of the course effects, teachers should adopt a combination of qualitative and quantitative assessment methods. In qualitative assessment, relevant data can be collected with the help of diversified ways such as listening and evaluating activities and interviews with teachers and students, in order to understand and grasp the actual experience of teachers and students on the education model in a more in-depth manner, and then the various types of information collected can be analyzed in-depth and compared, so as to make clear the strengths and weaknesses of this education method.

Third, the analysis of the evaluation results and verification of effectiveness. The analysis of the evaluation results and the verification of the effectiveness should start from the following aspects: evaluate the effect of the education model in improving students’ English practice ability and cultivating their self-directed learning and teamwork ability; Ensure that the classroom teaching strategies are fully adapted to the actual needs of students, and stimulate their subjective initiative and enthusiasm; focus on the contribution of the curriculum in shaping students’ innovative thinking and practical skills; Emphasis is placed on the potential of the education model to improve the quality of teachers, especially their professional quality.

Fourth, self-referential assessment is emphasized. Self-referential assessment is an emerging mode of assessment in the current field of education, which encourages students to look at their own learning process from a developmental perspective, based on their current level of understanding of the subject matter, their mastery of skills, and the dynamics of their feelings and emotions.

5

Analysis of changes in attitudes toward vocabulary learning

In order to test whether the vocabulary learning attitudes of the students in the experimental class after the experiment changed significantly compared with those before the experiment, the author distributed scales for pre-testing, planned to distribute 50 scales, and actually distributed 50 scales and recovered 50, with a recovery rate of 100%. The collected data were analyzed using the paired samples t-test in SPSS software, and the results of descriptive statistics and paired samples t-test are shown in Table 5 and Table 6 respectively. From the descriptive statistics of the experimental class students’ vocabulary learning attitude scale pre-experiment and post-test, it can be seen that the average score of the pre-measurement scale of learning attitude of the experimental class students is 71.352, and the post-test is 78.045, and the average score of the post-test is higher than the pre-measurement by 6.693 points. From the analysis of paired samples t-test, it can be found that the two-sided sig value (p-value) used to detect the significance is 0.000 less than 0.05, which indicates that the difference between the vocabulary learning attitudes of the students in the experimental class before and after the experiment is obvious, i.e., after the application of the improved algorithm of C4.5 and the adjusted teaching strategies, the students of the experimental class have obvious changes in their attitudes toward English vocabulary learning.

Table 5.

The study attitude scale was descriptive statistics

		Mean	N	SD	SEM
Learning attitude	Pre-exp.	71.352	50	9.451	1.289
	After-exp.	78.045	50	11.568	1.536

Table 6.

Test the test of the study attitude scale before and after the test

	Pair difference					t	df	Sig.2
Pre-exp.	Mean	SD	SEM	95%CI Upper limit	95%CI Lower limit	t	df	Sig.2
After-exp.	-6.1694	7.1562	0.9845	-8.1456	-4.1977	-6.279	49	0.000

The experimental class learning attitude pre-measurement scale cited the previous scale, which was used to investigate the experimental class students in four aspects, namely, attitude during vocabulary learning, attitude during vocabulary literacy, attitude during vocabulary review, and frequency of vocabulary review. The pre- and post-test scores of each dimension for each student were analyzed with the paired-samples t-test in SPSS in order to understand the changes in these four dimensions, i.e., the changes in students’ attitudes toward vocabulary learning before and after the experimental class. The results of descriptive statistics and paired samples t-test for each dimension are shown in Tables 7 and 8, respectively.

Table 7.

Descriptive statistics before and after dimensions

		Mean	N	SD	SEM
Lexical acquisition	Pre-exp.	29.548	50	4.6258	0.6358
Lexical acquisition	After-exp.	31.056	50	5.4698	0.7589
Memorize words	Pre-exp.	20.456	50	4.0658	0.5546
Memorize words	After-exp.	21.598	50	4.2879	0.5839
Review the main ways of words	Pre-exp.	11.395	50	2.7892	0.3476
Review the main ways of words	After-exp.	13.245	50	3.2689	0.4215
Review the frequency of words	Pre-exp.	8.657	50	1.4539	0.2103
Review the frequency of words	After-exp.	10.698	50	1.4258	0.1987

Table 8.

Test the matching sample t before and after the dimensions

		Pair difference					t	df	Sig.2
		Mean	SD	SEM	95%CI Upper limit	95%CI Lower limit	t	df	Sig.2
Lexical acquisition	(Pre-exp.)-(After-exp.)	-1.6985	5.0598	0.6985	-3.2655	-0.3036	-2.436	49	0.019
Memorize words	(Pre-exp.)-(After-exp.)	-1.3356	4.0987	0.5698	-2.4561	-0.2152	-2.368	49	0.023
Review the main ways of words	(Pre-exp.)-(After-exp.)	-1.3025	3.1456	0.4316	-2.1625	-0.4358	-3.025	49	0.006
Review the frequency of words	(Pre-exp.)-(After-exp.)	-1.6548	1.9784	0.2874	-2.2358	-1.1167	-6.136	49	0.000

The differences between the mean scores of the four dimensions of the post-test learning attitude scale and the mean scores of the pre-test learning scale in the experimental class are 1.508, 1.142, 1.85, and 2.041, respectively. In other words, the experimental class students’ scores on the four dimensions of the scale, namely, attitude when learning vocabulary, attitude when recognizing vocabulary, attitude when reviewing vocabulary, and frequency of reviewing vocabulary, have increased compared to the scores on the pre-test scale after the experiment. From the paired samples t-test of the pre- and post-tests of the dimensions of the learning attitude scale, it can be seen that the two-sided sig values (p-values) used to detect significance are 0.019, 0.023, 0.006 and 0.000 respectively, which explains that there is a significant difference between the scores of the experimental class on these four dimensions after the experiment compared with the pre-experiment, i.e., based on the improvement of the C4.5 algorithm and the adjustment of the teaching strategy of the English vocabulary teaching has a positive effect on students’ attitudes when learning vocabulary, attitudes when recognizing words, ways of reviewing vocabulary, and frequency of reviewing words.

6

Conclusion

In this paper, we propose an English vocabulary learning difficulty prediction method based on DBSCAN clustering and C4.5 classification prediction algorithm. The model can predict the vector representation of seven dimensions of English vocabulary learning difficulty, and also predicts a more concise and obvious difficulty classification, i.e., “easy”, “medium”, “difficult” three The experiments show that the model in this paper has good performance. The experiments show that the model in this paper has very good good goodness-of-fit and prediction accuracy, and its prediction accuracy reaches 0.988. In addition, the model in this paper also has very good robustness as a whole, and the R² of the model does not change much after deleting some of the features.

Applying the methodology of this paper to the practice of English vocabulary teaching and proposing the adjustment method of English vocabulary learning strategies, the results of the practice show that the teaching strategy based on the concept of OBE education can effectively improve the level of English vocabulary teaching and effectively improve the students’ attitudes towards vocabulary learning. The four dimensions related to vocabulary learning attitudes examined also showed more significant improvements, and the average scores of the posttest in the four dimensions of attitudes when learning vocabulary, attitudes when recognizing vocabulary, attitudes when reviewing vocabulary, and frequency of reviewing vocabulary were improved by 1.508, 1.142, 1.85, and 2.041 points, respectively, when compared with the pre-test.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

Prediction of English Vocabulary Learning Difficulty and Adjustment of Teaching Strategies Based on Decision Tree Algorithm

Liqiang Song

Yunna Wang

Online veröffentlicht: 21. März 2025

Eingereicht: 07. Nov. 2024

Akzeptiert: 15. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0614

SchlüsselwörterDBSCAN clustering, C4.5 algorithm, Information gain rate, Learning difficulty prediction, English vocabulary learning

© 2025 Liqiang Song et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
DBSCAN clustering, C4.5 algorithm, Information gain rate, Learning difficulty prediction, English vocabulary learning