Research on the Optimization of Intelligent English Vocabulary Teaching Paths Based on Reinforcement Learning Models

Vocabulary teaching plays a vital role in English learning. Vocabulary is the foundation of language, and without a rich vocabulary it is impossible to communicate and exchange effectively. Mastering more vocabulary means that students can express their thoughts and feelings more accurately and fluently [1-4]. Vocabulary is the soul of language and the key to understanding and using language. Learning vocabulary can help students understand the linguistic patterns and expressions of English in a deeper way, thus improving the overall language level [5-7].

The traditional way of teaching English vocabulary has many problems, such as a single teaching method and insufficient motivation of students. With the development of science and technology and the updating of educational concepts, optimized ways such as using multimedia teaching method, implementing group cooperative learning, introducing game-based learning, carrying out practical teaching and promoting vocabulary memorization and consolidation have gradually been paid attention to [8-11]. In order to ensure the effectiveness of teaching, it is necessary to optimize the vocabulary teaching path to better improve students’ learning effect and learning interest. Reinforcement learning is a kind of learning method to achieve optimal behavior by interacting with the environment and adjusting one’s strategy according to the reward or punishment obtained [12-15]. In English vocabulary teaching, it is different from traditional supervised and unsupervised learning in that reinforcement learning focuses more on decision making and optimization in a dynamic English teaching environment [16-17].

In this study, a framework of reinforcement learning model and reinforcement learning algorithm-Q-Learning are used to construct an English vocabulary exercises recommendation assisted teaching system. In order to improve the match between students and exercises, this paper designs a method for assessing students’ learning ability and an assessment method for the difficulty of exercises based on item response theory. After obtaining the parameters of English vocabulary exercises, the idea of Fisher’s information was introduced to calculate the multilevel difficulty of English vocabulary exercises, and then the architecture of the English vocabulary exercises recommending assisted teaching system was designed from the aspects of environment state, action set, and reward strategy. Evaluation criteria such as bias metric, precision, recall and NDCG were selected to evaluate the model constructed in this study, and finally the effectiveness of the system in this paper was evaluated by means of a questionnaire survey in terms of the system’s satisfaction, learning attitudes and learning effects.

2

Optimization of Intelligent English Vocabulary Teaching Paths

2.1

Application of Intelligent Algorithms in Instructional Optimization

2.1.1

Create a personalized learning system

Personalized learning is the core goal of computer-assisted instruction, and personalized learning systems are constructed through AI technology to tailor learning paths and resources according to learners’ interests, abilities, and progress. This requires collecting and analyzing learners’ learning data, using machine learning algorithms to identify learning patterns and needs, and then designing personalized learning paths and resource recommendation strategies to improve learning efficiency, satisfaction, and stimulate motivation and creativity.

2.1.2

Optimizing intelligent teaching aids

Intelligent teaching aids are key applications of AI technology in computer-aided teaching, covering intelligent Q&A systems, virtual teaching assistants, online labs, etc., which can answer questions, provide real-time feedback and simulate real experimental environments. In order to optimize these tools, it is necessary to continuously improve their algorithms to increase accuracy and response speed, and focus on interface design to enhance the ease of use and friendliness of the user experience. At the same time, a user feedback mechanism should be established to collect and process learners’ opinions in order to continuously improve the tools.

2.2

Enhanced Learning Model

Reinforcement learning model learns the optimal strategy through the interaction between the intelligent body and the environment. In this paper, we constructed an English vocabulary exercises recommendation assisted teaching system based on the reinforcement learning model, in which the students are the intelligences, the English vocabulary exercises are the environment, and the students get reward or punishment feedback through solving the problems. The dynamic adjustment mechanism provided by the reinforcement learning model is consistent with the path goal of teaching optimization, so the reinforcement learning model plays an important role in the construction of the system in this paper.

2.2.1

Enhanced Learning Basic Concepts

Reinforcement learning technology model [18] in the intelligent body’s task is to continuously explore in the constructed environment, the specific process is to observe the state in the environment, and then take the corresponding action, thereby obtaining the environment to feedback to the intelligent body’s current exploration of the action of the rewards, and finally through the maximization of rewards to learn and update the policy, and constantly interact with the environment, so as to find the optimal solution to reach the end state.

Reinforcement learning is the core element of reinforcement learning, which simulates the environment explored by the intelligent body as the environment in which the current learning task is performed. The environment interacts with the intelligent body and gives the intelligent body some rewards for its exploration in the current state, and changes the current state of the intelligent body, and so on until the intelligent body reaches the final state in the environment.

The state of the smart body in the environment can be understood as the attributes of the smart body itself, which can be location information or other variables that describe the environment. The state can be a discrete variable or a continuous variable and is continuously updated as the intelligent body interacts with the environment.

Actions are behaviors generated by the intelligent body during the interactive exploration of the current environment.

Reward is a certain reward that the environment gives back to the intelligent body according to the action taken by the intelligent body in the current state during the interactive exploration of the current environment. Usually, when the intelligent body chooses the next state, it will consider the reward of the current action in the environment, and the reward of the interaction in each time window is also related to the subsequent training process.

2.2.2

Markov decision-making process

Reinforcement learning problems are often described as Markov Decision Over (MDP). That is to say, in reinforcement learning, the interaction process between the intelligent agent and the environment has to satisfy the Markov decision process, i.e., the future state of the intelligent agent is only related to the current state. The Markov decision process has dependencies between states and a temporal order, which is a sequential decision process. The Markov decision process contains five elements, which can be represented by a quintuple, i.e. M = (S, A, P, R, γ). where S is a finite state space, A is a finite action space, also called the set of actions, $P = {P_{s a} (\cdot) | s \in S, a \in A}$ is the probability of transferring to the next state, R is the reward function, and γ is the discount factor.

where the transfer probability matrix P is specifically denoted as $P (s, a, s') = P_{(s, a)} r (s_{t + 1} = s' | s_{t} = s, a_{t} = a)$ and the immediate reward r is denoted as $r = R (s, a, s') = E {r_{t + 1} | s_{t} = s, a_{t} = a, s_{t + 1} = s'}$ .

The specific Markov Decision Process (MDP) is shown in Figure 1. s₀ is the initial state of the intelligent body, and after taking an action a₀ in this state, it transfers to state s₁ according to the probability p₀ and receives an immediate reward r₁ from the environment, and then repeats the above process continuously.

In Markov Decision Process (MDP), which is based on the theory of stochastic processes, it is stipulated that the current state of the intelligent body Agent is only related to the state of the previous moment and the action taken. Meanwhile, the reinforcement learning approach based on this regards the recommendation process as a Markov decision process, specifically, the intelligent body selects an exploratory behavior (action) according to the state transfer probability matrix $P_{s_{t}, s_{t + 1}}^{π}$ through the policy π and performs the action to interact with the environment, and after the interaction, the current reinforcement learning environment feeds back its immediate reward and the current intelligent body will be transformed to the next established state. Thus, a reinforcement learning-based English vocabulary exercise recommendation assisted teaching system can characterize the sequence of user behaviors and can capture the dynamic preferences of users. The model also dramatically improves the diversity of recommendation results by allowing the intelligences to fully explore the state space, action space according to their exploration mechanism. This type of model generally improves the accuracy of English vocabulary exercise recommendations by maximizing the cumulative benefits of the English vocabulary exercise recommendation aid teaching system as the optimization goal, focusing on the long-term feedback from users, and thus executing the updating of the recommendation strategy, focusing on the improvement of the long-term satisfaction of users, as well as improving the accuracy of the English vocabulary exercise recommendations.

2.2.3

Q-Learning Algorithm

The Q-Learning algorithm [19] is an offline policy algorithm and a value-based reinforcement learning algorithm. Where Q refers to the Q value, the Q function in the algorithm, the

i.e., $Q (s t a t e, a c t i o n)$ , is denoted as the Q-value, also called Q − value, obtained by executing action A in state S. The Q-Learning algorithm aims to maximize Reward by selecting the optimal action A, i.e., the best action, in state S to obtain the maximum gain, with the ultimate goal of maximizing the Q-value.

The Q-Learning algorithm always maintains a Q table to store the respective Q values corresponding to different actions taken in different states. The Q values are modified and the Q table is updated based on the immediate rewards r obtained and the algorithmic rules during the interaction of the intelligence with the environment. As the number of iterations increases, the intelligence improves its understanding of the environment and the Q function is gradually fitted until it converges or reaches the set number of iterations.

The Value update formula for the Q-Learning algorithm is shown below. (1) $\begin{matrix} Q (s, a) \leftarrow Q (s, a) + α [R (s, a) + γ \max Q' (s', a') - Q (s, a)] \\ s \leftarrow s' \end{matrix}$

where α is the learning rate, which represents the update magnitude, R(s, a) is the immediate reward, $\max Q' (s', a')$ is the Q value obtained after choosing the action that maximizes the next state, and $γ \max Q' (s', a')$ represents the future long-term reward. The $R (s, a) + γ \max Q' (s', a')$ whole is represented as the true Q value of taking an action a_t in the current state s_t and consists of the immediate reward R and the long-term reward. The Q(s, a) in parentheses is the estimated Q value, and the difference between the actual Q value and the estimated Q value is denoted as ΔQ(s, a), and as this difference converges to 0, Q(s, a) it stops changing and the whole converges to a convergent state.

2.3

Learner ability classification and exercise difficulty assessment

In order to recommend English vocabulary exercises for learners that match their learning abilities, two key issues need to be resolved: how to objectively and accurately assess learners’ learning abilities, and how to calculate the difficulty of English vocabulary exercises and how to match them with learners’ learning abilities. In this chapter, a method of assessing learners’ learning ability combined with the difficulty of English vocabulary exercises is proposed from the above two aspects.

2.3.1

Methods of Assessing Learning Ability Based on Item Response Theory

Item response theory (IRT) has been widely used in educational and psychological settings, particularly in educational assessment, to assess learner competence. Tests are usually constructed and evaluated using various IRT models by analyzing the characteristics of the items (item discrimination and item bias).

The probability function of a two-parameter normal shoulder-type model for a learner correctly answering a 0-1 scoring question i is given below. (2) $P_{i} (θ) = \int_{- \infty}^{a_{i} (θ - b_{i})} \frac{e^{\frac{- z^{2}}{2}}}{\sqrt{2 π}} d z$

Where P_i(θ) - the probability that the learner will answer the question correctly.

θ - The corresponding learning ability of the learner.

a_i - The differentiation parameter of question i.

b_i - Difficulty parameter of question i.

Similarly, based on the two-parameter normal shoulder model, the three-parameter normal shoulder model can be derived by adding the guessing coefficients of the questions to the model (3) $P_{i} (θ) = c_{i} + (1 - c_{i}) \int_{- \infty}^{a_{i} (θ - b_{i})} \frac{e^{\frac{- z^{2}}{2}}}{\sqrt{2 π}} d z$

Where c_i - the guessing parameter of topic i, which indicates the probability of being able to mask the right answer even if the topic is answered randomly.

Although the normal shoulder model theoretically lays down the basic form of the IRT model, the calculation process utilizes the integral form of calculation, which leads to the normal shoulder model is not very suitable in both parameter estimation and practical application. Currently the commonly used IRTlogistic model is the three-parameter logistic model: (4) $P_{i} (θ) = c_{i} + \frac{1 - c_{i}}{1 + e^{[- D a_{i} (θ - b_{i})]}}$

Each set of English vocabulary test questions in the Exercise Resource Library contains not only 0-1 scoring questions such as single-choice questions, fill-in-the-blank questions and judgment questions, but also multi-level scoring questions. Compared with the 0-1 scoring questions, the multi-level scoring questions can evaluate the learners’ learning ability more objectively.

Like the 0-1 scoring questions, the parameters of the multi-level scoring questions also include differentiation parameter, difficulty parameter, and there is no guessing parameter unlike the 0-1 scoring question model, because it is basically impossible for learners to guess to answer a multi-level scoring question correctly. Unlike the 0-1 scoring IRT model, the multilevel scoring IRT model subdivided the difficulty parameter of the question, i.e., for each additional share of the score, the difficulty of the question is a little more difficult, and the reason why it did not subdivided the differentiation parameter as well.

Therefore, a generalization of equation (4) can be obtained as a logistic model formula for the multilevel scoring model: (5) $P_{k} (θ) = \frac{1}{1 + \exp [- 1.7 a_{i} (θ - b_{i k})]}$

Where P_k(θ) - the probability that the learner scores k points on this question.

θ - The corresponding learning ability of the learner.

a_i - The differentiation parameter of question i.

b_ik - Difficulty parameter for topic i when it scores k.

In Eq. The value of k is taken from 1 because there is no difficulty for the learner to get 0 points on a multilevel scoring question.

2.3.2

Assessment of the difficulty of English vocabulary exercises

This section discusses how to assess the difficulty of English vocabulary exercises. Further, after evaluating the difficulty of exercises, we can more easily and accurately match students’ learning ability with the difficulty of English vocabulary exercises to achieve better recommendation results.

To calculate the difficulty of English vocabulary exercises, we first need to analyze the questions in the exercise bank. In practice, we find that different questions can differentiate students with different ability levels, and the degree of differentiation of the questions is the amount of information that the questions can provide for the respondents. The amount of information, but for students who have learned the basics very firmly, this question is a bonus question, so the question can not play a role in testing such students, that is, the value of the amount of information provided is very small, so it is not possible to answer this question based on the answer to the question to say that which student’s ability to learn is stronger.

Item response theory based on the idea of Fisher’s information defines the relationship between the amount of information contributed by the question in evaluating the learning ability of students as: (6) $I_{i} (θ) = \frac{{[p_{i}' (θ)]}^{2}}{p_{i} (θ) (1 - p_{i} (θ))}$

Where p_i(θ) - the probability that a learner with an ability value of θ answers question i correctly.

p_i′(θ) - the first order derivative function of p_i(θ) for θ.

The information content of the questions derived for different proficiency level values is different. Of course, we can similarly define the notion of informativeness for evaluating an individual student at the level of the entire test, which is the test information function, which is, in fact, an accumulation of the information functions of the questions included in the test, i.e: (7) $I (θ) = \sum_{i = 1}^{m} I_{i} (θ)$

The information function of the items allows for the classification of the questions into several categories:

Standard item information profiles, where the peak corresponds to a learning ability value of slightly less than 0 and the peak is high, indicate good differentiation and the item corresponds to a low level of difficulty, which can serve as a good test for a significant portion of the student population.

The ideal question information curve, with a peak corresponding to a learning ability value of 0, a higher peak and steeper sides of the peak, indicates that the question is of moderate difficulty, has good differentiation, and can be a good test for students whose learning ability is around 0, and the guessing result of the question has a small error.

The information curve of the biased type of questions, the corresponding learning ability value of the wave peak is low, the wave peak is short and the two sides of the wave peak are gentle, which indicates that the differentiation of this type of questions is small, and the difficulty is low, and because the span of the wave peak is large, it can’t provide effective information for the assessment of learning ability of a small interval, so this type of questions should be given less weight in the assessment of the difficulty of the exercises, or even directly not to let this type of questions participate in the assessment of the difficulty of the exercises. Therefore, this type of questions should be given less weight in the assessment of the difficulty of the exercises or even be excluded from the assessment of the difficulty of the exercises.

Wrong type of question information curve, no matter how the waveform is, the whole curve is located below the horizontal coordinate, indicating that it can not provide valid information for any ability assessment, this type of question design is a failure, and this type of question should be excluded from the assessment of the difficulty of the exercise.

The effective topic information curve weighted fit can be obtained after fitting a multi-peak graph, select the first peak as the lowest difficulty of the exercise, the last peak as the highest difficulty of the exercise, the average difficulty of the exercise is calculated by normalizing the information content of all valid topics as a weight and then multiplied by the difficulty of each topic to do a linear addition to obtain the average difficulty of the exercise. The formula for calculating the average difficulty of the exercises can be obtained: (8) $D = α_{1} b_{1} + α_{2} b_{2} + α_{3} b_{3} + \dots \cdot + α_{n} b_{n}$

Where α_i - the weight of topic i.

b_i - Difficulty of topic i.

2.4

Reinforcement Learning Based Exercise Recommendation Assisted Teaching System

By dynamically adjusting the exercise recommendation strategy, the Reinforcement Learning-based Exercise Recommendation Assisted Teaching System can provide students with appropriate English vocabulary exercises and answer sequences for training according to the state and learning ability level of different students, so as to realize the optimization of intelligent English vocabulary teaching path.

This section focuses on the study of the reinforcement learning-based exercise recommendation assisted teaching system, and takes the architecture of the exercise recommendation system as the main part of the experimental design and research. Therefore, this chapter first combines the method proposed in the previous chapter to construct an exercise recommendation model from the aspects of learners’ ability and the difficulty of the exercises to be recommended, and carries out a detailed design and description of the experiment, including the structure of the recommendation model based on the Q − learning algorithm of the experiment, as well as the setting of the model’s environment state, action set, and rewarding strategy, and so on.

2.4.1

System framework design

This study utilizes reinforcement learning algorithms and further analyzes the environmental features based on the characteristics of learners and questions to select the appropriate actions and achieve reward rewards. This experiment adopts offline experiment to obtain effective feature information from learners’ history of answering records of exercises to complete the research on the model of personalized exercise recommendation assisted teaching system based on reinforcement learning. Firstly, the answer records of learners’ questions are collected, the ability level interval of learners is classified according to the answer results, and the average pass rate and average score data of exercises are provided for the quantification of the difficulty of the questions. Then the reinforcement learning model is trained to recommend the exercises to the learners. The structure of the experimental model is shown in Figure 2.

2.4.2

State of the environment

In the exercise recommendation model based on reinforcement learning Q-Learning algorithm, the simulation of the learner’s “nearest development zone” is a two-dimensional model constructed by the learner’s ability level and the difficulty of the topic, and the experimental code in this study is written in Python language, and the simulation platform is built using Tkinter. A grid of 10 × 10 was constructed as the environment model for the experiment. The horizontal and vertical axes in this grid environment represent the learner’s ability level and the difficulty of the questions, respectively, and the values of both of them are standardized and normalized by data preprocessing and mapped in an interval with a value range of $[0, 1]$ . Each cell in the environment model is a state, and the total number of states is 100. An intelligent body selects an action in the current state s, arrives at the next state s′, computes the value function $Q (s, a)$ and updates the Q table according to the output Q value, selects the next action, and receives the corresponding instantaneous reward payoff according to the reward strategy r. Each step of the state updating represents the change of the intelligent body doing the question of different difficulty at different ability levels.

2.4.3

Action selection

Recommendation based on reinforcement learning Q-Learning algorithm is an exploratory process, and the reinforcement learning intelligent body selects the optimal next action through the value function $Q (s, a)$ after learning experience through continuous exploration. In this study, each frame moved by the intelligent body in the environment model indicates the occurrence of an action, and according to the changing characteristics of the learner’s ability level and the difficulty of the topic, there are three actions in the action set A of this experiment. In order for the intelligent body to learn enough knowledge and experience, a lot of exploration and trial and error should be carried out to help it make the correct judgment, but too much exploration will lead to a decrease in the efficiency of model training, while insufficient exploration will lead to a lack of experience and the inability to make the optimal choice. So in order to choose the optimal path to accumulate the maximized reward, we choose ε − greedy greedy algorithm [20] as the action selection strategy for the experiment to weigh the proportion of exploitation or exploration.

2.4.4

Incentive strategies

The purpose of the intelligent body in reinforcement learning is to find the optimal path through learning to obtain the maximum reward, the reward is the feedback of the environment on the selection and execution of the action during the learning process, the execution of a poor action to obtain a small reward or even get punishment, then the probability of the next time that the action is selected to be executed will become smaller. While executing an optimal action to get the maximum reward, then the probability that the next optimal action will be selected for execution will become larger, a reasonable reward strategy [21] can improve the efficiency of model training, so it is very important for the design of the immediate reward strategy.

3

Performance analysis of an enhanced learning-based assistive teaching system

3.1

Model Decision Stability Test

This experiment was conducted on a CPU: Intel E7-4820 v4, GPU: Tesla K40C, 2*12G video memory, Ubuntu 16.04, and the data processing and modeling algorithms in this file were written in Python.

The training data for this experiment were obtained from the background database of the English vocabulary question recommendation teaching system based on reinforcement learning. The original data were obtained from the backend database of the system, and a total of 110,000 pieces of data were obtained after data preprocessing, which contained the records of 10 users, each of which had 1,500 records of answering all the English vocabulary questions from the first semester of the first year of the university.

3.1.1

Evaluation indicators

According to the concept of “learning area” in psychology, the questions we do to learners each time should not be too difficult, so that learners will be afraid of difficulties, nor can they be too simple, causing users to lose interest in learning. According to the definition of “learning area”, ideally, the algorithm should be able to give a moderately difficult exercise from the learning area each time, which is reflected in the fact that the correct answer rate finally converges to a fixed value. Let the number of questions that learners answer each time is N and the number of correct answers each time is n, and the construction bias metric is as follows: (9) $e r r = {(\frac{n}{N} - δ)}^{2}$

where δ ∈ [0, 1], x ≤ n. err denotes the deviation of a learner’s single score from our expectation, and obviously, the more the English vocabulary question that the learner answers is in the learner’s learning zone, err the closer the difference is to 0. The actual experiment looked at the mean value of the recommended deviation in each round or specified zone, i.e: (10) $E R R = \frac{\sum_{i = 1}^{m} e r r_{i}}{m}$

where m is the number of responses in the round or the specified block.

3.1.2

Experimental results

Reinforcement learning is relatively difficult to measure, and is usually evaluated using the score at the end of the current round or the average of the results of several rounds, so the evaluation in this experiment was based on analyzing the results of 200 rounds of user responses. In order to observe the stability of the algorithm, the number of correct answers for each of the 200 rounds is shown in a line graph. Figure 3 shows the distribution of the number of correct answers for individual users with training, and Figure 4 shows the distribution of the number of correct answers for all users with training.

δ∗In the last 200 rounds of training, the number of times learners answered correctly each time was δ∗N was significantly increased, and the highest accuracy rate reached 71% and 69% respectively in the single user answer and all user answer experiments, and the number of correct answers was mainly distributed around δ∗N, and the frequency of other correct values was significantly reduced. The model in this experiment can effectively learn the English vocabulary questioning strategy during training, and the level of English vocabulary questioning from the learner’s learning area is increasing and stabilizing with the number of training sessions.

In order to evaluate the strategy more consistently, we look at the learning process of reinforcement learning through another dimension of the metric, namely the value function of the directly used action. Because the state space of this experiment is large and there is no way to select a fixed user state, we choose the value function of the initial state of a single user in each round to see whether the model in this experiment is stable in its decision-making on the sub-states during the training process. The value function of the initial state of a single user in each round is shown in Fig. 5.

The q value in the figure represents the expected discounted reward value that the reinforcement learning model can obtain for using the best action for that state. In order to accurately assess the change in the q value, we filtered the q value when the optimal action was taken in the initial state of the round each time, and there were 15,000 values. The value of q value oscillates in the pre-training in the number of 0-7500 iterations, oscillating in the range of 0-16.8, and in the number of 7501-15000 iterations, the trend of the training stabilizes, with an average value of about 16.3. It indicates that the experimental model gradually stabilizes the value of q value in the middle and late stages of training during the process of exploring English vocabulary learning strategies, from which it shows that the reinforcement learning model has been trained and achieved better learning results.

At the same time, based on the given recommendation bias metric, it is possible to see the decision-making of the model selection strategy each time. The value of ERR is calculated once per round. Figures 6 and 7 show the mean and the variance of ERR per 200 rounds for a single user, and Figures 8 and 9 show the mean and the variance of ERR per 200 rounds for all users. At the beginning of the training period, both the mean and the variance were large, with the mean and variance of ERR per 200 rounds for individual users reaching a maximum of 0.0391 and 0.0352, respectively, and that of ERR per 200 rounds for all users reaching a maximum of 0.041 and 0.084, respectively. However, as the training period progressed, both the mean and the variance of ERR per 200 rounds for individual users and the mean and variance of ERR per 200 rounds for all users reached a maximum of 0.041 and 0.084, respectively. However, as the training progresses, both the mean and the variance of ERR per 200 rounds for individual users and the mean and the variance of ERR per 200 rounds for all users decrease and stabilize. The mean and variance of ERR per 200 rounds for individual users eventually converged to 0.0041 and 0.0040, while the mean and variance of ERR per 200 rounds for all users stabilized at 0.01 and 0.03, respectively, which indicates the gradual stabilization of the model’s decision-making on the selection of topics in this paper, and the gradual reduction of bias in the model’s recommendation of English vocabulary exercises and its stabilization.

3.2

Predictive performance test of the model

3.2.1

Evaluation indicators

The experiments use precision, recall, and NDCG to validate the performance of the model.

Precision is used to measure the proportion of the classification model’s predictions that are correct. The formula for calculating precision is given below: (11) $\Pr e c i s i o n = \frac{\sum_{i - 1}^{N} | A (u) \cap B (u) |}{\sum_{i - 1}^{N} | A (u) |}$

where N represents the number of samples, A(u) represents the set of items that are the final output of this model, and B(u) represents the set of items for which there is indeed a preference interaction in the interaction record.

Recall, also known as sensitivity or true instance rate, is a measure of the proportion of positive instance samples that the model is able to correctly find. The recall rate is calculated using the following formula: (12) $Re c a l l = \frac{\sum_{i - 1}^{N} | A (u) \cap B (u) |}{\sum_{i - 1}^{N} | B (u) |}$

The discounted cumulative gain (NDCG) is a metric used to assess the quality of ranking systems, especially in information retrieval and recommender systems. It is obtained by summing the relevance scores of each item in a recommendation list weighted by position and then normalizing them in order to compare the quality of different lists. The value of NDCG usually ranges from 0 to 1, and the closer it is to 1, the better the performance of the system in terms of ranking and recommendation.

The formula for NDCG is: (13) $N D C G @ k = \frac{D C G @ k}{I D C G @ k}$

where DCG@k and IDCG@k are calculated as shown in Eqs. (14) and (15): (14) $D C G @ k = \sum_{i = 1}^{k} \frac{r e l_{i}}{\log_{2} (i + 1)}$ (15) $I D C G @ k = \sum_{i = 1}^{k} \frac{r e l_{(i)}}{\log_{2} (i + 1)}$

In this paper rel_i denotes the relevance score of the ind English vocabulary exercise in the recommended list, rel_(i) denotes the score of the ith English vocabulary exercise after sorting the true relevance scores in descending order, and k is the number of the first k English vocabulary exercises considered.

3.2.2

Results of model comparison experiments

In order to verify the superiority of the modeling algorithms, the following algorithms are used to compare with the experiments, respectively, and the comparison datasets use three more authoritative public datasets MovieLens, Yelp, and Last.fm. 1)

BPR algorithm

The BPR algorithm uses a pairwise learning approach to learn user preferences by comparing pairs of items that users have interacted with to achieve a personalized ranking of the recommendation list.

2)

Dijkstra

Dijkstra is a classic graph algorithm that can be used to solve the shortest path problem. It is simple and efficient, but it needs to be careful when dealing with negatively weighted edges.

3)

NeuCF algorithm

The training process of NeuCF algorithm usually includes two phases: embedding learning phase and neural network training phase.

4)

FPMC algorithm

This algorithm predicts the next behavior that the user may observe by learning the transfer probabilities in the user’s behavioral sequence.

5)

DDPRG algorithm

DDPGR is a deep deterministic policy gradient (DDPG) based recommendation algorithm designed to solve personalized recommendation tasks.

In order to verify the effectiveness and sophistication of this paper’s algorithm, a comparison test is conducted using this paper’s algorithm and the above multiple comparison algorithms on the same dataset as well as in the same operating environment, and the results of the comparisons at different k-values are shown in Tables 1, 2, 3, and 4.

Table 1.

The comparison of the model and its model(k=3)

Data set	Evaluation index	BPR	Q-Learning	NeuCF	FPMC	DDPRG	This model
Yelp	Precision@3	0.0411	0.0424	0.0810	0.0897	0.0961	0.1159
	Recall@3	0.1769	0.2458	0.5372	0.5076	0.5305	0.5522
	NDCG@3	0.1595	0.2406	0.3598	0.3654	0.3819	0.4021
MovieLens	Precision@3	0.0524	0.0525	0.0993	0.0922	0.1100	0.1257
	Recall@3	0.1953	0.1581	0.4988	0.5357	0.5404	0.5543
	NDCG@3	0.3161	0.3005	0.3685	0.4024	0.4299	0.4367
Last.fm	Precision@3	0.0908	0.0851	0.1142	0.1202	0.1328	0.1453
	Recall@3	0.3459	0.1953	0.5025	0.4953	0.5204	0.5402
	NDCG@3	0.3530	0.2006	0.3680	0.3588	0.3975	0.4099

Table 2.

The comparison of the model and its model(k=5)

Data set	Evaluation index	BPR	Q-Learning	NeuCF	FPMC	DDPRG	This model
Yelp	Precision@5	0.0450	0.0398	0.0784	0.0995	0.1152	0.1258
	Recall@5	0.1582	0.2244	0.5215	0.4954	0.5461	0.5753
	NDCG@5	0.1358	0.2358	0.3353	0.3869	0.4026	0.4345
MovieLens	Precision@5	0.0302	0.0501	0.0892	0.1261	0.1356	0.1450
	Recall@5	0.1154	0.1496	0.5067	0.5598	0.5895	0.5984
	NDCG@5	0.2786	0.1582	0.3598	0.4061	0.4452	0.4566
Last.fm	Precision@5	0.0958	0.0775	0.1064	0.1257	0.1483	0.1660
	Recall@5	0.3359	0.1594	0.5252	0.5240	0.5356	0.5584
	NDCG@5	0.3452	0.1961	0.3796	0.3859	0.3891	0.4060

Table 3.

The comparison of the model and its model(k=7)

Data set	Evaluation index	BPR	Q-Learning	NeuCF	FPMC	DDPRG	This model
Yelp	Precision@7	0.0394	0.0202	0.0951	0.1284	0.1772	0.1954
	Recall@7	0.1866	0.0998	0.5563	0.5235	0.5569	0.6025
	NDCG@7	0.2652	0.2685	0.3856	0.4314	0.4460	0.4752
MovieLens	Precision@7	0.0559	0.0523	0.1270	0.1256	0.1555	0.1753
	Recall@7	0.2570	0.1686	0.5632	0.4684	0.5473	0.6249
	NDCG@7	0.1588	0.1242	0.3999	0.3991	0.4525	0.5004
Last.fm	Precision@7	0.1251	0.1025	0.1400	0.1634	0.1800	0.2063
	Recall@7	0.4112	0.3000	0.5257	0.5570	0.5782	0.6034
	NDCG@7	0.2594	0.2067	0.3764	0.3994	0.4090	0.4583

Table 4.

The comparison of the model and its model(k=9)

Data set	Evaluation index	BPR	Q-Learning	NeuCF	FPMC	DDPRG	This model
Yelp	Precision@9	0.0399	0.0397	0.0701	0.1021	0.1028	0.1162
	Recall@9	0.1581	0.2106	0.5185	0.5002	0.5225	0.5453
	NDCG@9	0.1362	0.2268	0.3234	0.3931	0.4006	0.4070
MovieLens	Precision@9	0.0297	0.0406	0.0804	0.1148	0.1258	0.1357
	Recall@9	0.1581	0.1400	0.4988	0.5057	0.5486	0.5852
	NDCG@9	0.2782	0.1511	0.3438	0.4052	0.4255	0.4441
Last.fm	Precision@9	0.0962	0.0706	0.0985	0.1355	0.1446	0.1574
	Recall@9	0.3361	0.1506	0.5059	0.5258	0.5335	0.5454
	NDCG@9	0.3447	0.1900	0.3672	0.3762	0.3868	0.3990

According to the given data tables, it can be observed that the performance of each recommendation algorithm on different datasets is different in terms of Precision, Recall and NDCG metrics.

First of all, it can be noted that the algorithm in this paper is the most prominent among all indicators in all datasets, for example, the Precision@3 in Yelp dataset is 0.1159, which is 20.60% higher than the suboptimal algorithm. Secondly, for other algorithms, the BPR algorithm and Dijkstra algorithm perform poorly than other algorithms in all datasets, and the NDCG@5 is only 0.3452 and 0.1961 in the Last.fm algorithms, and the algorithms in the middle are NeuCF, FPMC and DDPRG.

In contrast, the reason why the BPR and Dijkstra algorithms may perform poorly is because they are not designed to adequately take into account the dynamics of user behavior in the dataset. Taking k=5 as an example, the recall of the BPR algorithm in Yelp, Movilens, and Last.fm algorithms is 0.1582, 0.1154, and 0.3359, respectively. Although the BPR algorithm can be trained with implicit feedback data, it ignores the sequential information of the user and the item, as well as the user’s behavioral evolution process, resulting in a poor performance in terms of the English vocabulary exercise recommendations accuracy and poor performance in coverage. Dijkstra’s algorithm is limited by the large state space and slow convergence speed, which leads to unstable performance in the practical application of English vocabulary exercise recommendation, and the performance is not as good as other algorithms.

The NeuCF, FPMC, and DDPRG algorithms have some limitations compared to the algorithms in this paper. First, for example, with k=7, the accuracy, recall, and NDCG of NeuCF in the Movielens dataset are 0.1270, 0.5632, and 0.3999. The accuracy, recall, and NDCG of FPMC in the Yelp dataset are 0.1284, 0.5235, and 0.4314, and the accuracy, recall, and NDCG of the DDPRG algorithm in the Last.fm with accuracy, recall, and NDCG of 0.1800, 0.5782, and 0.4090, and NeuCF, although it employs a neural network model, it usually can only capture the static feature information between users and items, ignoring the dynamic characteristics and temporal relationships of user behavior. This may lead to the inability to fully consider the user’s historical behavioral sequences in the recommendation process, resulting in less accurate recommendation results. Second, although FPMC considers the interaction relationship between users and items, it focuses more on modeling the co-occurrence relationship between users and items, while ignoring the temporal order of user behavior sequences. Therefore, FPMC may not perform well when facing time-sensitive recommendation scenarios. In addition, although DDPRG combines reinforcement learning algorithms, it may be limited by unstable training and too large state space during the training process, resulting in less stable recommendation performance. In summary, although these three algorithms are able to provide personalized English vocabulary exercises recommendation to a certain extent, they are not as good as this paper’s algorithm in terms of considering the dynamic characteristics of user behavior, temporal relationships, and recommendation effects.

Finally, the algorithm in this paper adopts an intelligent personalized recommendation model for English vocabulary exercises based on a reinforcement learning model, which can effectively capture the temporal relationship and dynamic characteristics of user behavior. For example, when k=9, the accuracy, recall and NDCG of the proposed algorithm in Yelp are 0.1162, 0.5453 and 0.4070, respectively, the accuracy, recall and NDCG in the Movielens dataset are 0.1357, 0.5852 and 0.4441, and the accuracy, recall and NDCG in Last.fm are 0.1574, 0.5454, 0.3990, the proposed algorithm can model the user’s historical behavior sequence, so as to better understand the user’s behavioral preferences and interest evolution. Secondly, the algorithm in this paper utilizes reinforcement learning algorithm for recommendation decision-making, which can dynamically adjust the recommendation strategy according to the user’s feedback and environmental changes, so as to realize personalized recommendation service. After comprehensively comparing the above several algorithms, the model shows superiority and advancement.

4

Analysis of the effect of optimizing teaching paths based on support systems

The main purpose of this study is to recommend personalized English vocabulary learning exercises for learners, and the English vocabulary learning exercises recommendation system should be tested for the application effect of the system after the design of the functional requirements for learners is completed and implemented, and the problems that appear on the platform should be modified according to the test results, so as to further improve the system. In order to verify whether the English Vocabulary Exercise Recommendation Assisted Teaching System improves the learners’ vocabulary learning efficiency and recommends appropriate vocabulary exercise resources according to the learners’ needs to a certain extent, the application effect of the system is verified through the questionnaire survey method.

The questionnaire survey was conducted with non-English majors from freshmen to sophomores in a teacher training university, and feedback was collected from learners after using the system. The questionnaire on the effect of using the English vocabulary exercise recommendation system has ten questions in total, and the design of the evaluation indexes of the system is carried out from four dimensions, namely, satisfaction with the English vocabulary exercise recommendation assisted teaching system, learning attitudes, learning results, and advantages and disadvantages of the system. A total of 60 questionnaires were distributed in this survey, and 50 were effectively returned.

4.1

System satisfaction analysis

In the questionnaire on the use effect of the English vocabulary exercises recommended to assist teaching system, questions 1, 2, and 3 of the survey questions analyze the learners’ satisfaction survey of the system, mainly from the three aspects of whether the learners support the use of the learner competence division method, whether the system resources are designed to be comprehensive, and whether the recommended English vocabulary exercises are in line with the personalized needs. The results of the system satisfaction survey are shown in Figure 10, with numbers 1, 2, and 3 on the z-axis corresponding to questions 1, 2, and 3, respectively.

60% of the learners strongly support the use of the learner competency segmentation method, 26% support the use of the learner competency segmentation method, and most of the learners support the use of the learner competency segmentation method to varying degrees to analyze the learners’ competency profiles and recommend vocabulary workbook resources for them. 64% of the learners think that the system’s resources are very comprehensive, and 20% think that the system’s resources are more comprehensive. On the whole, most learners think that the vocabulary exercise resources in the system are comprehensive, and a few learners think that the vocabulary exercise resources can be supplemented. Combined with the tenth question in the questionnaire on the effectiveness of the English vocabulary exercise recommendation teaching system, regarding the shortcomings of the system, some learners point out that the design of the vocabulary resources can be considered to include the English explanations of vocabulary words, and by analyzing the learners’ feedback data, further improvement of the system can be considered. The feedback from the learners was analyzed and further improvement of the system could be considered. At the same time, 68% of the learners and 20% of the learners thought that the vocabulary recommendation meets the demand in different degrees, which indicates that most learners think that the vocabulary recommendation function of the system can basically meet the learners’ personalized vocabulary demand, which means that the system’s recommendation function has a good recommending effect, and it can assist the teachers to carry out teaching. Overall, most learners recognize the satisfaction of the system to varying degrees, and the analysis of the learners’ satisfaction survey shows that the English vocabulary exercise recommendation system has a certain degree of effectiveness.

4.2

Analysis of learning attitudes

The 4th, 5th and 6th questions in the questionnaire survey analyze the learners’ attitudes towards learning after they have used the English Vocabulary Exercise Recommendation Assisted Teaching System, which are mainly analyzed from the aspects of whether the system is a powerful tool to assist vocabulary learning, whether the system can increase the learners’ interest in vocabulary learning, and whether the system improves their motivation to learn vocabulary. The results of the learning attitude survey are shown in Figure 11, 57.69% of the learners and 25% of the learners think that the system is a better tool for assisting vocabulary learning to varying degrees, while 11.54% of the learners think that the system is not very useful for vocabulary learning. Overall the system can help learners learn vocabulary to a certain degree, which means that the system develops a personalized vocabulary learning sequence for the learners, and that the system can help learners learn vocabulary to a certain degree. In general, the system can help learners learn vocabulary to some extent, which means that the system can help learners learn vocabulary in a personalized vocabulary learning sequence, which has a positive effect on the learning of English vocabulary for Grade 4. 63.46% of the learners strongly agree that the system can increase learners’ interest in learning English vocabulary, 21.15% think that the system can increase their interest in learning vocabulary, and a minority of the learners think that the system can’t help them to increase their interest in learning vocabulary. 61.54% of the learners strongly agree that the system can increase the enthusiasm for learning vocabulary In general, the system has a good positive effect on learning English vocabulary. 21.15% of the learners think that the system can improve the motivation of learning vocabulary, and a few learners think that the system does not improve the motivation of learning vocabulary to a high degree.

4.3

Analysis of learning outcomes

Questions 7 and 8 of the questionnaire are about analyzing the learning effect of the English Vocabulary Exercise Recommendation Assisted Teaching System (EVRATS), which are mainly analyzed from the aspects of whether EVRATS can help learners to build up an overall structure between vocabularies through the recommendation of personalized English vocabulary exercises, and whether EVRATS can improve the efficiency of learners in learning English vocabulary, and the results of the survey are shown in Fig. 12. The survey results show that 70% of the learners and 24% of the learners think that the system can help the learners to establish the overall structure of vocabulary, which is helpful for memorizing vocabulary to a certain extent. 50% of the learners agree very much that the system can improve the efficiency of learning English vocabulary, 38% think that the system can basically improve the efficiency of learning English vocabulary, and a minority of learners think that the system is not very useful or not useful in the learning of English vocabulary. A few learners think that the system has little or no effect on learning English vocabulary. Overall, the learners recognize that the system in this paper can help them learn vocabulary to a certain extent.

4.4

Analysis of the strengths and weaknesses of the system

Questions 9 and 10 of the questionnaire investigated what the learners felt were the strengths and weaknesses of the system, question 9 investigated what the learners felt were the strengths of the system, and question 10 investigated what the learners felt were the weaknesses of the system. The results of the survey are shown in Table 5, which lists some of the advantages and shortcomings that the learners discussed more. Overall, the system is a good tool to assist teachers to teach English vocabulary to students, and for the shortcomings of the system pointed out by the learners, there are vocabulary resources that can be supplemented in the part of the system, and the part of the system’s practice questions that can be increased in the number of questions, which can help learners to master the vocabulary usage better. The system can also increase the number of test questions to help learners better master vocabulary usage.

Table 5.

System advantages and disadvantages survey

Issue number	Part of the learner’s feedback
9	The system is more flexible than the English class four lexical learning software on the market
	The system can analyze my vocabulary and recommend my vocabulary
	The system can recommend me the vocabulary resources that I am interested in
	This system will improve my motivation to learn English four words
	This system will improve my efficiency of learning English four words
10	The department of lexical resources can increase the English interpretation of words
10	The system can increase the number of words

5

Strategies for optimizing the depth of English vocabulary teaching paths

Based on the reinforcement learning model of English vocabulary exercises recommended aided teaching system for intelligent English vocabulary teaching path optimization provides a strong help, but to achieve the depth of the teaching path optimization strategy can also be improved from the following aspects to supplement, in order to achieve a more comprehensive and efficient optimization of the teaching path.

5.1

Precise formulation of teaching objectives

The precise positioning of teaching objectives aims to set appropriate guidelines to ensure that teaching activities are carried out in an orderly manner. On the basis of a comprehensive analysis of students’ cognitive level, emotional attitude and learning status, each student’s learning behavior is datamined to form a system of indicators of learning elements with individual characteristics, so as to compare and find out the gaps between the indicators and the teaching objectives, and then optimize the establishment of teaching objectives, so as to make the teaching objectives personalized and globalized.

5.2

Optimization of teaching content

Optimizing teaching content refers to planning teaching content suitable for students’ personalized development needs based on the behavioral characteristics embodied in their learning process. Based on the learner’s personal learning portrait, analyze each student’s personalized learning data, synthesize data information on behavioral characteristics, teaching evaluation, and emotional attitudes, break through the limitations of paper-based teaching materials, make full use of the massive teaching resources provided by big data, select teaching knowledge with high relevance and strong relevance, enrich the form and scope of teaching content, classify, summarize, and reconstruct the teaching materials to form a teaching content plan that teaching content plan that meets the teaching objectives and learning conditions, and provides students with learning methodology knowledge and emotional cultivation programs tailored to their needs.

5.3

Feedback on practice-enhanced teaching

Intensive teaching feedback is to analyze the strengths and weaknesses of students in terms of their behaviors, states, emotions and other characteristics demonstrated in the learning process, so as to provide personalized guidance to each student. This timely instructional feedback helps to improve students’ learning outcomes. The current personalized learning is still only a simple classification of learners, and then the correlation of learning laws to form personalized recommendations. By building a personalized learning system based on reinforcement learning, we can form an efficient and accurate personalized learning decision-making model through sample training of students’ learning behaviors, and then formulate the best personalized learning strategies based on the learning environment, learning behaviors, and learning status.

6

Conclusion

This paper constructs an English vocabulary exercise recommendation assisted teaching system based on the reinforcement learning model, which achieves better results in intelligent English vocabulary teaching path optimization.

In this paper, the stability of the model is examined by the iterative stability of the mean and variance of the model evaluation metrics ERR on single and multiple users, and it is found that the mean and variance of ERR of the model are decreasing during several iterations, and that the mean and variance of ERR for a single user per 200 rounds are stabilized at 0.0041 and 0.0040, and the iterative tendency of the mean and variance of ERR for all the users is also stabilized at 0.01 and 0.03 for all users, respectively. It illustrates the validity of the model in this paper, along with good decision stability.

Different datasets such as MovieLens, Yelp and Last.fm were used to evaluate the model in this paper, and it was found that compared with other models, the Precision@3 of the proposed algorithm in the Yelp dataset was 0.1159, which was 20.60% higher than that of the better-performing algorithm. However, the BPR algorithm and Dijkstra algorithm performed poorly, with NDCG@5 of 0.3452 and 0.1961 in the Last.fm dataset. Experimental results show that, compared with the comparison algorithm, the proposed method is more conducive to capturing the user’s dynamically changing preferences, and can provide inspiration for the application scenarios of the actual English vocabulary practice recommendation auxiliary teaching system.

The system in this paper can better meet the students’ personalized English vocabulary learning needs, and has achieved high satisfaction ratings from the students, as well as more excellent evaluation results in terms of learning attitude and learning effect. More than 80% of the students gave positive evaluations in terms of satisfaction, learning attitude and learning effect, and thought that the system in this paper was helpful to their English vocabulary learning. It shows that the system in this paper provides a more scientific and effective method for the optimal development of education.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Research on the Optimization of Intelligent English Vocabulary Teaching Paths Based on Reinforcement Learning Models

Xuefen Shi

Jinli Pan

Data publikacji: 29 wrz 2025

Otrzymano: 06 sty 2025

Przyjęty: 27 kwi 2025

DOI: https://doi.org/10.2478/amns-2025-1135

Słowa kluczoweReinforcement Learning Model, Q-Learning, Learning Ability Level Classification, Reward Strategy, Recommendation Assisted Teaching System

© 2025 Xuefen Shi and Jinli Pan, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
Reinforcement Learning Model, Q-Learning, Learning Ability Level Classification, Reward Strategy, Recommendation Assisted Teaching System