Design of Intelligent Online Education Resource Optimization and Scheduling Strategies Based on Deep Reinforcement Learning

With the rapid development of modern economy and high technology, modern technology has become an indispensable part of social life [1–2]. The development of the times has put forward higher requirements for education [3]. Focusing on the present, “Internet + education” model has quietly become a new way out of education [4]. Especially during the new coronary pneumonia epidemic, the online classroom was opened at the same time on a large scale, and the “Internet + education” model has become the main force of education [5–6]. “Internet + education” is not simply adding the Internet and traditional education together, but the use of Internet platforms and information technology, making the Internet and the deep integration of the education process, giving full play to the role of the Internet in the optimization of the allocation of educational resources and wide dissemination, to enhance the speed and quality of educational development, the formation of a wider to the Internet as the infrastructure and tools for the realization of the education process. Internet as the infrastructure and tools for the realization of the new form of educational development [7–10]. However, in terms of the current development of the “Internet + education” model, there are still major problems, these problems are “Internet + education” model continues to develop the roadblock, urgently need our attention and solution [11–13].

For example, the coverage of course resources is not enough to meet the needs of all learners, the teaching concept of the platform design is unclear or missing, and the content of course resources needs to be enriched [14–15]. There are insufficient featured resources and courses, and it is more difficult for some learners to find course resources that match their learning needs [16–17]. The content of some course resources is not rich enough, and the introduction, learning, examination, evaluation, questions and answers are not comprehensive enough to fully meet the needs of learners’ self-study [18–19]. The interactivity of some course resources is insufficient, and there is still a gap compared with offline teaching [20]. Practical resources are relatively insufficient, and the combination of theory and practice needs to be improved [21]. Teachers’ information technology literacy and online learning design capabilities are insufficient, and teachers need to increase the development and construction of curriculum resources based on platform technology [22]. Some schools do not provide enough training for teachers, and the level of teachers varies. Through the research, it is also found that some teachers encounter more problems when teaching online, which makes them tired of coping with the process of teaching and not being able to handle the situation with ease, thus affecting the effectiveness of teaching and learning [23–24]. Understanding the current development of the “Internet + education” model, paying attention to and emphasizing the deficiencies and urgent problems, discovering the real reasons behind these problems, and exploring and studying the solutions to these problems will put forward a new research direction for the development of the Internet, and open up a new way out for the cause of education [25–26].

Internet online education technology for the field of education has brought about profound reforms and transformations, the research around the Internet online education has received heated attention, the literature [27] envisioned a strategy for transitioning from the traditional learning mode to the online teaching mode, based on the practice and surveys to confirm that the strategy played a positive role, and pointed out that the obstacles that exist in the online teaching mode include teaching and learning interactions, stimulating the motivation of the students to participate in learning methods, etc. Literature [28] combines quantitative and qualitative analytical research methods to describe the mechanism and underlying logic of the online teaching model operation, and argues that there is a need to build a bridge between its online teaching management and teaching processes to improve sustainable academic disruption. Based on descriptive questionnaire research methodology, literature [29] explains the low level of application of online teaching aids and that the gender of the instructor significantly affects the use of online learning assessment tools. Literature [30] describes the concept of online teaching cop and builds a link between the concept and an integrated framework of learning activities involving technology and unearths and reflects on the future development of online teaching and learning.

One of the studies around the optimization of online education resources on the Internet, personalized teaching and teaching resources recommendation is as follows, literature [31] around the age of the learner, quality and teaching needs, analyzed the characteristics of the online education model and the development of the experience, and put forward the optimization of the online education resources for the learners to provide a more professional continuing professional education. Literature [32] defines and designs a set of rules translated into a framework to support the construction of e-learning models, and is used to conduct tests and studies in LMS Moodle to assess the feasibility of the proposed ideas, making a positive contribution to the development and innovation of e-learning models. Literature [33] developed a personalized cloud online teaching recommendation model with the ability to mine user learning preferences, and then recommend and optimize personalized learning resources adapted to the learner to meet the learner’s personalized learning needs. Literature [34] demonstrates the importance of educational resource allocation optimization and proposes the urgency of innovating educational resource allocation optimization paths under the background of big data, and proposes an ideological teaching resource model with improved collaborative filtering algorithms as the core logic, which improves the user’s cold-start problem, and realizes personalized teaching resource recommendation. Literature [35] proposed a leapfrog evolutionary ant colony optimization (SEACO) algorithm suitable for online personalized learning path recommendation, which can realize accurate learning based on learners’ continuous personalized needs and improve the effect and quality of learning. Literature [36] explores teachers’ behaviors in online teaching models and uncovers the challenges and barriers faced by teachers in the online teaching process, which improves teachers’ digital literacy as well as the teachers’ community’s knowledge of online teaching technologies.

On the basis of existing deep reinforcement learning methods, this paper focuses on how to introduce deep reinforcement learning into the recommendation of capable online education resources, and constructs an online education resource recommendation model based on deep reinforcement learning, with a view to realizing intelligent online education resource optimization. The resource recommendation process is modeled as a Markov decision process covering four parts: state space, action space, state transfer and reward function, the learner is treated as an intelligent body agent of the reinforcement learning model, the reward function combining soft and hard rewards is selected as the reward mechanism, and the maximized cumulative rewards are obtained by calculating in the process of environmental interaction. The optimization of each parameter of the model is accomplished by sharing the feature layer by the strategy network and the value network. The strategy neural network is continuously trained and optimized to obtain k paths, i.e., the corresponding reward values, and then the paths are ranked according to their rewards, and the corresponding resource items are recommended to the learners. The online education resource model proposed in this paper is tested for its performance and applied to the recommendation of book resources in digital libraries to explore the practical application effect of the model in this paper.

2

A Deep Reinforcement Learning-based Recommendation Model for Online Educational Resources

In today’s society, with the continuous development of the Internet, the number of intelligent online educational resources is rising exponentially. In front of the huge number of online educational resources, it becomes more and more difficult for users to find the most suitable educational resources for themselves, so how to accurately recommend the most suitable online educational resources to the users is a very necessary issue, and the effective recommendation and scheduling of intelligent online educational resources has become the main direction of the optimization of intelligent online educational resources.

In this paper, on the basis of deep reinforcement learning method, we will construct an online education resource recommendation model based on deep reinforcement learning as the scheduling strategy of online education resources, to realize the accurate recommendation of online education resources for learners and promote the optimization of online education resources.

2.1

Deep reinforcement learning

2.1.1

Analysis of reinforcement learning theory

Reinforcement learning studies sequential decision-making problems, whose mathematical ideal form is the Markov Decision Process (MDP), which transforms problem solving into the process of optimizing the Bellman equation by using the Bellman equation to formally express the value function V_π(s) and the action-value function Q_π(s) in reinforcement learning [37]. This approach drastically simplifies the solution complexity and learning.

Over a sequence of discrete time points, the intelligent learns strategies π : s → a by interacting with the environment, which can be either deterministic a = π(s) or stochastic π(a | s) = π(s, a). At the t moment, the intelligent chooses an action A_t ∈ A(S_t) based on the received state S_t of the environment and consequently receives the corresponding reward R_t+1 while the environment shifts to a new state S_t+1. This process gradually constitutes an iterative sequence as demonstrated in equation (1): 1 $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, \dots$

The purpose of reinforcement learning is to find the optimal policy that maximizes the cumulative reward π*, and there are three solution methods: dynamic programming (DP), Monte Carlo (MC) and temporal difference (TD). During the interaction with the environment, the cumulative reward G_t that the intelligent body can obtain after all future actions performed from moment t is: 2 $G_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots + γ^{T - t - 1} r_{T} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} (0 \leq γ \leq 1)$ where subscript t is the number of time steps and γ is the discount factor. The value function based on Bellman’s equation is divided into state-value function v_π(s) and action-value function q_π(s, a). The state-value function is defined as: 3 $\begin{array}{l} v_{π} (s) & = E [G_{t} ∣ s_{t} = s] \\ = E_{π} [R_{s}^{a} + γ v_{π} (S_{t + 1} ∣ S_{t} = s)] \\ = \sum_{a \in A} π (s ∣ a) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'})) \end{array}$ where $P_{s s^{'}}^{a} = P (S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a)$ is the state transfer probability matrix. v_π(s) denotes the expected value of getting cumulative rewards from moment t to the end, evaluating the superiority or inferiority of state s. The action value function is defined as: 4 $\begin{array}{l} q_{π} (s, a) & = E [G_{t} ∣ s_{t} = s, a_{t} = a] \\ = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a] \end{array}$ where the action value function q_π(s, a) represents the expectation of the cumulative reward that can be obtained after taking action a in state s and is used to determine the probability of taking this action.

The state value function and the action value function are inter-transformable: 5 $v_{π} (s) = \sum_{a \in A} π (a ∣ s) q_{π} (s, a)$ 6 $q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} \sum_{a^{'} \in A} π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'})$

Reinforcement learning determines the optimal policy π* by finding the optimal state-value function v_π^*(s) or the optimal action-value function q_π^*(s, a) to determine the cumulative rewards earned G, which is formulated as follows: 7 $v_{π^{*}} (s) = \max_{a} q_{π^{*}} (s, a) = \max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π^{*}} (s^{'})]$ 8 $q_{π^{*}} (s) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ \max_{a^{'}} q_{π^{*}} (s^{'}, a^{'})]$

Of the three main solution methods, the dynamic programming (DP) method is a model-based method with the distinctive features that the state transfer probabilities are known in advance and all subsequent states s′ of the current state s can be directly accessed, so that the state values can be updated according to the Bellman equation. This method calculates the value function by the known state transfer probabilities and the set of actions as if it has a “God’s perspective”, however, since this method requires prior knowledge of the complete environment model, its application to practical problems is limited, especially when the environment model is unknown or difficult to model accurately. Its value function estimation metric is: 9 $V (s) \leftarrow E_{π} [r^{'} + γ V (s^{'})] = \sum_{a} p (s^{'}, r ∣ s, a) [r + γ V (s^{'})]$

Monte Carlo (MC) method is a model-free method, and the main idea is to approximate the true value of the state through a large number of samples. Compared with the dynamic programming method, the MC method is slow to learn and inefficient, so the method becomes inapplicable. Its value function estimation formula is as follows: 10 $V (s) \leftarrow V (s) + α (G - V (s))$ where α ∈ (0,1) is the learning rate.

In this paper, we mainly use the temporal difference method for the iterative computation of Bellman equation values. The temporal difference method is mainly divided into Sarsa algorithm and Q-learning algorithm, both of which output the optimal policy π* by estimating the action value function q_π(s, a) instead of the state value function v_π(s).

Sarsa algorithm uses the same strategy in updating the action value function as in executing the action, its a same track strategy and its update formula is shown below: 11 $q (s, a) \leftarrow q (s, a) + α [\underset{T D E r r o r}{\underset{︸}{\underset{I D T a r g e t}{\underset{︸}{r^{'} + γ q (s^{'}, a^{'})}} - q (s, a)}}]$

In contrast, the Q-learning algorithm selects the maximum possible value when updating the action value function, a process that is independent of the actual execution strategy employed, and is therefore an off-track strategy. Its update formula is also shown below: 12 $q (s, a) \leftarrow q (s, a) + α [\underset{T D E r r o r}{\underset{︸}{\underset{T D T a r g e t}{\underset{︸}{r^{'} + γ \max_{a} q (s^{'}, a)}} - q (s, a)}}]$

2.1.2

Deep reinforcement learning theory analysis

Deep reinforcement learning is a technique that combines deep learning and reinforcement learning to deal with complex problems requiring a high degree of autonomous decision-making by fusing the environment-awareness capability of deep learning with the autonomous decision-making capability of reinforcement learning [38]. At present, it has become a mainstream trend to rely on deep reinforcement learning algorithms to solve the problem of how to realize optimal autonomous decision-making for driverless cars on the road in the face of complex road environments. 1)

Deep reinforcement learning based on value functions

The DQN algorithm is an approach that combines deep learning with Q-learning. At its core, it utilizes a deep neural network to approximate the Q function, which is the expected payoff of taking an action in a given state. DQN solves the problem of dimensional catastrophe caused by the excessively large state space and action space in traditional Q-learning.

The DQN algorithm fits the optimal action value function Q_π^*(s, a) through a neural network: 13 $Q_{π^{*}} (s, a) = E_{s^{'}} [r + γ \max_{a^{'}} Q_{π^{*}} (s^{'}, a^{'}) ∣ s, a]$

Where, π* denotes the optimal policy. the DQN algorithm deals with discrete action spaces. In the neural network the value function is denoted as Q(s, a; θ), the core is how to determine the weights θ to approximate the value function, and the gradient descent method is used to minimize the loss function to constantly debug the network weights θ, the loss function is defined as: 14 $L_{i} (θ_{i}) = E_{s, a ~ ρ (s, a)} [\underset{T a r g e t Q - N e w o r k}{\underset{︸}{(r + γ \max_{a} Q (s^{'}, a^{'}; θ_{i - 1})}} - \underset{Q - N e w o r k}{\underset{︸}{{Q (s, a; θ_{i}))}^{2}}}]$ $L_i\left(\theta_i\right)=E_{s, a \sim \rho(s, a)}[\underbrace{\left(r+\gamma \max _a Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right)\right.}_{T \text { arget } Q \text {-Nework }}-\underbrace{\left.Q\left(s, a ; \theta_i\right)\right)^2}_{Q \text {-Nework }}]$ where θ_i–1 is the target network parameter of the i nd iteration, ρ(s, a) is the probability distribution of state-action pairs, and θ_i is the parameter of the Q-Network. The gradient descent method is used to find the gradient for θ: 15 $\begin{array}{l} \nabla_{θ_{i}} L_{i} (θ_{i}) = E_{s, a ~ ρ (s, a); (D)} \\ [(r + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ_{i - 1}) - Q (s, a; θ_{i})) \nabla_{θ_{i}} Q (s, a; θ_{i})] \end{array}$ $\begin{aligned} & \nabla_{\theta_i} L_i\left(\theta_i\right)=E_{s, a \sim \rho(s, a) ;(D)} \\ & {\left[\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta_{i-1}\right)-Q\left(s, a ; \theta_i\right)\right) \nabla_{\theta_i} Q\left(s, a ; \theta_i\right)\right]}\end{aligned}$ where D is the pool of samples in the experience cache. the DQN input is state s and the output is the value Q_π · (s, a) of the action value function for each action, i.e., Q values.

2)

Policy-based deep reinforcement learning

Policy-based deep reinforcement learning methods can overcome some of the limitations of value-based methods by directly modeling the policy π_θ(s, a) = p(a | s, θ), which allows for the direct generation of actions from a given state s a. For a continuous action space, this method can assume that the actions follow a certain distribution, such as the commonly used Gaussian distribution, and thus outputs the expected value of the action a_μ based on the state s, which is utilized a ~ N(a_μ, σ) to select the action.

Policy-based deep reinforcement learning is a method to maximize long-term rewards by optimizing the policy directly. It is divided into stochastic policy gradient (SPG) and deterministic policy gradient (DPG), based on which the objective function of SPG is derived as: 16 $J (π_{θ}) = E_{s ~ p_{0} (\cdot), a ~ π_{θ} (\cdot ∣ s)} Q_{π_{θ}} (s, a)$ $J\left(\pi_\theta\right)=E_{s \sim p_0(\cdot), a \sim \pi_\theta(\cdot \mid s)} Q_{\pi_\theta}(s, a)$

where π_θ is the stochastic strategy and ρ₀(·) is the probability of any state at the initial moment, the gradient information of the SPG is derived from Eq. (17) as: 17

\nabla_{θ} J (π_{θ}) = E_{s ~ ρ_{π} (\cdot), a ~ π_{θ} (\cdot ∣ s)} \nabla_{θ} \log π_{θ} (a ∣ s) Q_{π_{θ}} (s, a)

$\nabla_\theta J\left(\pi_\theta\right)=E_{s \sim \rho_\pi(\cdot), a \sim \pi_\theta(|s|)} \nabla_\theta \log \pi_\theta(a \mid s) Q_{\pi_\theta}(s, a)$

obtained by deterministic strategy function μ_θ(s). The objective function of DPG is as follows: 18

J (μ_{θ}) = E_{s ~ ρ_{0} (\cdot)} Q_{μ_{θ}} (s, μ_{θ} (s)) d s

$J\left(\mu_\theta\right)=E_{s \sim \rho_0(\cdot)} Q_{\mu_\theta}\left(s, \mu_\theta(s)\right) d s$

The gradient information of DPG is derived from equation (19) as: 19 $\nabla_{θ} J (μ_{θ}) = {E_{s ~ ρ_{μ} (\cdot)} \nabla_{θ} μ_{θ} (s) \nabla_{a} Q_{μ_{θ}} (s, a) |}_{a = μ_{θ} (s)}$ $\nabla_\theta J\left(\mu_\theta\right)=\left.E_{s \sim \rho_\mu(\cdot)} \nabla_\theta \mu_\theta(s) \nabla_a Q_{\mu_\theta}(s, a)\right|_{a=\mu_\theta(s)}$

Critic is then analogous to strategy evaluation, which goes on to estimate the action-movement function and update the action-movement function parameter ω with linear TD(0): 20 $Q_{ω} (s, a) \approx Q_{π_{θ}} (s, a)$

Actor updates the policy parameters using the policy gradient method in the direction required by Critic. θ The Actor-Critic algorithm is an approximate policy gradient with the following formula: 21 $\nabla_{θ} J (θ) \approx E_{π_{θ}} [\nabla_{θ} \log π_{θ} (s, a) Q_{ω} (s, a)]$

2.2

Intelligent online education resource recommendation model

Reinforcement learning can be categorized into value-based and policy-based according to whether or not the value function needs to be learned.Q-learning model is the classic value-based reinforcement learning model, DQN is the integration of deep learning on Q-1earning algorithm, which can effectively solve the recommendation task in discrete state, but lacks the ability to adopt the processing of the path. The policy-based reinforcement model can effectively solve the recommendation task in continuous state by sampling the learning paths in the educational resource recommendation task and continuously updating the policy selection through the training process. Therefore, this model chooses the policy-based approach to construct the reinforcement learning model.

2.2.1

Markov decision process design

This model transforms the resource recommendation problem into a Markov decision-making process, where the state transfer and action selection of an intelligent body are done in a knowledge graph environment [39]. The process is mainly composed of four parts: state space, action space, state transfer and reward function. 1)

State space

In this model, the learner is used as the intelligent body AGENT of the reinforcement learning model, and the Intelligent Education Knowledge Graph provides the training environment for the reinforcement model, which is trained through the continuous interaction between the intelligent body and the environment. State S_t represents the searching state of the intelligent body in the knowledge graph at step t, defined by ternary (u, i_t, h_t), the starting learner u, i_t is the entity reached by the agent at step t, and h_t is the history before step t, which is the collection of entities and relations that have been searched in the knowledge graph in the past t steps. The formula for S_t is as follows: 22 $S_{t} = [T r a n s E (u, i_{t}, h_{t}), D e e p W a l k (u, i_{t}, h_{t})]$

TransE(u, i_t, h_t) and DeepWalk(u, i_t, h_t) denote the vector representation of the learner u, the intelligent educational resource entity i and the search history h obtained by training the pair of triples (u, i_t, h_t) with the TransE model and DeepWalk model, respectively [40–41]. The initial state S₀ = (u, u, ∅) of the smart body interaction is set, and given the search length T, the final state is S_T = (u, i_T, h_T).

2)

Action space

In the ERKD model, all entities i(i ∈ I) and relations r(r ∈ R) associated with the smart education resource item i_t form the full state space A_t of the resource item i_t in state S_t, which is defined as shown in equation (23). The resources that learners have already learned are no longer in the consideration, so i ∉ {i₀, ⋯, i_t–1} indicates that the full state space does not include the entities i and their corresponding relationships r that are already in h_t: 23 $A_{t} = {(r, i) ∣ (i_{t}, r, i), i \notin {i_{0}, \dots, i_{t - 1}}}$

Because the distribution of out-degree has an obvious long-tail effect, certain nodes have a large out-degree leading to a waste of resources in the full state space. Therefore, it is necessary to calculate the scores for each edge connected to a node in the graph with the help of scoring function, and the important nodes are effectively retained according to the scores [42]. Scoring function is denoted by f((r, i)| u), and the specific definition is shown in Equation (24). Given user u, the scores are calculated for each of its edges by Scoring function, parameter α is set to set the upper bound on the number of actions, and the top ranked α actions are selected according to the scores, and the candidate action set ${\tilde{A}}_{u}$ is as follows: 24 ${\tilde{A}}_{u} = {(r, i) ∣ r a n k (f ((r, i) ∣ u))} \leq α, (r, i) \in A_{t}$

3)

Reward function

In order to achieve the optimization of the objectives in the decision-making process of the model, the EKRD model introduces a reward mechanism, which evaluates the recommended sequences and feeds the results back to the intelligent body through the Scoring function. In the Markov decision-making process, the environment will give corresponding rewards according to the action choices made by the intelligent body, and at the same time, the intelligent body will update its strategy according to the rewards fed back from the environment in order to maximize the rewards.

Intelligent educational resources recommendation process needs to produce different impacts with the accumulation of learners’ learning process, so it needs the intelligent body to make full use of the heterogeneous information in the network to explore as many paths as possible, and also needs to take into account the reasonableness of the recommendation, based on the reinforcement learning method there is a problem of sparse rewards, so in this paper, we choose the reward function that combines the soft rewards and hard rewards, and the reward function f((r,i) | u) is defined as follows: 25 $f_{T} ((r, i) ∣ u) = {\begin{cases} \max (0, \frac{I n P_{T r a n s} (u, i_{T}) + I n P_{D e e p W a l k} (u, i_{T})}{\max_{i \in I} I n P_{T r a n s E} (u, i) + I n P_{D e e p W a l k} (u, i)}), i f i_{T} \in I \\ 0, i f i_{T} \notin I \end{cases}$

Model training process, in order to ensure that the recommendation process more comprehensive consideration of all relevant resources, the intelligent body needs to take as much as possible to explore the best path, the best path is designed for the learner to find a resource item that needs to be learned, and the item learner has not learned, looking for the process of other items that the learner has already mastered, so in the process of carrying out the Scoring calculations, only the state of the final item is taken into account (i.e., the T moment) for Scoring calculation. As shown in Equation (25), f_T((r,i) | u) represents the scoring of the state at moment T, where InP_TransE(u,i) represents the calculation of the inner product of vector u and vector i generated by the TransE model, and InP_DeepWalk(u,i) represents the calculation of the inner product of vector u and vector i generated by the DeepWalk model.

4)

State Transfer

In the ERKD model, given state S_t and action A_t, various relationships related to the current entity are stored in action space A_t, from which the optimal action is selected and executed to realize the transfer to the next state S_t+1. The result of state transfer is S_t+1 = (S_t, A_t), which is shown in equation (26): 26 $P {S_{t + 1} = (u, i_{t + 1}, h_{t + 1}) ∣ S_{t} = (u, i_{t}, h_{t}), A_{t} = (r_{t + 1}, i_{t + 1})] = 1$

The goal of the EKRD model MDP is to learn the out strategy π so that the learner u can maximize the cumulative rewards ▯(θ) during the environmental interactions, and the computational formula is expressed as: 27 $▯ (θ) = E_{π} [\sum_{T = 0}^{T - 1} P^{t} f_{t + 1} ∣ s_{0} = (u, u, \emptyset)]$ $\square(\theta)=E_\pi\left[\sum_{T=0}^{T-1} P^t f_{t+1} \mid s_0=(u, u, \varnothing)\right]$

2.2.2

Policy Neural Networks

The wisdom education knowledge graph contains a large amount of learner information, wisdom education resource information and learner-resource interaction information, and after MDP modeling, the action space A_t contained in its environment is large in size, so this paper chooses a deep reinforcement model based on policy optimization.The feature layer in the ERKD model is shared by the policy network and the value network.

The strategy network is denoted as Π(· | s, Ã(u)) in the model, and the inputs of the strategy network are the state vector s and the optional action vector Ã(u), and the probability of each action is derived by operation, and the probability is assigned to 0 if the action does not belong to Ã(u).

The value network $\hat{v} (s)$ is responsible for the optimization of the strategy network, and maps the state vector s to a real number through the valuation network as an estimate of the value of the current state. In this paper, the state vector S_t is input into the fully connected network composed of two hidden layers, and the output layer makes softmax for normalization, and the strategy network and value network are defined as follows: 28 $x = d r o p o u t (σ (σ d r o p o u t (σ (S_{T r a n E} W_{1})) W_{2})) + d r o p o u t (σ (d r o p o u t (σ (S_{D e e p W a l k} W_{3})) W_{4}))$ 29 $\prod (\cdot ∣ s, \tilde{A} (u)) = s o f t \max (\tilde{A} (u) ▯ (x W_{p}))$ $\prod(\cdot \mid s, \tilde{A}(u))=\operatorname{soft} \max \left(\tilde{A}(u) \square\left(x W_p\right)\right)$ 30 $\hat{v} (s) = x W_{v}$

By mapping the state vector S_t to the Ã(u) probability of an optional action after processing by the strategy network and optimizing each parameter in the model, the set of relevant network parameters Θ = {W₁, W₂, W₃, W₄, W_p, W_v} is first defined and the strategy gradient formula is shown in (31): 31 $\nabla_{Θ} ▯ (Θ) = E_{π} [\cdot ∣ s, \tilde{A} (u) (D C G (s, s_{t}) - \hat{v} (s))]$ $\nabla_{\Theta} \square(\Theta)=E_\pi\left[\cdot \mid s, \tilde{A}(u)\left(D C G\left(s, s_t\right)-\hat{v}(s)\right)\right]$

2.2.3

Path Reasoning and Resource Program Recommendations

The model uses a trained policy network to accomplish a recommendation task on the knowledge graph, given learner u, with the goal of finding a candidate set of recommended educational resources along with a response inference path. The search process of candidate paths and recommended resources needs to be constrained, and the ERKD model uses action probabilities and rewards as a guide to constrain the search method, i.e., with a given learner and maximum path length, the path search is performed according to the trained policy network, and the paths are sampled multiple times in the knowledge graph to obtain k paths, i.e., the corresponding reward values, which are ranked according to the path rewards, and the corresponding resources are recommended to the learner. Project.

3

Online Educational Resources Recommendation Model Recommendation Performance Test

In this chapter, the recommendation performance of the online educational resource recommendation model based on deep reinforcement learning proposed in this paper will be tested, and the testing metrics used are as follows. 1)

HR@k metric, which measures the percentage of correct recommendation items present in the educational resource recommendation list.

2)

NDCG@k metric, which measures the correctness of the educational resource recommendation list and the influence of the ranking position of the correct items.

In this experiment, two indicators, HR@k and NDCG@k, were used to measure the overall accuracy and ranking priority of the recommended list of educational resources generated by the proposed model for intelligent online educational resources. The dataset used in this experiment is MOOPer and MOOCCube, the former is a large-scale open practice dataset jointly released by the National University of Defense Technology of China and the online practice teaching platform - Touge Platform, and the latter is an open data warehouse jointly created by Tsinghua University and Xuetang Online, which collects real user behavior data from the Xuetang online education platform, including user interaction with learning videos, messages, etc. The experimental results of the model in this paper on the MOOPer dataset using different convolution layers on the HR@10 and NDGC@10 indexes, as shown in Figure 1, where (a) and (b) correspond to the results of HR@10 and NDGC@10 indicators, respectively. As shown in the figure, when D1 and D2 are taken from 1 to 2, the performance of the recommendation model increases with the increase of the number of convolution layers in the graph, and reaches its peak when D1 and D2 are both 2, with 0.9897 and 0.9563 in the HR@10 and NDGC@10 indicators, respectively. When D1 and D2 are taken from 3 to 4, the experimental results show a downward trend with the increase of the number of convolution layers of the graph. When the convolution layers D1 and D2 of the model map are both 2, the HR@10 and NDGC@10 indexes are increased by 6.64% and 4.21% compared with those of D1 and D2, respectively.

In the MOOCCube dataset, the experimental results of the model using different convolutional layer HR@10 and NDGC@10 indicators are shown in Figure 2, and the experimental results of HR@10 and NDGC@10 indicators are shown in Figure (a) and (b), respectively. In the figure, when D1 takes 1 to 2 and D2 takes 1 to 3, the performance of the recommendation model increases with the increase of the number of convolution layers in the graph, and reaches the peak when D1 and D2 are 2 and 3, respectively, and 0.9464 and 0.8738 in the HR@10 and NDGC@10 indicators, respectively. When D1 and D2 are both 3 to 4, the experimental results show a downward trend with the increase of the number of convolution layers. In terms of HR@10 and NDGC@10 indicators, when the number of convolution layers D1 and D2 of the two modules of the model is 2 and 3, respectively, compared with 4 for D1 and D2, it is increased by 5.02% and 3.99%, respectively.

In general, in the policy neural network, when the values of graph convolutional layers D1 of the educational resource embedding module and graph convolutional layers D2 of the learner-educational resource interaction embedding module of the model in this paper are small, the comprehensive performance of the model’s resource recommendation improves with the increase of the number of graph convolutional layers, and the recommendation of the online educational resources is good.

4

Experiments on Optimized Scheduling of Online Educational Resources in Digital Libraries

This chapter will apply the online educational resource recommendation model based on deep reinforcement learning proposed in this paper to the recommendation of digital books, an online teaching resource in digital libraries, and explore the effect of the model in reality.

4.1

Data set setup

In this chapter, two real datasets, school digital library data from school digital libraries, and Goodbooks-10k dataset, are used to evaluate the model proposed in this paper. The statistics of the two datasets are specifically shown in Table 1. The school digital library data contains 533,114 borrowing records. To ensure the quality of the model, this chapter filters the users with less than 4 borrowing records, and finally obtains 518806 borrowing records. It consists of 22683 user borrowing sequences,containing 124357 books. Another real dataset used in this work is Goodbooks-10k dataset which is a book recommendation dataset containing 912608 borrowing records. After the same data processing, the final result is 898,287 borrowing records, which contains 10,000 books in the dataset and consists of 41,482 users’ borrowing sequences.

Table 1.

Statistics for Datasets

-	School Digital Library Data	Goodbooks-10k data
Borrowing records	518806	898287
Users	22683	41,482
Books	124357	10,000
Training data	496,044	856,772
Test data	22,581	41,425

The data of school digital library data, Goodbooks-10k dataset are preprocessed, because some users only borrowed a small number of books, a small amount of data can not be formed into a sequence, so this chapter uses the data of the users who borrowed a number greater than or equal to 4. The data distribution of the school digital library data, Goodbooks-10k dataset is shown in Figure 3, where the horizontal coordinate is the number of records per sequence and the vertical coordinate is the number of users. It can be observed that most of the sequences in the two data sets are below 20 in length, while the number of books is above 10,000, which also indicates the sparseness of the book data. In the process of training and testing, this chapter uses the last item of these sequences as the test data and the others as the training data to get the final results.

4.2

Experimental results and analysis

In order to prove the superiority of the model proposed in this paper, CF, FISM, NAIS, Light-GCN and HRL models are selected as the comparison models in this experiment. This chapter will still use HR@k metrics and NDCG@k metrics as the evaluation metrics for this experiment.

4.2.1

Comparative analysis of models

The experimental results of the school digital library dataset are specifically shown in Table 2. From the table, it can be seen that the model of this paper is 0.83 and 0.9222 on HR@5 and HR@10 indexes, respectively, and 0.5901 and 0.6219 on NDCG@5 and NDCG@10 indexes, the prediction performance is better than the other comparative models, and it can better predict what kind of books the users like and recommend these books to the users.

Table 2.

Comparative experimental results of school digital library data

-	HR@5	HR@10	NDCG@5	NDCG@10
CF	0.4515	0.4873	0.2865	0.2728
FISM	0.2355	0.3243	0.1781	0.2044
NAIS	0.2143	0.2857	0.1598	0.1835
Light-GCN	0.4695	0.5895	0.3233	0.3759
HRL	0.6505	0.7824	0.4713	0.5159
Model of this article	0.83	0.9222	0.5901	0.6219

The experimental results of all models on the Goodbooks-10k dataset are specifically shown in Table 3.CF, FISM, and NAIS adopt the deep learning approach, but the sparse data has a large impact on the training process of the three, leading to their final results on HR@5, HR@10, NDCG@5, and NDCG@10 metrics to be poorer. lite-GCN and HRL get slightly better results, and HRL is relatively better, but still not as good as this paper’s model. This paper’s model is 0.4807 and 0.7023 on HR@5 and HR@10 metrics, and 0.3689 and 0.4389 on NDCG@5 and NDCG@10 metrics, respectively, which gives better results than the other models, which proves the effectiveness of this paper’s model.

Table 3.

Comparative experimental results of Goodbooks-10k data

-	HR@5	HR@10	NDCG@5	NDCG@10
CF	0.4073	0.5755	0.1013	0.1212
FISM	0.4019	0.5441	0.2783	0.325
NAIS	0.3553	0.4989	0.2432	0.289
Light-GCN	0.4301	0.5935	0.2814	0.3467
HRL	0.2152	0.3223	0.1638	0.1923
Model of this article	0.4807	0.7023	0.3689	0.4389

4.2.2

Parametric analysis

Section will carry out parametric experiments, respectively, school digital library dataset and Goodbooks-10k dataset with the number of categories set to 1000, 2000, 5000, 10000, respectively, to obtain the experimental results of this paper’s model under different parameter settings, as shown in Table 4. From the table, it can be seen that with a certain increase in the number of categories, the classification ability of this paper’s model is improved. When the number of categories reaches 2000, the results of HR@5, HR@10, NDCG@5, and NDCG@10 metrics of this paper’s model are optimal, which are 0.8294, 0.922, 0.5904, and 0.6204, respectively. And as the number of categories continues to grow, the results for each indicator show a clear downward trend. When the number of categories reaches 10000, the results of HR@5, HR@10, NDCG@5, and NDCG@10 metrics of this paper’s model decrease to 0.3457, 0.4063, 0.2347, and 0.2579. Obviously, setting the number of categories too high will make the classification ability of this model weaker, and the parameters should be set reasonably in the application process in order to better utilize the online educational resources recommendation performance of this model.

Table 4.

Experimental results with different parameter settings

Category	HR@5	HR@10	NDCG@5	NDCG@10
10,000	0.3457	0.4063	0.2347	0.2579
5,000	0.652	0.7019	0.4959	0.5174
2,000	0.8294	0.922	0.5904	0.6204
1,000	0.8058	0.8973	0.5675	0.592

5

Conclusion

This paper proposes an online education resource recommendation model incorporating deep reinforcement learning to achieve the optimization of intelligent online education resources.

In order to test the resource recommendation performance of the online educational resource recommendation model in this paper, the performance test was carried out with HR@k and NDCG@k as evaluation indicators. In the MOOPer dataset, the performance of the recommendation model peaked when D1 and D2 were both 2, and the HR@10 and NDGC@10 indicators were 0.9897 and 0.9563, respectively. Later, with the increase of the number of convolution layers of the graph, the experimental results showed a downward trend. Compared with D1 and D2, and D1 and D2 were both 4, and the HR@10 and NDGC@10 indexes were increased by 6.64% and 4.21%, respectively. In the MOOPer dataset, D1 and D2 had the best performance when they were 2 and 3, respectively, and the HR@10 and NDGC@10 indexes were 0.9464 and 0.8738, respectively, which were 5.02% and 3.99% higher than those of D1 and D2, respectively. When the values of D1 and D2 are small, the recommendation performance of the proposed model in flower educational resources increases with the increase of the number of convolution layers, and the recommendation performance is good.

This paper applies the model to the scheduling of book education resources in the digital library, and discusses the practical application effect of the model. Compared with CF, FISM, NAIS, Light-GCN, HRL and other models, the prediction performance of the proposed model is completely better than that of other models on the school digital library dataset, with 0.83 and 0.9222 in HR@5 and HR@10 indicators, and 0.5901 and 0.6219 in NDCG@5 and NDCG@10 indicators, respectively. On the Goodbooks-10k dataset, the model still maintains the best HR@5, HR@10, NDCG@5, and NDCG@10 indicators, which are 0.4807, 0.7023, 0.3689, and 0.4389, respectively. This strongly proves the effectiveness of the proposed model. Parameter experiments were carried out, and the number of categories of the dataset was set to 1000, 2000, 5000, and 10000 respectively to obtain the experimental results under different parameter settings. When the number of categories is 2000, the results of HR@5, HR@10, NDCG@5, and NDCG@10 indicators are the best, which are 0.8294, 0.922, 0.5904, and 0.6204, respectively. When the number of categories increased to 10,000, the indicators of HR@5, HR@10, NDCG@5 and NDCG@10 showed a significant downward trend, which were 0.3457, 0.4063, 0.2347 and 0.2579, respectively. In practical application, the recommended performance of online educational resources of this model can be maximized by setting reasonable parameters.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Design of Intelligent Online Education Resource Optimization and Scheduling Strategies Based on Deep Reinforcement Learning

Yen Chun Lee

Chao-We Hsu

Chu-Hui Lee

Publié en ligne: 24 sept. 2025

Reçu: 25 déc. 2024

Accepté: 24 avr. 2025

DOI: https://doi.org/10.2478/amns-2025-1106

Mots clésDeep reinforcement learning, Markov decision modeling, Policy neural networks, Educational resource recommendation

© 2025 Yen Chun Lee et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Deep reinforcement learning, Markov decision modeling, Policy neural networks, Educational resource recommendation