An Intelligent Classification Method for Online Resource Data of College Language Teaching Based on Deep Reinforcement Learning
Publicado en línea: 29 sept 2025
Recibido: 28 dic 2024
Aceptado: 17 abr 2025
DOI: https://doi.org/10.2478/amns-2025-1134
Palabras clave
© 2025 Jing Dong and Zhuyun Wang, published by Sciendo.
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the continuous development of science and technology and the reform and innovation of education, digital education has become a hot topic in today’s education world. In language teaching in colleges and universities, the application of online resources provides students with a new way of learning and a broader learning space [1-4]. In the huge digital teaching resources, it is difficult for teachers and students to find the required resources, for this reason, it is of great significance to carry out the classification of digital teaching resources to promote the efficiency and quality of teaching [5-7].
Digitization of teaching resources is the process of transforming traditional teaching resources (such as books, textbooks, courseware, etc.) into digital form for storage, dissemination and utilization. This process includes not only scanning paper materials into electronic documents, but also converting non-text resources such as instructional videos, audios, interactive simulation experiments, etc. into digital formats [8-11]. Through digital teaching resources, educators are able to break through the limitations of time and space to provide students with richer and more diverse learning content. The advantages of digital teaching resources lie in their convenience, accessibility and interactivity, and students can access these resources anytime and anywhere via the Internet for independent learning and inquiry [12-15]. Teachers can also integrate teaching resources more easily and design more innovative and interactive teaching activities. Digital teaching resources can also support students with special needs, such as those with visual or hearing impairments [16-19].
With the continuous development of artificial intelligence applications, deep learning is increasingly used in various fields. Reinforcement learning, as a machine learning method that learns optimal behavioral strategies by interacting with the environment, has been widely researched and applied in resource scheduling, such as data centers, terminal devices, cloud services, wireless networks, and other fields [20-23]. In the classification of digital resources for language teaching in higher education, reinforcement learning learns, optimizes and classifies complex digital resources autonomously by interacting with the environment [24-25].
After the preliminary study of deep reinforcement learning in this paper, based on the characteristics of online educational resources, we try to perform feature extraction on online language teaching resources in colleges and universities, identify the key features from the online resource library and perform preliminary resource classification on the resources. Secondly, the intelligent classification model of college language online teaching resources based on DRML is constructed by optimizing the node combination problem of the original deep reinforcement learning. Taking text resources and image resources, which account for the largest proportion of university language online teaching resources, as experimental objects, we compare the text classification and image classification performance of the DRML model with that of other classification models to verify the effectiveness of the DRML model. Finally, the users’ evaluation of the resource classification effect of DRML model is collected through a questionnaire to explore the application effect of the model.
Deep reinforcement learning consists of deep learning and reinforcement learning [26], and its main principle framework is shown in Fig. 1. Among them, deep learning is mainly used for system perception, while reinforcement learning makes reasonable decisions based on system perception to achieve specific goals. The overall process of deep reinforcement learning is as follows: the intelligent body first observes the environment to get the high-dimensional features of the environment, and uses the deep learning method to perceive to get the specific state representation. Then, it evaluates the expected value of all the actions based on the expected return and selects the corresponding actions by a specific selection strategy (e.g., greedy strategy). Finally, the environment is changed by the actions and enters a new state. Through a continuous cycle of the above steps, the optimal policy is finally learned.

DRL model framework
The DQN algorithm is a common deep reinforcement learning algorithm [27] that uses deep neural networks to replace the value function in traditional Q-learning algorithms. Due to the ability of deep neural networks to learn high-dimensional abstract features from data, DQN can handle high-dimensional state spaces and obtain a score corresponding to each action. In DQN, after an intelligent body observes the state from the environment and performs an action, the environment provides the intelligent body with a reward and a new state. The intelligent body uses this information to update the estimate of the value function, making the estimate as close as possible to the true value. Meanwhile, since the interactions of the intelligent body are constantly changing in reinforcement learning, updating the estimate of the value function using only the current interaction history may lead to instability of the estimate, which in turn affects the training effect. Therefore, DQN employs the technique of experience playback to cache and randomly replay the previous interaction history. With the empirical replay region, DQN can update the estimate of the value function smoothly and can overfit the i.e., forgetfulness problem.
High-quality learning resources can improve learners’ learning efficiency, help online learning platforms establish trust with learners and teaching staff, and improve the authority of online learning. It is also conducive to the joint maintenance of the online teaching environment by learners and teachers, and promotes the positive cycle of resource learning.
Online education teaching resources are characterized by knowledge-intensive, rich text content and clear logical structure. Therefore, most resource classification models focus on extracting semantic features from resource text content. However, false and misleading resources are characterized by strong concealment and incitement, which are difficult to be identified directly by virtue of semantic features.
Misleading resources disguise themselves by learning to imitate the line structure and writing style of high-quality teaching resources, and induce readers to spread them quickly through inflammatory remarks. Research has shown that inflammatory text content stimulates readers’ emotions such as fear, disgust, and shock, thus increasing their willingness to spread and interact, and emotional incitement is an important reason for the spread of false misleading information. Misleading information usually uses expressions with strong emotional colors and tendencies, while genuine teaching resources tend to be objective and truthful.
The review of online learning resources must ensure that the content is accurate, authoritative and in line with academic standards to prevent the spread of false and misleading information. Learning resources also need to comply with relevant national education laws and regulations to ensure compliance of online education resources.
Suppose that N isomorphic intelligences are in a static unknown environment, and each of them is randomly initialized with its position in the environment at the beginning. At each time step, the intelligences collect partial observations of the environment from the environment at the current position and process the observations locally. Afterwards the intelligences choose a specific action to change their position based on the observations. The common goal of the intelligences is to recognize the key features of the resource information from a limited number of categories and classify the resources in a limited time step. Due to the architecture of centralized training distributed execution, the observations of all the intelligences are learned by the centralized training network. The specific process is shown in Fig. 2.

Multi-agent image feature extraction process
In order to satisfy the characteristic of decentralized execution, the movement strategy of each intelligent body can only rely entirely on its own local information. Therefore, the intelligences need to learn to extract relevant features from partial observations and rationally plan their traveling paths in the environment to find the most valuable resource information and reliably solve the classification problem. The intelligent body observation module is shown in Figure 3.

Agent observation module
It is assumed that a partial observation
where
where
The current position of the intelligence is also useful information, so it is more efficient to learn it together with the locally observed features of the intelligence. For the current position can be handled by a function mapping:
where
The trajectory in the environment is a sequence of observations, and in order for it to learn long sequences of knowledge, a Long Short-Term Memory (LSTM) unit is used as the perceptual network of the intelligent body. The hidden state of the LSTM module of the smart body
The evolution of the perceptual network over time is shown in Equation (5):
The intelligences also need to use the existing learned knowledge to categorize the information and predict the network to use the final joint hidden state
where
Finally the most probable category is taken out as the result:
Instead of reaching a consensus on categories, the intelligences combine the local perceptions of all intelligences to reach a single prediction.
The global reward of the environment is computed based on the category prediction and the true label of the information
This section describes the generic DRL framework used to solve the node combination optimization problem to construct the DRML resource classification model. The specific information is shown below.
Deep Q-network is the core of DRL method, using deep neural network can analyze the input features of nodes and calculate the score of each node [28]. Let the DQN be computed as a state-action function
where
Markov decision-making process is an important part of reinforcement learning [29]. Reinforcement learning aims to learn a function [30] that maps state-action pairs to a score, while Markov decision-making executes actions based on this function to obtain a reward and the next state. This process can be represented as a quaternion
In order to obtain a higher cumulative reward value, Markov decision making uses a greedy strategy [31] to select the action based on
Where, Equation (12) calculates the maximum
where
where
Through DQN as well as Markov decision process, the node combination optimization problem can be equivalently transformed into an optimization problem with DQN to improve the model’s classification on teaching resources. Given a DQN, the solution state
During the training process, RL employs a wandering strategy (
Due to the large number of label sets contained in the dataset of this experimental classification task, if the traditional classification evaluation metrics are used directly, not only will it consume a large amount of arithmetic power, but at the same time, each sample is only relevant to a very small amount of information, and the model performance is not good when a small amount of information is directly selected from a high-dimensional information set. Therefore, in the actual task scenario, the prediction result of each sample is provided with a short sorted list of relevant information, in which the higher the sorting means the higher the relevance, and the corresponding evaluation metrics should also pay more attention to the higher sorted labels. Therefore, researchers will prioritize the evaluation metrics that are more sensitive to the ranking to measure the performance of the classifier in classification scenarios. Based on the existing work, the evaluation metrics in this section choose Precision at top-k (P@k), which calculates the number of resources that are related to the samples in the first k positions of the ranked list, and if the better the model’s classification performance is, the higher the scores of the related resources will get and ranked at the head of the sorted list, and hence the value of P@k will be larger. In this section the DRML model is experimented with multiple text categorization models on the same dataset and the results are shown in Table 1.
Comparison experiment of DRML and other reference models (%)
| Algorithm | Eurlex-4k | AmazonCat-13k | Wiki10-31k | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P@1 | P@3 | P@5 | P@1 | P@3 | P@5 | P@1 | P@3 | P@5 | |
| PfastreXML | 74.91 | 73.45 | 57.02 | 92.31 | 78.04 | 63.56 | 82.84 | 69.88 | 60.11 |
| DisMec | 83.90 | 72.02 | 61.15 | 97.46 | 81.12 | 67.48 | 85.79 | 74.92 | 64.88 |
| Parabel | 81.41 | 70.49 | 55.96 | 94.05 | 78.89 | 64.11 | 85.89 | 75.67 | 59.42 |
| SLEEC | 88.01 | 71.57 | 58.40 | 96.69 | 79.84 | 61.33 | 86.40 | 79.60 | 69.35 |
| XML-CNN | 83.18 | 65.77 | 54.59 | 91.73 | 82.04 | 66.88 | 87.02 | 75.89 | 61.37 |
| LAHA | 80.37 | 69.04 | 58.53 | 94.92 | 80.16 | 65.78 | 85.24 | 76.39 | 62.87 |
| AttentionXML | 81.86 | 77.95 | 62.62 | 95.13 | 85.16 | 66.60 | 85.94 | 79.63 | 60.45 |
| X-Transformer | 74.34 | 78.36 | 62.51 | 93.60 | 83.38 | 69.18 | 88.12 | 80.70 | 70.51 |
| APLC-XLNet | 81.14 | 68.85 | 58.15 | 97.25 | 79.49 | 69.54 | 84.75 | 77.78 | 64.17 |
| LightXML | 89.39 | 66.34 | 64.22 | 94.58 | 86.73 | 65.18 | 89.86 | 81.14 | 64.82 |
| DRML | 90.25 | 80.56 | 69.74 | 98.71 | 86.47 | 71.22 | 91.04 | 82.43 | 71.53 |
From the experimental results in Table 1, it can be seen that compared to the one-to-many, embedding-based and tree-structure-based text categorization models, the DRML model proposed in this paper outperforms the compared models in all the evaluation metrics of categorization performance, which are 6.35%, 8.54%, and 8.59% enhancement of P@1, P@3, and P@5 in Eurlex-4k. P@1, P@3, and P@5 in AmazonCat-13k improved by 4.66%, 7.58%, and 7.11%. P@1, P@3, and P@5 in Wiki10-31k improved by 5.25%, 7.51%, and 6.65%, which demonstrates the superior performance of deep learning in learning sample features.
While comparing the deep learning based text classification models, the DRML model has slightly lower metrics than LightXML in P@3 in the dataset AmazonCat-13k, but is ahead of the deep learning models including LightXML in all other metrics. Among them, the DRML model has a larger improvement in classification performance compared to LAHA, with 9.88%, 11.52%, and 11.21% improvement in P@1, P@3, and P@5 in Eurlex-4k, respectively. P@1, P@3, and P@5 in AmazonCat-13k are improved by 3.79%, 6.31%, and 5.44%, respectively. P@1, P@3, and P@5 in Wiki10-31k improved by 5.80%, 6.04%, and 8.66%, respectively.
The purpose of the experiments in this section is to validate the relationship between the use of sentiment features, label semantic features and model classification performance of DRML model, the results of the experiments are shown in Table 2, where Difference shows the difference in performance between the two scenarios, where “+” indicates that the performance of the model improves, and on the contrary “-” indicates that the model performance decreases.
Relationship between emotion, label semantic features and classification performance (%)
| Dataset | P@k | Emotion feature | Label semantic feature | Difference |
|---|---|---|---|---|
| Eurlex-4k | P@1 | 85.95 | 89.42 | +3.47 |
| P@3 | 76.42 | 79.12 | +2.70 | |
| P@5 | 66.37 | 68.15 | +1.78 | |
| AmazonCat-13k | P@1 | 94.86 | 98.43 | +3.57 |
| P@3 | 84.25 | 86.44 | +2.19 | |
| P@5 | 68.36 | 69.75 | +1.39 | |
| Wiki10-31k | P@1 | 89.45 | 91.64 | +2.19 |
| P@3 | 79.64 | 81.06 | +1.42 | |
| P@5 | 69.63 | 70.87 | +1.24 |
As can be seen from Table 2, in the three experimental datasets, using labeled semantic features as the final text features for the classification task is better than the experiments using sentiment feature outputs in terms of evaluation metrics, where P@1, P@3, and P@5 in Eurlex-4k are 3.47%, 2.70%, and 1.78% higher, respectively. P@1, P@3, and P@5 in AmazonCat-13k are 3.57%, 2.19%, and 1.39% higher, respectively. P@1, P@3, and P@5 in Wiki10-31k are 2.19%, 1.42%, and 1.24% higher, respectively.
In this section, in order to test whether the improved modules contribute to the classification performance of the DRML model, the experiments target the module that dynamically adjusts the data text set to explore its impact on the classification performance. The experiment is divided into two aspects, the first aspect of “Static” indicates that the text set is not dynamically adjusted, and is computed interactively with the text features in an initialized form. The second aspect of “Dynamic” indicates that the semantic features of the text set are dynamically adjusted by the text of the sample data, while “Difference” indicates the performance difference between the two cases, where “+” indicates the performance of the model is improved, and “-” indicates the performance of the model decreases, as shown in Table 3.
Ablation experiment results (%)
| Dataset | P@k | Static | Dynamic | Difference |
|---|---|---|---|---|
| Eurlex-4k | P@1 | 85.46 | 89.63 | +4.17 |
| P@3 | 74.59 | 78.45 | +3.86 | |
| P@5 | 66.32 | 67.96 | +1.64 | |
| AmazonCat-13k | P@1 | 84.26 | 87.58 | +3.32 |
| P@3 | 73.94 | 75.68 | +1.74 | |
| P@5 | 67.06 | 69.33 | +2.27 | |
| Wiki10-31k | P@1 | 87.16 | 89.56 | +2.40 |
| P@3 | 76.49 | 80.02 | +3.53 | |
| P@5 | 69.28 | 70.63 | +1.35 |
From the experimental results in Table 3, it can be seen that the model with dynamically adapted text features, i.e., the full DRML model, has better performance in classifying text on the experimental dataset than the model using static text features, where P@1, P@3, and P@5 are 4.17%, 3.86%, and 1.64% higher in Eurlex-4k, respectively. P@1, P@3, and P@5 in AmazonCat-13k are 3.32%, 1.74%, and 2.27% higher, respectively. P@1, P@3, and P@5 in Wiki10-31k are 2.40%, 3.53%, and 1.35% higher, respectively, which also indicates the better classification performance of the DRML model.
In the online teaching resources of university languages, text and image are the highest and most important types of resources, and this subsection provides an in-depth investigation of the image classification effect of DRML model.
The experiments compare the classification effectiveness of the DRML model in this paper with the current state-of-the-art image classification models on three image datasets (miniImageNet, tieredImageNet, and QHGIM), and the results are shown in Table 4.
Comparison of classification accuracy of different methods (%)
| Method | miniImageNet | tieredImageNet | QHGIM |
|---|---|---|---|
| ProroNet | 69.48 | 75.46 | 73.52 |
| RelationNet | 70.56 | 75.08 | 70.63 |
| SimCLR | 81.03 | 80.12 | 76.28 |
| SimSiam | 81.89 | 83.44 | 78.49 |
| TPMN | 85.65 | 85.47 | 80.36 |
| RE-Net | 84.74 | 84.23 | 80.07 |
| ProroNet+Swin | 75.64 | 78.55 | 74.68 |
| BML | 77.42 | 84.64 | 83.59 |
| SUN | 85.79 | 87.09 | 83.66 |
| DRML | 88.96 | 90.41 | 89.73 |
Compared with the state-of-the-art models, DRML outperforms the other models on all three datasets of miniImageNet, tieredImageNet, and QHGIM, with accuracies of 88.96%, 90.41%, and 89.73% on the three datasets, respectively. DRML outperforms the baseline model (ProroNet) by 19.48%, 14.95%, and 16.21% classification accuracy, and improved the performance over the best available model (SUN) by 3.17%, 3.32%, and 6.07%, respectively. This indicates that DRML is able to classify images quickly and accurately and has some practical value.
In order to study the feature extraction capability of DRML more intuitively, this subsection samples 20 samples per category on the QHGIM dataset, and uses DRML to show their distributions in the feature space, and the feature distributions of the test samples are visualized as shown in Fig. 4, where (a) and (b) are the results of the feature visualization for the benchmark models ProtoNet and DRML, respectively. The results show that DRML is able to generate more accurate decision boundaries so that different categories are distinguished more obviously.

The visualized characteristics of the test sample
In order to investigate the overall generalizability of different models on the QHGIM dataset, this section uses different methods to visualize the feature distributions of all the samples within the QHGIM dataset. The visualization of the feature distributions of all the samples within the QHGIM dataset is shown in Fig. 5, and (a) to (d) are the original feature distributions, the ProtoNet, the SUN, and the DRML feature distributions, respectively.

The visualized characteristics of all test samples in QHGIM dataset
As shown in Fig. 5(a), the original QHGIM dataset has high overlap of different categories, which makes classification challenging. As shown in Fig. 5(b), the conventional prototype network has limited classification performance on the QHGIM dataset, which is only able to distinguish the classes of a small number of samples, and most of the samples still have significant inter-class overlap. As shown in Fig. 5(c), SUN achieves effective classification for most of the samples, but the class spacing between different classes is too close, and there is still class confusion for some samples. As shown in Fig. 5(d), in contrast DRML exhibits the best classification performance, effectively reducing the inter-class overlap and improving the overall results. This further demonstrates the effectiveness of DRML on image classification tasks.
The DRML model of this paper was used in the teaching of language at school in School S. Questionnaires were distributed to the subjects in the school to investigate whether the classification of resources for teaching language at school based on the DRML model has gained the satisfaction of the users. Evaluation questionnaires were distributed to 442 teachers and students, and the data from the recovered questionnaires were counted, and the results are shown in Table 5, where -2, -1, 0, 1, and 2 stand for Strongly Disagree, Disagree, Uncertain, Agree, and Strongly Agree, respectively, and Fi stands for the score rate.
Evaluation for the college Chinese online teaching resource classification of DRML model
| Index | Evaluation score (percentage) | ||||||
|---|---|---|---|---|---|---|---|
| -2 | -1 | 0 | 1 | 2 | 1/2 | Mean | |
| Classification accuracy | 1.79% | 4.91% | 12.14% | 54.76% | 26.40% | 81.16% | 0.99 |
| Resource quality | 0.19% | 2.31% | 10.45% | 55.82% | 31.23% | 87.05% | 1.16 |
| Classification speed | 0.38% | 4.14% | 12.42% | 53.69% | 29.37% | 83.06% | 1.08 |
| Result clarity | 0.97% | 2.45% | 12.36% | 51.28% | 32.94% | 84.22% | 1.13 |
| Learning difficulty | 1.72% | 2.47% | 20.40% | 57.05% | 18.36% | 75.41% | 0.88 |
| Acquisition difficulty | 0.18% | 2.74% | 21.58% | 53.32% | 22.18% | 75.50% | 0.95 |
| Efficiency improvement | 1.32% | 2.07% | 15.79% | 57.07% | 23.75% | 80.82% | 1.00 |
| Tool efficiency | 1.64% | 4.80% | 16.60% | 46.04% | 30.92% | 76.96% | 1.00 |
| Adopt willingness | 1.18% | 3.27% | 11.36% | 47.99% | 36.20% | 84.19% | 1.15 |
| Using willingness | 0.39% | 1.87% | 11.89% | 54.68% | 31.17% | 85.85% | 1.14 |
As can be seen from Table 5, the percentage of subjects who expressed agreement/strongly agree in the 10 indicators is more than 75%, the percentage of those who expressed disagreement is not more than 2%, and the number of those who expressed strong disagreement is not more than 5%. The mean scores of the 10 indicators are in the range of 0.88-1.16, which can be seen that this paper’s DRML model of categorization of online language teaching resources in colleges and universities achieves a more satisfactory evaluation of users’ Results.
The article fully investigates the characteristics of online educational resources, and feature extraction of online resources for language teaching in colleges and universities is carried out by means of the deep reinforcement learning method of intelligentsia. The original deep reinforcement learning algorithm is optimized so as to construct a DRML model for use in the intelligent classification of online language teaching resources in colleges and universities.
In text classification, the overall performance of DRML classification model in this paper is better than other classification models in text classification. In feature classification, classification using labeled semantic features outperforms classification using sentiment features, with P@1, P@3, and P@5 in Eurlex-4k being 3.47%, 2.70%, and 1.78% higher, respectively. It is 3.57%, 2.19%, and 1.39% higher in AmazonCat-13k. 2.19%, 1.42%, and 1.24% higher in Wiki10-31k, respectively. And the performance of DRML model for classifying text is better than the model using static text features. In image classification of teaching resources, the accuracy is 88.96%, 90.41%, 89.73% on miniImageNet, tieredImageNet and QHGIM datasets, which outperforms the best existing model by 3.17%, 3.32%, 6.07%, respectively. The method of feature visualization proves that the DRML model is indeed superior in image classification. In the evaluation of this paper’s DRML model for classifying teaching resources, the mean score interval of the indicators is [0.88, 1.16], more than 75% of the subjects agree/strongly agree in 10 indicators, and the percentage of the number of people who disagree or strongly disagree is less than 2% and 5%, respectively. The DRML model in this paper obtained high evaluation results.
