Research on multi-label short text categorization method for online education under deep learning

With the development of Internet technology, online education breaks the limitations of traditional education methods in terms of teacher strength and geography, and promotes the development of the education model in the direction of more flexible, more stable and easier to expand. However, the booming development of online education network platforms has led to the emergence of a large number of short text data, and due to the multi-label, multi-faceted characteristics of this type of data, users are unable to quickly obtain the target information when browsing short text [1–2]. The resulting text categorization technology is the main technology to promote the development of artificial intelligence and realize the free interaction between man and machine, which can deal with and solve the problem of information clutter to a greater extent, and facilitate the user to accurately locate the required information [3–4].

Short text refers to semantically sparse, shorter text of no more than 200 characters, characterized by sparseness and irregular format [5]. Sparseness refers to the short length of short text, the text contains only a small number of words with actual meaning, and it is impossible to obtain sufficient text features when text representation is performed [6–7]. Unstandardized format refers to the fact that short texts do not have strict writing specifications, the text form is liberalized, and there are noisy data such as misspelled words, abbreviations, multiple meanings and unregistered words in the text [8–10]. Due to sparsity, short text will get sparse feature vectors when performing text representation, which is not conducive to subsequent feature extraction and processing [11]. Due to the unstandardized text format, a lot of effort needs to be spent on the processing of noisy data in short text, which can easily cause ambiguity and affect the performance of text representation and subsequent tasks if not handled well [12–15]. Therefore, the short text’s own characteristics limit the text representation and feature extraction in the classification task.

With the abundance of information nowadays, the form of data is becoming more and more complex, a single topic has been unable to accurately describe the semantic information that the object has, multi-label text classification methods can use a collection of labels to express the text topic [16]. Mohammed, H. H. et al. designed two deep learning based approaches for multi-label text categorization with plain embedding layer and neural network model with pre-trained embedded corpus and evaluated the performance of the proposed models using multi-label evaluation metrics [17]. Liu, X. et al. applied the adapted multi-label k-nearest neighbor (MLkNN) classifier to a short text sentiment classification test, and the results of the corpus-based comparison test showed that the label-based MLkNN classifier could achieve high accuracy in short text sentiment classification with smaller training samples and lower training costs [18]. Shimura, K. et al. explored the application of category hierarchies to the task of multi-label categorization of short texts using a convolutional neural network with fine-tuning techniques that utilizes the data from the upper layers to contribute to the categorization of the lower layers, which significantly improves the model’s categorization performance by increasing the level of category linkages in the hierarchy [19]. Maragheh, H. K. et al. showed that deep learning neural networks can learn complex patterns in multi-labeled data, proposed a hybrid Long Short-Term Memory Neural Network (LSTM) and Convolutional Neural Network (CNN) for multi-labeled text categorization, and introduced a Competitive Search Algorithm (CSA) to further improve the detection and recognition accuracy of the model [20]. Almeida, A. M. et al. argued that multi-objective text categorization is an important tool in the field of sentiment analysis based on a multi-stance cognitive perspective, comparing algorithmic adaption, problem transformation under integrated algorithms with multi-labeled solutions, and found that the performance of integrated classifiers for sentiment analysis is more prominent [21]. Liu, H. et al. proposed a multi-label text classification algorithm based on multi-layer attention and label correlation, which first extracts label-related content information and sequence information in text, subsequently considers the label correlation of the original label space when performing label space dimensionality reduction in order to simplify the model learning, and finally utilizes a deep typical correlation analysis technique for coupling the features and the hidden space in the end-to-end model [22]. Gong, J., et al. proposed to construct a deep learning model based on hierarchical graph Transformer for the problem of semantic information loss caused by ignoring the connection between labels in traditional multi-label text classification methods, which utilizes the graph structure and label hierarchical relationship to generate labeled representations and improves the model’s ability to capture the hierarchical structure and logic of the text [23]. Khataei Maragheh, H. et al. evaluated the performance of SHO-LSTM model in a multi-label text classification task by using the Spotted Hyena Optimization Algorithm (SHO) to simulate the large-scale hunting behavior of spotted hyenas to solve the target task, and the comparative experiments showed that the SHO-LSTM model has a higher accuracy rate [24].

The development of the new generation of information technology has made online courses an indispensable mode of education in today’s society, which has attracted widespread attention from teachers and students with its efficient and convenient features. In this paper, we first introduce the related process and evaluation indexes for multi-label short text classification, and then discuss related models using word vector technology and deep learning. Then we extract the word vectors of multilabel short texts for online education based on the BERT model in deep learning technology, use BiLSTM-CNN model to obtain text features, and use Sigmoid function to realize the effective output of multilabel short texts. Finally, the public dataset THCNEWS and the self-made EduData dataset are selected as research objects to explore the feasibility of the BERT=BiLSTM-CNN model applied to the classification of multilabeled short texts in online education.

2

Relevant theories and technologies

The promotion of the global education informatization strategy and the continued steady growth in education investment have led to significant progress in the construction of education informatization in various countries, and the teaching and learning standards have continued to improve. The pursuit of educational equity and quality, educational innovation, personalized education, and capacity development have become common themes in education today. In the process of students’ participation in online education, a large amount of interactive data will be generated, including content-rich course comments and message interaction, these unstructured interactive data which contains rich semantic information, how to effectively use the unstructured text in online education, analyze the semantic information contained in it, become the current online courses to help improve the quality of teaching and learning has become a pressing problem to solve. Problem. Using deep learning for multi-label segment text classification is an effective method that can quickly and accurately mine the potential feature information in multi-label short texts.

2.1

Multi-label text categorization

2.1.1

Multi-label text categorization process

Text classification refers to extracting features from raw text data and predicting the labels of the text data based on these features. The multi-label text classification process is shown in Fig. 1, which can be roughly divided into the steps of text preprocessing, extracting text features and constructing a classifier [25].

Text data is characterized by unstructured, varying lengths, and interspersed data noise. Therefore, preprocessing the raw input text is necessary, and is especially important in the classification of short texts. Preprocessing methods are effective techniques to reduce data sparsity and help improve low-quality text.

In text data, different text records may have different writing styles, there may be the use of emoticons, abbreviations, misspellings, and the use of urls, etc., and the use of appropriate text preprocessing can help computers learn good text representations. Text preprocessing generally involves splitting words, cleaning data, and analyzing statistics. After the text data is preprocessed, the text representation method is used to obtain a form of representation that is easier to understand and has the least loss of information for the computer. Generally speaking, a good text representation has richer semantic features, which can greatly improve the algorithm’s effect. Finally, the represented text is fed into the classifier according to the selected features.

2.1.2

Evaluation Metrics for Multi-Label Text Categorization

For multi-label short text categorization tasks, the performance of the classifier can be evaluated using either micro-averaging (Micro) or macro-averaging (Macro) computational methods. The micro-averaging method does not consider the specific label information, treats all the categories as a binary classification task, and only judges whether the classifier correctly classifies them, and calculates all the TPs, FPs, TNs, and FNs first, and then calculates the corresponding metrics. The computational method of micro-averaging can evaluate the performance of the classifier better when the label distribution is relatively uniform, but when the label distribution is not uniform, the label categories with fewer samples tend to result in poorer classification of sparse label categories due to their lower weights in the computational process. In contrast, macro averaging is calculated by first calculating the accuracy, recall, and F1 composite score for each label category, and then averaging over all the categories, so it can also achieve better results on the more sparse label categories.

The macro-averaging is calculated as follows: 1 $M a c r o - P = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P_{i}}{T P_{i} + F P_{i}}$ 2 $M a c r o - R = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P_{i}}{T P_{i} + T N_{i}}$ 3 $M a c r o - F 1 = \frac{2 * M a c r o - P * M a c r o - R}{M a c r o - P + M a c r o - R}$ where i represents the test statistics associated with the i nd category label, and n is the total number of categories.

2.2

Word Vector Techniques

2.2.1

Vector space model

Word Frequency-Inverse Text Frequency (TF-IDF) constructs word feature representations by considering the importance of words in the text from a frequency perspective. The method asserts that words that can effectively represent the current text category should be those that appear more frequently in the current text and less frequently in other texts. In addition, a word should also gain importance if its total frequency of occurrence in all texts is low. To this end, TF-IDF uses both word frequency (TF) and inverse text frequency (IDF) to obtain a feature representation of a word, and then constructs a text vector space model to be used in the task of multi-label short text classification.

Word frequency, inverse text frequency and TF-IDF are calculated as follows: 4 $T F = \frac{n_{w_{k}}}{\sum_{i = 1}^{L} n_{w_{i}}}$ 5 $I D F = \log (\frac{N}{N_{w_{k}} + 1})$ 6 $T F - I D F = T F \times I D F$

Where w_k denotes the k nd word, L denotes the total number of words in the text, N denotes the total number of texts, and N_{w_k} denotes the total number of texts containing the k th word.

The pre-trained language model is trained on a large-scale corpus by setting up pre-training tasks to enable the model to obtain a generalized comprehension of the language, and then fine-tuned in downstream tasks according to specific task scenarios to achieve better results. In order to avoid the problems of unlimited increase of feature dimension with the increase of the number of words and the independence of words from each other in the traditional text feature representation, the pretrained language models represented by Word2Vec, GloVe, and BERT represent the use of fixed-size dimension to represent words, and the use of a large amount of corpus data to obtain the generalized linguistic knowledge, so that semantically similar words will have a similar feature representation.

2.2.2

BERT pre-training models

BERT is a pre-trained language model that learns a rich linguistic representation by training on a large-scale unsupervised corpus [26]. The BERT model is trained in a bi-directional way, which allows us to take into account both the left and right contextual information of a word, resulting in a richer representation. The BERT model’s foundation is the Transformer architecture, which is made up of multiple layers of Encoders. Each Encoder includes a multi-head self-attention mechanism as well as a forward propagation neural network. The multiple layers of encoders stacked in the BERT model allow the model to learn different levels of linguistic representations, which can be used for a wide variety of text categorization tasks. By fine-tuning the pre-trained BERT model, good performance can be obtained on a small amount of labeled data. In the working mechanism of BERT model, pre-training and fine-tuning together constitute the core working principle of BERT model.

1)

Pre-training phase

Input data: In the pre-training phase, the BERT model uses a large-scale text corpus to learn a generalized language representation. The input data is usually a corpus containing a large amount of text, which is divided into segments or sentences and then fed into the BERT model.

Masked Language Model: The goal of BERT is to train a masked language model that understands the contextual information in the text. During training, BERT randomly selects words or fragments and masks them (replacing them with special “[MASK]” symbols). The model’s task is to predict the masked words based on the context. This allows BERT to understand the meaning of words in different contexts.

Bidirectional Transformer Encoder: BERT uses a multi-layer bidirectional Transformer encoder to process the input text. The encoder contains a self-attention mechanism and a feed-forward neural network that captures the relationship between words and contextual information. BERT gradually develops a language representation at different levels by processing the input text multiple times.

2)

Fine-tuning phase

Task-specific fine-tuning: after the completion of the pre-training phase, BERT can be used for specific natural language processing tasks. Fine-tuning refers to further training the pre-trained BERT model on task-specific datasets to adapt it to a specific task. The fine-tuning phase usually includes an output layer designed according to the requirements of the task.

Supervised learning: in the fine-tuning phase, labeled task data is used to tune the parameters of the BERT model to achieve optimal performance on the task. Through optimization methods such as backpropagation and gradient descent, the parameters of BERT are fine-tuned to fit the characteristics of the task.

2.3

Deep Learning Models

2.3.1

Convolutional Neural Networks

Convolutional neural network (CNN) is an important deep learning model in deep learning, the use of CNN for feature extraction of textual information features, so that the model can get a good textual representation, the application of TextCNN to the field of multi-label short text classification makes the classification more accurate. The structure of CNN is shown in Fig. 2. The CNN uses the convolution kernel as a sliding window, the convolution kernel multiply and sum the features at the corresponding position by bit, this operation is called convolution operation. CNN uses the convolutional kernel as a sliding window, and the convolutional kernel is multiplied and summed with the features at the corresponding positions, which is called convolutional operation, and the convolutional operation enables CNN to efficiently perform feature dimensionality reduction of the feature values.The main model of CNN consists of the structure of the input layer, the convolutional layer, the pooling layer, and the fully-connected layer, and so on.

The input layer of CNN is mainly to input data to the model, and in text categorization it is necessary to vectorize the text information to convert it into a word vector matrix of text information. The main role of the convolutional layer is to capture semantic features from textual features through convolutional operations, which are realized by using a fixed-size convolutional kernel that can be slid. The convolution kernel is moved in the feature matrix in specific steps. Each step multiplies the convolution kernel with the feature values bitwise, composing the result into a new matrix for a new computation. The pooling layer is responsible for the operation of sampling feature values, and it mainly includes the average pooling layer, maximum pooling layer, and so on. The pooling layer can reduce the total number of feature values, discard unimportant feature values, and reduce the complexity of feature calculation. The fully connected layer is generally connected to the feature vector obtained from the pooling layer, through the fully connected layer even if the feature vector is converted into different output categories.

2.3.2

BiLSTM networks

In standard LSTM, the transmission of cell states is unidirectional from front to back, so the LSTM model can only learn text features from past moments but not from future moments. In contrast, Bidirectional Long Short-Term Memory Network (BiLSTM) has two unit state conveyor belts, which transmit information from front-to-back and back-to-front, respectively, enabling the BiLSTM model to utilize the text data information of the past moments while being able to learn the features of the text information of the future moments, and to recursively and feedback on them, with more accurate prediction results than the unidirectional LSTM. Mining the connection between the past and future data of the time series to improve data utilization better use of the temporal features of the time series in order to improve the model prediction accuracy [27].

Suppose $\vec{h_{t}}$ is the hidden layer state of the forward LSTM network at moment t, which can be regarded as a single-layer LSTM network, and the process of computing state $\vec{h_{t}}$ at moment t from state ${\vec{h}}_{t - 1}$ at moment t – 1, and x_t is the input at moment t. Then: 7 $\vec{h_{t}} = L S T M (x_{t}, {\vec{h}}_{t - 1})$

Where $\vec{h_{t}}$ is the hidden layer state of the forward LSTM network at moment t, x_t is the input at moment t, and ${\vec{h}}_{t - 1}$ is the hidden layer state of the state forward LSTM network at moment t – 1. Similarly $\vec{h_{t}}$ is the hidden layer state of the reverse LSTM network at moment t, then: 8 $\overset{\leftarrow}{h_{t}} = L S T M (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})$

Where: $\overset{\leftarrow}{h_{t}}$ is the hidden layer state of the forward LSTM network at moment t, x_t is the input at moment t, and ${\overset{\leftarrow}{h}}_{t - 1}$ is the hidden layer state of the state forward LSTM network at moment t – 1. The BiLSTM network output is the combination of the two parts of the hidden layer states $\vec{h_{t}}$ and $\overset{\leftarrow}{h_{t}}$ , which constitutes the overall hidden state of the network h_t.

2.3.3

Attention mechanisms

At its core, the attentional mechanism draws on the fact that the human visual system, in the course of a thorough scanning of the area a person sees, focuses its primary attention on the more critical information and ignores the ineffective information in it.

The attention mechanism can be described as a set of sequences of key-value pairs, where Key = (k₁, k₂, k₂,…,k_d), value Value= (v₁, v₂, v₃,…,v_d), and query sequence Query = (q₁, q₂, q₃,…, q_m), and the process of attention computation is a mapping process of key-value pairs in the query sequence [28]. The main computation steps are the following 3 steps:

First, the attention scores of Key and Value are computed score(q_t, k_t), i.e., e_ti. There are five main formulas to compute the attention scores, summation, splicing, dot product, scaled dot product, and bilinear, which are computed as: 9 $e_{t i} = s c o r e (q_{t}, k_{i}) = v_{T} \tanh (W q_{t} + U k_{i})$ 10 $e_{t i} = s c o r e (q_{t}, k_{i}) = v_{T} \tanh (W [q_{t}^{T}; k_{i}])$ 11 $e_{t i} = s c o r e (q_{t}, k_{i}) = q_{t}^{T} k_{i}$ 12 $e_{t i} = s c o r e (q_{t}, k_{i}) = \frac{q_{t}^{T} k_{i}}{\sqrt{d_{k}}}$ 13 $e_{t i} = s c o r e (q_{t}, k_{i}) = q_{t}^{T} W k_{i}$

Next, the output from the previous step is normalized using the SoftMax function so that the sum of all the obtained attention scores e_it is 1. i.e: 14 $a_{t i} = S o f t \max (e_{t i}) = \frac{\exp (e_{t i})}{\sum_{n = 1}^{N} \exp (e_{t n})}$

Finally, the weight coefficients a_ti are assigned to the corresponding values v_i for weighted summation and finally the output of the attention mechanism is obtained. Namely: 15 $A t t e n t i o n (q_{t}, K, V) = \sum_{i} a_{t i} v_{i}$

In traditional models, fixed weights are usually used for calculation, while the attention mechanism can adaptively assign weights according to different parts of the data, so that the model can pay more attention to the key information in the data, and screen and filter a large amount of redundant information in the model, thus improving the classification performance and model interpretability.

3

Multi-label short text classification model

With the gradual rise and rapid development of modern informatization and other technologies, education informatization and education modernization have been technically supported, and the education industry has gradually entered the stage of intelligent education. Various educational platforms have launched a large number of free online courses for learners to facilitate learning, which also promotes the rapid development of online education. The learning process of online education generates a large amount of text data, and how to understand the development of online education through deep mining of short text data of different categories with different labels is an issue that needs to be focused on in order to further improve the quality level of online education.

3.1

Classification modeling framework

3.1.1

BERT-BiLSTM-CNN modeling

In order to be able to realize the accurate classification of multi-labeled short texts in online education, this paper is based on a bit of BERT model, the output of BERT is accessed into BiLSTM-CNN hybrid network as an Embedding layer, and the short text-level word vectors output from BERT are fused with the output from BiLSTM-CNN layer to form a BERT-BiLSTM-CNN hybrid model to improve the accuracy of classifying multi-labeled short texts for online education, and the structure of the model is shown in Fig. 3 [29].

The BERT-BiLSTM-CNN model proposed in this study utilizes the BERT model’s feature of different output vectors for the same word at different positions in the text to solve the problem of multiple meanings of a word that may occur in a short text. Utilizing the feature that BiLSTM is able to capture context-dependent information in both forward and backward directions, the model is able to capture a more comprehensive global feature of the argumentative text. Utilizing the feature that CNN can abstract the text information in a higher dimension, so that the model can capture the local features of the text, combined with the advantages of deep learning techniques so that the model can better complete the task of multi-labeled short text classification for online education.

3.1.2

Network Architecture Establishment Steps

The specific steps for building the network structure of BERT-BiLSTM-CNN model are as follows:

Step1 Preprocess the text dataset to be classified to get the input text, denoted as E = (E₁, E₂,…,E_i,…,E_n), where E_i(i = 1, 2,…,n) denotes the i rd word of the text.

Step2 Input all the E_i into the word vector representation module, and after encoding by the language encoder Transformer, the text Eis sequence characterized to output the text w_i = (w_1i, w_2i,…, w_ni). where w_ni shows the word vector of the n th word in the i th sentence in the text, and the w₁ – w_n word vectors are spliced (w₁, w₂,…, w_n) to get the BERT word vector representation matrix W.

Step3 Firstly, the BERT word vectors are input into the deep text feature extraction classification module, and the dependence of the semantic syntactic information represented by the text word vectors is further enhanced after two-layer BiLSTM. Second, after convolution with multiple convolution kernels such as 2-gram, 3-gram, and 4-gram, the key features of the text and its deeper structural information are further extracted. Again, in the pooling layer, each feature map obtained from the convolutional layer is processed by 1-max pooling for feature mapping to obtain uniform feature values, and the pooled features are connected by a fully connected layer, and SoftMax regression classification is performed to generate the final categorized feature vectors. Finally, the output is the category to which it belongs.

3.2

Model Functional Modules

3.2.1

Text embedding layer

The word embedding technique can be understood as a distributed representation of words, which improves the computational efficiency and the semantic information representation of word vectors compared to the feature sparsity problem caused by traditional machine learning methods. For the task of encoding words into model inputs, traditional word embedding approaches use language models such as Bag-of-Words, Skip-gram, and Word2Vec models. Word embedding can be considered as a data pre-training process. In this study, the online education multi-label short text classification task belongs to the small sample task, the sample size is not enough to support the model to learn enough parameters, and it is easy to overfit on the training set, so it is necessary to use the pre-training language model as the word embedding representation.The BERT model uses a self-supervised approach to obtain generalized knowledge from the massive unsupervised data, and then fine-tunes it with a small amount of annotated data on teaching behaviors, to be apply to downstream text multi-label classification tasks.

In this paper, the BERT model is used to obtain the vector representation of the input text, and the sentence pairs composed of the labeling information constructed above and the original sentences are input using sentence pairs, which are represented as: 16 $i n p u t = [C L S] + x_{i} + [S E P] + l_{i} + [S E P]$

Where x_i = {w₁, w₂,…, w_m} represents the input sentence of a multi-tagged short text, [CLS] represents the start marker of the input sequence, [SEP] represents the split marker of the input sentence sequence, and l_i represents the tag information. Semantic coding based on the BERT pre-training model can transform the input text into word vectors by performing word embedding operations, and the probabilistic prediction of [MASK] tags is performed by the MLM task in the pre-training model.

3.2.2

Feature extraction layer

After the input text data is obtained to BERT word vectors and sentence vectors through the text representation layer, the semantic features are further extracted by the parallel combination strategy of Multihead Self-Attention Mechanism, CNN and RCNN models.

Firstly, global feature enhancement of BERT word vector Token Embedding is performed with the help of the long-distance dependent learning ability of multi-head self-attention mechanism, which strengthens the ability to capture key features between sentences, and obtains the multi-head attention matrix MultiHeadToken.

Secondly, CNN model is used to extract local features. The N-gram features of MultiHeadToken are combined and screened using a convolutional kernel of height h in the CNN model to obtain local semantic information and word order features. Namely: 17 $\begin{array}{l} c_{i} = f (w \cdot M u l t i H e a d T o k e n_{i : i + h - 1}) + b \\ c = [c_{1}, c_{2}, \dots, c_{N - h + 1}] \end{array}$

Where, MultiHeadTokeni_i:t+h−1 denotes the local features in rows i to i + h – 1 of the Multi Head Attention Matrix Multi Head Token, b is the bias parameter and fis the nonlinear function. Then the most important text features are obtained by global maximum pooling, i.e: 18 $\hat{c} = \max {c}$

After the local features are extracted by individual convolution kernels, the final convolution result CNNToken is obtained. i.e.: 19 $C N N T o k e n = [{\hat{c}}_{1}, {\hat{c}}_{2}, \dots, {\hat{c}}_{m}]$

At the same time, Multi Head Token is input into CNN+ATTENTION model to further extract the global and local features of the text sequence. The context-dependent information of Multi Head Token bidirectional RNN Token is extracted by Bi-LSTM. i.e: 20 $\begin{array}{l} {\vec{h}}_{t} = f (\vec{W} x_{t} + \vec{V} {\vec{h}}_{t - 1} + \vec{b}) \\ {\overset{\leftarrow}{h}}_{t} = f ({\overset{\leftarrow}{W}}_{t} + \overset{\leftarrow}{V} {\overset{\leftarrow}{h}}_{t - 1} + \overset{\leftarrow}{b}) \\ y_{t} = g (W_{y} [{\vec{h}}_{t}, {\overset{\leftarrow}{h}}_{t}] + b_{y}) \\ R N N T o k e n = [y_{1}, y_{2}, \dots, y_{N}] \end{array}$ where ${\vec{h}}_{t}$ and ${\overset{\leftarrow}{h}}_{t}$ denote the hidden layer vectors at the moments of forward and backward propagation, respectively, $\vec{W}$ and $\overset{\leftarrow}{W}$ denote the weight matrices at the moments of forward and backward propagation, respectively, $\vec{b}$ and $\overset{\leftarrow}{b}$ are bias terms, and f and g are nonlinear functions.

BiLSTM model is prone to lose the semantic information at the beginning of the sentence at the decoding end, so the multi-head self-attention mechanism is used to strengthen its key feature information in the global perspective to obtain the attention matrix RNNMHA. Then, based on the optimization idea of RCNN model, a new CNN structure is added, and the n-gram local feature information of the attention matrix is enhanced to obtain the feature vector RCNN Token.Finally, it is spliced with CLSEmbedding and CNNToken to obtain the final text feature vector. Namely: 21 $y = c o n c a t (C L S E m b e d d i n g, C N N T o k e n, R C N N T o k e n)$

3.2.3

Text output layer

After obtaining the final multi-labeled short text representation vectors through the feature fusion layer, the classifier is constructed using the fully connected layer and the Sigmoid activation function, where the fully connected layer maps the dimensionality of the input vectors to the dimension of the number of labels, and the Sigmoid activation function calculates the probability that the tech text belongs to each label. Then: 22 $\hat{y} = s o f t \max (W_{F} M^{(E)} + b)$

Where $\hat{y}$ is the classification result predicted by the model, indicating the probability value that the current technology text belongs to each label, and a threshold is set, if the probability value predicted by the current label is greater than the threshold then the label is considered to be the classification result of the text. W_F is the parameter matrix that maps the text vector to the output space, and b is the deviation term. Finally, the cross-entropy loss function is used as the loss function and the model is trained by minimizing the value of the loss function. Then: 23 $L_{l o s s} = - \sum_{i = 1}^{N} \sum_{j = 1}^{l} (y_{i j} \ln ({\hat{y}}_{i j}) + (1 - y_{i j}) \ln (1 - {\hat{y}}_{i j}))$ where N is the total number of texts, l is the number of labels, and y_ij is the true value of the i th tech text corresponding to the j th label.

4

Evaluation of multi-label short text classification models

The development of the Internet has made it easier for people to communicate with each other, geographically wider, and information has become more real-time. The development of the network has also accelerated the process of educational development. Students are no longer satisfied with daily classroom learning, and gradually began to learn online, the network has laid a good foundation for the development of online education, more and more student users make the text related to online education also gradually increased. How to realize the effective use of multi-label text in online education has become a research trend to improve the teaching level and quality of online education.

In this paper, we use the public dataset THCNEWS and self-collected dataset as the experimental data.THCNEWS is generated by the Natural Language Processing Group of TH University based on the screening and filtering of the historical data of Sina News, from which 50,000 articles are extracted as the experimental data. In this paper, we use web crawler technology to collect teaching review texts from online education related websites as self-collected experimental data (EduData), which total 54637 short multi-tagged text data. The collected data are cleaned and preprocessed, and then the data are divided into a training and test set in a ratio of 8:2 for experiments.

4.1

Parameter impact analysis of the model

4.1.1

Effect of different word vectors

How to embed text vectors and what method to train word vectors are the first issues to be considered in this study, in order to verify the impact of different word embeddings on this experiment, this study selects four commonly used word embedding models in NLP, Word2Vec, ELMo, GloVe, and BERT, which are capable of generating word vectors and also have their own characteristics. Word2Vec is a kind of shallow neural network, GloVe is a word vector generation method based on global matrix decomposition, and the advantage of BERT is that it is able to learn bi-directional relationships between words. This study designs a comparison experiment to verify the effect of different word embeddings on this experiment, and the results of the comparison experiment are shown in Table 1.

Table 1.

Different word embeddings’ impact on the results

Model	THCNEWS		EduData
Model	Train	Test	Train	Test
Word2Vec	93.32	92.64	93.75	93.98
ELMo	94.06	94.83	94.27	94.46
GloVe	94.27	94.78	94.96	94.83
BERT	96.48	96.32	96.71	96.59

As can be seen from the table, Word2Vec’s word vector training approach has a poor classification effect, with the training and testing accuracy ranging from 92.64% to 93.98% on the two datasets, in which the reason may be that CBOW and Skip-gram have a weak ability to process long texts, thus resulting in a poorer classification effect. ELMo and GloVe word vectors have moderate performance, and both of them have a multilabel short text classification accuracy are between 94% and 95%, this is because ELMo will carry out basic pre-training for multi-labeled short text, and its core idea is closer to BERT, while GloVe mainly utilizes co-occurrence matrix to learn the correlation between multi-labeled short text, which helps to improve its classification accuracy for multi-labeled short text. The best performance of the BERT-based model is over 96% on both datasets, which is due to the fact that BERT consists of pre-training and fine-tuning phases. After word vector extraction of multi-labeled short texts in the pre-training phase, labeled multi-labeled short text data are used in the fine-tuning phase to adjust the parameters of the BERT model to achieve the best performance on the multi-labeled short text categorization task. Performance.

4.1.2

Effect of convolution kernel parameters

The performance of CNN is very much determined by the size of the convolutional kernel, considering the effect of different convolutional kernels on this model, three convolutional layers are designed to obtain better performance according to the experimental ideas of existing research. The performance of [1,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7] convolutional kernel sizes are designed on the combination of convolutional kernels on this model. Holding other parameters constant, Table 2 shows the experiments performed on THUCNews and EduData data. In the THUCNews dataset, the best performance is achieved when using a convolutional kernel size of [2,3,4], with classification accuracy of more than 98% in all cases. This shows that the model is able to capture important features of multi-labeled short text data using a small convolutional kernel size. On the other hand, the best performance is achieved using larger convolutional kernel sizes [5,6,7] in the homemade EduData dataset, which has a classification accuracy of greater than 98% for multi-labeled short text. This may indicate that the features in the homemade EduData dataset are more dispersed in the text and require a larger convolutional kernel to capture them efficiently. Although the hyperparameter tuning process is usually time-consuming and computationally expensive, it can significantly improve the model accuracy.

Table 2.

The impact of convolutional kernel size

Convolution nucleus	THCNEWS		EduData
Convolution nucleus	Train	Test	Train	Test
[1,2,3]	96.34	96.42	97.18	96.89
[2,3,4]	98.06	98.13	97.42	97.26
[3,4,5]	97.83	97.57	97.63	97.54
[4,5,6]	97.69	97.42	97.71	97.98
[5,6,7]	97.75	97.68	98.25	98.37

4.2

Performance validation of the classification model

4.2.1

Classification Model Training Losses

The EduData dataset is used to train the multi-label short text classification model that was created in this paper. During the training phase of the model, the Adam optimizer is used to train the model to update the weights, and 50 rounds of iterative training are set up, with 1000 iterations in each round, and the number of training samples in each batch is 24. After every 100 iterations during the training process, the loss value and Macro-P of the model are recorded locally. In order to reduce the risk of overfitting the model during the training process, the training is terminated early when the loss value exceeds 5000 iterations and does not decrease. At the end of training, the loss values and Macro-P curves of the multi-label short text classification model during training are plotted as shown in Fig. 4 Where the left axis is the Macro-P training curve and the right axis is the loss value training curve.

Analysis of the loss value change curves shown in the figure shows that in the initial stage of model training, the loss value curve oscillates sharply and decreases dramatically. With the increase in the number of training times, the optimal value of the model parameter region, the fluctuation amplitude of the loss value gradually decreases, the loss value gradually tends to stabilize, and the model training is ended at 5*10⁵ steps, at which time the loss value of the model is about 0.085. From the change of the Marco-P curve of the model, its trend is opposite to the loss value, and it oscillates sharply and is in the stage of rapid increase before 3*10³ steps. After this, it gradually converges and the overall Marco-P stabilizes around 96.05% after 5*10⁵ steps. The change curves of the model training loss and Marco-P can be used to understand the specific parameters of the model convergence, and then better ensure the accuracy of the model in the classification of multi-label short texts.

4.2.2

Comparative Experiments on Classification Models

In order to verify the effectiveness of the multi-label short text classification model based on BERT-BiLSTM-CNN model proposed in this paper, the model in this paper is compared and experimented with TextCNN, BERT, RoBERTa, MacBERT, ERNIE, ERNIE-CNN, CRC-MHA. The models are trained using uniform hyperparameters in the experiments, and Table 3 shows the results of the comparison experiments of different models on different datasets.

Table 3.

Different models compare experimental results

Model	THCNEWS			EduData
Model	Marco-P	Marco-R	Marco-F1	Marco-P	Marco-R	Marco-F1
TextCNN	83.64%	82.01%	0.835	84.45%	83.13%	0.891
BERT	88.06%	84.75%	0.871	89.09%	86.06%	0.932
RoBERTa	87.57%	83.28%	0.882	89.24%	86.28%	0.935
MacBERT	88.25%	85.16%	0.878	89.46%	86.71%	0.937
ERNIE	88.48%	85.34%	0.883	89.87%	87.15%	0.941
ERNIE-CNN	89.73%	86.43%	0.894	91.16%	88.49%	0.948
CRC-MHA	90.12%	88.85%	0.901	91.49%	89.27%	0.953
Ours	92.09%	90.14%	0.915	92.08%	90.38%	0.962

In the classification experiments on the THCNEWS dataset, the BERT-BiLSTM-CNN model proposed in this paper achieves the best results in terms of Marco-P, Marco-R, and Macro-F1 values for multi-labeled short text classification. Compared to BERT and its improved models RoBERTa and MacBERT, the Macro-F1 values have improved by 5.05%, 3.74%, and 4.21%, respectively, and all of them have achieved a substantial lead. Among them, the difference in the performance of BERT compared to RoBERTa and MacBERT on the THCNEWS dataset is not significant, and the RoBERTa model even performs less well than the BERT model, which is presumed to be related to the difference in its pre-training process. Compared to the ERNIE and ERNIE-CNN models, the Macro-F1 values improved by 3.62% and 2.35%, respectively, which is a more significant improvement than the effect of the ERNIE model. The Macro-F1 index of ERNIE-CNN model is improved by 1.1% compared with ERNIE model, which indicates that there is still more hidden information in the output feature vector of the pre-trained model, and the effect of using CNN network for extraction has been significantly improved. Compared with the CRC-MFA model, the Macro-F1 value is elevated by 1.55%, and the classification effect is also improved to a certain extent.The CRC-MFA model adopts a multi-dimensional approach to feature fusion, but there are semantic and information density differences between news headlines and keywords in the dataset, and direct splicing may cause semantic confusion. The BERT-BiLSTM-CNN model proposed in this paper adopts a multi-field attention fusion, where the title and keywords of the multi-labeled short text are separately feature extracted and then feature fused, which can better retain the feature information, and therefore better classification results can be achieved.

In addition, in the classification experiments on the EduData dataset, this paper’s model improves the Macro-F1 value by 3.22%, 2.89%, 2.67%, 2.23%, and 1.48% compared to the BERT, RoBERTa, MacBERT, ERNIE, and ERNIE-CNN models, respectively, which all achieve a substantial lead.

Compared with the CRC-MHA model, the Macro-F1 value only improved by 0.94%, which is not a significant improvement. The reason may lie in the style of the dataset, the EduData dataset contains a large number of fields of online education related event text, and does not reflect the advantages of multi-field feature fusion, but the final classification effect is similar. It shows that the effect of this model in the classification of multi-label short text with multiple fields is similar to the mainstream model, and it has good generalization ability.

4.2.3

Comparison of model time consumption

In order to verify the time performance of the BERT-BiLSTM-CNN model based on BERT-BiLSTM proposed in this paper, the comparison model in Section 4.2.2 is still chosen to compare the training time of different models on different datasets for experiments and to analyze the time complexity. The experimental results that are related are displayed in Fig. 5.

On both THCNEWS and EduData datasets, the training duration of TextCNN is within 2 hours. The time consumption of TextCNN on THCNEWS dataset is 0.41 hours lower than that of BERT, while the time consumption of TextCNN on EduData dataset is 0.51 hours lower than that of BERT. TextCNN is a convolutional neural network based on convolutional kernels of different sizes, which is able to reduce the number of parameters of the model through the parameter-sharing mechanism, and can perform data BERT is based on the Transformer model, which is characterized by multi-head attention mechanism and hierarchical stacking structure, and has more parameters than traditional deep learning models, thus its training time is higher than that of TextCNN model. Overall, the training time of the multi-label short text classification model designed in this paper is significantly higher than the comparison model, which is due to the fact that the model in this paper adopts the BERT model for short text representation, in which the embedded Transformer structure will significantly consume time, and the BiLSTM model utilizes the long and short-term memory network to model the multi-label short text in both forward and backward order, but due to the fact that long and short-term memory network is serial in nature, it can only be processed serially in both directions, which will also increase the training time to some extent.

Although the temporal performance of the multi-label short text classification model based on BERT-BiLSTM-CNN model proposed in this paper is worse than that of the traditional deep learning model, it achieves a better classification effect than the traditional deep learning model within the acceptable training time, so it can also be better applied to the classification of multi-label short texts in online education, and provides reliable We also propose that the model can be better applied to classify multi-label short texts in online education, providing reliable technical support for improving the quality of online education.

5

Conclusion

In this paper, a multi-label short text classification model based on BERT-BiLSTM-CNN is proposed and validated for its performance in text categorization tasks in the online education domain. The short text word vector representation based on the BERT model has a classification accuracy of more than 96% on both THCNEWS and EduData datasets, and the loss value and Marco-P of the model level off around 0.085 and 96.05% after 5*10⁵ steps. The Marco-F1 values of the multi-label short text classification model constructed in this paper reach 0.915 and 0.962 on the THCNEWS and EduData datasets, respectively, which are higher than the comparison model. Although the overall training time is longer than that of the comparison model, this model has better generalization performance considering the classification accuracy. Therefore, it is feasible to carry out online education multi-label short text classification with the support of deep learning technology, which can provide reliable technical support to enhance the deep mining of online education-related text data.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Research on multi-label short text categorization method for online education under deep learning

Yinuo Guo

Publicado en línea: 19 mar 2025

Recibido: 11 nov 2024

Aceptado: 15 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0391

Palabras claveBERT model, BiLSTM-CNN model, Text categorization, Multi-label short text, Online education

© 2025 Yinuo Guo, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
BERT model, BiLSTM-CNN model, Text categorization, Multi-label short text, Online education