Data mining and neural network modeling for teaching and learning in vocational education: promoting innovation in academic management and teaching reforms
Published Online: Mar 19, 2025
Received: Nov 22, 2024
Accepted: Feb 20, 2025
DOI: https://doi.org/10.2478/amns-2025-0444
Keywords
© 2025 Wangkai Xu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Since its creation, data mining technology has been widely used in a variety of fields, such as in the financial system, telecommunication industry, retail industry, health care industry, biomedical industry, as well as science and engineering, etc., and all of them have the corresponding mature mining system formation, which has made great contributions to these fields. However, the application of data mining in different industries is not universal, and its application in the field of colleges and universities is relatively small, with the maturity of data mining technology and the continuous expansion of the application field, at present, many college and university researchers have begun to study the application of data mining technology in the teaching and management of colleges and universities [1–5]. Higher vocational education is an important part of higher education and also an important part of vocational education. In recent years, the development of higher vocational education has been rapid, and at the same time, there are some problems. The development of higher vocational is not related to its scale and capital investment, the key is the quality of education. In order to improve the quality of its own education, higher vocational should firstly have a full understanding of its own cultivation object, namely students, and at the same time, combine the characteristics of higher vocational itself and social demand, and combine the three to formulate corresponding educational measures and strategies, in order to form a management system with its own characteristics [6–10].
Higher vocational education is employment-oriented education, what kind of talents the society needs, the school should cultivate what kind of talents. In the process of vocational education coursework assessment, all the elements that may affect students’ coursework effectiveness should be considered, but there are more elements associated with students’ coursework effectiveness, and the degree of influence between each element and element varies, so the results of coursework assessment are usually difficult to interpret with an accurate and appropriate mathematical analytical formula, and the traditional classification methods are difficult to accurately deal with this problem. And neural networks, with their nonlinear mapping, self-learning, self-organization and self-adaptive ability, have good results in solving complex and difficult internal mechanisms, which can provide new processing methods for the solution of nonlinear classification, pattern recognition, signal processing and other problems [11–15]. Therefore, grasping the two centers of vocational education and employment, applying data mining technology and neural network to higher vocational education, a large amount of valuable information can be found from it to realize a data mining and neural network model used for the analysis of students’ characteristics, the analysis of success factors, the prediction of students’ employment and the analysis of coursework assessment. The application of this information to the field of academic management in colleges and universities can promote the further reform, improvement and development of the education system, and can provide an important basis for decision-making in the management of colleges and universities, so as to promote the sustained and healthy development of the cause of higher vocational education [16–20].
In this paper, an early warning analysis model (TabNet) for students’ academic management and teaching is constructed based on data mining techniques. It also utilizes data cleaning, transformation, and approximation methods to preprocess students’ course performance data and construct their datasets. Meanwhile, based on the above research results, an academic early warning system for vocational education teaching in colleges and universities was successfully designed and implemented. The system provides data support and decision-making support for the teaching management of the college while providing early warning of students’ academic performance. Finally, the construction method and implementation process of an intelligent classroom for vocational education teaching are provided.
Data mining [21] is a relatively complex system engineering, this paper, in the context of vocational education and teaching, divides the data mining process of academic management and teaching reform into the following five phases, which are as follows: the phase of clarifying the problem, the phase of data collection, the phase of data pre-processing, the phase of model building, and the phase of model interpretation and evaluation. The specific process of data mining is shown in Figure 1.

Data mining flow chart
In data mining, most of the collected data is incomplete and disorganized, which cannot directly be used for mining analysis. Therefore, in order to improve the quality of data analysis, preprocessing operations are required. The methods of data preprocessing have been developed over the years and have become very mature. The following mainly introduces data cleaning, data conversion, and data normalization.
Data cleansing is the procedure of detecting and correcting identifiable errors in a dataset, including checking data consistency, handling invalid and missing data, and so on. This paper focuses on methods for cleaning incomplete data and noise data. Cleaning Incomplete Data Using incomplete data for data mining can have an impact on the analysis results of data mining algorithms and may even lead to algorithmic errors. For the missing data in the original dataset, it is necessary to determine the range of missing values, calculate the proportion of missing values for each field, and select different cleaning methods according to the proportion of missing and the importance of the field. Noisy data Noisy data in the original dataset refers to data containing outliers or error values, and these noisy data will have an impact on the results of data mining, so it is necessary to process the noisy data before data analysis. Noisy data can be processed using the binning method, which divides the data into different subintervals according to their attribute values, and adopts the corresponding methods to process the data in different subintervals; clustering can be used to find out and remove those isolated points that are far away from the center of the clusters; and regression can be used to find the fitting curve between two attributes, use one attribute to predict the value of the other. Finding and removing data that differ significantly from the predicted values.
Data conversion is the process of transforming data into a form suitable for data analysis, or changing the structure or format of the original data through the processing of data specification. For example, before mining and analyzing the student achievement data in the academic management system of colleges and universities, the student achievement data can be converted into indicators such as the average grade point, failure rate, and failed science scores, so that the data can become more meaningful. Several methods of data conversion are introduced below: Min-max normalization: Let max and min be the maximum and minimum values of a field in the dataset respectively, the min-max normalization is to map the value Fractional calibration normalization: Decimal calibration normalization can also be referred to as base transformation specification. Specifically, it is achieved by moving the decimal point position to normalize the relevant fields in the data set. The transformation relational equation is shown below:
In the above equation,
Data generalization refers to simplifying the dataset as much as possible on the basis of maintaining the information content of the dataset, so as to achieve the purpose of improving the efficiency of data mining. For large datasets, it is necessary to perform data generalization on the dataset before data mining, and several methods of data statute are introduced below. Dimension reduction Dimensional approximation is divided into two main methods: the first is directly from the original data set to delete those fields with less relevance to the mining target or repeated fields; the second is to reduce the number of fields in the original data set through the method of dimensionality reduction, the original fields are re-combined into a new set of irrelevant to each other a few integrated variables, while maintaining as much as possible the original data set of fields contained in the information. Numerical generalization Numerical approximation refers to reducing the amount of data by using adopting smaller data units or replacing the original data with a data model, and the specific methods can be divided into two categories: parametric and parameter-free. Parametric methods are used to evaluate the original dataset by building a model that only stores parameters without storing specific data. The parameter-free method, on the other hand, requires the storage of specific data and divides the original dataset into several clusters so that the data in the clusters are similar to each other and the data between each cluster are different from each other, using the clusters of the data instead of the actual data.
Due to the batching standards of different courses, different examination contents lead to the distribution of student achievement data of different majors is not reasonable, there will be some abnormal distribution, such as the overall course grades are high, or the overall low, rather than in the intervals of the overall normal distribution. Therefore, in order to intuitively understand whether the distribution of different courses in academic data is normal or not, this experiment divides five achievement intervals, which are (80, 100], (60, 80], in (40, 60], (20, 40], [0, 20]. In order to verify the effect of this approach, a number of courses are randomly selected as an example of the performance data for division, where students of four majors A, B, C and D are selected, for which different course scores of C1, C2, C3 and C4 are selected as the object of division. The results of the distribution of the scores of the four majors and their corresponding courses are shown in Figure 2. The four graphs in Figure 2 show histograms of the distribution of scores in each class after division by interval. The C1 course scores of A majors are mainly distributed between 80-90 points, while the C2 course scores of B majors are mainly distributed between 70-85 points, and the C3 course scores of C majors are mainly distributed between 70-90 points, while the C4 course grades of students majoring in D have a certain distribution in each interval and do not conform to normal distribution. This shows that there is a significant difference in the distribution of grades in the courses of different majors. The reason for this large difference is not only due to the influence of the course characteristics and the difficulty of the teacher’s marking, but also due to the missing values in the raw data, more error values, resulting in the total distribution of students’ grades is affected, so it is necessary to further process the student data.

The results of four professional students and their corresponding courses
Since a college has different majors, and different majors have different course systems, the academic early warning model constructed in this experiment is applicable to students of all relevant majors in the college. Moreover, the courses chosen by each student are not the same, making it impossible to make a simple score comparison between different courses. Taking each person’s course as a feature will lead to too large data features, which will make the model too complicated and not highly generalizable when constructing the model for prediction. Therefore, in order to objectively demonstrate each student’s academic performance, an overall assessment of the student’s performance in each semester was used, so this study used statistical indicators such as GPA, credits taken, credits failed, number of courses failed, and grade point average in each semester as the data input features for analysis. After the process, four students were randomly selected and their input data format for the first three semesters was listed. Table 1 exhibits the data input format for the first three semesters.
Data input format for the first three semestings
| First semester | ||||
| Student number(ID) | 20240528 | 20240157 | 20240346 | 2024004 |
| The first semester | 3.72 | 2.56 | 2.88 | 3.83 |
| First semester credit | 27.5 | 32.0 | 28.5 | 27.0 |
| No credit for the first semester | 0 | 4.5 | 0 | 5.0 |
| The first of the semester | 0 | 2 | 0 | 3 |
| First semester average | 88.53 | 76.56 | 72.38 | 88.86 |
| Second semester | ||||
| Student number(ID) | 20240077 | 20240172 | 20240369 | 20240605 |
| The second semester | 2.90 | 2.77 | 3.52 | 3.96 |
| Second t semester credit | 27.0 | 30.5 | 26.5 | 28.0 |
| No credit for the second semester | 4.5 | 0 | 0 | 5.5 |
| The second of the semester | 1 | 0 | 0 | 2 |
| Second semester average | 88.63 | 76.66 | 72.48 | 88.96 |
| Third semester | ||||
| Student number(ID) | 20240077 | 20240172 | 20240369 | 20240605 |
| The third semester | 2.90 | 2.76 | 3.51 | 3.95 |
| Third semester credit | 27.0 | 29.5 | 25.5 | 28.0 |
| No credit for the third semester | 0 | 4.5 | 5.5 | 0 |
| The third of the semester | 0 | 1 | 2 | 0 |
| Third semester average | 88.53 | 76.56 | 72.38 | 88.89 |
Because the traditional academic warning has the disadvantage of lagging, often at the end of the semester or in the upper grades of the student’s academic problems before the problem is exposed, when the academic warning is too late, which will lead to the effect of the academic warning is greatly reduced. Therefore, in order to intervene earlier for students, this experiment will advance the academic warning for students, based on students’ grades in the first two semesters, and predict whether students need academic warning in the third semester to conduct the study. The focus of the study is to provide students with timely academic warnings in the early grades so that they have the opportunity to make corrections. Therefore, the academic number was used as the unique ID of the student. The final data entry format for each student is shown in Table 2.
The final data input format for each student
| Serial number | Field name |
|---|---|
| 1 | Student number(ID) |
| 2 | Student number(ID) |
| 3 | The first semester |
| 4 | First semester credit |
| 5 | No credit for the first semester |
| 6 | The first of the semester |
| 7 | The second semester |
| 8 | Second t semester credit |
| 9 | No credit for the second semester |
| 10 | The second of the semester |
| 11 | Second semester average |
Raw data often contains course scores for some students due to missing exams, comprehensive test credit recognition, not being entered into the system, etc. Null value, so the need for data in the null value of the processing, the mainstream data complementary there are three methods. Respectively, the average score of this course is used to complement the plurality of complementary and median complementary processes, with value 0 processing. As the data in this paper is used in each semester statistical indicators, so the missing value of the course can be directly rounded off without processing, this operation almost does not affect the overall statistical indicators for each student, basically reflecting the student’s academic performance in each semester.
The Pearson correlation coefficient is a measure of the degree of linear correlation between two variables, with a value between -1 and 1. Assuming that there are two variables

Characteristic correlation scatter point diagram
The process of neural network [22] to construct the decision flow shape is similar to the process of decision tree to perform decision making, in the figure,

Tree model decision case diagram
The neural network, on the other hand, mimics the tree model decision-making process using a designed fully connected layer and the ReLU function, and the neural network decision-making process is shown in Figure 5. As can be seen from Fig. 5 Neural Network Decision Flow Shape, firstly the input data of the neural network is a feature vector [

Neural network decision flow
The TabNet architecture is shown in Fig. 6. TabNet [23–24] is a stack of multiple decision steps, each consisting of Feature transformer and Attentive transformer, Mask layer, Split layer and ReLU. The input sample features such as discrete features, TabNet first maps the discretized features into continuous numerical features using training embedding, and then ensures that the data input form of each decision step is an

TabNet schema
Feature transformer
Feature transformer functions to implement the feature computation at the decision step. Feature transformer is a bit more complex compared to neural network that implements feature processing process through FC, Feature transformer consists of BN layer, Gated Linear Unit (GLU) layer and fully connected layer, the purpose of GLU is to add a gate unit on top of the original FC layer, the computation is as shown in Eq. (4):
Split layer
The role of Split layer is to cut the vectors output by Feature transformer, the calculation is shown in equation (5):
In the above equation
Attentive transformer
Attentive transformer obtains the Mask layer matrix of the current decision step based on the output of the previous decision step and makes the Mask matrix is sparse and non-repeating, so that the Mask matrix of the student samples is also different, with the function of letting the samples choose different features.
The Attentive transformer structure is shown in Fig. 7. In Fig. 7, the role of the Sparsemax layer makes the Attentive transformer output results sparse, Sparsemax directly projects the feature vector to the simplex to achieve the sparsification, the calculation formula is shown below:

Attentive transformer Framework
According to the application of Sparsemax function in Attentive transformer, the calculation formula is shown below:
In the above equation,
The KNN interpolation method uses the remaining features to construct the multidimensional space, selects the
The distance between two student samples
In the above equation,
For sample
If the
For the KNN interpolation method, if the
Where
For visualization, this paper uses 30 test sets and 30 training sets for decision tree warning with accuracy (a), precision (p), recall (r) and F1 values under TabNet neural network. The results of the decision tree performance on the warning data are shown in Fig. 8. From the figure, it can be seen that the average accuracy of the classifier on the test set is 93.11% and the average recall is 74.35%. While the classifier performs poorly on the training set, its overall accuracy is 81.36% with a recall of only 47.07%, indicating that the classifier is not precise enough to classify the positive examples. In contrast to the decision tree algorithm, the support vector machine, on the other hand, also performed well on the test set. Since the support vector machine maps the data to a high-dimensional space for classification, it is difficult to visualize its classification process visually, and only the performance is analyzed here.

The results of the decision tree on the warning data
In the support vector machine algorithm, the dataset is normalized. Because the number of features in the warning dataset is not very large and the number of sample sets is sufficiently full compared to the amount of features, a Gaussian kernel function is used and its parameter gamma is set to 0.5 according to the inverse of the number of categories of the warning results. The results of the performance of the Support Vector Machine algorithm on the warning dataset are shown in Fig. 9. The support vector machine algorithm is shown in Fig. 5 obtained 87.36% recall and 91.14% precision rate on the training set, as much as possible, we would like to obtain a high recall along with a high precision rate, but the mutual constraints of these two metrics inevitably lead to the inability to satisfy them at the same time. Finding a balance between recall and precision, we obtained the reconciled average evaluation metric F1 value of the two, and the average F1 values of the training set and test set were 81.57% and 81.85%, respectively. The similarity between the test and training sets of the model is extremely high at this point.

Support vector machine algorithm results in the warning data set
In this paper, the performance of TabNet neural network algorithm is compared and analyzed with the traditional BP neural network algorithm. The loss rate curves of TabNet algorithm and BP algorithm on the training and test sets are shown in Fig. 10. The (a) and (b) are the loss rate of Bp neural network and TabNet neural network respectively. On the test set, the loss curve of the TabNet algorithm is 43.75% lower than that of the BP neural network algorithm, reaching a minimum of 0.045. It shows that the TabNet neural network has a better generalization performance than the traditional BP neural network, with a low chance of overfitting, when predicting new data. The results show that the BP neural network algorithm ended up with a difference of about 33.33% between the training set and the test set loss curves, while the difference in the curves on the TabNet neural network algorithm is an order of magnitude smaller than the former. It is inevitable that there is a gap between the loss curves of the training set and the test set, because the classifiers learned on the training set try to fit the test set samples as much as possible there will be errors, and if the gap is too large, overfitting may have occurred. The accuracy of the joint result shows that both classifiers perform well here, but the latter is more detailed.

The loss rate curve of the two algorithms in the training set and the test set
In order to intuitively judge the value of the results of these two classifiers in predictive classification, the ROC curve graph is introduced here, the larger the area under the curve indicates that the value of the classifier is greater, i.e., when the curve is closer to 0 on the X-axis, and closer to 1 on the Y-axis, the higher the accuracy rate is. The ROC comparison curves of the two on the training and test sets are shown in Fig. 11. From the ROC curve in Fig. 11, it can be seen that the TabNet neural network is closer to the axes, and the areas of both are 0.8054 and 0.8913, respectively, which indicates that the predicted TabNet neural network is more accurate, and the classified results are better value to use.

The 2 algorithms are in the training set and the roc curve on the test set
In order to avoid issuing warnings at a time when they are not working well therefore the three weeks at the beginning and end of the semester were not considered as the optimal time, we ended up testing the classification using a classifier according to two weeks as a set of dynamic data. Weeks 9-10 were determined to be the best time to implement the alert system from an accuracy standpoint and from a data standpoint such as the frequency of student library access, as the student profile data showed a slackening of motivation. Early warning at that point in time would allow the student time to change their learning style while maintaining efficient predictive accuracy, followed by continued attention to the student’s dynamic academic profile with periodic occurrences of early warning. Early warning alerts were obtained indicating that the student was failing exams frequently and that his recent study behavior and study self-discipline remained poor, and that an early warning should be issued to urge him to correct his studies and strengthen his study habits.
The construction of intelligent teaching system mainly refers to the collection and analysis of learning behavior data related to vocational education teaching through big data, cloud computing, Internet of Things and other emerging information technologies in vocational education curriculum teaching, to realize the perception and identification of learners’ learning portrait and learning status, so as to provide the basic conditions for the further realization of intelligent intelligent teaching. Specific practices include: ① Equip teachers with network information technology application ability as team leaders in the teaching team. ② Adopt human-computer combination, cross-border cooperation and data-driven and other ways to realize the construction of teaching resources, mainly including the construction of an intelligent classroom platform based on the network teaching resource library. ③ Through the network, artificial intelligence and other means to build a platform for communication, interaction and information feedback between teachers and students, through which real-time records are made of students’ questions, answers and interactions in the classroom.
In the intelligent teaching mode, teaching design is an extremely critical link, which is related to the effectiveness of the course and the personalized teaching level of teachers. First of all, the intelligent teaching mode will carry out data collection and analysis in the design process, on the basis of which it proposes the implementation of personalized learning programs for students with different levels and abilities, and this process is also the basis for the implementation of personalized teaching mode and teaching methods to achieve the expected results. Second, the intelligent teaching mode in the design process according to the data empowerment analysis to get the student learning behavior portrait information, these portrait information into the student learning model, and combined with the current learning state of the students, design targeted, can meet the requirements of the knowledge transfer of the learner’s activities to implement the program. Again, the intelligent teaching program is classified and analyzed based on the collected student data in the design process. Finally, the intelligent teaching program is synthesized and analyzed in the design process based on the collected data and information, and the results are used to develop a personalized learning model implementation plan and specific activity arrangements.
The intelligent classroom teaching mode is mainly implemented through the following two aspects: first, before teaching, the teacher pushes the syllabus of the course, teaching objectives, unit training content, learning tasks and practice questions and other relevant content through the cell phone software, and pushes the relevant information and knowledge in combination with the learning objectives and learning requirements that the teacher has set in advance. Secondly, the intelligent robot or intelligent terminal is taught and instructed through online and offline training platforms.
In this paper, an academic management dataset is constructed based on big data, and then the TabNet neural network algorithm is used to provide early warning on the academic results of vocational education teaching. Finally, based on the early warning results, it proposes the construction method and implementation process of the intelligent teaching model for colleges and universities. The primary conclusions are as follows:
Due to the missing values in the raw data of the grades of different specialized courses, more error values and other reasons, the total distribution of students’ grades has a large difference. Therefore, this study adopts statistical indicators such as GPA, credits taken, credits not passed, number of courses not passed, and grade point average of each semester as the data input characteristics for analysis. The average accuracy and recall of the classifier on the test set are 93.11% and 74.35%, respectively; while the classifier is not precise enough to categorize the positive examples on the training set, as evidenced by the fact that its recall is only 47.07% with an overall average accuracy of 81.36%. Compared to the decision tree algorithm, the support vector machine has a recall and precision of 87.36% and 91.14%, respectively, on the training set, and its similarity between the training and test sets is extremely high, with F1 means of 81.57% and 81.85%, respectively. On the test set, the TabNet algorithm has a better generalization performance than the BP neural network algorithm with a low chance of overfitting. The results show that the BP neural network algorithm ended up with a difference of about 33.33% between the training and test set loss curves, while the TabNet algorithm had a minimal difference. The ROC curves show that the TabNet neural network has a higher accuracy and classifies the results to a better use value. Through the data mining technology and neural network algorithm, the academic management process is standardized, the collected data information is comprehensively analyzed, and according to the results, the personalized learning mode implementation plan and specific activity arrangements are formulated, which in turn promotes the innovation of academic management and teaching reform.
