Machine Learning Based Big Data Analytics for Education in Curriculum Reforms

In today's information age, the development of Internet technology has brought new opportunities and challenges to various industries, and in the face of increasingly developed Internet technology, various industries are constantly exploring new paths of development based on Internet technology. Similarly, the field of education is also experiencing unprecedented changes. The development of Internet technologies such as big data, cloud computing and machine learning has provided strong support for the change of the educational mission, so that the educational model and methods continue to enhance the intelligence, personalisation and precision in the educational process[1]. Therefore, this paper focuses on the application of machine learning-based educational big data analysis in curriculum reform, with a view to providing new ideas and methods for educational reform.

Educational big data refers to the massive and diverse data generated in educational activities, including student behaviour data, teaching activity data, evaluation feedback data, etc. It is an important product of the development of educational informatisation, and has been increasingly valued in curriculum reform[2]. By collecting, organizing and analyzing a large amount of data in the field of education, educational big data can provide a scientific basis for educational decision-making, provide personalized support for teaching activities, and at the same time optimize the teaching process and improve the quality of education[3]. However, how to effectively collect these data, how to efficiently store as well as manage and analyse these data is an important issue that the education field is currently facing in terms of big data in education. And machine learning, as an efficient data analysis technology, shows great potential in the application of educational big data.

Machine learning is an autonomous computer learning technology, which can automatically extract laws and patterns from data, and perform tasks such as prediction, classification, recognition, and recommendation on data[4]. At present, machine learning has many applications in the field of education can effectively solve the problems existing in the traditional education model. For example, the construction of intelligent teaching scenarios, through the analysis of students “learning behaviours and learning trajectories, machine learning can automatically identify and assess students” learning status, thus providing teachers with students' learning needs and learning preferences, making it easy for teachers to understand students and provide personalized teaching resources and teaching methods based on these data. In addition, machine learning can monitor students “learning progress to ensure the adaptability of teaching content and students” learning effectiveness. However, there are still many problems in the practical educational applications of machine learning, which need to be continued in-depth discussion and research.

In this paper, based on the exploration of machine learning algorithms and key technologies commonly used in machine learning and educational big data analysis, a curriculum reform model is proposed to provide decision support for curriculum reform. The model includes steps such as data collection, data preprocessing, feature selection, model training and evaluation. The effectiveness of the proposed curriculum reform model is verified through empirical research. Through the analysis of actual teaching data, the application effect of the model in personalised teaching, learning behaviour analysis and teaching resources recommendation is demonstrated. At the same time, this paper will also explore the challenges that may be encountered in the application of the model in promoting education reform and propose corresponding solution strategies. Through the research in this paper, we aim to provide new ideas and methods for curriculum reform based on machine learning and education big data analysis, improve the quality and efficiency of teaching, and promote the personalised and precise development of education.

2

Machine Llearning and Big Data Analytics in Education

2.1

Overview of Machine Learning Theory

Machine learning is a core branch of artificial intelligence, with its main theoretical foundation derived from statistics, computer science and applied mathematics, and is a series of algorithms and techniques that enable computer systems to use data to learn and improve themselves. It enables computers to learn patterns from data by constructing models and use such patterns to make predictions and decisions about unknown data[5].

The main processes of machine learning are data preprocessing, feature selection, model training and evaluation[6]. Firstly, the quality of data is ensured through data preprocessing, which mainly involves operations such as data cleaning, normalisation and denoising. Next, feature selection is used to identify the most important attributes in all data to reduce dimensionality and improve model performance. After feature selection, algorithms are used to learn from the data to achieve model training results and discover potential relationships between the data. Finally, model evaluation is achieved by applying specific metrics to measure the accuracy of model predictions and generalisation ability[7]. Supervised learning, unsupervised learning and reinforcement learning are the three main types of machine learning. Supervised learning is learning a model to classify or predict new data from a labelled training dataset. Unsupervised learning deals with unlabelled data and focuses on discovering structure and patterns in the data. Reinforcement learning, on the other hand, trains a model to make optimal decisions in a given environment through reward and punishment mechanisms.

2.2

Big Data in Education

In 2012, Big Data for Development: Challenges and Opportunities, published by the United Nations, stated that the era of big data has arrived and will have a profound impact on all areas of society[8]. In the context of the rapid development of Internet technology in the new era, all industries have begun to use big data to better determine their strategies and business directions. Big data is characterised by a huge amount of data, rapid data processing and a wide variety of data, and it can process various types of data at a faster speed. In the field of education, the application of big data covers a diverse range of data sources from school teacher and student information management systems, online learning platforms, and course management platforms. These data contain both structured and unstructured information, including not only detailed records of students “learning progress, but also information on teachers” teaching activities, as well as data on various learning behaviours generated in online learning environments. Educational big data can be analysed in terms of clickstream data at the micro level, text data at the meso level and institutional data at the macro level. Micro-level clickstream data are mainly learners in learning environments such as smart tutoring systems, Massive Open Online Courses (MOOC), simulations and games. The textual data at the meso-level are mainly the data generated by students when they perform writing activities in learning environments such as smart tutoring systems, online forums, and social media. Institutional data at the macro-augmented level are mainly student personnel data, enrolment data, campus services data, course schedules and course enrolment data, university major requirements and degree completion data, etc., as counted by individual educational institutions. By collecting and analysing these educational big data, hidden educational laws and trends can be obtained from them, thus providing a more scientific and accurate basis for educational decision-making.

At present, machine learning is widely used in education big data analysis. For example, students' academic achievement and performance can be better mined and predicted by deep learning regression and linear review models[9]. Using the Hidden Malt customer service model can classify students “academic performance levels, forming a compact learning performance trajectory, which can obtain the relationship between students” learning performance trajectory and their ultimate academic success[10]. In recent years, the rapid development of students “online teaching, their online learning information is recorded, the study uses a large amount of interactive data from Moodle and the performance characteristics of college students based on the model of multiple linear regression, cluster analysis and correlation analysis to predict students” performance, and analyse the important factors affecting students' performance[11]. With the continuous progress of machine learning technology, its application in the field of education will be more in-depth and extensive, providing powerful technical support for educational reform and innovation.

3

Commonly Used Machine Learning Algorithms and Key Technologies for Education Big Data Analysis

3.1

Recurrent Neural Network

Rerrent Neural Network (RNN) is a deep learning model that is suitable for processing sequence data of arbitrary length and can capture the temporal dynamic features in the sequence[12]. Its main feature is its short-term memory capability, which can solve problems such as temporal analysis and speech recognition. The above recurrent neural network can be represented by formulas, where formula (3.1) is the calculation formula for the output layer and formula (3.2) is the calculation formula for the hidden layer: (3.1) $o_{t} = g (V s_{t})$ (3.2) $s_{t} = f (U x_{t} + W s_{t - 1})$

V—the weight matrix of the output layer,

g(Vs_t)—activation function,

U—the weight matrix of the input x,

W—the weight matrix of the last value as an input to this one,

f(Ux_t + Ws_t-1)—activation function.

Since recurrent neural networks are not suitable for processing long sequences, LSTM (Long Short-Term Memory, LSTM) with long and short term processing is obtained by improving on this basis.

(3.3)

i_{t} = σ (W_{i} x_{i} + U_{i} h_{t - 1})

(3.4)

f_{t} = σ (W_{f} x_{i} + U_{f} h_{t - 1})

(3.5)

o_{t} = σ (W_{o} x_{i} + U_{o} h_{t - 1})

(3.6)

{\tilde{s}}_{t} = \tanh (W_{S} x_{i} + U_{s} h_{t - 1})

(3.7)

s_{t} = f_{t}^{°} S_{t - 1} + i_{t}^{°} {\tilde{s}}_{t}

(3.8)

h_{t} = o_{t}^{°} \tanh (s_{t})

Where, formula (3.3) represents the input gate, which indicates how much of the input xt of the network at that moment is saved to the unit state st, formula (3.4) represents the forgetting gate, who determines how much of the unit state st-1 at the previous moment is retained to st at the current moment, and formulas (3.5) and (3.6) represent the output gates, and the thing expressed in formula (3.7) determines the unit state st output value. The current value of the storage unit is shown by equation (2.9).

3.2

Random Forest

Random Forest (RF), proposed by Leo Breiman and his team, is an algorithm that performs well in classification and prediction tasks. It improves the accuracy of predictions by not significantly increasing the computational burden. The algorithm can efficiently handle datasets with a large number of feature dimensions without the need for feature dimensionality reduction[13]. In constructing a random forest, the generation of each decision tree is based on randomised putative back sampling of the original dataset N, with n samples taken for training, and this process is applicable to all trees. When the input data has m feature dimensions, a constant a much smaller than m needs to be set to ensure that the most optimal features from the m features are selected for splitting in each decision tree construction process. Ultimately, the prediction result of the random forest is derived from the combined prediction results of these independent decision trees.

3.3

K-Nearest Neighbour Algorithm

The K-Nearest Neighbors algorithm (KNN) is a distance-based supervised learning algorithm mainly applicable to classification and regression tasks. It makes predictions based on the distance between test samples and training samples, usually using Euclidean distance, Manhattan distance or Minkows distance[14]. Also, it is a nonlinear lazy learning algorithm that does not require assumptions about the distribution or form of the data, and instead of constructing a model in the training phase, it uses the entire training dataset in the prediction phase. In classification problems, the KNN algorithm determines the class of the test sample by calculating the distance between the test sample and each sample in the training set and selecting the K nearest neighbours, and then determines the class of the test sample based on the majority vote of these neighbours. In regression problems, the KNN algorithm predicts the output of the test sample as a weighted average or simple average of the output values of the K nearest neighbours.

(3.9)

d (x, y) = \sqrt{{(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} + \dots + {(x_{n} - y_{n})}^{2}} = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

(3.10)

d (x, y) = \sum_{i = 1}^{n} | x_{i} - x_{j} |

Equations (3.9) and (3.10) are the Euclidean and Manhattan distance formulas, respectively, according to which the final K-value is determined after tuning the parameters through experiments. The choice of K-value is crucial to the performance of the KNN algorithm and affects the bias and variance of the algorithm. Therefore, choosing the appropriate K value usually requires methods such as cross-validation to determine.

4

Application of Educational Big Data Analytics in Curriculum Reform

4.1

Analysis and Prediction of Students' Learning Behaviours Based on Machine Learning Educational Big Data

1)

Selection of Data Sets

In this study, the educational dataset used originates from the academic affairs system of a university in China, covering information such as student swipe records and grades. These data are real records and have been desensitised to ensure data security and privacy. The dataset used in this study covers behavioural data, achievement data and basic information data, which are shown in Figure 4.1. This dataset records in detail the card-swiping behaviours, book borrowing, academic achievements and basic personal information of students in each grade from 2019 to 2023.

2)

Data Processing

Some of the missing data, abnormal data values, duplicate data values and other issues in the data set are processed, in which the mean or median in the sample is used to fill in for the missing data values, and if there are more missing directly deleted, the abnormal values can be filled in using the method of mean correction, and the duplicate data are deleted. Finally, Min-Max Normalisation method is used to normalise the data.

3)

Construction of Entropy of Students' Learning Behaviour

In this study, the 24 hours of a day are divided into 96 time segments labelled 1 -96 according to every 15 minutes, and the behavioural entropy is used to assess the orderliness of students' behaviours; the larger the behavioural entropy is, the more disordered the student's behaviours are; the smaller the behavioural entropy is, the more orderly the student's behaviours are. Therefore, the specific formula for calculating the frequency of occurrence of a particular behaviour at a certain time period is as follows, where denotes the frequency of occurrence of behaviour Vi at tr this time.

(4.1)

P (V_{i, t_{r}}) = \frac{n_{v_{i, t_{r}}}}{\sum_{0}^{m} n_{v_{i, t_{r}}}}, i = 1, 2, 3, \dots \dots, m; r = 1, 2 \dots, T

Meanwhile, the behavioural entropy of Vi is defined as shown in Equation 4.2.

(4.2)

H (V_{i, t_{r}}) = \sum_{i}^{T} p_{i, t_{r}} \log (p_{i, t_{r}}), i = 1, 2, 3, \dots \dots, m; r = 1, 2 \dots, T

Based on the above formula, the entropy of behaviour corresponding to students' monthly behaviours can be derived, which in turn can be used to analyse the differences in learning behaviours of students with different academic performances.

4)

Analysis of Student Behaviour

The data set is analysed using the above model to derive the library and Internet behaviours and exams of students with different academic performances. The results of library behaviour of students with different academic performances are shown in Figure 4.2. As shown in the above figure, the number of times students with better academic performance borrowed books from the library was greater than the number of times they borrowed books from other categories.

Figure 4.3 demonstrates the distribution of Internet access behaviour of various types of students. Analysis of the chart reveals that those students with high academic performance usually display a more regular pattern of Internet access, which suggests that they are focused on their studies during the day and only occasionally use their mobile phones to access the Internet during break times. On the contrary, students with lower academic performance have a more disorganised online behaviour, which may mean that they frequently use their mobile phones for online activities in their studies and daily life.

Meanwhile, we have analysed the online behaviour of students in different semesters during the four years of university, and their behavioural entropy is shown in Figure 4.4. Behavioural entropy changes greatly in different semesters, in which the behavioural entropy of different types of students in the first semester does not have a big difference, while in the following semesters as the time students can do activities on their own becomes longer, their behavioural entropy of surfing the Internet changes greatly, showing that there is no regularity in surfing the Internet.

In addition, we analyse the students' examination results based on the selection of higher mathematics as a representative, the higher mathematics subject is divided into different knowledge point boards, and different knowledge point boards correspond to different topics, through statistical analysis to get the mastery of different students on the course, and the results are shown in Figure 4.5. From the figure, it can be seen that the students have a better mastery of linear algebra and the weakest mastery of ordinary differential equations, and the practice of ordinary differential equations should be strengthened in future teaching.

4.2

Personalised Learning Path Design Based on Machine Learning for Education Big Data

The aim of this study is to construct a student behavioural portrait and apply the K-means clustering algorithm to divide the students into groups in order to gain a deeper understanding of the student population and to provide customised pedagogical support. The research methodology is outlined as follows: first, the number of divided student groups is determined, followed by randomly selecting K data points as the initial centroids for clustering. In the final step, each data sample is categorised into the group represented by its nearest clustering centre. At the same time, the position of each clustering centre is updated to the mean of all sample points within that group. The distance between the sample and the clustering centre is calculated as described in Equation 4.3. Through this process, groups of students with well-defined characteristics are formed.

(4.3)

d (x_{i}, c_{j}) = \sqrt{\sum_{k = 1}^{n} {(x_{i k} - c_{j k})}^{2}}

In personalised learning recommendation, the similarity between students is first calculated and then personalised learning content is provided to students based on the similarity. The similarity between students is calculated using the cosine similarity measure. Assuming that there are two students u and v, and the feature vectors of their learning behaviours are xu and xv respectively, the formula for calculating the cosine similarity is shown in 4.4.

(4.4)

\sin (u, v) = \frac{x_{u} \cdot x_{v}}{‖ x_{u} ‖ ‖ x_{v} ‖}

In the formula, · denotes the dot product of vectors and ||x_u||||x_v|| is the mode of the vector x. The closer the cosine similarity is to 1, the more similar the two students are.

After calculating the similarity between the students, personalised recommendations can be provided to the students based on the similarity. The formula for personalised recommendation is shown below: (4.5) $\hat{r_{u l}} = \frac{\sum_{v \in N (u)} \sin (u, v) \cdot r_{v i}}{\sum_{v \in N (u)} \sin (u, v)}$

$\hat{r_{u l}}$ —Student u's predicted level of interest in course i,

N(u)—is the set of k students with the highest similarity to student u,

r_vi—User v's level of interest in course i.

On the basis of the above, Python language and its related development frameworks are used along with Hadoop big data processing tools for recommendation of personalised learning. At the same time, the correctness, stability, performance and user satisfaction of the system were evaluated. The results show that the recommendation of personalised learning paths passes the test and performs well, providing students with a good learning experience.

5

Machine Learning-Based Education Big Data Analysis on Curriculum Reform Strategy

5.1

Machine Learning Technology and Curriculum Teaching Method Integration Innovation

1)

Machine learning is used to analyse educational big data to portray the learning habits and preferences of the student population, so as to explore more effective teaching methods, such as flipped classroom and project-based learning.

2)

Diversify teaching tools in the teaching process, for example, intelligent teaching assistants and virtual reality teaching can be used to increase students' immersion and participation.

3)

Teachers can reflect on the content and form of teaching based on the results of the educational big data analysis of machine learning after class, continuously improve teaching strategies, and achieve the organic integration of teaching methods and technology.

5.2

Promote the Use of Personalised Learning Path Recommendation Systems in Schools

1)

Using machine learning to analyse education big data to obtain students “learning behaviours, performance histories and personal daily behaviours, a personalised learning model can be constructed for them, and schools can carry out teaching content based on students” personalised recommendations.

2)

For students, the personalised learning path recommendation system allows them to more accurately obtain course content and learning sequences that match their learning styles, so that they can be taught according to their aptitude, which can better improve their learning efficiency.

3)

The personalised learning path recommendation system can track students “learning progress and feedback, and the school can continuously adjust the course content according to their feedback to ensure that the course content matches students” learning abilities.

5.3

Dynamic Optimisation of Course Content

1)

Use education big data analysis to assess students' familiarity with and mastery of different learning modules in the course content, identify the strengths and weaknesses in the course, and better improve the relevance of teaching.

2)

Obtaining the degree of student interest in the teaching content in the course, and regularly updating the course materials based on this data, incorporating the latest academic research and industry trends to ensure the cutting-edge and practicality of the teaching content.

6

Conclusion

This study provides data support and scientific basis for curriculum reform by applying machine learning technology to analyse educational big data in depth. The results of the study show that the machine learning model can effectively identify the patterns and trends of students' learning behaviours, thus providing the possibility of formulating personalised curriculum reform strategies. By predicting and analysing students’ mastery in different knowledge modules, it provides a basis for schools to be able to design curriculum content and teaching methods that better meet students' needs, and also significantly improves students' engagement in learning. In addition, the application of machine learning helps to solve the problem of uneven distribution of educational resources, providing equal learning opportunities for students of different backgrounds and promoting educational equity. However, the application of machine learning in education is still at the stage of exploration, and future research needs to further explore how to combine knowledge from multiple disciplines, such as educational psychology and cognitive science, to optimise the machine learning model to achieve deeper educational reform. At the same time, in the process of machine learning technology, it is important to pay attention to data privacy and ethical issues to ensure the safety and privacy protection of student data. In summary, machine learning-based big data analysis in education provides a new perspective and approach to curriculum reform. Through accurate data-driven decision-making, it enables schools and teachers to meet students' learning needs more effectively, and promotes the development of education towards personalisation and precision.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Machine Learning Based Big Data Analytics for Education in Curriculum Reforms

Jing Zhang

Yongyan Fan

Data publikacji: 27 lut 2025

Otrzymano: 20 paź 2024

Przyjęty: 27 sty 2025

DOI: https://doi.org/10.2478/amns-2025-0135

Słowa kluczowemachine learning, big data in education, teaching model, curriculum reform

© 2025 Jing Zhang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
machine learning, big data in education, teaching model, curriculum reform