An Optimization Study of Information Visualization Techniques in Educational Management Systems for Student Data Analysis

With the advent of the information explosion, the rapid growth in the amount of data makes the need for effective data visualization tools particularly urgent. Visualization research currently has four forms are: information visualization research, data visualization research, knowledge visualization research and scientific visualization research [1-4].

Information visualization technology is a technology that transforms data, information and knowledge into graphical and visual presentations to help people better understand and analyze complex data patterns and associations [5-7]. Information visualization technology can not only present a large amount of data in an intuitive and clear way, but also help people to discover the hidden patterns and insight into the deeper information [8-9]. Information visualization is based on citation analysis, social network analysis and other methods, with the help of computer technology to deal with massive amounts of data, the abstract, fuzzy information is converted into a concrete, intuitive and clear visualization of the knowledge map, so that teachers can quickly understand the learning situation of a student, targeted management and training [10-13].

Information visualization has long been applied to the field of education, which can help students to clarify their learning speed and quality [14-17]. In the education management system, a variety of student-related data information is stored, which is messy and scattered, and the readability of the data is poor. How to improve the readability of the data, and the correct use of these educational data information, to obtain the necessary information related to education is very important, information visualization technology plays a role in this moment. Information visualization technology in the education management system, through the system data information can be used to mine and display the teaching data information that needs to be used and analyzed in the teaching process, extract student data, analyze it, improve the readability of student data and data focus, help teachers better understand the students, so as to better manage and cultivate students [18-20].

Information visualization is widely used in the field of education. Literature [21] creates a visualization platform for student management, which analyzes students’ daily learning and rationally allocates resources through visualization technology, providing convenience in student management. Literature [22] describes that 3D visualization technology combined with problem-based learning model can improve the learning efficiency and interest of spinal surgery students. And literature [23] experimental results showed that the use of visualization technology can motivate students to learn and develop their critical thinking mode. Literature [24] conducted a study using smart technologies for information visualization in the educational process and showed that the visualization process is beneficial to the efficiency of education. And literature [25] summarized the benefits of visualization technology in schools, teachers, students, students understand their own development problems, teachers more accurate teaching knowledge and timely understanding of student learning, while the school has the information of the two targeted adjustments to the goals and strategies of teaching management. These visualizations are inseparable from the information provided by the education management system, after all, the education management system is a record of student performance, personal performance, educational experience and other information. Therefore, visual analysis based on the data of students in the system is a big step forward in the process of education and teaching, and its optimization research can continuously improve the level of school management and education.

This paper designs a system for analyzing student behavior based on clustering and association rules. Firstly, we get the static attribute data of students and various dynamic behavior data of the same magnitude through data preprocessing, and then we optimize the original algorithm by adopting the K-Means algorithm based on the information entropy and density optimization for the selection of initial clustering centers and the quantification of attribute feature contribution. The cropping step of the Apriori algorithm is improved and optimized to reduce its computational amount. On this basis, the architecture and functional structure are designed to complete the construction of the student behavior analysis system. The system is applied to the teaching management of a university to analyze student behavior based on the resulting student portrait.

2

Behavior analysis system based on student data information

2.1

Student Behavioral Profile

Behavioral portrait, also known as behavioral data virtualization, as an effective tool for building target users, associating user behaviors and predicting behavioral trajectories, behavioral portrait is an important output method of behavioral data, the essence of which is to organize behavioral data into visual behavioral data features, thus constituting a user’s behavioral data model. At present, many fields are involved in the concept of user behavioral portrait, in the practical application of the most easy to understand and closely related to the life of the label will be the user’s basic attributes, daily behavior and habits associated. User behavioral portrait is virtual real user behavioral data, firstly, user behavioral portrait for real and specific behavioral data to classify the user’s different behavioral label categories, secondly, according to the user’s different behavioral label categories to describe the user’s behavioral characteristics, and finally sketched out a data-based user portrait. From the point of view of different dimensions of user data, behavioral portrait refers to the use of the massive original behavioral data to carry out secondary calculations, the reconstruction of the new data form, outlining the full picture of the user’s behavioral data, so as to solve the problem of how to transform the data into information for decision-making, and to gradually embody the data structure of the user’s behavioral portrait from the abstraction to the details. User behavioral portrait is a method that effectively combines qualitative and quantitative methods. Qualitative methods refer to the generalization and abstraction of static attributes and characteristics of users through the analysis of their social situations, usage scenarios, and living habits; quantitative methods refer to the fine statistical analysis and calculation of users’ attributes and characteristics to obtain a more accurate understanding of users.

2.2

Data pre-processing process

2.2.1

Data cleansing and data integration

Missing value analysis. Firstly, analyze the reasons for missing data, such as the inability to obtain unstructured data or the process of obtaining structured data is wrong; secondly, analyze the impact of missing data, for example, data mining algorithm models will lose a lot of useful information, it is difficult to grasp the laws embedded within the data; finally, use the Pandas library of Python to read in the collected raw data to view the basic situation of the data and calculate the the number of missing, not missing, missing rate and so on.

Outlier Analysis. Outlier analysis, also known as outlier analysis, refers to the individual values that exist in the raw data that significantly deviate from the rest of the observations, and is mainly processed using statistical volume analysis and box plot analysis, and the box plot can detect outliers, as shown in Figure 1. 1)

Statistical quantitative analysis is to do a descriptive analysis of the data to determine the abnormal data, for example, by determining the maximum and minimum values of the data in the data set to determine whether the value of the data is outside a reasonable range.

2)

Outliers are defined as behavioral data less than Q_L − 1.5IQR or greater than Q_U + 1.5IQR, where Q_L is the lower quartile, indicating that a quarter of all data objects have behavioral data less than their value; Q_v refers to the upper quartile, indicating that a quarter of the behavioral data in all data objects is greater than its value; IQR refers to the interquartile spacing, indicating that the difference between the upper quartile Q_U and the lower quartile Q_L, contains half of all data objects, the larger the value, indicating that the behavioral data of the data object the greater the degree of mutation; vice versa, indicating that the behavioral data of the data object the smaller the degree of mutation.

Data integration refers to the process of logically or physically centralizing the storage of data sources with different data contents, data formats and data quality, and its main purpose is to solve the problem of data distribution and heterogeneity. In the process of data integration, due to the different ways of storing data in multiple data sources and the mismatch of data types, the original data are converted, refined and integrated after giving full consideration to the problems of data object identification and attribute redundancy. 1)

Identification of data objects

Data object identification refers to the identification of existing data objects from data sources with different data contents and data formats, and the common forms of stored data include three aspects: i.e., homonym, synonym and unit inconsistency. Among them, homonymy means that the same behavioral data in different data sources are used to describe different data objects; heteronymy means that the same data object in different data sources is described by different behavioral data; and unit non-uniformity means that the behavioral data of the same data object is described by using the international unit and the traditional unit of measurement respectively. Therefore, the purpose of identifying data objects is to detect and solve the problem of storing data forms in different data sources.

2)

Identify redundant attributes

Data integration often makes the attributes of data objects redundant, i.e., the behavioral data of the same data object appears multiple times or the naming inconsistency of the same behavioral data leads to the duplication of behavioral data. Integrating data from different sources and minimizing or even avoiding redundancy of behavioral data of data objects and inconsistency of behavioral data of data objects improves the quality of data mining. The redundant behavioral data of data objects that appear after integration should be analyzed and detected before deleting the redundant behavioral data of data objects.

2.2.2

Data conversion and data protocols

In order to unify the different magnitudes that exist between the behavioral data of the data object and the range differences of the behavioral data of the data object, it is necessary to normalize the data object and scale the behavioral data of the data object so that the values are distributed in a specific region, which facilitates the comprehensive analysis of the data object. 1)

Normalization of deviation

Off-diagonal normalization refers to min-max normalization, which refers to mapping the behavioral data of the behavioral data object of the linear transformation processing data object in the interval $[0, 1]$ . The off-diagonal normalization retains the original relationship between the behavioral data of the behavioral data object, and is the simplest way to eliminate the influence of the magnitude and range of values of the behavioral data of the behavioral data object, and the conversion method is shown in Equation (1). 1 $x^{*} = \frac{x - \min}{\max - \min}$

2)

Standard deviation normalization

Standard deviation normalization refers to the zero mean normalization, while after standard deviation normalization processing behavioral data mean is 0, standard deviation is 1, the conversion is shown in formula (2). 2 $x^{*} = \frac{x - \bar{x}}{σ}$

where x^* denotes the behavioral data value of the behavioral data object after standard deviation normalization; x denotes the initial value of the behavioral data; $\bar{x}$ denotes the mean value of the behavioral data: σ denotes the standard deviation of the behavioral data.

The essence of continuous discretization of behavioral data is to determine the split points of behavioral data values and map the values of behavioral data to corresponding subintervals. Commonly used methods include the equal-width split-box method, equal-frequency split-box method, and methods based on cluster analysis. 1)

Equal-width binning method

The equal-width split-box method divides continuous behavioral data into sub-intervals of the same width according to the characteristics of their value range.

2)

Equal frequency binning

The equal-frequency split-box method is to divide the same amount of behavioral data into different subintervals, so that each subinterval contains a fixed amount of behavioral data.

3)

Method based on cluster analysis

The method based on cluster analysis is to use the spatial clustering algorithm to cluster the continuous behavioral data, and then the behavioral data that are clustered into the same cluster are marked in the same way. When discrete processing of behavioral data is performed by methods based on cluster analysis, the number of cluster categories, and thus the number of intervals dividing the behavioral data, need to be artificially specified.

The data statute refers to improving the efficiency of data analysis and mining by constructing new data sets with smaller data volumes while maintaining the integrity of the original data. The methods commonly used for data statute are categorized into two types: attribute statute and numerical statute.

2.3

Data mining process

2.3.1

Data mining steps

Starting from the content and characteristics of the data, the steps of data mining consist of nine parts, including problem definition, data screening, data cleaning, data conversion, algorithm selection, data mining implementation process, pattern interpretation, knowledge evaluation and knowledge presentation. 1)

Problem Definition: Understand the data and business requirements, put forward the problem to be solved by data mining, determine the goal of data mining, and select the knowledge model of data mining.

2)

Data Screening: Explore all the data information related to the specific business, determine the data mining theme, and select the target dataset.

3)

Data Cleaning: Clean the noise data, vacancy data and redundant data existing in the batch business database, etc. to improve the accuracy of data mining.

4)

Data transformation: processing target data in different business databases to solve the problem of inconsistent data format, content and type.

5)

Algorithm selection: Selecting appropriate knowledge discovery patterns based on the characteristics of the data itself and the ultimate goal of data mining.

6)

Data mining: Use knowledge discovery and data mining algorithms to extract knowledge patterns of interest to users and display them using data visualization techniques.

7)

Pattern Interpretation: Judge whether the data-mined knowledge patterns can satisfy the user’s needs, analyze the reasons for the problems, iterate the processing steps repeatedly, extract the relevant knowledge, and eliminate the redundant or irrelevant patterns.

8)

Knowledge Evaluation: Present the discovered knowledge patterns to the user in a way that is easy to understand.

9)

Knowledge presentation: use data visualization techniques to display effective data mining patterns.

2.3.2

Correlation analysis studies

The analysis is done by calculating the correlation coefficient between the data objects. Commonly used methods in the process of correlation analysis of data objects are Pearson correlation coefficient, Spearman rank correlation coefficient and coefficient of determination. 1)

Pearson correlation coefficient method

Pearson correlation coefficient is generally used to analyze the correlation between the behavioral data of data objects, and the calculation method is shown in formula (3). 3 $r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}$

Where the correlation coefficient r has a range of values and the different values of −1 ≤ r ≤ 1, r indicate the degree of correlation between the behavioral data of the data objects. 4 ${\begin{matrix} r > 0 & Positive correlation \\ r < 0 & Negative correlation \\ | r | = 0 & Indicates that there is no linear relationship \\ | r | = 1 & Indicates a perfectly linear relationship \end{matrix}$

Where the correlation coefficient |r| ranges from 0 < |r| < 1, different values of |r| indicate different degrees of linear correlation between the behavioral data of the data objects: 5 ${\begin{matrix} r > 0 & Positive correlation \\ r < 0 & Negative correlation \\ | r | = 0 & Indicates that there is no linear relationship \\ | r | = 1 & Indicates a perfectly linear relationship \end{matrix}$

2.4

K-Means Cluster Analysis of Student Data

2.4.1

Traditional K-Means Clustering Algorithm Flow

K-Means algorithm is a clustering algorithm for the detection of differences in the division of data elements between the problem, it is first selected from the sample set of K start clustering target, and then according to the rules of the algorithm requires that the calculation of the distance between the sample data objects to be carried out, according to the results of the calculation of the analysis of the sample data objects grouped together, the iterative calculations until the center of the clustering and then no change, to get K clustering results.K-Means K-Means clustering algorithm is implemented as follows:

Input: set the sample set S has n data objects and the number of clusters K: 6 $S = {x_{1}, x_{2}, \dots x_{n}}, K = {c_{1}, c_{2} \dots, c_{k}}$

Output: K cluster that satisfies the minimum criteria of the criterion function.

The processing flow is as follows: 1)

Arbitrarily select K samples from the sample set S as the initial clustering center.

2)

According to the mean value of each clustered sample, calculate the distance between each sample and the center object d (Euclidean distance is used in this paper), and then reclassify the corresponding object according to the calculated minimum distance. Let two p-dimensional sample data points $x_{i} = (x_{i}, x_{2}, \dots, x_{p})$ and $x_{j} = (x_{j 1}, x_{j 2}, \dots, x_{j p})$ , the Euclidean distance between them is defined as equation (8): 7 $d (x_{i}, x_{j}) = \sqrt{{(x_{11} - x_{j 1})}^{2} + {(x_{i 2} - x_{j 2})}^{2} + \dots {(x_{i p} - x_{j p})}^{2}}$

The average distance of all sample data points is (9): 8 $M e a n d i s t (S) = \frac{2}{n (n - 1)} \times \sum_{i \neq j, j = 1}^{n} d (x_{i}, x_{j})$ 3)

Recalculate the mean (center object) of each sample that has been obtained.

4)

Cycle through steps 2) and 3) above until the value of the objective function is unchanged or less than a specified threshold.

The objective function is the squared error criterion function, defined as equation (10): 9 $σ_{i} = \sqrt{\frac{\sum_{i = 1}^{n} {(x_{i} - c_{i})}^{2}}{| C_{i} | - 1}}$

where c_i is the center-of-mass point of a data object of the same class, and the center-of-mass point c_i is defined as in Equation (10): 10 $c_{i} = \frac{1}{| C_{i} |} \sum_{x_{i} = τ_{i}} x_{j}$

c_i denotes the center of the ind family and $| C_{i} |$ is the number of data objects in class C_i. 5)

Ends and K clusters are obtained.

The flow of the above K-Means algorithm implementation is shown in Fig. 2.

2.4.2

K-Means Clustering Algorithm Based on Information Entropy and Density Optimization

1)

The concept of information entropy

Information entropy is a measure of the amount of information required to eliminate uncertainty, a concept used in information theory to measure the amount of information, often used as a quantitative indicator of the information content of a system, which can be further used as the objective of the optimization of the system equations or as a criterion for the selection of parameters. In the process of decision tree generation, entropy is used as a criterion for the classification of the optimal attributes of the samples. In this paper, we will briefly introduce the concept of entropy.

Suppose the sample data set $s = {x_{1}, x_{2}, \dots, x_{n}}$ , its probability density $p_{i} = P [X = x_{i}]$ , its self-information of a sample data point can be expressed as $I (x_{i}) = \log \frac{1}{p_{i}}$ , and the information entropy of the sample data set can be expressed as $H (X) = \sum_{i} p_{i} \log \frac{1}{p_{j}} (i = 1, 2 \dots, n)$ , stipulating that 0 log(0) = 0. The information entropy is used to measure the amount of information, if the greater the uncertainty, the greater the amount of information, the greater the entropy: if the uncertainty is smaller, the smaller the amount of information, the smaller the entropy [26].

2)

Optimization algorithm idea

Aiming at the shortcomings of the traditional K-Means clustering algorithm, this paper adopts an optimized K-Means clustering algorithm that is based on the information entropy and density optimization of K-Means clustering algorithm to carry out clustering analysis of the objects under study in this paper, the general idea is that: unlike the traditional K-Means algorithm that calculates the Euclidean distance between the two data objects, this paper calculates the information entropy of the attributes of the data objects by calculating the information entropy of the data objects, and then By assigning the Euclidean distance formula, calculate the value of the criterion function after the assignment to determine the initial clustering center point to continue clustering the data object set.

3)

Optimization algorithm attribute weight calculation method

Through the analysis of the clustering contribution of each attribute in the clustering process of data objects, the entropy value method is used to calculate the weight of each attribute feature to complete the calculation of the Euclidean distance between the data points for more refinement and clustering. The steps of calculating the feature weights using entropy method are as follows.

Step 1 Construct the attribute value matrix as shown in equation (11). 11 $A = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{22} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}]$

In Equation (5) m denotes the number of dimensions of the data object and n denotes the number of sample data.

Step2 calculates the weights of each attribute of the data object and the weight of the attribute value of the ith data object corresponding to the attribute of the jrd dimension [27]. Firstly, the data need to be processed to compress the data density range to the interval [0, 1], the process is shown in equation (12) equation: 12 $M_{i j} = a_{i j} / \sum_{i = 1}^{n} a_{i j}$

Equation (12) in which a_ij represents the attribute value and M_ij represents the attribute value weight, where i = 1, 2, ⋯, m, j = 1, 2, ⋯, n.

Step3 The formula for calculating the entropy value of the jth dimensional attribute is shown in (13): 13 $H_{j} = - \frac{1}{\ln n} \sum_{i = 1}^{n} M_{i j} \ln M_{i j}$

When M_ij = 0, there is M_ij ln M_ij = 0. If for a given j value a_ij all are equal, then $M_{i j} = a_{i j} / \sum_{i = 1}^{n} a_{i j} = 1 / n$ , at which point H_j takes an extremely large value.

Step4 Calculate the coefficient of variation for the jth dimensional attribute as shown in equation (14): 14 $q_{j} = 1 - H_{j}$

Where the coefficient of variation is q_j. For a given j, when H_j is smaller, q_j is larger and the attribute is more important, and when H_j is larger, q_j is smaller, then the attribute’s clustering role is smaller. When H_j = 1, q_j is zero, then the clustering role of the attribute is zero.

Step5 Calculate the attribute weights formula for the jth dimension as shown in (15): 15 $w_{j} = q_{j} / \sum_{j = 1}^{m} q_{j}$

Step6 Use the Euclidean distance formula to calculate the similarity between data objects, according to the definition can be obtained after the assignment of the Euclidean distance as shown in Equation (16): 16 $d_{w} = (x_{i}, x_{j}) = \sqrt{\sum_{p = 1}^{m} w_{p} {(x_{i p} - x_{j p})}^{2}}$

In Eq. (16), w_p is the weight value of the attribute in the pnd dimension. It is equivalent to making the attribute values corresponding to the weights appropriately enlarged or reduced, so that attributes with large weights have a greater role in clustering and attributes with small weights have a smaller role in clustering.

Step7 adopts the standard deviation as the standard measurement function to find the assigned category objective value function as shown in Equation (17): 17 $σ_{i} = \sqrt{\frac{\sum_{a_{i} \in τ_{j}} d_{w} (a_{i}, c (C_{j}))}{| C_{j} | - 1}}$

In formula (17), σ_i indicates the standard deviation of the assignment of class $i; | C_{j} |$ is the number of data objects contained in C_j. As can be seen from the formula (17), the smaller the value of σ_i, indicating that the greater the similarity of the data objects within the class, the denser the data objects, the more the center of mass of the class where the data objects are located can reflect the classification decision surface. 4)

Optimization algorithm initial clustering center selection method

The selection of the initial clustering center of the traditional K-Means clustering algorithm is mostly based on empirical judgment of random selection, and the practice of randomly selecting the initial clustering center will lead to the instability of the algorithm, and may fall into the local optimal situation, which leads to the inaccuracy of the clustering results. The K-Means algorithm based on information entropy and density optimization studied in this paper is to measure the similarity of data points by weighting the Euclidean distance between two data points, and at the same time, in the selection of clustering centers by calculating its objective value function, according to the size of the value of the sort can be more accurate initial clustering centers. The process of the initial clustering center selection method of the optimization algorithm is as follows. (1)

Randomly select a specified number of $k' (k' > k)$ initial clustering center data points;

(2)

Scan and calculate the assigned Euclidean distances between the remaining data points in the data set and the k initial clustering centers, sort them according to the size of the distances, and classify the data point with the smallest distance into the corresponding category;

(3)

Count the assigned objective value function values σ_i of the k^k categories divided, and arrange the σ_i values in the order of smallest to largest, and take the clustering centers corresponding to the first k σ_i values to form the initial clustering centers of the optimization clustering algorithm.

It can be seen that the optimization algorithm is simple and fast in the process of selecting the initial clustering centers, and the clustering seed centers can be determined by sorting the size of the assigned objective value function values at one time, which is different from the random selection of traditional algorithms, and it can greatly reduce the chances of the initial clustering centers being divided into the same clusters.

2.5

Analysis of association rules for student data

2.5.1

Traditional association rules

The introduction of correlation objects is intended to uncover interesting connections between various types of student data, which can then be informative for both self-improvement and decision-making on the part of the student and the school. We introduce some concepts of association rules as follows. 1)

Item set

In the Wal-Mart example above, the sum of all items purchased by a customer is the itemset (e.g., all items sold at Wal-Mart form a set, which is the itemset), and each item is an item. This is represented by the formula shown below: 18 $I = {i_{1}, i_{2}, i_{3}, \dots i_{n}}$

where I is denoted as the item set, which contains all the items i_j, and i_j denotes a specific item. 2)

Database

Again take the above Wal-Mart Supermarket as an example, the database is a list of each customer’s purchases as a base unit of data composition of the database, as shown in the following formula: 19 $D = {t_{1}, t_{2}, t_{3}, \dots t_{n}}$

where t_j is a purchase record which consists of a number of items in I, i.e., t_j ⊆ I is called a transaction, and all the transactions make up the database D. 3)

Rule

In the association rules, if X ⊆ I, Y ⊆ I, can be X ⇒ Y and X ≠ ∅, Y ≠ ∅, X ∩ Y = ∅, such as this type of association rules, X, Y are a subset of I, and there is no intersection between them, when X can be deduced from Y, X and Y are linked.

4)

Support: the number of times several associated data appear in the data set as a proportion of the total data set. In the process of finding data association rules, what we need is the association is supported by a high frequency, like buying diapers will be a high probability of buying beer, first we have to ensure that the number of diapers and beer purchased in the database to reach a certain frequency, otherwise it will be meaningless, and so we introduce the formula for the degree of support is as follows: 20 $S u p p o r t (X, Y) = P (X, Y) = \frac{n u m b e r (X Y)}{n u m b e r (A l l S a m p l e s)}$

When we want to study the association rules between X and Y, we must first ensure that the frequency of X and Y in the database reaches a certain number, otherwise the small probability of occurrence is not enough to ensure the reliability of the association, as shown in the formula above, the formula for calculating the degree of support of X and Y is the number of occurrences of X and Y divided by the total number of all the matters that is, the degree of support of X and Y. 5)

Confidence: the probability of the appearance of one data after the appearance of another data, or the conditional probability of the data.

When the support of X, Y meets the minimum level of support we set, then we have to verify whether the correlation between them really exists, which will be assessed by the confidence level. The formula for confidence level is as follows: 21 $C o n f i d e n c e (X \Rightarrow Y) = P (\frac{X}{Y}) = \frac{P (X Y)}{P (Y)}$

To summarize, the mining of association rules is mainly carried out by two steps, the first step is to find out the high-frequency item sets, and the second step is to generate association rules.

Assuming that ${A, B, C}$ is a combination to be mined for association rules, there is one thing that we must ensure before mining the relationship between them, that is, the frequency of occurrence of combination ${A, B, C}$ in all data sets must reach our preset minimum value, i.e., Support(A, B, C) ≥ P, where P denotes a certain probability proportion of occurrence or a certain number of occurrences. This ensures that the occurrence of the combination is not by chance, but accompanied by a certain percentage of probability of occurrence, the setting of P must be reasonable, if it is not possible to eliminate the chance time, the mining of association rules will be meaningless, adding to the amount of arithmetic.

The second step is to generate the association rules, this step is to ensure that the confidence of the combination, the same we will artificially select a minimum confidence to ensure that Confidence(A, B ⇒ C), thus proving that the occurrence of A, B will lead to the occurrence of C events, only through these two steps, can prove that the event for the rule associated events.

2.5.2

Apriori algorithm and its improvement

The core idea of Apriori algorithm is to derive K + 1 combinations from K combination, instead of the original idea of rule mining to find K + 1 combinations from the complete dataset that meets the minimum support, which relies on the theory that: if the K + 1 combinations are high-frequency combinations, the K subsets of the combinations must also be high-frequency subclasses, and the core process of Apriori algorithm: 1)

Connection step

Apriori algorithm is based on the a priori property of frequent itemsets, so we can obtain C_k by cross connecting L_k−1. Let the items in L_k−1 of two crosses are I_j, I_k, and they are arranged in dictionary order, when the items in L_k−1 are cross-connected, it is required that the first k-2 items of I_j, I_k are the same $(I_{j} [1] = I_{k} [1] \land I_{j} [2] = I_{k} [2] \land \dots \land I_{j} [k - 2] = I_{k} [k - 2])$ , then I_j, I_k can be connected, in order to ensure that the result of the connection is not duplicated I_j[k − 1] < I_k[k − 1].

2)

Pruning step

Generated by the connection step of C_k, we need to cut C_k before, remove the infrequent items, so we according to the a priori property, that is, all non-empty subsets of the set of frequent items must also be frequent. In this case, we will prune C_k according to L_k−1, if the k − 1 subset of C_k is not in L_k−1, it means that the subset is not a frequent term, the subset is not a frequent term to take the subset contains the term within the C_k must be infrequent, can be directly removed, in this way we can greatly reduce the amount of computation.

Variable description: $| C_{k} |$ denotes the number of subsets of C_k, $| L_{k - 1} |$ denotes the number of subsets of L_k−1, ${(C_{k})}_{i}$ denotes the i-th subset of C_k, $(L_{k - 1}) j$ denotes the j-th subset of L_k−1, and it can be known that the number of subsets generated by each item in C_k is $C_{k}^{1}$ . Thus the computational amount of the trimming step in the original Apriori algorithm can be expressed as: 22 $\sum_{{(C_{k})}_{1}}^{{(c_{k})}_{| c_{k} |}} C_{k}^{1} | L_{k - 1} |$

The improved calculation is: 23 $\sum_{{(L_{k - 1})}_{1}}^{{(L_{k - 1})}_{| L_{k - 1} |}} | C_{k} | C_{k}^{1}$

The whole process of cropping is: check the subset in L_k−1 with the subset of all items in C_k respectively, and assign a value to that C_k item according to the matching situation, so it is divided into two cases for discussion as follows: 1)

When $| L_{k - 1} | - C_{k}^{1} + 1 > C_{k}^{1}$ , that is, when the matching process of the items in L_k−1 with the items in C_k, if the number of non-compliant items exceeds the number of subsets of each item in $C_{k}^{1}$ , we use the number of matches that are judged to be compliant to eliminate the non-compliant items in C_k in the pruning process.

2)

When $| L_{k - 1} | - C_{k}^{1} + 1 < C_{k}^{1}$ , then we use the number of judgmentally non-conforming matches in the cropping process to eliminate the non-conforming items in C_k [28].

2.6

Student Behavior Analysis System Architecture and Functional Structure Design

Design and build a student behavior analysis system, which not only applies the statistical analysis of student behavior to it, but also applies the cluster analysis of student behavior and the correlation analysis of student behavior habits and performance in this paper. Through this system, data from the databases of various application systems is analyzed and processed in a unified way to improve the management efficiency of administrators.

This paper is based on campus logistics data as well as student performance data, but the data are only stored in each database and are not fully utilized, and all the data should be integrated and analyzed to dig out the potential information that was not discovered originally. Therefore, by building a big data platform and using improved algorithms to process logistic data in parallel on Spark, we mined students’ behavioral habits, and applied the analysis results to campus early warning and determination of poor students. The results of data mining are stored in a NoSQL database, and the results of individual student portraits and student group portraits are displayed through Web front-end and back-end technologies to help school administrators pay attention to student dynamics in a timely manner, and to provide data support for decision-making work, such as the formulation of relevant campus guidelines. The architecture design of the Hadoop-based student behavior analysis system is shown in Figure 3, which mainly includes five layers: preprocessing layer, storage layer, processing layer, application layer, and visualization layer.

The layers of the system architecture are described below: 1)

Pre-processing layer. The logistical data, such as one-card data and campus takeout data, is processed to provide reliable data for subsequent analysis.

2)

Storage layer. The pre-processed data and the results of mining are used to provide storage services using technologies such as Hive, a distributed data warehouse, HDFS, a distributed file system, and distributed databases to provide storage security for mining student behavioral habits and applications.

3)

Processing layer. Using Spark distributed fast computing engine, the optimized clustering algorithm and association rule algorithm are used to analyze and process the logistic data and achievement data, mining the correlation between students’ behavioral habits and achievements, and at the same time providing a more efficient computing service for data query, statistics and other functions.

4)

Application layer. Through the analysis of clustering results and association rule results, it digs out students’ behavioral habits, consumption level and the relationship between behavioral habits and grades, and applies them to the determination of poor students, academic early warning, Internet early warning, security early warning and so on.

5)

Visualization layer. Using Django and Bootstrap front-end and back-end technology to display the data mining results and realize data visualization. The results of statistical analysis and mining using algorithms are displayed in charts.

There are five modules in the student behavior analysis system, as shown in Figure 4, which are student personal portrait, student group portrait, comprehensive early warning analysis, determination of poor students, and user management module. Students’ personal portrait and students’ group portrait are mainly for statistical analysis, cluster analysis, association rule analysis and description of school behavior, including one-card consumption analysis, campus takeout consumption analysis, library borrowing, Internet access analysis and academic performance analysis. Comprehensive warning is based on the results mined by statistical analysis and association rule algorithm, which includes academic warning, internet warning, and safety warning. The function of determining poor students is to divide students into groups with different consumption levels by analyzing the results of one-card consumption and campus takeout consumption through clustering algorithms, and obtaining low-consumption student groups as the candidate groups of poor students. User management function is to set different permissions for different personnel, which is beneficial for campus managers to manage the platform.

Comprehensive portrait of students includes group portrait and individual portrait of students, using the behavioral data generated by students at school for statistical analysis, cluster analysis and correlation analysis, describing the situation of students at school in the form of charts and graphs, which is helpful for school administrators to have a more comprehensive understanding of the status of students at school, and makes the management more convenient and quicker. The comprehensive portrait of students includes one-card consumption analysis, campus takeaway analysis, library borrowing and access analysis, and Internet time analysis, etc. It is a statistical analysis of all student data to understand the overall situation of different behaviors among all students. The individual student portrait is only for a single student, through statistical analysis and correlation rule analysis to understand the student’s situation at school in detail, according to the time, name, student number and other requirements to set the conditions of the portrait.

3

Student profiling and data mining applications

The student behavior analysis system designed in this paper is being applied in a university to verify its effectiveness. The source of data is public data from a university.

3.1

Test of student portrait construction based on feature indicators

3.1.1

Correlation analysis between gender and student achievement

The influence of gender on students’ performance has always attracted much attention, and related studies have shown that there are some differences between boys and girls in terms of intellectual development, comprehension, imagination, calculation, logic, memory and emotion at different times. The comparative analysis of male and female students’ academic performance in public and compulsory courses is shown in Table 1. Table 2 shows the statistics of achievement distribution.

Table 1.

Comparison of men and women

	Mean value			variance
	Public class	Specialized course	synthesize	Public class	Specialized course	synthesize
Male	72.41	69.59	71.64	100.26	212.71	96.67
Female	78.75	75.84	77.88	56.84	109.68	48.59

Table 2.

Comparison of distribution of different gender scores

	Public class			Specialized course
	<60	60-80	80-100	<60	60-80	80-100
Male	14.00%	62.00%	24.00%	22.00%	58.00%	20.00%
Female	6.00%	48.00%	46.00%	8.00%	66.00%	26.00%

It can be seen that, whether it is the average credit score of public courses or the average credit score of specialized courses, the overall situation of girls is better than that of boys, and the variance of girls’ results is smaller, indicating that the distribution of results is more stable than that of boys.

Figure 5 shows a comparison of the distribution of grades by gender. Where (a) is the performance in public courses and (b) is the performance in specialized courses. The yellow line indicates girls’ grades and the purple line indicates boys’ grades, and it is obvious that girls’ grades are better than boys’. Overall, girls have fewer failures and a greater proportion of high grades, while boys’ grades are mostly in the middle.

3.1.2

Analysis of the correlation between daily routine and students’ performance

Poor diet, sleep, hygiene and other aspects of life, accumulated over time will become an impact on the physical and mental health of college students, and ultimately lead to the occurrence of disease and affect their studies. In this paper, we select the average monthly consumption amount, the average monthly consumption times, and the regular work and rest index calculated based on the average monthly early times and the average monthly late times (regular work and rest index = 60 points + average monthly early times - average monthly late times) to study and analyze its impact on students’ overall performance. To a certain extent, the amount of consumption and the number of consumption can reflect whether the students eat and drink reasonably in the cafeteria and supermarket in the process of campus life and study, whether there is a tendency to often stay in the dormitory and order takeout or go out off-campus; the number of times of waking up early and the number of times of returning home late can measure whether the students are regular in the process of campus life and study, whether there is a late return to the dormitory and sleep, resulting in low classroom attendance and poor learning results, and even whether there is a tendency to stay out late and violate the rules of the students’ grades. Even the number of late nights violates dormitory management rules and other regulations.

First, based on the students’ weighted average scores, the students were divided into three categories: those with scores higher than 80 were regarded as excellent; those with scores lower than 80 and higher than 60 were regarded as moderate; and those with scores lower than 60 were regarded as abnormal. The distribution of average monthly consumption amount, average monthly consumption frequency, and regular work and rest index of the three categories of students were studied and analyzed separately, and the results are shown in Figures 6, 7, and 8, respectively.

Blue color indicates students with abnormal performance, yellow color indicates students with medium performance, and purple color indicates students with good performance. As can be seen from the figure, there are obvious differences between students with good, medium and abnormal performance in terms of the number of times of consumption, the amount of consumption, and the index of regular work and rest. Students with good grades and medium grades have more concentrated consumption and consumption amount, smaller fluctuation range, and the former is slightly higher than the latter; students with abnormal grades have lower consumption and consumption amount, and the change range is bigger. It means that students with better grades have more regular and reasonable consumption behavior, basically all three meals are consumed in the on-campus cafeteria and supermarket, and the behavior of traveling off-campus and ordering food from takeaways is less; students with poor grades obviously consume less in the on-campus cafeteria and supermarket, and most of the students may stay in the dormitory for a long time and rely on takeaways, and a small number of them may like to buy snacks and snacks instead of having meals in the right place, which leads to unhealthy behaviors such as consumption. A few students may have unhealthy behaviors such as preferring to buy snacks and not eating at the right time, resulting in significantly higher consumption and money consumption than average. In terms of students’ regular work and rest, the average number of times of waking up early is greater than the number of times of returning home late, and students with good habits of going to bed early and waking up early are obviously more inclined to have medium and good grades; while the frequency of staying up late and returning home late and the frequency of sleeping too much lead to low regular work and rest indexes, and their grades are more concentrated in the category of abnormality, which indicates that poor work and rest habits and consumption habits will to a certain degree lead to the disruption of the work and rest time, insufficient sleep time, unhealthy eating behavior, and thus unhealthy eating habits. This indicates that poor work and rest habits and consumption habits can, to a certain extent, lead to disturbed work schedules, insufficient sleep time, and unhealthy dietary behaviors, thus affecting the academic performance and physical and mental development of college students.

3.1.3

Correlation Analysis between Social Relationships and Student Achievement

Mining social network relationships through the student card data can, to a certain extent, reflect the student’s personality and psychological problems, helping educators to more comprehensively understand the student characteristics of the information, identifying lonely students to focus on, verifying the actual life and learning situation of the students, and carrying out targeted guidance measures and interviews to promote healthy growth.

In order to facilitate the calculation, first of all, the time of consumption is expressed as a numerical form, and the place of consumption in the cafeteria supermarket is digitally coded, and the consumption data of 10 students in a certain month is randomly selected to calculate the matrix of co-occurrence times, as shown in Table 3.

Table 3.

The number of times is the matrix (part)

21	9	26	5	13	21
21	14	30	21	11	14
10	25	1	25	15	22
28	27	7	11	16	27
21	9	9	25	17	12
6	19	1	26	18	27
10	19	0	21	16	21

Let the maximum value of the total number of co-occurrences among all students with other students be P, and the minimum value be Q. The total number of co-occurrences between each student and other students is standardized, and the total number of co-occurrences and the comprehensive weighted average grade are shown in Figure 9.

From the line graph, it can be seen that the trend of the total number of students’ standardized co-occurrences and the comprehensive grade has a certain isotropy. In order to verify the correlation between the total number of students’ standardized co-occurrences and the composite grade, a correlation analysis was performed.

The resulting density plots for the two variables are shown in Figure 10. The purple curve in the graph shows the distribution of the total number of times students have interacted with other students, while the red curve shows the distribution of the composite weighted average grade. It can be seen that the distribution of the two variables is significantly different, and does not indicate a strong correlation between the two, i.e., the number of times students co-occur with other students does not have a significant impact on the relationship between the student’s composite weighted average grade, after all, there are both solitary schoolmasters and schoolmasters who make a wide range of friendships, but in general, students who have a better interpersonal and social relationship are more likely to perform well in terms of their grades as well.

3.1.4

Student profile construction

Based on the above analysis of the characteristic indexes that have significant relevance and influence on students’ performance, a student with the student number 2020****107 in the dataset is randomly selected, and the various types of data on his campus data are subjected to the extraction of indexes, the analysis of characteristic behaviors, and the construction of the user portrait. The first grades are shown in Table 4.

Table 4.

Student scores

Course name	Credit	Course school	Score
College sports	2	Institute of sports	90
College English	3	Foreign language college	88
Advanced mathematics	3	Mathematics college	89
Inorganic chemistry experiment	2	Chemistry college	71
Computer principle	3	Computer college	83
Biology	4	Biological hospital	73
Ethics and law	2	Political academy	85

The weighted average grade for professional courses is 85.71, and the weighted average grade for public courses is 84.95, which can be considered a good level of performance for this student.

Figure 11 shows the amount of money spent and the number of times spent by this student in each month. The average monthly consumption amount is 745.71 yuan, and the average number of consumption times is 75.86 times. The fluctuation of the consumption amount and number of times consumed in each month is not significant, and all of them are within the reasonable range, which means that the student’s consumption level is normal. Considering and comparing the real situation of the student, the judgment of the system in this paper is in line with reality.

In conclusion, it can be concluded that the application of the system in this paper can realize the accurate construction of student profiles.

3.2

Student Behavior Analysis Test

3.2.1

Cluster analysis of student behavior based on improved K-means

1)

Cluster analysis of students’ consumption habits

Using the information entropy and density optimization K-means algorithm based on information entropy and density optimization designed in the previous section, the consumption data of students are clustered. According to the evaluation guidelines of clustering algorithm, in order to get better clustering effect, the best clustering effect is obtained when the number of clusters is set to 5. We analyze the results of student classification after clustering, according to the actual situation of students’ consumption in the school, comparing the average value of each cluster with the average value of the overall indicators of students, and noting that H is the average value higher than the overall indicators of students, and L is the average value lower than the overall indicators of students. Table 5 displays the results of the cluster analysis.

It can be seen that the first type of students have a lower level of consumption per month, and the peak value of consumption in a single month is not high, but the number of times is more frequent, belonging to the group with a lower level of consumption. This type of students mostly belong to the more frugal type, is the economic conditions of the poorer students, it is recommended that the school and counseling teachers pay attention to the living conditions of this type of students, the identification of poor students and financial assistance and other work can be considered from this type of students to choose. The second type of students have a medium level of consumption per month, the average consumption per month is not high, but the maximum value of consumption in a single month is high and the number of times of consumption is also low. This type of student should eat irregularly in the school cafeteria and usually prefer to eat off campus or order takeout. The third type of student has a moderately high monthly consumption level, a high frequency of monthly consumption, and a low variability in the amount of consumption. This indicates that this type of student often eats in the school cafeteria and has stable consumption, which is in line with the normal eating pattern of more students at school.

The fourth type of students had the highest average consumption per month, frequent consumption and higher maximum value of consumption, indicating that this type of students is a high-consumption group in the cafeteria, and that students eat in the cafeteria on a more regular basis, with higher requirements for the quality of meals and larger meal sizes.

The average monthly consumption of students in the fifth category is lower and less frequent, and the maximum value of consumption is also not high. Students in this category do not frequently eat in the cafeteria, and are more likely to eat off-campus frequently and consume more off-campus. Schools and counselors, etc. Should also pay attention to food safety and personal safety, etc. of such students who often go off-campus.

2)

Cluster analysis of students’ living habits

Using the K-means algorithm based on information entropy and density optimization designed in the previous section, the regular habit of eating and drinking, the habit of surfing the Internet, the habit of waking up early, and the habit of exercising were clustered and analyzed in the indicators of students’ living habits. According to the cluster average criterion, when the number of clusters is 3, a better classification of students is obtained. The percentage of students in each category of clusters and the average of indicators are shown in Table 6. It can be seen that students in the first category often get up early in the middle of the month, eat more regularly in the school cafeteria, spend more time on the Internet, and also exercise regularly. Students in this category are more self-disciplined, able to get up early on time to exercise, and have a regular diet. These students have good physical fitness and good living habits. The second category students spend an increasing amount of time in bed each month, eat irregularly in the school cafeteria, spend a lot of time on the internet, and exercise very little. This type of student should belong to the category of those who like to stay in the dormitory, often surf the Internet, do not like sports, do not eat on time, and have poor physical quality. The school counselor should pay attention to the study and class situation of this category of students, and whether they often skip classes. The third category of students often wake up early in the middle of the month, but they often go to the cafeteria for irregular meals, spend more time on the Internet, and have less physical exercise. These students should not have the unhealthy habit of eating breakfast and should not like to exercise. These students should pay attention to changing their unhealthy habits and paying attention to their physical health. Generally speaking, students spend more time on the Internet during the school period. Although the society has stepped into the Internet era, people get more information on the Internet than anywhere else, and people’s lives are already inseparable from the Internet, it is still recommended that students should rationally arrange the time of surfing the Internet, and pay attention to their physical health.

3)

Cluster analysis of students’ study habits

K-means clustering analysis was performed on four indicators of students’ study habits: class attendance rate, number of books borrowed, number of times in and out of the library, and length of study time. When the number of clusters is 4, the clustering effect is better, and the distribution of students’ class clusters is as follows. The proportion of students in each type of cluster and the average value of each student indicator are shown in Table 7.

According to the table, the first type of students have a high rate of attendance in class attendance, fewer books are borrowed, more times in and out of the library, and a higher length of study time. This type of students often go to class when they are supposed to and also often read and study in the study room and library, but the number of books borrowed from the library is very low, or it is possible that this type of students are used to reading books in the library and do not borrow books.

The second type of students’ attendance in class is not very high, class attendance is not good, the amount of books borrowed is not high, the number of times they go in and out of the library is low, and the length of study time is low, which means that such students often skip classes, do not study hard enough, and belong to the students who don’t like to study in school. School counseling teachers should focus on the learning situation of these type of students and promptly urge them to strengthen their study discipline and cultivate good study habits.

The third type of students have a high rate of attendance in class, borrow more books from the library, go in and out of the library more often, and study for a longer period of time. This type of students should belong to the type of hardworking students in school, who work very hard in and out of class, are willing to spend their energy to learn knowledge, and have good study habits.

The fourth type of students’ attendance in class is average, the amount of books borrowed is low, the number of times they go to the library is not high, and the time they spend in the study room is low. This type of students just go to class often, don’t spend much time studying after class, don’t make much effort to study, and don’t spend much energy on studying. They may belong to the type of students who don’t pay attention to studying in general, but revise suddenly before the exam. It is recommended that students cultivate regular study habits and arrange their study time appropriately and reasonably.

Table 5.

Student consumption habits clustering results

Consumer class	Student ratio	Average monthly consumption	Consumption frequency	Monthly consumption peak	Comparison
1	10.34	468.37	104.2	534.32	LHL
2	17.25	664.85	68.1	1106.27	LLH
3	33.69	723.17	84.3	876.18	HHL
4	24.86	892.03	97.4	1151.4	HHH
5	13.86	497.32	39.6	625.31	LLL
Mean value		696.81	80.7	913.24

Table 6.

Student habits clustering results

Life type	Student ratio	Regular eating habits	Get up early	Surfing habits	Physical exercise	Comparison
1	36.23	23.6	20.9	175.25	53.3	HHLH
2	10.29	9.2	6.6	210.15	6.5	LLHL
3	53.48	13.4	19.3	206.2	17.6	LHHL
Mean value		15.4	15.6	197.2	25.8

Table 7.

Students learn clustering results

Life type	Student ratio	Regular eating habits	Get up early	Surfing habits	Physical exercise	Comparison
1	32.71	95.37	7.8	29.6	153.6	HLHH
2	11.64	86.15	3.8	11.8	84.4	LLLL
3	12.08	92.49	15.4	26.2	132.5	HHHH
4	43.57	90.75	8.6	16.4	116.3	LLLL
Mean value		91.19	8.9	21	121.7

3.2.2

Association Analysis of Student Behavior and Comprehensive Quality Based on Improved Apriori Association Rule

In the parameter setting of the Apriori algorithm model, five types of students’ consumption habits, three types of living habits, four types of study habits and three types of comprehensive qualities are set as the pre and post variables of the association rules as shown in Table 8, Table 9 and Table 10 below.

Table 8.

Five types of students’ consumption habits

Consumer class	Student ratio	Consumption characteristics
1	10.62	Low consumption level, low consumer peak, high consumption frequency and consumer stability
2	17.16	Consumption level generally, monthly consumption peak is high, consumption frequency is low
3	33.63	The consumption level of consumption is low, the monthly consumption is low, the consumption frequency is high, the consumption is stable
4	24.89	The consumption level is highest, monthly consumption peak and frequent consumption frequency
5	13.7	Low consumption levels, low monthly consumption and low consumption frequency

Table 9.

Three types of lifestyle Settings for living habits

Life category	Student ratio	Life characteristics
1	36.24	Often get up early, eat regularly, exercise regularly and spend longer time on the Internet
2	10.55	Not often getting up early, eating irregular, not exercising regularly, long online hours
3	53.21	Often get up early, eat irregular, don’t exercise regularly, and have a longer time to surf the Internet

Table 10.

Four types of learning habits

Learning class	Student ratio	Learning characteristics
1	32.19	The class is high, the book is borrowed, the number of learning is more, the study time is longer
2	11.61	The class is not high, the book is very small, the number of studies is low, the study time is low
3	12.67	The class is higher, the book borrows more, the number of learning is more, the study time is long
4	43.53	There are fewer books, less learning and less learning time

Setting the support degree as 10% and confidence degree as 80%, the mining of association rules is carried out, and a total of 24 association rules are mined to get the association network diagram as shown in Fig. 12, according to the goal of this paper to be researched, the redundant rules are eliminated and merged, and the association rules with the degree of enhancement of more than 1 are selected, among which the later items of the association rules for the comprehensive quality of the students are shown in Table 11.

Table 11.

Rules of association of mining

Afterterm	Foreterm	Percentage of rule support	Percentage of support	Percentage of confidence
Worse	Life type 2	8.475	10.272	82.566
General	Consumer type 5	9.864	11.956	83.251
General	Life type 3
General	Life type 3	44.872	53.124	84.194
General	Learning type 4	19.018	23.705	80.578
General	Life type 1
Good	Life type 1	8.367	10.317	81.362
Good	Learning type 3

The association rules mined were life habit type 1, study habit type 1 → good general quality (S=10.317%,C=81.362%); Habits of life type 3 → average overall quality (S=53.124%, C=84.194%); Consumption Habit Type 5, Life Habit Type 3 → General Quality (S=11.956%, C=83.251%); Student Habit Type 4, Life Habits Type 1 → Average overall quality (S=23.705%, C=80.578%); Life Habit Type 2 → Poor Comprehensive Quality Type (S=10.272%, C=82.566%) The degree of enhancement of these five association rules is greater than 1, which indicates that the occurrence of the antecedent event of the association rule favors the occurrence of the latter event.

The first rule reflects that when the student’s life habit is to get up early, go to the cafeteria and eat regularly and exercise regularly, and the study habit is to have a high rate of attendance in class, borrow more books, go to the library regularly and study for a longer period of time, then the student has a probability of 81.343% to have a better quality of overall performance.

The second rule reflects that when a student’s living habits are often early, but diet is not regular and does not often exercise, then the student has a probability of 84.068% of the overall quality of the average.

The third rule reflects that when the student’s spending habits are low total monthly spending, less frequent spending, and living habits are often early, but not regular diet and not regular exercise, then the student has 83.246% probability of having average overall quality.

The fourth rule reflects that when a student’s study habits are average class attendance, borrowing fewer books, going to the library less often and spending less time studying, and living habits are often early, eating more regularly, and exercising regularly, the student has 80.366% likelihood of having average overall quality.

The fifth rule states that if the student’s life habits are not early, eat irregularly, and exercise infrequently, they have an 82.423% probability of having poor overall quality.

To sum up, it can be seen that students’ behavioral habits during the school period are closely related to their comprehensive quality, and this paper obtains some laws affecting the development of students’ comprehensive quality by digging the relationship between various indicators of students’ behavioral characteristics and their comprehensive quality. In the students’ living habits, learning habits and consumption habits will have an impact on the development of students’ quality, good learning habits and living habits are conducive to students’ rational planning of their time, better learning professional knowledge, improve the level of practice, and disperse innovative thinking, and comprehensively enhance the overall quality of students.

4

Conclusion

In this paper, a behavioral analysis system, including a student behavioral portrait, is designed to meet the informationization requirements of education management in colleges and universities. The system is applied to a university, analyzing the correlation between gender, work and rest, social relations and students’ performance, and finding that students who stay up late and sleep too often lead to a low regular work and rest index, and their performance is concentrated in the abnormal category. Randomly constructing a student’s portrait, it was found that the average monthly consumption amount is 745.71 yuan, and the average monthly consumption frequency is 75.86 times. The consumption level is in the normal range. Compared to the real situation, the judgment of the system in this paper is in line with reality. Behavioral analysis concludes that when the students’ living habits are often get up early, go to the cafeteria and eat very regularly and often exercise, and their study habits are high rate of attendance in class, borrowing more books, often go to the library and study for a longer period of time, then the students have a probability of 81.343% of the overall quality of the students is better. It helps improve the quality and efficiency of education management in colleges and universities.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

21	9	26	5	13	21
21	14	30	21	11	14
10	25	1	25	15	22
28	27	7	11	16	27
21	9	9	25	17	12
6	19	1	26	18	27
10	19	0	21	16	21

21	9	26	5	13	21
21	14	30	21	11	14
10	25	1	25	15	22
28	27	7	11	16	27
21	9	9	25	17	12
6	19	1	26	18	27
10	19	0	21	16	21

An Optimization Study of Information Visualization Techniques in Educational Management Systems for Student Data Analysis

Hongli Lou

Pin Yue

Data publikacji: 21 mar 2025

Otrzymano: 18 paź 2024

Przyjęty: 07 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0573

Słowa kluczoweInformation entropy, K-Means, Apriori, Association rules, Educational management

© 2025 Hongli Lou, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
Information entropy, K-Means, Apriori, Association rules, Educational management

21	9	26	5	13	21
21	14	30	21	11	14
10	25	1	25	15	22
28	27	7	11	16	27
21	9	9	25	17	12
6	19	1	26	18	27
10	19	0	21	16	21