Research on Decision Tree-Based Strategic Modeling of Student Management and Civic Education in Colleges and Universities under Interdisciplinary Perspective

With the changes of the times and the needs of social progress, ideological education and student management in colleges and universities has become an important part of the work of colleges and universities, and has an irreplaceable position. Civic and political education in colleges and universities is one of the basic tasks of cultivating talents in colleges and universities, and it is a necessary way to cultivate socialist builders and successors [1–3]. Civic and political education is an important part of higher education, aiming at solving real problems and building the future, and broadening students’ knowledge and thinking horizons as the main purpose [4–5]. Student management in colleges and universities is an important part of management work in colleges and universities, as well as one of the management work in colleges and universities. Student management is related to the quality of college education and the growth, development and safety of students. Doing a good job in student management is one of the important issues in colleges and universities [6–8].

From the current situation of Chinese colleges and universities, there are a lot of problems in the implementation of Civic and Political Education in colleges and universities, which limit the smooth progress of the work related to Civic and Political Education. In order to ensure that students’ Civic and political education can be carried out actively and effectively, it is necessary to strengthen the management of students [9–11]. By actively creating a harmonious and orderly campus culture, using the powerful function of multimedia and network technology, enhancing the students’ subjective consciousness, as well as channeling some negative states of the students, etc., we can strengthen the management of students from all aspects and carry out the students’ ideological education in all directions and in many aspects. Only by effectively linking and combining student management and ideological education in colleges and universities can we improve the comprehensive quality of students and provide the society with a real talent pool [12–15].

Literature [16] emphasizes the importance of management education to the civic education in colleges and universities. Aiming at the current problems of management education, combining scientific, standardized and tacit education, it aims to explore the practical ways of efficient management education, so as to create a good talent cultivation environment and improve the effectiveness of ideological education. Literature [17] explores the current situation of student management in colleges and universities and the advantages of information technology in international political education in colleges and universities, and puts forward the ways of information international political education to improve student management. Through a comparative study, it is pointed out that information technology obviously and effectively improves the efficiency of IPE and student management. Literature [18] emphasizes the importance of student management and ideological and political education in higher education. It comprehensively discusses the problems exposed by the current work of student management and ideological and political education in Chinese higher education, and puts forward effective solution strategies. Literature [19] indicated that in the new era colleges and universities are paying great attention to the management and education of students, but the actual results are very unsatisfactory, whether in student education or management are more or less deficient. Measures are proposed to combine the two and effective measures to promote the cultivation of talents in colleges and universities. Literature [20] discussed the integration of ideological education and learning management in colleges and universities under the background of “Internet +”. It says that the integration of the two is not only an innovation in two fields, but also conducive to the promotion of the smooth progress of the work of colleges and universities, and plays an important role in the promotion of students’ healthy growth, the cultivation of teacher-student relationship, and the improvement of the quality of teaching and learning, and so on. Literature [21] affirms the important role of Civic and Political Education. A management system for the management of college students’ civic education is formulated to prompt the traditional civic education model to adapt to the new education system. The management system applies to the needs of contemporary talent cultivation, realizes the consistent reform of the management subject and management scope, and helps the healthy development of students’ civic and political education. Literature [22] indicates that in the context of the era of diversification and fragmentation of ideology and culture, it is imperative to strengthen the management of students’ ideology and politics. The innovative strategies of student management in colleges and universities are discussed from the perspective of ideological management, aiming to provide reference for student management in colleges and universities. Based on the current situation of ideological education and student management in colleges and universities, the literature [23] puts forward the proposal of combining ideological education and student management for its deficiencies, so as to improve the quality of education and cultivate high-quality talents.

In this study, the decision tree classification method in data mining technology is integrated interdisciplinary with student management and civic education in colleges and universities, and a strategy model of student management and civic education in colleges and universities is constructed with the improvement of C4.5 decision tree mining technology as the core. The ID3 algorithm is proposed in the decision tree algorithm, which selects the attribute with the most information gain after branching based on the information gain of the attribute. Improvements are made based on the ID3 algorithm, and the processing of continuous attributes and attribute value vacancy cases is added. Simplify information entropy according to Taylor’s formula and McLaughlin’s formula in higher mathematics. Simplify the calculation of split information quantity and select the attribute with the largest information gain rate as the root node. The student behavior data of four semesters of University A from 2022 to 2023 are selected as samples, and the empirical analysis of the classification results of student behavior data is carried out by using the model of this paper and the improved C4.5 decision tree mining technique.

2

Model of student management and civic education strategies in higher education institutions

Decision tree is one of the common methods in the field of data mining technology, which is widely used in different fields of research and application by virtue of the advantages such as easy conversion into classification rules. Under the interdisciplinary perspective, this paper combines the decision tree data mining technology with student management and civic education in colleges and universities, and builds a strategy model for student management and civic education in colleges and universities.

2.1

Data Mining for Student Management and Civic Education in Colleges and Universities

2.1.1

Overview of data mining

Data mining technology is not a simple data analysis, it is the use of extremely complex mathematical calculations, for a large amount of data, to find the existence of the law and valuable information, and on this basis to predict the possibility of things happening in the future [24]. Compared to previous data analysis methods, the information mined using data mining techniques is more valuable. Because data mining can usually analyze data without assumptions, the information obtained in the process is characterized by unknown, functional, and applied. Therefore, for certain information that cannot be realized through intuition, especially some of the information that is contrary to intuitive consciousness, the more unexpected the results of mining are, the more valuable it is.

At present, data mining technology is used in a wide range of fields, mainly concentrated in business, finance, and other fields. Overall, the application of data mining technology in these areas is worthwhile for educational researchers to learn from, and is of great significance to the application of college student management and ideological education in colleges and universities in the education industry.

2.1.2

The process of data mining

Data mining is about exploring and analyzing data; it is not just a collection of independent tools, but rather a meticulously planned process of discovering valuable knowledge from large amounts of data. Data mining is a specific step in the process of knowledge discovery in databases, and it uses a special algorithm to extract data patterns. On a macro level, the whole process of data mining is divided into three parts: data preparation, data mining, and interpretation and evaluation of results. Among them, data preparation is a necessary stage before data analysis, which has a direct impact on the mining results. It mainly includes data selection, cleaning, integration, and conversion steps. The whole process of data mining is shown in Figure 1.

2.1.3

Methods of data mining

There are many commonly used research methods in data mining techniques, but each method has a different focus and solves different problems. The method used in this study for student management and civic education in colleges and universities is the decision tree method. Decision tree, also known as a decision tree, is one of the most intuitive classification algorithms in data mining. Using a decision tree, a large amount of complex data can be categorized and the rules can be represented in the form of a flowchart tree. A decision tree is a predictive model that can be used to evaluate the risk of a project and thus determine its feasibility, but it can be difficult to predict continuous data. The premise of using decision trees is that the probability of various situations occurring is known, and there is a mapping relationship between objects and their attribute values. Generally, there are two types of decision trees: a classification tree, which is usually used for discrete variables, and a regression tree, which is usually used for continuous variables.

2.2

Decision Tree Data Mining Techniques

Decision trees are one of the common methods of data mining [25]. Decision tree, is based on the known probability of occurrence of various situations, through the construction of a decision tree to find the probability of the expected value of the net present value in the case of greater than or equal to zero, and thus to do the project risk evaluation, and to determine the feasibility of the analysis method.

Decision trees are a type of tree structure, which is a kind of attribute structure built up according to the strategy choice. In decision-making, a decision tree is a predictive model that represents the mapping relationship between object attributes and object values. Each node in a decision tree represents an attribute of an object, while branches represent the values of the attributes, and leaf nodes represent a classification. In short, a decision tree is a prediction tree that is based on a classification training set for prediction and categorization.

The origin of decision trees stems from conceptual learning systems and was actually developed by the ID3 algorithm. In the early days, decision trees were an important method for artificial intelligence, and with the development of data mining techniques, decision trees became an important tool for building decision support systems.

Among the decision tree algorithms, the ID3 algorithm is one of the more mature ones [26]. The ID3 algorithm selects attributes based on information gain, and selects the attribute with the largest information gain after branching.

The basic idea of ID3 algorithm is: 1)

Construct a decision tree by traversing the space of possible decisions through top-down greedy letting.

2)

Determine an attribute to be the root node, then construct a branch for each possible attribute value and group the training samples into appropriate branches, i.e., divide the samples into subsets, each corresponding to a branch.

3)

Repeat this process over and over again, using only the samples that actually reach this branch.

4)

If all the samples at a node have the same category, the further expansion of that part of the tree is stopped.

In the classification idea of decision tree, the smaller the entropy, the greater the information gain, decision tree classification is to choose the attribute with the greatest gain to be the classification node of the decision tree, and then different branches are established by the different values of this attribute. And in the branches, the same method is used to recursively classify until all subsets can be categorized into the same category.

The information gain calculation of attributes can be performed in this way.

Let C be the number of classes inside the sample, S be the sample, and P(s, j) denote the probability that the sample inside sample S belongs to class j, i.e., p(i, j) = s_j / S, is the number of samples in sample S that belong to class j. For a given sample categorization, the expected information gain is: 1 $\ln f o (S) = - \sum_{j = 1}^{c} p (S, j) \log_{2} p (S, j)$ $$\ln fo(S) = - \sum\limits_{j = 1}^c p (S,j){\log _2}p(S,j)$$

The attribute T with value set {a₁, a₂, &, a_k} can be S divided into different subsets {S₁, S₂,…S_k} where s_j includes S_ij samples of class C_i, based on T the expected information of such a division, called T the entropy. Its weighted average is: 2 $E (T) = \sum_{i = 1}^{k} \frac{| S_{i} |}{| S |} \log_{2} \frac{| S_{i} |}{S}$ $$E(T)\> = \sum\limits_{i = 1}^k {{{|{S_i}|} \over {|S|}}} {\log _2}{{|{S_i}|} \over S}$$

The information gain of T is defined as: 3 $G a i n (S, T) = \ln f o (S) - E (T)$ $$Gain(S,T) = \ln fo(S) - E(T)$$

2.3

Improved C4.5 decision tree data mining techniques

2.3.1

C4.5 algorithm

C4.5 algorithm is an improvement of ID3 algorithm, which adds the handling of continuous attributes, attribute value vacancy cases on the basis of ID3, and a more mature method for tree pruning [27]. 1)

The concept of information gain rate

The rate of information gain is developed on the basis of information gain, Let attribute A have v different values {a₁, a₂,⋯⋯,a_n} [28]. Attribute A can be used to partition S into v subsets {S₁, S₂,……,S_n}, where S_j contains some such samples in S which have value a_j on A. If the samples are partitioned based on the value of attribute A, SplitI(A) is the notion of entropy, so the information gain rate of the attribute can be calculated by the following equation: 4 $G a i n - R a t i o (A) = \frac{G a i n (A)}{S p l i t I (A)}$ $$Gain - Ratio(A) = {{Gain(A)} \over {SplitI(A)}}$$

Among them: 5 $S p l i t I (A) = - \sum_{j = 1}^{v} p_{i} \log_{2} (p_{i})$ $$SplitI(A) = - \mathop \sum \limits_{j = 1}^v {p_i}{\log _2}({p_i})$$

2)

For numerical attributes, the process of C4.5 is as follows.

(1)

Sort the training data according to the attribute values.

(2)

Dynamically divide the training data with different thresholds.

(3)

Determine a threshold value when the input changes.

(4)

Take the midpoint of the two actual values as the new threshold.

(5)

Generate two divisions into which all samples are distributed.

(6)

Get all possible thresholds, gains and gain ratios.

(7)

In each value attribute will become two intervals, i.e. small hand threshold or greater than or equal to threshold.

3)

Processing training samples containing unknown attribute values

C4.5 The training samples are handled by replacing the most commonly used values with the most commonly used values or by grouping the most commonly used values in the same category. A probability is assigned to the attribute and to each value based on the values for which the attribute is known.

4)

Generating rules

Once the decision tree is built, the tree can be converted into an if-then rule. This rule is stored in a two-dimensional array, where each row represents a rule in the tree, i.e., a path from root to leaf. Each column in the table stores a node in the tree.

2.3.2

Improved C4.5 algorithm

1)

Improvement of the formula

From the basic principle of C4.5 algorithm, it is known that the formation of decision tree is based on the principle of information theory. Since the information gain rate calculation process involves multiple logarithmic function operations, library functions have to be called in the calculation program, which increases the calculation time. It is considered that if the computational cost of the decision tree can be reduced the generation time of the decision tree can be saved.

Suppose E = F₁ × F₂ ×…×F_n is a n-dimensional exhaustive vector space, where F_j is a set of exhaustive discrete symbols, and the elements e = < w, w₂,…, w_n > in E, are called examples, where w_j ∈ F_j, j = 1,2,…,n. Let YE and NE be two sets of examples of E, called the positive and negative example sets, respectively. Suppose that the positive example set YE and the negative example set NE in the vector space E are of size y and n, respectively.

(1)

A correct decision tree on vector space E has the same probability of classifying any set of samples as the positive and negative examples in E.

(2)

The amount of information required for a decision tree to make a correct category judgment on a sample set is: 6 $I (y, n) = - \frac{y}{y + n} \log_{2} \frac{y}{y + n} - \frac{n}{y + n} \log_{2} \frac{n}{y + n}$ $$I(y,n) = - {y \over {y + n}}{\log _2}{y \over {y + n}} - {n \over {y + n}}{\log _2}{n \over {y + n}}$$

(3)

If attribute A is chosen as the root of the decision tree and A takes V different values {A₁, A₂…A_v}, using attribute A it can be divided E into v subsets {E, E₂,…E_n}, where E_i contains the sample data where attribute A takes A_i values in E. Suppose E_i contains y_i positive. Examples and n_i counterexamples, then the subset E, the desired information required is I(y_i, n_i), obtained from equation (6): 7 $I (y_{i}, n_{i}) = - \frac{y_{i}}{y_{i} + n_{i}} \log_{2} \frac{y_{i}}{y_{i} + n_{i}} - \frac{n_{i}}{y_{i} + n_{i}} \log_{2} \frac{n_{i}}{y_{i} + n_{i}}$ $$I({y_i},{n_i}) = - {{{y_i}} \over {{y_i} + {n_i}}}{\log _2}{{{y_i}} \over {{y_i} + {n_i}}} - {{{n_i}} \over {{y_i} + {n_i}}}{\log _2}{{{n_i}} \over {{y_i} + {n_i}}}$$

Therefore, the information entropy required to root with attribute A is: 8 $E (A) = \sum_{i = 1}^{v} \frac{y_{i} + n_{i}}{y + n} I (y_{i}, n_{i})$ $$E(A) = \sum\limits_{i = 1}^v {{{{y_i} + {n_i}} \over {y + n}}} I({y_i},{n_i})$$

Therefore, the simplification is obtained: 9 $E (A) = \frac{1}{(n + y) \ln 2} \sum_{i = 1}^{n} (- y_{i} \ln \frac{y_{i}}{y_{i} + n_{i}} - n_{i} \ln \frac{n_{i}}{y_{i} + n_{i}})$ $$E(A) = {1 \over {(n + y)\ln 2}}\sum\limits_{i = 1}^n {( - {y_i}\ln {{{y_i}} \over {{y_i} + {n_i}}} - {n_i}\ln {{{n_i}} \over {{y_i} + {n_i}}})} $$

Since $\frac{1}{(n + y) \ln 2}$ $${1 \over {(n + y)\ln 2}}$$ is a constant in the training set, the function E'(A) can be assumed to be satisfied: 10 $E^{'} (A) = \sum_{i = 1}^{v} (- y_{i} \ln \frac{y_{i}}{y_{i} + n_{i}} - n_{i} \ln \frac{n_{i}}{y_{i} + n_{i}})$ $$E'(A) = \sum\limits_{i = 1}^v {( - {y_i}\ln {{{y_i}} \over {{y_i} + {n_i}}} - {n_i}\ln {{{n_i}} \over {{y_i} + {n_i}}})} $$

Simplify the information entropy according to the idea of Taylor’s formula and McLaughlin’s formula in higher mathematics, based on the principle of equivalent infinitesimal, if x is very small, then ln(1 + x) ≈ x, can be obtained [29]: 11 $\ln \frac{y_{i}}{y_{i} + n_{i}} = \ln (1 - \frac{n_{i}}{y_{i} + n_{i}}) \approx - \frac{n_{i}}{y_{i} + n_{i}}$ $$\ln {{{y_i}} \over {{y_i} + {n_i}}} = \ln (1 - {{{n_i}} \over {{y_i} + {n_i}}}) \approx - {{{n_i}} \over {{y_i} + {n_i}}}$$ 12 $\ln \frac{n_{i}}{y_{i} + n_{i}} = \ln (1 - \frac{y_{i}}{y_{i} + n_{i}}) \approx - \frac{y_{i}}{y_{i} + n_{i}}$ $$\ln {{{n_i}} \over {{y_i} + {n_i}}} = \ln (1 - {{{y_i}} \over {{y_i} + {n_i}}}) \approx - {{{y_i}} \over {{y_i} + {n_i}}}$$

Substituting (11) and (12) into (10) gives the information entropy: 13 $E^{'} (A) = \sum_{i = 1}^{v} (- y_{i} \ln \frac{y_{i}}{y_{i} + n_{i}} - n_{i} \ln \frac{n_{i}}{y_{i} + n_{i}}) \approx \sum_{i = 1}^{v} \frac{2 y_{i} n_{i}}{y_{i} + n_{i}}$ $$E'(A) = \sum\limits_{i = 1}^v {( - {y_i}\ln {{{y_i}} \over {{y_i} + {n_i}}} - {n_i}\ln {{{n_i}} \over {{y_i} + {n_i}}})} \approx \sum\limits_{i = 1}^v {{{2{y_i}{n_i}} \over {{y_i} + {n_i}}}} $$

(4)

Split the amount of information: 14 $s p l i t I^{'} (A) = - \frac{n_{i}}{y_{i} + n_{i}} \log_{2} \frac{n_{i}}{y_{i} + n_{i}} - \frac{y_{i}}{y_{i} + n_{i}} \log_{2} \frac{y_{i}}{y_{i} + n_{i}} \approx \sum_{i = 1}^{i} \frac{2}{\ln 2} \frac{y_{i} n_{i}}{{(y_{i} + n_{i})}^{2}}$ $$splitI'(A) = - {{{n_i}} \over {{y_i} + {n_i}}}{\log _2}{{{n_i}} \over {{y_i} + {n_i}}} - {{{y_i}} \over {{y_i} + {n_i}}}{\log _2}{{{y_i}} \over {{y_i} + {n_i}}} \approx \sum\limits_{i = 1}^i {{2 \over {\ln 2}}} {{{y_i}{n_i}} \over {{{({y_i} + {n_i})}^2}}}$$

(5)

The information gain rate is: 15 $G a i n - r a t i o^{'} (A) = \frac{G a i n^{'} (A)}{S p l i t I^{'} (A)} = \frac{I^{'} (A) - E^{'} (A)}{S p l i t I^{'} (A)}$ $$Gain - rati{o^\prime }(A) = {{Gai{n^\prime }(A)} \over {Split{I^\prime }(A)}} = {{{I^\prime }(A) - {E^\prime }(A)} \over {Split{I^\prime }(A)}}$$

The improved C4.5 algorithm uses a simplified information entropy calculation method $\sum_{i = 1}^{v} \frac{2 y_{i} n_{i}}{y_{i} + n_{i}}$ $$\sum\limits_{i = 1}^v {{{2{y_i}{n_i}} \over {{y_i} + {n_i}}}} $$ for the information entropy, a simplified split information amount calculation method $\sum_{i = 1}^{v} \frac{2}{\ln 2} \frac{y_{i} n_{i}}{{(y_{i} + n_{i})}^{2}}$ $$\sum\limits_{i = 1}^v {{2 \over {\ln 2}}} {{{y_i}{n_i}} \over {{{({y_i} + {n_i})}^2}}}$$ for the split information amount, and selects the attribute with the largest information gain rate as the root node. Utilizing this calculation method is merely a quadratic hybrid operation, which is quick to implement on a computer.

2.3.3

Application of the improved C4.5 algorithm

1)

Reconstruct the decision tree using the improved C4.5 algorithm

According to the improved computerized formulas of information entropy and split information volume for the above examples respectively, the attribute with the largest information gain rate is selected as the root node and the decision tree is built: 16 $I (y, n) = \frac{17 \times 3}{{(17 + 3)}^{2}} = 0.128$ $$I(y,n) = {{17 \times 3} \over {{{(17 + 3)}^2}}} = 0.128$$

Take the attribute “Student Management” (SM) as an example: 17 $E (S M) = \sum_{i = 1}^{v} \frac{y_{i} n_{i}}{v + n_{i}} = (\frac{4 \times 1}{4 + 1} + \frac{6 \times 1}{6 + 1} + \frac{7 \times 0}{7 + 0} + \frac{0 \times 1}{0 + 1}) = 1.657$ $$E(SM) = \sum\limits_{i = 1}^v {{{{y_i}{n_i}} \over {v + {n_i}}}} = ({{4 \times 1} \over {4 + 1}} + {{6 \times 1} \over {6 + 1}} + {{7 \times 0} \over {7 + 0}} + {{0 \times 1} \over {0 + 1}}) = 1.657$$ 18 $S p l i t I (S M) = \sum_{i = 1}^{v} \frac{y_{i} n_{i}}{{(y_{i} + n_{i})}^{2}} = \frac{5 \times 7 \times 7 \times 1}{{(5 + 7 + 7 + 1)}^{2}} = 0.613$ $$SplitI(SM) = \sum\limits_{i = 1}^v {{{{y_i}{n_i}} \over {{{({y_i} + {n_i})}^2}}}} = {{5 \times 7 \times 7 \times 1} \over {{{(5 + 7 + 7 + 1)}^2}}} = 0.613$$ 19 $G a i n - R a t i o^{'} (S M) = \frac{I^{'} (y, n) - E^{'} (S M)}{S p l i t I^{'} (S M)} = - 2.494$ $$Gain - Ratio'(SM) = {{{I^\prime }(y,n) - {E^\prime }(SM)} \over {Split{I^\prime }(SM)}} = - 2.494$$

Take the attribute “intellectual and political education” (IPE), for example: 20 $E (I P E) = (\frac{2 \times 0}{2 + 0} + \frac{2 \times 1}{2 + 1} + \frac{9 \times 0}{9 + 0} + \frac{4 \times 2}{4 + 2}) = 2.000$ $$E(IPE) = ({{2 \times 0} \over {2 + 0}} + {{2 \times 1} \over {2 + 1}} + {{9 \times 0} \over {9 + 0}} + {{4 \times 2} \over {4 + 2}}) = 2.000$$ 21 $S p l i t I (I P E) = \frac{2 \times 3 \times 9 \times 6}{{(2 + 3 + 9 + 6)}^{2}} = 0.810$ $$SplitI(IPE) = {{2 \times 3 \times 9 \times 6} \over {{{\left( {2 + 3 + 9 + 6} \right)}^2}}} = 0.810$$ 22 $G a i n - R a t i o (I P E) = \frac{0.128 - 2.000}{0.810} = - 2.311$ $$Gain - Ratio(IPE) = {{0.128 - 2.000} \over {0.810}} = - 2.311$$

Compare the rate of information gain: 23 $G a i n - R a t i o (S M) = \frac{0.128 - 1.657}{0.613} = - 2.494$ $$Gain - Ratio(SM) = {{0.128 - 1.657} \over {0.613}} = - 2.494$$ 24 $G a i n - R a t i o (I P E) = \frac{0.128 - 2.000}{0.810} = - 2.311$ $$Gain - Ratio(IPE) = {{0.128 - 2.000} \over {0.810}} = - 2.311$$ 25 $G a i n - R a t i o (C o m p r e h e n s i v e) = \frac{0.128 - 0.875}{0.84} = - 0.889$ $$Gain - Ratio(Comprehensive) = {{0.128 - 0.875} \over {0.84}} = - 0.889$$

3

Empirical evidence of the model of student management and civic education strategies in higher education institutions

The research in this paper is based on the massive dataset of the students of University A as the research object, selecting the students’ student behavioral data of four semesters from 2022-2023 as the samples, and carrying out the analysis of the classification results of the behavioral data in the samples based on the decision tree mining with improved C4.5 algorithm.

3.1

Iterative training on student data

Based on the improved C4.5 decision tree data mining technique, the four datasets corresponding to the four semesters of 2022-2023 have been classified and predicted. The relevant parameters of the algorithm are tuned to the imbalance problem of the data, the sampling scheme of the samples is adjusted according to the algorithm execution, and the training data are iterated many times to select the best learning result of the decision tree.The relationship curve between the algorithm execution error rate and the number of decision trees for the 4 datasets of 2022-2023 is shown in Fig. 2. In the figure, 2022B and 2022J represent the first and second half of the semester of 2022 respectively, while 2023B and 2023J are the same. From the figure, it can be seen that the convergence of the number of effective decision trees is realized when the number of decision trees in all four datasets is 250, and the prediction results tend to be stable with high accuracy.

3.2

Gini index measurement of student data

The Gini index is an index used by the classical decision tree CART to classify the optimal features of a problem, and the larger the value of the Gini index, the greater the importance of the activity is represented. Applying the college student management and civic education strategy model constructed in this paper to fully standardize the student behavior dataset in 2022-2023, measuring the Gini index value of different student management and civic teaching behavior indicators, and comparing the results with the actual index value, as shown in Fig. 3, with (a) and (b) being the data in 2022 and 2023, respectively. As can be seen from the figure, the student data are fully standardized using the improved C4.5 decision tree data mining technology in the strategic model of student management and civic education in colleges and universities in this paper, and the accuracy of data prediction performs better, and the measured Gini index value fits well with the actual index value. The prediction accuracy in the first half of the semester and the second half of the semester of the end of the semester of the year 2022 is about 91.88%, 90.51%, while the prediction accuracy of the first half semester and the off semester of 2023 reaches 91.61% and 94.83%, i.e., the prediction accuracy of the four datasets corresponding to the four semesters of 2022-2023 are all greater than 90%, which proves that the improved C4.5 decision tree data mining technique in this paper has excellent data processing performance.

3.3

Results of Student Data Classification

In this section, the improved C4.5 decision tree data mining technique proposed in this paper’s strategy model for college student management and civic education will be used to analyze and derive the influencing factors and related rules of students’ behavioral performance in college student management and civic education from the data of students’ learning behaviors of the four semesters of 2022-2023 at the University of A.

3.3.1

Statistics on student behavioral data

In the academic behavior data for the four semesters of 2022-2023, there is detailed information related to students’ violations of the disciplinary guidelines for student management and civic education, and the disciplinary information mainly includes class, semester, disciplinary items, and points. The quality of disciplinary deductions also varies with different years. In this section, students’ disciplinary data are processed by eliminating items with less than 50 demerit points and combining the details, such as insubordination, insolence, defiance of teachers, instructors, and student cadres, and insubordination, insolence, defiance of teachers, instructors, and student cadres management. The final data in the demerit statistics are shown in Table 1. As can be seen from the table, the top five disciplinary violations in terms of statistics are “absenteeism”, “not tidying up and cleaning according to the requirements of dormitory housekeeping”, “being late”, “not turning off the lights and fans after class”, “talking in class, making noise, and making noise”, and the statistical numbers are 12633, 4573, 4140, 1982, and 1859 respectively. The number of “absenteeism” disciplinary violations is much higher than that of other disciplinary violations, and it is the most frequent disciplinary violation among students in the four semesters from 2022 to 2023. The number of serious violations of discipline and major demerits, such as “participating in gambling” and “confronting teachers and instructors”, was the lowest among all discipline violations, at 74 and 62, respectively, and neither of them exceeded 100.

Table 1.

Statistics of student behavior data

Disciplinary	Statistics
Truancy	12633
Do not organize and clean according to the requirements of dormitory	4573
Late arrival	4140
Students on duty leave lights and fans on after class	1982
Lecture, noise, and heckle	1859
Do not organize dormitory internal affairs according to the requirements	1790
Free in and out of the dormitory	1705
Late and early retreat	831
The night night will not sleep and speak	650
Not in the dormitory	586
Sleep in class	550
Find cigarette and lighter in the dormitory	465
Speech in class	429
Use mobile phone during class	400
Sit not in the seating schedule	320
Use mobile phone in the teaching area	314
Early retirement	268
Play games in class	223
Smoking in school	172
Stay in other people’s dormitory	133
Doing something that is not relevant to learning	120
Do nothing to learn	104
Refuse to perform the clean work	104
Eat in class	91
Discontent from management	87
Noisy play in the dormitory	89
Gambling(record a demerit)	74
Contradict teachers and instructors(record a demerit)	62

3.3.2

Disaggregated statistics on student behavioral data

According to the research experience, students’ disciplinary behaviors in the four semesters of 2022-2023 are focused, which is more suitable for semester-by-semester research, and students’ disciplinary behaviors will be deducted from the corresponding “moral education points”. The decision tree model was established for the four semesters, and the moral education scores of students in the four semesters of 2022-2023 were classified and analyzed in four education periods, as shown in Table 2. In the table, “% in the semester” refers to the percentage of each group of statistics in the classification period of student behavior in the total number of this semester, and “% in the classification of moral education” refers to the percentage of the number of statistics of student discipline in this semester in the total number of four semesters of the classification item of student behavior. It can be seen that in the four semesters from 2022 to 2023, the proportion of moral education scores in the “normal” category is the highest in the whole semester, accounting for more than 50%, while the proportion of moral education scores in the “dangerous period” category is always the lowest, with the proportions of 10.58%, 8.77%, 8.26%, 9.16%, and 9.14% in the four semesters from 2022 to 2023, all of which are less than 10%. And as far as the statistics of moral scores of different educational periods in the four semesters of 2022-2023 are concerned, the semester with the highest percentage of moral scores in the educational period and school processing period is the second half of 2022, with the percentage of 28.28% and 36.79, respectively.And in the hazardous period and normal classification, the semester with the highest percentage of moral scores is the second half of 2023, with the percentage of 29.54% and 31.09%, respectively.

Table 2.

Classification statistics of student behavior data

Semester	-	Classification				Total
Semester	-	Education period	Critical period	School treatment period	normal	Total
2022B	Statistics	212	105	163	512	992
	% in term	21.37%	10.58%	16.43%	51.61%	100%
	% in behavior classification	22.63%	22.98%	20.12%	18.32%	19.84%
2022J	Statistics	265	132	298	810	1505
	% in term	17.61%	8.77%	19.80%	53.82%	100%
	% in behavior classification	28.28%	28.88%	36.79%	28.98%	30.11%
2023B	Statistics	212	85	128	604	1029
	% in term	20.6%	8.26%	12.44%	58.7%	100%
	% in behavior classification	22.63%	18.6%	15.8%	21.61%	20.58%
2023J	Statistics	248	135	221	869	1473
	% in term	16.84%	9.16%	15%	59%	100%
	% in behavior classification	26.47%	29.54%	27.28%	31.09%	29.47%
Total	Statistics	937	457	810	2795	4999
	% in term	18.74%	9.14%	16.2%	55.91%	100%
	% in behavior classification	100%	100%	100%	100%	100%

3.3.3

Analysis of the results of the disaggregation of student behavioral data

Taking the student behavior data of the first half of the 2022 semester as a sample, the improved C4.5 decision tree data mining technique proposed in this paper was used to analyze the results of the decision tree and to plot the main factors of the first half of the 2022 semester, as shown in Figure 4. As can be seen from the figure, the five main factors affecting the classification of moral education in the first half of 2022 are “talking and noisy in class”, “absenteeism”, “not tidying up the housekeeping of the dormitory as required”, “using mobile phones during class”, and “disobedience to management”, with the corresponding importance of each factor being 7.48%, 5.56%, 3.51%, 3.1%, and 2.24%, and the corresponding standardized importance values of 99.5, 73.68, 46.88, 40.05, and 29.26.

Using the same method, the factors affecting the classification of students’ moral education in the second half of 2022, first half of 2023, and second half of 2023 can be obtained respectively. The main influencing factors in the second half of 2022 are “absenteeism”, “not tidying up dormitory housekeeping as required”, “not sleeping at night when lights are out”, “talking and noisy in class”, and “absenteeism”. The main influencing factors in the first half of 2023 are “absenteeism”, “not entering and leaving the dormitory according to the requirements of work and rest”, “staying at night and not sleeping with lights out”, “not tidying up the housekeeping of the dormitory as required” and “disobedience to management”. The main factors affecting the classification of moral education in the after-school semester of 2023 are “talking and noisy in class”, “absenteeism”, “entering and leaving the dormitory according to the requirements of work and rest”, “not tidying up the housekeeping of the dormitory as required” and “not turning off the lights and fans in the dormitory after class”.

4

Conclusion

This paper explores the integration and innovation of the decision tree method in data mining technology with student management and civic education in colleges and universities. The empirical analysis of the classification results of this paper’s model is carried out using the student behavior dataset from four semesters from 2022-2023, which will be used as a sample by University A. In the iterative training of the student sample data, the four datasets corresponding to the four semesters of 2022-2023 all achieve the convergence of the effective number of decision trees at 250, and the prediction results are stable and have a high accuracy rate. In the Gini index measurement of the student sample data, the Gini index values measured in each dataset have a high degree of fit to the actual index value, and the prediction accuracy of the four datasets corresponding to the four semesters from 2022 to 2023 are 91.88%, 90.51%, 91.61%, and 94.83%, respectively, which are all greater than the 90% level. In the classification results of the student sample data, it can be seen that the top five disciplinary violations in the four semesters from 2022 to 2023 are “absenteeism”, “not tidying up and cleaning according to the requirements of dormitory housekeeping”, “being late”, “students on duty do not turn off the lights and fans after class”, “talking in class, making noise, and making noise”, and the statistical numbers are 12633, 4573, 4140, 1982, and 1859, respectively. In the four semesters from 2022 to 2023, the total proportion of moral education scores in the “normal period” category exceeded 50%, while the proportion of moral education scores in the “dangerous period” category was always the lowest, less than 10%. According to the statistics of moral education scores in different education periods of the four semesters, the semester with the highest proportion of moral education scores in the education period and the school processing period is the second half of 2022, accounting for 28.28% and 36.79% respectively. In the classification of dangerous period and normal period, the second half of 2023 has the highest proportion, accounting for 29.54% and 31.09% respectively. Taking the student behavior data in the first half of 2022 as an example, the five main factors affecting the classification of moral education were “talking and noisy in class”, “absenteeism”, “not tidying up dormitory housekeeping as required”, “using mobile phones during class” and “disobedience to management”, and the corresponding importance of each factor was 7.48%, 5.56%, 3.51% and 3.1%, respectively. Overall, the college student management and civic education strategy model constructed in this paper shows excellent student data processing classification ability in the empirical analysis, and is able to identify the key factors affecting students’ moral scores, disciplinary behaviors, etc., as well as to understand the weaknesses of colleges and universities in the cultivation of student management and civic education, and provide college and university administrators with the idea of formulating targeted student management and civic education strategy and the direction for college administrators to formulate targeted strategies for student management and civic education.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on Decision Tree-Based Strategic Modeling of Student Management and Civic Education in Colleges and Universities under Interdisciplinary Perspective

Wei Luo

Published Online: Mar 19, 2025

Received: Oct 26, 2024

Accepted: Feb 19, 2025

DOI: https://doi.org/10.2478/amns-2025-0428

KeywordsData mining, Decision tree, ID3 algorithm, C4.5 algorithm, Student management

© 2025 Wei Luo, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Data mining, Decision tree, ID3 algorithm, C4.5 algorithm, Student management