Extraction and analysis of individual features of students’ ideological and political education based on big data clustering algorithm
Online veröffentlicht: 19. März 2025
Eingereicht: 27. Okt. 2024
Akzeptiert: 18. Feb. 2025
DOI: https://doi.org/10.2478/amns-2025-0531
Schlüsselwörter
© 2025 Ying Zhai, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Accompanied by the booming development of information technology field, the intersection of information technology and human life is more and more, and information technology together with the development of the Internet is also rapid popularization, and the ensuing number of global data shows explosive growth, data from all over the world shows the aggregation trend, the outbreak of massive data for the economic and social development of the world’s countries and people’s lives and other aspects of the impact of a certain degree [1-2]. At the same time, the development of big data technology also provides impetus and technical support for the innovation of modern ideological and political education [3]. The application of big data technology to ideological and political education is a due move of education with the development of the times and the gradual maturation of big data technology [4]. Big data technology promotes the ideological and political education in colleges and universities to optimize and innovate in teaching methods, modes, management and evaluation, which improves the precision of ideological and political education in colleges and universities and enhances the effectiveness and validity of education [5-6]. Colleges and universities should grasp the digital revolution as a major historical opportunity, give full play to the functional role of big data in ideological and political education, coordinate and balance the dichotomy faced by ideological and political education under the big data scenario, accurately grasp the effective path of the innovation and development of big data-driven ideological and political education, effectively improve the effectiveness of ideological and political education, and continue to write new chapters in the history of ideological and political education in the history of the digital development of mankind. A new chapter in the history of human digital development and progress.
As a product of the new era, big data has both natural and social characteristics, with powerful data analysis and prediction functions, and is able to grasp the ideological dynamics and behavioral trajectory of college students as a whole. Li, Y. et al. showed that using big data technology to transform and analyze the huge amount of data generated by teaching activities can accurately monitor the quality of teaching, and then improve the teaching methods, providing powerful support for universities to strengthen teaching [7]. Liu, R. constructed a data analysis model for educational evaluation using K-means clustering technique, which was able to provide a detailed understanding of the problems faced by students’ learning by mining and analyzing the important features generated in the teaching and learning process, providing a basis for the development of effective student management strategies [8]. Kausar, S. et al. analyzed the performance of different mining techniques in educational data analysis and pointed out that more robust educational mining tools are able to adaptively mine effective information from large amounts of educational data to provide administrators with suitable educational decisions [9]. Wang, Z. emphasized the important role of data mining techniques in instructional management by using clustering algorithms to assess students’ performance, which can classify students into groups with high degree of similarity and provide assistance to instructional administrators to understand students’ learning characteristics and change their teaching strategies [10]. Oladipupo, O. O. et al. used median deviation statistics to assess the performance of different clustering algorithms in the task of knowledge mining of student data; algorithms with the highest clustering potential can provide educational administrators with more accurate clustering features, which is important for improving learning outcomes, student success, and supporting strategic decision making [11]. Urbina Nájera, A. B. et al. introduced clustering algorithms that can reflect the teacher-student relationship in instructional tutoring, and the use of clustering algorithms such as k-means to correlate teacher-student skills and affinity can effectively strengthen the effectiveness of instructional tutoring play [12]. Dwivedi, S. et al. explored the application of big data analytics techniques in educational recommender systems by extracting individual student characteristics in order to discover patterns of similarity in their grade level, subject matter, etc., in order to recommend appropriate courses of study for students [13]. Ding, D. et al. proposed a set of student behavior description index system and a student behavior segmentation model based on cluster analysis, which can identify student groups with different cluster characteristics by segmenting students’ behaviors, facilitating further student management in schools [14]. Through data analysis and data processing technology, students in colleges and universities can be “group portrait” and “individual portrait”, however, in the field of ideological and political education, how to use big data technology to achieve accurate ideological and political education goals still need further research.
The big data clustering algorithm is conducive to revealing the characteristics and developmental laws of college student groups, and is conducive to the targeted adjustment of the existing ideological and political teaching system in colleges and universities. Firstly, based on the theory of clustering algorithm, this paper focuses on the student group of higher vocational colleges and adopts FCM algorithm to construct the student portrait. Secondly, relying on the school information platform dataset, the individual characteristics of students are further extracted from three aspects, namely, diligence, sleep pattern, and consumption behavior situation, based on the psychological perspective. Finally, students are divided into five groups and the target group index TGI is introduced to characterize different students. The comparison of the clustering results of TGI values reveals the characteristics and growth patterns of students between students as a whole and different groups.
Two data structures are generally used in cluster analysis: the data matrix and the dissimilarity matrix.
Assuming that an entity set, with
Let
Because many clustering algorithms operate on dissimilarity matrices, the data matrix is often transformed into a dissimilarity matrix before the clustering algorithm is used.
Depending on the type of data, the calculation of the degree of difference between the data is different. The common types of data and their calculation of the degree of difference are described below:
Interval-scaled variables are continuous measures of a rough linear scale. The clustering results of this type of variable may be strongly influenced by the units. Generally the smaller the units of the variable, the larger the domain of values and the greater the impact on the clustering results. Data can generally be standardized to reduce the impact of variable units on clustering results.
Calculate the absolute deviation from the mean
Calculate the standardized measure or
For interval-scaled variables, the following distance measures are generally used to describe the dissimilarity between each pair of objects. Let object The Euclidean distance is defined as:
The Manhattan distance is defined as:
Minkowski distance is defined as:
In the above equation,
The above distance formula has the following properties:
When it is necessary to differentiate the importance of variables, weights can be assigned to the variables with the following formula:
Binary variables have states: 0 and 1. When the variable is present, the value is 0; otherwise the value is 1. The possible values of binary variables are shown in Table 1. Where
The Possible values table of two element type variables
Object |
Object |
1 | 0 | sum |
---|---|---|---|---|
1 | ||||
0 | ||||
sum |
Binary variables are subdivided into symmetric and asymmetric binary variables.
A variable is symmetric binary if its two states have equal value and same weight. An example is the gender attribute. The dissimilarity formula for symmetric binary variables is:
An asymmetric binary variable is one in which the outputs of the two states of the variable are not of equal importance, e.g., positive and negative results of a disease test are generally valued in the case of getting the disease that has less chance of occurring, i.e., the positive case, and the result is coded as a 1. The formula for the degree of dissimilarity of an asymmetric binary variable is:
A categorical variable is similar to a binary variable, but it can take more than two state values. For example, a color variable may have five values: red, yellow, orange, green, and blue. Assuming that the number of states of a categorical variable is Simple matching method
Create a binary variable for each of the
Ordinal variables are divided into discrete ordinal variables and continuous ordinal variables. Discrete ordinal variables are similar to categorical variables except that the ordering of the
Suppose that some ordinal-type variable Replace the value Compute the end Replace the
Variables of this type approximately follow Equation Use the same method as for dealing with interval scaled variables; Doing a logarithmic transformation on the values of the proportional scale variable, and then calculating the degree of dissimilarity of the results obtained using the same methods as those used to deal with the interval scale variable; The proportional scale variable is treated as continuous ordinal-type data and then processed in the ordinal-type data manner. The processing should be chosen according to the actual needs, but generally the latter two are more effective.
In actual data processing, the data object to be processed is often described by a variety of different types of variables. For objects described by mixed types of variables, the degree of difference between them can be solved using Equation (11). The degree of dissimilarity
Depending on the type of variable
Clustering is an unsupervised learning method that aims to group data objects in a dataset into distinct groups or clusters in such a way that the data objects within a group are highly similar and the similarity between groups is low. Clustering algorithms usually group data without prior knowledge or labeling. As one of the classical machine learning algorithms, clustering algorithms are characterized by simple principles, high efficiency and practicality, and have high applicability in several application areas such as image segmentation, document analysis, feature learning and market segmentation.
Fuzzy clustering technique was proposed in the late 1960s and is an improvement of clustering technique. It was developed on the basis of the previous traditional distance-based clustering techniques, aiming to solve the problems of traditional clustering algorithms in dealing with complex datasets. Fuzzy clustering is a clustering based on the fuzzy division of objects, combined with the fuzzy measure of the object, and its basic concept is that the data points in the clustering space can belong to different clusters at the same time, and each data point has a proportional weight to indicate the degree of its belonging in each cluster. The fuzzy clustering results are expressed in the form of fuzzy matrices, which can be used to jointly decide which clusters each object belongs to and how much it belongs to according to subjective and objective forces, which improves the intuition of the fuzzy clustering results.
FCM is the best known fuzzy clustering method. With the hard clustering algorithm represented by K-means, each feature vector can be a member of a cluster, and in FCM feature vectors can belong to one or more clusters with different degrees of membership, constrained by a sum of membership degrees equal to one.
Let
The FCM minimizes an objective function that takes into account the distance from the feature vector to the cluster center, and the following objective function needs to be minimized:
In the equation
Algorithm 1: Fuzzy
Input: dataset
Output: set of clustering center vectors
Step 1: Initialize the affiliation matrix
Step 2: Calculate the degree of affiliation
Step 3: Calculate the clustering center
Step 4: Repeat steps 2 to 3 until the error variation rate calculated by Equation (17) is less than
Step 5: Return the set of clustering center vectors
Ideological and political education in higher vocational colleges and universities needs to adjust the existing teaching system and carry out teaching practices precisely through student portraits, so there is an urgent need to obtain student portrait labels from the data collected in schools. Establishing portrait labels is the key work of user profiling, and the label name is a semanticized short text to summarize the function or meaning of the label. Based on the logic of teaching Civics class in higher vocational colleges and universities, the labels are extracted from students’ basic situation data, academic level data, social practice data, and physical and mental health data.
Through the above data collection and label extraction, the Civics teaching management department of higher vocational colleges and universities can draw portraits of upcoming students according to the levels of grades, majors, and classes before the start of the first semester of the new academic year, so as to reasonably formulate the teaching plan and arrange the lecturers. Firstly, at the grade level, the student portraits will be combined with the academic level data labels to make the first round of division for the freshmen who have never taken university Civics classes and those who have already taken Civics classes; secondly, at the major level, the student portraits will be combined with the basic information and the academic level data labels to make the second round of division according to the background of the disciplines of arts, sciences, engineering, agriculture, and arts, as well as the overall performance profiles of the majors, including the willingness to learn Civics and the ability to learn the rest of the subjects. learning ability) for the second round of division; again, at the class level, the student portrait will combine the basic situation and academic level data labels to carry out the third round of division according to the middle and high school situation and the enthusiasm for independent learning; finally, at the student level, the student portrait will combine the basic situation, social practice and physical and mental health data labels to carry out the third round of division according to the students’ individual profiles (including freshmen and former freshmen, nature of hukou, age, sex grade) , professional recognition, social service awareness, self-management and sound personality for the fourth round of division. The Civics teaching management part can, according to the teaching arrangement, start from the first round of division, the label extraction and classification integration of the target group, so as to construct a student portrait in line with the actual teaching.
The scope of the acquired dataset is mainly the data generated by students using the information system provided by the school. In order to abstract business scenarios as much as possible, the behavioral characteristics of students are portrayed in terms of their intrinsic traits, so that these characteristics can play a certain degree of universal value in a wide range of application scenarios. Therefore, in this paper, we portray students’ individual characteristics from three perspectives: diligence, sleep patterns, and consumption behavior situations.
The sub-trait of Diligence and Diligence, Aggressiveness, is strongly correlated. Aggressiveness responds to expectations that reflect the pursuit of competence and success in work or study. From campus card spending data, diligence can be measured by the amount of time a student spends studying. While there is no explicit metric, the frequency of presence on their study areas can be calculated as a proxy. The most obvious study areas include libraries and academic buildings. Colleges and universities, for security reasons, will set up access to the library to ensure that no one can enter the library without a campus card. Therefore, in order for students to gain access to the library, they must swipe their cards. In addition, students borrow books inside the library through their campus cards. In the academic building, although there is no access control, students usually fetch water in the academic building when school is in session or during self-study. In the academic buildings of Chinese colleges and universities, the schools usually install card machines for the machines that provide boiled water, so that students have to swipe their cards when they get boiled water, thus saving water. These consumption records can be considered a reflection of the students’ studies in the building.
For this reason, this paper calculates the number of times each student appears in the academic building and the library and uses it to measure the diligence indicator. That is, the more times a student appears in the library and the academic building, the more diligent he is. The correlation between diligence and academic performance is shown in Fig. 1(a) and Fig. 1(b), where the effect of the mean has been removed, making it possible to compare between different students. As can be seen from the figures, diligence and achievement rank are inversely related, with a mean Spearman rank correlation coefficient as high as -0.381. This observation suggests that students’ efforts will always be rewarded with good academic performance. In addition, we find that the correlation varies across semesters. In particular, the correlation coefficients of -0.152 and -0.149 in the first semester are significantly lower than in other semesters where the correlation is more than -0.261 One possible reason is that the first semester grades may still be highly dependent on what was learned in high school. Another reason is that many universities set up a strict routine during freshman year, resulting in less variation in behavior across students. It is worth noting that this paper includes the number of times students go to the print room as an indicator of diligence because students often need to print a lot of instructional materials to review before exams (20 days). The Cramér’sV value was used for one of the sleep patterns, while the others were Spearman’s correlation coefficients, and the p-values of all the correlations were much smaller than 0.001.

Correlation between academic achievement and diligence
Literature examining the relationship between sleep patterns and student achievement suggests that students with good sleep habits are likely to achieve good grades. In particular, wake-up time and bedtime have a significant impact on student achievement, and students who stay up late and sleep in are generally less successful.
In the dataset collected from the schools, there were no actual wake-up and bedtimes. However, students need to use their campus cards to turn on the water, enter and exit the dormitory or take a shower. Therefore, in this paper, the first time and the last time of swiping the card each day are used as the wake-up time and bedtime. Considering the variability of the first and last time of the day for different students, this paper does not use too specific time to represent the sleeping pattern. Therefore, the frequency of the first swipe and the last swipe of each student’s campus card corresponding to the hour is calculated, and the timestamp of the highest frequency is used to denote the sleeping pattern of each student. Corresponding to wake-up and sleep patterns, there are only a few discrete values to represent them, for example, wake-up patterns are mainly concentrated at 6:00, 7:00, 8:00, 9:00, and 10:00, while sleep patterns are mainly concentrated at 22:00, 23:00, 24:00, 1:00, and 2:00. The correlation between students’ sleep patterns and grades is shown in Fig. 2(a) and Fig. 2(b).

The correlation between sleep duration and student achievement
In Figure 2(a), students who wake up later generally have worse grades. Since the first class usually starts at 8:00 a.m., students waking up at 6:00-7:00 a.m. and waking up at 7:00-8:00 a.m. are considered to be early risers, and those who often wake up during this time have the best grades. In Figure 2(b), we see that students who stay up late also have lower grades, which may be due to indulging in recreational activities such as games or novels. In this paper, we put the sleeping pattern characteristics into individual characteristics as well. Notice that the sleep pattern feature is uniquely hot, so we cannot calculate the Spearman correlation coefficient and choose the discretized Cramér’sV value between grades and sleep patterns instead, with a mean of 0.0677.
The students’ consumption behavior data extracted and counted according to the consumption behavior description index is shown in Table 2. In order to make the characteristics of students’ consumption behavior more accurate, the average monthly consumption amount of the extracted students’ consumption behavior data is the amount of students’ consumption in the restaurant, the peak monthly consumption amount is the total amount of students’ consumption in the restaurant and the supermarket, and the average monthly consumption frequency is the number of times that students consume in the restaurant.
Student consumption behavior data
Student number | Average monthly consumption/Yuan | Average monthly consumption/Yuan | Average monthly consumption frequency/yuan |
---|---|---|---|
201901**** | 739.54 | 820.24 | 73.0 |
201901**** | 832.28 | 1093.13 | 87.3 |
201901**** | 428.42 | 592.01 | 50.3 |
201901**** | 924.14 | 1093.23 | 59.1 |
201901**** | 380.09 | 532.08 | 50.2 |
201901**** | 879.24 | 967.29 | 78.5 |
201901**** | 921.33 | 987.23 | 79.2 |
201901**** | 350.20 | 380.92 | 48.3 |
201901**** | 730.29 | 829.34 | 78.1 |
201901**** | 926.24 | 988.26 | 80.5 |
201901**** | 539.21 | 678.29 | 67.9 |
201901**** | 370.13 | 540.23 | 73.2 |
201901**** | 863.92 | 930.92 | 64.1 |
201901**** | 572.39 | 783.93 | 70.3 |
Considering the actual situation of schools and college students, it can be seen that the average value of the overall average monthly consumption of students is not large, indicating that students do not spend a lot in the restaurant, and the price of the school restaurant is more favorable, but the average monthly consumption frequency of the overall students is not high, indicating that the college students’ diets are more irregular, and they generally do not eat breakfast.
In this paper, the clustered student data are divided into five categories, including four special groups and one reference group. The five categories of student groups are: [Class I - learning difficulties group; Class II - economic difficulties group; Class III - psychological difficulties group; Class IV - group with employment difficulties; Category V (reference group) - general group.]
In order to outline the group portrait more easily, the Target Group Index (TGI) was introduced to characterize the characteristics of different students. The TGI means the strength of an individual within a group under a certain characteristic, i.e., the ratio of the proportion of a certain characteristic in the target group to the proportion of the characteristic in the whole, which can more accurately portray the strength of the characteristics of an individual within a group and the differences of the characteristics between groups. When TGI>100 indicates a tendency to be above average and vice versa. And usually when TGI>120 can be considered that the characteristic is more significant in the target group, and when TGI<80 can be considered that the characteristic is more significant in the total group. The results of clustering TGI values for different student group characteristics are shown in Table 3.
Clustering table of TGI values of student group characteristics
Group division label classification | Groups with learning difficulties | Groups with financial difficulties | Group with psychological difficulties | Groups with employment difficulties | General group |
---|---|---|---|---|---|
Diligence | 82.89 | 84.92 | 88.03 | 81.47 | 94.18 |
Sleep pattern | 159.23 | 92.11 | 110.35 | 137.35 | 98.46 |
Consumption behavior | 105.32 | 74.92 | 101.29 | 102.51 | 103.63 |
For example, the TGI value of the economically disadvantaged group in the characteristic of “consumption behavior” is 74.92, which indicates that the performance of the target group is significantly lower than that of the overall level in this characteristic. The Target Group Index (TGI) is a numerical representation of the group’s characteristics, and a portrait based on the TGI can better reflect the characteristics and growth patterns of this group of students.
In order to further reveal the characteristics and growth patterns of students between students as a whole and different groups, and to transform the characteristics of student groups from data description to graphical portrait, the results of TGI value clustering are shown in Figure 3. Based on the TGI values of each characteristic of different groups, it is found that there are both commonalities and some differences in the three characteristics between different student groups.

Clustering table of TGI values of student group characteristics
From the above three behavioral labels, the TGI values of the five types of student groups in terms of diligence do not differ much, with an average value of 86.298, which is lower than the average but not lower than 80, indicating that the five types of student groups are firm in the pursuit of values and cultural confidence. The majority of them are brave to struggle and never stop, and the minority of them choose to “lie down”.
The average TGI of the five groups of students in terms of sleep pattern is 119.5, which is more than the average level, but there is a big difference between different groups. The TGI values of the learning-difficulty group and the employment-difficulty group are well over 120, which is due to the poor self-discipline and initiative of students, the pan-entertainment use of cell phones, and the weak ability of self-education and self-management. Therefore, it is necessary to focus on playing an exemplary role to enhance the learning atmosphere and cultivate students’ awareness of independent learning.
The average TGI of the five types of student groups in consumption behavior is 97.534, close to the average level. Among them, the TGI of the economically disadvantaged student groups on consumption behavior is 74.92, much lower than the average level. The economically disadvantaged student groups should be managed accurately, and a financial aid model that focuses on financial aid and combines with spiritual education should be established.
In order to adjust the existing ideological and political education teaching system in higher vocational colleges and universities, this paper extracts and analyzes individual characteristics of students’ ideological and political education based on big data clustering algorithm.
Diligence and grade ranking are inversely related, and the average Spearman’s rank correlation coefficient is as high as -0.381. Sleep pattern and grade ranking. Sleep pattern and grade ranking were highly correlated with an average Cramér’sV value of 0.0677. Consumption behavior was differentiated.
The TGI values of the five student groups in terms of diligence do not differ much, with a mean value of 86.298, which is lower than the average but not lower than 80. the mean TGI in terms of sleep patterns is 119.5, which exceeds the average but varies greatly between groups, with the TGI values of the groups with learning difficulties and the groups with difficulties in finding employment being much higher than 120. The mean TGI in terms of consumption behavior is 97.534, with the TGI value of the economically difficult student group is 74.92, which is much lower than the average level.
The big data clustering algorithm intuitively reveals the characteristics and growth patterns of different student groups, and improves the precision of ideological and political education. Precise ideological and political education in colleges and universities in the new era should not only focus on extracting effective information from data, but also build an integrated system including data extraction, analysis and research, law discovery, strategy formulation, effective implementation, evaluation and feedback, and dynamic adjustment, in order to meet the requirements of ideological and political education of “transforming according to the events, advancing according to the times, and being new according to the situation”.
The Research is Supported by: The Hubei Province Philosophy and Social Science Research Special Task Project of 2020 (Ideological and Political Theory Course) “Research on Synergistic Training of Innovation and Entrepreneurship Education and Ideological and Political Education in Universities” (20Z059).