Research on the Innovation of College Students’ Employment Guidance Methods and Their Practical Effects in Higher Education Institutions under the Environment of Big Data

In recent years, with the popularization of higher education and the reform of the marketization of college students’ employment, the scale of China’s higher education institutions is expanding, the number of college graduates is increasing year by year, and the increasing prominence of the employment problem of college students has become a focal point of concern for individual students and their families, schools and the society [1–3]. With the reform and deepening of the employment system for college graduates in China, “two-way choice and independent career choice” through the talent market has become the main way for college graduates to find employment. The new form of employment of “two-way choice” has greatly strengthened the main role and choice rights of graduates in employment, but at the same time, “two-way choice” has also brought many confusions to graduates [4–6]. It has been proved that the success of graduates’ employment requires not only professional knowledge and skills in the traditional sense, but also the ability and strategy to cope with the market, and therefore, the graduates’ desire for employment guidance is becoming stronger and stronger [7–9]. Scientific and effective employment guidance and services for college students have become one of the important tasks of higher education institutions and an important part of university education [10–11]. It is of great practical significance to carry out in-depth research and exploration on the principles and methods of employment guidance for graduates of colleges and universities under the environment of big data, in order to improve the effectiveness of employment guidance for graduates of colleges and universities and to improve the methods of employment guidance [12–14].

In order to solve the problems of low data processing efficiency and rough data classification in the traditional college students’ employment guidance methods of current higher education institutions, this paper takes data mining technology as the basis, establishes mining objects and database, collects the basic information of students from the Employment Office, the Academic Affairs Office and the Academic Engineering Department, and realizes the pre-processing and filling of students’ employment data. The C4.5 algorithm in a decision tree is proposed, and the information gain rate of the C4.5 algorithm is simplified and calculated using Taylor’s formula and McLaughlin’s formula, which reduces the complexity of algorithm calculation. The kernel function is added in the Mean Shift vector to improve the Mean Shift vector, the BWP index is used to find out the optimal number of initial clusters, and the MSK clustering algorithm is proposed. The employment data of college students in A higher education institution is selected as the research sample to carry out the innovation and practice of college students’ employment guidance method, explore the correlation between students’ school performance and employment in the student data, and conduct the index of students’ employment data.

2

Overview

Big data has a far-reaching impact on the analysis of the employment situation of college students, the formulation of policies, employment services and employment guidance, and the career development planning of college students, and is of great significance in promoting the full employment of contemporary college students. Literature [15] pointed out that career decisionmaking difficulties can affect the successful employment of college students, with 1,092 college students as the research object, we examined the influence mechanism between the future time perspective, self-efficacy and college students’ career decision-making difficulties through the big data technology and the social emotional choice theory, and the results showed that the future time perspective has a negative predictive effect on the college students’ career decision-making difficulties, and self-efficacy plays a mediator role in the relationship between the future time perspective and the career decision-making difficulties. Self-efficacy mediates the relationship between future time perspective and career decision-making difficulties. Ref. [16] emphasizes the importance of improving the employment quality of graduates, uses SWOT analysis to investigate the advantages and disadvantages of career guidance quality education in colleges and universities, points out that career guidance in the context of big data has the advantages of diversified data sources and digitization of problem presentation, and also identifies the challenges such as the “information cocoon” effect faced by high-quality employment guidance, and also studies and formulates a fourdimensional employment development strategy. Literature [17] used the Apriori algorithm in data mining technology and computer big data to design an employment guidance course system, and verified its effectiveness through experiments, which can colleges and universities to dig out the intrinsic factors affecting students’ employment, and help college students’ employment guidance work with smooth information and simplified statistical analysis of information. Literature [18] proposes a data-driven approach that considers both institutional data and social media news to predict students’ career decisions and experimentally validates the effectiveness of the approach, which can provide appropriate treatment or early support for students and help college graduates achieve career success. Literature [19] proposes a career planning path and employment strategy based on deep learning, and the validity and reliability of the proposed strategy is verified through experiments, which not only helps to solve the main problems faced by students’ career planning education, but also improves students’ competitiveness in employment, comprehensive quality as well as exploratory ability. Literature [20] mainly explores the application effect of deep learning in college students’ career planning and entrepreneurship, and verifies through empirical analysis that deep learning can be used as an effective tool to help college students better plan their careers and improve their entrepreneurial ability, which provides useful insights and tools for the theoretical study of career planning and entrepreneurship. Literature [21] proposed the LSTM-Canopy algorithm based on deep learning and big data technology, and applied it to the college students’ career planning path visualization system, and verified the validity and feasibility of the system through performance testing, which can improve the experts’ ability to analyze and judge their careers in order to help college students adapt to the flexible employment environment.

3

Innovation of employment guidance methods for university students

As the main way for higher education institutions to help college students successfully find employment, career guidance plays an important role in guiding college students to establish a correct outlook on life, values, and employment. However, in today’s big data era, employment guidance in institutions of higher education generally suffers from problems such as inefficient data processing and rough classification. Therefore, how to innovate the traditional college student employment guidance method in higher education institutions has become an important issue in current student employment guidance education.

Aiming at the problems of low data processing efficiency and rough data classification of traditional college students’ employment guidance methods in institutions of higher education, this paper proposes an innovation method of college students’ employment guidance in institutions of higher education, based on data mining technology, realizing efficient mining of employment data by improving C4.5 classification algorithm, establishing employment data clustering method with MSK algorithm as the core, and carrying out graduates’ employment guidance based on the clustering results in a more targeted way. Employment guidance for graduates based on the clustering results.

3.1

Employment data mining algorithms

3.1.1

Establishment of mining objects and creation of a database

In this paper, we not only collect the information of graduates from the Employment Office, but also collect the information of academic records and grades provided by the Registrar’s Office, as well as the basic information of students from the Department of Academic Affairs, etc. when we conduct the information collection. 1)

Basic student information

The data structure mainly includes the following attributes: major, student number, class, name, gender, date of birth, place of origin, political outlook, awards and so on.

2)

Students’ academic status and performance

The data structure mainly includes the following attributes: student number, major, class, name, year and month of enrollment, academic performance in each subject, English grade, etc.

3)

Students’ employment information

The data structure mainly includes the following attributes: student number, major, class, name, place of origin, contact information, employment intention, employment unit, address of the unit, signing status, transferring status, employment status, whether the student is poor or not, whether he/she needs help, etc.

3.1.2

Data pre-processing

Before data mining, it is necessary to improve the “quality” of the data by means of pre-processing. 1)

Data cleaning

Data cleaning can fill in the gaps in the data, smooth out the noise, correct inconsistent errors, find isolated points, and then improve the quality of the data in the database, thus improving the performance of data mining and the accuracy of mining.

After the data collection of student information and other data mentioned above is completed, it takes a lot of time to clean the data as a whole, and the data with no value can be directly cleaned or ignored; and the vacant values in the attributes can be filled in.

2)

Data Integration

As there is a lot of student information collected now, and the amount of repetition is also large, in order to effectively improve the quality of data mining, which requires the extraction and processing of existing data information.

3)

Attribute Statute

The object of data analysis may contain attributes that are not associated with the data to be mined, or the association is not obvious, or redundant data, and so on. Therefore, attributes need to be generalized to ensure the quality of data mining, which is necessary to construct an effective decision tree.

(1)

Dimension Statute

In order to achieve the purpose of reducing the amount of data, we can use the method of dimensional generalization to generalize the data.

(2)

Numerical Reduction

Numerical approximation is a technique that reduces the number of modes by replacing them with a “smaller” representation of the data.

3.1.3

Data Population

In the process of data mining, certain algorithms require that the dataset must be accurate at the time of acquisition, and the presence of missing data can lead to a reduction in the performance of the algorithms, such as decision trees, K -mean clustering method, and so on. However, if the missing data is singularly ignored, a large number of errors will be triggered, with the consequence that on the one hand, the computing time and complexity of the subsequent algorithms will be increased, and on the other hand, they may be inaccurate in the acquisition of the results. So we have to put great efforts on the processing of missing data to improve and solve this problem.

The main functions of Bayesian networks are, first, to discover potential relationships between data [22]. The second is to express the probability of a possible complex relationship between random variables, or to express the uncertainty of the existence of a certain probability.

Bayesian networks evolve through path analysis, modification patterns, influence diagrams and other information. Its algorithm is based on the probability of occurrence of the data, the organic combination of probability theory and graph theory to form a more compact form and better representation of the joint probability distribution.

X is a randomly given set of variables denoted as X = {X₁, X₂,⋯, X_n}, where X₁ is denoted as a m-dimensional vector. A Bayesian network is defined as: 1 $B = 〈 G, θ 〉$

θ represents the set of parameters used to quantize the network. For the values x_i of X_i, θ_xi|pz(xi) = P(x_i|pa(X_i)) is a set of parameters representing the conditions and probabilities that x_i occur in a given pa(X_i) situation. X is a given set of variables whose joint conditional probability is expressed by the following formula: 2 $P_{a} (X_{1}, X_{2}, \dots, X_{n}) = \prod_{i = 1}^{n} P B (X_{i} | p a (X_{i}))$

Bayesian network working process Bayes works as follows: 1)

X is a n -dimensional vector and the data sample set is expressed as X = {X₁, X₂,⋯, X_n} which means that the sample has n A₁, A₂,⋯, A_n an attribute metric.

2)

Assuming m classes C₁, C₂,⋯, C_n for the data samples, X is the unknown class labeling for the given data samples, which are classified to class C_i using simple Bayesian classification, expressed as: 3 $p (C_{i} | X) > p (C_{j} | X), 1 \leq j \leq m, j \neq i$

In p(C_i|X), the largest class, C_i, is the largest a posteriori assumption. Its formula is expressed as: 4 $p (C_{i} | X) = \frac{p (X | C_{i}) p (C_{i})}{p (X)}$

3)

In all classes, p(X) is constant, so P(X|C_i)p(C_i) need only be maximized. Assuming Bayes’ principle, if the prior probability of a class is not given, i.e., if the prior probability is unknown, p(C₁) is usually taken as: p(C₁) = p(C₂) =⋯= p(C_m), P(X|C_i) is simply maximized. The prior probability of the class is calculated by p(C_i) = s_i/s, where s_i in class C_i denotes the number of training samples and s denotes the total number of training samples.

4)

When sampling gives more attributes of the dataset, the time consumed will be very large if we simply go to calculate the value of P(X|C_i). In practice, the joint distribution needs to be simplified as much as possible in order to minimize the computational overhead of P(X|C_i). If the samples are found to be independent of each other in terms of their attribute values given the class labeling and there is no dependency between the attributes, the formula is expressed as: 5 $P (X | C_{i}) = \prod_{k = 1}^{n} p (x_{k} | c_{i})$

If the restriction relaxes the assumed tuning independence between the characteristic variables, then it is more widely applicable, and according to Bayes’ theorem the formula can be established: 6 $\begin{array}{l} p (c ∣ x_{1}, x_{2}, \dots x_{n}) & = \frac{p (x_{1}, x_{2}, \dots x_{n}, c)}{p (x_{1}, x_{2}, \dots, x_{n})} = \frac{p (x_{1}, x_{2}, \dots, x_{n} ∣ c) p (c)}{p (x_{1}, x_{2}, \dots, x_{n})} \\ = \frac{\prod_{i = 1}^{n} p (x_{i} ∣ c) p (c)}{p (x_{1}, x_{2}, \dots, x_{n})} = \partial \prod_{i = 1}^{n} p (x_{i} ∣ c) p (c) \end{array}$ where ∂ = 1/p(x₁, x₁,⋯, x_n) is a regularization constant. The probability coefficient p(x₁|C_i), p(x₂|C_i),⋯, where the valuation of p(x_n|C_i) can be obtained from the training sample. Suppose A_k belongs to the discrete value property, then p(x_k|C_i) = s_ik/s_i.

5)

The posterior probability serves as a Bayesian indication of classification and its target value is i.e., the maximum value of the output conditional probability, which can be seen from Eq. (6) indicates that by observing the conditional probability p(x₁|c), the equivalent form of the Bayesian decision is then: 7 $C * = \arg \max p (c | x_{1}, x_{2}, \dots, x_{n}) = \arg \max \prod_{i = 1}^{n} p (x_{1} | c) p (c)$

For each p(X|C_i)p(C_i) -computation of class C_i, a sufficient necessary condition for assigning sample X to be equivalent to class C_i is: 8 $p (X | C_{i}) p (C_{i}) > p (X | C_{j}), 1 \leq j \leq m, j \neq i$

That is, the maximum (X|C_i)p(C_i) of class C_i is conditional on X being assigned.

Given a student record A containing missing values, it is represented by a n-dimensional feature vector A = {A₁, A₂,⋯, A_n} to describe the attribute values of the record A on the n attributes {X₁, X₂,⋯, X_n} respectively. Where there are m classes {B₁, B₂,⋯, B_n} for attributes X (1 ≤i ≤n) with missing values.

3.2

Improved C4.5 classification algorithm

3.2.1

The classical C4.5 algorithm

Decision tree is a common method in data mining technology, and C4.5 algorithm is the more mature decision tree algorithm, and is also the mainstream decision tree algorithm, with fast classification speed and high classification accuracy [23]. C4.5 algorithm is Quinlan’s improved algorithm based on ID3 algorithm, and ID3 algorithm compared to the increase in the processing of continuous attribute, attribute value vacancy situation, the The idea of C4.5 algorithm is: assuming that S is the training sample set, when constructing the decision tree for the training sample set S, select the attribute with the largest value of Gain-Ratio (x) as the splitting node, and according to this criterion, S can be divided into n subsets. If the i th subset S_i contains tuples of the same category, then this node is used as the leaf node of the decision tree and the splitting is stopped. For those S_i that do not satisfy the above criteria, the tree is built recursively using the above method until all subsets contain tuples of the same category. It is based on the following principle:

Definition I, category information entropy. Let the training set S has s samples, the training set is divided into m classes, the number of instances in the i th class is s_i, s_i/s that is the probability p_i, Info(S) is the category information entropy, based on the information entropy calculation of the formula is: 9 $I n f o (S) = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i})$

Definition 2, Conditional Information Entropy. If attribute A is chosen to divide the training set S, the set of training samples S is divided into k subsets {S₁, S₂,…,S_k}, and let attribute A have k different values {a₁, a₂,…,a_k}, then the number of training instances belonging to class i in Definition s_j is s_ij, and Info_A (S) is the conditional informational entropy of attribute A, and the informational entropy of the subset divided into subsets by A is given by Eq. (10): 10 $I n f o_{A} (S) = \sum_{j = 1}^{k} \frac{S_{1 j} + S_{2 j} + \dots + S_{m j}}{S} \times I n f o (S_{j})$ where $I n f o (S_{j}) = - \sum_{i = 1}^{m} p_{i j} \log_{2} (p_{i j})$ , $p_{i j} = \frac{S_{i j}}{S_{j}}$ are the sample probabilities of class i in S_j.

Definition 3: The information gain for classifying attribute A is calculated as: 11 $G a i n (A, S) = I n f o (S) - I n f o_{A} (S)$

Definition 4 Split information entropy: Let attribute A have k different values, and attribute A can be used to divide the sample set S into k subsets. One of them S_j contains some such samples in S : they have value a_j on attribute A. If the samples are partitioned in terms of the value of attribute A, Info(A) is the split information entropy of attribute A, as in equation (12): 12 $I n f o (A) = - \sum_{j = 1}^{k} p_{j} \log_{2} (p_{j})$

Definition 5: The formula for dividing the information gain rate for attribute A is: 13 $G a i n - R a t i o (A) = \frac{G a i n (A, S)}{I n f o (A)}$

3.2.2

Improved Information Gain Rate Calculation Methods

1)

Theoretical foundation

The Taylor series is a representation of a function in terms of series, that is, a concatenation of infinite terms which are computed from the derivatives of the function at a given point, and Taylor’s Median Theorem is defined as follows [24].

If for any function f(x), it contains derivatives up to order (n+1) in some open interval (a,b) containing x₀, then for any x ∈ (a,b), there are: 14 $\begin{array}{l} f (x) = f (x_{0}) + f' (x_{0}) (x - x_{0}) + \frac{f'' (x_{0})}{2!} {(x - x_{0})}^{2} \\ + \dots + \frac{f^{n} (x_{0})}{n!} {(x - x_{0})}^{n} + \frac{f^{(n + 1)} (ξ)}{(n + 1)!} {(x - x_{0})}^{n + 1} \end{array}$ $$\matrix{ {f\left( x \right) = f\left( {{x_0}} \right) + f\prime \left( {{x_0}} \right)\left( {x - {x_0}} \right) + {{f\prime \prime \left( {{x_0}} \right)} \over {2!}}{{\left( {x - {x_0}} \right)}^2}} \hfill \cr { + \ldots + {{{f^n}\left( {{x_0}} \right)} \over {n!}}{{\left( {x - {x_0}} \right)}^n} + {{{f^{\left( {n + 1} \right)}}\left( \xi \right)} \over {\left( {n + 1} \right)!}}{{\left( {x - {x_0}} \right)}^{n + 1}}} \hfill \cr } $$

The ξ in the above equation refers to a value between x₀ and x.

In Taylor’s formula, let x₀ = 0, then ξ is between 0 and x. Let ξ = θx(0 < θ < 1), and so Taylor’s formula becomes the simple form, which is often referred to as McLaughlin’s formula with a Lagrange cosine term: 15 $\begin{array}{l} f (x) = f (0) + f^{'} (0) x + \frac{f^{″} (0)}{2!} x^{2} \\ + \dots + \frac{f^{n} (0)}{n!} x^{n} + \frac{f^{(n + 1)} (θ x)}{(n + 1)!} x^{n + 1} (0 < θ < 1) \end{array}$

Removing the Lagrange residual term gives the approximate formula: 16 $f (x) \approx f (0) + f' (0) x + \frac{f'' (0)}{2!} x^{2} + \dots + \frac{f^{n} (0)}{n!} x^{n}$ $$f\left( x \right) \approx f\left( 0 \right) + f\prime \left( 0 \right)x + {{f\prime \prime \left( 0 \right)} \over {2!}}{x^2} + \ldots + {{{f^n}\left( 0 \right)} \over {n!}}{x^n}$$

2)

Improvement Ideas

From the basic principle of C4.5 algorithm, it can be known that the selection of attributes when generating the decision tree is based on the principle of information theory, due to the fact that in the process of calculating the formula for the information gain rate involves multiple logarithmic function operations, which requires calling the library function several times during the calculation, which greatly increases the calculation time. To address this problem, an improvement method for the information gain rate formula is proposed, i.e., the information gain rate of the C4.5 algorithm is simplified and calculated by using the mathematical Taylor’s formula and McLaughlin’s formula, which reduces the complexity of the algorithm’s computation to a large extent, and the improved C4.5 algorithm is named as TAM-C4.5 algorithm [25].

Since the derivative of ln(x) at x₀ = 0 is meaningless and the probabilities commonly taken under the information gain rate formula range between [0,1], the McLaughlin formula of ln(x+1) is chosen in this paper to improve the traditional C4.5 formula for the information gain rate, as in Eq. (17): 17 $\ln (x + 1) \approx x - \frac{1}{2} x^{2} + \frac{1}{3} x^{3} \dots + {(- 1)}^{n - 1} \frac{1}{n} x^{n}$

So there: 18 $\ln (x) \approx (x - 1) - \frac{1}{2} {(x - 1)}^{2} + \frac{1}{3} {(x - 1)}^{3} \dots + {(- 1)}^{n - 1} \frac{1}{n} {(x - 1)}^{n}$

When x ∈ (0,1). 19 $\ln (x) \approx (x - 1) - \frac{1}{2} {(x - 1)}^{2} + \frac{1}{3} {(x - 1)}^{3}$

Through the above approximate simplification, it is possible to realize the conversion of logarithmic operations into non-logarithmic operations, which can eliminate the complex logarithmic operations in the formula of the information gain rate by using the above transformation characteristics, so as to achieve the purpose of simplifying the computational formula and improving the efficiency of tree building. And the formula (19) is more accurate than the simplification of equivalent infinitesimal ln(1 + x) ≈ x.

The transformation formula of category information entropy is shown in (20): 20 $\begin{array}{l} I n f o^{'} (S) & = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i}) = - \sum_{i = 1}^{m} \frac{s_{i}}{s} \log_{2} (\frac{s_{i}}{s}) \\ = - \sum_{i = 1}^{m} \frac{s_{i}}{s} \frac{\ln (\frac{s_{i}}{s})}{\ln 2} = - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} s_{i} \times \ln (\frac{s_{i}}{s}) \\ = - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} s_{i} \times [(\frac{s_{i}}{s} - 1) - \frac{1}{2} {(\frac{s_{i}}{s} - 1)}^{2} + \frac{1}{3} {(\frac{s_{i}}{s} - 1)}^{3}] \\ = - \frac{1}{\ln 2 \times s} \sum_{i = 1}^{m} [\frac{s_{i} (s_{i} - s) (11 s^{2} + 2 s_{i}^{2} - 7 s_{i} s)}{6 s^{2}}] \end{array}$

Similarly, the transformation formula for conditional and split information entropy is shown in (21): 21 $I n f o_{A}' (S) = - \frac{1}{\ln 2 \times s} \sum_{j = 1}^{k} \sum_{i = 1}^{m} [\frac{s_{i j} (s_{i j} - s_{j}) (11 s_{j}^{2} + 2 s_{i j}^{2} - 7 s_{i j} s_{j})}{6 s_{j}^{2}}]$ 22 $I n f o (A) = - \frac{1}{\ln 2 \times s} \sum_{j = 1}^{k} [\frac{s_{j} (s_{j} - s) (11 s^{2} + 2 s_{j}^{2} - 7 s_{j} s)}{6 s^{2}}]$

Therefore, the formula for the transformed information gain rate is shown in (23): 23 $G a i n - R a t i o' (A) = \frac{I n f o (S) - I n f o_{A}' (S)}{I n f o (A)}$

Analyzing the improved formula, it can be seen that the category information entropy is the same in each calculation of the information gain rate value of the conditional attributes. Since this paper omits $\frac{1}{\ln 2 \times s}$ from each part of the simplified formula, in order to ensure the classification accuracy of the algorithm, this paper adopts the improved formula in calculating the category conditional entropy, and tries its best to ensure that it does not change the order of the information gain rate of the individual conditional attributes, and does not affect the classification accuracy.

3.3

Employment data clustering algorithm

3.3.1

Mean Shift vectors with added kernel functions

In the traditional calculation process of Mean Shift vector by default each point has the same effect on the Mean Shift vector, but in reality each point makes a different effect, so this paper adds a kernel function to the Mean Shift vector thus adding a weight coefficient for each sample to eliminate the effect due to the different effects of the data’s proximity to the center point [26].

The knowledge related to the improved Mean Shift vector is defined as follows.

Definition 1 Dataset X : Assume that there is a multidimensional space, defining each sample point in the space to be represented x by a column vector, and that x the data features can be represented by a column vector, x modulo ||x||² = x^Tx.

Definition 2 Kernel function: there exists a function K : X → R (R denoting the domain of real numbers), and there exists a profile function K: [0, ∞] → R as shown in Eq. (24): 24 $K (x) = k ({‖ x ‖}^{2})$ and where K must satisfy several requirements: (1) K is nonnegative; (2) K is nonincreasing, i.e., if a < b, then K(a) ≥ K (b); and (3) K is segmentally continuous and $\int_{0}^{\infty} K (r) d r < \infty$ . Then such a function K is called a kernel function.

Gaussian kernel function (also known as RBF function) is widely used due to the special characteristics of its function, which can transform a finite number of dimensions into a multidimensional or even high-dimensional space through the function. The Gaussian kernel function has applications in several fields: in the field of statistics, it can be used as a normal distribution function; in electronic signal processing, this function can be used to define filters; in the field of image processing, it can be used for blurring images; in the field of mathematics, it is mainly used for refining the heat equation, diffusion equation, and adding weighting coefficients to the equation.

In this paper, the kernel function is added to the Mean Shift vectors so as to add weights to the points with different distances from the center and to avoid the disadvantage that they play the same influence, as shown in Eq. (25): 25 $K (\frac{x_{i} - x}{h}) = \frac{1}{h \times \sqrt{2 \times π}} \times e^{- \frac{{(x - x)}^{2}}{2 \times h^{2}}}$ where h is the bandwidth and the form of the kernel function varies for different bandwidths. Therefore, a suitable value of h can be chosen to get the weight that satisfies the condition (the closer the point is to the center point, the larger the weight obtained)

After adding the Gaussian kernel function the improved Mean Shift vector is obtained in the following form: 26 $M_{h} (x) = \frac{\sum_{x_{i \in S h}} [K (\frac{x_{i} - x}{h}) \times (x_{i} - x)]}{\sum_{x_{i \in S_{h}}} [K (\frac{x_{i} - x}{h})]}$

3.3.2

Determining the optimal number of initial clusters

The number of clusters in the K-means algorithm can be decided by the researcher, and different numbers are bound to produce different results. This means that the researcher’s choice of different numbers will produce different results, which can be analyzed and from which the most appropriate results can be selected as the best number of clusters.

The knowledge related to the evaluation indicator BWP is defined as follows:

Definition 1 Dataset S : There exists a dataset S in a finite space containing n data samples in a dataset that can be x_i represented by, where 1 ≤ i ≤ n it is assumed that the dataset can be classified into k classes.

Definition 2 Intraclass distance a(j,i) : where j represents class j and i represents the i th sample in class j. The intraclass distance refers to the average distance of a sample point in one of the classes from the clustering result to the other sample points in that class, as shown in Equation (27). Where j in n_j indicates the jth class and n_j indicates the number of samples in the jth class. || ||² represents the square of the Euclidean distance. $x_{i}^{j}$ represents the i th sample in the j th category. $x_{p}^{j}$ represents the p th sample in class j, and p ≠ i: 27 $a (j, i) = (\frac{1}{n_{j} - 1}) \sum_{p = 1, p \neq i}^{n_{j}} {‖ x_{p}^{j} - x_{i}^{j} ‖}^{2}$

Definition 3 Minimum Interclass Distance b(j,i) : Where j represents the number of classes and i represents the i th of the j th class. The inter-class distance refers to the minimum of the average of the distance from this sample to the samples in other classes as shown in equation (28). Where c in n_c represents the c th class and n_c represents the number of samples in the c th class. $x_{i}^{j}$ represents the ith sample in the j th class. $x_{p}^{c}$ represents the p th sample in the c th category: 28 $b (j, i) = \min_{1 \leq c \leq k, c \neq j} (\frac{1}{n_{c}} \sum_{p = 1}^{n_{c}} {‖ x_{p}^{c} - x_{i}^{j} ‖}^{2})$

Definition 4 Clustering Effectiveness Evaluation Indicator BWP(j,i) : It is calculated as shown in Equation (29): where a(j,i), b(j,i) are calculated with reference to Equation (27), (28): 29 $B W P (j, i) = \frac{b (j, i) - a (j, i)}{b (j, i) + a (j, i)}$

We analyze formulas (27) and (28) and find that when the distance between classes is larger the larger the value of b(j,i) is, indicating that the distance between different classes is more distant and the difference is greater. When the intra-class distance is smaller the smaller the value of a(j,i), indicating that the samples within the same class are closer to each other and more similar. Therefore, the larger the value of the sample set clustering results BWP(j,i) indicates that the better the results, so you can calculate the average BWP value of each result, according to which to determine the optimal number of clusters, the formula for the BWP indicator is as follows: 30 $B W P (K) = \frac{1}{n} \sum_{j = 1}^{k} \sum_{i = 1}^{n_{j}} B W P (j, i)$

From the above equation, it can be seen that the value of K corresponding to when BWP(j,i) obtains the maximum value is the optimal number of clusters.

3.3.3

MSK algorithm

The algorithm is mainly composed of two parts, firstly, the best number of clusters is taken out using BWP index, and the initial clustering center is taken out using the improved algorithm based on the weights and densities, and secondly, the clustering center and the number of clusters taken out are used as the initial clustering center and number of clusters of K-means algorithm for clustering.

The specific algorithmic flow of MSK algorithm proposed in this paper is shown below.

Step1, select a center point from the dataset.

Step2, calculate the MeanShift vector based on that center point and the points within its radius.

Step3, move the center point by a distance modulo the MeanShift vector, and move in the direction represented by the MeanShift vector.

Step4, repeat Step2 and Step3 until the size of the shift vector meets the set threshold requirement, record this centroid, and include this centroid and the data within its radius in the class cluster.

Step5, repeat Step1, Step2, Step3, Step4 steps until all data points are categorized.

Step6, after the first 5 steps, the center point cluster and the number n can be found by MeanShift algorithm.

Step7, use the BWP evaluation criteria to find the best number of clusters m.

Step8, if m=n, then the cluster derived in Step6 is the initial center point, which is recorded as new_cluster; otherwise, the center point cluster derived in Step6 is clustered twice to obtain the new center point new_cluster.

Step9, the set n_cluster obtained in Step8 is used as the initial clustering center of the K-means algorithm, and the number of the set n_cluster is used as the number of clusters.

Step10, calculate the distance of each point from each centroid, select the smallest value among them, and categorize the point into the class with the smallest distance.

Step11, recalculate the average value of the data points within each class cluster after categorization, and record the average value as the center point of the next clustering.

Step12, iteratively carry out steps Step10 and Step11, and stop iterating when the center point derived in step Step11 does not change much or remains almost constant.

Step13, output the individual class clusters generated after the above steps.

4

Innovative Practices in Employment Guidance Methods for University Students

In this chapter, the college students’ employment data of A higher education institution will be selected as the research sample to carry out the college students’ employment guidance method innovation practice and verify the utility of the college students’ employment guidance innovation method proposed in this paper. The data mining method with the improved C4.5 algorithm as the core is used to data mine the college students’ employment data of A higher education institution, and the MSK algorithm is used to conduct clustering analysis.

4.1

Analysis of the correlation between students’ performance in school and their employment situation

There is a correlation between college students’ school performance and employment, that is, school performance can reflect students’ employment to a certain extent, which is the prerequisite for clustering students according to their school performance and making corresponding employment guidance. In this section, we will analyze the correlation between college students’ school performance and employment using a chi-square test. The chi-square test can be used to infer whether there is a correlation between two categorical variables. Different design types of data, the use of chisquare test must meet the conditions of its application, this paper uses the line x list data, the requirements of each grid in the theoretical frequency T are greater than 5 or 1 < T < 5 of the number of grids does not exceed 1/5 of the total number of grids.

First of all, the correlation analysis of specialties and industry categories is carried out, taking the employment information of graduates of four specialties of marketing, logistics management, CNC technology and computer network technology in the class of 2023 in college A as an example, and the information of line x list is shown in Table 1. From the table, we can see that the data with the smallest theoretical frequency is the number of computer network technology majors engaged in the service industry, only 1 person, the theoretical frequency is 8.86, that is, the minimum theoretical frequency is greater than 5, in line with the application conditions. Students’ performance in school and employment have a correlation.

Table 1.

Chi-square test

Majors	Employment industry						Total
Majors	Internet	Manufacturing	Logistics transport	Service sector	Trade	Other	Total
Marketing	22	20	6	25	42	26	141
Logistics management	3	18	36	8	20	4	89
Numerical control technology	3	70	0	9	13	18	113
Computer network technology	41	23	7	1	2	15	89
Total	69	131	49	43	77	63	432

The chi-square test of line x list information using SPSS and Pearson’s exact test can be obtained with an asymptotic significance value of 0.002, which is less than 0.05, and the two variables of specialty and industry category show a significant correlation. The two-by-two combination formed by the remaining variables is subjected to the chi-square test, and the asymptotic significance value of the Pearson chi-square is collated, and the collation results are shown in Figure 1. It can be seen that major and industry category (0.003), job type (0.005) are significantly correlated, major grade ranking and major match (0.022) are significantly correlated, English level and job type (0.041) are significantly correlated, professional skill level and monthly salary level (0.029), grade level (0.019), major match (0.012) is significantly correlated, receiving scholarships is significantly correlated with monthly salary level (0.039), grade level (0.037), punishment for violation is significantly correlated with grade level (0.015), participation in associations is significantly correlated with type of position (0.045), monthly salary level (0.039), grade level (0.015), participation in competitions is significantly correlated with monthly salary level (0.04), and specialty match (0.02) are significantly correlated.

4.2

Cluster analysis of students based on MSK algorithm

In the previous section, this paper confirmed that students’ school performance is related to employment, but it is difficult to measure and describe the specific association between each student’s school performance and employment. In this section, we will combine the MSK algorithm to consider clustering analysis of students from the perspective of school performance, to find out students with the same characteristics to form a group, theoretically, each group contains students who are about to graduate and those who are already working, and we can recommend to the graduating students to recommend the better industries and jobs that graduates of the student group are engaged in, so as to help the students to better choose their careers.

This paper utilizes the Weka platform for student cluster analysis. The number of professional names is large and discrete data, which is not suitable for clustering, but the influence of majors on industry categories and job types is relatively large and can not be ignored, so the default of students of the same major is one category, and students of the same major are selected for clustering. The students’ school performance is refined, and 23 indicators used for cluster analysis are collated, and the standard form of the content of each indicator is stipulated, and the numerical type of content is explained, and each clustering indicator is specific as shown in Table 2.

Table 2.

Clustering index

Index name	Index content
Ranking	TOP30%	TOP70%	Else
Reported_score	All the tests are conducted in the average course of the “report test”
Test_score	All the tests are “closed test”,”open roll test”
Attendance_rate	The average attendance rate for all courses is excluded from the course of attendance
Task_subrate	The average of the submission rate for all the courses is the exception of the homework
Task_Excellent	The average of the outstanding results of all courses is the exception of the homework
English_level	High	Normal	Low
Skill_level	High	Normal	Low
Scholarship_num	The number of scholarships at all levels during school
Highest_scholarship	Provincial	Campus-level	else
Honor_num	The sum of the number of penalties incurred during school
Highest_punish	Especially	Serious	Generally
Community_nature	Student_management	Public_welfare	Interest
Holdpost	Yes	No	-
Skill_competition	Yes	No	-
Skill_level	National	Provincial	else
Skill_award	First prize	Second prize	else
Innavation_competition	Yes	No	-
Innavation_level	National	Provincial	else
Innavation_award	First prize	Second prize	else
Expression_competition	Yes	No	-
Expression_level	National	Provincial	else
Expression_award	First prize	Second prize	else

Considering the situation that positions may change over time and the evaluation criteria of students’ school performance may change, the school performance information of students within five years of graduation and students who are about to graduate are selected for clustering. This paper uses 582 students in the 2020 computer network technology program as an example to demonstrate how to provide students with career guidance methods. Retaining the nine indicators of professional grade ranking, average course attendance, average coursework submission rate, English proficiency, professional skill level, number of scholarships, club positions, honors received, and competition participation in the above indicators, the students are clustered into six categories, observing the values of the indicators in each category, and summarizing the clustering results of the 582 students as shown in Table 3 based on the common sense and experience of the students’ evaluation. The students’ clustering results are totaled into six categories: leadership, mediocrity, general, application, learning, and all-around.

Table 3.

Cluster result

Number	Describe	The name of the cluster results
0	Strong ability of learning, expressing language and organization	Leadership type
1	The attitude is not serious, the ability to learn is weak, all sides are not special, and some of them are against the penalty records	Mediocre type
2	The attitude is more serious, the learning ability is general, the performance of all sides is not outstanding	Common type
3	The attitude is less serious, the ability to learn is general, the professional skills and the ability to use the knowledge are strong, the language expression ability and the organization coordination ability are lacking	Appliance type
4	Serious attitude, strong learning ability and professional skills, weak language and organizational coordination	Learning type
5	The attitude is more serious, learning ability, professional skills, organizational coordination ability, language expression ability and time coordination ability are strong	All-round type

A total of 600 students in the class of 2020 majoring in computer network technology are clustered into the above six categories based on their performance in school, demonstrating the application of clustering to provide students with a reference for choosing their careers, and providing graduates of the same category with employment information for the graduating students of the class of 2022 as a reference. Since we have selected graduates within five years of graduation to be clustered together with students of the class of 2022, we can find the employment information of graduates in the corresponding class of a current student through the student number, including industry information, unit information, positions, monthly salary level, grade level, etc., schools can add a function of choosing a career reference to the employment service system. This function is available for students who are about to graduate after logging into the system with a password. According to the information obtained from the school number, the information on the employment status of graduates will be displayed accordingly to the query requirements of the students, which realizes the innovation of the method of employment guidance for college students.

4.3

Regression analysis of students’ performance in school and employment gains

This section will focus on the use of multiple linear regression models to examine the relationship between students’ performance in school and the extent of students’ employment gains from the results of the cluster analysis described above.

4.3.1

Regression analysis of students’ performance in school on employability gains

This paper uses multiple step-by-step regression analysis to analyze the regression of students’ performance in school on the harvest of employability to derive a regression model. The step-by-step multiple regression can eliminate the independent variables that do not have a significant effect on the dependent variable step by step, and finally leave the independent variables that have a significant effect on the dependent variable, so as to construct the best model. The regression results are shown in Table 4. Based on the reality in the table, the R² is 0.405, which indicates that the regression equation can explain 40.5% of the original variables, the tolerance is close to 1, and the VIF is less than 10, so there is no covariance between the independent variables, and the equation fits well. In addition, violation penalties are the only indicator among the eight variables in the table that has a negative effect on the harvest on employability, with a regression coefficient of -0.145, while all other variables have a positive effect.

Table 4.

The regression analysis of students’ performance on employment ability

Variable	Nonnormalized coefficient		Normalization factor	T	Sig.	Common linear statistics		_R2
Variable	B	Standard error	Trial version	T	Sig.	Tolerance	VIF	_R2
Constant	0.346	0.072	-	1.335	0.069	-	-	0.405
Professional ranking	0.128	0.061	0.164	2.756	0.032	0.561	1.321	-
Average attendance rate of the course	0.089	0.028	0.132	2.165	0.018	0.898	1.165	-
Average homework submission rate	0.102	0.055	0.205	1.888	0.014	0.764	1.112	-
Punishment	-0.145	0.048	-0.216	-0.114	0.006	0.734	1.522	-
Professional skill level	0.124	0.066	0.168	2.388	0.004	0.832	1.286	-
Scholarships	0.115	0.053	0.125	1.218	0.003	0.911	1.723	-
Club position	0.108	0.053	0.096	0.367	0.002	0.672	1.114	-
Honor	0.138	0.068	0.157	4.692	0.001	0.742	1.343	-
Student competition	0.082	0.042	0.078	0.713	0.002	0.585	1.745	-

4.3.2

Regression analysis of students’ performance in school on the gain of career planning skills

The results of the stepwise multiple regression of students’ school performance on the gain of career planning ability are shown in Table 5. As can be seen from the table, the R² is 0.452, the regression equation explains 45.2% of the original variables, and there is no covariance between the independent variables and a good fit of the equation. Among all the variables in the table, the regression coefficient of the violation and punishment variable is negative at -0.116, which has a negative effect on the harvesting of career planning skills.

Table 5.

The regression analysis of students’ performance on career planning ability

Variable	Nonnormalized coefficient		Normalization factor	T	Sig.	Common linear statistics		R²
Variable	B	Standard error	Trial version	T	Sig.	Tolerance	VIF	R²
Constant	0.084	0.022	-	1.832	0.558	-	-	0.452
Professional ranking	0.327	0.104	0.136	0.787	0.019	0.567	1.526	-
Average attendance rate of the course	0.093	0.029	0.026	0.267	0.033	0.9	1.762	-
Professional skill level	0.237	0.056	0.129	1.161	0.001	0.754	2.002	-
Punishment	-0.116	0.043	-0.057	-0.342	0.01	0.749	1.623	-
Scholarships	0.15	0.046	0.122	0.147	0.017	0.831	1.263	-
Club position	0.115	0.03	0.091	0.782	0.016	0.9	1.423	-
Honor	0.185	0.045	0.152	1.246	0.008	0.691	1.302	-
Student competition	0.091	0.012	0.09	2.209	0.007	0.737	1.44	-

4.3.3

Regression analysis of students’ performance in school on entrepreneurship gains

The results of the multiple regression of students’ performance in school on entrepreneurship gains are shown in Table 6. From the table, it can be seen that the regression equation was able to explain 38.2% of the original variables, with a tolerance close to 1 and VIF less than 10, demonstrating a good fit to the equation. For each specific variable, the variables of major, professional skill level, and club participation entered the equation (P<0.05), where the regression coefficient of the violation and punishment variable was -0.112, indicating that the violation and punishment in the students’ performance in school had a negative effect on the entrepreneurial competence harvesting, while the rest of the variables had a positive effect.

Table 6.

The regression analysis of students’ ability to learn in school

Variable	Nonnormalized coefficient		Normalization factor	T	Sig.	Common linear statistics		_R2
Variable	B	Standard error	Trial version	T	Sig.	Tolerance	VIF	_R2
Constant	0.513	0.204		3.24	0.143	-	-	0.382
Majors	0.151	0.08	0.613	2.746	0.02	0.932	1.227	-
Punishment	-0.112	0.067	-0.5	-1.244	0.009	0.919	1.321	-
Professional skill level	0.175	0.05	0.255	1.896	0.005	0.672	1.065	-
Awards for scholarship	0.121	0.033	0.335	0.68	0.004	0.635	1.327	-
Innovation competition	0.166	0.039	0.206	1.191	0.007	0.586	1.554	-
Community participation	0.058	0.026	0.199	2.707	0.01	0.809	1.271	-

5

Conclusion

In order to realize the innovation of college students’ employment guidance methods in higher education institutions, this paper proposes the information gain rate calculation method to improve the C4.5 algorithm, and puts forward the employment data clustering method with the MSK algorithm as the core, which solves the problems of inefficient data processing and rough data classification in the traditional college students’ employment guidance methods. Take the employment data of college students in a higher education institution as the research sample, and conduct an innovation practice for college students’ employment guidance. Taking the employment information of the graduates of four majors of marketing, logistics management, numerical control technology and computer network technology in grade 2023 in college A as an example, the correlation between students’ school performance and employment is analyzed by using chi-square test, and the minimum theoretical frequency is 8.86, which is greater than 5, and it is determined that there is a correlation between the performance of students in school and their employment situation. The clustering results of students’ school performance information can be obtained from the clustering results of six categories: leadership, mediocrity, general, application, learning, and all-around. Regression analysis of students’ school performance and employment gain, in the regression analysis of employment ability gain, career planning ability gain and entrepreneurship ability gain, the R² value is 0.405, 0.452, 0.382 respectively, the tolerance is close to 1 and the VIF is less than 10, there is no covariance between the independent variables as well and shows a good fit of the equation. Among the specific variables, the violation penalties are negative and have a negative effect on the gain of employability, the gain of career planning ability and the gain of entrepreneurial ability, while the rest of the variables have a positive effect.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on the Innovation of College Students’ Employment Guidance Methods and Their Practical Effects in Higher Education Institutions under the Environment of Big Data

Meili Zhao

Published Online: Mar 19, 2025

Received: Oct 16, 2024

Accepted: Jan 30, 2025

DOI: https://doi.org/10.2478/amns-2025-0451

KeywordsData mining techniques, Decision tree, C4.5 algorithm, MSK algorithm, College student career guidance

© 2025 Meili Zhao, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Data mining techniques, Decision tree, C4.5 algorithm, MSK algorithm, College student career guidance