Research on Employment and Entrepreneurship Potential Mining and Cultivation Mechanism of College Students Based on Decision Tree Modeling

As the number of graduates grows sharply in China at this stage, the number of jobs available in the society grows slowly, thus creating a very serious employment conflict [1-2]. At the same time, the conflict between the marketization of graduate employment and the lagging employment concept of graduates appears to be very prominent, graduates to economically underdeveloped regions and grassroots employment trend has not yet formed, the structural contradiction between supply and demand of talents still exists, which affects the employment of college graduates [3-6], in this context, the mining of college students’ employment and entrepreneurship potentials and talent cultivation through the big data technology, is of great significance [7-9].

The cultivation of college students’ employment and entrepreneurship ability is an important task in current higher education, and the cultivation of college students’ employment and entrepreneurship ability requires a variety of paths and methods [10-12], including curriculum and teaching reform, entrepreneurial practice and internship opportunities, innovation and entrepreneurship education platform construction, as well as employment and entrepreneurship guidance and counseling services [13-16]. Through the implementation of these measures, it can effectively improve the competitiveness of college students in employment and entrepreneurship, cultivate their entrepreneurial spirit and practical ability, and lay a solid foundation for their future employment and entrepreneurship road [17-19]. The students’ self-improvement should also actively participate in various practical activities to improve their own ability. Only through the joint efforts of schools and students, can we better cultivate the employment and entrepreneurship of college students and contribute to the development of social economy [20-23].

In this paper, the improved decision tree C4.5 algorithm is used to construct a prediction model of college students’ employment and entrepreneurship potential, to make preliminary prediction of college students’ employment and entrepreneurship potential, and to pave the way for the guidance of college students’ employment and entrepreneurship. The main influencing factors of college students’ employment and entrepreneurship potential are extracted by factor analysis, and firstly, it is verified whether the data of college students’ related information used in this paper meet the standard of factor analysis. Then the number of relevant factors was determined using the principal component method of extraction and the gravel plot test criterion. In order to increase the variance gap of the factors so that they can be defined and interpreted, the factors were renamed using the rotation method. Finally, the scores of the factors were calculated to replace the original variables, thus completing the characterization of the employment and entrepreneurial potential of college students. The available data were analyzed to verify the effectiveness of the decision tree model and factor analysis model in this paper on the task of mining and cultivating students’ employment and entrepreneurial potential.

2

Student employment prediction model based on decision tree algorithm

2.1

Data preprocessing and correlation analysis

1)

Data collection. This study mainly collects data from three aspects: basic information of college graduates, graduate performance information, and graduate employment information, and organizes them. Establish the data information form.

2)

Data integration and summarization. Obtain the original data of graduates and integrate them, delete the data with the same attributes, summarize the information, form the information summary table, and exclude the factors that have less impact on the employment of graduates.

3)

Data conversion. The above data attributes are valued to ensure that they can fall into a small, limited value space, which has a positive significance on the generation of decision trees. The 9 relevant attributes are normalized for taking values. The data conversion rules are shown in Table 1.

Table 1.

Property values and conversion value comparison

Attribute classification	Attribute name	Attribute value	Conversion value
Basic attribute	Gender	Male and female	1, 0
	Political identity	Party member	1, 0
	Source information	Eastern region, central region, western region	1, 0, -1
	Whether to be a student cadre	Yes, no	1, 0
	Comprehensive achievement	85 points above 85 points	1, 0
	Job category	Teaching personnel and non-teaching personnel	1, 0
	Job matching	Match, mismatch	Y, N
Predictive attribute	Employment situation	Employment and employment	1, 0, -1

Correlation analysis of attributes affecting college graduates’ successful employment. In order to effectively construct the decision tree prediction model of whether graduates are successfully employed, it is necessary to analyze the correlation of the attributes affecting the employment of college graduates, obtain the test attributes, and ensure the accuracy of the decision tree prediction model. SPSS software is mainly used to conduct correlation analysis.

Table 2 shows the correlation analysis of the attributes affecting the employment of college graduates, and the four attributes of graduates’ comprehensive achievement, whether they are student cadres, birth source information and political identity are the test attributes affecting the successful employment of college graduates.

Table 2.

Analyzes the properties of college graduates’ employment

		Employment	Cadre	Source information	Grade	Political identity	Gender
Employment	Cor	1.000	0.755	0.705	0.822	0.584	0.085
	Sig.2	0.000	0.000	0.000	0.000	0.000	0.156
	N	1200	1200	1200	1200	1200	1200
Cadre	Cor	0.728	1.000	0.516	0.638	0.451	0.003
	Sig.2	0.000	0.000	0.000	0.000	0.000	0.566
	N	1200	1200	1200	1200	1200	1200
Source information	Cor	0.705	0.511	1.000	0.584	0.412	0.035
	Sig.2	0.000	0.000	0.000	0.000	0.000	0.563
	N	1200	1200	1200	1200	1200	1200
Grade	Cor	0.825	0.634	0.595	1.000	0.0477	0.035
	Sig.2	0.000	0.000	0.000	0.000	0.000	0.000
	N	1200	1200	1200	1200	1200	1200
Political identity	Cor	0.559	0.454	0.412	0.485	1.000	0.042
	Sig.2	0.000	0.000	0.000	0.000	0.000	0.000
	N	1200	1200	1200	1200	1200	1200
Gender	Cor	0.085	0.001	0.033	0.031	0.045	1.000
	Sig.2	0.158	0.981	0.566	0.000	0.000	0.000
	N	1200	1200	1200	1200	1200	1200

2.2

Fundamentals of the C4.5 algorithm

The C4.5 [24] algorithm is an improvement of the ID3 algorithm. Unlike the ID3 algorithm [25], the C4.5 algorithm selects attributes for each node of the tree based on the information gain rate. The algorithm selects the attribute with the highest gain rate as the test attribute for the current node. This attribute minimizes the amount of information needed to classify the samples in the resulting partition and reflects the minimal randomness or “impurity” of the partition. This theoretical approach minimizes the expected number of tests required to classify an object and ensures that a simple tree is found. For the sake of convenience, the concepts are explained below.

Definition 1: Let dataset s be a set containing s data samples and the category attribute can take m different values corresponding to m different categories C_i(i = 1, 2, ⋯, m). Assume s_i is the number of samples in category C_i; the amount of information required to categorize a given data object is called the entropy before s division, i.e.: 1 $I (s_{1}, s_{2}, \dots, s_{m}) = - \sum_{i = 1}^{m} p_{i} \log_{2} (p_{i})$ $$I({s_1},{s_2}, \cdots ,{s_m}) = - \sum\limits_{i = 1}^m {{p_i}} {\log_2}({p_i})$$

where p_i is the probability that any data object belongs to category C_i: p_i = s_i/s.

The entropy reflects the average uncertainty and purity of the sample set s. In general, the higher the entropy value, the higher the average uncertainty and the lower the purity.

Definition 2: Let an attribute A take v different discrete attribute values {a₁, a₂, ⋯a_v}. using attribute A the set S can be partitioned into v subsets {S₁, S₂, ⋯S_v}, which S_j contain data samples from the S set where the attribute A takes a_j values. If attribute A is selected as a test attribute, i.e., attribute A is used to partition the current sample set. Let s_ij be the number of samples belonging to C_i category in subset S_j. Then the information required to partition the current set of samples using the attribute A partitioned s the entropy after partitioning, i.e: 2 $E (A) = - \sum_{j = 1}^{v} \frac{s_{1 j} + \dots + s_{m j}}{s} I (s_{1 j}, \dots, s_{m j})$ $$E(A) = - \sum\limits_{j = 1}^v {\frac{{{s_{1j}} + \cdots + {s_{mj}}}}{s}} I({s_{1j}}, \cdots ,{s_{mj}})$$

where $\sum_{i = 1}^{v} \frac{s_{1 j} + \dots + s_{m j}}{s}$ $$\sum\limits_{i = 1}^v {\frac{{{s_{1j}} + \cdots + {s_{mj}}}}{s}}$$ denotes the weight of subset S_j in dataset s.

Definition 3 Information gain is defined as: 3 $G a i n l (A r) = I l (s_{1}, s_{2}, \dots, s_{m} r) - E l (A r)$ $$Gainl(Ar) = Il({s_1},{s_2}, \cdots ,{s_m}r) - El(Ar)$$

For sample set S, the larger Gain(A) is, the higher the purity of subset partitioning.

Definition 4 Information gain rate is defined as: 4 $G a i n R a t i o l (A r) = \frac{G a i n l (A r)}{S p l i t I n f o l (A r)}$ $$GainRatiol(Ar) = \frac{{Gainl(Ar)}}{{SplitInfol(Ar)}}$$

where $S p l i t I n f o (A) = - \sum_{j = 1}^{v} \frac{s_{1 j} + \dots + s_{m j}}{s} \log_{2} \frac{s_{1 j} + \dots + s_{m j}}{s}$ $$SplitInfo(A) = - \sum\limits_{j = 1}^v {\frac{{{s_{1j}} + \cdots + {s_{mj}}}}{s}{{\log }_2}\frac{{{s_{1j}} + \cdots + {s_{mj}}}}{s}}$$.

Calculate the information gain rate for each attribute by using the above formula. The attribute with the highest information gain rate is selected as the test attribute for the given set s, a node is created and labeled as that attribute, and branches are created for each value of the attribute to divide the sample.

2.3

Description of the C4.5 algorithm

1)

Algorithm description

Assuming T is the training set, when constructing the decision tree for T, the attribute with the largest information gain rate is selected as the split node, and T is divided into n subsets according to this criterion. If the i th subset T_i contains tuples of the same category, the node becomes a leaf node of the decision tree and stops splitting. And for the other subsets of T that do not satisfy this criterion, recursively generate the tree as described above until all subsets contain tuples belonging to one category.

2)

Flowchart of C4.5 algorithm

According to the description of C4.5 algorithm, the flowchart of C4.5 algorithm is given, and the flow is shown in Fig. 1.

2.4

Pruning algorithm in C4.5 algorithm

Pruning a decision tree is to cut out the replaceable subtrees and replace them with leaves to simplify the decision tree. The algorithm is also used to reduce the prediction error to improve the quality of the classification model. If the expected misclassification rate of the subtree is greater than the error rate predicted by a single leaf, replacement needs to be performed.

The C4.5 algorithm utilizes a post pruning approach, which uses pessimistic pruning when evaluating the prediction error. The method uses the training sample set itself to estimate the error before and after pruning to decide whether to actually prune or not. The formulas used in the method are as follows: 5 $P r [\frac{f - q}{\sqrt{q (1 - q) / N}} > z] = c$ $$Pr\left[ {\frac{{f - q}}{{\sqrt {q(1 - q)/N} }} > z} \right] = c$$

Where N is the number of instances, f = E/N is the observed error rate (where E is the number of classification errors in N instances), q is the true error rate, c is the confidence level (an input parameter of the C4.5 algorithm, with a default value of 0.25), and z is the standard deviation corresponding to c, the value of which can be obtained by checking the table of normal distribution according to the set value of c. The formula calculates an upper confidence limit for the true error rate q, which is used to make a pessimistic estimate for the error rate e of the node: 6 $e = \frac{f + \frac{z^{2}}{2 N} + Z \sqrt{\frac{f}{N} - \frac{f^{2}}{N} + \frac{z^{2}}{4 N^{2}}}}{1 + \frac{z^{2}}{N}}$ $$e = \frac{{f + \frac{{{z^2}}}{{2N}} + Z\sqrt {\frac{f}{N} - \frac{{{f^2}}}{N} + \frac{{{z^2}}}{{4{N^2}}}} }}{{1 + \frac{{{z^2}}}{N}}}$$

Determine the size of e before and after the pruning, and if e becomes smaller, prune.

3

Prediction of student employment based on employment prediction models

To construct a decision tree prediction model for employment and entrepreneurship potential, the final classification result will be whether students are successfully employed, that is, employment is represented by Y, and employment is represented by N. After data preprocessing and attribute screening, the test attributes determined are “political identity”, “student source information”, “cadre situation”, and “grades”. 1)

Calculate the amount of information of categorized attributes

There are 2000 sample data in the training set, among which there are 150 data whose class is employment and 50 data whose class is to be employed, and the amount of information of the categorized attributes can be obtained according to the formula. The test set contains 1200 samples.

2)

Calculate the information entropy of each test attribute

For the attribute “political identity”, there are two attribute values in this attribute, first of all, we need to calculate the information quantity of the subset divided by each attribute value, and then get the information entropy of this attribute on this basis.

3)

Calculate the information gain of the test attribute.

4)

Calculate the split information entropy of each test attribute.

5)

Calculate the information gain rate of each test attribute.

6)

Select the attribute with the largest information gain rate.

There is 5) the information gain ratio of the attribute “Grade” is 0.0126, which is the largest, according to the idea of the C4.5 algorithm, the “Grade” attribute will be selected as the root node, and the “Grade” attribute has two attribute values, so the training set sample will be divided into two parts.

7)

Repeat steps 2)-6) to complete the division of each branch.

The subset formed after each division is then classified according to the above computational ideas, until all the samples belong to the same category or traverse all the test attributes, the final decision tree model can be formed.

In order to solve the cumbersome calculation brought by large data samples and many test attributes, this study designs and develops a simple college student employment prediction tool based on the core idea of C4.5 algorithm, using Python language and excel as the database for storing sample data. The tool mainly has the functions of constructing a prediction model, evaluating the prediction accuracy of the model, and making prediction applications.

Based on the employment prediction tool of college students, the decision tree prediction model of employment and entrepreneurship potential can be constructed, click the “Select Training Set” button in the interface, load the data table named “Training Set 1” into the program, and click the “Generate Prediction Model” button to generate a decision tree prediction model of employment and entrepreneurship potential, as shown in Figure 2.

CJ, GB, SYXX, ZZSF are the abbreviations of the attributes “grades”, “cadres”, “student information” and “political identity”, respectively, and the 0 and 1 on the directed arrows represent the values of each attribute respectively, and the specific meaning can be referred to Table 1 in Chapter 2, so as to complete the construction of the decision tree prediction model of employment and entrepreneurship potential. According to the final decision tree model, the corresponding classification rules can be obtained by traversing from the root node to each leaf node in turn, and there are 14 classification rules corresponding to the employment and entrepreneurship potential prediction model. Table 3 describes the details.

Table 3.

Rules for the classification of successful employment

Number	Rule	Conclusion
1	Grades less than 85 points and cadres=No and political identity=Non-party and source=Central region	Employment
2	Grades less than 85 points and cadres=No and political identity=Non-party and source=Eastern region	Unemployment
3	Grades less than 85 points and cadres=No and political identity=Non-party and source=Western region	Employment
4	Grades less than 85 points and cadres=No and political identity=Party and source=Central region	Employment
5	Grades less than 85 points and cadres=No and political identity=Party and source=Eastern region	Employment
6	Grades less than 85 points and cadres=No and political identity=Party and source=Western region	Unemployment
7	Grades less than 85 points and cadres=Yes	Employment
8	Grade greater than 85 points and source of land =Central area	Employment
9	Achievement = greater than 85 points and raw land = east region and political identity = non-party member	Employment
10	Achievement = greater than 85 points and origin = eastern region and political identity = Party member and cadre = No	Employment
11	This cable = more than 85 points and origin = eastern region and political identity = party member and cadre = Yes	Unemployment
12	Achievement-greater than 85 points and origin = western region and cadre=No	Employment
13	Achievement = more than 85 points and the western region of the origin and, the cadre is and political identity = non-party member	Employment
14	Achievement = more than 85 points and origin = western region and, cadre is and political identity = party member	Unemployment

After the decision tree C4.5 prediction model is constructed, it needs to be tested whether its accuracy meets the requirements. Using the randomly selected test set data in the previous section, we can evaluate the accuracy and applicability of the decision tree C4.5 model.

The correct rate of classification predicted by the decision tree C4.5 model is shown in Table 4. It can be seen that the training set has a correct prediction rate of 88.59%. The correct rate for the test set is 84.56%. The correct rate of both predicting the employment and entrepreneurial potential of college students is high.

Table 4.

The accuracy of graduation prediction

Model	Sample set	Accuracy/%	Error rate/%
C4.5	Training set	88.59%	11.41%
C4.5	Test set	84.56%	15.44%

The prediction accuracy of each classification of employment and entrepreneurship potential was calculated separately based on the test set as shown in Table 5. It can be obtained that the prediction results of employment and entrepreneurship potential of the decision tree C4.5 model are 82.53% accuracy for governmental organizations/institutions and state-owned enterprises, 85.03 % for further education, 79.15% for foreign-funded enterprises and private enterprises, and 84.41% for freelancing. It can be seen that the accuracy rate of employment and entrepreneurship potential fluctuates, and the accuracy rate of the prediction of further education is the highest, more than 85%.

Table 5.

The accuracy of graduation prediction

	Government agencies/institutions, state-owned enterprises	Promotion	Foreign enterprises, private enterprises	Freelancing	Acc/%
Government agencies/institutions, state-owned enterprises	241	15	17	19	82.53
Promotion	15	267	14	18	85.03
Foreign enterprises, private enterprises	27	24	262	18	79.15
Freelancing	14	15	12	222	84.41

The gain assessment curves for the prediction results of the Decision Tree C4.5 model are shown in Figure 3. The top curve is the optimal gain curve and the middle curve is the gain curve of the decision tree C4.5 model. The overall trend of the gain curve of the Decision Tree C4.5 model is relatively similar to that of the best curve, with a small difference, and the difference between the two gains never exceeds 15%. In summary, it can be seen that the gain assessment curve of the decision tree C4.5 model fits the optimal curve to a high degree, and has a good classification prediction effect on the prediction of college students’ employment and entrepreneurship potential.

The boosted assessment curves for the prediction results of the Decision Tree C4.5 model are shown in Figure 4. The rightmost curve in the figure is the optimal curve, and the following curve is the boosting curve for the decision tree C4.5 model. In the lifting curves corresponding to the range of 0 to 30 percentile, the actual lifting curve decreases, and then the curve becomes steeper when it reaches around the 30th percentile or so. And at this point, the optimal curve is always straight. In the greater than 30th percentile range, the boosting decreases continuously, but the trend of boosting in this range is similar to the optimal boosting curve. In summary, the lift assessment curve of the decision tree C4.5 model fits the optimal curve better, and the confidence level of the decision tree C4.5 model rule is high.

4

Research on Influencing Factors of College Students’ Employment and Entrepreneurship Based on Factor Analysis

4.1

Factor analysis model

Factor analysis [26] is a technique for data simplification, which mainly reflects the idea of data dimensionality reduction. By studying the internal interdependence between variables, a few abstract variables are identified that can synthesize the main information of all variables, which cannot be measured directly, and the abstract variables are usually called factors.

Factor analysis is similar to cluster analysis in that both types are classified as R and Q. R-type factor analysis analyzes variables and Q-type factor analysis [27] analyzes samples. In this paper, R-type factor analysis is chosen to analyze the variables, and the characteristics of R-type factor analysis are that the common factors in R-type cannot be observed intuitively, but they are common factors that exist objectively. Given a sample of n, p indicators, X = (X₁, X₂, …, X_p)^T is a random vector, and the common factor to be sought is F = (F₁, F₂, …F_m)^T, there are: 7 $X_{i} = a_{i 1} F_{1} + a_{i 2} F_{2} + \dots a_{i m} F_{m} + ε_{i}, (i = 1, 2, ... p)$ $${X_i} = {a_{{\text{i}}1}}{F_1} + {a_{i2}}{F_2} + \cdots {a_{im}}{F_m} + {\varepsilon_i},(i = 1,2,...p)$$

In the above equation F₁, F₂, …F_m is referred to as the common factor, ε_i is referred to as the special factor of X_i, which represents the variance of the variable due to influences other than the common factor, X_i is the measurable variable, and the following equation shows the matrix form of the model: 8 $X = A F + ε$ $$X = AF + \varepsilon$$

Among them: 9 $A = [\begin{matrix} a_{11} & a_{12} & ... & a_{1 m} \\ a_{21} & a_{22} & ... & a_{2 m} \\ ... & ... & ... & ... \\ a_{p 1} & a_{p 2} & ... & a_{p m} \end{matrix}] = (A_{1}, A_{2}, ... A_{m})$ $$A = \left[ {\begin{array}{*{20}{c}} {{a_{11}}}&{ {a_{12}}}&{ ...}&{ {a_{1m}}} \\ {{a_{21}}}&{ {a_{22}}}&{ ...}&{ {a_{2m}}} \\ {...}&{ ...}&{ ...}&{ ...} \\ {{a_{p1}}}&{ {a_{p2}}}&{ ...}&{ {a_{pm}}} \end{array}} \right] = ({A_1},{A_2},...{A_m})$$

is called the factor loading matrix, and a_ij is the factor loading whose actual meaning is the correlation coefficient between F_i and X_j. 10 $X = [\begin{matrix} X_{1} \\ X_{2} \\ ⋮ \\ X_{p} \end{matrix}], F = [\begin{matrix} F_{1} \\ F_{2} \\ ⋮ \\ F_{p} \end{matrix}], ε = [\begin{matrix} ε_{1} \\ ε_{2} \\ ⋮ \\ ε_{p} \end{matrix}]$ $$X = \left[ {\begin{array}{*{20}{c}} {{X_1}} \\ {{X_2}} \\ \vdots \\ {{X_p}} \end{array}} \right],F = \left[ {\begin{array}{*{20}{c}} {{F_1}} \\ {{F_2}} \\ \vdots \\ {{F_p}} \end{array}} \right],\varepsilon \: = \left[ {\begin{array}{*{20}{c}} {{\varepsilon_1}} \\ {{\varepsilon_2}} \\ \vdots \\ {{\varepsilon_p}} \end{array}} \right]$$

This mathematical model needs to satisfy the following four aspects: 1)

m < p, i.e. the number of extracted common factors is less than the number of original variables;

2)

Cov(F, ε) = 0, i.e., the public and special factors are uncorrelated;

3)

D(F) = I_m, i.e. the uncorrelated variance of each public factor is 1;

4)

Cov(ε_i, ε_j) = 0, D(e_i) = σ_j, where the individual special factors are uncorrelated and have different variances.

The decomposition of the covariance matrix of the original variable X is given by X = AF + ε, yielding Cov(X) = ACov(F)A^T + Cov(ε), i.e.: 11 $C o v (X) = A A^{T} + d i a g (σ_{1}^{2}, σ_{2}^{2}, ..., σ_{m}^{2})$ $$Cov(X) = A{A^T} + diag(\sigma_1^2,\sigma_2^2,...,\sigma_m^2)$$

The smaller the value of $σ_{1}^{2}, σ_{2}^{2}, ..., σ_{m}^{2}$ $$\sigma_1^2,\sigma_2^2,...,\sigma_m^2$$, the more components the common factor shares.

The loading matrix is not unique, let T be a m × m matrix, and let A^* = AT, F^* = T′F, then the model can be expressed as follows X = A^*F^* + ε. 1)

Statistical significance of factor loadings a_ij.

For the factor model: 12 $X_{i} = a_{i 1} F_{1} + a_{i 2} F_{2} + \dots a_{i m} F_{m} + ε_{i}, (i = 1, 2, ... p)$ $${X_i} = {a_{{\text{i}}1}}{F_1} + {a_{i2}}{F_2} + \cdots {a_{im}}{F_m} + {\varepsilon_i},(i = 1,2,...p)$$

The covariance between X_i and F_j can be obtained as: 13 $\begin{array}{rcl} C o v (X_{i}, F_{j}) & = & C o v (\sum_{k = 1}^{m} a_{i k} F_{k} + ε, F_{j}) & \\ = & C o v (\sum_{k = 1}^{m} a_{i k} F_{k}, F_{j}) + C o v (ε_{i}, F_{j}) \\ = & a_{i j} \end{array}$ $$\begin{array}{rcl} Cov({X_i},{F_j}) &=& Cov\left( {\sum\limits_{k = 1}^m {{a_{ik}}} {F_k} + \varepsilon ,{F_j}} \right)\& \\ &=& Cov\left( {\sum\limits_{k = 1}^m {{a_{ik}}} {F_k},{F_j}} \right) + Cov({\varepsilon_i},{F_j}) \\ &=& {a_{ij}} \\ \end{array}$$

If X_i is normalized, X_i has a standard deviation of 1, and F_j has a standard deviation of 1, there: 14 $γ_{X_{i, F_{j}}} = \frac{c o v (X_{i}, F_{j})}{\sqrt{D (X_{i}) \sqrt{D (F_{j})}}} = Cov (X_{i}, F_{j}) = a_{i j}$ $${\gamma_{{X_{i,{F_j}}}}} = \frac{{cov({X_i},{F_j})}}{{\sqrt {D({X_i})\sqrt {D({F_j})} } }} = {\text{Cov}}({X_i},{F_j}) = {a_{ij}}$$

Then, for the standardized X_i, a_ij is the correlation coefficient between X_i and F_j, indicating that X_i depends on the weight, or power, of F_j. Psychologists call it the loadings, indicating the loadings of the i th variable on the j th common factor, reflecting the relative importance of the i th variable on the j th common factor. 2)

Statistical significance of variable commonality.

There are factor models known: 15 $\begin{array}{rcl} D (X_{i}) & = & a_{i 1}^{2} D (F_{1}) + a_{i 2}^{2} D (F_{2}) + \dots + a_{i m}^{2} D (F_{m}) + D (ε_{i}) \\ = & a_{i 1}^{2} + a_{i 2}^{2} + \dots + a_{i m}^{2} + D (ε_{i}) \\ = & h_{i}^{2} + σ_{i}^{2} \end{array}$ $$\begin{array}{rcl} D({X_i}) &=& a_{i1}^2D({F_1}) + a_{i2}^2D({F_2}) + \cdots + a_{im}^2D({F_m}) + D({\varepsilon_i}) \\ &=& a_{i1}^2 + a_{i2}^2 + \cdots + a_{im}^2 + D({\varepsilon_i}) \\ &=& h_i^2 + \sigma_i^2 \\ \end{array}$$

The common degree of quantity X_i is: 16 $h_{i}^{2} = \sum_{j = 1}^{m} a_{i j}^{2} i = 1, 2, ..., p$ $$h_i^2 = \sum\limits_{j = 1}^m {a_{ij}^2} \quad i = 1,2,...,p$$

If X_i is normalized, there: 17 $1 = h_{i}^{2} + σ_{i}^{2}$ $$1 = h_i^2 + \sigma_i^2$$

3)

Statistical significance of the variance contribution $g_{j}^{2}$ $$g_j^2$$ of the common factor F_i Let the factor loading matrix be A, and call the sum of squares of the elements of column j, i.e: 18 $g_{j}^{2} = \sum_{i = 1}^{p} a_{i j}^{2} j = 1, 2, ..., m$ $$g_j^2 = \sum\limits_{i = 1}^p {a_{ij}^2} \quad j = 1,2,...,m$$

is the contribution of public factor X_i to F_j, which is $g_{j}^{2}$ $$g_j^2$$, indicating the extent to which the information of variable X_i can be described by the extracted k public factors, with a value interval of (0, 1), and the larger the value of $g_{j}^{2}$ $$g_j^2$$, the higher the ratio of information that can be interpreted by the public factors for that variable. The relative importance of the common factor F_j can be measured by $g_{i}^{2} / h$ $$g_i^2/h$$, which is called the contribution of the common factor F_i to X. The purpose of factor analysis is to derive the solution of the factor analysis model from the covariance array Σ or correlation array R of the original random vectors, i.e., to derive the loadings array A and the characteristic variance array D_ε, and to make the relevant explanatory statements.

4.2

Steps in Factor Analysis

First, the original data with high reliability and authenticity are selected, and such data are usually obtained based on the research of actual problems.

Second, standardize all the original variables to eliminate the influence of variables in the order of magnitude, and then obtain the correlation matrix based on the standardized data and convert it into the correlation between variables.

Third, the principal component analysis [28] is used to solve the common factor and the factor loading matrix is derived.

Fourth, in order to make the coefficients in the factor loading matrix more significant, the factor loading matrix can be rotated, and in this paper, the maximum variance orthogonal rotation method is utilized to maximize the relative sum of squares of the loadings, and the factors are interpreted by naming.

Fifth, calculate the component matrix scores of the factors.

Sixth, analyze the results and draw conclusions

The logic diagram of factor analysis is shown in Figure 5:

5

Graduate descriptive modeling and analysis based on factor analysis

5.1

Construction of Graduate Description Indicator System

In order to initially choose the measurement indicators to cover as many factors affecting the employment situation of graduates as possible, this paper establishes a relatively reasonable indicator system for the description of graduates.

Starting from the needs of enterprises, the factors affecting the employment quality of college students are shown in Figure 6. On the basis of a comprehensive understanding of the employment situation of fresh graduates of the information class, 15 measurement indicators are proposed following the five principles of indicator selection mentioned above.

5.2

Collection and Processing of Graduate Employment Data

A sample survey was conducted on the graduates of the last three years from three universities, namely, University X, University Y and University Z, in the area of City B. A total of 620 questionnaires were distributed and 600 questionnaires were recovered, among which 600 questionnaires were valid, and the validity rate of the questionnaires was 100%.

This paper divides the salary grade according to the proportion difference between the individual sample salary and the average salary of the sample set as a criterion, and the division is shown in Table 6, in which St indicates the salary of a single sample and S_d indicates the average salary of all sample data.

Table 6.

The table of Income classification

Evaluation index	Evaluation criteria	Evaluation level
Remuneration	S_t≥1.5S_d	A
	1.5 S_d > S_t≥1.2 S_d	B
	1.2 S_d > S_t≥0.8 S_d	C
	0.8 S_d > S_t≥0.5 S_d	D
	0.5 S_d >S_t	E

In this paper, 15 observable indicators are used as influencing factors for determining the salary level of college students’ employment, and these indicators are expressed in the form of variables: where X1 indicates GPA scores, X2 indicates English scores, X3 indicates scholarships won, X4 indicates course design scores, X5 indicates project experience, X6 indicates award-winning experience, X7 indicates award level, X8 indicates student cadre experience, X9 indicates cadre level, X10 denotes campus honors awarded, X11 denotes social situation, X12 denotes political appearance, X13 denotes part-time job experience, X14 denotes internship experience, and X15 denotes interview experience.

In order to make the sample model more objective in describing the characteristics of the sample, the data of all the observed variables are normalized to eliminate the gap of the data outline, and all the data are transformed to between [0, 1] for data transformation process. 19 $x_{i} = \frac{x_{i} - x_{\min}}{x_{\max} - x_{\min}}$ $${x_i} = \frac{{{x_i} - {x_{\min }}}}{{{x_{\max }} - {x_{\min }}}}$$

5.3

Factor analysis to construct a descriptive model for graduates

5.3.1

Data quality checks

Before doing the factor analysis, the correlation analysis of the 15 original variables was conducted using the standardized data, and the matrix of correlation coefficients is shown in Table 7. It shows that the correlation coefficients between most of the variables are greater than 0.3. Among them, X1, X2 and X3 have a certain correlation with each other, X4, X5, X6 and X7 have a certain correlation with each other, X8, X9, X10, X11 and X12 have a certain correlation with each other, and X13, X14 and X15 have a certain correlation with each other.

Table 7.

Correlation matrix

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15
X1	1.000	0.643	0.524	0.492	-0.152	0.153	-0.311	0.308	-0.252	0.365	-0.189	0.303	0.109	0.061	0.251
X2	0.636	1.000	0.762	0.162	0.043	-0.112	0.118	0.308	-0.174	0.275	-0.557	0.456	-0.426	0.036	0.317
X3	0.526	0.761	1.000	0.246	0.23	0.027	-0.428	-0.383	0.1	0.469	0.158	0.122	0.169	0.353	0.281
X4	0.491	0.168	0.25	1.000	0.647	0.369	0.338	0.405	0.005	0.063	-0.103	-0.033	0.198	0.384	0.398
X5	-0.554	0.049	0.212	0.378	1.000	0.434	0.401	-0.434	0.194	-0.01	0.299	-0.301	0.173	0.306	0.209
X6	0.145	-0.118	0.029	0.337	0.438	1.000	0.388	0.314	0.18	-0.067	-0.509	0.248	0.498	0.25	0.351
X7	-0.318	0.127	-0.425	0.407	0.414	0.398	1.000	-0.198	-0.084	-0.448	0.063	-0.555	0.206	0.135	0.263
X8	0.302	0.299	-0.372	0.006	-0.431	0.313	-0.193	1.000	0.658	0.524	0.752	-0.307	0.457	0.143	0.273
X9	-0.261	-0.185	0.098	0.07	0.192	0.173	-0.104	0.653	1.000	0.295	0.473	0.029	0.39	0.056	0.416
X10	0.36	0.283	0.468	-0.103	0.009	-0.072	-0.448	0.532	0.311	1.000	0.399	0.086	0.22	0.124	0.416
X11	-0.18	-0.558	0.155	-0.031	0.298	-0.501	0.056	0.75	0.467	0.393	1.000	-0.027	-0.02	0.127	0.494
X12	0.305	0.451	0.119	0.192	-0.299	0.24	-0.55	-0.318	0.03	0.08	-0.035	1.000	0.667	0.396	0.399
X13	0.101	-0.42	0.161	0.382	0.174	0.494	0.198	0.458	0.389	0.227	-0.019	0.654	1.000	0.471	0.524
X14	0.062	0.032	0.37	0.397	0.313	0.257	0.143	0.138	0.062	0.119	0.134	0.392	0.471	1.000	0.351
X15	0.25	0.313	0.294	0.109	0.215	0.347	0.26	0.272	0.416	0.402	0.485	0.4	0.524	0.347	1

In order to further determine whether the data are suitable for factor analysis, KMO test and Bartlett’s test were performed on the data in this paper. Usually factor analysis requires that the KMO statistic is greater than 0.5 and the sig value of the Bartlett’s spherical test statistic is less than 0.05. In this paper, the KMO value is 0.771 and the sig value of the Bartlett’s test is 0.000, so it is able to do factor analysis.

5.3.2

Factor extraction process

In this paper, the principal component method is used to extract the common factors and determine the number of factors according to the eigenvalue criterion and the gravel plot test criterion. The eigenvalues are calculated using principal component analysis technique as shown in Table 8 and the gravel plot is depicted in the order of factor extraction as in Figure 7.

Table 8.

Eigenvalue and variance contribution rate

Factor	eigenvalue	Variance contribution(%)	Cumulative(%)
1	5.278	36.955%	36.955%
2	2.196	15.356%	52.311%
3	1.906	13.335%	65.646%
4	1.113	7.752%	73.398%
5	0.868	6.083%	79.481%
6	0.798	5.602%	85.083%
7	0.685	4.750%	89.833%
8	0.473	3.287%	93.120%
9	0.350	2.451%	95.571%
10	0.263	1.803%	97.374%
11	0.161	1.189%	98.563%
12	0.095	0.657%	99.220%
13	0.077	0.541%	99.761%
14	0.022	0.123%	99.884%
15	0.021	0.116%	100.000%

There are four factors with eigenvalues greater than 1 and at the steep slope of the gravel plot. The explained variance of factor 1 is 36.955%, which means that factor 1 is able to explain 36.955% of the information of the original variable alone. Factor 2 has an explained variance of 15.356%, meaning that factor 2 is able to explain 15.356% of the information of the original variable alone. The cumulative variance of Factor 2 is 52.311%, implying that Factor 1 and Factor 2 are able to jointly explain 52.311% of the information about the original variable. The explained variance of Factor 3 is 13.335%, meaning that Factor 3 is able to explain 13.335% of the information of the original variable alone. The cumulative variance of Factor 3 is 65.646%, implying that Factor 1, Factor 2, and Factor 3 are able to collectively explain 65.646% of the information about the original variable. By analogy, the explained variance of factor 4 is 7.752%, implying that factor 4 is able to explain 7.752% of the information of the original variables alone. From the table, it can be seen that the first 4 common factors contributed 73.398% of the total variance, which is sufficient to represent most of the information of the 15 original observed variables. Therefore, it was decided to retain the 4 factors and the initial factor loading matrix can be obtained.

The initial loading matrices of the factors are shown in Table 9. The variance of the factors before rotation does not differ much on different original variables, which makes it impossible to interpret and redefine the factors. Therefore, it is necessary to make the factor loadings approximate to 1 or 0 according to certain rules through the factor rotation technique. In this paper, the variance maximization orthogonal method is used to achieve factor rotation, and the rotated factor loading matrix is shown in Table 10.

Table 9.

Component matrix

Measuring factor	Public factor
Measuring factor	1	2	3	4
X1	0.526	0.480	0.255	-0.304
X2	0.345	0.201	0.266	-0.019
X3	0.429	0.385	0.089	0.358
X4	0.170	0.256	0.202	0.209
X5	0.388	0.515	0.468	0.189
X6	0.423	0.421	0.210	0.393
X7	0.515	0.463	0.462	-0.122
X8	0.495	-0.322	0.674	0.387
X9	0.457	0.427	0.360	0.296
X10	-0.389	0.268	0.294	0.193
X11	0.301	0.315	0.291	0.309
X12	-0.138	-0.105	0.292	0.312
X13	0.205	0.289	0.360	0.404
X14	-0.210	0.419	0.309	0.407
X15	0.289	0.342	0.210	0.355

Table 10.

Rotated component matrix

Measuring factor	Public factor
Measuring factor	1	2	3	4
X1	0.926	0.078	0.126	0.078
X2	0.728	0.197	0.157	-0.022
X3	0.937	0.123	0.152	0.052
X4	0.565	0.735	0.073	0.448
X5	0.322	0.798	0.166	0.219
X6	0.277	0.852	0.051	0.131
X7	0.142	0.865	0.149	-0.059
X8	0.162	-0.068	0.902	0.114
X9	0.109	0.326	0.774	0.134
X10	-0.030	0.185	0.856	-0.009
X11	0.084	0.138	0.781	0.101
X12	-0.077	-0.496	0.728	0.007
X13	0.185	0.128	0.255	0.814
X14	-0.091	0.109	0.017	0.806
X15	0.420	0.215	0.006	0.640

According to the factor rotation component matrix table, it can be summarized that the four public factors mainly contain the following variable indicators:

Factor 1: can be interpreted and named as students’ “learning ability”. This factor contributes 36.955% of the total variance of all initial variables and 36.955% of the total variance explained. This also indicates that the most critical factor in deciding whether or not to hire a student is whether or not the student’s demonstrated learning ability can meet the requirements of the company.

Factor 2 can be interpreted and named as students’ “practical ability”. The contribution of this factor to the variance of all initial variables is 15.356% of the total variance explained. The factor loadings of the factor on the measures describing awards, i.e., award experience and award level, are larger, which also indicates that students should participate in more competitions both inside and outside the university during their college years, so as to put what they have learned into practice and improve their personal practical ability.

Factor 3: can be interpreted and named as students’ “interpersonal skills”. It contributes 13.335% to the total variance of all initial variables and accounts for 13.335% of the total variance explained.

Factor 4: can be interpreted and named as “vocational ability”. It contributes 7.752% to the variance of all initial variables and 7.752% to the total variance explained.

In order to further simplify the raw data for subsequent analysis to calculate the factor scores and to use the public factors as input variables for the classifier, the factor score values can be calculated based on the factor score function. The regression method was used to inversely represent the public factors as a linear combination of the observed variables, and the factor score matrix was calculated as shown in Table 11.

Table 11.

Component score coefficient matrix

Measuring factor	Public factor
Measuring factor	1	2	3	4
X1	0.088	-0.019	0.056	0.047
X2	-0.268	0.019	0.282	-0.026
X3	0.099	-0.039	0.172	-0.102
X4	-0.157	-0.040	0.360	-0.016
X5	0.067	-0.272	-0.003	0.065
X6	0.071	-0.258	0.009	0.025
X7	-0.080	0.055	-0.212	0.014
X8	-0.009	0.080	0.199	-0.103
X9	-0.011	-0.120	0.253	0.124
X10	-0.003	0.157	0.175	-0.179
X11	0.266	0.243	-0.084	0.100
X12	0.128	-0.204	-0.074	0.007
X13	0.136	0.021	0.122	0.202
X14	0.107	0.072	-0.047	0.070
X15	0.005	0.120	0.031	0.137

After obtaining the factor score function, the graduate description model can be represented by public factors instead of the original variables. The four extracted public factors, i.e., learning ability, practical ability, interpersonal handling ability, and vocational ability, are used as the main influencing factors for exploring students’ salary ratings, i.e., as the input variables of the college students’ salary prediction classifier. The feature vector model describing college students can be expressed as shown in the following equation, where Z, Y, J, L denotes college students’ learning ability, practical ability, interpersonal handling ability, and vocational ability, respectively, and Z_i, Y_i, J_i, L_i denotes the value of feature Z, Y, J, L on S_i, respectively. 20 $S_{i} = (Z_{i}, Y_{i}, J_{i}, L_{i})$ $${S_i} = ({Z_i},{Y_i},{J_i},{L_i})$$

6

Mechanisms for cultivating the employment and entrepreneurial abilities of university students

6.1

Enhancement of one’s comprehensive quality and employment and entrepreneurial ability

The improvement of college students’ employment and entrepreneurship ability requires the continuous improvement of their comprehensive quality. In the process of their own ability to cultivate, actively absorb professional knowledge, enthusiastically participate in social practice, to adapt to the enterprises and institutions on the demand for talent, and constantly improve their own quality level, and ultimately to achieve the purpose of employment choice and entrepreneurship.

Career planning, also called career design, combines the individual and the organization, comprehensively considers and summarizes the subjective and objective conditions of the assessor’s career, weighs the assessor’s personal interests, abilities, external evaluations and social relations, combines with the needs of the DU and the assessor’s personal career inclination, determines the most reasonable career goal, and makes scientific and reasonable arrangements for the realization of this goal. Cultivating employment and entrepreneurship is a long and systematic project, and the prerequisite and foundation is scientific and reasonable career planning. The phenomenon of lack of career planning is especially prominent among college students in independent colleges, which largely affects the quality and level of employment and entrepreneurship of college students in independent colleges.

6.2

Strengthening the internal construction of schools

Only by highlighting the marketization and internationalization of the school, strengthening the construction of characteristics, and running the school independently according to the law for the society, can the school adapt to the trend of education development, survive and develop in the increasingly fierce competition, and it is also the fundamental way out and the inevitable choice to cope with the challenge of global economic integration. With the deepening of the new industrial revolution, the development of knowledge-based economy and economic globalization, the trend of internationalization of higher education has come, as a close link with the market economy of the school can not ignore the development of education to send a trend. In the face of intensifying competition, schools can only adapt to the new trend with distinctive schooling characteristics. Can only rise to the challenge, actively participate in the market, international competition, hard work, accurate positioning of their own, school mode should have characteristics, and actively and market supply and demand, in order to turn the crisis into an opportunity for students to create good conditions for employment and entrepreneurship, in order to win for the school’s survival and opportunities for development and growth.

6.3

Strengthening macro-control and government service awareness

Public service function is one of the important tasks of the modern government, as the provider and manager of this function, whether it is guiding the employment and entrepreneurship activities on the micro level or regulating the employment and entrepreneurship activities on the macro level, it plays a key leading role in solving the employment and entrepreneurship problems of college students.

The government should take the initiative, strengthen the sense of service, and continuously introduce policies and measures to promote the employment and entrepreneurship of college students according to the reality. First, to create a high-quality market economic environment, and work together to create a capital market for college students’ entrepreneurship. Financing difficulties is one of the biggest difficulties faced by college students in the process of entrepreneurship, to solve the financing difficulties of college students’ entrepreneurship, mobilize the enthusiasm of college students’ entrepreneurship, and solidly promote the progress of college students’ entrepreneurship. In addition, we can join hands with enterprises, banks and even government departments to build a capital market for college students’ entrepreneurship, so as to solve the difficulty of starting capital for college students’ employment and entrepreneurship. Second, the government should build a social support and guarantee system for student entrepreneurship. The government should coordinate the relationship between the relevant ministries and departments, increase the policy training, and promote the implementation of the policy among college students. The relevant ministries and departments can set up entrepreneurship guidance organizations to provide employment and entrepreneurship training directly to college students or the general public, and they can also w guide and supervise universities to offer relevant employment and entrepreneurship courses. The government should introduce more and more favorable policies to encourage and support the entrepreneurial activities of university students. For example, it should reduce or waive tuition fees, provide tax incentives, simplify registration procedures, and open entrepreneurial parks so that university students can carry out their entrepreneurial activities without too many worries.

7

Conclusion

In this paper, the decision tree C4.5 algorithm was used to construct a prediction model of college students’ employment and entrepreneurship potential, and the classical factor analysis model was applied to find out the most significant factors in the influence on college students’ employment and entrepreneurship training, so as to provide suitable employment decisions for the employment guidance departments of colleges and universities.

The prediction accuracy of the decision tree prediction model of college students’ employment and entrepreneurship potential is evaluated, and the prediction accuracy of the model is 88.59% and 84.56% on the training set and the test set, respectively, which are above 80%, and it can complete the prediction of college students’ employment and entrepreneurship potential better.

From the perspective of enterprises, the article chose 15 measurement indicators to explore the main influencing factors of college students’ employment and entrepreneurship, established the salary rating scale of college students’ employment and entrepreneurship situation, and extracted four determinants of college students’ employment and salary among multiple factors through factor analysis, i.e., learning ability, practical ability, interpersonal processing ability, and vocational ability, and the total variance explanation of the four factors accounts for 36.955%, 15.356%, 13.335%, 7.752%, these four factors are the main influencing factors of college students’ employment and entrepreneurship potential.

Through the excavation of the main influencing factors of college students’ employment and entrepreneurship potential, this paper establishes the relevant cultivation mechanism of college students’ employment and entrepreneurship from the three levels of individual students, internal construction of schools and macro-control of the government, which is of practical significance to improve the employment and entrepreneurship ability of college students.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on Employment and Entrepreneurship Potential Mining and Cultivation Mechanism of College Students Based on Decision Tree Modeling

Yan Kong

Published Online: Sep 26, 2025

Received: Jan 14, 2025

Accepted: Apr 18, 2025

DOI: https://doi.org/10.2478/amns-2025-1080

KeywordsDecision tree C4.5 algorithm, Factor analysis model, Post pruning method, Employment and entrepreneurship ability

© 2025 Yan Kong, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Decision tree C4.5 algorithm, Factor analysis model, Post pruning method, Employment and entrepreneurship ability