Data-driven Multiple Regression Analysis of Teaching Mode Innovation and Teaching Quality of English Education in Colleges and Universities Based on Data

Innovative teaching mode is the key to improve teaching quality and cultivate students’ core competitiveness, exploring and applying data-driven teaching mode innovation becomes crucial, through data analysis teachers can deeply understand students’ learning needs, learning progress and learning styles, so as to target the design and adjustment of teaching content and methods, personalized teaching becomes possible to help each student to achieve a better learning effect [1- 2]. However, to realize the innovation of teaching mode, we need to overcome various challenges, continuously improve the technical infrastructure, ensure data security and privacy protection, and improve the professionalism and training level of teachers. Only in this way can colleges and universities meet the challenges in the field of education and create a more valuable educational environment for students’ learning and development [3-4].

Data analysis is of great significance in English teaching in colleges and universities. By collecting, organizing and analyzing a large amount of students’ learning data, educational institutions and teachers can gain a deeper understanding of students’ learning needs and behavioral patterns, so as to carry out more precise and personalized teaching [5-6]. By analyzing students’ learning data, teachers can reveal students’ bottlenecks in English learning, understand students’ data such as learning time, use of learning materials, and answering questions, which helps teachers accurately grasp students’ learning needs and design teaching content and activities in a targeted way to meet students’ learning needs [7-9]. Secondly, data analysis can reveal students’ learning behavior patterns. By analyzing students’ clicking and browsing records on online learning platforms, teachers can understand students’ preferences for different types of resources, and then optimize the selection and presentation of teaching resources. By analyzing students’ learning behaviors, they can also find out students’ independent learning ability and motivation to learn, which can provide targeted guidance and motivation strategies for teachers [10-12]. Therefore, by deeply exploring students’ learning needs and behavioral patterns, teachers can implement personalized teaching and provide accurate learning guidance, thus improving students’ learning effects and English proficiency, and data analysis also provides teachers with a basis for teaching evaluation and adjustment, promoting the continuous improvement of education quality [13-14].

In recent years, with the rapid development of computer and network technology, digital education has been gradually spreading across the country. In 2020, a sudden epidemic hampered normal offline teaching and learning activities across the country [15]. In this context, “online teaching” has been rapidly developed. “Online teaching” is a new type of teaching method that uses the Internet, multimedia, and a variety of interactive means to teach and interact systematically, and “online” means that all of learners’ teaching and learning activities are carried out on the platform, that is, in the network [16-17]. During the online teaching period, the teachers’ own lack of online teaching experience and insufficient information technology capabilities have led to the fact that their assessment of students’ learning status can only be obtained through online interaction and assignments. Obviously, such evaluation is far from playing its due value and role, which is bound to seriously affect the quality of online teaching. However, the quality of online teaching not only depends on teachers’ online teaching design, but also on the quality of online teaching evaluation [18-19]. It can be said that teaching evaluation throughout the whole process of online teaching plays a decisive role in the quality of the whole university English online teaching. Therefore, it is of great significance to track, monitor and analyze the quality of university English online teaching by using big data, effectively promote the quality of university English online teaching, and strive to realize the “substantial equivalence” between online teaching and offline classroom teaching quality [20].

Scholars first analyze the innovation of English teaching from a technological perspective. Literature [21] introduces the principal component analysis method to reduce the dimensionality of evaluation indexes in the English online teaching quality assessment model, which effectively improves the performance of the model’s English teaching evaluation. Literature [22] systematically reviewed the research literature related to English digital teaching and affirmed the positive role played by digital information technology in promoting the improvement of English teaching. Literature [23] constructed a computerized English teaching system based on C++ and Windows technology tools, which to a certain extent promoted the improvement of English effect. Literature [24] conceived an English translation model with neural machine algorithm as the underlying structure, which has higher quality of English translation and shows good performance in business English translation teaching practice. Literature [25] attempts to integrate convolutional neural network (cnn)-recurrent neural network into immersive situated English teaching scenarios, which promotes students’ English learning interactivity, sense of immersion, and sense of cognition.

Secondly, English teaching innovation is discussed from the perspective of English teaching methods. Literature [26] envisioned an English teaching ecosystem based on big data technology, and in the teaching comparison experiments, it was confirmed that the proposed teaching methods help students reform and innovate their English teaching concepts, methods and contents. Based on the empirical investigation method, literature [27] reveals that the English flipped teaching classroom contributes to the effect of teacher-student interaction and the quality of interaction, and at the same time provides an important reference for English teachers’ practice of flipped classroom.

In order to use big data to promote the innovation of English teaching mode, the article analyzes the current trend of English education and teaching mode innovation in colleges and universities. Subsequently, it takes the learning data of English majors in a college in J province as an example, and carries out preliminary correlation analysis after data preprocessing of the samples. The improved K-means++ algorithm is used to calculate the Euclidean distance of the clustering centers and sum them up, and the clustering centers are constantly updated based on the probability formula. The English teaching data were clustered by the above method, and the law between the clusters of teaching needs was derived. On this basis, a SPOC college English informatization teaching model containing online teaching, classroom guidance, classroom research, and after-class practice is constructed. A teaching quality evaluation path using multiple regression analysis is explored for this model to realize the teaching quality assessment of the new English teaching model.

2

Innovative trends in the teaching model of English language education in higher education institutions

English informatization teaching in colleges and universities has greatly promoted and enriched the integration and utilization of high-quality educational resources, and a large number of high-quality digital resources have enabled the personalized needs of learners to be realized with the support of new media technology. The current trend of data-driven college English teaching mode innovation trend is mainly reflected in the following aspects.

2.1

Utilization of digital resources

At present, the focus and research hotspot of China’s English education informatization resource construction has shifted from the early high-quality courses, high-quality resource sharing courses, and open video courses to the construction of microcourses and catechisms, but the construction of digital resources such as high-quality courses and online courses in the early stage has laid a very good foundation for English informatization classroom teaching. The digital resources for teachers to carry out English informatization teaching based on the network platform can be introduced through the introduction of high-quality MOOC course resources. Building online courses and microcourse video resources with localized characteristics. Transform and upgrade the original high-quality resource sharing courses on campus and transform them into digital resources needed for English informatization classrooms. Form the resources needed for “flipped classroom” with Chinese characteristics in continuous development. It can be said that the classroom cannot be “flipped” without rich digital resources.

2.2

Relying on new media technologies

Due to the intervention of new media, based on digital network technology, reorganization and integration of different communication technologies, single media is transformed into all-media, which provides a great possibility for the realization of English informatization classroom. In the English informatized classroom, teachers deliver teaching information to students through online teaching platforms, mobile terminals and other new media technologies, and students can also communicate with teachers in real time through new media technologies. In addition, learners can also collaborate and communicate with each other through new media such as microblogging, WeChat, QQ, and online forums to jointly complete the construction of knowledge. What’s more, when educators or learners are utilizing various media and devices to learn and work, new media and devices, themselves, also have an interactive role, and information is disseminated between people and machines to form interactions. It can be seen that the lack of new media technology support information classroom is difficult to realize.

2.3

Educational methods to support personalized learning

English informatization classroom learning is not limited by time, space and location. Teachers use educational technology before class to record short and concise microclass videos, so that students can study independently outside class and before the new class, realizing flipped classroom teaching. During the class, teachers analyze the students’ learning data fed back by the online platform and provide personalized guidance to students, and conduct targeted discussions and explanations in class. After class, students collaborate with each other to complete assignments and interact with teachers and classmates online, and teachers provide online counseling. Through human and machine automatic procedures, supervise, remind the learning process and learning tasks, learners can freely choose the learning content, control the learning process, learners from passive acceptance of knowledge to active learning knowledge, in access to course resources at the same time, so that the focus on personalized learning blended teaching becomes possible.

3

Acquisition and analysis of data underlying innovations in teaching models

From the above analysis, it is not difficult to find that the collection and analysis of rich English teaching big data is the basis for realizing the innovation of new media teaching, personalized education and other teaching modes. Therefore, this study first collects and analyzes the sample data of English education in colleges and universities, refines the basic laws of English education and teaching, and carries out the construction of new teaching models on this basis.

3.1

Data acquisition

This paper is based on the historical operation data of the academic affairs system of a university in J province, including the results of various majors and courses in English, the evaluation results, the attendance rate, the attendance rate, and the preview situation. The data results of the survey were preprocessed to form five factor attributes that may affect the teaching quality, namely “preview”, “evaluation”, “achievement”, “attendance” and “attendance”, and the sample set was as follows: (1) $U_{i} = [u_{i 1}, u_{i 2}, \dots, u_{i k}]$

where $i \in [1, M]$ , M describe the set size, and u_ik describes the kth attribute value of the ith sample u_i in the set.

3.2

Data pre-processing

One hot code is chosen to convert the fixed type data such as “Prep”, “Course Type” and “Analysis Type” into numerical values, and some of the attribute codes are converted as shown in Table 1. The conversion of some attribute codes is shown in Table 1.

Table 1.

Conversion table of attribute codes

Text type	Data attribute	Value
Preview	Yes	0
Preview	No	1
Course	Basic English	0
	Professional English	1
	Business English	2
	English listening	3
Requirement	Self-directed type	0
	Self-driven type	1
	Friendly type	2
	Passive type	3

A manual calibration method is used to calibrate the demand types to form a calibrated dataset to better evaluate the analytical model, which is then divided into a training set and a test set. The training set data attributes and example data are shown in Table 2.

Table 2.

Data attributes and examples of training set

u_i1	u_i2	u_i3	u_i4	u_i5	u_i6	u_i7
1	88	91	90	85	2	3
0	89	98	84	97	1	0
1	82	93	87	90	0	1
0	89	94	98	91	0	2
1	83	86	94	81	0	0
1	81	84	87	89	0	2
0	88	80	80	92	2	2
1	78	79	88	90	0	1

In order to understand the distribution of courses in the dataset, statistics were made according to four types of needs: “autonomous”, “friendly”, “self-driven” and “passive”. The statistical results are shown in Figure 1.

3.3

Data attribute correlation analysis

3.3.1

Pearson’s correlation coefficient

Correlation analysis refers to the analysis of the degree of correlation between two or more variables that are correlated, thus measuring the degree of correlation between the elements of the two variables. Correlation analysis is used to investigate whether there is some kind of dependence between a phenomenon, and to analyze the direction of correlation and the degree of correlation, is an important statistical method to study the correlation between random variables. This paper presents the required correlation analysis is Pearson correlation coefficient.

Correlation analysis based on classical statistics is used to measure the strength of linear correlation between variables, which is generally described quantitatively using the Pearson correlation coefficient. The Pearson correlation coefficient, also known as the Pearson product-moment correlation coefficient, is a linear correlation coefficient and is the most commonly used type of correlation coefficient [28]. Denoted as r, it is used to reflect the degree of linear correlation between two variables, including dependent variable Y and dependent variable X, with a value of r ranging from -1 to 1, with larger absolute values indicating a stronger correlation. A larger absolute value of r indicates a stronger correlation. A value of r = 0 indicates that the two variables are not linearly correlated (only non-linearly), but are also correlated in other ways (e.g., in a curvilinear way). If r < 0, there is a negative correlation between the two variables, i.e. the larger the value of one variable, the smaller the value of the other variable. If r > 0, it means that there is a positive correlation between the two variables, i.e., the larger the value of one variable, the larger the value of the other variable. When r = 1 and -1, this indicates that the dependent variable Y and the independent variable X can be well described by the line equation and all sample points fall well on a straight line. The formula for this is shown in equation (2): (2) $r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x}) \sum_{i = 1}^{n} (y_{i} - \bar{y})}}$

where r is the Pearson correlation coefficient of the two variables, x_i and y_i are the sample observations of the two variables, n is the number of samples, and $\bar{x}$ and $\bar{y}$ denote the mean values of the two variables, respectively.

3.3.2

Correlation analysis of EFL data

In this section, attribute correlation analysis is performed on the calibrated training set to remove data attributes with weak correlation to ensure the stability of the subsequent clustering algorithm. In order to reduce the amount of computation, the correlation analysis is performed only on the four analysis types of basic English courses, and the Pearson correlation coefficient is used to statistically analyze the degree of closeness of relationship between the variables.

Substituting the values of the data attributes in the training set into the Pearson’s formula yields a table of correlation coefficient analysis, as shown in Table 3. Where ${(S_{m})}_{u_{i k}}$ and ${(S_{n})}_{u_{i j}}$ represent the standard deviation, ${(S_{m} R_{n})}_{u_{i j} u_{i k}}$ represents the covariance, and S_mR_n represents the correlation coefficients of the elements in the analysis type set R and the relationship grouping set S. R₀ ~ R₃ represent four types of demand types such as autonomous, self-driven, friendly and passive, respectively, while S₁ ~ S₆ represent the attribute relationship combinations such as (assessment/grades), (assessment, listening rate), (assessment/attendance), (grades/listening rate), (grades/attendance) and (listening rate/attendance), respectively.

Table 3.

Analysis of correlation coefficients

Group	Type	n	${(S_{m})}_{u_{i k}}$	${(S_{n})}_{u_{i j}}$	${(S_{m} R_{n})}_{u_{i j} u_{i k}}$	S_mR_n	r
S₁	S₁R₀	88	8.37	6.01	7.29	0.15	0.23
	S₁R₁	68	8.91	6.29	11.02	0.22
	S₁R₂	92	8.74	7.31	18.10	0.26
	S₁R₃	75	10.14	9.13	22.72	0.26
S₂	S₂R₀	88	8.37	8.38	11.75	0.18	0.2
	S₂R₁	68	8.91	8.91	4.31	0.08
	S₂R₂	92	8.72	8.74	21.90	0.30
	S₂R₃	75	10.14	10.12	16.57	0.21
S₃	S₃R₀	88	8.37	5.04	2.71	0.07	0.1
	S₃R₁	68	8.89	5.42	7.40	0.16
	S₃R₂	92	8.74	7.32	2.61	0.05
	S₃R₃	75	10.12	7.47	14.43	0.20
S₄	S₄R₀	88	5.97	8.37	9.15	0.21	0.25
	S₄R₁	68	6.29	8.91	2.98	0.05
	S₄R₂	92	7.29	8.72	22.34	0.35
	S₄R₃	75	9.13	10.14	25.04	0.34
S₅	S₅R₀	88	6.00	5.06	1.47	0.04	0.07
	S₅R₁	68	6.27	5.42	4.31	0.11
	S₅R₂	92	7.28	7.31	-0.11	0.00
	S₅R₃	75	9.16	7.45	9.52	0.15
S₆	S₆R₀	88	8.35	5.04	0.42	0.02	0.08
	S₆R₁	68	8.83	5.42	1.67	0.05
	S₆R₂	92	8.72	7.32	5.44	0.08
	S₆R₃	75	10.14	7.45	14.88	0.23

It can be found that the correlation between relationship group S₄ (grade/attendance rate) is the highest (r=0.25), followed by S₁ (evaluation/grade) (r=0.23), which shows that the attribute values “grade” and “attendance rate”, “evaluation” and “grade” in the training set have the highest correlation between the type of course demand, and the cluster analysis should be mainly carried out according to this relationship group.

3.4

Cluster analysis of English language teaching data

3.4.1

K-means++ clustering algorithm

K-means is a better performance clustering algorithm with low complexity, which can be applied to the case of large amount of data. The algorithm first randomly selects k point in the dataset as the center of the clustering cluster, and divides each dataset point into clusters according to the Euclidean distance, and then iterates over the center of the clusters to form k clusters that meet the requirements.

As a basic algorithm, K-means is robust and can be used for all types of data sets. However, the algorithm also has many shortcomings, such as the algorithm is susceptible to the selection of the initial k value and the distribution of data points, as well as the convergence speed is slow in large datasets and so on [29]. The number of datasets to which the algorithm is applied is large in size, so the K-means algorithm needs to be improved.

The steps to execute the improved K-means++ algorithm of the paper are as follows:

Step 1: The center point C is selected in the data set.

Step 2: For each data point x_i in the data set, calculate its Euclidean distance from the nearest center point C with the following formula: (3) $D_{i} (x) = \sqrt{{(a_{i} - a_{j})}^{2} + {(b_{i} - b_{j})}^{2}}$

where $(a_{i}, b_{i})$ is the coordinate of x_i. $(a_{j}, b_{j})$ is the coordinate of C. The cumulative value of the Euclidean distance is: (4) $S_{j} = \sum_{x_{i} \in X} D_{i} {(x)}^{2}$

Step 3: Select the brand new center point C_j based on the probability formula in equation (5): (5) $C_{j} = \frac{D_{i} {(x)}^{2}}{S_{j}}$

Step 4: Repeat steps 2 and 3 until k center point is generated.

It can be seen that the K-means++ algorithm has improved the selection of iterative centroids and better clustering performance [30].

3.4.2

Clustering test for ELT data

The improved K-means++ algorithm was utilized to test the cluster analysis of the instructional data in order to provide an algorithmic architecture for the new instructional model construction work. Based on the data preprocessing in the previous section, the K-means++ algorithm was used to test clustering on 30% of the raw data. A random center of mass O (k=4) was constructed, each point was assigned to the nearest center of mass, and then the center of mass was recalculated, and the process was repeated until the results of the cluster assignment of the data points no longer changed position. According to the algorithmic process, the preprocessed dataset is programmed using Python and graphical results are formed as shown in Figure 2. Fig. 2 (a) and (b) show the clustering results of the relationship between “grade-listening” and “grade-assessment”, respectively.

The results after clustering are tested for two different relationship groupings separately. Let the calibrated number of a particular classification be N_c, and the number obtained through the correlation analysis and clustering results in this paper be $N_{c}^{'}$ , then the accuracy of this classification can be defined as: $p = N_{c} / N_{c}^{'}$ . In this paper, the test validity is measured by p pairs of post-clustering accuracies.

Figure 3 shows the results of the grouping evaluation of the relationship, and the clustering effects of the relationship between “lecture-grade” and “grade-evaluation” are shown in Figure 3, respectively. Among them, the average classification accuracy of “grade” and “attendance rate” $\bar{p} = 91.7 %$ , and the accuracy rate of “evaluation” and “grade” $\bar{p} = 82.7 %$ . It can be seen that the evaluation accuracy of the “lecture-grade” relationship is high, which also verifies the view of correlation analysis.

4

Data-driven multiple regression analysis of teaching model construction and teaching quality

Based on the previous analysis of the innovative trend of English education in colleges and universities and the preliminary analysis of English teaching data, this paper proposes a data-driven informatization education model for English in colleges and universities and a teaching quality evaluation method based on multiple linear regression.

4.1

Constructing Informatization Teaching Mode of English in Colleges and Universities

Most of the traditional English informatization teaching models in colleges and universities use MOOC as the form of courses, and its advantages are mainly manifested in the openness of content and form. However, the shortcomings of MOOC are also obvious, such as: no prerequisites for learners, no size limitation, low completion rate, no formal credit certification, and open online exams are prone to academic integrity and other problems.

SPOC, as a form of online course derived from MOOC, adheres to the teaching concept and teaching design of MOOC, utilizes the resources of high-quality MOOCs, improves the teaching methods and processes in schools, and improves the teaching effect. Therefore, this paper proposes a college English informatization teaching model based on SPOC flipped classroom.

The teaching activities of SPOC can be divided into pre-course orientation, classroom research and post-course practice, with a complete flipped classroom teaching process; it is an organic combination of physical classroom and network teaching. In this paper, based on the research of many scholars, based on the concept of SPOC, on the existing network platform, according to the informatization teaching framework designed in the previous paper, and then combined with the characteristics of the flipped classroom and blended learning to construct a SPOC-based informatization classroom teaching process model, the process model of the English informatization teaching mode based on SPOC is shown in Figure 4.

4.1.1

E-learning platforms

The effective development of blended teaching mode can not be separated from the support of the network platform, the platform can be based on the existing stable and easy to operate network course platform, boutique resource sharing course website as a carrier, by the teacher to redesign it, for learners to do regular maintenance. Digital resources are the core of the network platform, each course is independent, different network teaching platform functions generally include: resource area, communication area, management area, learning area, etc. The network teaching platform provides an environment for the realization of SPOC teaching activities. Teachers build resources based on the platform, release learning tasks, manage students, and interact with students; students acquire needed learning materials based on the platform, carry out independent learning, complete online tests and discussions, and realize the application of knowledge understanding.

4.1.2

Pre-course orientation

Before the class, the teacher is the creator and integrator of the course resources, and the designer of the content and progress of the “SPOC flipped classroom”, and the students are the implementers of the “flipped classroom”. Teachers release “learning tasks” and push learning resources, and students independently choose a variety of high-quality resources for online learning through the online platform, watch micro-videos to understand the main content of the classroom in advance, and identify problems.

4.1.3

Classroom research

In the classroom, the teacher is the guide and facilitator of teaching activities, providing individualized guidance to students, organizing group seminars, carrying out project training, jointly solving problems encountered, and providing feedback on classroom problems. The classroom teaching methodology has shifted from monolithic theory teaching to diversified collaboration, inquiry, discussion and interaction. Teachers are the leaders of “SPOC Flipped Classroom”, summarizing the knowledge points and giving new tasks according to the students’ situation, and students understand and internalize the knowledge through practical operation.

4.1.4

Practice learning after school

After class, the teacher is a supporter, assigning after-class homework and implementing after-class tests to help students assess what they have learned. According to the students to complete the homework, selected outstanding students work in the network platform to display and share, targeted counseling as well as evaluation, students in the teacher’s assistance and intra-group and inter-group collaborative exchanges, so that the previous knowledge to be consolidated, sublimation, and for the next stage of learning to prepare.

4.2

Teaching quality evaluation method of SPOC teaching model based on multiple regression

4.2.1

Multiple linear regression algorithms

One-dimensional linear regression is the simplest linear regression model used to analyze the linear relationship between an independent variable and the dependent variable, and its basic idea is to predict the value of the dependent variable by modeling the linear relationship between the independent variable and the dependent variable. However, usually in the process of research on real problems, the change of the dependent variable is often subject to the joint action of multiple variables at the same time, at this time, the univariate linear regression can’t predict the dependent variable, and it is necessary to elicit two or more variables that act together to explain the change of the dependent variable, i.e., multiple regression, which, in the case of linear relationship between multiple independent variables and the variable to be measured, is referred to in this paper as Multiple linear regression. Multiple linear regression is a statistical method that uses multiple independent variables to predict one or more dependent variables, and it can analyze the relationship between multiple independent variables and a dependent variable and estimate the functional form between them. The modeling and parameter calculation process is as follows:

When y is set as the dependent variable and x₁,x₂,⋯,x_i is the independent variable and there is a linear relationship between the independent variable and the dependent variable, the general form of the multiple linear regression model is as follows: (6) $y = a_{0} + a_{1} x_{1} + a_{2} x_{2} + \dots + a_{i} x_{i} + e$

Where a₀ represents the constant term, a₁,a₂⋯,a_i represents the regression coefficient, and e represents the error term.

If two independent variables x₁, x₂ and the same dependent variable y are linearly correlated, the multiple linear regression model formula is: (7) $y = a_{0} + a_{1} x_{1} + a_{2} x_{2} + e$

Parameter estimation for multiple regression models, like the same binary linear regression equation, requires that the parameters be solved by least squares provided that the sum of squares of errors $(\sum e^{2})$ is minimized.

With the binary linear regression model, the standard set of equations for solving the regression parameters is shown in equation (8): (8) ${\begin{array}{l} \sum y = n a_{0} + a_{1} \sum x_{1} + a_{2} \sum x_{2} \\ \sum x_{1} y = a_{0} \sum x_{1} + a_{1} \sum x_{1}^{2} + a_{2} \sum x_{1} x_{2} \\ \sum x_{2} y = a_{0} \sum x_{2} + a_{1} \sum x_{1} x_{2} + a_{2} \sum x_{2}^{2} \end{array}$

The values of a₀, a₁, and a₂ can be found by solving this equation, which can also be solved by using matrix method formulas: (9) $a = {(x' x)}^{- 1} \cdot (x' y)$ (10) $[\begin{array}{l} a_{0} \\ a_{1} \\ a_{2} \end{array}] = [\begin{matrix} n & \sum x_{1} & \sum x_{2} \\ \sum x_{1} & \sum x_{1}^{2} & \sum x_{1} x_{2} \\ \sum x_{2} & \sum x_{1} x_{2} & \sum x_{2}^{2} \end{matrix}] \cdot [\begin{array}{l} \sum y \\ \sum x_{1} y \\ \sum x_{2} y \end{array}]$

It should be noted that the correlation between independent variables needs to be considered when modeling with multiple linear regression models. If there is a high degree of correlation between the independent variables, it may lead to a decrease in the accuracy of the multiple linear regression model, and at this time, the use of feature selection, principal component analysis and other methods can be considered to reduce the correlation between the independent variables, so as to improve the accuracy of the model.

4.2.2

Multiple covariance tests

Multicollinearity is a situation where there is a high degree of correlation between independent variables in a multiple regression model. Multicollinearity can lead to inaccurate regression coefficients, make it difficult to make statistical inferences, and even cause the model to fail. Variance inflation factor (VIF), is used to portray the severity of complex (multiple) correlations among multiple variables. It is the ratio of the variance of the regression coefficients estimated under the assumption of a nonlinear relationship between the independent variables. When 0 < VIF < 5, there is no covariance. If 5 < VIF < 10, the phenomenon is weak complex covariance. When the value is 10 < VIF < 100, the covariance is moderate. Severe covariance occurs when VIF is greater than 100. The calculation formula is shown in equation (11): (11) $V I F_{j} = \frac{1}{1 - R_{j}^{2}}$

where VIF_j is the variance inflation factor and $R_{j}^{2}$ is the decidable coefficient of multiple explanatory variables assisting the regression.

In this paper, we choose to utilize the variance inflation factor (VIF) to test for the presence of multicollinearity in the independent variables.

4.2.3

Analysis of multiple regression results

1)

Single-factor linear regression analysis

In order to study the influence of pre-study rate, attendance rate, listening rate, course grade, and evaluation grade on the final exam grade, this paper first studies the influence of each factor on the final exam grade. Firstly, the linear regression of each independent variable on the dependent variable is established separately, and the model is as follows: (12) $Y = β_{0 i} + β_{1 i} X_{i} + ε_{i}, i = 1, 2, \dots, 5$

Where, β_0i, β_1i is the regression parameter to be estimated and ε_i is the random error. The results of the regression parameters and their confidence intervals, test statistics R², F, p, s² obtained using statistical software are shown in Table 4.

Table 4.

The calculation result of single factor linear regression model

Parameter	Parametric estimate	Parametric estimate
β₀₁	0.4843	[0.4064, 0.5631]
β₁₁	0.3185	[0.2182, 0.4192]
R² = 0.2043, F = 39.7541, p = 0.0000, s² = 0.0374
β₀₂	-0.9687	[-1.945, 0.0082]
β₁₂	1.7012	[0.7135, 2.6789]
R² = 0.0729, F = 11.5871, p = 0.0005, s² = 0.0421
β₀₃	0.3712	[0.2538, 0.4957]
β₁₃	0.5876	[0.3872, 0.7831]
R² = 0.1879, F = 36.3412, p = 0.0000, s² = 0.0384
β₀₄	0.2395	[0.1728, 0.3116]
β₁₄	0.7221	[0.6234, 0.8123]
R² = 0.5871, F = 207.3967, p = 0.0000, s² = 0.0198
β₀₅	0.3387	[0.0184, 0.6621]
β₁₅	0.3871	[0.0553, 0.7213]
R² = 0.0354, F = 5.3687, p = 0.0213, s² = 0.0443

It can be seen from the calculation results in Table 4 that the value of p of each variable is less than 0.05, and the F value is greater than the critical value of F, indicating that the preview rate, attendance rate, attendance rate, course grade, and evaluation score have significant effects on the final examination score, among which the contribution rate of course grade β₁₄ is 58.71%, the contribution rate of preview rate β₁₁ is 20.43%, the contribution rate of attendance rate β₁₃ is 18.79%, the contribution rate of attendance rate β₁₂ is 7.29%, and the contribution rate of evaluation score β₁₅ is 3.54%. From the data of classroom teaching, the course grade, the rate of pre-study and the rate of listening to the lectures have the greatest influence on the English final examination results, and the strict management and supervision of pre-study before class, classroom answering and post-class homework should be strengthened in the teaching process.

2)

Multifactor linear regression analysis

In order to study the overall impact of multiple variables such as pre-study rate, attendance rate, listening rate, course grade, and evaluation grade on the final exam grade, this paper establishes a multivariate linear regression model (1): (13) $Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} + β_{5} X_{5} + ε$

Where, β₀, β₁, β₂, β₃, β₄, β₅ are the regression parameters to be estimated and ε is the random error. Table 5 shows the calculation results of the multifactor linear regression model (1).

Table 5.

The calculation result of multi-factors linear regression model (1)

Parameter	Parametric estimate	Parametric estimate
β₀	0.8213	[0.0645, 1.5578]
β₁	0.1073	[0.0213, 0.1984]
β₂	-0.8531	[-1.7682, 0.0536]
β₃	0.0573	[-0.1328, 0.2368]
β₄	0.6742	[0.5574, 0.7921]
β₅	0.1983	[-0.0651, 0.4622]
R² = 0.6151, F = 46.1386, p = 0.0000, s² = 0.0175

From the calculation results in Table 5, it can be seen that the value of p is less than 0.05, and the value of F is greater than the critical value of F, indicating that the model is valid from the overall point of view. The confidence intervals of parameters β₂, β₃, and β₅ contain zero points, indicating that the effects of attendance rate, listening rate, and evaluation of teaching grades on final examination grades are not significant.

Remove the non-significant factors and re-establish the linear regression model (2): (14) $Y = k_{0} + k_{1} X_{1} + k_{4} X_{4} + ε$

The results of the calculation of the multifactor linear regression model (2) (Ⅰ) are shown in Table 6. From the calculation results, p<0.05, F value is much larger than the critical value, which indicates that the model is overall usable and 60.31% of the final exam grade can be explained by the model.

Table 6.

The calculation result(Ⅰ) of multi-factors linear regression model (1)

Parameter	Parametric estimate	Parametric estimate
k₀	0.2011	[0.1276, 0.2732]
k₁	0.1153	[0.0364, 0.1935]
k₄	0.6531	[0.5384, 0.7633]
R² = 0.6031, F = 112.9138, p = 0.0000, s² = 0.0184

Figure 5 shows the distribution of the residuals, and it was found that 16 data points had residual confidence intervals that did not contain zeros, and the data should be considered outliers.

After eliminating them and re-running the program to calculate, the results of calculating the multifactor linear regression model (2) (II) are shown in Table 7. From the calculation results, the regression parameters k₀, k₁ and k₄ do not change much, the confidence interval length of the parameters becomes shorter, the values of R² and F become larger, and the residual sum of squares s² becomes smaller, indicating that the modified model is more plausible, and that 78.43% of the final exam grade can be explained by the model.

Table 7.

The calculation result(Ⅱ) of multi-factors linear regression model (2)

Parameter	Parametric estimate	Parametric estimate
k₀	0.1864	[0.1324, 0.2379]
k₁	0.1012	[0.0443, 0.1566]
k₄	0.7036	[0.6311, 0.7792]
R² = 0.7843, F = 244.1613, p = 0.0000, s² = 0.0081

In the one-factor linear regression analysis, the effects of pre-testing rate, attendance rate, attendance rate, course grade, and evaluation grade on the final exam grade are all significant, but in the multi-factor linear regression analysis, the pre-testing rate and course grade have a significant effect on the final exam grade, and the course grade has the largest proportion and the highest contribution. From the regression model, it can be seen that there is multicollinearity among attendance rate, listening rate, course grade, and evaluation of teaching grade, and course grade can highly reflect attendance rate, listening rate and course grade.

Based on the above multiple regression process, it will be able to assess and analyze the quality of English teaching after applying the SPOC college English informatization teaching model, so as to continuously optimize and improve the English teaching model constructed in this paper, and to enhance the quality of English teaching in colleges and universities.

5

Conclusion

This paper focuses on the correlation analysis and clustering of basic data on English teaching, and innovatively proposes a data-driven English education model. The correlation analysis shows that English grades have the highest correlation (r=0.25, 0.23) with listening rate and assessment, and combined with the higher average classification accuracy of the “listening-grade” relationship (91.7%>82.7%), it is concluded that the listening rate has the highest correlation with English grades. From this teaching rule, a SPOC informationized teaching model aimed at improving classroom attention was constructed, and then a teaching quality assessment method using multiple linear regression was proposed. After removing insignificant factors, the results of the multifactor linear regression analysis show that the prep rate and course grade can explain most (60.31%) of the English final exam scores. Therefore, in order to better assess the teaching quality effects achieved by the SPOC informationized teaching model proposed in this paper, more attention should be paid to students’ pre-preparation and course grades.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

u_i1	u_i2	u_i3	u_i4	u_i5	u_i6	u_i7
1	88	91	90	85	2	3
0	89	98	84	97	1	0
1	82	93	87	90	0	1
0	89	94	98	91	0	2
1	83	86	94	81	0	0
1	81	84	87	89	0	2
0	88	80	80	92	2	2
1	78	79	88	90	0	1

u_i1	u_i2	u_i3	u_i4	u_i5	u_i6	u_i7
1	88	91	90	85	2	3
0	89	98	84	97	1	0
1	82	93	87	90	0	1
0	89	94	98	91	0	2
1	83	86	94	81	0	0
1	81	84	87	89	0	2
0	88	80	80	92	2	2
1	78	79	88	90	0	1

Data-driven Multiple Regression Analysis of Teaching Mode Innovation and Teaching Quality of English Education in Colleges and Universities Based on Data

Qingyan Ge

Published Online: Sep 26, 2025

Received: Dec 26, 2024

Accepted: Apr 16, 2025

DOI: https://doi.org/10.2478/amns-2025-1063

KeywordsMultiple regression analysis, K-means++, Pearson correlation analysis, English teaching mode innovation

© 2025 Qingyan Ge, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Multiple regression analysis, K-means++, Pearson correlation analysis, English teaching mode innovation

u_i1	u_i2	u_i3	u_i4	u_i5	u_i6	u_i7
1	88	91	90	85	2	3
0	89	98	84	97	1	0
1	82	93	87	90	0	1
0	89	94	98	91	0	2
1	83	86	94	81	0	0
1	81	84	87	89	0	2
0	88	80	80	92	2	2
1	78	79	88	90	0	1