Data-driven Multiple Regression Analysis of Teaching Mode Innovation and Teaching Quality of English Education in Colleges and Universities Based on Data 
Published Online: Sep 26, 2025
Received: Dec 26, 2024
Accepted: Apr 16, 2025
DOI: https://doi.org/10.2478/amns-2025-1063
Keywords
© 2025 Qingyan Ge, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Innovative teaching mode is the key to improve teaching quality and cultivate students’ core competitiveness, exploring and applying data-driven teaching mode innovation becomes crucial, through data analysis teachers can deeply understand students’ learning needs, learning progress and learning styles, so as to target the design and adjustment of teaching content and methods, personalized teaching becomes possible to help each student to achieve a better learning effect [1- 2]. However, to realize the innovation of teaching mode, we need to overcome various challenges, continuously improve the technical infrastructure, ensure data security and privacy protection, and improve the professionalism and training level of teachers. Only in this way can colleges and universities meet the challenges in the field of education and create a more valuable educational environment for students’ learning and development [3-4].
Data analysis is of great significance in English teaching in colleges and universities. By collecting, organizing and analyzing a large amount of students’ learning data, educational institutions and teachers can gain a deeper understanding of students’ learning needs and behavioral patterns, so as to carry out more precise and personalized teaching [5-6]. By analyzing students’ learning data, teachers can reveal students’ bottlenecks in English learning, understand students’ data such as learning time, use of learning materials, and answering questions, which helps teachers accurately grasp students’ learning needs and design teaching content and activities in a targeted way to meet students’ learning needs [7-9]. Secondly, data analysis can reveal students’ learning behavior patterns. By analyzing students’ clicking and browsing records on online learning platforms, teachers can understand students’ preferences for different types of resources, and then optimize the selection and presentation of teaching resources. By analyzing students’ learning behaviors, they can also find out students’ independent learning ability and motivation to learn, which can provide targeted guidance and motivation strategies for teachers [10-12]. Therefore, by deeply exploring students’ learning needs and behavioral patterns, teachers can implement personalized teaching and provide accurate learning guidance, thus improving students’ learning effects and English proficiency, and data analysis also provides teachers with a basis for teaching evaluation and adjustment, promoting the continuous improvement of education quality [13-14].
In recent years, with the rapid development of computer and network technology, digital education has been gradually spreading across the country. In 2020, a sudden epidemic hampered normal offline teaching and learning activities across the country [15]. In this context, “online teaching” has been rapidly developed. “Online teaching” is a new type of teaching method that uses the Internet, multimedia, and a variety of interactive means to teach and interact systematically, and “online” means that all of learners’ teaching and learning activities are carried out on the platform, that is, in the network [16-17]. During the online teaching period, the teachers’ own lack of online teaching experience and insufficient information technology capabilities have led to the fact that their assessment of students’ learning status can only be obtained through online interaction and assignments. Obviously, such evaluation is far from playing its due value and role, which is bound to seriously affect the quality of online teaching. However, the quality of online teaching not only depends on teachers’ online teaching design, but also on the quality of online teaching evaluation [18-19]. It can be said that teaching evaluation throughout the whole process of online teaching plays a decisive role in the quality of the whole university English online teaching. Therefore, it is of great significance to track, monitor and analyze the quality of university English online teaching by using big data, effectively promote the quality of university English online teaching, and strive to realize the “substantial equivalence” between online teaching and offline classroom teaching quality [20].
Scholars first analyze the innovation of English teaching from a technological perspective. Literature [21] introduces the principal component analysis method to reduce the dimensionality of evaluation indexes in the English online teaching quality assessment model, which effectively improves the performance of the model’s English teaching evaluation. Literature [22] systematically reviewed the research literature related to English digital teaching and affirmed the positive role played by digital information technology in promoting the improvement of English teaching. Literature [23] constructed a computerized English teaching system based on C++ and Windows technology tools, which to a certain extent promoted the improvement of English effect. Literature [24] conceived an English translation model with neural machine algorithm as the underlying structure, which has higher quality of English translation and shows good performance in business English translation teaching practice. Literature [25] attempts to integrate convolutional neural network (cnn)-recurrent neural network into immersive situated English teaching scenarios, which promotes students’ English learning interactivity, sense of immersion, and sense of cognition.
Secondly, English teaching innovation is discussed from the perspective of English teaching methods. Literature [26] envisioned an English teaching ecosystem based on big data technology, and in the teaching comparison experiments, it was confirmed that the proposed teaching methods help students reform and innovate their English teaching concepts, methods and contents. Based on the empirical investigation method, literature [27] reveals that the English flipped teaching classroom contributes to the effect of teacher-student interaction and the quality of interaction, and at the same time provides an important reference for English teachers’ practice of flipped classroom.
In order to use big data to promote the innovation of English teaching mode, the article analyzes the current trend of English education and teaching mode innovation in colleges and universities. Subsequently, it takes the learning data of English majors in a college in J province as an example, and carries out preliminary correlation analysis after data preprocessing of the samples. The improved K-means++ algorithm is used to calculate the Euclidean distance of the clustering centers and sum them up, and the clustering centers are constantly updated based on the probability formula. The English teaching data were clustered by the above method, and the law between the clusters of teaching needs was derived. On this basis, a SPOC college English informatization teaching model containing online teaching, classroom guidance, classroom research, and after-class practice is constructed. A teaching quality evaluation path using multiple regression analysis is explored for this model to realize the teaching quality assessment of the new English teaching model.
English informatization teaching in colleges and universities has greatly promoted and enriched the integration and utilization of high-quality educational resources, and a large number of high-quality digital resources have enabled the personalized needs of learners to be realized with the support of new media technology. The current trend of data-driven college English teaching mode innovation trend is mainly reflected in the following aspects.
At present, the focus and research hotspot of China’s English education informatization resource construction has shifted from the early high-quality courses, high-quality resource sharing courses, and open video courses to the construction of microcourses and catechisms, but the construction of digital resources such as high-quality courses and online courses in the early stage has laid a very good foundation for English informatization classroom teaching. The digital resources for teachers to carry out English informatization teaching based on the network platform can be introduced through the introduction of high-quality MOOC course resources. Building online courses and microcourse video resources with localized characteristics. Transform and upgrade the original high-quality resource sharing courses on campus and transform them into digital resources needed for English informatization classrooms. Form the resources needed for “flipped classroom” with Chinese characteristics in continuous development. It can be said that the classroom cannot be “flipped” without rich digital resources.
Due to the intervention of new media, based on digital network technology, reorganization and integration of different communication technologies, single media is transformed into all-media, which provides a great possibility for the realization of English informatization classroom. In the English informatized classroom, teachers deliver teaching information to students through online teaching platforms, mobile terminals and other new media technologies, and students can also communicate with teachers in real time through new media technologies. In addition, learners can also collaborate and communicate with each other through new media such as microblogging, WeChat, QQ, and online forums to jointly complete the construction of knowledge. What’s more, when educators or learners are utilizing various media and devices to learn and work, new media and devices, themselves, also have an interactive role, and information is disseminated between people and machines to form interactions. It can be seen that the lack of new media technology support information classroom is difficult to realize.
English informatization classroom learning is not limited by time, space and location. Teachers use educational technology before class to record short and concise microclass videos, so that students can study independently outside class and before the new class, realizing flipped classroom teaching. During the class, teachers analyze the students’ learning data fed back by the online platform and provide personalized guidance to students, and conduct targeted discussions and explanations in class. After class, students collaborate with each other to complete assignments and interact with teachers and classmates online, and teachers provide online counseling. Through human and machine automatic procedures, supervise, remind the learning process and learning tasks, learners can freely choose the learning content, control the learning process, learners from passive acceptance of knowledge to active learning knowledge, in access to course resources at the same time, so that the focus on personalized learning blended teaching becomes possible.
From the above analysis, it is not difficult to find that the collection and analysis of rich English teaching big data is the basis for realizing the innovation of new media teaching, personalized education and other teaching modes. Therefore, this study first collects and analyzes the sample data of English education in colleges and universities, refines the basic laws of English education and teaching, and carries out the construction of new teaching models on this basis.
This paper is based on the historical operation data of the academic affairs system of a university in J province, including the results of various majors and courses in English, the evaluation results, the attendance rate, the attendance rate, and the preview situation. The data results of the survey were preprocessed to form five factor attributes that may affect the teaching quality, namely “preview”, “evaluation”, “achievement”, “attendance” and “attendance”, and the sample set was as follows:
   
where 
One hot code is chosen to convert the fixed type data such as “Prep”, “Course Type” and “Analysis Type” into numerical values, and some of the attribute codes are converted as shown in Table 1. The conversion of some attribute codes is shown in Table 1.
Conversion table of attribute codes
| Text type | Data attribute | Value | 
|---|---|---|
| Preview | Yes | 0 | 
| No | 1 | |
| Course | Basic English | 0 | 
| Professional English | 1 | |
| Business English | 2 | |
| English listening | 3 | |
| Requirement | Self-directed type | 0 | 
| Self-driven type | 1 | |
| Friendly type | 2 | |
| Passive type | 3 | 
A manual calibration method is used to calibrate the demand types to form a calibrated dataset to better evaluate the analytical model, which is then divided into a training set and a test set. The training set data attributes and example data are shown in Table 2.
Data attributes and examples of training set
| 1 | 88 | 91 | 90 | 85 | 2 | 3 | 
| 0 | 89 | 98 | 84 | 97 | 1 | 0 | 
| 1 | 82 | 93 | 87 | 90 | 0 | 1 | 
| 0 | 89 | 94 | 98 | 91 | 0 | 2 | 
| 1 | 83 | 86 | 94 | 81 | 0 | 0 | 
| 1 | 81 | 84 | 87 | 89 | 0 | 2 | 
| 0 | 88 | 80 | 80 | 92 | 2 | 2 | 
| 1 | 78 | 79 | 88 | 90 | 0 | 1 | 
In order to understand the distribution of courses in the dataset, statistics were made according to four types of needs: “autonomous”, “friendly”, “self-driven” and “passive”. The statistical results are shown in Figure 1.

Overview statistics of data set
Correlation analysis refers to the analysis of the degree of correlation between two or more variables that are correlated, thus measuring the degree of correlation between the elements of the two variables. Correlation analysis is used to investigate whether there is some kind of dependence between a phenomenon, and to analyze the direction of correlation and the degree of correlation, is an important statistical method to study the correlation between random variables. This paper presents the required correlation analysis is Pearson correlation coefficient.
Correlation analysis based on classical statistics is used to measure the strength of linear correlation between variables, which is generally described quantitatively using the Pearson correlation coefficient. The Pearson correlation coefficient, also known as the Pearson product-moment correlation coefficient, is a linear correlation coefficient and is the most commonly used type of correlation coefficient [28]. Denoted as r, it is used to reflect the degree of linear correlation between two variables, including dependent variable 
   
where 
In this section, attribute correlation analysis is performed on the calibrated training set to remove data attributes with weak correlation to ensure the stability of the subsequent clustering algorithm. In order to reduce the amount of computation, the correlation analysis is performed only on the four analysis types of basic English courses, and the Pearson correlation coefficient is used to statistically analyze the degree of closeness of relationship between the variables.
Substituting the values of the data attributes in the training set into the Pearson’s formula yields a table of correlation coefficient analysis, as shown in Table 3. Where 
Analysis of correlation coefficients
| Group | Type | ||||||
|---|---|---|---|---|---|---|---|
| 88 | 8.37 | 6.01 | 7.29 | 0.15 | 0.23 | ||
| 68 | 8.91 | 6.29 | 11.02 | 0.22 | |||
| 92 | 8.74 | 7.31 | 18.10 | 0.26 | |||
| 75 | 10.14 | 9.13 | 22.72 | 0.26 | |||
| 88 | 8.37 | 8.38 | 11.75 | 0.18 | 0.2 | ||
| 68 | 8.91 | 8.91 | 4.31 | 0.08 | |||
| 92 | 8.72 | 8.74 | 21.90 | 0.30 | |||
| 75 | 10.14 | 10.12 | 16.57 | 0.21 | |||
| 88 | 8.37 | 5.04 | 2.71 | 0.07 | 0.1 | ||
| 68 | 8.89 | 5.42 | 7.40 | 0.16 | |||
| 92 | 8.74 | 7.32 | 2.61 | 0.05 | |||
| 75 | 10.12 | 7.47 | 14.43 | 0.20 | |||
| 88 | 5.97 | 8.37 | 9.15 | 0.21 | 0.25 | ||
| 68 | 6.29 | 8.91 | 2.98 | 0.05 | |||
| 92 | 7.29 | 8.72 | 22.34 | 0.35 | |||
| 75 | 9.13 | 10.14 | 25.04 | 0.34 | |||
| 88 | 6.00 | 5.06 | 1.47 | 0.04 | 0.07 | ||
| 68 | 6.27 | 5.42 | 4.31 | 0.11 | |||
| 92 | 7.28 | 7.31 | -0.11 | 0.00 | |||
| 75 | 9.16 | 7.45 | 9.52 | 0.15 | |||
| 88 | 8.35 | 5.04 | 0.42 | 0.02 | 0.08 | ||
| 68 | 8.83 | 5.42 | 1.67 | 0.05 | |||
| 92 | 8.72 | 7.32 | 5.44 | 0.08 | |||
| 75 | 10.14 | 7.45 | 14.88 | 0.23 | 
It can be found that the correlation between relationship group 
K-means is a better performance clustering algorithm with low complexity, which can be applied to the case of large amount of data. The algorithm first randomly selects 
As a basic algorithm, K-means is robust and can be used for all types of data sets. However, the algorithm also has many shortcomings, such as the algorithm is susceptible to the selection of the initial 
The steps to execute the improved K-means++ algorithm of the paper are as follows:
Step 1: The center point 
Step 2: For each data point 
   
where 
   
Step 3: Select the brand new center point 
   
Step 4: Repeat steps 2 and 3 until 
It can be seen that the K-means++ algorithm has improved the selection of iterative centroids and better clustering performance [30].
The improved K-means++ algorithm was utilized to test the cluster analysis of the instructional data in order to provide an algorithmic architecture for the new instructional model construction work. Based on the data preprocessing in the previous section, the K-means++ algorithm was used to test clustering on 30% of the raw data. A random center of mass O (k=4) was constructed, each point was assigned to the nearest center of mass, and then the center of mass was recalculated, and the process was repeated until the results of the cluster assignment of the data points no longer changed position. According to the algorithmic process, the preprocessed dataset is programmed using Python and graphical results are formed as shown in Figure 2. Fig. 2 (a) and (b) show the clustering results of the relationship between “grade-listening” and “grade-assessment”, respectively.

Comparison diagrams of clustering relationship
The results after clustering are tested for two different relationship groupings separately. Let the calibrated number of a particular classification be 
Figure 3 shows the results of the grouping evaluation of the relationship, and the clustering effects of the relationship between “lecture-grade” and “grade-evaluation” are shown in Figure 3, respectively. Among them, the average classification accuracy of “grade” and “attendance rate” 

Comparison of relationship grouping evaluation
Based on the previous analysis of the innovative trend of English education in colleges and universities and the preliminary analysis of English teaching data, this paper proposes a data-driven informatization education model for English in colleges and universities and a teaching quality evaluation method based on multiple linear regression.
Most of the traditional English informatization teaching models in colleges and universities use MOOC as the form of courses, and its advantages are mainly manifested in the openness of content and form. However, the shortcomings of MOOC are also obvious, such as: no prerequisites for learners, no size limitation, low completion rate, no formal credit certification, and open online exams are prone to academic integrity and other problems.
SPOC, as a form of online course derived from MOOC, adheres to the teaching concept and teaching design of MOOC, utilizes the resources of high-quality MOOCs, improves the teaching methods and processes in schools, and improves the teaching effect. Therefore, this paper proposes a college English informatization teaching model based on SPOC flipped classroom.
The teaching activities of SPOC can be divided into pre-course orientation, classroom research and post-course practice, with a complete flipped classroom teaching process; it is an organic combination of physical classroom and network teaching. In this paper, based on the research of many scholars, based on the concept of SPOC, on the existing network platform, according to the informatization teaching framework designed in the previous paper, and then combined with the characteristics of the flipped classroom and blended learning to construct a SPOC-based informatization classroom teaching process model, the process model of the English informatization teaching mode based on SPOC is shown in Figure 4.

Process model of English information based teaching model based on SPOC
The effective development of blended teaching mode can not be separated from the support of the network platform, the platform can be based on the existing stable and easy to operate network course platform, boutique resource sharing course website as a carrier, by the teacher to redesign it, for learners to do regular maintenance. Digital resources are the core of the network platform, each course is independent, different network teaching platform functions generally include: resource area, communication area, management area, learning area, etc. The network teaching platform provides an environment for the realization of SPOC teaching activities. Teachers build resources based on the platform, release learning tasks, manage students, and interact with students; students acquire needed learning materials based on the platform, carry out independent learning, complete online tests and discussions, and realize the application of knowledge understanding.
Before the class, the teacher is the creator and integrator of the course resources, and the designer of the content and progress of the “SPOC flipped classroom”, and the students are the implementers of the “flipped classroom”. Teachers release “learning tasks” and push learning resources, and students independently choose a variety of high-quality resources for online learning through the online platform, watch micro-videos to understand the main content of the classroom in advance, and identify problems.
In the classroom, the teacher is the guide and facilitator of teaching activities, providing individualized guidance to students, organizing group seminars, carrying out project training, jointly solving problems encountered, and providing feedback on classroom problems. The classroom teaching methodology has shifted from monolithic theory teaching to diversified collaboration, inquiry, discussion and interaction. Teachers are the leaders of “SPOC Flipped Classroom”, summarizing the knowledge points and giving new tasks according to the students’ situation, and students understand and internalize the knowledge through practical operation.
After class, the teacher is a supporter, assigning after-class homework and implementing after-class tests to help students assess what they have learned. According to the students to complete the homework, selected outstanding students work in the network platform to display and share, targeted counseling as well as evaluation, students in the teacher’s assistance and intra-group and inter-group collaborative exchanges, so that the previous knowledge to be consolidated, sublimation, and for the next stage of learning to prepare.
One-dimensional linear regression is the simplest linear regression model used to analyze the linear relationship between an independent variable and the dependent variable, and its basic idea is to predict the value of the dependent variable by modeling the linear relationship between the independent variable and the dependent variable. However, usually in the process of research on real problems, the change of the dependent variable is often subject to the joint action of multiple variables at the same time, at this time, the univariate linear regression can’t predict the dependent variable, and it is necessary to elicit two or more variables that act together to explain the change of the dependent variable, i.e., multiple regression, which, in the case of linear relationship between multiple independent variables and the variable to be measured, is referred to in this paper as Multiple linear regression. Multiple linear regression is a statistical method that uses multiple independent variables to predict one or more dependent variables, and it can analyze the relationship between multiple independent variables and a dependent variable and estimate the functional form between them. The modeling and parameter calculation process is as follows:
When 
   
Where 
If two independent variables 
   
Parameter estimation for multiple regression models, like the same binary linear regression equation, requires that the parameters be solved by least squares provided that the sum of squares of errors 
With the binary linear regression model, the standard set of equations for solving the regression parameters is shown in equation (8):
   
The values of 
   
   
It should be noted that the correlation between independent variables needs to be considered when modeling with multiple linear regression models. If there is a high degree of correlation between the independent variables, it may lead to a decrease in the accuracy of the multiple linear regression model, and at this time, the use of feature selection, principal component analysis and other methods can be considered to reduce the correlation between the independent variables, so as to improve the accuracy of the model.
Multicollinearity is a situation where there is a high degree of correlation between independent variables in a multiple regression model. Multicollinearity can lead to inaccurate regression coefficients, make it difficult to make statistical inferences, and even cause the model to fail. Variance inflation factor (VIF), is used to portray the severity of complex (multiple) correlations among multiple variables. It is the ratio of the variance of the regression coefficients estimated under the assumption of a nonlinear relationship between the independent variables. When 0 < 
   
where 
In this paper, we choose to utilize the variance inflation factor (VIF) to test for the presence of multicollinearity in the independent variables.
Single-factor linear regression analysis In order to study the influence of pre-study rate, attendance rate, listening rate, course grade, and evaluation grade on the final exam grade, this paper first studies the influence of each factor on the final exam grade. Firstly, the linear regression of each independent variable on the dependent variable is established separately, and the model is as follows:
   
Where, 
The calculation result of single factor linear regression model
| Parameter | Parametric estimate | Parametric estimate | 
|---|---|---|
| 0.4843 | [0.4064, 0.5631] | |
| 0.3185 | [0.2182, 0.4192] | |
| -0.9687 | [-1.945, 0.0082] | |
| 1.7012 | [0.7135, 2.6789] | |
| 0.3712 | [0.2538, 0.4957] | |
| 0.5876 | [0.3872, 0.7831] | |
| 0.2395 | [0.1728, 0.3116] | |
| 0.7221 | [0.6234, 0.8123] | |
| 0.3387 | [0.0184, 0.6621] | |
| 0.3871 | [0.0553, 0.7213] | |
It can be seen from the calculation results in Table 4 that the value of 
Multifactor linear regression analysis In order to study the overall impact of multiple variables such as pre-study rate, attendance rate, listening rate, course grade, and evaluation grade on the final exam grade, this paper establishes a multivariate linear regression model (1):
   
Where, 
The calculation result of multi-factors linear regression model (1)
| Parameter | Parametric estimate | Parametric estimate | 
|---|---|---|
| 0.8213 | [0.0645, 1.5578] | |
| 0.1073 | [0.0213, 0.1984] | |
| -0.8531 | [-1.7682, 0.0536] | |
| 0.0573 | [-0.1328, 0.2368] | |
| 0.6742 | [0.5574, 0.7921] | |
| 0.1983 | [-0.0651, 0.4622] | |
From the calculation results in Table 5, it can be seen that the value of 
Remove the non-significant factors and re-establish the linear regression model (2):
   
The results of the calculation of the multifactor linear regression model (2) (Ⅰ) are shown in Table 6. From the calculation results, 
The calculation result(Ⅰ) of multi-factors linear regression model (1)
| Parameter | Parametric estimate | Parametric estimate | 
|---|---|---|
| 0.2011 | [0.1276, 0.2732] | |
| 0.1153 | [0.0364, 0.1935] | |
| 0.6531 | [0.5384, 0.7633] | |
Figure 5 shows the distribution of the residuals, and it was found that 16 data points had residual confidence intervals that did not contain zeros, and the data should be considered outliers.

The distribution of residual
After eliminating them and re-running the program to calculate, the results of calculating the multifactor linear regression model (2) (II) are shown in Table 7. From the calculation results, the regression parameters 
The calculation result(Ⅱ) of multi-factors linear regression model (2)
| Parameter | Parametric estimate | Parametric estimate | 
|---|---|---|
| 0.1864 | [0.1324, 0.2379] | |
| 0.1012 | [0.0443, 0.1566] | |
| 0.7036 | [0.6311, 0.7792] | |
In the one-factor linear regression analysis, the effects of pre-testing rate, attendance rate, attendance rate, course grade, and evaluation grade on the final exam grade are all significant, but in the multi-factor linear regression analysis, the pre-testing rate and course grade have a significant effect on the final exam grade, and the course grade has the largest proportion and the highest contribution. From the regression model, it can be seen that there is multicollinearity among attendance rate, listening rate, course grade, and evaluation of teaching grade, and course grade can highly reflect attendance rate, listening rate and course grade.
Based on the above multiple regression process, it will be able to assess and analyze the quality of English teaching after applying the SPOC college English informatization teaching model, so as to continuously optimize and improve the English teaching model constructed in this paper, and to enhance the quality of English teaching in colleges and universities.
This paper focuses on the correlation analysis and clustering of basic data on English teaching, and innovatively proposes a data-driven English education model. The correlation analysis shows that English grades have the highest correlation (r=0.25, 0.23) with listening rate and assessment, and combined with the higher average classification accuracy of the “listening-grade” relationship (91.7%>82.7%), it is concluded that the listening rate has the highest correlation with English grades. From this teaching rule, a SPOC informationized teaching model aimed at improving classroom attention was constructed, and then a teaching quality assessment method using multiple linear regression was proposed. After removing insignificant factors, the results of the multifactor linear regression analysis show that the prep rate and course grade can explain most (60.31%) of the English final exam scores. Therefore, in order to better assess the teaching quality effects achieved by the SPOC informationized teaching model proposed in this paper, more attention should be paid to students’ pre-preparation and course grades.
