Model Innovation and Practice of English Education in Colleges and Universities in the Context of Internet

In the Internet era, the dissemination of information and knowledge has become more convenient and comprehensive. Students can obtain a large number of English learning resources through the Internet, such as English learning websites, English learning apps and English e-books. This situation makes English teaching encounter new challenges, and traditional classroom teaching and teaching resources can no longer meet students’ needs [1-4]. In the Internet era, students’ requirements for learning have also changed, and they pay more attention to practicality and personalization, so the traditional teaching mode is difficult to meet students’ learning habits and attitudes. College English teaching needs to be reformed and innovated to adapt to the development of the times and the needs of students [5-8].

Under the background of Internet education, the traditional English teaching mode has faced many challenges and opportunities for change. With the progress of science and technology and the continuous development of information technology, Internet education presents the characteristics of diversification and personalization. The innovation of English teaching mode is an important means to improve the quality of education and learning effect [9-12]. By innovating in terms of teaching content, teaching methods, teaching evaluation, teaching environment and teaching resources, we can better adapt to the needs of learners in the Internet era and provide a more personalized and diversified learning experience [13-15]. Educators should actively explore the teaching mode adapted to the background of Internet education, and make positive contributions to the cultivation of future talents with innovation and adaptability [16-17].

Literature [18] explored the English in the interactive teaching of English in colleges and universities with Internet of Things (IoT) technology and launched a study on its teaching mode. Using questionnaires, interviews and other ways to investigate students, using statistical software analysis, it was concluded that the teaching effect of the interactive teaching mode of English in the Internet of Things is very obvious and loved by students. Literature [19] discusses the blended teaching mode of English in colleges and universities under the hi of network environment. It emphasized that under the influence of the Internet, the mutual combination of traditional teaching means and information network breaks the time and space limitations and effectively improves students’ learning interest and English teaching quality. Literature [20] discusses the impact of the Internet on English teaching in colleges and universities and the teaching principles, and puts forward the teaching mode under the Internet environment, pointing out that the Internet environment requires colleges and universities to change their teaching concepts, update their teaching methods, and make use of the Internet teaching platform in order to promote students’ comprehensive learning of English knowledge. Literature [21] describes the role of the Internet in promoting the traditional English teaching mode towards innovation, describes the innovations and the current development of English teaching in colleges and universities under the intelligent environment, and emphasizes the integration and application of the hybrid teaching mode in English teaching in colleges and universities under the intelligent environment. Literature [22] examines the integration of multimodal resources and English reading teaching under the background of “Internet Plus”, and based on the literature review and case study analysis, it is indicated that the application of audio, video and other forms of media resources in teaching can effectively improve the English reading ability of students, and puts forward the suggestion of using digital media resources in teaching.

Literature [23] introduced the application of blended teaching in college English teaching. It is pointed out that blended teaching is a teaching mode based on the combination of traditional classroom and online learning which has realized the complementarity, and the research on the design and application of the blended teaching mode of English in colleges and universities under the background of the Internet is carried out. Literature [24] proposed an English teaching model based on the Internet of Things and the cultivation of students’ critical thinking ability by adopting a computer data simulation model to create an innovative model of English teaching and introducing a particle swarm optimization algorithm into the model. The results of the study mentioned that particle swarm optimization algorithm can improve the quality of English teaching. Literature [25] for college students’ poor English performance and other problems, put forward the research method of Internet English teaching reform and development direction. By understanding the problems exposed by the current English teaching, the many advantages of Internet teaching are analyzed to strengthen the connection between English teaching and the Internet. The Boltzmann machine model was used to select a suitable direction for teaching reform and development, and the experimental results verified the effectiveness of the method, and students’ English learning performance and efficiency were improved. Literature [26] aims to establish an interactive education paradigm based on the results of the Internet of Things to improve the quality of English teaching. By using various sensors to compare students’ speech and vocabulary in order to improve students’ English language correction, which improves students’ learning efficiency, and also helps teachers to choose suitable teaching resources for students, the results of the study proved the effectiveness of the method. Literature [27] examined based on greater exposure to Internet resources to improve students’ language skill development and learning. By integrating Internet resources with English language teaching in colleges and universities so as to share effective methods, the challenges faced by English language teachers in colleges and universities were explored. Literature [28] not only analyzes the current development of English education and teaching in colleges and universities, but also elucidates the advantages of English teaching in colleges and universities in the context of the Internet, and puts forward effective suggestions to promote the sustainable development of English education in colleges and universities. Literature [29] discusses the current situation and problems of English teaching in colleges and universities based on quality education and education reform, and compares the modern and traditional teaching modes, while realizing the construction of a new teaching system in colleges and universities. Literature [30] proposed a multimedia English teaching system based on Internet virtual technology, and constructed a three-layer architecture of distance teaching system based on the principle of distance teaching, and created a virtual learning environment through the use of virtual reality technology, and the experimental results show that the proposed system has the characteristics of stable performance and real-time interaction, which is conducive to improving the effect of English teaching.

In this paper, edge features and pixel ratio features are used to describe the target model and construct the tracking model based on particle filtering, respectively. The similarity between the target template and the candidate template is calculated by fusing the two features of gray value and binary value, and using Bhattacharyya distance representation.Kinect is utilized to obtain the hand position information, map the real hand information to the virtual space, and control the virtual hand to perform classroom interaction operations.A method for correcting the position of the virtual hand is proposed, considering the actual interaction distance.According to the given course content, the user’s operation intention is defined and the interaction intention is analyzed.Use deep RNNs and ResNet to recognize teacher and student behaviors and classroom teaching scenarios, respectively, to design inquiry-based interactive English classrooms.The model developed in this paper is utilized in English education in colleges and universities, and its interaction effect is evaluated.

2

Interactive tools and methods for English language teaching and learning

2.1

Gesture fingertip tracking method based on edge features and pixel ratio

2.1.1

Tracking model based on particle filtering

The tracked target is defined as the target model, and the target of the current frame is the candidate model. The edge features and pixel ratio features are used to describe the target model, respectively. Let ${\hat{q}}_{u}$ denote the probability density of each level of the edge direction feature of the target model, and n₁ denote the number of quantization levels of the edge direction value, the edge direction feature of the target model can be described as: (1) ${\hat{q}}_{u} = C \sum_{i = 1}^{n} k (‖ X_{i}^{*} ‖) δ (b_{E O H} (X_{i}) - u)$ (2) ${\hat{q}}_{E O H} = {{\hat{q}}_{u}}_{u = 1 \dots n_{1}}, \sum_{u = 1}^{n_{1}} {\hat{q}}_{u} = 1$ where k(·) is the cross section function of the Epanechnikow kernel and δ(·) is the Dirac function. X_i denotes the gradient magnitude at the edge points of the target model. N denotes the number of edges in the target model. b_EOH(·) is used to determine whether the value in the direction of the edge points is equal to the eigenvalue. C is the normalization coefficient.

Let ${\hat{q}}_{v}$ denote the probability density of the target model human pixel ratio, T_i denote the number of target model human pixels, and C_i denote the total number of target model pixels. Experiments show that setting the size of the selected region of interest to 40×40 will have good tracking performance.

(3)

{\hat{q}}_{v} = \frac{T_{i}}{c_{i}}

2.1.2

Fusion characteristics and similarity

Two features were extracted from templates of different sizes, respectively. The edge direction histogram is extracted from the gray-scale medium template (size 80*90 (a)) and the pixel ratio feature is extracted from the binary template (size 40*40 (b)). Then, the two features are fused to calculate the similarity between the target template and the candidate model, and the Bhattacharyya distance is used to represent the similarity of the edge direction features between the target model and the candidate model. And for the pixel ratio features, the new similarity representation is set as: (4) $ρ (y) = 1 - \frac{| {\hat{p}}_{v} (y) - {\hat{p}}_{v} |}{q_{v}}$

The larger ρ is, the better the similarity of the pixel ratio features. In a changing scene, the two features behave differently. In order to combine the advantages of both, the weights of the two features are combined using a linear weighting method: (5) $α_{1} = \frac{1 - ρ_{1}}{(1 - ρ_{1}) + ρ_{2}}$ (6) $α_{2} = \frac{1 - ρ_{2}}{(1 - ρ_{1}) + ρ_{2}}$

The better a feature performs, the more similar it is to the target model and the greater the weight. The joint feature probability density can be expressed as: (7) $ρ_{s u m} (y) = α_{1} (1 - ρ_{1}) + α_{2} ρ_{2}$

2.2

Position correction algorithm based on virtual hand interaction

2.2.1

Virtual hand based interaction control

Using the hand position information obtained by Kinect, the real hand can be easily mapped into the virtual space so that the user can directly control the virtual hand [31]. Position in three directions constitutes three degrees of freedom of the virtual hand, and the other three degrees of freedom are rotations in three directions. The degrees of freedom of the virtual hand, i.e., the characteristics of the virtual hand, are described using the following equation: (8) $V R H a n d = {p_{x}, p_{y}, p_{z}, r_{x}, r_{y}, r_{z}}$

If the depth coordinate of the real hand is [h_x,h_y,h_z], then the coordinate mapped to the virtual space [p_x,p_y,p_z] can be expressed as: (9) $[\begin{array}{l} p_{x} \\ p_{y} \\ p_{z} \end{array}] = [\begin{matrix} t_{x} & 0 & 0 \\ 0 & t_{y} & 0 \\ 0 & 0 & - t_{z} \end{matrix}] * [\begin{array}{l} h_{x} \\ h_{y} \\ h_{z} \end{array}] + [\begin{array}{l} λ_{x} \\ λ_{y} \\ λ_{z} \end{array}]$

In matrix T, t_x, t_y, t_z denote the scale of the transformation to real hand coordinates. This scale is set according to the size of the actual virtual reality interaction space. It is worth noting that since the value of axis Z in the depth coordinates obtained by Kinect is positive, the value of t_z is negative.

2.2.2

Virtual hand position correction algorithm

Based on the analysis of the user interaction model in the previous section, it is found that the position of the virtual hand can be corrected in real time by simply correcting the angle of the virtual hand around the Y-axis during left-right movement. Let this value be σ, where: (10) ${\begin{matrix} σ_{\max} = θ + μ \\ σ_{\min} = 0 \end{matrix} = θ = \tan (\frac{V E_W i d t h / 2}{V E_D e p t h})$ where μ is a constant, based on experience μ = 30, but the virtual hand adjusted according to this value is still not consistent with the actual manipulation vector and has a small deviation. It was found that the above formula calculates the offset based on the distance between the screen and the virtual space plane where the virtual hand is located. In fact, the viewing angle is based on the user’s eyes observing the distance between the real hand and the virtual hand plane, so the distance between the real hand and the screen should be taken into account d. In the unity3d scene, the position of an object is in meters, while the general value of the distance between a person’s hand and the screen during actual interaction is d = 1. Therefore, the formula (10) was modified: (11) ${\begin{array}{l} σ_{\max} = θ' + μ \\ σ_{\min} = 0 \end{array}, θ' = \tan (\frac{V E_W i d t h / 2}{V E_{D e p t h} + d})$

Then, the normal vector of the virtual hand is updated in real time. Through the statistics of the actual data, it is found that the rotation angle σ is linearly related to p_x in the virtual space [32]: (12) $σ = \frac{σ_{\max}}{\frac{σ t_{W I t h} + d}{2}} * p_{x} + ρ$ where b = 1, this is because the real person’s right hand is on the right side of the user’s center, where b is a benefit to the right hand range of operation. ρ is a constant, and its value can be obtained by substituting the values of σ and p_x into Eq. (12) during the initialization process. The obtained σ corresponds to r_y in the virtual hand attribute. By modifying the value of r_y in the virtual hand, the real hand and the virtual hand can keep the consistent movement according to this angle, where the formula is based on the right hand operation. As a matter of fact, Equation (12) applies regardless of left-handed or right-handed operation.

2.3

Inquiry-based Interaction Model in Virtual English Language Teaching

2.3.1

Definition of user intent

For a given virtual English lesson, set the current experiment content ℝ∈Cdata, and Cdata denotes the database containing all virtual English content. For a given lesson content ℝ, define the user’s current operation intent as: (13) $I_{k} = {ψ ({w_{1} |}_{t 1}^{2}, {w_{2} |}_{t 1}^{t_{1}^{2}}, {w_{3} |}_{t 1}^{t_{1}^{2}}) | ℝ}, I_{k} \in I_{ℝ}$ where ${w_{i} |}_{t 1}^{t 2}$ denotes the user interaction information of the i rd channel during the time period [t₁,t₂], and I_ℝ denotes the set of intentions that the user is able to execute under the current instructional content. In the explorable interaction model envisioned in this paper, user intentions I_k will be fed back by the system, and the system feedback r_k = {I_k|k = 1,2…N_(ℝ)} (N denoting the number of possible intentions that the user can express under the current instructional content ℝ) will serve as an important basis for deciding whether or not the current instructional content ℝ will be updated. The new instructional content ℝ′ is described in the following form: (14) $(ℝ', {I^{'}}_{ℝ}) = φ (r_{k} | I_{k}, ℝ, C d a t a)$

As the content is updated, the user’s set of executable intents is also updated. In this way, the system is able to satisfy the user’s need for exploratory interactions, and whether the user expresses an intent to follow the standard flow of the original instructional content ℝ or to perform an exploratory operation, he or she will get a result that is appropriate and satisfies the user’s intent. The crux of the problem is to identify such a function ψ(·) that determines the user’s intent and a function φ(·) that updates the current instructional content and set of intents based on the user’s intent.

2.3.2

Interaction Intent Analysis

The process of expressing a user’s one-time intention is defined as an operation unit U_ac = {I_k,r_k}, which describes the user’s intention to operate I_k and the eventual feedback received r_k. The value of operation unit U_ac indicates whether the result of the current user operation satisfies the user’s expectation, defining the following relationship: (15) $U_{a c} = {\begin{array}{l} 0 & I_{k} \cap r_{k} < ε \\ 1 & I_{k} \cap r_{k} \geq ε \end{array}$

ε represents a threshold for the user to judge whether the result of an operation meets the expectation, and the user will not perform more operations unnecessarily if the expectation is met (U_ac = 1). When U_ac = 0, the user needs to change the way of expressing the intention to achieve the realization of the expectation. According to the user’s operation intention I_k defined in Equation (13), the information for the user to express the intention comes from three channels, and if the current user only adopts the information of a single channel to get the result of U_ac = 0, the user can choose to enhance the expression by adding a new channel of information according to Equation (14), or to express the intention by choosing another kind of channel of information: (16) ${I^{'}}_{k (n)} = {\begin{array}{l} I_{k (n = 1)} \cap a d d ({w_{i} |}_{11}^{t 2}) & n < 3, U_{a c} = 0 \\ u p d a t a (I_{k (n = 1)}) & n \leq 3, U_{a c} = 0 \end{array}$ where n denotes the number of channels included in the current user’s intent, and ${w_{i} |}_{t 1}^{t 2} = {G_{i}, V_{i}, S_{i}}$ denotes one or more channels of information selected by the user [33]. After the user re-expresses the intention, the operation unit U_ac is also updated, which in turn re-characterizes the one-time intention expression process of the user. In this paper, we follow the principle that every user should be processed and feedback to the user, and treat every valid or invalid user intent equally. Therefore, when the value of one of the user’s operation units U_ac is always 0, the user will get an indication that the current operation is invalid, which is a simple compensation for the system’s lack of a priori knowledge as well as the lack of recognition ability due to the diversity of intention expressions.

2.3.3

Exploratory Interaction Design

In the Intent Understanding phase, the user’s Intent Behavior A_i is obtained, and this behavior can be resolved to A_i = {EPAC,NO_EPAC} in the Feasibility Analysis phase, which first determines the category of the behavior: (17) $A_{i} = {\begin{matrix} 1 & A_{i} \in E P A C \\ 0 & A_{i} \in N O_E P A C \end{matrix}$

If A_i = 1, then the validity of the current inquiry-based operation needs to be considered, i.e., Value(EPAC): (18) $V a l u e (E P A C) = {\begin{matrix} 1 & E P A C \cap ℝ \cap C d a t a \neq \emptyset \\ 0 & E P A C \cap ℝ \cap C d a t a = \emptyset \end{matrix}$

Otherwise A_i = 0, indicating standard process operation, continues as taught and does not proceed to the next step in the algorithm. After the feasibility analysis, the validity of the user’s behavior is determined. At this point, the algorithm enters the immediate update phase, taking Aoe(Seq) as an example, the user intends to adjust the teaching order, and after verifying the validity of this operation, the algorithm will update the teaching content: (19) $ℝ' = u p d a t a (ℝ | A o e (S e q))$

3

Teacher and student classroom behavior recognition based on multimodal fusion

3.1

Deep RNN-based automatic teacher-student behavior recognition

Neural network network deepening brings good gains and with it brings many problems. In this study, the deepening network is improved to capture long temporal contextual information by designing the temporal and representational layers to capture the sequential contextual information of video frames and the representational information of each individual structure, respectively. For RNN networks it is possible to deepen the network directly by stacking, but it is difficult to co-train the two information streams mentioned above, and the results of the time-use teaching also show that it is difficult to train, so the shallow RNN network is generally utilized to extract the features of the CNN network, which is then used as the input to the RNN network for training, but such a network is not an end-to-end network.

For the characterization layer a CNN based network structure is used to extract the input frames individually, while the timing information is captured by RNN network [34]. As shown in Fig. 1, R denotes the representation unit and T denotes the timing unit through which the information is extracted by the representation unit R and the timing unit T to encode the timing information, ${o^{'}}_{i, t}$ is the input of layer i at timestamps t and is designed as a regular CNN network to extract the features, which is represented by the following equation (20), φ_i denotes the parameter of R at layer i, and R is the ReLU(Conv(·)) function.

(20)

{o^{'}}_{i, t} = R (o_{i - 1, t}; φ_{i})

The chronological flow is then represented by Eq. (21), where c_i,t represents the memory state of layer i at timestamp t, ∅_i represents the layer i parameters of T, and T is the Sigmoid(Conv(·)) function: (21) $c_{i, t} = T (c_{i, t - 1}, o_{i - 1, t}; \emptyset_{i})$ (22) $o_{i, t} = \hat{u} (o_{i, t}', c_{i, t})$

Proceeding through Eq. (22), the fusion of 2-way information streams is carried out.

In addition for the difficult difficulty of training, generally the network starts with the hope that it learns as much as possible about the representational stream information and less about the temporal information, and waits for the network to deepen before learning the temporal information, and also adopts a Dropout-like approach to reduce the training complexity.

3.2

Classroom Scene Detection Based on Resnet

3.2.1

Speech pre-processing

Speech is a kind of sound that has a certain meaning and connotation through the vibration of human vocal organs. The original signal of speech cannot be utilized directly due to the distortion and mixing phenomena of the vocal organs and the acquisition and modulation equipment, but must go through a series of pre-processing processes such as pre-emphasis, frame splitting, windowing and other operations. Through the pre-processing process, we usually obtain a uniform and smooth signal, which significantly improves the quality of speech. Specifically, it consists of four parts: pre-emphasis, framing, windowing, and endpoint detection:

1)

Pre-emphasis

Usually, s(n) is used to represent the speech signal, and the soundgate excitation and the oral-nasal radiation affect the average power of the speech signal to a certain extent, resulting in the high-frequency portion of the signal (above 800 Hz) attenuating at 6 db/oct (octave), with the component becoming smaller at higher frequencies. Therefore, it is necessary to boost the high-frequency attenuated portion of the speech signal. The general pre-emphasis operation is realized using a digital filter, and the pre-emphasis input-output relationship for speech signal s(n) is expressed in the following equation (23): (23) $\tilde{s} (n) = s (n) - a s (n - 1)$ Where a represents the pre-emphasis factor, generally a is taken as 0.9375.

2)

Splitting

As the speech signal will change with time, it has time-varying nature, but it is generally regarded as stable and unchanged as a quasi-steady state process (i.e., short-time smoothness) within a short time interval (10~30ms). The “short-time analysis technology” is the core and key of speech analysis technology, and the whole related analysis is built on the “short-time basis”, i.e. “short-time analysis”. The speech signal is analyzed in a short time segment to extract its corresponding characteristic parameters, each segment is a “frame”, the length of the frame and the time interval is consistent with the general 10~30ms. Through this analysis method, the speech signal is converted into a time sequence composed of frames.

3)

Add window

The signal is first subframed and then windowed. We usually perform the window operation near the sampling point n of the speech waveform, i.e., to strengthen the waveform near the sampling point while weakening the other parts. In fact, in the short-term analysis of speech signals based on the window operation is to do some kind of operation or transformation in the vicinity of the sampling point. In practice, rectangular window, Hamming window and Hanning window are the three most commonly used types, and their definitions are shown in Eqs. (24)-(26): (1)

Rectangular window (24) $f (x) = {\begin{cases} 1, 0 < n \leq N - 1 \\ 0, O t h e r \end{cases}$

(2)

Hanming Window (25) $f (x) = {\begin{cases} 0.54 - 0.46 \cos (2 π n l (N - 1)), 0 \leq n \leq N - 1 \\ 0, O t h e r \end{cases}$

(3)

Hanning Window (26) $f (x) = {\begin{cases} 0.5 [1 - \cos (2 π n l (N - 1))], 0 \leq x \leq N - 1 \\ 0, O t h e r \end{cases}$

3.2.2

Speech feature extraction

Mayer’s inverse spectral coefficient is for the auditory system to extract features by modeling the response and perception of the human ear to sounds of different frequencies, the process of discriminating sound frequencies by the human ear is similar to an operation of taking logarithms of a particular signal.

The MFCC algorithm process is as follows: 1)

Fast Fourier Transform (FFT): (27) $X [k] = \sum_{n = 0}^{N - 1} x [n] e^{- j \frac{2 π}{N} n k} k = 0, 1, 2, \dots N - 1$

x[n](n = 0,1,2,…,N–1) represents a discrete speech sequence that has been sampled frame by frame, where N represents the length of the frame, X^[k] is a complex representation of the N points, and |X[k]| represents the modulo operation on X^[k] to obtain its amplitude spectrum.

2)

Conversion of the actual frequency scale to the Mel frequency scale: (28) $M e l (f) = 2597 \lg (1 + f / 700)$

3)

Construct a triangular filter bank and compute the output after each one is filtered: (29) $F (l) = \sum_{k = f_{e} (1)}^{f_{k} (l)} w_{l} (k) | X [k] | l = 1, 2, \dots L$

The meanings of w_l(k) and f₀(l) are shown in equations (30)-(31) below: (30) $w_{l} (k) = {\begin{array}{l} \frac{k - f_{o} (l)}{f_{c} (l) - f_{o} (l)}, f_{o} (l) \leq k \leq f_{c} (l) \\ \frac{f_{h} (l) - k}{f_{h} (l) - f_{c} (l)}, f_{c} (l) \leq k \leq f_{h} (l) \end{array}$ (31) $f_{o} (l) = \frac{o (l)}{[\frac{l (l)}{N}]}, f_{h} (l) = \frac{n (l)}{[\frac{l}{N}]}, f_{e} (l) = \frac{c (l)}{[\frac{l (s)}{N}]}$

w_l(k) denotes the filter coefficients of the filter, f_s is the sampling rate, L is the number of filters, o(l), c(l), and h(l) denote the lower limit frequency value, center frequency value, and upper limit frequency value of the corresponding filters in the real frequency coordinate axis, and F(l) denotes the output of the filter.

4)

The output obtained from the above operation is further logarithmically operated and then Discrete Cosine Transform (DTC) operation is performed to finally obtain the MFCC features as shown in the following equation (32): (32) $M (i) = \sqrt{\frac{2}{N}} \sum_{i = 1}^{L} \log F (l) \cos [(l - \frac{1}{2}) \frac{i π}{L}] i = 1, 2, \dots, Q$

4

Analysis of the Effectiveness of English Education Model Innovation in Colleges and Universities

4.1

Analysis of the interactive effect of English education in colleges and universities

4.1.1

Classroom interaction

Applying the model constructed in this paper to English education in colleges and universities, class A is selected as the control class,, class B as the experimental class, in which teacher Y is from class A and teacher X is from class B. Empirical investigations are conducted to analyze the students in classes A and B respectively. Class A uses the traditional English teaching mode, while class B uses the virtual interactive English teaching mode. This subsection verifies the effectiveness of the virtual interactive teaching mode by comparing the two different English teaching methods.

Ms. Y’s class has 25 students, including 10 boys and 15 girls, and each class lasts for 90 minutes with a regular flow. First, there is the class presentation. In each lesson, one student presented on a topic, and after the presentation, the teacher commented on the presentation. Then Mr. Y started the formal lesson. Every time she explains a new natural paragraph, she asks a student to read the text aloud and then leads the students to read the text carefully, including vocabulary words, difficult sentences, the meaning of the paragraph, the purpose of the writing, etc. She also asks the students to translate the sentences after she finishes explaining some vocabulary words and expressions to help them consolidate them and make sure that they have grasped their meanings and usages.

Teacher X’s class has 27 students, including 24 girls and 3 boys.Teacher X uses the Virtual Interactive English Teaching Model (VIETM), an online classroom model built from educational scenarios. The model also has rich interactive features, such as board, desktop sharing, grouping, raising hands, answerer, timer, small blackboard, etc., which can basically meet the various needs of teachers’ English teaching.

Table 1 shows teacher Y’s classroom interaction. The total score of teacher-student interaction in teacher Y’s online classroom is 4.765, which is medium level, and the SD is 1.355, indicating that the interaction status is relatively average. The statistics show that the total score in the domain of emotional support is 4.555, which is medium level.Teacher Y’s classroom is dominated by the teacher’s lecture, and there is less interaction between teachers and students and between students and students. The main forms of teacher-student interaction were asking students to read the text aloud and asking students to translate sentences. When students were thinking, Mr. Y would give enough waiting time. When the students finished the task, Ms. Y would give affirmation and praise, and if the students could not answer, Ms. Y would give encouragement patiently.

Table 1.

Y teacher classroom interaction

		M	SD	Minimum value	Maximum value
Emotional support	Positive atmosphere	3.452	0.769	2	4
	Teacher sensitivity	5.534	0.526	4	6
	Respect students’ opinions	4.678	0.826	5	6
	Mean	4.554667	1.125
Classroom organization	Behavior management	3.793	0.469	3	5
	Productivity	6.285	0.675	6	7
	Negative atmosphere	6.402	0.498	6	7
	Mean	5.493333	1.385
Teaching support	Teaching and learning arrangement	5.189	0.398	5	6
	Content understanding	6.632	0.463	6	7
	Analysis and exploration	3.485	0.632	2	4
	Feedback quality	2.713	0.749	2	4
	Instructional dialogue	3.221	0.832	2	4
	Mean	4.248	1.536
Student participation	Student participation	2.415	0.539	2	3
Student participation	Total mean	4.765333	1.355

During lectures, Mr. Y also strives to make sure that the students can understand, and the language of instruction is mainly English, supplemented by Chinese when necessary. After explaining a vocabulary word or sentence pattern, Ms. Y would translate the sentence for students to practice. In the interviews, Ms. Y was confident in her teaching ability. Her years of teaching experience and understanding of the students allowed her to accurately grasp the key points of the teaching and promote students’ understanding of the content, and the students also felt that they were able to learn something in Ms. Y’s class, and the classroom satisfaction was high.

4.1.2

Classroom Climate

In this study, an independent samples t-test was conducted on 41 questionnaires using SPSS software to analyze whether there was a significant difference between the classroom climate of teachers X and Y on each dimension and factor. The independent samples t-test’s requirements were met by the data, as demonstrated by the results of normality and chi-square tests.The results of the analysis of the independent samples t-test are shown in Table 2.

Table 2.

Independent sample T test

Dimension		Median	Mean	SD	t	P	MD	Cohen’s Value
Learning behavior		3.763	3.713	0.163	2.163	0.034**	0.183	0.765
Interpersonal support		4.632	4.162	0.182	3.523	0.004***	0.245	1.215
Situational support		4.096	4.026	0.163	0.189	0.896	0.254	0.049
Learning input		3.796	3.736	0.384	3.196	0.025**	0.398	1.652
Learning behavior	Factor name	Median	Mean	SD	t	P	MD	Cohen’s Value
	Contact reality	3.648	3.363	0.385	0.958	0.385	0.136	0.396
	Task orientation	4.241	4.128	0.248	-1.758	0.053*	0.175	0.465
	Active participation	3.769	3.685	0.475	0.148	0.999	0.093	0.096
	Personalized learning	3.631	3.689	0.462	-0.783	0.436	0.048	0.245
	Exploratory learning	3.685	3.866	0.315	0.189	0.815	0.026	0.056
	Cooperative learning	3.752	3.625	0.558	6.289	0.000***	0.866	2.655
Interpersonal support	Teacher support	4.298	4.286	0.248	1.289	0.245	0.169	0.483
	Student equality	4.425	4.396	0.485	4.068	0.000***	0.455	1.341
	Student rapport	4	3.348	0.268	0.285	0.843	0.059	0.045
Situational support	Multimedia application	4	3.985	0.385	-4.678	0.000***	0.395	1.593
	Activity innovation	3.7	4.687	0.307	1.732	0.066*	0.189	0.685
	Classroom management	4.696	4.325	0.317	2.963	0.005***	0.269	1.095
Learning input	Study interest	4	3.978	0.285	4.745	0.000***	0.618	1.123
	Self-efficacy	3.258	2.915	0.495	1.267	0.278	0.298	0.388
	Harvest evaluation	4.169	4.065	0.588	2.388	0.059**	0.284	0.845

From the results of the independent samples t-test, there were significant differences between the two teachers on the dimensions of learning behavior (0.034), learning engagement (0.025), and interpersonal support (0.004), and there were no significant differences on contextual support. Specifically, there is a significant difference between the two teachers on the three factors of student equality, cooperative learning and multimedia use with a p-value of 0. The differences on the other factors are relatively small. From the results of the questionnaire and classroom observation, the overall classroom climate of teacher X was better than that of teacher Y. The results of the questionnaire and classroom observation showed that teacher X’s classroom climate was better than that of teacher Y’s.

4.1.3

Analysis of teaching tendencies

In order to explore more deeply the differences between teaching interaction behaviors in college English smart classrooms and traditional classrooms, the researcher conducted a paired test, and Table 3 shows the results of the paired-sample t-test for teaching behaviors. For the mean data of teacher speech, although there are some differences between the two types of classrooms, based on two-sided significance p = 0.628 > 0.05, it implies that there is no significant difference between the two. However, for the mean data of student speech, the two types of classrooms showed a significant difference, specifically analyzed to show a two-sided significance p = 0.036 < 0.05, which suggests a significant increase in student speech in the college English smart classroom. This data demonstrates the positive impact of the smart classroom on students’ engagement and expression, indicating the educational importance of the smart classroom reform.

Table 3.

Test results of the selection of the teaching behavior

Teaching behavior	Categories	Mean	Correlation coefficient	Sig.	Sig.(double side)
Teacher speech	Teacher	0.458	-0.386	0.385	0.628
Teacher speech	Student	0.426	-0.386	0.385	0.628
Student speech	Teacher	0.327	0.825	0.028	0.036
Student speech	Student	0.386	0.825	0.028	0.036
Indirect influence/direct impact	Teacher	1.495	0.036	0.726	0.008
Indirect influence/direct impact	Student	2.536	0.036	0.726	0.008
Perceptual input and language intake	Teacher	0.489	0.278	0.463	0.498
Perceptual input and language intake	Student	0.432	0.278	0.463	0.498
Integration and output	Teacher	0.289	-0.183	0.452	0.385
Integration and output	Student	0.291	-0.183	0.452	0.385

Teaching tendency is categorized into two types: indirect guidance and direct control.Teaching tendency is calculated by the ratio of indirect influence to direct influence in the teacher’s speech, and its ratio is greater than 1, then it is indirect guidance type. On the contrary, it is the type of direct control. The direct influence lies in the teacher’s explanation of knowledge points and direct feedback on students’ attitudes, while the indirect influence is the teacher’s indirect feedback on students’ attitudes and responses in the form of questions and answers, adoption, etc. Figure 2 shows the ratio of classroom teaching tendency to emotional climate, in which the first two curves are the ratio of indirect influence to direct influence and the last two curves are the ratio of positive influence to negative influence. The ratio of indirect influence to direct influence in 90% of the lesson cases in both the smart and traditional classrooms of college English is more than 1. However, in the smart classroom environment, this ratio has been further improved, and the highest value of the ratio reaches 5.214. This indicates that the college English teachers in both the smart and traditional classrooms tend to give indirect guidance to the classroom and the students, i.e., to encouragement, questions and answers, dialogues, and situation design and other ways of classroom teaching. Observing the change in the proportion of this behavior in terms of the mean value, the teachers in the smart classroom had more loose control over the classroom, and the teachers assumed the role of guides in classroom teaching rather than the role of controllers. This data emphasizes the important role of smart classroom reforms in terms of teaching patterns and pedagogical dispositions.

4.2

Analysis of English Language Teaching Achievement

4.2.1

Analysis of the results of the reading training experiment

This experiment was carried out in two second year teaching classes in a local college, and the subjects of the study were all the students in the second year class 2 and class 7, class 2 had 39 students class 7 had 41 students, of which class 7 in the second year was the control class, which used the traditional English teaching method combined with the original multimedia, i.e., it only used the basic functions of the original multimedia, such as demonstration, audio playback, etc. Class 2 in the second year was the experimental class, which used interactive multimedia-assisted learning. The two classes are of equal English level and belong to the same class. These two classes have the same level of English, belong to the middle class of the second grade, the individual natural situation is close to the same, in the degree of English education, English performance is basically the same, both have the characteristics of lively and active, inquisitive, these two classes of students are relatively more intelligent, fast to the reception of knowledge, and like to use multimedia equipment to learn. In the experiment, both classes were unified by an English teacher who was familiar with multimedia-assisted English teaching and had new insights into classroom model design. Both classes used the same teaching content, i.e., Oxford English textbook, to study the whole book content of the first book of the first year of the second year of college, and were tested for the same period of time, a total of 24 teaching weeks, to complete the study and review of 8 units. The experimental period was from March 2022 - July 2022.

Multimedia-assisted teaching is the use of computers, according to the teaching process, teaching content, teaching objectives and learners’ learning characteristics and knowledge structure, reasonable integration of video, sound, image, text and other information elements, to form a reasonable and efficient mode of teaching, to achieve satisfactory teaching results. The diversity, integration, and interactivity of multimedia are especially evident in assisting English reading performance. Multimedia-assisted teaching provides foreign language teachers with rich teaching resources and advanced teaching concepts and methods, breaks through the closed mode of traditional teaching, and allows students to participate more actively in the classroom, more vividly and intuitively to acquire knowledge, and improve their performance more significantly.

Selected from the experimental class to do analysis of the individual performance of students of different grades, Figure 3 shows the reading performance of some students, Figure (a) for the experimental class, Figure (b) for the control class. The data in the figure show that the use of interactive multimedia-assisted English teaching can significantly improve the performance of students at all levels of the band. For the academic excellence students, interactive multimedia-assisted teaching enables the academic excellence students to strive for excellence, and there is a significant change in the improvement of the performance of the intermediate students, in which the excellent proximity students rapidly enter the ranks of excellence, and the grades of student 3 are 79, 78, and 83 in that order.Similarly, among the difficult students, the interactive multimedia-assisted English language teaching has a notable performance. Passing critical students rapidly improved their grades to pass and above, such as Student 2, who improved from 58 to 71. For students who are particularly weak, the interactive multimedia greatly stimulates students’ interest in learning, so their grades soar, and students like the novel format, independent learning environment, so that they can really become the master of learning.

The same control class of all levels of students were analyzed and studied, as shown in Figure (b) The results show that there is no certain pattern of English teaching performance without the aid of interactive multimedia, and there is a greater relationship with the individual differences of students, some students have a good state of learning, and there is an improvement in their performance, and some intermediate students have poor self-restraint, and there is a significant regression in their performance. Struggling students’ performance is obviously stagnant, and they even have a tendency to fall behind and fail to keep up with the pace of learning.Because the learning content is difficult and boring, students lose interest and motivation to learn, which causes unsatisfactory results.

4.2.2

Listening performance

This paper chapter analyzes, based on the listening pre and post-test scores of the students in the two classes, and the pre and post scores of different question types, that multimodal listening teaching does have a corresponding positive effect on the students’ listening level improvement and performance progress, and that it affects to varying degrees the students’ answering ability in the four categories of questions: listening to recordings to select pictures, answering questions, judging the correctness of the questions, and filling in the blanks.

Before the beginning of the teaching experiment, the author conducted a listening test with the students in both classes. At the end of the test, the author corrected and organized the test questions for both classes and entered the results into SPSS 23.0 for data analysis. Through this step, it is possible to observe whether there is any difference in the listening level of the students in the two classes, and it is also convenient to form a comparison with the post-test scores to verify the teaching effect of multimodal listening teaching.

(1)

Pre-test analysis

Table 4 shows the descriptive statistics and independent samples t-test of the total pre-test listening scores, the mean values of the listening scores of the experimental group and the control group before the beginning of the experiment are 10.875 and 11.855, respectively, and the difference between the mean values of the pre-test scores of the two classes is 0.98, which is less than 1 point, and there is not a significant gap, and the average score of the experimental group is slightly lower than that of the control group. Secondly, the standard deviation of the scores of the experimental group and the control group were 4.685 and 4.397 respectively, indicating that the degree of dispersion of the listening scores of these two classes was basically the same.

Table 4.

Statistical analysis of the previous listening force

/	Class	N	Average score	Standard deviation	Standard error mean
	Experimental group	39	10.875	4.685	0.632
	Control group	41	11.855	4.397	0.695
	Levene the variance is tested equally
	F	Sig.	T	DF	/
The number of equal variations is adopted	0.156	0.648	-0.469	87
No equal variation			-0.436	86.548
	For the average t test for the average
	Significance (double tail)	Mean difference	Standard error	Confidence interval of 95% difference
	Significance (double tail)	Mean difference	Standard error	Lower limit	Upper limit
The number of equal variations is adopted	0.626	-0.436	0.955	-2.398	1.468
No equal variation	0.626	-0.436	0.954	-2.397	1.467

Next to the results of Levene’s variance chi-square test for the pre-test scores of the experimental group and the control group, the probability of significance value (P-value) = 0.648. a P-value of more than 0.05 means that the variances of the data of these two groups are equal and normally distributed, which is in line with the condition of variance chi-square of the independent samples T-test, which means that the experimental group and the control group can be subjected to the T-test and the results of the test are valid. According to the results of the T-test, it can be seen that the results of the two classes compare P=0.626 and 0.626>0.05, indicating that there is no significant difference between the experimental group and the control group in the pre-test listening performance, and the experimental group’s performance is slightly lower than that of the control group (the average difference of MD=-0.436), and the two classes are similar in the English listening level, which can be used as a parallel class to carry out the teaching experiment and verify the interactive innovative teaching mode. Teaching Effect.

Table 5 shows the scores for each question type during the pre-test listening.The highest degree of completion among the four types of questions is achieved by listening to the recording and selecting pictures, and the scores are relatively good. This type of questions requires students to select the corresponding picture according to the content of the recording, and the picture modality presented in the test questions is more likely to attract students’ interest than the text modality, and the picture modality makes the difference between the options more intuitive and easy to distinguish, which reduces the difficulty of the questions, so that the students with a lower level of listening can also complete the questions well. For this type of question, the average score of the experimental group was 3.458 while the average score of the control group was 3.648. The control group scored 0.19 points higher than the experimental group.

Table 5.

The results of the scores were scored

/	Class	N	Average	Standard deviation	Standard error mean
Image selection	Experimental group	39	3.458	1.185	0.198
Image selection	Control group	41	3.648	1.196	0.175
Answer questions	Experimental group	39	2.745	1.326	0.269
Answer questions	Control group	41	2.863	1.287	0.178
Judgment right and wrong	Experimental group	39	2.196	1.396	0.169
Judgment right and wrong	Control group	41	2.398	1.278	0.187
Fill in	Experimental group	39	2.896	1.526	0.297
Fill in	Control group	41	2.689	1.597	0.229

2)

Post-test analysis

After the teaching experiment, to verify the changes in the listening scores of the two groups of students, a listening post-test was conducted one week after the end of the experiment. Descriptive statistics and independent samples test were carried out to observe the changes in the listening performance of the two groups of students after the experiment by combining the listening pre-test and post-test scores of the experimental group and the control group.

The descriptive statistics and independent samples t-test for post-test listening are shown in Table 6. The average post-test score of the experimental group is 12.786, which is 1.421 higher than that of the control group, changing the situation that the average score of the pre-test is lower than that of the control group, and there is an overall improvement in the score. The standard deviations of students’ scores in the two groups were 3.256 and 4.266 respectively, indicating that the degree of dispersion of students’ posttest listening scores in the two groups was basically the same. The results of the independent samples test show that the two-tailed value of significance of the posttest scores of the two groups of students is 0.041<0.05, and there is a significant difference between the listening scores of the two groups of students. Accordingly, it can be concluded that interactive English listening teaching can help improve students’ listening performance.

Table 6.

Posterior audiometry analysis

/	Class	N	Average score	Standard deviation	Standard error mean
	Experimental group	39	12.786	3.256	0.596
	Control group	41	11.365	4.266	0.631
	Levene’s variance test
	F	Significance	T	DF	/
The number of equal variations is adopted	2.391	0.185	-2.163	87
No equal variation			-2.695	86.453
	For the average t test for the average
	Significance (double tail)	Mean difference	Standard error	Confidence interval of 95% difference
	Significance (double tail)	Mean difference	Standard error	Lower limit	Upper limit
The number of equal variations is adopted	0.041	-1.838	0.848	-3.995	-0.169
No equal variation	0.041	-1.838	0.848	-3.995	-0.162

On the basis of analyzing the total scores of the posttests of the two groups of students, the scores of each question type were analyzed to observe the specific effects of the teaching experiment on the scores of each question type of the two groups of students.

Table 7 shows the descriptive statistics of the scores of each question type in the posttest, and it can be seen that the difference between the average scores of the four question types is 0.359, 0.329, 0.333, 0.3937, respectively, and the scores of the students in the experimental group of the different question types changed after the teaching experiment, and the average values of the different question types gained a significant increase over the level of the mean value of the control group.

Table 7.

The results were descriptive statistics

/	Class	N	Average	Standard deviation	Standard error mean
Image selection	Experimental group	39	3.955	0.915	0.169
Image selection	Control group	41	3.596	1.165	0.148
Answer questions	Experimental group	39	3.395	1.069	0.153
Answer questions	Control group	41	3.066	1.023	0.157
Judgment right and wrong	Experimental group	39	2.626	1.169	0.186
Judgment right and wrong	Control group	41	2.293	1.399	0.193
Fill in	Experimental group	39	3.348	1.299	0.185
Fill in	Control group	41	2.954	1.469	0.269

5

Conclusion

In this paper, we construct a model for tracking gesture interactions based on particle filtering and fuse two features, grayscale and binary, to calculate similarity.Kinect is utilized to obtain hand information, and a position correction algorithm is proposed.Define the user’s intention, analyze the interaction intention, and design an inquiry-based interactive English classroom.Combine the deep RNN algorithm to recognize the interactions between teachers and students.

The interaction model is applied to English education in colleges and universities to analyze the interaction scores of the subject teachers’ online classrooms, and Teacher Y’s total online classroom teacher-student interaction score is 4.765, with an SD of 1.355, which is of medium level. There were significant differences between the two subject teachers in three aspects of classroom atmosphere: learning behavior, learning engagement, and interpersonal support, with p-values of 0.034, 0.025, and 0.004, respectively.

The students’ English reading scores were analyzed, and after the use of interactive multimedia-assisted English teaching, the scores of qualified critical students improved more rapidly, from 58 to 71 after three tests. As for English listening, the average listening post-test score of the experimental group was 12.786, which was 1.421 points higher than the average score of the control group, with standard deviations of 3.256 and 4.266, with basically the same degree of dispersion, and with a two-tailed significance of 0.041<0.05, the teaching of the interactive mode helps to improve the students’ listening scores.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Model Innovation and Practice of English Education in Colleges and Universities in the Context of Internet

Nan Cheng

Qiongyin Tu

Publicado en línea: 17 mar 2025

Recibido: 29 oct 2024

Aceptado: 29 ene 2025

DOI: https://doi.org/10.2478/amns-2025-0263

Palabras claveKinect, Bhattacharyya distance, Deep RNN, Resnet technology, Interactive English teaching

© 2025 Nan Cheng et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
Kinect, Bhattacharyya distance, Deep RNN, Resnet technology, Interactive English teaching