A Study of English Teachers’ Classroom Teaching Behavior Based on Deep Learning

Teaching behavior in the broad sense includes the teacher’s “teaching” and the student’s “learning” behavior, while teaching behavior in the narrow sense refers to the teacher’s “teaching” behavior. Classroom teaching behaviors are all the observable outward behaviors taken by teachers in the classroom to accomplish the teaching objectives. Classroom teaching behaviors of English teachers mainly refer to the various behaviors directly directed to the teaching content that English teachers take in order to make the teaching successfully implemented in the English classroom environment [1–2]. This behavior is due to individual differences in English educational philosophy, teaching style, professional knowledge and skills, English classroom teaching wit, etc., and shows different behavioral approaches in specific English classroom teaching situations. The evaluation of English teachers’ classroom teaching behavior is a value judgment of English teachers’ classroom teaching behavior for the purpose of improving and enhancing English teachers’ teaching ability and classroom teaching effect [3–5].

With the rapid development of education informatization, teacher behavior recognition and analysis has become a popular research direction in the field of education. The traditional teacher behavior recognition and analysis relies on manual observation and subjective judgment, which has the problems of low recognition accuracy and cumbersome operation, etc. The deep learning-based classroom behavior recognition and analysis system can effectively identify and analyze teachers’ behavior by using big data and neural network technology. The deep learning-based teacher behavior recognition and analysis system can effectively improve the recognition accuracy and automation degree by using big data and neural network technology. Moreover, it can realize real-time monitoring and analysis of teachers’ behaviors, and provide reference basis and teaching improvement suggestions for teachers after class. At the same time, it can also provide strong support for educational research and promote the development of educational informatization [6–9].

Literature [10] indicated the wide application of deep learning technology and explored the feasibility of deep learning technology in classroom teaching behavior analysis based on the characteristics of deep learning technology from classroom teaching behavior analysis. Literature [11] proposed a method for teacher behavior recognition in video teaching scenes. Based on the teaching model of “teacher set”, the recognition and extraction algorithms are discussed to recognize the teachers in the teaching video, and the improved 3D BP-TBR is mentioned to recognize the types of teacher behaviors, and the validity of the performance of 3D BP-TBR is verified through experiments, and the proposed overall method is conducive to improving the accuracy rate of recognizing the teacher behaviors. Literature [12] emphasizes the important role of teachers in regulating the educational process and student learning outcomes, among others. Teachers’ concern and praise and the impact they have on student engagement in the English classroom are examined and some insights are presented to clarify student and teacher practices. Literature [13] builds a classroom teaching behavior analysis and evaluation system with deep learning face recognition technology and unfolds the analysis of classroom behavior in terms of various student behaviors, and the results indicate that deep learning face recognition technology is able to identify students’ classroom behaviors effectively, which helps in teaching management and implementation. Literature [14] explored the effectiveness of the flipped classroom teaching model of deep learning. Applying this teaching model with application teaching and launching teaching experiments, it was concluded that the implementation of environmental protection education based on English teaching is of great significance. Literature [15] explored the relationship between English teachers’ interpersonal behavior and students’ English fluency, and the English “Teacher Interaction Questionnaire” was used to investigate. The results indicated that student achievement was negatively related to teacher uncertainty, positively related to the degree of cooperation between teachers and students, and that there were differences in students’ preferences for teacher behavior.

Literature [16] constructed the classroom learning behavior dataset ActRec-Classroom based on classroom behaviors such as listening to lectures, raising hands, etc., and proposed a framework for classroom learning behavior analysis system with the help of CNN, conducted experiments on human body detection, extracting human body key points and designing action recognition classifiers, and the results revealed that the proposed system meets the needs of behavior analysis under classroom teaching. Literature [17] created an integrated practice process for classroom behavior analysis under AI based on human posture assessment, facial expression recognition and other technologies, including modules for acquisition and analysis, feedback and improvement, aiming at guiding teachers to improve their behavior, teaching quality and learning effectiveness. Literature [18] takes multiple data sources for cross-comparison and uses the mature random forest algorithm and correction matrix in the data analysis of classroom teaching behavior, and the analysis results emphasize that deep learning can accurately feed back the data of classroom teaching phenomena and teaching activities, which can help to optimize the classroom management and promote teaching reform. Literature [19] identified predictors of Spanish English learners’ willingness to communicate in a foreign language. Through multiple regression analysis, it was concluded that foreign language classroom anxiety was the most significant negative predictor affecting foreign language learning, while positive predictors were found in the enjoyment of the foreign language and the frequency of the teacher’s use of the foreign language. Literature [20] discusses the predictability of foreign language teachers’ self-efficacy on motivational teaching behavior. A questionnaire survey was conducted on 112 English teachers and the results were analyzed by descriptive statistics and multiple regression analysis, which mentioned that English teachers’ efficacy for classroom management and student engagement was much lower than self-efficacy for teaching strategies, proving the causal relationship between self-efficacy and motivational strategies. Literature [21] reviewed the literature on the concept of teacher beliefs and studies exploring teacher beliefs using various methods and examined the relationship between teacher beliefs and classroom practices and the factors influencing teacher beliefs in teaching professional development.

This paper explores the classroom teaching behaviors of English teachers by constructing an English teacher target detection and tracking model, as well as a classroom teaching behavior recognition model. First, this paper utilizes web crawler technology to obtain video data from English classroom teaching. Then, a target detection algorithm based on cross-stage local network and pyramid attention network is utilized to detect English teachers’ targets in classroom teaching videos in real-time. Then, the target tracking algorithm based on deep neural network, Kalman filter algorithm and Hungarian algorithm is utilized to associate the target images of English teachers in adjacent frames to realize the detection and tracking of English teachers’ targets in the classroom teaching video. Then, the classroom behavior dataset of English teachers is constructed, and the HRNet human posture estimation model is used to produce the teacher skeletal information dataset, and the graph attention network (GAT) is introduced into the multistream adaptive graph convolution network (MS-AAGCN) to construct the classroom English teacher teaching behavior recognition model. Finally, the model’s performance is evaluated and its application effect in real teaching scenarios is examined.

2

Acquisition of classroom teaching videos and target detection tracking modeling

2.1

Data Acquisition of English Teachers’ Classroom Behavior

For the acquisition of English classroom teaching video data, web crawler technology or manual download is mainly utilized to crawl English classroom teaching video data for English classroom teaching keywords. Secondly, target tracking is performed on each English classroom teaching video to extract the continuous image of each English teacher. Finally, the continuous images are manually annotated to obtain the English teacher classroom behavior video dataset as well as the student classroom behavior picture dataset. The acquisition process for English classroom teaching video data is shown in Figure 1.

For the processing of English classroom teaching videos, first, all the English classroom teaching videos are put in a folder, for each English classroom teaching video, a new corresponding folder is created for storing all the images of the target, target tracking is performed on each frame, target tracking will get a target ID and the position coordinates of the target image, and then a new folder is created for storing all consecutive images of the target for each ID, and the target images are numbered according to the frames. Then for each ID a new folder is created for storing all the consecutive images of the target, the target images are numbered according to the number of frames, after processing all the videos an initial dataset is obtained. After that, all the consecutive images of the target are manually labeled, and 16-32 frames are taken as a sample video of an English teacher’s classroom teaching behavior, or a single frame is taken as a sample image of an English teacher’s classroom teaching behavior.

2.2

Deep Learning-Based Classroom Teacher Goal Detection and Tracking

In order to recognize the behavior of each English teacher target in the classroom teaching video, it is necessary to first detect and track the English teacher targets in the video and extract the spatio-temporal features of each English teacher target. In this paper, we use the target detection algorithm based on cross-stage local network and pyramid attention network to realize the real-time detection of English teacher targets in the classroom teaching video, and then use the target tracking algorithm based on deep neural network, Kalman filtering algorithm, and Hungarian algorithm to correlate the images of the English teacher targets in adjacent frames, so that we can realize the detection and tracking of the English teacher targets in the classroom teaching video.

2.2.1

Real-time detection of classroom teacher goals

Detecting and localizing English teacher targets in classroom teaching videos is key to achieving recognition of English teacher teaching behaviors. In this paper, a deep learning-based video target detection algorithm is utilized to achieve real-time detection of English teacher targets in classroom teaching videos. The deep learning network structure for target feature learning and detection contains a cross-stage localization network for extracting target features from classroom teaching video frames, a pyramid attention network for fusing multi-layer features, a convolutional network for target category computation, and an edge-regression network for calculating English teacher target prediction frames with a border regression network. The structure of the deep learning-based real-time target detection network for classroom English teachers is shown in Fig. 2.

Where the convolution process of each layer of the cross-stage local network for target feature learning in classroom teaching video frames is shown in equation (1): 1 $h_{l} (X) = (x * W + b) \otimes σ (X * V + c)$

Where: X ∈ R^N×m is the input for each layer. W ∈ R^k×m×n represents the weights of each layer. b ∈ Rⁿ represents the bias value of each layer. σ(X*V+c) represents the convolution operation. V ∈ R^k×m×n, represents the weights of the convolution kernel. c ∈ Rⁿ represents the bias of the convolutional layer. N represents the total number of convolutional layers. k represents the number of the current convolution layer. m represents the dimension of the image. n represents the number of neurons in the convolutional layer.

After obtaining the target features of the classroom teaching video frames, the high-level features obtained from the convolutional neural network learning are fused using a feature fusion network, and then the fused features are input into a convolutional network for target detection, and the confidence level of the corresponding category is calculated as shown in Equation (2): 2 $C_{i}^{J} = P_{r} (O b j e c t) * I O U$

The probability of having an object, IOU represents the ratio of the intersection and concatenation of the predicted edges to the true edges, as shown in Equation (3): 3 $I O U = \frac{S_{1}}{S_{2}}$ 4 $S_{1} = (\min (x_{p 2}, x_{l 2}) - \max (x_{p 1}, x_{l 1})) * (\min (y_{p 2}, y_{l 2}) - \max (y_{p 1}, y_{l 1}))$ 5 $S_{2} = (x_{p 2} - x_{p 1}) * (y_{p 2} - y_{p 1}) + (x_{l 2} - x_{l 1}) * (y_{l 2} - y_{l 1}) - S_{1}$

Where: (x_p1, y_p1, x_p2, y_p2) is the coordinates of the upper left and lower right points of the real target box. (x_n1, y_n, x₁₂, y₁₂) is the coordinates of the upper left and lower right points of the predicted target box. S₂ is the area of the real target box. S₂ is the area of the intersection part of the predicted target box and the real target box.

Meanwhile, the coordinate position of the target frame is obtained by calculation, then the loss function is shown in Equation (6): 6 $l o s s = 1 - I O U + \frac{ρ^{2}}{c^{2}} + α V$ 7 $ρ^{2} = {(x_{p} - x_{l})}^{2} + {(y_{p} - y_{l})}^{2}$ 8 $c^{2} = {(\max (x_{p 2}, x_{l 2}) - \min (x_{p 1}, x_{l 1}))}^{2} + {(\max (y_{p 2}, y_{l 2}) - \min (y_{p 1}, y_{l 1}))}^{2}$ 9 $\begin{array}{l} v = \frac{4}{π^{2}} {(\arctan \frac{w_{l}}{h_{l}} - \arctan \frac{w_{p}}{h_{p}})}^{2} \\ = \frac{4}{π^{2}} {(\arctan \frac{x_{l 2} - x_{h 1}}{y_{l 2} - y_{l 1}} - \arctan \frac{x_{p 2} - x_{p 1}}{y_{p 2} - y_{p 1}})}^{2} \end{array}$ 10 $α = \frac{v}{1 - I O U + v}$

Where: ρ is the distance between the prediction box and the center point of the target box. c is the diagonal length of the smallest enclosing rectangle of the prediction box and the real box. v is the aspect ratio similarity between the predicted frame and the real frame. α is the influence factor of v.

2.2.2

Real-time tracking of classroom teacher goals

Firstly, all English teacher target images are extracted using the target detection algorithm, and then each English teacher target image is input into a simple appearance embedding model to obtain the appearance features of each English teacher, and then the Kalman filter algorithm is used to predict the appearance features of each English teacher in the next frame, and then the Hungarian algorithm is utilized to perform the matching after detecting the size and position of the English teacher target in the next frame image,. Thus, each English teacher’s target is associated to achieve classroom English teacher target tracking. The flow of the classroom English teacher’s target tracking algorithm is shown in Figure 3.

3

Teacher Behavior Recognition Model Based on Multi-Stream Graph Convolutional Networks

This chapter first constructs an English teacher behavior dataset based on the acquired teaching videos of real classroom scenarios, and then proposes an English teacher teaching behavior recognition model based on multi-stream graph convolutional networks, which recognizes English teacher classroom teaching behaviors based on the extracted human behavioral features on the basis of teacher target detection and tracking in the English teacher behavior dataset and combines with the graph attention module to help the The model pays attention to and learns the connection relationship between various key points of the body, which further enhances the performance of the model in recognizing teacher behaviors.

3.1

Classroom English Teacher Behavior Dataset Construction

3.1.1

Classroom English Teacher Behavior Category Classification

The first step in constructing a teacher behavior dataset is to determine the categories of teacher behaviors in the dataset. In this paper, with reference to the Student-Teacher (S-T) Behavior Analysis Method and the Classroom Teaching Behavior Analysis System (TBAS) as well as related studies, teacher behaviors that appear more frequently in the process of classroom teaching, have certain pedagogical significance and are easy to visually distinguish are selected for study, including blackboard boarding, operation of multimedia, walking around, knowledge explanation, media explanation, and standing in routine 6 behaviors.

3.1.2

Teacher Behavior Dataset Construction

The production of the dataset mainly includes four steps: collecting data, screening and cleaning the data, data preprocessing, and manual labeling.

The data used in this paper mainly come from real classroom videos recorded in our classrooms and teaching videos made public on the Internet, including a total of 1,000 hours of teaching videos from one hundred English teachers. All the videos collected are teacher-centered, i.e., the teacher is the only one in the picture, and it is guaranteed that the podium, the lectern, the blackboard and the multimedia electronic screen are included in the captured picture.

To ensure data quality, this study preprocesses the original classroom video data and manually crops out video clips with a duration of 1s-10s that meet the requirements, and each video clip contains only one defined teacher behavioral category, i.e., there is only one category label for each video clip. In order to avoid the influence of factors such as teachers’ own appearance characteristics on the results of behavior recognition, each teacher’s behavioral clip of each category appeared no more than five times in the dataset. After the labeling of video clips was completed the data were saved and named by category in the format of category label_category name_video number.mp4.

Meanwhile, the samples of each class were randomly divided into a training and validation set with 80% and 20% proportions, respectively. By comparing with the public dataset and the existing research teacher behavior dataset, it can be seen that the English teacher behavior recognition dataset constructed in this paper can meet the demand of the subject study in terms of data volume.

3.2

Feature extraction of human behavior

3.2.1

Optical flow characteristics

Optical flow is the instantaneous speed of pixel motion in a sequence image, and can be used to measure how image pixels change over time. Optical flow is usually generated by the motion of the target or the camera in the image, or both. Human behavior is captured by the camera as a video, and the human body and its background can be shown as a continuous flow of images in the video, and the movement of the human body can lead to the generation of optical flow. Therefore, optical flow features can characterize human behavior. The two-dimensional vector field formed by calculating the instantaneous velocity of the image pixels one by one is called the optical flow field.

Two assumptions need to be satisfied for the calculation of optical flow: first, the brightness of the image is constant when calculating the optical flow, i.e., the same pixel has the same brightness between the preceding and following frames. Second, the motion in continuous time is “small motion”, which can be represented by the change of gray scale.

Assuming that I(x,y,t) denotes the light intensity of pixel (x, y) at frame t, then at the next frame there is since the brightness of the pixel is constant: 11 $I (x, y, t) = I (x + d x, y + d y, t + d t)$ where dx, dy and dt denote very small displacements in both directions of space versus time, respectively.

Expand the right-hand side of Eq. (11) in terms of Taylor series as: 12 $I (x, y, t) = I (x, y, t) + \frac{\partial I}{\partial x} d x + \frac{\partial I}{\partial y} d y + \frac{\partial I}{\partial t} d t + ε$ ε is infinitesimal and can be neglected, so from Eq. (12): 13 $\frac{\partial I}{\partial x} \frac{d x}{d t} + \frac{\partial I}{\partial y} \frac{d y}{d t} + \frac{\partial I}{\partial t} = 0$

Let I_x and I_y be the partial derivatives of the gray scale of a pixel point along the image X and Y axes, I_t be the partial derivative along the time direction, and u and v be the velocity vectors of the optical flow along the X and Y axes, respectively, then Eq. (13) is obtained: 14 $I_{x} u + I_{y} v + I_{t} = 0$

Where, $I_{x} = \frac{\partial I}{\partial x}$ , $I_{y} = \frac{\partial I}{\partial y}$ , $I_{t} = \frac{\partial I}{\partial t}$ , $u = \frac{d x}{d t}$ , $v = \frac{d y}{d t}$ .

I_x, I_y, and I_t can be obtained by calculating the information from the images in the video, while the optical flow vector (u,v) is required. There are two unknowns in Eq. (14), so other constraints are needed to solve it. Different optical flow fields can be solved under different constraints, e.g., the LK optical flow algorithm introduces the constraint of spatial consistency, which considers that neighboring pixels are also neighboring in the front and back frames of the video, using 9 pixels within a 3×3 window to create 9 equations according to Eq. (14), which can be expressed using a matrix as: 15 $[\begin{matrix} I_{x 1} & I_{y 1} \\ I_{x 2} & I_{y 2} \\ ⋮ & ⋮ \\ I_{x 3} & I_{y 3} \end{matrix}] [\begin{array}{l} u \\ v \end{array}] = [\begin{matrix} - I_{t 1} \\ - I_{t 2} \\ ⋮ \\ - I_{t 9} \end{matrix}]$

Equation (15) forms an overdetermined solution to the problem, and the optical flow vector can be found using the least squares method.

3.2.2

Characterization of spatio-temporal points of interest

Spatio-temporal points of interest refer to the points that change more significantly in the spatio-temporal domain of the video, which are not considered based on the human body itself, but in some ways can reflect the details of the video motion, and thus some characteristics of human behavior. In addition, spatio-temporal points of interest can also describe some appearance information in the video. To acquire spatio-temporal interest point features, two steps must be taken: detecting spatio-temporal interest points and describing spatio-temporal interest points.

The main idea of spatio-temporal interest point detection is to consider the video as a three-dimensional function, and then utilize a mapping function to map the video from high to low dimensions, i.e., from three dimensions to one dimension, and finally solve for the local maxima, which are the spatio-temporal interest points of the video, in one-dimensional space. Commonly used detection methods include 3D Harris corner point detection, which is extended from the 2D Harris corner point detection method and contains information about spatial structure and temporal dimension. The specific calculation is as follows:

In the first step, the video is transformed into a linear scale space by performing a scale transformation. The temporal scale and spatial scale of the video are determined by the scope of the behavior, which has a large impact on the detection of spatio-temporal interest points, in order to adapt to the scale change of the video, the scale change needs to be carried out in time and space before detection, which can be expressed as: 16 $L (\cdot; σ_{l}^{2}, τ_{l}^{2}) = g (\cdot; σ_{l}^{2}, τ_{l}^{2}) * f (\cdot)$

Where: f(·) denotes the video function. g(·) denotes the Gaussian filter function. σ_t denotes the spatial scale parameter. τ_l denotes the time scale parameter. * denotes the convolution operation. $L (\cdot; σ_{l}^{2}, τ_{l}^{2})$ denotes the result obtained after the video is convolved with Gaussian kernel function. Where, the Gaussian filter function is represented as follows: 17 $g (x, y, t; σ_{l}^{2}, τ_{l}^{2}) = \frac{\exp (- \frac{x^{2} + y^{2}}{2 σ_{l}^{2}} - \frac{t^{2}}{2 τ_{l}^{2}})}{\sqrt{{(2 π)}^{3} σ_{l}^{4} τ_{l}^{2}}}$

In the second step, a Gaussian window is employed to slide in the video, and the changes in the window before and after the sliding are compared to determine the spatio-temporal interest points. L_x, L_y, and L_t denote the Gaussian smoothing in the horizontal, vertical, and temporal directions, respectively, and the matrix μ is constructed with the following expression: 18 $μ = g (\cdot; σ_{t}^{2}, τ_{l}^{2}) * (\begin{matrix} L_{x}^{2} & L_{x} L_{y} & L_{x} L_{t} \\ L_{x} L_{y} & L_{y}^{2} & L_{y} L_{t} \\ L_{x} L_{t} & L_{y} L_{t} & L_{t}^{2} \end{matrix})$

In the third step, the eigenvalues of matrix μ are calculated, and the eigenvalues are utilized to construct response function H, which calculates the spatio-temporal points of interest by setting the threshold value of H. The response function H is denoted as: 19 $H = \det (μ) - k \times t r a n c e^{3} (μ) = λ_{1} λ_{2} λ_{3} - k {(λ_{1} + λ_{2} + λ_{3})}^{3}$

where λ₁, λ₂, and λ₃ are the eigenvalues of matrix μ. Finally, the positive maxima of the function H are solved, that is, the spatio-temporal points of interest are found.

Detecting spatio-temporal interest points from videos usually has the problem of low number because the interest points need to satisfy the conditions of both temporal and spatial domains. This problem is overcome by the dense trajectory algorithm, whose main idea is to densely sample the video at multiple scales and then track the feature points to form a feature trajectory.

After the interest points are detected by the spatio-temporal interest point detector, it is also necessary to use a specific way to describe the interest points structurally and statistically for feature characterization, so various feature descriptors appear. jets local descriptors are used by calculating the first partial derivatives of the spatio-temporal interest points up to the fourth partial derivatives, and then combining these partial derivatives sequentially in order to form the feature vectors, and this kind of features are mainly extracted from video This feature mainly extracts information about the motion and appearance of the point of interest in the spatio-temporal domain. The Cuboid descriptor takes the point of interest as the center and then expands it in the temporal and spatial dimensions to construct a cube, and then combines the gradients of all the pixels in the cube to form a one-dimensional feature vector, which is then downscaled using PCA. The main idea of the HOG/HOF feature descriptor is to compute the gradient of the point of interest, and then divide the region according to the direction of the gradient. Then divide the region according to the gradient direction and count the gradient size of the feature points to form a feature vector.

3.2.3

Characteristics of the human skeleton

The human skeleton can be considered an abstract representation of the human body, capable of focusing on human behavior without the interference of complex environments. The human skeleton is usually composed of joints, and a commonly used skeleton model with 18 joints or 25 joints is shown in Fig. 4. The 18-joint model is numbered by 18 digits from 0-17, which are represented sequentially in order: Nose, Neck, Rshoulder, RElbow, RWrist, LShoulder, LElbow, LWrist, RHip, RKnee, Rankle, LHip, LKnee, LAnkle, REye, LEye, Rear, and Lear, with some of the words initialized R for right and L for left. The 25-joint-point model has the addition of MidHip, LBigToe, and LSmallToe compared to the 18-joint-point model, LHeel, RBigToe, RSmallToe, and Rheel.

3.3

Multi-Stream Map Convolutional Neural Network Modeling

3.3.1

Skeletal keypoint extraction

After organizing the English teacher behavior video dataset, it is also necessary to construct a teacher behavior dataset based on skeletal information in order to achieve effective recognition of English teacher classroom teaching behaviors. By referring to the existing behavior recognition datasets based on skeletal information, this paper chooses to use the HRNet human posture estimation model to produce the teacher skeletal information dataset.

3.3.2

Network structure

The overall structure of the multistream map convolutional network is shown in Fig. 5, where the core structure is a stack of base modules, and the number of output channels of the nine base modules is raised from 32 to 256 dimensions. 1)

Graph Convolution Module

The behavior recognition method based on graph convolutional network first extracts the skeletal keypoints by the human body pose estimation algorithm HRNet. The input to the graph convolutional network is the human body topology graph, i.e., the extracted keypoints are connected in the order of connectivity of the human body joints [22]. The connections between individual human keypoints in the same frame in the video are used as spatial edges, and the edges between the same keypoints in neighboring frames are used as temporal edges. A topology graph is constructed based on the natural connections of the human body, and the feature graph in the network is actually a tensor f ∈ □^C×T×N, where N denotes the number of keypoints, T denotes the duration, and C denotes the number of channels.

Given a topological graph of the human skeleton, the convolution in a graph convolutional network can be expressed as: 20 $f_{o u t} (v_{i}) = \sum_{v_{j} \in B_{i}} \frac{1}{Z_{i j}} f_{i n} (v_{j}) \cdot w (l_{i} (v_{j}))$

Where: f is the feature mapping. v is the vertices of the map. B_i denotes the convolutional sampling region of v_i. The convolutional kernel size of the sampling strategy used in this paper is 3. w is the weighting function which provides a weight vector based on the given input. l_i is the mapping function that maps all neighboring vertices to a fixed numbered subset, each with a unique weight vector. Z_ij is used to balance the contribution of each subset. For the temporal dimension, an K_t×1 convolution is performed on the output feature map of the spatial convolution computation, where k_t is the kernel size of the temporal dimension. The convolution is completed one node at a time, K_t keyed to the schema.

2)

Attention Module

Graph Attention Network (GAT) is a neural network for processing graph-structured data based on GCN improvement, which consists of several graph attention layers [23]. The multi-stream adaptive graph convolutional network MS-AAGCN embeds the STC attention module into the graph convolutional layers to help the model learn to selectively attend to joints, frames, and channels, and co-learn and update them along with other parameters [24]. This data-driven approach increases the flexibility and generalization ability of the model. In this paper, we add a graph attention layer to focus on important nodes based on the MS-AAGCN network to automatically learn and optimize the connectivity relationships between nodes to improve the expressive power of the model.

First define a feature transformation matrix matrix w ∈ □^F+F′ is used to transform each node from input to output, and in the process of aggregation, the attention weights of neighbor node v_j to node v_i are: 21 $e_{i j} = f (W h_{i}, W h_{j})$

Where e_ij denotes the importance of node j to node i and f is a function to calculate the correlation between nodes. The attention e_ij between two nodes is calculated as follows: 22 $e_{i j} = f (W h_{i}, W h_{j}) = L e a k y Re L U (w^{T} [W h_{i} ‖ W h_{j}])$

Where □ denotes vector splicing, W ∈ ^2F′ is the weight coefficient and Leaky ReLU is the activation function. It should be noted that node j is the nearest neighbor of node i, and there may be more than one node in close proximity to node i, the correlation of all neighboring nodes should be normalized first, in the following form: 23 $α_{i j} = s o f t \max (e_{i j}) = \frac{\exp (e_{i j})}{\sum_{v_{k} \in N v_{i}} \exp (e_{i k})}$

After getting the attention coefficient α the new features of node v_i can be obtained: 24 $v_{i} = σ (\sum_{v_{j} \in N (v_{i})} α_{i j} W h_{j})$

The Time Attention Module (TAM) is calculated as: 25 $M_{t} = σ (g_{t} (A v g P o o l (f_{i n})))$ where M_t ∈ □^1*T*1 σ are Sigmoid activation functions and g_t is a one-dimensional convolution operation.

The Channel Attention Module (CAM) helps the model to enhance the descriptive features (channels) based on the input samples, and it generates the attention computationally as follows: 26 $M_{c} = σ (W_{2} (δ (W_{1} (A v g P o o l (f_{i n})))))$

where M_c ∈ □^C+1+1, W₁ and W₂ are the weights of the fully connected layer and δ is the ReLu activation function.

3.3.3

Multi-stream framework

Four modes were trained, which were Skeletal Flow, Skeletal Motion Flow, Skeletal Keypoint Flow, and Skeletal Keypoint Motion Flow.

The joints near the center of gravity of the body are defined as source keypoints, and those far from the center of gravity are defined as target keypoints, and the vector of source keypoints pointing to target keypoints represents a bone information. In the human body topology graph, the number of keypoints is one more than the number of bones, and the root node does not have a source keypoint corresponding to it, so the root node is assigned an empty bone with a value of zero. The bone flow is denoted by B and the keypoint flow is denoted by J. Given the source keypoint v_i = (x_i,t,y_i,t,z_i,t) and the target keypoint v_j = (x_j,t,y_j,t,z_j,t) adjacent to it at the moment t, the bone vector can be expressed as: 27 $b_{i, j} = (x_{j, t} - x_{i, t}, y_{j, t} - y_{i, t}, z_{j, t} - z_{i, t})$

The motion information of bones and keypoints is the difference between the same bones and keypoints in two neighboring frames, and the bone motion flow and keypoint motion flow are denoted by B-M and J–M, respectively. Given the keypoints at moments t and t+1, the motion information between them can be expressed as: 28 $m_{t, t + 1} = (x_{i, t + 1} - x_{i, t}, y_{i, t + 1} - y_{i, t}, z_{i, t + 1} - z_{i, t})$

The four modal data in the multi-stream framework are fed into four separate streams, each of which gives a separate outcome score, and the results of the four streams are weighted and summed to get the final predicted score and give the predicted behavioral labels. A multilayer spatio-temporal graph convolution operation is performed on the topological map to extract high-level features, and the output feature of the last layer of the network is x. A Global Average Pooling (GAP) layer is then connected to predict the classification scores using the Softmax function, and the resultant scores for a single stream are: 29 $S_{i} = s o f t \max (A v g P o o l (x))$

4

Model application analysis

4.1

Model experiment results and analysis

The model evaluation metrics used in this study are mainly classification accuracy and classification checking rate (Recall). The curve of classification accuracy and cross-entropy loss function value of the English teacher classroom teaching behavior recognition method used in this paper on the training set is shown in Fig. 6, and the shaded area is the cross area of the two.

As the number of iterations increases, the loss function value tends to decrease slowly, then rapidly, and finally slowly to gradually reach a flat state. The classification accuracy, on the other hand, rises slowly, then rapidly, and finally rises slowly to reach a plateau. After 14000 iterations, the loss function value of the model tends to 0 and the accuracy rate tends to 100%, which indicates that the model has been well trained.

During model training, a validation set validation is performed at the end of each iteration cycle. The curve of classification accuracy and cross-entropy loss function value of the English teacher behavior recognition method on the validation set is shown in Figure 7.

As can be seen from Fig. 7, when the model undergoes validation set validation for the first time, the validation set accuracy is slightly higher than the training set accuracy, and there is an underfitting problem on the training set. With the increase of the training cycle, the validation set classification accuracy rate is always smaller than the training set accuracy rate, and it gradually improves, and the underfitting problem is solved, forming a normal fitting phenomenon and reaching a more ideal state.

On the CTTAV validation set, the recognition effects of the English teacher teaching behavior recognition model proposed in this paper and other excellent behavior recognition models are shown in Table 1.

Table 1.

Comparison of model recognition effect

Method	Input size	Accuracy rate/%	Extraction posture
C3D	16128171	76.34	No
I3D	32256256	80.16	No
SlowOnly	8256256	53.25	No
R3D-34	48112112	80.63	No
R2plus1D-34	48112112	73.14	No
MS-AAGCN	485656	85.49	Yes
HRNet+MS-AAGCN+GAT(Ours)	485656	87.61	Yes

As can be seen from Table 1, the teacher teaching behavior recognition method proposed in this paper achieves a recognition accuracy of 87.61% on the English teacher classroom teaching behavior dataset, which is 2.12% higher than that of the MS-AAGCN model embedded with the STC attention module. This proves that combining the graph attention module with the human skeleton topology map is a better way to fully utilize the human behavioral feature information and achieve a better recognition effect. Similarly, on the teacher behavior dataset, comparing with C3D, I3D, SlowOnly, R3D-34, and R2plus1D-34, the model proposed in this paper is able to improve the accuracy by 11.27%, 7.45%, 34.36%, 6.98%, and 14.47%, respectively. This indicates that the teacher behavior analysis network model constructed in this paper has a better recognition effect on the English teacher classroom teaching behavior dataset compared to other models.

The confusion matrix of this paper’s model for categorizing the predicted share of different behaviors is shown in Figure 8. Among them, X1~X6 denote the six English teachers’ classroom teaching behaviors of blackboard boarding, operating multimedia, walking around, knowledge explanation, media explanation, and regular standing, respectively.

As can be seen in Figure 8, the highest categorization check rate (Recall) was found for blackboard writing and operating multimedia, which were 98% and 93%, respectively. The next highest rates were media presentations and walking around, both of which were greater than 80%. Lastly, Knowledge Explanation and Routine Standing, both with 79%. Among them, knowledge explaining and media explaining have greater similarity in behavioral expression and have the highest level of confusion with each other, with the most misclassifications for knowledge explaining being media explaining and the most misclassifications for media explaining being knowledge explaining. In addition, the knowledge explanation behavior would be in a standing position facing the students, which is almost the same as the regular standing behavior except for the differences in hands and face. In the sampling process, the sample size for regular standing was smaller than for knowledge explanation, so a larger portion of regular standing was misclassified as knowledge explanation. Walking around behavior is a kind of mobile state behavior, and in the sampling process, all other actions will appear mobile state, and the sample number of walking around is too small, resulting in the model can’t recognize and classify this kind of action well. The characteristics of the blackboard board and operating multimedia behaviors are more obvious, usually, the blackboard board is a one-handed writing behavior with the back to the students, and the operating multimedia is a stationary behavior facing the students with the body leaning forward, so the classification accuracy of these two is the highest. Overall, the classification effect of the current model on teachers’ teaching behaviors can support the analysis of classroom teachers’ teaching behaviors.

4.2

Application in real teaching scenarios

In order to verify the practicality of the English classroom teaching behavior target detection and identification model proposed in this paper, classroom teaching behavior data were collected and analyzed from university English teachers at the experimental site in City A using the developed quantitative calculation and evaluation system for teachers’ teaching behavior. The video collection task lasted for one semester, and data collection was conducted for three types of English courses, namely English speaking, English reading, and English writing, involving eight teachers.

Using 5 seconds as the interval time slice of video segmentation, the teaching behaviors of English teachers in the whole lesson were analyzed, including teaching facial orientation (seconds), teaching smile expression (seconds), teaching gesture (seconds), teaching body distance proximity range (seconds), and the number of interactions (times).

In order to verify that the method proposed in this study is consistent with real objective teacher teaching behaviors, the model detection results of the three English courses were compared with their objective truth values. The objective real values were recorded by six professionals who were trained to manually annotate the recorded videos with the help of tools, and the recording intervals were consistent with the deep learning-based target detection and recognition model of this study, and a comparison of the automatic detection results of the model and the manually annotated results of the tool is shown in Figure 9. Among them, (a) to (d) represent the time statistics of student-oriented recognition, hand gesture recognition, near-student body distance recognition, and smile expression recognition, respectively.

As can be seen from Figure 9, the duration of each behavior obtained by the recognition method proposed in this paper is basically consistent with the results obtained by manual hand labeling, which better proves the validity and reliability of the method proposed in this paper, and it can be effectively applied to real scenarios. At the same time, it can be seen that due to the characteristics of different English courses, the proportion of time allocated for teachers’ classroom use of gestures is different. In oral English teaching, the teacher spends more time explaining to the students.

Instructional gestures and evaluative gestures are used more in reading courses. Congruent gestures and symbolic gestures were more prevalent in English speaking and writing classes. Relatively more time is spent on student-based practice in English reading and English writing classes. Expressive language was used more in the English speaking and English writing classes, while the English reading class was lecture-based and had relatively less use of expressive language. The results prove that the method proposed in this paper is valid and reliable, and can be utilized effectively in real-life scenarios.

5

Conclusion

In this paper, we constructed an English teacher goal detection and tracking model as well as a classroom teaching behavior identification model to investigate the teaching behavior of English teachers in the classroom. The main conclusions obtained are as follows:

For one, the model is trained and its recognition results are compared to other models. After 14000 iterations, the loss function value of the model tends to be zero and the accuracy rate tends to be 100%. And with the increase of the training cycle, the validation set classification accuracy rate is always smaller than the training set accuracy rate, and it gradually improves, and the underfitting problem is solved and a normal fitting phenomenon is formed, which indicates that the model training effect is better and reaches a more ideal state. On the English teacher classroom teaching behavior dataset, the recognition accuracy of the proposed teacher teaching behavior recognition model proposed in this paper is as high as 87.61%. Comparing with C3D, I3D, SlowOnly, R3D-34, R2plus1D-34, and MS-AAGCN models, the recognition accuracy of this paper’s model is improved by 11.27%, 7.45%, 34.36%, 6.98%, 14.47% accuracy and 2.12%, respectively. It proves the effectiveness of this paper’s model in recognizing English teachers’ classroom teaching behaviors.

Second, the target detection and teaching behavior recognition model of this paper for English teachers is applied to real English teaching scenarios, and the durations of the behaviors obtained by the model are basically consistent with those obtained by manual labeling, which proves that the model of this paper can be effectively applied to real scenarios.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

A Study of English Teachers’ Classroom Teaching Behavior Based on Deep Learning

Yaya Tian

Hailong Zhang

Published Online: Mar 19, 2025

Received: Oct 17, 2024

Accepted: Feb 02, 2025

DOI: https://doi.org/10.2478/amns-2025-0437

KeywordsTarget detection and tracking, Multi-stream graph convolutional networks, Deep learning, Classroom teaching behavior

© 2025 Yaya Tian et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Target detection and tracking, Multi-stream graph convolutional networks, Deep learning, Classroom teaching behavior