A Study of English Teachers’ Classroom Teaching Behavior Based on Deep Learning
Published Online: Mar 19, 2025
Received: Oct 17, 2024
Accepted: Feb 02, 2025
DOI: https://doi.org/10.2478/amns-2025-0437
Keywords
© 2025 Yaya Tian et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Teaching behavior in the broad sense includes the teacher’s “teaching” and the student’s “learning” behavior, while teaching behavior in the narrow sense refers to the teacher’s “teaching” behavior. Classroom teaching behaviors are all the observable outward behaviors taken by teachers in the classroom to accomplish the teaching objectives. Classroom teaching behaviors of English teachers mainly refer to the various behaviors directly directed to the teaching content that English teachers take in order to make the teaching successfully implemented in the English classroom environment [1–2]. This behavior is due to individual differences in English educational philosophy, teaching style, professional knowledge and skills, English classroom teaching wit, etc., and shows different behavioral approaches in specific English classroom teaching situations. The evaluation of English teachers’ classroom teaching behavior is a value judgment of English teachers’ classroom teaching behavior for the purpose of improving and enhancing English teachers’ teaching ability and classroom teaching effect [3–5].
With the rapid development of education informatization, teacher behavior recognition and analysis has become a popular research direction in the field of education. The traditional teacher behavior recognition and analysis relies on manual observation and subjective judgment, which has the problems of low recognition accuracy and cumbersome operation, etc. The deep learning-based classroom behavior recognition and analysis system can effectively identify and analyze teachers’ behavior by using big data and neural network technology. The deep learning-based teacher behavior recognition and analysis system can effectively improve the recognition accuracy and automation degree by using big data and neural network technology. Moreover, it can realize real-time monitoring and analysis of teachers’ behaviors, and provide reference basis and teaching improvement suggestions for teachers after class. At the same time, it can also provide strong support for educational research and promote the development of educational informatization [6–9].
Literature [10] indicated the wide application of deep learning technology and explored the feasibility of deep learning technology in classroom teaching behavior analysis based on the characteristics of deep learning technology from classroom teaching behavior analysis. Literature [11] proposed a method for teacher behavior recognition in video teaching scenes. Based on the teaching model of “teacher set”, the recognition and extraction algorithms are discussed to recognize the teachers in the teaching video, and the improved 3D BP-TBR is mentioned to recognize the types of teacher behaviors, and the validity of the performance of 3D BP-TBR is verified through experiments, and the proposed overall method is conducive to improving the accuracy rate of recognizing the teacher behaviors. Literature [12] emphasizes the important role of teachers in regulating the educational process and student learning outcomes, among others. Teachers’ concern and praise and the impact they have on student engagement in the English classroom are examined and some insights are presented to clarify student and teacher practices. Literature [13] builds a classroom teaching behavior analysis and evaluation system with deep learning face recognition technology and unfolds the analysis of classroom behavior in terms of various student behaviors, and the results indicate that deep learning face recognition technology is able to identify students’ classroom behaviors effectively, which helps in teaching management and implementation. Literature [14] explored the effectiveness of the flipped classroom teaching model of deep learning. Applying this teaching model with application teaching and launching teaching experiments, it was concluded that the implementation of environmental protection education based on English teaching is of great significance. Literature [15] explored the relationship between English teachers’ interpersonal behavior and students’ English fluency, and the English “Teacher Interaction Questionnaire” was used to investigate. The results indicated that student achievement was negatively related to teacher uncertainty, positively related to the degree of cooperation between teachers and students, and that there were differences in students’ preferences for teacher behavior.
Literature [16] constructed the classroom learning behavior dataset ActRec-Classroom based on classroom behaviors such as listening to lectures, raising hands, etc., and proposed a framework for classroom learning behavior analysis system with the help of CNN, conducted experiments on human body detection, extracting human body key points and designing action recognition classifiers, and the results revealed that the proposed system meets the needs of behavior analysis under classroom teaching. Literature [17] created an integrated practice process for classroom behavior analysis under AI based on human posture assessment, facial expression recognition and other technologies, including modules for acquisition and analysis, feedback and improvement, aiming at guiding teachers to improve their behavior, teaching quality and learning effectiveness. Literature [18] takes multiple data sources for cross-comparison and uses the mature random forest algorithm and correction matrix in the data analysis of classroom teaching behavior, and the analysis results emphasize that deep learning can accurately feed back the data of classroom teaching phenomena and teaching activities, which can help to optimize the classroom management and promote teaching reform. Literature [19] identified predictors of Spanish English learners’ willingness to communicate in a foreign language. Through multiple regression analysis, it was concluded that foreign language classroom anxiety was the most significant negative predictor affecting foreign language learning, while positive predictors were found in the enjoyment of the foreign language and the frequency of the teacher’s use of the foreign language. Literature [20] discusses the predictability of foreign language teachers’ self-efficacy on motivational teaching behavior. A questionnaire survey was conducted on 112 English teachers and the results were analyzed by descriptive statistics and multiple regression analysis, which mentioned that English teachers’ efficacy for classroom management and student engagement was much lower than self-efficacy for teaching strategies, proving the causal relationship between self-efficacy and motivational strategies. Literature [21] reviewed the literature on the concept of teacher beliefs and studies exploring teacher beliefs using various methods and examined the relationship between teacher beliefs and classroom practices and the factors influencing teacher beliefs in teaching professional development.
This paper explores the classroom teaching behaviors of English teachers by constructing an English teacher target detection and tracking model, as well as a classroom teaching behavior recognition model. First, this paper utilizes web crawler technology to obtain video data from English classroom teaching. Then, a target detection algorithm based on cross-stage local network and pyramid attention network is utilized to detect English teachers’ targets in classroom teaching videos in real-time. Then, the target tracking algorithm based on deep neural network, Kalman filter algorithm and Hungarian algorithm is utilized to associate the target images of English teachers in adjacent frames to realize the detection and tracking of English teachers’ targets in the classroom teaching video. Then, the classroom behavior dataset of English teachers is constructed, and the HRNet human posture estimation model is used to produce the teacher skeletal information dataset, and the graph attention network (GAT) is introduced into the multistream adaptive graph convolution network (MS-AAGCN) to construct the classroom English teacher teaching behavior recognition model. Finally, the model’s performance is evaluated and its application effect in real teaching scenarios is examined.
For the acquisition of English classroom teaching video data, web crawler technology or manual download is mainly utilized to crawl English classroom teaching video data for English classroom teaching keywords. Secondly, target tracking is performed on each English classroom teaching video to extract the continuous image of each English teacher. Finally, the continuous images are manually annotated to obtain the English teacher classroom behavior video dataset as well as the student classroom behavior picture dataset. The acquisition process for English classroom teaching video data is shown in Figure 1.

Classroom teaching video data acquisition process
For the processing of English classroom teaching videos, first, all the English classroom teaching videos are put in a folder, for each English classroom teaching video, a new corresponding folder is created for storing all the images of the target, target tracking is performed on each frame, target tracking will get a target ID and the position coordinates of the target image, and then a new folder is created for storing all consecutive images of the target for each ID, and the target images are numbered according to the frames. Then for each ID a new folder is created for storing all the consecutive images of the target, the target images are numbered according to the number of frames, after processing all the videos an initial dataset is obtained. After that, all the consecutive images of the target are manually labeled, and 16-32 frames are taken as a sample video of an English teacher’s classroom teaching behavior, or a single frame is taken as a sample image of an English teacher’s classroom teaching behavior.
In order to recognize the behavior of each English teacher target in the classroom teaching video, it is necessary to first detect and track the English teacher targets in the video and extract the spatio-temporal features of each English teacher target. In this paper, we use the target detection algorithm based on cross-stage local network and pyramid attention network to realize the real-time detection of English teacher targets in the classroom teaching video, and then use the target tracking algorithm based on deep neural network, Kalman filtering algorithm, and Hungarian algorithm to correlate the images of the English teacher targets in adjacent frames, so that we can realize the detection and tracking of the English teacher targets in the classroom teaching video.
Detecting and localizing English teacher targets in classroom teaching videos is key to achieving recognition of English teacher teaching behaviors. In this paper, a deep learning-based video target detection algorithm is utilized to achieve real-time detection of English teacher targets in classroom teaching videos. The deep learning network structure for target feature learning and detection contains a cross-stage localization network for extracting target features from classroom teaching video frames, a pyramid attention network for fusing multi-layer features, a convolutional network for target category computation, and an edge-regression network for calculating English teacher target prediction frames with a border regression network. The structure of the deep learning-based real-time target detection network for classroom English teachers is shown in Fig. 2.

The classroom teachers target real-time detection network structure
Where the convolution process of each layer of the cross-stage local network for target feature learning in classroom teaching video frames is shown in equation (1):
Where:
After obtaining the target features of the classroom teaching video frames, the high-level features obtained from the convolutional neural network learning are fused using a feature fusion network, and then the fused features are input into a convolutional network for target detection, and the confidence level of the corresponding category is calculated as shown in Equation (2):
The probability of having an object,
Where: (
Meanwhile, the coordinate position of the target frame is obtained by calculation, then the loss function is shown in Equation (6):
Where:
Firstly, all English teacher target images are extracted using the target detection algorithm, and then each English teacher target image is input into a simple appearance embedding model to obtain the appearance features of each English teacher, and then the Kalman filter algorithm is used to predict the appearance features of each English teacher in the next frame, and then the Hungarian algorithm is utilized to perform the matching after detecting the size and position of the English teacher target in the next frame image,. Thus, each English teacher’s target is associated to achieve classroom English teacher target tracking. The flow of the classroom English teacher’s target tracking algorithm is shown in Figure 3.

The flowchart of the classroom teacher target tracking algorithm
This chapter first constructs an English teacher behavior dataset based on the acquired teaching videos of real classroom scenarios, and then proposes an English teacher teaching behavior recognition model based on multi-stream graph convolutional networks, which recognizes English teacher classroom teaching behaviors based on the extracted human behavioral features on the basis of teacher target detection and tracking in the English teacher behavior dataset and combines with the graph attention module to help the The model pays attention to and learns the connection relationship between various key points of the body, which further enhances the performance of the model in recognizing teacher behaviors.
The first step in constructing a teacher behavior dataset is to determine the categories of teacher behaviors in the dataset. In this paper, with reference to the Student-Teacher (S-T) Behavior Analysis Method and the Classroom Teaching Behavior Analysis System (TBAS) as well as related studies, teacher behaviors that appear more frequently in the process of classroom teaching, have certain pedagogical significance and are easy to visually distinguish are selected for study, including blackboard boarding, operation of multimedia, walking around, knowledge explanation, media explanation, and standing in routine 6 behaviors.
The production of the dataset mainly includes four steps: collecting data, screening and cleaning the data, data preprocessing, and manual labeling.
The data used in this paper mainly come from real classroom videos recorded in our classrooms and teaching videos made public on the Internet, including a total of 1,000 hours of teaching videos from one hundred English teachers. All the videos collected are teacher-centered, i.e., the teacher is the only one in the picture, and it is guaranteed that the podium, the lectern, the blackboard and the multimedia electronic screen are included in the captured picture.
To ensure data quality, this study preprocesses the original classroom video data and manually crops out video clips with a duration of 1s-10s that meet the requirements, and each video clip contains only one defined teacher behavioral category, i.e., there is only one category label for each video clip. In order to avoid the influence of factors such as teachers’ own appearance characteristics on the results of behavior recognition, each teacher’s behavioral clip of each category appeared no more than five times in the dataset. After the labeling of video clips was completed the data were saved and named by category in the format of category label_category name_video number.mp4.
Meanwhile, the samples of each class were randomly divided into a training and validation set with 80% and 20% proportions, respectively. By comparing with the public dataset and the existing research teacher behavior dataset, it can be seen that the English teacher behavior recognition dataset constructed in this paper can meet the demand of the subject study in terms of data volume.
Optical flow is the instantaneous speed of pixel motion in a sequence image, and can be used to measure how image pixels change over time. Optical flow is usually generated by the motion of the target or the camera in the image, or both. Human behavior is captured by the camera as a video, and the human body and its background can be shown as a continuous flow of images in the video, and the movement of the human body can lead to the generation of optical flow. Therefore, optical flow features can characterize human behavior. The two-dimensional vector field formed by calculating the instantaneous velocity of the image pixels one by one is called the optical flow field.
Two assumptions need to be satisfied for the calculation of optical flow: first, the brightness of the image is constant when calculating the optical flow, i.e., the same pixel has the same brightness between the preceding and following frames. Second, the motion in continuous time is “small motion”, which can be represented by the change of gray scale.
Assuming that
Expand the right-hand side of Eq. (11) in terms of Taylor series as:
Let
Where,
Equation (15) forms an overdetermined solution to the problem, and the optical flow vector can be found using the least squares method.
Spatio-temporal points of interest refer to the points that change more significantly in the spatio-temporal domain of the video, which are not considered based on the human body itself, but in some ways can reflect the details of the video motion, and thus some characteristics of human behavior. In addition, spatio-temporal points of interest can also describe some appearance information in the video. To acquire spatio-temporal interest point features, two steps must be taken: detecting spatio-temporal interest points and describing spatio-temporal interest points.
The main idea of spatio-temporal interest point detection is to consider the video as a three-dimensional function, and then utilize a mapping function to map the video from high to low dimensions, i.e., from three dimensions to one dimension, and finally solve for the local maxima, which are the spatio-temporal interest points of the video, in one-dimensional space. Commonly used detection methods include 3D Harris corner point detection, which is extended from the 2D Harris corner point detection method and contains information about spatial structure and temporal dimension. The specific calculation is as follows:
In the first step, the video is transformed into a linear scale space by performing a scale transformation. The temporal scale and spatial scale of the video are determined by the scope of the behavior, which has a large impact on the detection of spatio-temporal interest points, in order to adapt to the scale change of the video, the scale change needs to be carried out in time and space before detection, which can be expressed as:
Where:
In the second step, a Gaussian window is employed to slide in the video, and the changes in the window before and after the sliding are compared to determine the spatio-temporal interest points.
In the third step, the eigenvalues of matrix
where
Detecting spatio-temporal interest points from videos usually has the problem of low number because the interest points need to satisfy the conditions of both temporal and spatial domains. This problem is overcome by the dense trajectory algorithm, whose main idea is to densely sample the video at multiple scales and then track the feature points to form a feature trajectory.
After the interest points are detected by the spatio-temporal interest point detector, it is also necessary to use a specific way to describe the interest points structurally and statistically for feature characterization, so various feature descriptors appear. jets local descriptors are used by calculating the first partial derivatives of the spatio-temporal interest points up to the fourth partial derivatives, and then combining these partial derivatives sequentially in order to form the feature vectors, and this kind of features are mainly extracted from video This feature mainly extracts information about the motion and appearance of the point of interest in the spatio-temporal domain. The Cuboid descriptor takes the point of interest as the center and then expands it in the temporal and spatial dimensions to construct a cube, and then combines the gradients of all the pixels in the cube to form a one-dimensional feature vector, which is then downscaled using PCA. The main idea of the HOG/HOF feature descriptor is to compute the gradient of the point of interest, and then divide the region according to the direction of the gradient. Then divide the region according to the gradient direction and count the gradient size of the feature points to form a feature vector.
The human skeleton can be considered an abstract representation of the human body, capable of focusing on human behavior without the interference of complex environments. The human skeleton is usually composed of joints, and a commonly used skeleton model with 18 joints or 25 joints is shown in Fig. 4. The 18-joint model is numbered by 18 digits from 0-17, which are represented sequentially in order: Nose, Neck, Rshoulder, RElbow, RWrist, LShoulder, LElbow, LWrist, RHip, RKnee, Rankle, LHip, LKnee, LAnkle, REye, LEye, Rear, and Lear, with some of the words initialized R for right and L for left. The 25-joint-point model has the addition of MidHip, LBigToe, and LSmallToe compared to the 18-joint-point model, LHeel, RBigToe, RSmallToe, and Rheel.

Common skeleton model
After organizing the English teacher behavior video dataset, it is also necessary to construct a teacher behavior dataset based on skeletal information in order to achieve effective recognition of English teacher classroom teaching behaviors. By referring to the existing behavior recognition datasets based on skeletal information, this paper chooses to use the HRNet human posture estimation model to produce the teacher skeletal information dataset.
The overall structure of the multistream map convolutional network is shown in Fig. 5, where the core structure is a stack of base modules, and the number of output channels of the nine base modules is raised from 32 to 256 dimensions.
Graph Convolution Module The behavior recognition method based on graph convolutional network first extracts the skeletal keypoints by the human body pose estimation algorithm HRNet. The input to the graph convolutional network is the human body topology graph, i.e., the extracted keypoints are connected in the order of connectivity of the human body joints [22]. The connections between individual human keypoints in the same frame in the video are used as spatial edges, and the edges between the same keypoints in neighboring frames are used as temporal edges. A topology graph is constructed based on the natural connections of the human body, and the feature graph in the network is actually a tensor Given a topological graph of the human skeleton, the convolution in a graph convolutional network can be expressed as:
Where: Attention Module Graph Attention Network (GAT) is a neural network for processing graph-structured data based on GCN improvement, which consists of several graph attention layers [23]. The multi-stream adaptive graph convolutional network MS-AAGCN embeds the STC attention module into the graph convolutional layers to help the model learn to selectively attend to joints, frames, and channels, and co-learn and update them along with other parameters [24]. This data-driven approach increases the flexibility and generalization ability of the model. In this paper, we add a graph attention layer to focus on important nodes based on the MS-AAGCN network to automatically learn and optimize the connectivity relationships between nodes to improve the expressive power of the model. First define a feature transformation matrix matrix

The overall structure of the multi-flow atlas network
Where
Where □ denotes vector splicing,
After getting the attention coefficient
The Time Attention Module (TAM) is calculated as:
The Channel Attention Module (CAM) helps the model to enhance the descriptive features (channels) based on the input samples, and it generates the attention computationally as follows:
where
Four modes were trained, which were Skeletal Flow, Skeletal Motion Flow, Skeletal Keypoint Flow, and Skeletal Keypoint Motion Flow.
The joints near the center of gravity of the body are defined as source keypoints, and those far from the center of gravity are defined as target keypoints, and the vector of source keypoints pointing to target keypoints represents a bone information. In the human body topology graph, the number of keypoints is one more than the number of bones, and the root node does not have a source keypoint corresponding to it, so the root node is assigned an empty bone with a value of zero. The bone flow is denoted by
The motion information of bones and keypoints is the difference between the same bones and keypoints in two neighboring frames, and the bone motion flow and keypoint motion flow are denoted by B-M and
The four modal data in the multi-stream framework are fed into four separate streams, each of which gives a separate outcome score, and the results of the four streams are weighted and summed to get the final predicted score and give the predicted behavioral labels. A multilayer spatio-temporal graph convolution operation is performed on the topological map to extract high-level features, and the output feature of the last layer of the network is
The model evaluation metrics used in this study are mainly classification accuracy and classification checking rate (Recall). The curve of classification accuracy and cross-entropy loss function value of the English teacher classroom teaching behavior recognition method used in this paper on the training set is shown in Fig. 6, and the shaded area is the cross area of the two.

Classification accuracy and cross entropy loss curve on the training set
As the number of iterations increases, the loss function value tends to decrease slowly, then rapidly, and finally slowly to gradually reach a flat state. The classification accuracy, on the other hand, rises slowly, then rapidly, and finally rises slowly to reach a plateau. After 14000 iterations, the loss function value of the model tends to 0 and the accuracy rate tends to 100%, which indicates that the model has been well trained.
During model training, a validation set validation is performed at the end of each iteration cycle. The curve of classification accuracy and cross-entropy loss function value of the English teacher behavior recognition method on the validation set is shown in Figure 7.

Classification accuracy and cross entropy loss curve on the training set
As can be seen from Fig. 7, when the model undergoes validation set validation for the first time, the validation set accuracy is slightly higher than the training set accuracy, and there is an underfitting problem on the training set. With the increase of the training cycle, the validation set classification accuracy rate is always smaller than the training set accuracy rate, and it gradually improves, and the underfitting problem is solved, forming a normal fitting phenomenon and reaching a more ideal state.
On the CTTAV validation set, the recognition effects of the English teacher teaching behavior recognition model proposed in this paper and other excellent behavior recognition models are shown in Table 1.
Comparison of model recognition effect
| Method | Input size | Accuracy rate/% | Extraction posture |
|---|---|---|---|
| C3D | 16*128*171 | 76.34 | No |
| I3D | 32*256*256 | 80.16 | No |
| SlowOnly | 8*256*256 | 53.25 | No |
| R3D-34 | 48*112*112 | 80.63 | No |
| R2plus1D-34 | 48*112*112 | 73.14 | No |
| MS-AAGCN | 48*56*56 | 85.49 | Yes |
| HRNet+MS-AAGCN+GAT(Ours) | 48*56*56 | 87.61 | Yes |
As can be seen from Table 1, the teacher teaching behavior recognition method proposed in this paper achieves a recognition accuracy of 87.61% on the English teacher classroom teaching behavior dataset, which is 2.12% higher than that of the MS-AAGCN model embedded with the STC attention module. This proves that combining the graph attention module with the human skeleton topology map is a better way to fully utilize the human behavioral feature information and achieve a better recognition effect. Similarly, on the teacher behavior dataset, comparing with C3D, I3D, SlowOnly, R3D-34, and R2plus1D-34, the model proposed in this paper is able to improve the accuracy by 11.27%, 7.45%, 34.36%, 6.98%, and 14.47%, respectively. This indicates that the teacher behavior analysis network model constructed in this paper has a better recognition effect on the English teacher classroom teaching behavior dataset compared to other models.
The confusion matrix of this paper’s model for categorizing the predicted share of different behaviors is shown in Figure 8. Among them, X1~X6 denote the six English teachers’ classroom teaching behaviors of blackboard boarding, operating multimedia, walking around, knowledge explanation, media explanation, and regular standing, respectively.

Validation set confusion matrix
As can be seen in Figure 8, the highest categorization check rate (Recall) was found for blackboard writing and operating multimedia, which were 98% and 93%, respectively. The next highest rates were media presentations and walking around, both of which were greater than 80%. Lastly, Knowledge Explanation and Routine Standing, both with 79%. Among them, knowledge explaining and media explaining have greater similarity in behavioral expression and have the highest level of confusion with each other, with the most misclassifications for knowledge explaining being media explaining and the most misclassifications for media explaining being knowledge explaining. In addition, the knowledge explanation behavior would be in a standing position facing the students, which is almost the same as the regular standing behavior except for the differences in hands and face. In the sampling process, the sample size for regular standing was smaller than for knowledge explanation, so a larger portion of regular standing was misclassified as knowledge explanation. Walking around behavior is a kind of mobile state behavior, and in the sampling process, all other actions will appear mobile state, and the sample number of walking around is too small, resulting in the model can’t recognize and classify this kind of action well. The characteristics of the blackboard board and operating multimedia behaviors are more obvious, usually, the blackboard board is a one-handed writing behavior with the back to the students, and the operating multimedia is a stationary behavior facing the students with the body leaning forward, so the classification accuracy of these two is the highest. Overall, the classification effect of the current model on teachers’ teaching behaviors can support the analysis of classroom teachers’ teaching behaviors.
In order to verify the practicality of the English classroom teaching behavior target detection and identification model proposed in this paper, classroom teaching behavior data were collected and analyzed from university English teachers at the experimental site in City A using the developed quantitative calculation and evaluation system for teachers’ teaching behavior. The video collection task lasted for one semester, and data collection was conducted for three types of English courses, namely English speaking, English reading, and English writing, involving eight teachers.
Using 5 seconds as the interval time slice of video segmentation, the teaching behaviors of English teachers in the whole lesson were analyzed, including teaching facial orientation (seconds), teaching smile expression (seconds), teaching gesture (seconds), teaching body distance proximity range (seconds), and the number of interactions (times).
In order to verify that the method proposed in this study is consistent with real objective teacher teaching behaviors, the model detection results of the three English courses were compared with their objective truth values. The objective real values were recorded by six professionals who were trained to manually annotate the recorded videos with the help of tools, and the recording intervals were consistent with the deep learning-based target detection and recognition model of this study, and a comparison of the automatic detection results of the model and the manually annotated results of the tool is shown in Figure 9. Among them, (a) to (d) represent the time statistics of student-oriented recognition, hand gesture recognition, near-student body distance recognition, and smile expression recognition, respectively.

The comparison of the model test results and the artificial labeling results
As can be seen from Figure 9, the duration of each behavior obtained by the recognition method proposed in this paper is basically consistent with the results obtained by manual hand labeling, which better proves the validity and reliability of the method proposed in this paper, and it can be effectively applied to real scenarios. At the same time, it can be seen that due to the characteristics of different English courses, the proportion of time allocated for teachers’ classroom use of gestures is different. In oral English teaching, the teacher spends more time explaining to the students.
Instructional gestures and evaluative gestures are used more in reading courses. Congruent gestures and symbolic gestures were more prevalent in English speaking and writing classes. Relatively more time is spent on student-based practice in English reading and English writing classes. Expressive language was used more in the English speaking and English writing classes, while the English reading class was lecture-based and had relatively less use of expressive language. The results prove that the method proposed in this paper is valid and reliable, and can be utilized effectively in real-life scenarios.
In this paper, we constructed an English teacher goal detection and tracking model as well as a classroom teaching behavior identification model to investigate the teaching behavior of English teachers in the classroom. The main conclusions obtained are as follows:
For one, the model is trained and its recognition results are compared to other models. After 14000 iterations, the loss function value of the model tends to be zero and the accuracy rate tends to be 100%. And with the increase of the training cycle, the validation set classification accuracy rate is always smaller than the training set accuracy rate, and it gradually improves, and the underfitting problem is solved and a normal fitting phenomenon is formed, which indicates that the model training effect is better and reaches a more ideal state. On the English teacher classroom teaching behavior dataset, the recognition accuracy of the proposed teacher teaching behavior recognition model proposed in this paper is as high as 87.61%. Comparing with C3D, I3D, SlowOnly, R3D-34, R2plus1D-34, and MS-AAGCN models, the recognition accuracy of this paper’s model is improved by 11.27%, 7.45%, 34.36%, 6.98%, 14.47% accuracy and 2.12%, respectively. It proves the effectiveness of this paper’s model in recognizing English teachers’ classroom teaching behaviors.
Second, the target detection and teaching behavior recognition model of this paper for English teachers is applied to real English teaching scenarios, and the durations of the behaviors obtained by the model are basically consistent with those obtained by manual labeling, which proves that the model of this paper can be effectively applied to real scenarios.
