Virtual Learning Environment and Intelligent Interaction Design for the Construction of Online Platform for Teaching Aesthetic Education in Colleges and Universities under Three-Taiwan Fusion Environment
Data publikacji: 24 mar 2025
Otrzymano: 21 paź 2024
Przyjęty: 10 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0756
Słowa kluczowe
© 2025 Haiyan Gu, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Aesthetic education is an important part of the quality education system, and its core teaching goal is to nourish students’ minds, improve the method and ability of discovering, appreciating and creating beauty, enhance students’ aesthetic ability, aesthetic taste, and actively create beauty in their lives [1-4]. Under the background of three-channel integration environment and digital era, educational activities also begin to develop towards digitalization, detach from the traditional single teaching mode, and innovate the teaching activities by digital means and methods, through the construction of the online platform for teaching aesthetic education [5-8], as well as the platform’s virtual learning environment and intelligent interactive design, so that the students can obtain the feeling of beauty, experience the connotation of beauty, master the relevant knowledge and skills, and further deepen the relevant knowledge and skills. The virtual learning environment and intelligent interactive design of the platform allow students to get the feeling of beauty, experience the connotation of beauty, master the relevant knowledge and skills, and further deepen their understanding of traditional aesthetic ideas and enhance their cultural self-confidence [9-12].
Virtual learning environment and intelligent interactive design cannot be separated from virtual reality and artificial intelligence technology. The goal of virtual reality interactive learning environment design and development is to provide learners with an immersive learning experience. Through virtual reality technology, it is possible to simulate the real environment and create various scenarios and situations in it, so that learners can put themselves in it. This immersive learning experience can greatly enhance the learner’s sense of participation and engagement, and improve the learning effect [13-16]. Intelligent interaction is a discipline involving human-computer interaction, user experience and artificial intelligence, which can realize efficient, flexible and intelligent communication and interaction between humans and machines, and is of great significance for improving the quality of education [17-19].
This paper describes the theory of “three-platform integration”, and utilizes big data, cloud computing, Internet of Things and other “three-platform integration” technologies to focus on discovering the construction method of “platform”. Based on the cloud computing platform, we design the three-dimensional virtual learning environment system structure in the teaching of aesthetic education in colleges and universities, and use programming languages such as JSP and JAVA to effectively realize the three-dimensional virtual learning environment and its functions. According to the students’ past aesthetic education learning information, the TF-IDF algorithm is used to count the information. The cosine similarity calculation method is weighted by introducing the course category percentage method, and the user profile is integrated to construct the personalized learning resource recommendation method in this paper. The Kinect sensor is used as the input device, and Unity 3D is combined to scientifically design virtualized body posture and language interactions in intelligent interaction. Experiments are designed to verify the virtual learning environment and intelligent interaction design method of this paper, and explore the reasonableness of applying the method of this paper to the construction of online platform for teaching aesthetic education in colleges and universities.
In recent years, the State has promulgated relevant regulations reflecting the importance of aesthetic education in the comprehensive development of students, requiring the improvement of the quality of aesthetic education teaching, deepening the reform of aesthetic education teaching, and promoting the development of aesthetic education as a whole. Therefore, for art disciplines in institutions of higher learning, it is necessary to give full play to its cluster advantages, on this basis, the “three-stage fusion” aesthetic education teaching model is proposed, that is, the podium, the platform and the stage are combined into one. In the context of “three-stage integration”, the teaching of aesthetic education for non-art students is carried out in strict compliance with the requirements and laws of aesthetic education teaching, highlighting the student’s main position, closely linking the podium, platform and stage, and scientifically constructing a connecting and integrating mechanism.
Firstly, teachers deliver aesthetic knowledge and skills through podium presentations. They implement theoretical teaching in conjunction with curricular learning to promote general knowledge of aesthetics among non-art students. Secondly, the platform plays a role in the practice of aesthetics that should not be underestimated, and it is the main position of aesthetics practice, using experiential learning to carry out practical teaching and enriching the teaching pathway for non-art students. Finally, the stage refers to the effective application of the acquisitive learning approach, giving full play to its guiding role, and then the implementation of the teaching test, which provides important help for non-art students, i.e., internalizing the results of aesthetic education.
Cloud computing provides powerful computing resources and storage capacity for the online platform, enabling it to easily cope with large-scale concurrent student access and complex data processing needs. Through virtualization technology, the resources of physical servers are integrated and distributed so that users do not need to care about the construction and maintenance of the underlying infrastructure, and only need to obtain the required computing resources on demand, which greatly reduces the platform construction and operation costs, and at the same time improves the flexibility and expandability of the system.
Internet of Things (IoT) technology connects various devices (e.g., sensors, intelligent terminals, etc.) to the Internet, realizing the information interaction between people and things, and between things and things. In the online platform for teaching aesthetic education in colleges and universities, IoT technology can be used to collect students’ learning behavior data, environmental data, etc., providing data support for personalized teaching and intelligent interaction. For example, the physiological state of students during the process of art creation is monitored through smart wearable devices, so as to adjust the teaching rhythm and content difficulty to achieve the best learning effect.
Big data technology is capable of storing, managing, and analyzing massive and diverse data. In the platform, big data can be used to analyze students’ learning habits, interest preferences, knowledge mastery, etc., to dig out the potential education laws and students’ needs, and to provide a basis for accurate teaching, resource recommendation and teaching evaluation. Through in-depth analysis of historical data, teachers can identify problems in the teaching process in a timely manner and improve teaching methods and strategies in a targeted manner.
According to the architecture of 3D virtual learning environment, this paper designs a simulation prototype, i.e. cloud computing platform, whose architecture is shown in Fig. 1. High performance and 3D virtual learning environment can be constructed through this platform.

Prototype design
Among them, the servers mainly include login servers, space servers, simulators, scenario file servers, file servers, database servers, and web servers, which are distributed in storage and application nodes, respectively. The servers in the cluster system install RedHat 9.0 Linux operating system, and the parallel environment uses MPICH 2.0 to present.
The leftmost is the cloud client, the user through the embedded JSP and JAVA and other plug-ins of the IE browser to interact with the prototype system in the method of service packaging, functionality and business logic is mainly realized by middleware technology. The following describes the operation process of the simulation prototype.
First of all, when the user performs an operation, the cloud client will make a judgment and different types of operations will be submitted to the main server for processing in different ways. When the user logs in, registers, manages users, and orders business, the main server compares, enters, or takes out the information in the database according to the type of request, and returns the information the user wants. Secondly, when the user carries out the business application, since it is not a simple operation, the master server will perform load balancing according to the condition of each node, and assign the application nodes and computing nodes that are the most suitable for the user’s application to the user, so that the user’s operation can achieve the highest efficiency. The following is a concrete introduction to the functional design of the 3D virtual learning environment through the designed simulation prototype, i.e., the cloud computing platform.
The concept of three-dimensional virtual learning environment is very broad, can be provided by text, graphics, images, sound, video and other multimedia forms (such as electronic courseware PPT, tutorial lesson plan documents, teaching case video, MP3 voice resources, etc.), can also be presented by way of three-dimensional physical objects that can provide real-time interaction (such as virtual experiments, modeling demonstrations, virtual campus, etc.). The three-dimensional virtual learning environment for teaching aesthetic education in colleges and universities discussed in this paper mainly provides the following functions:
User registration and login The user registers through a unique account, and at the same time gives the user’s nickname and gender in the learning environment, and chooses a favorite virtual image as an avatar. After the system validation, the user is received as a legitimate user. After successful registration, through the login page, you can login to enter the three-dimensional virtual learning environment for learning. Selection of 3D virtual scene Learners can choose the virtual learning scene according to their hobbies after logging in. Such as virtual teacher simulation of real classroom teaching, the teacher organizes the students to carry out teaching, can realize PPT online playback and real-time voice explanation. Virtualized body roaming scene Learners through the mouse or keyboard to control their own behavior, flying, walking or transported to other virtual scenes. Such as entering the virtual classroom, thematic discussion area, online classroom, and exiting the system. Online communication Provide one-to-one, one-to-many and many-to-many real-time interactions between teachers and students, between classmates and between fellow teachers, to exchange ideas and opinions in the form of text or voice, which can be used to communicate in real time, publish personal logs, log messages, community announcements, user shouting and so on. Personal Data Maintenance Through the personal database to maintain their own personal data, including basic data changes and updates, modify the account password, replace the avatar and so on.
Collect data such as students’ basic information, learning history, interests, learning styles, etc., and transform the data into a format that Python can handle, so as to obtain user labels to construct user profiles. In the process of extracting course labels, it is necessary to firstly perform word splitting on the course text, and use the TFIDF algorithm to count the results of word splitting [20], locate the course label keywords, and then construct the user portrait. After the course is subworded, the label keywords of the course can be obtained, such as “Principles of Aesthetic Education in Colleges and Universities” can be obtained after subwording [“colleges and universities”, “principles of aesthetic education”], and these two These two keywords constitute the vector space of the course, which represents the characteristics of the learning resource. By counting the labeled keywords of all the courses that the user has studied, the user label and weight can be obtained, and the weight
where
As the number of courses a user takes increases, the number of course tags also increases, but the percentage of course tags in the user profile should not increase linearly with the increase of tags. For example, a label
Through the above method, the user profile can be constructed by utilizing the user attributes and historical learning information, and the user profile is represented using the representation method of vector space model. Next, the relevant recommendation algorithms are implemented and improved to generate a recommendation candidate set, after which the fit between the user profile and the items in the candidate set is calculated, and the most suitable learning resources for the user are filtered out.
In this paper, user-based collaborative filtering algorithm is selected to realize collaborative filtering. In the calculation method of user similarity, several more classical similarity calculation methods such as Pearson correlation coefficient and cosine similarity are usually used. In the dataset used in this paper, only the overall score data of the course is included, and from the formula, Pearson similarity is not applicable due to the need for user ratings of individual items to participate in the calculation process, so this paper adopts the cosine similarity and Jekyll and Hyde similarity to conduct experiments and make targeted improvements. After the data is processed, the cosine similarity [21] can be calculated using equation (3):
The
Equation (3) simplifies the calculation, but it is not optimized for the characteristics of the user’s study courses, although it can achieve better calculation results in some data backgrounds, but the stability is not high, therefore, this paper adjusts the weights by introducing the method of course category percentage, as shown in Equation (4):
Eq. (4), compared with Eq. (3), introduces the
For Jaccard similarity [22], which itself is used to measure the degree of association between two sets consisting of Boolean values, computationally, it is only necessary to compute the ratio of the intersection of the two sets to the concatenation of the two sets, and again, to amplify the effect of the major course disciplines on the similarity, it is optimized by using Eq. (5):
After experiments using cosine similarity and Jekyll and Hyde similarity to obtain similar recommendation results, both of which are in line with the user’s original learning preferences. However, after repeated experiments, it is found that the percentage of each subject category in the results obtained by using the improved cosine similarity is more similar to the percentage of course categories reflected in the user’s studied courses, i.e., the recommended results are more in line with the user’s original interests, so in the similarity calculation of this algorithm, this paper selects the optimized cosine similarity as the calculation method.
Learning Resource Recommendation Algorithm with Fusion of User Profiles By fusing labeled user profiles with recommendation algorithms and learning resources, the fit between the user and the learning resources in the recommendation candidate set is calculated, and the N items with the highest fit are recommended to the user.
When improving and realizing the recommendation algorithm, in order to ensure the effectiveness of the recommendation results, a single algorithm is set to output no more than 10 recommendation results based on the similarity score, so as to reduce redundant and invalid recommendations. Therefore, after mixing the results, there will be at most 30 results in the candidate set, and only need to calculate the fit between these 30 candidate resources and users to get the final recommendation results, which greatly reduces the amount of computation compared to calculating the fit between all resources and users. The fit between resource
In Eq. (6),
In this paper, we designed avatars in an immersive 3D virtual learning environment, using Kinect sensor as an input device, and implemented avatar interaction in Unity 3D virtual reality engine. Since the Unity 3D virtual reality engine does not directly support the Kinect SDK, it needs to use middleware for connection. Currently, the commonly used middleware include Kinect with MS-SDK, CMU’s Kinect Wrapper for Unity 3D, OpenNI’s OpenNI Unity Toolkit and ZigfuDevelopment Kit. very simple, and can be directly imported into Unity 3D for development, the Kinect with MS-SDK middleware is chosen in this paper to complete the initialization and processing of Kinect data.
Since the movement of the virtualized body is realized by controlling the joint points of the skeleton, the 3D model of the virtual character with skeleton should be imported into the Unity 3D scene first. Then the depth image within the field of view is captured using the Kinect sensor. The Kinect SDK NUI evaluates the acquired depth image at the pixel level, which in turn separates the human body contours from the background image and recognizes the 32 parts of the human body. The recognized 32 parts are analyzed by machine learning algorithms, and finally 20 skeletal joints of the human body and related information are inferred. The distribution of human skeletal joints that can be captured by Kinect is shown in Figure 2. Through the Kinect with MS-SDK middleware, the 20 skeletal joint points of the user captured by Kinect are mapped one by one with the skeletal joint points of the 3D model of the virtual character in Unity 3D, at which time the virtual character will move along with the movement of the real person, realizing the virtualized body interaction.

Human skeleton node
In this paper, the human body pose information is collected with the help of Kinect sensor, which can acquire the depth image of the object and also track the 3D coordinate information of 20 skeletal joints of the human body, which not only can effectively avoid the influence of light on the pose recognition, but also eliminates the process of separating the human body from the complex background. After obtaining the information of the 20 skeletal joints of the human body, the pose of the human body can be determined by calculating the distance and angle between the joints.
Assuming that the 3D coordinate of joint
For some pose recognition where there is a cross between the joint points, such as holding the chest with both hands, it is only necessary to calculate the distance between the joint point of the left hand and the joint point of the right elbow and the distance between the joint point of the right hand and the joint point of the left elbow, and then compare them with the pre-set threshold, and if their distances are both less than the threshold, the pose can be considered as holding the chest with both hands. However, for some postures, it is not possible to accurately determine the posture based only on the distance between the joint points, and it is also necessary to calculate the angle between the joint points.
The distances

Calculation of Angle of junction
The magnitude of
Before the pose recognition, this paper presets the joint point angle and angle threshold for each pose, where the size of the angle threshold determines the recognition accuracy, and the following is an example of the whole recognition process by raising the left hand. The joints associated with the action of raising the left hand are the left wrist joint, the left elbow joint, and the left shoulder joint. When the angle between the left shoulder joint point and the left elbow joint point is 0, and the angle between the left elbow joint point and the left wrist joint point is 90, it can be assumed that this posture is raising the left hand, but of course, it is also necessary to set a suitable threshold. After the human posture information is captured using Kinect, it needs to be compared with the preset posture information. If the difference between the angle between the left shoulder joint point and the left elbow joint point and the angle of the preset joint point is within the threshold range, and at the same time the difference between the angle between the left elbow joint point and the left wrist joint point and the angle of the preset joint point is also within the threshold range, then the pose matching is successful, and the pose is recognized as lifting the left hand, and if one of the angular differences is out of the threshold range, then the matching fails, and it is necessary to do the matching again, as in shown in Equation (10):
In Equation (10),
The computer first recognizes human voice information, then converts it into commands and sends them to the intelligent character, which completes the corresponding tasks according to the received commands. Since Kinect sensor has significant advantages in audio processing and speech recognition, this paper uses Kinect to collect user’s speech information, and completes speech recognition through SAPI provided by Kinect for Windows SDK, and this paper’s intelligent character speech interaction includes speech text recognition function and speech emotion recognition function.
Since the raw audio captured is of low quality, audio processing is performed before speech recognition.After Kinect captures the audio, it employs algorithms to improve the audio quality such as echo cancellation algorithms, automatic gain control algorithms, and noise suppression algorithms. When the sound returns from the microphone, it creates an echo, Kinect eliminates the echo by extracting the user’s voice pattern and then extracting specific audio from the received audio based on this pattern. Automatic gain control is to keep the amplitude of the user’s voice consistent with the time, avoiding the impact on speech recognition due to differences in the user’s position. Noise suppression is to remove ambient noise from the audio signal, which allows the user’s voice to be captured more clearly and accurately by Kinect.
Speech recognition can be categorized into continuous speech recognition and isolated word speech recognition according to how the speech is told. Isolated word speech recognition recognizes separated words and phrases, which has a high recognition accuracy although this limits what the speaker says and is not as natural and convenient as continuous speech recognition. Since this paper wants to realize the control of smart character movement through voice commands, only isolated word speech recognition is required.
After starting the Kinect device, the first step is to initialize the audio data stream and register the speech recognition engine, which is the core part of speech recognition and is used to analyze and interpret the audio data stream. Then it is also necessary to create an XML grammar file in which all the required voice commands are written. After writing the grammar XML file, it is loaded into the speech recognition engine, and then the voice commands captured by Kinect are matched with the predefined voice commands in the XML file; if the match is successful, the next step of event processing is performed, and then it continues to recognize new voice commands issued by the user. If the match fails, this voice data is discarded.
In this paper, under the environment of “three-station fusion”, the virtual learning environment and intelligent interaction function for the online platform of teaching aesthetic education in colleges and universities are constructed by utilizing big data, cloud computing and other technologies. In order to verify the effectiveness of the proposed method, this paper will carry out experimental simulation analysis on the personalized learning resources recommendation, posture recognition and other technologies in the design, and study the effectiveness of this paper’s method on the online platform for teaching aesthetic education in colleges and universities.
The experiments in this paper use a real dataset in the external environment of E-learning, namely the Aesthetic Education 2017-2018 (Ae) dataset, which is proposed from an online platform for teaching aesthetic education in a university. The dataset contains implicit information about learners’ interactions with the online learning platform and learning resources. The dataset contains 359 learners, 1386471 interaction messages, 5237.16 total credit hours, and the sparsity of the dataset is 77.62%. In order to evaluate the performance of the algorithm, the dataset is divided into training set and test set according to 7:3.
The definition of rating sparsity is shown below:
In E-learning environments, some information about learners, either explicitly or implicitly, is collected from their active sessions by observing their behavior and interaction with the platform. In order to make the data suitable for mining and implementing recommendations, the raw dataset needs to be cleaned and preprocessed. Implicit learning ratings are represented by a value between 0 and 5.
The main purpose of the experiment is to test the prediction accuracy of the method proposed in this paper, and MAE is chosen for the evaluation, defined as follows:
where
Consider the algorithms of this paper for different similarity metrics to compare the performance: Ours-Pearson, Ours-Tanimoto, Ours-Jaccard, Ours-Cosine.The first experiments were performed to find the best value of the number of nearest neighbors, K, on the static and incremental datasets, and the results of the experiments are shown in Fig. 4. In the figure, in order to determine the optimal value of the nearest neighbor number K for this paper’s method on the Ae dataset, the experimental selection range is controlled between 0 and 300. By increasing the value of K, it is found that there is a significant difference in the MAE values of this paper’s algorithm using different similarity measures. On the Ae dataset, the MAE values of the Pearson and Tanimoto similarity measures are significantly higher than those of the Jaccard and Cosine similarity measures, while the Cosine similarity measure achieves smaller MAE values when the number of nearest neighbors is greater than 180, i.e., the Ours-Cosine method possesses higher prediction accuracy. Overall, by changing the number of nearest neighbors, it can be found that Ours-Cosine performs better than other methods on the Ae dataset.

Comparison of the MAE values of different similarity metrics
Figure 5 shows the results of the experiments conducted under the four similarity metrics with K values set to 30, 60, 180 and 200 respectively. It can be seen that for the Pearson and Tanimoto similarity metrics, the MAE values are higher under all four K values. It indicates that in the Ae dataset, the use of Pearson and Tanimoto similarity metrics could not achieve the desired prediction results.Jaccard, although more than the Pearson and Tanimoto methods MAE values of the MAE values at K values of 30, 60 and 180, its MAE values are at a lower level. However, when the K value is greater than 180, the MAE value of Jaccard produces a significant increase.The MAE value of Cosine method can be maintained at a lower level of 0.5~0.8 under all four K values, and the corresponding MAE value is the smallest when K=180. The experimental results show that the selection of the Cosine similarity metric can obtain higher prediction accuracy and model performance stability, which demonstrates the reasonableness of choosing it as the similarity metric for learning resources recommendation in this paper.

Comparison of the MAE values of different similarity metrics
Fig. 6 shows the comparison results of the prediction accuracy of this paper’s learning resource recommendation method incorporating user profiles with the item-based learning resource recommendation method (Item), the deep learning-based learning resource recommendation method (Deep-learning), and the hybrid learning resource recommendation method (Hybrid) under the Ae dataset. From the figure, it can be seen that the method in this paper has a lower MAE value compared to the other three learning resource recommendation methods. Therefore, it has higher prediction accuracy than the other three methods, and can recommend more accurate and personalized learning resources for learners in the online platform for teaching aesthetic education in colleges and universities, which can fully satisfy the diversified learning needs of different students in the process of learning aesthetic education.

Comparison of prediction accuracy of different methods
Expanding on the poses proposed in related studies, a database containing 60 whole-body poses is established to validate the effectiveness of the pose recognition method in this paper.
Offline experiment Six subjects (3 males and 3 females) were recruited to conduct the offline experiment.The Kinect device was placed horizontally, 50cm from the ground.The angle of the drive motor was positive 20 degrees. The background was a white wall. Subjects faced the Kinect device with their whole body in the field of view, 250 cm away from it, and performed 60 postures in sequence. For subjects 2~6, 110 samples were collected for each posture, respectively 6600 samples. For subject 1, 220 samples were collected for each posture, totaling 13,200 samples. 60% of the pose data for subject 1 was used for training and the other 40% was used for testing. All of the pose data for subjects 2~6 were used for testing. To get the number of joint angle features with the highest accuracy, the number of joint angle features is gradually increased from 5 to 30. When the optimal number of joint angle features is obtained, the accuracy of analyzing this paper’s method is gradually increased from 2 to 60 classes. Online experiment Using the optimal number of joint angle features from the offline experiments and the algorithm of this paper to establish a real-time pose recognition system, the Kinect body sensor captures the coordinate data of 20 joints at a speed of 50 frames per second. In real-time pose recognition, 100 frames of coordinate data are continuously captured, and then 26 joint angles are extracted from each frame and fed into the recognition model for each frame. The number of occurrences of each pose in the 100 frames is counted, and the pose with the highest number of occurrences is considered to be the recognized pose. Subject 1 was selected and 5 more (3 male, 2 female) subjects were recruited for the online experiment.The Kinect device was placed horizontally, 50 cm from the floor.The angle of the drive motor was positive 20 degrees. The background was a white wall. Subjects faced the Kinect device with their entire body in the field of view, 250 cm away from it, and performed 60 poses in sequence.
The number of joint angle features gradually increases from 5 to 30 to analyze the pose recognition accuracy and the results are shown in Figure 7. With the increase in the number of joint angle features, the accuracy rate shows an overall increasing trend. The accuracy rate is highest when the number of joint angle features is 28, which is 97.51%. When the number of joint angle features is greater than 28, the accuracy rate tends to be stable.

Different number of features and accuracy
The number of pose types increases gradually from 2 to 60 to analyze the pose recognition accuracy and the results are shown in Figure 8. With the increase in the number of pose types, the overall trend of pose recognition accuracy is decreasing. When the number of pose types is 3, 10, 13, 15, 18 and 26, the accuracy reaches 100%. When the number of pose types does not exceed 29, the accuracy maintains a high level, and when it exceeds 29, the accuracy begins to decline.

Different categories of postures and accuracy
The results of the online experiment are shown in Table 1, with an accuracy rate of 98.43% for subject 1. The accuracy rates of the three newly recruited subjects are all above 95%. It shows that the pose recognition method constructed based on Kinect device in this paper can complete the learner’s pose recognition in the aesthetic education teaching platform with a high accuracy rate, which lays a solid foundation for the intelligent interaction in the teaching platform.
Accuracy of real-time posture recognition
| Numbering | Accuracy/% |
|---|---|
| 1 | 98.43 |
| 7 | 95.28 |
| 8 | 96.72 |
| 9 | 95.97 |
| 10 | 96.03 |
| 11 | 97.48 |
The speech interaction module designed in this paper contains the functions of speech text recognition and speech emotion recognition. For the above functions, this paper uses the following datasets for the validation of the voice interaction module.
AISHELL-1 Chinese speech dataset In the speech text recognition task, this paper uses AISHELL-1 for model training and validation. The audio of the dataset is recorded by 500 speakers from different accent regions in China, with a total audio length of 185 hours and a sampling rate of 20 kHz. The text encompasses 11 domains, such as unmanned driving, education and teaching, industrial production, and so on. The database text has a 99% accuracy rate and can be used for speech recognition, voiceprint recognition, and other fields. Homemade Speech Emotion Classification Dataset Speech emotion recognition function is based on the classification idea, the classification model requires high quality of the dataset, but in the field of Chinese speech emotion recognition, the open source dataset is less and the emotion classification is inconsistent. Therefore, in this paper, based on the partially open-source CASIA Chinese emotion corpus and the Chinese part of the ESD dataset, the homemade speech emotion classification dataset CA-ES (CASIA-ESD, CA-ES) is obtained by writing classification rules for data cleaning and reclassification. In which the classification rules are only according to the emotion categories, and there is no character differentiation, the final CA-ES dataset has a total of 22,654 data in six emotion categories, and the details are shown in Table 2.
Homemade voice emotional classification data set
| Emotional label | Voice quantity |
|---|---|
| Angry | 3943 |
| Pleasure | 3826 |
| Nature | 3794 |
| Sadness | 3980 |
| Surprise | 3369 |
| Fear | 3742 |
The Kinect device uses the TensorFlow framework to implement speech text recognition. The Conformer encoder uses 14 identical Conformer modules, with a number of attention heads of 5 and an attention dimension of 250 dimensions. The dimension of the feed-forward network is 3042 dimensions, and to prevent overfitting, the model uses a preheat learning strategy with a preheat learning rate of 0.005, a preheat step count of 1000, and an initial learning rate of 0.01. The model training is set with batch_size=125, num_workers=20, and the AISHELL-1 dataset is used, and only 180 rounds of training are performed in this experiment, taking into account the model size, the training duration is about 55h, and the results are shown in Fig. 9, (a)-(d) denote the training set LOSS, the test set LOSS, the training learning rate and the test set word error rate, respectively. The training results show that the best LOSS for 180 rounds is 0.524, the best word error rate is 0.058, and the model recognition correctness can reach over 95%.

Speech recognition model training results
In this subsection, model training is performed in the experimental platform using the homemade CA-ES dataset, and the results are shown in Fig. 10, where (a)-(d) denote the training set LOSS, the training set model classification accuracy, the test set model classification accuracy, and the model recognition confusion matrix, respectively. The same training 180 rounds, the best LOSS is 0.264, and the best accuracy rate can reach more than 95%. The confusion matrix shows the correspondence between the model’s prediction results on the test set and the real labels, as well as the accuracy and error of the classification, as can be seen from the figure, except for the Surprise emotion, the accuracy of the other five emotions reaches more than 90%, and due to the small number of the Surprise emotion speech, the accuracy rate is lower. Overall, the use of Kinect device in this paper can effectively complete the voice text recognition and voice emotion recognition in the online platform of aesthetic teaching, which significantly improves the experience of intelligent interaction design in this paper.

Speech emotion recognition model training results
Combining all the experimental results and analyses, it can be concluded that the improved learning resource recommendation method adopted in the virtual learning environment design of this paper can effectively meet the personalized needs of students regarding the learning of aesthetic education. In the intelligent interaction design, the pose recognition and voice interaction methods based on the Kinect device in this paper also demonstrate superior performance. Applying the virtual learning environment and intelligent interaction module designed in this paper to the online teaching platform of aesthetic education in colleges and universities, focusing on the construction of the platform in the “three-station fusion”, can significantly promote the benign reform of the teaching of aesthetic education in colleges and universities, enhance the level of the teaching of aesthetic education in colleges and universities, and promote the development of students’ ability of aesthetic education.
In this paper, a user-based collaborative filtering algorithm is selected to implement collaborative filtering, and user profiling and improved cosine similarity metric are integrated in the algorithm to recommend personalized learning resources for students in the virtual learning environment. The Kinect sensor combined with Unity 3D and other tools are used to design the virtualized body posture interaction and voice interaction in the intelligent interaction module. The experimental results show that the design of the virtual learning environment and intelligent interaction module in this paper is sufficient to meet the needs of the construction of the online platform for teaching aesthetic education in colleges and universities, and give full play to the platform advantages of the “three-station fusion”, which can significantly improve the level of teaching aesthetic education in colleges and universities and the development of students’ ability to develop aesthetic education. However, it is worth noting that the pose recognition algorithm adopted in this paper can show advanced recognition performance when the types of poses are less than 29. When there are more than 29 types of poses, recognition accuracy tends to decline. The next step needs to optimize and improve the pose recognition algorithm, as well as increase its carrying capacity, to improve the application scale of the online platform for teaching aesthetic education.
