The Application of Artificial Intelligence Big Model Technology in Mental Health Education Work in Colleges and Universities
Pubblicato online: 24 set 2025
Ricevuto: 14 gen 2025
Accettato: 01 mag 2025
DOI: https://doi.org/10.2478/amns-2025-1019
Parole chiave
© 2025 Ying Li et al., published by Sciendo.
This work is licensed under the Creative Commons Attribution 4.0 International License.
In today’s high-speed development of the information age, the mental health of college students is becoming more and more prominent, and has become a social issue of great concern. In response to this status quo, the emergence of artificial intelligence provides a new application and research direction for college students’ mental health education [1-4]. As an emerging technology, artificial intelligence (AI) big model has shown strong application potential in various fields. Among them, combining with mental health education to provide personalized psychological counseling and support for students is an area of great interest [5-7].
AI big model is a model trained using deep learning and other technologies, with powerful data processing and analysis capabilities. In mental health education, AI big models can identify potential psychological problems and give appropriate advice and support by analyzing data such as students’ voice, text, and even physiological indicators [8-10]. This personalized teaching method makes mental health education closer to students’ needs and helps to improve the treatment effect. Compared with traditional psychological counseling, AI Big Model has the advantages of personalized treatment, faster speed, higher efficiency, etc. Moreover, AI Big Model will not be affected by emotional fluctuations or subjective factors, and it provides counseling services for individuals more objectively and objectively [11-14]. However, there are certain challenges and risks associated with the application of AI Big Models in mental health education, mainly in the privacy and data security of individual students and possible bias in the modeling process, resulting in the advice and support received by individuals is not always accurate. The algorithms and models of the AI Big Model need to be continuously optimized and improved to ensure that its application in mental health education is more effective and reliable [15-18].
In this paper, we first analyze the special needs of dialogue models in mental health domain in terms of user privacy and utterance sensitivity, as well as the technical framework of the model. Then for the empathic reply technology of the model, 4SPG algorithm is introduced for text repetition generation and optimization, UniLM model is used for the extraction of user semantic features and the understanding of utterance information, and Copy model is used for the fusion of user’s key semantics and the generated replies, which is combined to obtain the UniLM-Copy model. Finally, the model is validated for its response generation effect, semantic understanding accuracy, psychological guidance effect, and feasibility, respectively.
Based on the perspective of the needs of mental health dialog, this chapter clarifies the main factors that should be considered and satisfied in constructing a dialog model in the field of mental health. It also describes the basic structure and technical advantages of the Rasa Dialogue System Framework, taking into account the characteristics of the dialogue subjects in the mental health field.
The development of smart conversation plugins for the mental health domain requires consideration of special needs to ensure that the plugins provide accurate, sensitive, and appropriate support when dealing with mental health-related conversations. In the mental health domain, the emotional state of the user plays an important role in the quality and effectiveness of the dialog. Conversation models need to be able to recognize and understand the user’s emotions and respond to them appropriately. Applying deep learning techniques to sentiment analysis and emotion generation helps the dialog model to acutely perceive and process the user’s emotional expressions. Protecting user privacy and information security in the mental health domain is critical. Conversational models need to ensure that users’ sensitive information is protected and comply with legal and ethical guidelines for privacy protection. The design of AI models and data processing methods should follow relevant privacy protection measures, such as desensitizing sensitive information and adopting secure data transmission and storage methods.
Rasal is an open source task-based system framework implemented in python with high flexibility and extensibility. The designers have divided Rasa into two parts, NLU and Rasa Core. Among them, RasaNLU is responsible for understanding user input and extracting valid intent and entity information from it. Rasa Core contains modules such as conversation state tracking, conversation strategy, and conversation generation, focusing on conversation management tasks.
The architecture of Rasa Dialogue System is shown in Fig. 1. In the Rasa Dialogue System framework, the input text message passes through the Natural Language Processing module, Dialogue State Tracking module, Dialogue Strategy module, and Dialogue Action module in turn, and finally generates the reply content. The natural language processing module transmits the captured intent and entity information to the dialog state tracking module. The dialog state tracking module records and updates the intent and entity information captured during the entire dialog process. The dialog strategy module selects appropriate dialog actions based on the dialog state information. Finally, the dialog action module generates replies and updates the dialog state. The Rasa framework hides the tedious pipeline interfacing and dialog state tracking processes, allowing developers to focus on building individual modules.

Rasa’s Architecture Diagram
Building a complete Rasa project can be accomplished by writing just a few files, which further reduces the cost of learning for developers.
Based on the modeling framework and techniques proposed in Chapter 2, this chapter serves as the core algorithm of the conversation model by designing the rephrasing generation model, 4SPG. It then analyzes the unique advantages of the UmiLM model in the empathic reply generation technique, and combines the Copy mechanism to construct a mental health conversation model (UniLM-Copy model).
Based on the current major problems in the field of rephrasing generation, this paper proposes a novel rephrasing generation model, 4SPG, which is based on the mainstream encoder-decoder neural network structure for characterization learning of input utterances. Among other things, this paper proposes a Bidirectional Paraphrase Learning (BPL) framework. The comparison between one-way repetition learning and two-way repetition learning is shown in Fig. 2. Current repetition learning frameworks all adopt a one-way learning model, i.e., the original utterance is used as input to learn the corresponding target utterance. In contrast, the bidirectional retelling learning framework demonstrated in this method will simultaneously learn to generate the original utterance with the target utterance as input. Specifically, the encoder and decoder in this framework receive the same paraphrase pair as input, and subsequently the decoder generates the paraphrase results corresponding to the original and target utterances, respectively. Compared to the manual expansion of the training dataset on a one-way paraphrase learning framework to achieve reuse of the training data, the high extensibility of the BPL framework makes it possible for the model to learn a deeper relationship between the original utterance and the target.

Unidirectional paraphrase learning and bi-directional paraphrase learning
Observing Fig. 2, based on the BPL framework, this paper proposes a semantic enhancer and a stylistic enhancer based on a self-supervised learning approach to achieve semantic accuracy and stylistic diversity enhancement for the encoder and decoder, respectively. Among them, the semantic enhancer focuses on the semantic distance between two features and guides the encoder to capture more precise semantic information, thus improving the accuracy of the paraphrase. Stylistic enhancers emphasize potentially changeable regions through an attentional mechanism and cause the decoder to learn through backpropagation to generate richer representations. In particular, these enhancers are designed as auxiliary loss functions and do not require eliciting additional labeling data.
Specifically, given the input word embeddings of the original sentence
The outputs
Next, this method maps
Unlike previous work on retelling generation, the BPL framework allows for simultaneous learning from both the original and target inputs. Assuming that
The outputs
Unlike semantic enhancers, textual style migration is usually reflected in subtle lexical changes. For this reason, this paper implements style enhancement by introducing attention between the original features generated by the decoder and the target representation. Specifically, the style enhancer computes the difference between the hidden features
Throughout the model training process, this method uses a total of three loss functions to measure the model performance, including the cross individual loss
where
UniLM, Unified Pre-trained Language Model. From the name of this model, we can know that it can train various language models by using different Mask strategies with the same parameters through a Transformer model.
UniLM is mainly stacked with Transformer blocks. The UniLM model used in this paper consists of a stack of 12 Transformers, and the hidden layer of each Transformer contains 768 hidden nodes and 12 heads. The UniLM is consistent with the structure of BERT-BASE, so the parameters can be initialized using the already trained BERT-BASE. The chance of masking a participle is 15%, and of the masked participles, 80% are replaced with [MASK], 10% are randomly replaced with words from the lexicon, and the rest of the participles are not processed. Further, when masking participles, a different number of participles were covered each time. With an 80% chance of covering one word, there is also a 20% chance that 2-3 consecutive participles can be covered at the same time.
UniLM can be trained using three Mask approaches for unidirectional language models, bidirectional language models and Seq2Seq language models, respectively, and they differ fundamentally in that the range of information obtained from words in the same sequence varies. For example, the seq2Seq language model, UniLM utilizes BERT in generative tasks via the Mask matrix. Although the model only uses the encoder architecture, it can be used for the Seq2Seq task. The reason for this is as follows: the words in the input sequence are controlled by the Mask matrix, which gives the contextual information of the words, and although the words that are about to be predicted in the target sequence have access to the contextual information of the input sequence, they only have access to the left side of the target sequence. The empathic response generation technique proposed in this section is based on the Seq2Seq language model, and so the self-attentive Mask mechanism of the Seq2Seq language model is discussed below.
In the pre-training phase, the user-input statement or question and the consultant’s empathetic response are organized into the form of contextual clauses and imported into the model. If the masked participle of the Seq2Seq language model belongs to Segment1, i.e., the original text sequence, i.e., the statement or question input by the user, then the participle can only focus on the other participles of the sequence in which it is located, and it cannot focus on the participle that belongs to Segment2, i.e., the target sequence, i.e., the empathic response of the counselor. If the masked participle belongs to Segment2, it can not only focus on the participle belonging to Segment1, but also see the information of the left participle in the same sequence.
For example, there are two sentences: “My classmates hit me” and “It’s miserable”. Seq2Seq language model input build, “[CLS] classmate hit me [SEP] [MASK] miserable [SEP]”.
During decoding, if there is statement “X1”, at
The samples were coded using the UniLM model. The first row of the matrix refers to the representation of [CLS] features, the second row is the feature representation of Xi, and so on. For decoding, the feature representation through [MASK] passes through a linear layer, then, the probability distribution of the words in the word list is obtained using Softmax, and the word generated by decoding is the one with the highest likelihood, and the cycle repeats, terminating when [SEP] is generated.
Specifically, firstly, the attentions distribution of the under-head position to the remaining positions is obtained by Multi-head-attention, which is controlled by the mask matrix to control the range of each position or word that can be attended to, and then, the feature vectors of the under-head position of the decoder are computed as in Eq. (17):
where
The Softmax function maps the vector of scores
Further, the cross-entropy loss is computed for the result
In order to improve the problems of inaccurate details of generated complex events and the absence of emotional keywords in the generated empathic responses mentioned at the beginning, in this paper, we introduce the generation probability

UniLM-Copy mechanism model structure
Where,
This chapter focuses on the study of empathic response generation based on the UniLM-Copy model, which firstly designs and analyzes the principles of the 4SPG algorithm of the retelling generation model, and then proposes a mental health model based on UniLM-Copy, which is able to incorporate the complex event details and the emotional keywords into the generated empathic responses using the Copy mechanism. The model generates empathic responses by analyzing the potential emotions of the contextual information of the user’s input statements and combining them with mental health knowledge, thus guiding the user’s positive mental state.
In order to test the suitability of this paper’s model for mental health work, this section focuses on the model’s performance in response generation effect, language comprehension and query, and psychological guidance effect and feasibility, in order to develop test experiments and evaluation analysis.
In order to evaluate the effect of model response generation in this paper, this subsection compares the effect of automatic and manual evaluation of response generation of this paper’s method with Vanilla prompting method and CoT prompting method, respectively.
The experiments in this paper were conducted on the ESConv dataset. The dataset was constructed through crowdsourcing and is widely used in emotional support conversation research. The dataset contains 1200 long conversations with a total of 38,365 utterances, and the conversation scenarios are interactions between the helper and the supporter. In each round of dialogue, the supporter responds with a specific emotionally supportive dialog strategy, and the described dialog strategies are labeled in the dataset, with eight categories: asking questions, restating or paraphrasing, feeling reactions, self-revelation, affirmation and reassurance, offering suggestions, providing information, and others. Based on following the previous division, in this paper, we keep the test set for effect evaluation in order to make a fair comparison, and merge the training and validation sets into a retrieval set without further training of the model. The statistics of the dataset are shown in Table 1.
Statistics of ESC Conv data set
Parameter | Test set | Search set (training set + verification set) |
---|---|---|
Session count | 200 | 1110 |
Number of statements | 6037 | 32327 |
Average number of session rounds | 30.95 | 29.26 |
Average statement length | 15.70 | 16.55 |
Table 2 demonstrates the results of automatic evaluation comparison between this paper’s method and other methods. It can be seen that the effect of this paper’s method is better than Vanilla cueing method and CoT cueing method in most cases, which indicates that the inference algorithm in this paper is more effective on emotional support tasks.
Automatic evaluation results(units:%)
Model | Method | BLEU-2 | BLEU-4 | METEOR | Distinct-1 | Distinct-2 |
---|---|---|---|---|---|---|
Mistral | +Vanilla | 2.37 | 0.83 | 12.03 | 5.24 | 25.35 |
+CoT | 2.47 | 0.81 | 12.38 | 5.15 | 25.84 | |
+UniLM-Copy | 2.62 | 1.11 | 12.42 | 7.08 | 36.27 | |
Gemma | +Vanilla | 1.97 | 0.76 | 9.23 | 6.74 | 29.17 |
+CoT | 1.93 | 0.74 | 9.18 | 6.53 | 28.43 | |
+UniLM-Copy | 2.11 | 0.80 | 10.23 | 7.17 | 33.72 | |
Llama2 | +Vanilla | 2.19 | 0.75 | 12.88 | 3.57 | 17.38 |
+CoT | 2.09 | 0.71 | 12.68 | 3.62 | 17.59 | |
+UniLM-Copy | 2.22 | 0.74 | 13.84 | 4.23 | 21.07 | |
Llama3 | +Vanilla | 2.44 | 0.94 | 11.23 | 5.77 | 26.04 |
+CoT | 2.52 | 1.05 | 10.61 | 5.77 | 24.85 | |
+UniLM-Copy | 2.68 | 1.02 | 11.71 | 9.26 | 40.66 |
This paper’s method shows some superiority in the METEOR metric, which additionally introduces word type, synonyms, semantics and other aspects to evaluate the generated text in addition to phrase matching. This indicates that this paper’s method is more advantageous in terms of semantic level and lexical expression, and is more capable of capturing the core meaning of the reference text rather than simple n-gram matching.
In addition, this paper’s method also shows some advantages in text diversity, and performs better than the fine-tuning method on all models. The diversity improvement brought by this paper’s method is not only caused by the increase in text length, as can be seen in the comparison of different cueing methods on a single model, this paper’s method further improves the diversity on this basis, and the effect is more significant. The Distinct-2 indicator on the llama3 model reaches 40.66%, and the mistral model reaches 36.27%, which is due to the fact that this paper’s method can pay full attention to the user’s fine-grained emotional information, so that the model can combine the user’s specific state when generating replies to target the diversity of responses.
Emotional support conversations involve emotional and cognitive aspects, which are sometimes difficult to judge their strengths and weaknesses using automated assessment metrics. To overcome this limitation, a manual assessment was introduced to analyze the content of the responses. In this paper, we randomly selected 300 examples and replies generated by different methods in the test set, and three evaluators voted to select the replies that excelled in different aspects, and if the three evaluators came up with three different voting results, a fourth evaluator was introduced to vote, and the final results of the experiment are shown in Table 3.
Human evaluation results(units:%)
Comparison | Evaluation content | Win | Lose | Tie |
---|---|---|---|---|
Baseline method | Fluency | 43.1 | 22.1 | 35.1 |
Identifiability | 65.1 | 11.8 | 23.4 | |
Comfort | 59.4 | 12.4 | 28.5 | |
Suggestion | 52.4 | 16.8 | 31.1 | |
Totality | 60.8 | 15.1 | 24.4 | |
Textual method | Fluency | 49.4 | 14.4 | 36.5 |
Identifiability | 63.8 | 9.4 | 27.1 | |
Comfort | 55.8 | 21.1 | 23.4 | |
Suggestion | 50.1 | 15.8 | 34.4 | |
Totality | 58.4 | 19.8 | 22.1 |
It can be seen that this paper’s method outperforms the baseline method in all five aspects. Among them, this paper’s method performs more prominently in recognition and consolation, with 63.8% in recognition and 55.8% in consolation, indicating that it is able to perceive and recognize the user’s emotional state more acutely, and is more empathetic in its responses. In addition, further analysis reveals that the advantage of this paper’s method in terms of fluency is not significant. The reason for this may be that the fine-tuning method has already performed well in terms of fluency after training on the dataset on the one hand, and on the other hand, the large model sometimes generates replies with too many words, which makes the advancement of the dialog process slightly stiff in some cases.
In this section, the NQ dataset, the TQA dataset, and the WQ dataset are used for the comparative evaluation of different algorithms for linguistic querying and comprehension, where the NQ dataset is used for the information retrieval system and the question and answer task, the TQA dataset covers common sense questions and the corresponding answers in several domains, and the WQ dataset is used for the question and answer task.
Top-K accuracy is one of the key metrics to measure the retrieval effectiveness of the model, which is used to evaluate how often the model is able to find at least one relevant document inside the top K retrieval results it returns in all queries. Expand the performance comparison between this paper’s algorithm and other algorithms in the six-column dataset of TQA Top-100(X1), NQ Top-100(X2), NQ Top-20(X3), TQA Top-20(X4), WQ Top-100(X5), and WQ Top-20(X6).
Figure 4 shows the TOP-k accuracy data for all algorithms in the dataset, where S1 is the algorithm of this paper, S2 is the DPR algorithm, S3 is the ANCE algorithm, and S4 is the BM25 algorithm.

Retrieval performance comparison
Among all the algorithms, the BM25 algorithm has the worst results on all the datasets, which also indicates the importance of deep learning in text retrieval. The DPR algorithm, as the founder of dense retrieval, still has a good performance in terms of metrics, but it still has no advantage over other dense retrieval models that are improved on DPR. The model in this paper, on the other hand, shows an advantage over the baseline model BM25 in all cases. In terms of Top-20 metrics, this paper’s model outperforms BM25 by 21.4% on the NQ dataset, 15.5% on TQA, and 20.8% on WQ. This shows that the model in this paper has a significant advantage over traditional algorithms in understanding natural language queries.
In order to clarify the feasibility of the model of this paper in the practical application of mental health education in colleges and universities, this section develops the overall effect test and feasibility test of the model in terms of the positive guiding effect of the model on students and the stable operation of the actual scenario, respectively. The overall effect test adopts the form of comparing similar algorithms, while the feasibility test is carried out in a multi-user scenario.
In this paper, a total of 500 students in five majors in the second year of a university were psychologically assessed, and the top 20 students with anxiety scores in each major were selected as the experimental subjects, and were divided into groups A, B, C, D, and E according to their majors. Among them, the students in group A used the conversation model based on Vanilla prompting method, the students in group B used the conversation model based on CoT prompting method, the students in group C used the conversation model based on DPR algorithm, the students in group D used the conversation model based on ANCE algorithm, and the students in group E used the model in this paper.
The test data before and after the experiment were analyzed, and the mean values were calculated using the test entries of items 2, 17, 23, 33, 39, 57, 72, 78, 80, and 86 of the SCL-90 scale about anxiety test.
Since there are some differences between students of different majors in terms of student population, professional learning background, education and teaching, and the anxiety levels of the students in the five selected sample groups are not the same, the mean value of the above test entries is presented to reflect the anxiety level of the group. The test results obtained after the pre-test and post-test are shown in Figure 5.

Experimental test comparison
As can be seen from Figure 5, the anxiety level of each group decreased after using the model, which basically proves that the model can provide a certain degree of help for the mental health education of college students. Among them, the anxiety level of students in group E is the lowest after using the model, and the average score is only 15.9, which is 3.0 different from the average score of anxiety before the experiment.
Figure 6 shows a comparison of the distribution of the effect scores of the five models on students’ psychological guidance in the experiment, out of 5 points. The data proves that the model of this paper has the best performance on students’ psychological guidance effect among similar models, and its performance score is as high as 3.89.

Guidance mode comparison
In order to simulate real-world usage scenarios, this subsection designs a response time test to verify the model’s responsiveness under multiple concurrent user levels and to ensure that the model still runs stably and smoothly under high loads. The response time of this paper’s model under different concurrency numbers is shown in Table 4.
Different concurrent response schedules
Concurrency | 10 | 30 | 50 | 100 | 300 |
---|---|---|---|---|---|
Average response time (s) | 0.524 | 0.619 | 0.837 | 1.059 | 1.345 |
Maximum response time (s) | 0.832 | 0.826 | 1.139 | 1.256 | 1.579 |
Minimum response time (s) | 0.416 | 0.467 | 0.630 | 0.838 | 1.036 |
The test data in Table 4 shows that the average response time of the model under different numbers of concurrencies ranges from 0.524-1.345s, which has strong processing capability and stability.
Although the average and maximum response times are significantly longer when concurrency increases, the model still maintains a relatively stable minimum response time under high concurrency, which indicates that the model performs efficiently in low concurrency environments, with fast processing speeds and short response times, in line with the needs of daily operations. As the number of users increases, the model’s response time increases, but it is still controllable, indicating that the processing capability is within the expected range. Even at the peak of concurrency, the model maintains a stable minimum response, demonstrating some processing toughness and stability.
Overall, the model has an expected downward trend in performance when the number of concurrency increases, but the overall performance is stable, especially under high concurrency, which still maintains a certain level, indicating that the model is designed to have better concurrency processing capability and stability. This provides data support for the model in actual deployment, optimization of resource allocation, and expansion strategy, which guarantees the stable operation of the student mental health dialogue model under multi-user scale and provides reliable services for universities.
In this paper, the special needs of psychological counseling in colleges and universities are taken as the premise, and the mental health dialogue model (UniLM-Copy model) is constructed by combining the 4SPG algorithm, the UniLM model, the Copy mechanism, and other artificial intelligence big model technologies.
In the performance test and evaluation, the responses generated by the UniLM-Copy model are able to recognize and perceive the user’s emotions more acutely, with 63.8% of them being recognizable and 55.8% being comforting. The model has a strong effect on the psychological guidance of college students, and the average anxiety score of students before and after the realization of the model is reduced by 3.0. And the model has a strong processing capability and stability, and its average response time is stable at 0.524-1.345s under different concurrency of multi-user scenarios.
The UniLM-Copy model designed in this paper, whether in generating highly empathetic responses or in the overall effect of psychological guidance, presents an excellent performance of similar algorithms that cannot be reached, and it is a more successful attempt to explore the path of the application of artificial intelligence big model technology in the work of mental health education in colleges and universities.