The Application of Artificial Intelligence Big Model Technology in Mental Health Education Work in Colleges and Universities

In today’s high-speed development of the information age, the mental health of college students is becoming more and more prominent, and has become a social issue of great concern. In response to this status quo, the emergence of artificial intelligence provides a new application and research direction for college students’ mental health education [1-4]. As an emerging technology, artificial intelligence (AI) big model has shown strong application potential in various fields. Among them, combining with mental health education to provide personalized psychological counseling and support for students is an area of great interest [5-7].

AI big model is a model trained using deep learning and other technologies, with powerful data processing and analysis capabilities. In mental health education, AI big models can identify potential psychological problems and give appropriate advice and support by analyzing data such as students’ voice, text, and even physiological indicators [8-10]. This personalized teaching method makes mental health education closer to students’ needs and helps to improve the treatment effect. Compared with traditional psychological counseling, AI Big Model has the advantages of personalized treatment, faster speed, higher efficiency, etc. Moreover, AI Big Model will not be affected by emotional fluctuations or subjective factors, and it provides counseling services for individuals more objectively and objectively [11-14]. However, there are certain challenges and risks associated with the application of AI Big Models in mental health education, mainly in the privacy and data security of individual students and possible bias in the modeling process, resulting in the advice and support received by individuals is not always accurate. The algorithms and models of the AI Big Model need to be continuously optimized and improved to ensure that its application in mental health education is more effective and reliable [15-18].

In this paper, we first analyze the special needs of dialogue models in mental health domain in terms of user privacy and utterance sensitivity, as well as the technical framework of the model. Then for the empathic reply technology of the model, 4SPG algorithm is introduced for text repetition generation and optimization, UniLM model is used for the extraction of user semantic features and the understanding of utterance information, and Copy model is used for the fusion of user’s key semantics and the generated replies, which is combined to obtain the UniLM-Copy model. Finally, the model is validated for its response generation effect, semantic understanding accuracy, psychological guidance effect, and feasibility, respectively.

2

Relevant knowledge and technical background

Based on the perspective of the needs of mental health dialog, this chapter clarifies the main factors that should be considered and satisfied in constructing a dialog model in the field of mental health. It also describes the basic structure and technical advantages of the Rasa Dialogue System Framework, taking into account the characteristics of the dialogue subjects in the mental health field.

2.1

Special Needs of Dialogue Models in Mental Health

The development of smart conversation plugins for the mental health domain requires consideration of special needs to ensure that the plugins provide accurate, sensitive, and appropriate support when dealing with mental health-related conversations. In the mental health domain, the emotional state of the user plays an important role in the quality and effectiveness of the dialog. Conversation models need to be able to recognize and understand the user’s emotions and respond to them appropriately. Applying deep learning techniques to sentiment analysis and emotion generation helps the dialog model to acutely perceive and process the user’s emotional expressions. Protecting user privacy and information security in the mental health domain is critical. Conversational models need to ensure that users’ sensitive information is protected and comply with legal and ethical guidelines for privacy protection. The design of AI models and data processing methods should follow relevant privacy protection measures, such as desensitizing sensitive information and adopting secure data transmission and storage methods.

2.2

Rasa Dialogue System Framework

Rasal is an open source task-based system framework implemented in python with high flexibility and extensibility. The designers have divided Rasa into two parts, NLU and Rasa Core. Among them, RasaNLU is responsible for understanding user input and extracting valid intent and entity information from it. Rasa Core contains modules such as conversation state tracking, conversation strategy, and conversation generation, focusing on conversation management tasks.

The architecture of Rasa Dialogue System is shown in Fig. 1. In the Rasa Dialogue System framework, the input text message passes through the Natural Language Processing module, Dialogue State Tracking module, Dialogue Strategy module, and Dialogue Action module in turn, and finally generates the reply content. The natural language processing module transmits the captured intent and entity information to the dialog state tracking module. The dialog state tracking module records and updates the intent and entity information captured during the entire dialog process. The dialog strategy module selects appropriate dialog actions based on the dialog state information. Finally, the dialog action module generates replies and updates the dialog state. The Rasa framework hides the tedious pipeline interfacing and dialog state tracking processes, allowing developers to focus on building individual modules.

Building a complete Rasa project can be accomplished by writing just a few files, which further reduces the cost of learning for developers.

3

Construction of the Mental Health Dialogue Model

Based on the modeling framework and techniques proposed in Chapter 2, this chapter serves as the core algorithm of the conversation model by designing the rephrasing generation model, 4SPG. It then analyzes the unique advantages of the UmiLM model in the empathic reply generation technique, and combines the Copy mechanism to construct a mental health conversation model (UniLM-Copy model).

3.1

Algorithm description

Based on the current major problems in the field of rephrasing generation, this paper proposes a novel rephrasing generation model, 4SPG, which is based on the mainstream encoder-decoder neural network structure for characterization learning of input utterances. Among other things, this paper proposes a Bidirectional Paraphrase Learning (BPL) framework. The comparison between one-way repetition learning and two-way repetition learning is shown in Fig. 2. Current repetition learning frameworks all adopt a one-way learning model, i.e., the original utterance is used as input to learn the corresponding target utterance. In contrast, the bidirectional retelling learning framework demonstrated in this method will simultaneously learn to generate the original utterance with the target utterance as input. Specifically, the encoder and decoder in this framework receive the same paraphrase pair as input, and subsequently the decoder generates the paraphrase results corresponding to the original and target utterances, respectively. Compared to the manual expansion of the training dataset on a one-way paraphrase learning framework to achieve reuse of the training data, the high extensibility of the BPL framework makes it possible for the model to learn a deeper relationship between the original utterance and the target.

Observing Fig. 2, based on the BPL framework, this paper proposes a semantic enhancer and a stylistic enhancer based on a self-supervised learning approach to achieve semantic accuracy and stylistic diversity enhancement for the encoder and decoder, respectively. Among them, the semantic enhancer focuses on the semantic distance between two features and guides the encoder to capture more precise semantic information, thus improving the accuracy of the paraphrase. Stylistic enhancers emphasize potentially changeable regions through an attentional mechanism and cause the decoder to learn through backpropagation to generate richer representations. In particular, these enhancers are designed as auxiliary loss functions and do not require eliciting additional labeling data.

Specifically, given the input word embeddings of the original sentence x_s and the input word embeddings of the target sentence x_t, the outputs h_es & h_et of the semantic encoder SE(x) can be computed as equations (1), (2): 1 $h_{e s} = S E (x_{s})$ ${h_{es}} = SE\left( {{x_s}} \right)$ 2 $h_{e t} = S E (x_{t})$ ${h_{et}} = SE\left( {{x_t}} \right)$

The outputs h_dt & h_ds of sentence decoder DE(x) can be computed as equations (3), (4): 3 $h_{d t} = D E (x_{t}, h_{e s})$ ${h_{dt}} = DE\left( {{x_t},{h_{es}}} \right)$ 4 $h_{d s} = D E (x_{s}, h_{e t})$ ${h_{ds}} = DE\left( {{x_s},{h_{et}}} \right)$

Next, this method maps h_dt and h_ds to a reasonable spatial distribution by means of a linear layer LMHead with a nonlinear activation function softmax, and the results y_s and y_t can be described as Eqs. (5), (6): 5 $y_{s} = s o f t \max (L M H e a d (h_{d s}))$ ${y_s} = soft\max \left( {LMHead\left( {{h_{ds}}} \right)} \right)$ 6 $y_{t} = s o f t \max (L M H e a d (h_{d t}))$ ${y_t} = soft\max \left( {LMHead\left( {{h_{dt}}} \right)} \right)$

Unlike previous work on retelling generation, the BPL framework allows for simultaneous learning from both the original and target inputs. Assuming that N is the total number of words in the corpus, the cross-entropy loss can be used to fit the decoder input to the decoder output as in equation (7): 7 $L_{c e} = - \sum_{i = 1}^{N} x_{t} \log y_{t} - \sum_{j = 1}^{N} x_{s} \log y_{s}$ ${\mathcal{L}_{ce}} = - \sum\limits_{i = 1}^N {{x_t}} \log {y_t} - \sum\limits_{j = 1}^N {{x_s}} \log {y_s}$

The outputs h_es and h_et of the semantic coder preserve the deep semantic features of the original utterance and the target utterance respectively, which have rich semantic information. Based on this, a self-supervised semantic enhancer is designed in this paper, i.e., h_es and h_et are guided to generate more accurate semantic features by two dropout operations with strength 0.1 and minimizing the cosine distance between the enhanced features using a $L_{s e m}$ ${\mathcal{L}_{sem}}$ loss function during the training process. The mathematical description is shown in Eqs. (8), (9), and (10): 8 $h_{e s}^{'} = d r o p o u t (d r o p o u t (h_{e s}))$ ${h'_{es}} = dropout\left( {dropout\left( {{h_{es}}} \right)} \right)$ 9 $h_{e t}^{'} = d r o p o u t (d r o p o u t (h_{e t}))$ ${h'_{et}} = dropout\left( {dropout\left( {{h_{et}}} \right)} \right)$ 10 $L_{s e m} = - \frac{h_{e s}^{'} \cdot h_{e t}^{'}}{| h_{e s}^{'} | \cdot | h_{e t}^{'} |}$ ${\mathcal{L}_{sem}} = - \frac{{{{h^\prime_{es}}} \cdot {{h^\prime_{et}}}}}{{\left| {{{h^\prime_{es}}}} \right| \cdot \left| {{{h^\prime_{et}}}} \right|}}$

Unlike semantic enhancers, textual style migration is usually reflected in subtle lexical changes. For this reason, this paper implements style enhancement by introducing attention between the original features generated by the decoder and the target representation. Specifically, the style enhancer computes the difference between the hidden features h_ds, h_dt and the corresponding attention a_ds, a_dt values, and subsequently freezes the highly relevant information and maximizes the cosine distance of the lowly relevant parts through a $L_{s t y}$ ${\mathcal{L}_{sty}}$ loss function. This is mathematically expressed as Eqs. (11)-(15): 11 $a_{d s} = s o f t \max (h_{d s} h^{T}_{d t})$ ${a_{ds}} = soft\max \left( {{h_{ds}}{h^T}_{dt}} \right)$ 12 $a_{d t} = s o f t \max (h_{d t} h^{T}_{d s})$ ${a_{dt}} = soft\max \left( {{h_{dt}}{h^T}_{ds}} \right)$ 13 ${\tilde{h}}_{d s} = s o f t \max (h_{d s} - a_{d s})$ ${\widetilde h_{ds}} = soft\max \left( {{h_{ds}} - {a_{ds}}} \right)$ 14 ${\tilde{h}}_{d t} = s o f t \max (h_{d t} - a_{d t})$ ${\widetilde h_{dt}} = soft\max \left( {{h_{dt}} - {a_{dt}}} \right)$ 15 $C_{s t y} = - \frac{{\tilde{h}}_{e s} \cdot {\tilde{h}}_{e t}}{| {\tilde{h}}_{e s} | \cdot | {\hat{h}}_{e t} |}$ ${\mathcal{C}_{sty}} = - \frac{{{{\tilde h}_{es{\text{ }}}} \cdot {{\tilde h}_{et{\text{ }}}}}}{{\left| {{{\widetilde h}_{es}}} \right| \cdot \left| {{{\widehat h}_{et}}} \right|}}$

Throughout the model training process, this method uses a total of three loss functions to measure the model performance, including the cross individual loss $L_{c e}$ ${\mathcal{L}_{ce}}$ and the auxiliary enhancement loss $L_{s e m}$ ${\mathcal{L}_{sem}}$, $L_{s e m}$ ${\mathcal{L}_{sem}}$. Its total loss function $L$ $\mathcal{L}$ can be expressed as equation (16): 16 $L = C_{c e} + λ_{1} L_{s e m} + λ_{2} L_{s t y}$ $\mathcal{L} = {\mathcal{C}_{ce{\text{ }}}} + {\lambda_1}{\mathcal{L}_{sem}} + {\lambda_2}{\mathcal{L}_{sty}}$

where λ₁ and λ₂ are the hyperparameters corresponding to the auxiliary enhancement losses $C_{s e m}$ ${\mathcal{C}_{sem}}$ and $L_{s e m}$ ${\mathcal{L}_{sem}}$, respectively, and are set to 1.2 by default.

3.2

The UniLM model

UniLM, Unified Pre-trained Language Model. From the name of this model, we can know that it can train various language models by using different Mask strategies with the same parameters through a Transformer model.

UniLM is mainly stacked with Transformer blocks. The UniLM model used in this paper consists of a stack of 12 Transformers, and the hidden layer of each Transformer contains 768 hidden nodes and 12 heads. The UniLM is consistent with the structure of BERT-BASE, so the parameters can be initialized using the already trained BERT-BASE. The chance of masking a participle is 15%, and of the masked participles, 80% are replaced with [MASK], 10% are randomly replaced with words from the lexicon, and the rest of the participles are not processed. Further, when masking participles, a different number of participles were covered each time. With an 80% chance of covering one word, there is also a 20% chance that 2-3 consecutive participles can be covered at the same time.

UniLM can be trained using three Mask approaches for unidirectional language models, bidirectional language models and Seq2Seq language models, respectively, and they differ fundamentally in that the range of information obtained from words in the same sequence varies. For example, the seq2Seq language model, UniLM utilizes BERT in generative tasks via the Mask matrix. Although the model only uses the encoder architecture, it can be used for the Seq2Seq task. The reason for this is as follows: the words in the input sequence are controlled by the Mask matrix, which gives the contextual information of the words, and although the words that are about to be predicted in the target sequence have access to the contextual information of the input sequence, they only have access to the left side of the target sequence. The empathic response generation technique proposed in this section is based on the Seq2Seq language model, and so the self-attentive Mask mechanism of the Seq2Seq language model is discussed below.

In the pre-training phase, the user-input statement or question and the consultant’s empathetic response are organized into the form of contextual clauses and imported into the model. If the masked participle of the Seq2Seq language model belongs to Segment1, i.e., the original text sequence, i.e., the statement or question input by the user, then the participle can only focus on the other participles of the sequence in which it is located, and it cannot focus on the participle that belongs to Segment2, i.e., the target sequence, i.e., the empathic response of the counselor. If the masked participle belongs to Segment2, it can not only focus on the participle belonging to Segment1, but also see the information of the left participle in the same sequence.

For example, there are two sentences: “My classmates hit me” and “It’s miserable”. Seq2Seq language model input build, “[CLS] classmate hit me [SEP] [MASK] miserable [SEP]”. M refers to the words that have been masked by the MASKSeq2Seq language model for only 15% (“really miserable”). The “hit” in the first sentence is in a state of self-attention, and only all the words in the first sentence can be seen, and the words in the next sentence cannot be seen. [MASK] can see all the words in the first sentence and the above information of the sentence in which it is located.

During decoding, if there is statement “X₁”, at t = 1, “[CLS]X₁[SEP]Y₁[MASK]” is the input sequence, and adding the “[MASK]”’s corresponding feature to the end of the sequence means to make a prediction for the next word. In the encoding phase, the left sequence “[CLS]X₁[SEP]” is known and can focus on the contextual information. “Y₁[MASK]” is the target sequence, i.e., the sequence we want to generate, which is in the decoding state and can see the information of the known sequence and part of the information on the left side of the target sequence. Thus the model utilizes the mask matrix to organically combine the encoder and decoder.

The samples were coded using the UniLM model. The first row of the matrix refers to the representation of [CLS] features, the second row is the feature representation of Xi, and so on. For decoding, the feature representation through [MASK] passes through a linear layer, then, the probability distribution of the words in the word list is obtained using Softmax, and the word generated by decoding is the one with the highest likelihood, and the cycle repeats, terminating when [SEP] is generated.

Specifically, firstly, the attentions distribution of the under-head position to the remaining positions is obtained by Multi-head-attention, which is controlled by the mask matrix to control the range of each position or word that can be attended to, and then, the feature vectors of the under-head position of the decoder are computed as in Eq. (17): 17 $A_{t} = S o f t \max (\frac{(W_{q} * X_{t}) * {(W_{k} * X_{I n p u t})}^{T}}{\sqrt{d_{k}}}) + M$ ${A_t} = Soft\max \left( {\frac{{\left( {{W_q}^*{X_t}} \right)^*{{\left( {{W_k}^*{X_{Input}}} \right)}^T}}}{{\sqrt {{d_k}} }}} \right) + M$

A₁ is the attention distribution of the generated word vectors at t time pair X_Input, X_t is the target vector at time t, X_Input is the contextual feature vector at t time, M is a mask matrix controlling the attention range of the word, d_t is the word vector dimensionality, and W_g, W_k, and W_v are the learning parameters, which are related as in equation (18): 18 $X_{O u t p u t} = A_{t} * W_{v} * X_{I n p u t}$ ${X_{Output}} = {A_t}^*{W_v}^*{X_{Input}}$

X_Output represents the feature vectors output from t the temporal decoder, which are linearly transformed twice for X_Input and then using the Softmax function, the final word list distribution P_v is obtained see equation (19): 19 $P_{v} = S o f t \max (W' (W * X_{O u t p u t} + b) + b')$ ${P_v} = {\text{ }}Soft\max \left( {W^\prime \left( {W^*{X_{Output}} + b} \right) + b^\prime } \right)$

where W, W′, b, and b′ refer to the learnable parameters.

The Softmax function maps the vector of scores s to a probability distribution, which is defined in equation (20): 20 $S o f t \max {(s)}_{j} = \frac{e^{s_{i}}}{\sum_{j = 1}^{n} e^{s_{j}}}$ ${\text{ }}Soft\max {(s)_j} = \frac{{{e^{{s_i}}}}}{{\sum\limits_{j = 1}^n {{e^{{s_j}}}} }}$

i is the output node number. s_i is the value of the i rd node. n is the number of output nodes and also the number of categories.

Further, the cross-entropy loss is computed for the result X_output, denoted as s, along with the original disambiguation s_t, which aims to optimize the parameters of the model. The cross-entropy function is defined as in equation (21): 21 $L = \log (\sum_{i = 1}^{n} e^{s_{i}}) - s_{i}$ $L = \log \left( {\sum\limits_{i = 1}^n {{e^{{s_i}}}} } \right) - {s_i}$

3.3

Copy mechanism

In order to improve the problems of inaccurate details of generated complex events and the absence of emotional keywords in the generated empathic responses mentioned at the beginning, in this paper, we introduce the generation probability P_g into the UniLM model, P_g represents the probability of the complex event information and emotional keywords contained in the generated responses in the vocabulary, and P_z = 1 − P_g represents the probability of the probability of the possible copying of the complex event information and the emotional keywords from the user-inputted statements or questions. Figure 3 shows the structure of the UniLM-Copy mechanism model, which contains the Copy mechanism.

P_g is obtained from the following calculation, see equation (22): 22 $P_{g} = S i g m o i d (W [X_{t}, X_{O u t p u t}, A_{t}] + b)$ ${P_g} = Sigmoid\left( {W\left[ {{X_t},{X_{Output}},{A_t}} \right] + b} \right)$

W and b denote the learnable parameters. The final probability distribution can be obtained as in equation (23): 23 $P (w) = P_{g} * P_{v} (w) + P_{c} * A_{t}$ $P(w) = {P_g}^*{P_v}(w) + {P_c}^*{A_t}$

Where, w is not part of the word list, P_v(w) = 0, the predicted words are generated from the source sequence, i.e., the user’s statement or question. w does not belong to the source sequence, A_t = 0, the predicted words are generated from the word list. Copy mechanism copies emotional keywords and complex event information from the user’s statement or question to the generated empathic response, thus achieving the empathy purpose.

This chapter focuses on the study of empathic response generation based on the UniLM-Copy model, which firstly designs and analyzes the principles of the 4SPG algorithm of the retelling generation model, and then proposes a mental health model based on UniLM-Copy, which is able to incorporate the complex event details and the emotional keywords into the generated empathic responses using the Copy mechanism. The model generates empathic responses by analyzing the potential emotions of the contextual information of the user’s input statements and combining them with mental health knowledge, thus guiding the user’s positive mental state.

4

Assessment and Application Tests of the Mental Health Model

In order to test the suitability of this paper’s model for mental health work, this section focuses on the model’s performance in response generation effect, language comprehension and query, and psychological guidance effect and feasibility, in order to develop test experiments and evaluation analysis.

4.1

Evaluation of model response generation

In order to evaluate the effect of model response generation in this paper, this subsection compares the effect of automatic and manual evaluation of response generation of this paper’s method with Vanilla prompting method and CoT prompting method, respectively.

4.1.1

Experimental data set

The experiments in this paper were conducted on the ESConv dataset. The dataset was constructed through crowdsourcing and is widely used in emotional support conversation research. The dataset contains 1200 long conversations with a total of 38,365 utterances, and the conversation scenarios are interactions between the helper and the supporter. In each round of dialogue, the supporter responds with a specific emotionally supportive dialog strategy, and the described dialog strategies are labeled in the dataset, with eight categories: asking questions, restating or paraphrasing, feeling reactions, self-revelation, affirmation and reassurance, offering suggestions, providing information, and others. Based on following the previous division, in this paper, we keep the test set for effect evaluation in order to make a fair comparison, and merge the training and validation sets into a retrieval set without further training of the model. The statistics of the dataset are shown in Table 1.

Table 1.

Statistics of ESC Conv data set

Parameter	Test set	Search set (training set + verification set)
Session count	200	1110
Number of statements	6037	32327
Average number of session rounds	30.95	29.26
Average statement length	15.70	16.55

4.1.2

Automated assessment

Table 2 demonstrates the results of automatic evaluation comparison between this paper’s method and other methods. It can be seen that the effect of this paper’s method is better than Vanilla cueing method and CoT cueing method in most cases, which indicates that the inference algorithm in this paper is more effective on emotional support tasks.

Table 2.

Automatic evaluation results(units:%)

Model	Method	BLEU-2	BLEU-4	METEOR	Distinct-1	Distinct-2
Mistral	+Vanilla	2.37	0.83	12.03	5.24	25.35
	+CoT	2.47	0.81	12.38	5.15	25.84
	+UniLM-Copy	2.62	1.11	12.42	7.08	36.27
Gemma	+Vanilla	1.97	0.76	9.23	6.74	29.17
	+CoT	1.93	0.74	9.18	6.53	28.43
	+UniLM-Copy	2.11	0.80	10.23	7.17	33.72
Llama2	+Vanilla	2.19	0.75	12.88	3.57	17.38
	+CoT	2.09	0.71	12.68	3.62	17.59
	+UniLM-Copy	2.22	0.74	13.84	4.23	21.07
Llama3	+Vanilla	2.44	0.94	11.23	5.77	26.04
	+CoT	2.52	1.05	10.61	5.77	24.85
	+UniLM-Copy	2.68	1.02	11.71	9.26	40.66

This paper’s method shows some superiority in the METEOR metric, which additionally introduces word type, synonyms, semantics and other aspects to evaluate the generated text in addition to phrase matching. This indicates that this paper’s method is more advantageous in terms of semantic level and lexical expression, and is more capable of capturing the core meaning of the reference text rather than simple n-gram matching.

In addition, this paper’s method also shows some advantages in text diversity, and performs better than the fine-tuning method on all models. The diversity improvement brought by this paper’s method is not only caused by the increase in text length, as can be seen in the comparison of different cueing methods on a single model, this paper’s method further improves the diversity on this basis, and the effect is more significant. The Distinct-2 indicator on the llama3 model reaches 40.66%, and the mistral model reaches 36.27%, which is due to the fact that this paper’s method can pay full attention to the user’s fine-grained emotional information, so that the model can combine the user’s specific state when generating replies to target the diversity of responses.

4.1.3

Manual assessment

Emotional support conversations involve emotional and cognitive aspects, which are sometimes difficult to judge their strengths and weaknesses using automated assessment metrics. To overcome this limitation, a manual assessment was introduced to analyze the content of the responses. In this paper, we randomly selected 300 examples and replies generated by different methods in the test set, and three evaluators voted to select the replies that excelled in different aspects, and if the three evaluators came up with three different voting results, a fourth evaluator was introduced to vote, and the final results of the experiment are shown in Table 3.

Table 3.

Human evaluation results(units:%)

Comparison	Evaluation content	Win	Lose	Tie
Baseline method	Fluency	43.1	22.1	35.1
	Identifiability	65.1	11.8	23.4
	Comfort	59.4	12.4	28.5
	Suggestion	52.4	16.8	31.1
	Totality	60.8	15.1	24.4
Textual method	Fluency	49.4	14.4	36.5
	Identifiability	63.8	9.4	27.1
	Comfort	55.8	21.1	23.4
	Suggestion	50.1	15.8	34.4
	Totality	58.4	19.8	22.1

It can be seen that this paper’s method outperforms the baseline method in all five aspects. Among them, this paper’s method performs more prominently in recognition and consolation, with 63.8% in recognition and 55.8% in consolation, indicating that it is able to perceive and recognize the user’s emotional state more acutely, and is more empathetic in its responses. In addition, further analysis reveals that the advantage of this paper’s method in terms of fluency is not significant. The reason for this may be that the fine-tuning method has already performed well in terms of fluency after training on the dataset on the one hand, and on the other hand, the large model sometimes generates replies with too many words, which makes the advancement of the dialog process slightly stiff in some cases.

4.2

Performance testing and comparison of models

In this section, the NQ dataset, the TQA dataset, and the WQ dataset are used for the comparative evaluation of different algorithms for linguistic querying and comprehension, where the NQ dataset is used for the information retrieval system and the question and answer task, the TQA dataset covers common sense questions and the corresponding answers in several domains, and the WQ dataset is used for the question and answer task.

Top-K accuracy is one of the key metrics to measure the retrieval effectiveness of the model, which is used to evaluate how often the model is able to find at least one relevant document inside the top K retrieval results it returns in all queries. Expand the performance comparison between this paper’s algorithm and other algorithms in the six-column dataset of TQA Top-100(X1), NQ Top-100(X2), NQ Top-20(X3), TQA Top-20(X4), WQ Top-100(X5), and WQ Top-20(X6).

Figure 4 shows the TOP-k accuracy data for all algorithms in the dataset, where S1 is the algorithm of this paper, S2 is the DPR algorithm, S3 is the ANCE algorithm, and S4 is the BM25 algorithm.

Among all the algorithms, the BM25 algorithm has the worst results on all the datasets, which also indicates the importance of deep learning in text retrieval. The DPR algorithm, as the founder of dense retrieval, still has a good performance in terms of metrics, but it still has no advantage over other dense retrieval models that are improved on DPR. The model in this paper, on the other hand, shows an advantage over the baseline model BM25 in all cases. In terms of Top-20 metrics, this paper’s model outperforms BM25 by 21.4% on the NQ dataset, 15.5% on TQA, and 20.8% on WQ. This shows that the model in this paper has a significant advantage over traditional algorithms in understanding natural language queries.

4.3

Application experiments

In order to clarify the feasibility of the model of this paper in the practical application of mental health education in colleges and universities, this section develops the overall effect test and feasibility test of the model in terms of the positive guiding effect of the model on students and the stable operation of the actual scenario, respectively. The overall effect test adopts the form of comparing similar algorithms, while the feasibility test is carried out in a multi-user scenario.

4.3.1

Overall effect

In this paper, a total of 500 students in five majors in the second year of a university were psychologically assessed, and the top 20 students with anxiety scores in each major were selected as the experimental subjects, and were divided into groups A, B, C, D, and E according to their majors. Among them, the students in group A used the conversation model based on Vanilla prompting method, the students in group B used the conversation model based on CoT prompting method, the students in group C used the conversation model based on DPR algorithm, the students in group D used the conversation model based on ANCE algorithm, and the students in group E used the model in this paper.

The test data before and after the experiment were analyzed, and the mean values were calculated using the test entries of items 2, 17, 23, 33, 39, 57, 72, 78, 80, and 86 of the SCL-90 scale about anxiety test.

Since there are some differences between students of different majors in terms of student population, professional learning background, education and teaching, and the anxiety levels of the students in the five selected sample groups are not the same, the mean value of the above test entries is presented to reflect the anxiety level of the group. The test results obtained after the pre-test and post-test are shown in Figure 5.

As can be seen from Figure 5, the anxiety level of each group decreased after using the model, which basically proves that the model can provide a certain degree of help for the mental health education of college students. Among them, the anxiety level of students in group E is the lowest after using the model, and the average score is only 15.9, which is 3.0 different from the average score of anxiety before the experiment.

Figure 6 shows a comparison of the distribution of the effect scores of the five models on students’ psychological guidance in the experiment, out of 5 points. The data proves that the model of this paper has the best performance on students’ psychological guidance effect among similar models, and its performance score is as high as 3.89.

4.3.2

Feasibility testing

In order to simulate real-world usage scenarios, this subsection designs a response time test to verify the model’s responsiveness under multiple concurrent user levels and to ensure that the model still runs stably and smoothly under high loads. The response time of this paper’s model under different concurrency numbers is shown in Table 4.

Table 4.

Different concurrent response schedules

Concurrency	10	30	50	100	300
Average response time (s)	0.524	0.619	0.837	1.059	1.345
Maximum response time (s)	0.832	0.826	1.139	1.256	1.579
Minimum response time (s)	0.416	0.467	0.630	0.838	1.036

The test data in Table 4 shows that the average response time of the model under different numbers of concurrencies ranges from 0.524-1.345s, which has strong processing capability and stability.

Although the average and maximum response times are significantly longer when concurrency increases, the model still maintains a relatively stable minimum response time under high concurrency, which indicates that the model performs efficiently in low concurrency environments, with fast processing speeds and short response times, in line with the needs of daily operations. As the number of users increases, the model’s response time increases, but it is still controllable, indicating that the processing capability is within the expected range. Even at the peak of concurrency, the model maintains a stable minimum response, demonstrating some processing toughness and stability.

Overall, the model has an expected downward trend in performance when the number of concurrency increases, but the overall performance is stable, especially under high concurrency, which still maintains a certain level, indicating that the model is designed to have better concurrency processing capability and stability. This provides data support for the model in actual deployment, optimization of resource allocation, and expansion strategy, which guarantees the stable operation of the student mental health dialogue model under multi-user scale and provides reliable services for universities.

5

Conclusion

In this paper, the special needs of psychological counseling in colleges and universities are taken as the premise, and the mental health dialogue model (UniLM-Copy model) is constructed by combining the 4SPG algorithm, the UniLM model, the Copy mechanism, and other artificial intelligence big model technologies.

In the performance test and evaluation, the responses generated by the UniLM-Copy model are able to recognize and perceive the user’s emotions more acutely, with 63.8% of them being recognizable and 55.8% being comforting. The model has a strong effect on the psychological guidance of college students, and the average anxiety score of students before and after the realization of the model is reduced by 3.0. And the model has a strong processing capability and stability, and its average response time is stable at 0.524-1.345s under different concurrency of multi-user scenarios.

The UniLM-Copy model designed in this paper, whether in generating highly empathetic responses or in the overall effect of psychological guidance, presents an excellent performance of similar algorithms that cannot be reached, and it is a more successful attempt to explore the path of the application of artificial intelligence big model technology in the work of mental health education in colleges and universities.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

The Application of Artificial Intelligence Big Model Technology in Mental Health Education Work in Colleges and Universities

Ying Li

Chenjian Xu

Pubblicato online: 24 set 2025

Ricevuto: 14 gen 2025

Accettato: 01 mag 2025

DOI: https://doi.org/10.2478/amns-2025-1019

Parole chiaveArtificial intelligence model, UniLM-Copy model, Rasa Dialogue System Framework, Mental health dialogues

© 2025 Ying Li et al., published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Artificial intelligence model, UniLM-Copy model, Rasa Dialogue System Framework, Mental health dialogues