Research on Language and Cultural Elements and Cultural Confidence of Chinese Literary Works in Southeast Asia in the Internet Era

Literature is the art of language, the cultural activity of language. The language pattern pours the thinking pattern, and through the thinking pattern has a profound constructive effect on the cultural spirit of a nation, and the literary activities built on this language and culture inevitably reflect a cultural spirit [1-2]. Southeast Asia is a culturally diverse region, and Southeast Asian Chinese literature is an important part of overseas Chinese literature. Compared with the literary language of mainland China, the biggest characteristic of Southeast Asian Chinese literary language is that it retains the expression system of the northern Chinese dialect in general, and at the same time, it has the unique flavor of the southern Chinese dialect, which is distinctly reflected in vocabulary, grammar and other aspects [3-5]. In addition, because it is located in the national languages of different countries in Southeast Asia and the expression system of Western languages, the Chinese vocabulary is mixed with many local languages and Western languages, and deeply embedded in the Chinese language system, creating a multilingual hybrid state of the Southeast Asian Chinese literary language, which is different from the aesthetic qualities of mainland China’s literary language [6-9]. In the seemingly non-standardized and even rustic language expression, the rich and complex life of Chinese in Southeast Asia is vividly conveyed, reflecting the strong inclusiveness of Chinese culture and the ability to adapt to the environment, which is a literary treasure trove worthy of our cherishing and in-depth study [10-13].

To enhance cultural self-confidence, it is necessary to actively disseminate Chinese voices to the outside world, let Chinese culture go to the world, and let the world understand, recognize and respect Chinese culture. In the Internet era, short videos and social media interaction modes have brought people closer to each other, and more and more young people have begun to show the charm of “Chinese flow” in countries around the world [14]. In-depth study of the linguistic and cultural elements of Southeast Asian Chinese literature can promote cultural understanding, enhance cultural self-confidence, and be able and willing to carry it forward, boldly display the charm of the Chinese stream, and lay a solid foundation for future cross-cultural exchanges [15-16].

In this paper, we conduct in-depth text mining of Southeast Asian Chinese literature, extract keywords in Southeast Asian Chinese literature using LDA model, and analyze the importance of keywords in Southeast Asian Chinese literature by TF-IDF algorithm. The keywords in Southeast Asian Chinese literature are analyzed using co-occurring word analysis and word clustering analysis using the Word2Vec model. A sentiment recognition method based on RoBERTa-BiLSTM-Attention is proposed to analyze the sentiment of Southeast Asian Chinese literature and authors. The linguistic and cultural elements are subjected to TF-IDF value calculation, co-occurring word analysis, and clustering analysis, on which the sentiment of Southeast Asian Chinese literary works and authors is analyzed. The language and cultural elements, as well as cultural self-confidence, are analyzed in depth.

2

Textual and Emotional Studies of Chinese Literature in Southeast Asia

2.1

Extraction of Language and Cultural Elements of Southeast Asian Chinese Literature

2.1.1

Text data preprocessing

The purpose of preprocessing the Southeast Asian Chinese literary text is to transform the text into a computer-recognizable form, and at the same time, to eliminate some of the noise and interference, so as to improve the accuracy of the model processing. 1)

Word Processing

Words are the smallest unit to express semantics, and unlike English text, which naturally has spaces as the division between words, there is no fixed demarcation as the sign of sentence break in Chinese expression. The same sentence can derive different meanings due to different ways of breaking it, so the accuracy of the word division results has a crucial impact on the results of text analysis.

The more common Chinese lexical methods are dictionary-based, statistics-based and comprehension-based, in which the advantage of dictionary-based lexical methods is that they do not require additional corpus sets and are relatively less computationally intensive, but they also lead to a relatively mechanized lexical result, while statistics-based lexical methods, although they can better identify words, have the drawbacks of omitting low-frequency words and increasing computational effort. In this paper, we use the jieba lexical library in Python as the lexical tool. The jieba lexical process can be regarded as relying on the lexicon, constructing a directed acyclic graph (DAG) for the input sentences, and performing lexical partitioning based on the probability-maximizing paths on the DAG, and for the words that are not included in the lexicon, we use Hidden Markov Models (HMMs) and the Viterbi algorithm to find out the most probable sequences. Specifically, jieba contains three modes to slice Chinese text: exact slice mode, full slice mode and search engine mode. The advantage of the full cut mode is that the processing speed is fast, but it is also easy to produce ambiguity, while the search engine mode calls a more complex algorithm on the basis of the full cut, and carries out the second cut for long words, thus improving the reliability of the results. Comparatively speaking, the exact cut mode is more suitable for the demand of text analysis for accuracy and completeness. Therefore, this paper chooses to adopt a precise mode to slice the text to ensure the accuracy of the results.

2)

Deactivation of word processing

In the process of information retrieval, in order to improve the search efficiency and save storage space, we usually automatically filter out those words or phrases with high frequency but low information value before or after processing the text, and these filtered out words or phrases are called deactivated words. Deactivated words not only increase the complexity of text processing, add noise to the text processing process, but also may interfere with the accuracy of search results. Therefore, it is a crucial step to reasonably select and filter the deactivated words before using Jieba for word segmentation. Accurate and comprehensive deactivation words can effectively improve the quality of text features, tune up the recall of the model, reduce the dimensionality of text features, and improve the efficiency of information processing.

3)

Custom Dictionary

Constructing a custom dictionary can help the component tool match possible word combinations more quickly, identify proper nouns appearing in the text content, and improve the accuracy of the component. When building a custom dictionary, first of all, we should be familiar with the text content, organize and collect the nouns in the relevant fields involved in the word division object, and store them in a computer-recognizable format in the custom dictionary, the word division tool will read the words in the custom dictionary and add them to the initial dictionary of the word division tool, and in the subsequent word division process, we will bypass the nouns set in the custom dictionary to the point that they are not cut up.

4)

Synonymization

Synonym merging can improve the accuracy and reliability of word frequency statistics, in the synonym merging, generally for specific nouns and their abbreviations, after determining the meaning of the synonyms for the unification of the two in the identification process will be equivalent.

2.1.2

TF-IDF algorithm

TF-IDF, i.e., Word Frequency-Inverse Document Frequency, is a statistical analysis method that can effectively measure the importance of a word in a document in a specific text set or corpus [17]. As a commonly used weighting technique, TF-IDF plays an indispensable role in the field of information retrieval and text mining.

The principle of TF-IDF is based on two main factors: word frequency (TF) and inverse document frequency (IDF). Word frequency (TF) indicates how often a word appears in a particular document. Specifically, if a word occurs more often in a document, the more weight that word has in assessing the content of that document. Thus, the formula for TF can be expressed as: 1 $t f_{i} = \frac{n_{i, j}}{\sum k n_{i, j}}$

Where, tf - word frequency, numerator - the number of times a word occurs in the text, denominator - the total number of words in the text.

Inverse Document Frequency (IDF) is a key metric for evaluating the frequency of occurrence of a word in an overall text set, and it is used to measure the effectiveness of a word in distinguishing between different documents. In general, when the IDF value is high, it means that the term occurs in most documents and therefore the term contributes less to differentiating between specific documents.The IDF value is calculated using the following formula: 2 $i d f_{i j} = \log \frac{| D |}{| j : t_{i} \in d_{j} |}$

Where, idf - inverse document frequency, numerator - total number of documents in the corpus, denominator - number of documents containing the word.

Finally, TF and IDF are multiplied to get the TF-IDF value of a word in a particular document.The words with the top TF-IDF weights can be regarded as the keywords of the article.The TF-IDF is calculated by the formula: 3 $t f i d f_{i j} = t f_{i j} \times i d f_{i j}$

Where, tfidf - word frequency-inverse document frequency, tf - word frequency, idf - inverse document frequency.

The larger the TF-IDF value, the higher the importance of the word in the document. Therefore, using TF-IDF value to sort, classify and filter documents can improve the efficiency and accuracy of information retrieval and text mining.

In summary, this study will be based on the principle of TF-IDF algorithm, and utilize the TfidfTransformer algorithm in Python to perform TF-IDF calculation on the result of word segmentation, so as to extract the keywords of Southeast Asian Chinese literary texts. By counting the word frequencies of these keywords and arranging them in descending order, the core content of Southeast Asian Chinese literary texts is revealed more clearly. In addition, to make it more intuitive to show the words with high frequency of occurrence in Southeast Asian Chinese literature. Based on this, these high-frequency keywords are compared with the thematic lexical items derived through the LDA thematic probability model to explore the consistency between them.

2.1.3

Keyword Vectorized Representation of Word2Vec Models

The preprocessed text corpus is input into the Word2Vec model, which utilizes a neural network model to automatically learn the words in the corpus and map them into a continuous vector space, and finally obtains a vectorized representation of each word in the corpus. Then, the trained word vector model is used to vectorize the extracted topic keywords. By representing words as vectors, the Word2Vec model endows words with richer semantic information so that various operations can be carried out in the vector space, which provides an important support for calculating the similarity between topic keywords [18].

For the keywords after vectorized representation, this paper quantifies the similarity between two words by calculating the cosine value between the keyword vectors, as shown in Equation (4). A value between -1 and 1 can be obtained, and the closer the value is to 1 means that the two keywords are more semantically similar, while the closer the value is to -1 means that the two keywords are less semantically similar: 4 $S i m (w_{i}, w_{j}) = \frac{{\vec{w}}_{i} \cdot {\vec{w}}_{j}}{| {\vec{w}}_{i} | \cdot | {\vec{w}}_{j} |}$

where w_i and w_j represent the two keywords and ${\vec{w}}_{i}$ is the vectorized representation of keyword w_i. Sim(w_i, w_j) represents the word vector similarity between word w_i and word w_j, which to some extent represents the semantic correlation between the two keywords.

2.2

Sentiment Analysis Model of Southeast Asian Chinese Literature

This paper proposes a RoBERTa-BiLSTM-Attention based sentiment recognition method to recognize and analyze the sentiment in Southeast Asian Chinese literature.

The RoBERTa-BiLSTM-Attention based sentiment analysis method first inputs the Southeast Asian Chinese literature text into the RoBERTa model to obtain the corresponding word-oriented quantized sentence embeddings and word embeddings. A bidirectional LSTM network is utilized to learn contextual text features and focus on the key information in the text of user comments through the attention mechanism. Finally, the output obtained from the attention mechanism is combined with the sentence embeddings and mapped to the sentiment labeling space through the fully connected layer. The output of the fully connected layer is normalized by applying the Softmax function to the output of the fully connected layer in order to obtain the final classification results. The overall structure is shown in Fig. 1, including text vector representation layer, feature extraction layer and output layer.

2.2.1

Text Vector Representation Layer

The text vector representation layer is one of the key parts of the RoBERTa-BiLSTM-Attention model, whose main function is to transform the input text into semantically rich vector forms for subsequent sentiment analysis work.The RoBERTa model, as a pre-training model, is capable of in-depth feature extraction and encoding of the input text [19]. With a multi-layer Transformer encoder, the RoBERTa model is able to learn rich semantic information and encode it into contextually relevant word vectors.

RoBERTa-WWM processes a certain percentage of tokens in the input sentences by the MLM method during self-supervised training.The pre-training process of RoBERTa-WWM model consists of the Embedding layer and the Transformer layer based on the full word mask. During the pre-training process, the model goes through a series of steps to extract and learn the features and semantic information in the text data. Its pre-training process mainly contains the following steps: 1)

Define the input sentence of the model as e = (e₁, e₂, …, e_n), where e denotes the ird character of the input sentence and n denotes the sentence length.

2)

In the Embedding layer of the RoBERTa-WWM model, the input sentence is processed and transformed into an input sequence T = (t₁, t₂, …, t_n). This process involves fusing tokenembeddings (word embedding vectors), segmentembeddings (segmentation vectors), and positionembeddings (position vectors) to obtain the final presentation. are fused to obtain the final presentation.Token Embeddings are used to query the word vector table to obtain the word vectors, Segment Embeddings are used to indicate which sentence each word belongs to, and Position Embeddings represent the positional information of each word.

3)

Sequence T = (t₁, t₂, …, t_n) is fed into the Transformer layer in order to extract features of the sequence. This step produces a semantically rich output sequence h = (h₁, h₂, …, h_n), which can be used as an encoded representation for the extraction of feature relations in subsequent processing.

2.2.2

Feature extraction layer

The feature extraction layer consists of a Bidirectional Long Short Term Memory Network (BiLSTM) layer and an Attention mechanism (Attention) layer, which together act in the feature extraction layer to effectively model the sequence data. The final predicted label is further determined after all the elements are considered together. The computational formulas involved are shown below: 5 $f_{t} = σ (W_{f} \cdot [h_{t - 1}, e_{t}] + b_{f})$ 6 ${\tilde{C}}_{t} = t a n h (W_{c} \cdot [h_{t - 1}, e_{t}] + b_{c})$ 7 $i_{t} = σ (W_{i} \cdot [h_{t - 1}, e_{t}] + b_{i})$ 8 $C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot C_{\tilde{t}})$ 9 $o_{t} = σ (W_{o} \cdot [h_{t - 1}, e_{t}] + b_{o})$ 10 $h_{t} = O_{t} \cdot \tanh (C_{t})$

In the Long Short-Term Memory Network (LSTM) formulation, the oblivion gate state is identified by the symbol f_t, while the input gate state is represented by the output gate state i_t, and o_t the output gate state. W_f, W_i, W_o These states represent the weight coefficient matrices of the forgetting, input, and output gates, respectively, and b_f, b_i, b_o represent their bias vectors, respectively. Among these gating mechanisms, σ is the Sigmoid function, and the Sigmoid function is usually used for activation. In addition, in the reverse LSTM, the time step t is recomputed once by the model, a process that is performed by combining the word vector e_t at any position t of the input sequence and with the hidden layer output state h_t−1 at the previous moment to obtain the reverse representation ${\overset{\leftarrow}{h}}_{t}$ .

Thus, by combining the forward representation ${\vec{h}}_{t}$ and the reverse representation ${\overset{\leftarrow}{h}}_{t}$ , the complete contextual information of the text can be obtained, leading to the final hidden state h_t at the current moment t. 11 $h_{t} = {\vec{h}}_{t} \oplus {\overset{\leftarrow}{h}}_{t}$

Then the data obtained from the dual-channel long-term short-term memory neural network is imported into the multiple head attention processing module, which sets different feature weights according to the importance of the words in the text, more effectively grasps the meaning association between words, enhances the importance of the key words, and reduces the noise in the text, thus greatly improving the accuracy of emotion recognition. Multiple head attention mechanism module is to use multiple head attention mechanism to emphasize the meaning characteristics of keywords and capture part of the qualities of the input sequence.The output vectors h₁ to h_t of BiLSTM feature extraction module are synthesized into matrix H and become the input data of multiple head attention mechanism module, after h unique linear transformation, three sequences of Q, K, and V are generated, and the linear representation formula is: 12 $Q = H W_{i}^{Q}$ 13 $K = {H W}_{i}^{K}$ 14 $V = {H W}_{i}^{V}$

Then the attention scores of each head are calculated separately and these scores are normalized by Softmax function, and finally the normalized scores are multiplied with the corresponding values V to get the weighted values. The specific formula is as follows: 15 $A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K_{T}}{\sqrt{d_{k}}}) V$

Finally, after joining the weighted results of all the heads together to get the output of the multi-head attention mechanism, it is then linearly transformed to get the final output. The specific formula is as follows: 16 $h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$ 17 $M = C o n c a t (h e a d_{i}, ..., h e a d_{h}) W^{o}$

where $W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}$ is the parameter matrix of the ind linear representation, d_k represents the dimension of each attention head, $Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}$ is the multi-head attention pair Q, K, V three matrices using different matrices, and W^o represents the linear transformation of the spliced results to generate the final output M of the multi-head attention mechanism layer.

2.2.3

Output layer

The output part includes the fully connected module and Softmax function, in addition, to avoid the overfitting problem, the Dropout technique is introduced to the fully connected module.Dropout is a common forwarding strategy used in the field of deep learning, which reduces the overfitting and enhances the wide-area applicability of the model by randomly removing a portion of the neural network units. During the training process, Dropout randomly sets the outputs of neurons (including input neurons and hidden layer neurons) to zero, temporarily discards them with a certain probability (usually 0.5), but keeps them active during forward propagation and back propagation. The process is shown in Fig. 2.

Mechanism of Dropout During training, Dropout removes some neurons randomly to simplify the neural network structure and reduce the possibility of overfitting. It forces the neurons to not depend on specific input features, forcing the network to learn more robust and generalized features. In each training iteration, neurons to be discarded are randomly selected with a certain probability and their outputs are set to zero. The formula for this is shown in Eq: 18 $y = D r o p o u t (W_{d} T + b_{d})$

where W_d is the weight value of the fully connected layer and b_d is the bias. After the label prediction distribution normalization operation, the category with the largest probability value is selected as the probability value of the prediction result Y. 19 $Y = S o f t m a x (y_{i}) = \frac{\exp (y_{i})}{\sum_{j} \exp (y_{i})}$

3

Linguistic, cultural and emotional analysis

3.1

Analysis of linguistic and cultural elements

3.1.1

High TF-IDF value keyword analysis

Lexicon is the smallest research granularity that can be involved in natural language processing. Since the relevant algorithms cannot analyze the textual information directly, it is necessary to perform lexical segmentation operations based on the lexical weight and preference information provided by the user dictionary and the deactivation lexicon, and to slice up the original textual texts of Southeast Asian Chinese literature into multiple lexemes in order to facilitate the research smoothly. The PaddleNLP program was invoked in Python software to perform stage-by-stage (10 years as a node) lexical segmentation on 285 Southeast Asian Chinese literature works between 1961 and 2020, and the results of the segmentation in the case of retaining only verbs, nouns and gerunds are shown in Table 1. It can be found that the number of Southeast Asian Chinese literature is characterized by a leaping growth, but its effective vocabulary ratio has been maintained at the level of about 22% for a long time, which, on the one hand, indicates that the lexicon has been created in a more reasonable way, which can effectively and robustly extract effective information. On the other hand, this phenomenon indicates that Southeast Asian Chinese literature may have a stable creative content and framework in the long run. In addition, according to the results of descriptive statistics, it can be found that the number of publications of Southeast Asian Chinese literature as a whole is in a state of continuous growth, especially during the period of 2010-2020, the number of published works reached 129, and the intensity of its distribution is nearly twice as much as that of the previous period, and the total number of effective vocabularies has increased by as much as 310% in comparison with that of the previous period, which suggests that the history of Southeast Asian Chinese literature has entered into a brand-new stage of development.

Table 1.

Descriptive statistics

Period	Publication quantity	Novel proportion	Word quantity	Effective word quantity	Effective word proportion
1961-1970	7	0.429	55898	12074	0.216
1971-1980	12	0.417	589059	129004	0.219
1981-1990	22	0.545	648918	142113	0.219
1991-2000	48	0.521	805036	177108	0.220
2001-2010	67	0.597	944736	207842	0.220
2011-2020	129	0.651	3860471	853164	0.221

Southeast Asian Chinese literature encompasses novels, as well as other genres like prose and poetry. Although there is a high degree of similarity between the two in terms of final purpose, literary value, and creative arrangement, there are obvious differences in terms of specific creative perspectives and structures. Specifically, novels have more obvious macro features, which tend to reflect social and cultural phenomena through the arrangement of storylines. Prose and poetry have more obvious micro features, focusing more on specific author’s feelings. In order to better analyze the characteristics of Chinese literature in Southeast Asia, this paper reports the number of novels together with the number of words in the analysis of textual subtexts.

TF-IDF value can reflect the importance of the keywords in a set of text data, this paper will set the window to 100 unit size in the text based on word separation for keyword recognition, processing results shown in Table 2.

Table 2.

Keywords with high TF-IDF value (part)

1961-1970		1971-1980		1981-1990
Keyword	TF-IDF	Keyword	TF-IDF	Keyword	TF-IDF
China	8.214	Qu Yuan	16.815	Qu Yuan	27.405
Qu Yuan	7.082	China	15.393	China	22.183
Great Wall	5.038	Great Wall	12.852	Wuxia	21.035
Yangtze river	5.026	Forbidden city	12.787	Martial arts	20.788
Chopsticks	5.021	Wuxia	12.411	Great Wall	20.287
Chang’e flies to the moon	4.869	Yangtze river	9.223	Forbidden city	13.695
Wuxia	2.918	Jiangnan	7.889	Chopsticks	13.526
Jiangnan	2.471	Chopsticks	7.303	Jiangnan	11.246
West Lake	1.886	Liang Shanbo & Zhu Yingtai	5.905	Jianghu	10.188
Jianghu	1.062	Tea set	5.199	Bamboo forest	10.144
1991-2000		2001-2010		2011-2020
Keyword	TF-IDF	Keyword	TF-IDF	Keyword	TF-IDF
China	56.962	Qu Yuan	88.031	China	127.756
Qu Yuan	52.219	China	81.279	Qu Yuan	112.581
Wuxia	52.182	Jiangnan	71.965	Jiangnan	101.934
Tea set	47.655	Chopsticks	56.731	Great Wall	99.649
Great Wall	46.462	Li Bai	37.775	Tea set	89.415
Jianghu	29.361	Wuxia	34.832	Chopsticks	77.067
Jiangnan	29.306	Great Wall	29.216	Su Shi	75.014
Chopsticks	21.615	Incense burner	20.135	Wuxia	69.278
Chang’e flies to the moon	20.158	Forbidden city	19.718	Jianghu	65.818
Yangtze river	15.853	Martial arts	13.164	Peony flower	29.465

As can be seen from Table 2, the two elements of “Qu Yuan” and “China” are the most crucial in Southeast Asian Chinese literature, and during the 60-year period from 1961 to 2020, the TF-IDF values of these two words consistently ranked among the top two keywords at each stage. In the past 60 years, the TF-IDF value of the keyword “Qu Yuan” has increased from 7.082 to 112.581, and the TF-IDF value of “China” has increased from 8.214 to 127.756. In addition, elements such as “The Great Wall,” “martial arts,” “Jiangnan,” and “chopsticks” also appear frequently in Southeast Asian Chinese literature. Chinese literature in Southeast Asia.

3.1.2

Analysis of co-occurring words

This subsection utilizes the Word2Vec model for co-occurring word analysis of Southeast Asian Chinese literature. In literature intelligence analysis, co-occurring word analysis has become an important measure to quickly identify the development of a discipline. Its core logic is to reflect the allocation of attention about Southeast Asian Chinese literature within a certain period of time by counting the spatial distribution relationship of selected keywords, and then to examine the contemporary connotations embedded in these words. Returning the keyword identification results in Table 2 to the original text of the work, the analysis results of co-occurring words are obtained by calculating their spatial distribution as shown in Table 3.

Table 3.

Common word matrix (part)

	Qu Yuan	Great Wall	Confucius	China	Yangtze river	Liang Shanbo & Zhu Yingtai	Wuxia	Martial arts	Mid-Autumn festival	Swordsman
Qu Yuan	0	154	130	118	118	102	89	112	51	83
Great Wall	154	0	124	102	113	86	106	45	82	80
Confucius	130	124	0	98	105	111	86	103	83	77
China	118	102	98	0	91	108	88	96	88	77
Yangtze river	118	113	105	91	0	99	75	92	71	76
Liang Shanbo & Zhu Yingtai	102	86	111	108	99	0	85	102	81	75
Wuxia	89	106	86	88	75	85	0	73	74	63
Martial arts	112	95	103	96	92	102	73	0	76	63
Mid-Autumn festival	90	82	83	88	71	81	74	76	0	38
Swordsman	83	80	77	77	76	75	63	63	38	0

3.1.3

Cluster analysis

The keywords of Southeast Asian Chinese literature texts were clustered and analyzed, and the keyword clustering results are shown in Table 4. After the theme clustering of Southeast Asian Chinese literature using the Word2Vec model, the six categories of author’s needs are summarized. The six thematic clusters in this paper are summarized as Chinese historical celebrities (Cluster I), Chinese seasonal customs (Cluster II), hometown customs and implements (Cluster III), mountains and rivers (Cluster IV), myths and legends (Cluster V), and martial arts culture (Cluster VI).

Table 4.

Keyword clustering results

Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6
Qu Yuan	Mid-Autumn festival	Chopsticks	Great Wall	Liang Shanbo & Zhu Yingtai	Wuxia
Confucius	Dragon Boat Festival	China	Forbidden city	Chang’e flies to the moon	Jianghu
Li Bai	Qingming festival	Incense burner	Terracotta warriors	Nuwa mends the sky	Bamboo forest
Su Shi	Spring festival	Writing brush	Yangtze river	Hou Yi shoots the sun	Kungfu
Yue Fei	Double ninth festival	Ink	Yellow river	Jing Wei fills the sea	Swordsman
Li Qingzhao	Cold food festival	Tea set	Jiangnan	The legend of the white snake	Revenge
Wang Bo	Summer solstice	Screen	West Lake	Kua Fu chases the sun	Hero
Zhuge Liang	Winter solstice	Oil paper umbrella	Ming Tomb	Yu Gong moves the mountain	Red dust
Liu Yuxi	Autumn equinox	Peony flower	Tengwang Pavilion	King Yu combats the flood	Martial arts
Liu Yong	Spring equinox	Lotus flower	Yueyang Tower	Cangjie creates characters	Chivalry

3.2

Emotional analysis of literary works

This section further analyzes the authors’ sentiment tendencies under the six types of demands, obtains the number and percentage of positive sentiment, neutral sentiment and negative sentiment in each demand topic, and presents the high-frequency words of positive sentiment and negative sentiment by means of a word cloud display, so as to gain insights into the authors’ sentiment expressions.

The sentiment analysis tool used in this section is the Data Manager software, which is capable of automatically analyzing text sentiment tendencies and outputting sentiment tendency results. Its principle is to cut the text, according to the sentiment dictionary to cut each word for the sentiment score, the sum of the sentiment score is greater than 0 for the text of positive emotional tendency, equal to 0 for the neutral emotional tendency, less than 0 for the negative emotional tendency. The software comes with a sentiment dictionary containing 38,452 commonly used Chinese words, and each word is assigned a sentiment score between -10 and 10, from high to low. Users can also modify the sentiment scores of words independently according to their own needs, or import a customized sentiment dictionary.

The comments corresponding to the six themes obtained in the previous section are extracted to generate six new documents, which are sequentially imported into the Data Manager software for sentiment analysis to obtain six “Theme Comments - Sentiment Tendency” result documents. Each document contains 12 columns of content, including the serial number of the keyword, body text, sentiment tendency, positive word, negative word, degree word, negative word, number of positive sentences, positive score, number of negative sentences, negative score and total score. The number of positive, neutral, and negative sentiment comments in the 6 documents were counted and percentages were calculated to obtain topic-sentiment statistics as shown in Table 5.

Table 5.

Theme-affection tendency statistics

Theme	Affection tendency	Word quantity	Proportion	Theme	Affection tendency	Word quantity	Proportion
1	Positive	2467	80%	2	Positive	2136	74%
	Neutral	432	14%		Neutral	577	20%
	Negative	185	6%		Negative	173	6%
3	Positive	1792	67%	4	Positive	2113	83%
	Neutral	696	26%		Neutral	331	13%
	Negative	187	7%		Negative	102	4%
5	Positive	1669	72%	6	Positive	2449	88%
	Neutral	371	16%		Neutral	250	9%
	Negative	278	12%		Negative	84	3%

Observing the data in Table 5, it can be found that the proportion of positive sentiments for all six themes exceeds 60%, and the sum of positive and neutral sentiments exceeds 85%, indicating that the authors of Southeast Asian Chinese literature have a positive attitude towards Chinese culture in general.Among the six themes, Theme 3 has the smallest proportion of positive sentiments, only 67%, and Theme 5 has the largest proportion of negative sentiments, 12%. Themes 1, 4, and 6 all have more than 80% of positive emotions, and theme 6 has the highest percentage of positive emotions, at 88%, and only 3% of negative emotions, indicating that the authors are most enthusiastic about Chinese martial arts culture. Theme 3 has the largest proportion of neutral emotions, reaching 26%, indicating that there is still some room for improvement in the creations of Southeast Asian Chinese literature authors of the Wind Ware.

3.3

Language Elements and Cultural Confidence in Southeast Asian Chinese Literature

Southeast Asia is a region with a range of cultural traditions, and Southeast Asian Chinese literature is a significant component of Chinese literature from abroad. Southeast Asian Chinese literature has inherited Chinese traditions and incorporated the diverse cultures of Southeast Asia.

Under the linguistic display of Southeast Asian Chinese literature, Chinese elements have six faces. There are tributes to Chinese historical celebrities, such as Qu Yuan, whom Southeast Asian Chinese literature writers are particularly fond of. There are tributes to Chinese historical figures, such as Qu Yuan, who is particularly popular among Southeast Asian writers of Chinese literature. There is also nostalgia for hometown customs and utensils, such as chopsticks. The fondness for mountains and rivers, such as the Great Wall. Transplantation and spreading of myths and legends, such as Liang Shanbo and Zhu Yingtai and Chang’e Runs to the Moon. The depiction and transmission of martial arts culture. They are either modern poems, modern novels, or miscellaneous life stories, so that our cultural elements naturally appear in the Chinese community in Southeast Asia and in the vision of Nanyang people, and these beautiful treasures of Chinese culture shine brightly in Nanyang.

Ancient language has an important heritage role in Southeast Asian Chinese literature, containing rich cultural connotations and wisdom, enriching the depth and heritage of the literature, giving the literature a sense of the times and making the works more historical and contemporary. It contains rich emotional expressions, making the works more vivid and touching. Meanwhile, allusions and idioms in ancient languages often provide writers with a source of inspiration for creation. The ancient language not only provides profound cultural connotations to literary works, but also provides readers with rich life wisdom and emotional experience. This inheritance not only provides more depth and breadth to literary works, but also helps to promote the essence of traditional culture.

Language is not only a tool for human thinking and communication, but also a carrier of culture, because language is essentially a historical net woven by layers of culture, connecting readers and authors between expression and appreciation. The richness and diversity of language, dialect, and ethnicity of Southeast Asian Chinese literary works make them unique compared to works from the mainland. At the same time, the intermingling of the local and humanistic nature of the works, especially the cultural Chineseness, makes the language of the works have a great cultural inheritance value here. No matter whether it is vernacular, dialect, or classical language, they all tell the Chinese story.

4

Conclusion

In this paper, we use text mining methods to extract linguistic and cultural elements from Southeast Asian Chinese literature, and propose RoBERTa-BiLSTM-Attention for sentiment recognition and analysis of Southeast Asian Chinese literature.

From 1961 to 2020, the number of Southeast Asian Chinese literature has been growing, especially in the period of 2011-2020, showing a trend of leaping growth, and its effective vocabulary ratio is basically maintained at about 22%.The number of Southeast Asian Chinese literature published during the period of 2011-2020 is 129, which is nearly twice as many as that of the previous decade, and the total number of effective vocabulary is increased by 310% in comparison with that of the period of 2001-2010. During the 60-year period, the TF-IDF values of high-frequency keywords have been increasing, and the TF-IDF values of the words “Qu Yuan” and “China” have always occupied the top two positions.

The Chinese elements in Southeast Asian Chinese literature can be categorised into six categories: Chinese historical celebrities, Chinese seasonal customs, hometown customs and implements, mountains and rivers, myths and legends, and martial arts culture. The authors of Southeast Asian Chinese literature have positive attitudes toward all six categories, with more than 80% of positive feelings toward Chinese historical figures, mountains and rivers, and martial arts culture, with the highest percentage of positive feelings toward martial arts culture (88%).

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

Research on Language and Cultural Elements and Cultural Confidence of Chinese Literary Works in Southeast Asia in the Internet Era

Ting Shao

Online veröffentlicht: 24. März 2025

Eingereicht: 03. Nov. 2024

Akzeptiert: 12. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0709

SchlüsselwörterText mining, Sentiment analysis, TF-IDF, Word2Vec, Southeast Asian Chinese literature

© 2025 Ting Shao, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
Text mining, Sentiment analysis, TF-IDF, Word2Vec, Southeast Asian Chinese literature