Accès libre

Research on Language and Cultural Elements and Cultural Confidence of Chinese Literary Works in Southeast Asia in the Internet Era

  
24 mars 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

Literature is the art of language, the cultural activity of language. The language pattern pours the thinking pattern, and through the thinking pattern has a profound constructive effect on the cultural spirit of a nation, and the literary activities built on this language and culture inevitably reflect a cultural spirit [1-2]. Southeast Asia is a culturally diverse region, and Southeast Asian Chinese literature is an important part of overseas Chinese literature. Compared with the literary language of mainland China, the biggest characteristic of Southeast Asian Chinese literary language is that it retains the expression system of the northern Chinese dialect in general, and at the same time, it has the unique flavor of the southern Chinese dialect, which is distinctly reflected in vocabulary, grammar and other aspects [3-5]. In addition, because it is located in the national languages of different countries in Southeast Asia and the expression system of Western languages, the Chinese vocabulary is mixed with many local languages and Western languages, and deeply embedded in the Chinese language system, creating a multilingual hybrid state of the Southeast Asian Chinese literary language, which is different from the aesthetic qualities of mainland China’s literary language [6-9]. In the seemingly non-standardized and even rustic language expression, the rich and complex life of Chinese in Southeast Asia is vividly conveyed, reflecting the strong inclusiveness of Chinese culture and the ability to adapt to the environment, which is a literary treasure trove worthy of our cherishing and in-depth study [10-13].

To enhance cultural self-confidence, it is necessary to actively disseminate Chinese voices to the outside world, let Chinese culture go to the world, and let the world understand, recognize and respect Chinese culture. In the Internet era, short videos and social media interaction modes have brought people closer to each other, and more and more young people have begun to show the charm of “Chinese flow” in countries around the world [14]. In-depth study of the linguistic and cultural elements of Southeast Asian Chinese literature can promote cultural understanding, enhance cultural self-confidence, and be able and willing to carry it forward, boldly display the charm of the Chinese stream, and lay a solid foundation for future cross-cultural exchanges [15-16].

In this paper, we conduct in-depth text mining of Southeast Asian Chinese literature, extract keywords in Southeast Asian Chinese literature using LDA model, and analyze the importance of keywords in Southeast Asian Chinese literature by TF-IDF algorithm. The keywords in Southeast Asian Chinese literature are analyzed using co-occurring word analysis and word clustering analysis using the Word2Vec model. A sentiment recognition method based on RoBERTa-BiLSTM-Attention is proposed to analyze the sentiment of Southeast Asian Chinese literature and authors. The linguistic and cultural elements are subjected to TF-IDF value calculation, co-occurring word analysis, and clustering analysis, on which the sentiment of Southeast Asian Chinese literary works and authors is analyzed. The language and cultural elements, as well as cultural self-confidence, are analyzed in depth.

Textual and Emotional Studies of Chinese Literature in Southeast Asia
Extraction of Language and Cultural Elements of Southeast Asian Chinese Literature
Text data preprocessing

The purpose of preprocessing the Southeast Asian Chinese literary text is to transform the text into a computer-recognizable form, and at the same time, to eliminate some of the noise and interference, so as to improve the accuracy of the model processing.

Word Processing

Words are the smallest unit to express semantics, and unlike English text, which naturally has spaces as the division between words, there is no fixed demarcation as the sign of sentence break in Chinese expression. The same sentence can derive different meanings due to different ways of breaking it, so the accuracy of the word division results has a crucial impact on the results of text analysis.

The more common Chinese lexical methods are dictionary-based, statistics-based and comprehension-based, in which the advantage of dictionary-based lexical methods is that they do not require additional corpus sets and are relatively less computationally intensive, but they also lead to a relatively mechanized lexical result, while statistics-based lexical methods, although they can better identify words, have the drawbacks of omitting low-frequency words and increasing computational effort. In this paper, we use the jieba lexical library in Python as the lexical tool. The jieba lexical process can be regarded as relying on the lexicon, constructing a directed acyclic graph (DAG) for the input sentences, and performing lexical partitioning based on the probability-maximizing paths on the DAG, and for the words that are not included in the lexicon, we use Hidden Markov Models (HMMs) and the Viterbi algorithm to find out the most probable sequences. Specifically, jieba contains three modes to slice Chinese text: exact slice mode, full slice mode and search engine mode. The advantage of the full cut mode is that the processing speed is fast, but it is also easy to produce ambiguity, while the search engine mode calls a more complex algorithm on the basis of the full cut, and carries out the second cut for long words, thus improving the reliability of the results. Comparatively speaking, the exact cut mode is more suitable for the demand of text analysis for accuracy and completeness. Therefore, this paper chooses to adopt a precise mode to slice the text to ensure the accuracy of the results.

Deactivation of word processing

In the process of information retrieval, in order to improve the search efficiency and save storage space, we usually automatically filter out those words or phrases with high frequency but low information value before or after processing the text, and these filtered out words or phrases are called deactivated words. Deactivated words not only increase the complexity of text processing, add noise to the text processing process, but also may interfere with the accuracy of search results. Therefore, it is a crucial step to reasonably select and filter the deactivated words before using Jieba for word segmentation. Accurate and comprehensive deactivation words can effectively improve the quality of text features, tune up the recall of the model, reduce the dimensionality of text features, and improve the efficiency of information processing.

Custom Dictionary

Constructing a custom dictionary can help the component tool match possible word combinations more quickly, identify proper nouns appearing in the text content, and improve the accuracy of the component. When building a custom dictionary, first of all, we should be familiar with the text content, organize and collect the nouns in the relevant fields involved in the word division object, and store them in a computer-recognizable format in the custom dictionary, the word division tool will read the words in the custom dictionary and add them to the initial dictionary of the word division tool, and in the subsequent word division process, we will bypass the nouns set in the custom dictionary to the point that they are not cut up.

Synonymization

Synonym merging can improve the accuracy and reliability of word frequency statistics, in the synonym merging, generally for specific nouns and their abbreviations, after determining the meaning of the synonyms for the unification of the two in the identification process will be equivalent.

TF-IDF algorithm

TF-IDF, i.e., Word Frequency-Inverse Document Frequency, is a statistical analysis method that can effectively measure the importance of a word in a document in a specific text set or corpus [17]. As a commonly used weighting technique, TF-IDF plays an indispensable role in the field of information retrieval and text mining.

The principle of TF-IDF is based on two main factors: word frequency (TF) and inverse document frequency (IDF). Word frequency (TF) indicates how often a word appears in a particular document. Specifically, if a word occurs more often in a document, the more weight that word has in assessing the content of that document. Thus, the formula for TF can be expressed as: tfi=ni,jkni,j

Where, tf - word frequency, numerator - the number of times a word occurs in the text, denominator - the total number of words in the text.

Inverse Document Frequency (IDF) is a key metric for evaluating the frequency of occurrence of a word in an overall text set, and it is used to measure the effectiveness of a word in distinguishing between different documents. In general, when the IDF value is high, it means that the term occurs in most documents and therefore the term contributes less to differentiating between specific documents.The IDF value is calculated using the following formula: idfij=log|D||j:tidj|

Where, idf - inverse document frequency, numerator - total number of documents in the corpus, denominator - number of documents containing the word.

Finally, TF and IDF are multiplied to get the TF-IDF value of a word in a particular document.The words with the top TF-IDF weights can be regarded as the keywords of the article.The TF-IDF is calculated by the formula: tfidfij=tfij×idfij

Where, tfidf - word frequency-inverse document frequency, tf - word frequency, idf - inverse document frequency.

The larger the TF-IDF value, the higher the importance of the word in the document. Therefore, using TF-IDF value to sort, classify and filter documents can improve the efficiency and accuracy of information retrieval and text mining.

In summary, this study will be based on the principle of TF-IDF algorithm, and utilize the TfidfTransformer algorithm in Python to perform TF-IDF calculation on the result of word segmentation, so as to extract the keywords of Southeast Asian Chinese literary texts. By counting the word frequencies of these keywords and arranging them in descending order, the core content of Southeast Asian Chinese literary texts is revealed more clearly. In addition, to make it more intuitive to show the words with high frequency of occurrence in Southeast Asian Chinese literature. Based on this, these high-frequency keywords are compared with the thematic lexical items derived through the LDA thematic probability model to explore the consistency between them.

Keyword Vectorized Representation of Word2Vec Models

The preprocessed text corpus is input into the Word2Vec model, which utilizes a neural network model to automatically learn the words in the corpus and map them into a continuous vector space, and finally obtains a vectorized representation of each word in the corpus. Then, the trained word vector model is used to vectorize the extracted topic keywords. By representing words as vectors, the Word2Vec model endows words with richer semantic information so that various operations can be carried out in the vector space, which provides an important support for calculating the similarity between topic keywords [18].

For the keywords after vectorized representation, this paper quantifies the similarity between two words by calculating the cosine value between the keyword vectors, as shown in Equation (4). A value between -1 and 1 can be obtained, and the closer the value is to 1 means that the two keywords are more semantically similar, while the closer the value is to -1 means that the two keywords are less semantically similar: Sim(wi,wj)=wiwj|wi||wj|

where wi and wj represent the two keywords and wi is the vectorized representation of keyword wi. Sim(wi, wj) represents the word vector similarity between word wi and word wj, which to some extent represents the semantic correlation between the two keywords.

Sentiment Analysis Model of Southeast Asian Chinese Literature

This paper proposes a RoBERTa-BiLSTM-Attention based sentiment recognition method to recognize and analyze the sentiment in Southeast Asian Chinese literature.

The RoBERTa-BiLSTM-Attention based sentiment analysis method first inputs the Southeast Asian Chinese literature text into the RoBERTa model to obtain the corresponding word-oriented quantized sentence embeddings and word embeddings. A bidirectional LSTM network is utilized to learn contextual text features and focus on the key information in the text of user comments through the attention mechanism. Finally, the output obtained from the attention mechanism is combined with the sentence embeddings and mapped to the sentiment labeling space through the fully connected layer. The output of the fully connected layer is normalized by applying the Softmax function to the output of the fully connected layer in order to obtain the final classification results. The overall structure is shown in Fig. 1, including text vector representation layer, feature extraction layer and output layer.

Figure 1.

Structure of RoBERTa-BiLSTM-Attention affection analysis model

Text Vector Representation Layer

The text vector representation layer is one of the key parts of the RoBERTa-BiLSTM-Attention model, whose main function is to transform the input text into semantically rich vector forms for subsequent sentiment analysis work.The RoBERTa model, as a pre-training model, is capable of in-depth feature extraction and encoding of the input text [19]. With a multi-layer Transformer encoder, the RoBERTa model is able to learn rich semantic information and encode it into contextually relevant word vectors.

RoBERTa-WWM processes a certain percentage of tokens in the input sentences by the MLM method during self-supervised training.The pre-training process of RoBERTa-WWM model consists of the Embedding layer and the Transformer layer based on the full word mask. During the pre-training process, the model goes through a series of steps to extract and learn the features and semantic information in the text data. Its pre-training process mainly contains the following steps:

Define the input sentence of the model as e = (e1, e2, …, en), where e denotes the ird character of the input sentence and n denotes the sentence length.

In the Embedding layer of the RoBERTa-WWM model, the input sentence is processed and transformed into an input sequence T = (t1, t2, …, tn). This process involves fusing tokenembeddings (word embedding vectors), segmentembeddings (segmentation vectors), and positionembeddings (position vectors) to obtain the final presentation. are fused to obtain the final presentation.Token Embeddings are used to query the word vector table to obtain the word vectors, Segment Embeddings are used to indicate which sentence each word belongs to, and Position Embeddings represent the positional information of each word.

Sequence T = (t1, t2, …, tn) is fed into the Transformer layer in order to extract features of the sequence. This step produces a semantically rich output sequence h = (h1, h2, …, hn), which can be used as an encoded representation for the extraction of feature relations in subsequent processing.

Feature extraction layer

The feature extraction layer consists of a Bidirectional Long Short Term Memory Network (BiLSTM) layer and an Attention mechanism (Attention) layer, which together act in the feature extraction layer to effectively model the sequence data. The final predicted label is further determined after all the elements are considered together. The computational formulas involved are shown below: ft=σ(Wf[ht1,et]+bf) C˜t=tanh(Wc[ht1,et]+bc) it=σ(Wi[ht1,et]+bi) Ct=ftCt1+itCt˜) ot=σ(Wo[ht1,et]+bo) ht=Ottanh(Ct)

In the Long Short-Term Memory Network (LSTM) formulation, the oblivion gate state is identified by the symbol ft, while the input gate state is represented by the output gate state it, and ot the output gate state. Wf, Wi, Wo These states represent the weight coefficient matrices of the forgetting, input, and output gates, respectively, and bf, bi, bo represent their bias vectors, respectively. Among these gating mechanisms, σ is the Sigmoid function, and the Sigmoid function is usually used for activation. In addition, in the reverse LSTM, the time step t is recomputed once by the model, a process that is performed by combining the word vector et at any position t of the input sequence and with the hidden layer output state ht−1 at the previous moment to obtain the reverse representation ht.

Thus, by combining the forward representation ht and the reverse representation ht, the complete contextual information of the text can be obtained, leading to the final hidden state ht at the current moment t. ht=htht

Then the data obtained from the dual-channel long-term short-term memory neural network is imported into the multiple head attention processing module, which sets different feature weights according to the importance of the words in the text, more effectively grasps the meaning association between words, enhances the importance of the key words, and reduces the noise in the text, thus greatly improving the accuracy of emotion recognition. Multiple head attention mechanism module is to use multiple head attention mechanism to emphasize the meaning characteristics of keywords and capture part of the qualities of the input sequence.The output vectors h1 to ht of BiLSTM feature extraction module are synthesized into matrix H and become the input data of multiple head attention mechanism module, after h unique linear transformation, three sequences of Q, K, and V are generated, and the linear representation formula is: Q=HWiQ K=HWiK V=HWiV

Then the attention scores of each head are calculated separately and these scores are normalized by Softmax function, and finally the normalized scores are multiplied with the corresponding values V to get the weighted values. The specific formula is as follows: Attention(Q,K,V)=softmax(QKTdk)V

Finally, after joining the weighted results of all the heads together to get the output of the multi-head attention mechanism, it is then linearly transformed to get the final output. The specific formula is as follows: headi=Attention(QWiQ,KWiK,VWiV) M=Concat(headi,...,headh)Wo

where WiQ,WiK,WiV is the parameter matrix of the ind linear representation, dk represents the dimension of each attention head, QWiQ,KWiK,VWiV is the multi-head attention pair Q, K, V three matrices using different matrices, and Wo represents the linear transformation of the spliced results to generate the final output M of the multi-head attention mechanism layer.

Output layer

The output part includes the fully connected module and Softmax function, in addition, to avoid the overfitting problem, the Dropout technique is introduced to the fully connected module.Dropout is a common forwarding strategy used in the field of deep learning, which reduces the overfitting and enhances the wide-area applicability of the model by randomly removing a portion of the neural network units. During the training process, Dropout randomly sets the outputs of neurons (including input neurons and hidden layer neurons) to zero, temporarily discards them with a certain probability (usually 0.5), but keeps them active during forward propagation and back propagation. The process is shown in Fig. 2.

Figure 2.

Dropout workflow diagram

Mechanism of Dropout During training, Dropout removes some neurons randomly to simplify the neural network structure and reduce the possibility of overfitting. It forces the neurons to not depend on specific input features, forcing the network to learn more robust and generalized features. In each training iteration, neurons to be discarded are randomly selected with a certain probability and their outputs are set to zero. The formula for this is shown in Eq: y=Dropout(WdT+bd)

where Wd is the weight value of the fully connected layer and bd is the bias. After the label prediction distribution normalization operation, the category with the largest probability value is selected as the probability value of the prediction result Y. Y=Softmax(yi)=exp(yi)jexp(yi)

Linguistic, cultural and emotional analysis
Analysis of linguistic and cultural elements
High TF-IDF value keyword analysis

Lexicon is the smallest research granularity that can be involved in natural language processing. Since the relevant algorithms cannot analyze the textual information directly, it is necessary to perform lexical segmentation operations based on the lexical weight and preference information provided by the user dictionary and the deactivation lexicon, and to slice up the original textual texts of Southeast Asian Chinese literature into multiple lexemes in order to facilitate the research smoothly. The PaddleNLP program was invoked in Python software to perform stage-by-stage (10 years as a node) lexical segmentation on 285 Southeast Asian Chinese literature works between 1961 and 2020, and the results of the segmentation in the case of retaining only verbs, nouns and gerunds are shown in Table 1. It can be found that the number of Southeast Asian Chinese literature is characterized by a leaping growth, but its effective vocabulary ratio has been maintained at the level of about 22% for a long time, which, on the one hand, indicates that the lexicon has been created in a more reasonable way, which can effectively and robustly extract effective information. On the other hand, this phenomenon indicates that Southeast Asian Chinese literature may have a stable creative content and framework in the long run. In addition, according to the results of descriptive statistics, it can be found that the number of publications of Southeast Asian Chinese literature as a whole is in a state of continuous growth, especially during the period of 2010-2020, the number of published works reached 129, and the intensity of its distribution is nearly twice as much as that of the previous period, and the total number of effective vocabularies has increased by as much as 310% in comparison with that of the previous period, which suggests that the history of Southeast Asian Chinese literature has entered into a brand-new stage of development.

Descriptive statistics

Period Publication quantity Novel proportion Word quantity Effective word quantity Effective word proportion
1961-1970 7 0.429 55898 12074 0.216
1971-1980 12 0.417 589059 129004 0.219
1981-1990 22 0.545 648918 142113 0.219
1991-2000 48 0.521 805036 177108 0.220
2001-2010 67 0.597 944736 207842 0.220
2011-2020 129 0.651 3860471 853164 0.221

Southeast Asian Chinese literature encompasses novels, as well as other genres like prose and poetry. Although there is a high degree of similarity between the two in terms of final purpose, literary value, and creative arrangement, there are obvious differences in terms of specific creative perspectives and structures. Specifically, novels have more obvious macro features, which tend to reflect social and cultural phenomena through the arrangement of storylines. Prose and poetry have more obvious micro features, focusing more on specific author’s feelings. In order to better analyze the characteristics of Chinese literature in Southeast Asia, this paper reports the number of novels together with the number of words in the analysis of textual subtexts.

TF-IDF value can reflect the importance of the keywords in a set of text data, this paper will set the window to 100 unit size in the text based on word separation for keyword recognition, processing results shown in Table 2.

Keywords with high TF-IDF value (part)

1961-1970 1971-1980 1981-1990
Keyword TF-IDF Keyword TF-IDF Keyword TF-IDF
China 8.214 Qu Yuan 16.815 Qu Yuan 27.405
Qu Yuan 7.082 China 15.393 China 22.183
Great Wall 5.038 Great Wall 12.852 Wuxia 21.035
Yangtze river 5.026 Forbidden city 12.787 Martial arts 20.788
Chopsticks 5.021 Wuxia 12.411 Great Wall 20.287
Chang’e flies to the moon 4.869 Yangtze river 9.223 Forbidden city 13.695
Wuxia 2.918 Jiangnan 7.889 Chopsticks 13.526
Jiangnan 2.471 Chopsticks 7.303 Jiangnan 11.246
West Lake 1.886 Liang Shanbo & Zhu Yingtai 5.905 Jianghu 10.188
Jianghu 1.062 Tea set 5.199 Bamboo forest 10.144
1991-2000 2001-2010 2011-2020
Keyword TF-IDF Keyword TF-IDF Keyword TF-IDF
China 56.962 Qu Yuan 88.031 China 127.756
Qu Yuan 52.219 China 81.279 Qu Yuan 112.581
Wuxia 52.182 Jiangnan 71.965 Jiangnan 101.934
Tea set 47.655 Chopsticks 56.731 Great Wall 99.649
Great Wall 46.462 Li Bai 37.775 Tea set 89.415
Jianghu 29.361 Wuxia 34.832 Chopsticks 77.067
Jiangnan 29.306 Great Wall 29.216 Su Shi 75.014
Chopsticks 21.615 Incense burner 20.135 Wuxia 69.278
Chang’e flies to the moon 20.158 Forbidden city 19.718 Jianghu 65.818
Yangtze river 15.853 Martial arts 13.164 Peony flower 29.465

As can be seen from Table 2, the two elements of “Qu Yuan” and “China” are the most crucial in Southeast Asian Chinese literature, and during the 60-year period from 1961 to 2020, the TF-IDF values of these two words consistently ranked among the top two keywords at each stage. In the past 60 years, the TF-IDF value of the keyword “Qu Yuan” has increased from 7.082 to 112.581, and the TF-IDF value of “China” has increased from 8.214 to 127.756. In addition, elements such as “The Great Wall,” “martial arts,” “Jiangnan,” and “chopsticks” also appear frequently in Southeast Asian Chinese literature. Chinese literature in Southeast Asia.

Analysis of co-occurring words

This subsection utilizes the Word2Vec model for co-occurring word analysis of Southeast Asian Chinese literature. In literature intelligence analysis, co-occurring word analysis has become an important measure to quickly identify the development of a discipline. Its core logic is to reflect the allocation of attention about Southeast Asian Chinese literature within a certain period of time by counting the spatial distribution relationship of selected keywords, and then to examine the contemporary connotations embedded in these words. Returning the keyword identification results in Table 2 to the original text of the work, the analysis results of co-occurring words are obtained by calculating their spatial distribution as shown in Table 3.

Common word matrix (part)

Qu Yuan Great Wall Confucius China Yangtze river Liang Shanbo & Zhu Yingtai Wuxia Martial arts Mid-Autumn festival Swordsman
Qu Yuan 0 154 130 118 118 102 89 112 51 83
Great Wall 154 0 124 102 113 86 106 45 82 80
Confucius 130 124 0 98 105 111 86 103 83 77
China 118 102 98 0 91 108 88 96 88 77
Yangtze river 118 113 105 91 0 99 75 92 71 76
Liang Shanbo & Zhu Yingtai 102 86 111 108 99 0 85 102 81 75
Wuxia 89 106 86 88 75 85 0 73 74 63
Martial arts 112 95 103 96 92 102 73 0 76 63
Mid-Autumn festival 90 82 83 88 71 81 74 76 0 38
Swordsman 83 80 77 77 76 75 63 63 38 0
Cluster analysis

The keywords of Southeast Asian Chinese literature texts were clustered and analyzed, and the keyword clustering results are shown in Table 4. After the theme clustering of Southeast Asian Chinese literature using the Word2Vec model, the six categories of author’s needs are summarized. The six thematic clusters in this paper are summarized as Chinese historical celebrities (Cluster I), Chinese seasonal customs (Cluster II), hometown customs and implements (Cluster III), mountains and rivers (Cluster IV), myths and legends (Cluster V), and martial arts culture (Cluster VI).

Keyword clustering results

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
Qu Yuan Mid-Autumn festival Chopsticks Great Wall Liang Shanbo & Zhu Yingtai Wuxia
Confucius Dragon Boat Festival China Forbidden city Chang’e flies to the moon Jianghu
Li Bai Qingming festival Incense burner Terracotta warriors Nuwa mends the sky Bamboo forest
Su Shi Spring festival Writing brush Yangtze river Hou Yi shoots the sun Kungfu
Yue Fei Double ninth festival Ink Yellow river Jing Wei fills the sea Swordsman
Li Qingzhao Cold food festival Tea set Jiangnan The legend of the white snake Revenge
Wang Bo Summer solstice Screen West Lake Kua Fu chases the sun Hero
Zhuge Liang Winter solstice Oil paper umbrella Ming Tomb Yu Gong moves the mountain Red dust
Liu Yuxi Autumn equinox Peony flower Tengwang Pavilion King Yu combats the flood Martial arts
Liu Yong Spring equinox Lotus flower Yueyang Tower Cangjie creates characters Chivalry
Emotional analysis of literary works

This section further analyzes the authors’ sentiment tendencies under the six types of demands, obtains the number and percentage of positive sentiment, neutral sentiment and negative sentiment in each demand topic, and presents the high-frequency words of positive sentiment and negative sentiment by means of a word cloud display, so as to gain insights into the authors’ sentiment expressions.

The sentiment analysis tool used in this section is the Data Manager software, which is capable of automatically analyzing text sentiment tendencies and outputting sentiment tendency results. Its principle is to cut the text, according to the sentiment dictionary to cut each word for the sentiment score, the sum of the sentiment score is greater than 0 for the text of positive emotional tendency, equal to 0 for the neutral emotional tendency, less than 0 for the negative emotional tendency. The software comes with a sentiment dictionary containing 38,452 commonly used Chinese words, and each word is assigned a sentiment score between -10 and 10, from high to low. Users can also modify the sentiment scores of words independently according to their own needs, or import a customized sentiment dictionary.

The comments corresponding to the six themes obtained in the previous section are extracted to generate six new documents, which are sequentially imported into the Data Manager software for sentiment analysis to obtain six “Theme Comments - Sentiment Tendency” result documents. Each document contains 12 columns of content, including the serial number of the keyword, body text, sentiment tendency, positive word, negative word, degree word, negative word, number of positive sentences, positive score, number of negative sentences, negative score and total score. The number of positive, neutral, and negative sentiment comments in the 6 documents were counted and percentages were calculated to obtain topic-sentiment statistics as shown in Table 5.

Theme-affection tendency statistics

Theme Affection tendency Word quantity Proportion Theme Affection tendency Word quantity Proportion
1 Positive 2467 80% 2 Positive 2136 74%
Neutral 432 14% Neutral 577 20%
Negative 185 6% Negative 173 6%
3 Positive 1792 67% 4 Positive 2113 83%
Neutral 696 26% Neutral 331 13%
Negative 187 7% Negative 102 4%
5 Positive 1669 72% 6 Positive 2449 88%
Neutral 371 16% Neutral 250 9%
Negative 278 12% Negative 84 3%

Observing the data in Table 5, it can be found that the proportion of positive sentiments for all six themes exceeds 60%, and the sum of positive and neutral sentiments exceeds 85%, indicating that the authors of Southeast Asian Chinese literature have a positive attitude towards Chinese culture in general.Among the six themes, Theme 3 has the smallest proportion of positive sentiments, only 67%, and Theme 5 has the largest proportion of negative sentiments, 12%. Themes 1, 4, and 6 all have more than 80% of positive emotions, and theme 6 has the highest percentage of positive emotions, at 88%, and only 3% of negative emotions, indicating that the authors are most enthusiastic about Chinese martial arts culture. Theme 3 has the largest proportion of neutral emotions, reaching 26%, indicating that there is still some room for improvement in the creations of Southeast Asian Chinese literature authors of the Wind Ware.

Language Elements and Cultural Confidence in Southeast Asian Chinese Literature

Southeast Asia is a region with a range of cultural traditions, and Southeast Asian Chinese literature is a significant component of Chinese literature from abroad. Southeast Asian Chinese literature has inherited Chinese traditions and incorporated the diverse cultures of Southeast Asia.

Under the linguistic display of Southeast Asian Chinese literature, Chinese elements have six faces. There are tributes to Chinese historical celebrities, such as Qu Yuan, whom Southeast Asian Chinese literature writers are particularly fond of. There are tributes to Chinese historical figures, such as Qu Yuan, who is particularly popular among Southeast Asian writers of Chinese literature. There is also nostalgia for hometown customs and utensils, such as chopsticks. The fondness for mountains and rivers, such as the Great Wall. Transplantation and spreading of myths and legends, such as Liang Shanbo and Zhu Yingtai and Chang’e Runs to the Moon. The depiction and transmission of martial arts culture. They are either modern poems, modern novels, or miscellaneous life stories, so that our cultural elements naturally appear in the Chinese community in Southeast Asia and in the vision of Nanyang people, and these beautiful treasures of Chinese culture shine brightly in Nanyang.

Ancient language has an important heritage role in Southeast Asian Chinese literature, containing rich cultural connotations and wisdom, enriching the depth and heritage of the literature, giving the literature a sense of the times and making the works more historical and contemporary. It contains rich emotional expressions, making the works more vivid and touching. Meanwhile, allusions and idioms in ancient languages often provide writers with a source of inspiration for creation. The ancient language not only provides profound cultural connotations to literary works, but also provides readers with rich life wisdom and emotional experience. This inheritance not only provides more depth and breadth to literary works, but also helps to promote the essence of traditional culture.

Language is not only a tool for human thinking and communication, but also a carrier of culture, because language is essentially a historical net woven by layers of culture, connecting readers and authors between expression and appreciation. The richness and diversity of language, dialect, and ethnicity of Southeast Asian Chinese literary works make them unique compared to works from the mainland. At the same time, the intermingling of the local and humanistic nature of the works, especially the cultural Chineseness, makes the language of the works have a great cultural inheritance value here. No matter whether it is vernacular, dialect, or classical language, they all tell the Chinese story.

Conclusion

In this paper, we use text mining methods to extract linguistic and cultural elements from Southeast Asian Chinese literature, and propose RoBERTa-BiLSTM-Attention for sentiment recognition and analysis of Southeast Asian Chinese literature.

From 1961 to 2020, the number of Southeast Asian Chinese literature has been growing, especially in the period of 2011-2020, showing a trend of leaping growth, and its effective vocabulary ratio is basically maintained at about 22%.The number of Southeast Asian Chinese literature published during the period of 2011-2020 is 129, which is nearly twice as many as that of the previous decade, and the total number of effective vocabulary is increased by 310% in comparison with that of the period of 2001-2010. During the 60-year period, the TF-IDF values of high-frequency keywords have been increasing, and the TF-IDF values of the words “Qu Yuan” and “China” have always occupied the top two positions.

The Chinese elements in Southeast Asian Chinese literature can be categorised into six categories: Chinese historical celebrities, Chinese seasonal customs, hometown customs and implements, mountains and rivers, myths and legends, and martial arts culture. The authors of Southeast Asian Chinese literature have positive attitudes toward all six categories, with more than 80% of positive feelings toward Chinese historical figures, mountains and rivers, and martial arts culture, with the highest percentage of positive feelings toward martial arts culture (88%).