Research on Language and Cultural Elements and Cultural Confidence of Chinese Literary Works in Southeast Asia in the Internet Era
Online veröffentlicht: 24. März 2025
Eingereicht: 03. Nov. 2024
Akzeptiert: 12. Feb. 2025
DOI: https://doi.org/10.2478/amns-2025-0709
Schlüsselwörter
© 2025 Ting Shao, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Literature is the art of language, the cultural activity of language. The language pattern pours the thinking pattern, and through the thinking pattern has a profound constructive effect on the cultural spirit of a nation, and the literary activities built on this language and culture inevitably reflect a cultural spirit [1-2]. Southeast Asia is a culturally diverse region, and Southeast Asian Chinese literature is an important part of overseas Chinese literature. Compared with the literary language of mainland China, the biggest characteristic of Southeast Asian Chinese literary language is that it retains the expression system of the northern Chinese dialect in general, and at the same time, it has the unique flavor of the southern Chinese dialect, which is distinctly reflected in vocabulary, grammar and other aspects [3-5]. In addition, because it is located in the national languages of different countries in Southeast Asia and the expression system of Western languages, the Chinese vocabulary is mixed with many local languages and Western languages, and deeply embedded in the Chinese language system, creating a multilingual hybrid state of the Southeast Asian Chinese literary language, which is different from the aesthetic qualities of mainland China’s literary language [6-9]. In the seemingly non-standardized and even rustic language expression, the rich and complex life of Chinese in Southeast Asia is vividly conveyed, reflecting the strong inclusiveness of Chinese culture and the ability to adapt to the environment, which is a literary treasure trove worthy of our cherishing and in-depth study [10-13].
To enhance cultural self-confidence, it is necessary to actively disseminate Chinese voices to the outside world, let Chinese culture go to the world, and let the world understand, recognize and respect Chinese culture. In the Internet era, short videos and social media interaction modes have brought people closer to each other, and more and more young people have begun to show the charm of “Chinese flow” in countries around the world [14]. In-depth study of the linguistic and cultural elements of Southeast Asian Chinese literature can promote cultural understanding, enhance cultural self-confidence, and be able and willing to carry it forward, boldly display the charm of the Chinese stream, and lay a solid foundation for future cross-cultural exchanges [15-16].
In this paper, we conduct in-depth text mining of Southeast Asian Chinese literature, extract keywords in Southeast Asian Chinese literature using LDA model, and analyze the importance of keywords in Southeast Asian Chinese literature by TF-IDF algorithm. The keywords in Southeast Asian Chinese literature are analyzed using co-occurring word analysis and word clustering analysis using the Word2Vec model. A sentiment recognition method based on RoBERTa-BiLSTM-Attention is proposed to analyze the sentiment of Southeast Asian Chinese literature and authors. The linguistic and cultural elements are subjected to TF-IDF value calculation, co-occurring word analysis, and clustering analysis, on which the sentiment of Southeast Asian Chinese literary works and authors is analyzed. The language and cultural elements, as well as cultural self-confidence, are analyzed in depth.
The purpose of preprocessing the Southeast Asian Chinese literary text is to transform the text into a computer-recognizable form, and at the same time, to eliminate some of the noise and interference, so as to improve the accuracy of the model processing.
Word Processing Words are the smallest unit to express semantics, and unlike English text, which naturally has spaces as the division between words, there is no fixed demarcation as the sign of sentence break in Chinese expression. The same sentence can derive different meanings due to different ways of breaking it, so the accuracy of the word division results has a crucial impact on the results of text analysis. The more common Chinese lexical methods are dictionary-based, statistics-based and comprehension-based, in which the advantage of dictionary-based lexical methods is that they do not require additional corpus sets and are relatively less computationally intensive, but they also lead to a relatively mechanized lexical result, while statistics-based lexical methods, although they can better identify words, have the drawbacks of omitting low-frequency words and increasing computational effort. In this paper, we use the jieba lexical library in Python as the lexical tool. The jieba lexical process can be regarded as relying on the lexicon, constructing a directed acyclic graph (DAG) for the input sentences, and performing lexical partitioning based on the probability-maximizing paths on the DAG, and for the words that are not included in the lexicon, we use Hidden Markov Models (HMMs) and the Viterbi algorithm to find out the most probable sequences. Specifically, jieba contains three modes to slice Chinese text: exact slice mode, full slice mode and search engine mode. The advantage of the full cut mode is that the processing speed is fast, but it is also easy to produce ambiguity, while the search engine mode calls a more complex algorithm on the basis of the full cut, and carries out the second cut for long words, thus improving the reliability of the results. Comparatively speaking, the exact cut mode is more suitable for the demand of text analysis for accuracy and completeness. Therefore, this paper chooses to adopt a precise mode to slice the text to ensure the accuracy of the results. Deactivation of word processing In the process of information retrieval, in order to improve the search efficiency and save storage space, we usually automatically filter out those words or phrases with high frequency but low information value before or after processing the text, and these filtered out words or phrases are called deactivated words. Deactivated words not only increase the complexity of text processing, add noise to the text processing process, but also may interfere with the accuracy of search results. Therefore, it is a crucial step to reasonably select and filter the deactivated words before using Jieba for word segmentation. Accurate and comprehensive deactivation words can effectively improve the quality of text features, tune up the recall of the model, reduce the dimensionality of text features, and improve the efficiency of information processing. Custom Dictionary Constructing a custom dictionary can help the component tool match possible word combinations more quickly, identify proper nouns appearing in the text content, and improve the accuracy of the component. When building a custom dictionary, first of all, we should be familiar with the text content, organize and collect the nouns in the relevant fields involved in the word division object, and store them in a computer-recognizable format in the custom dictionary, the word division tool will read the words in the custom dictionary and add them to the initial dictionary of the word division tool, and in the subsequent word division process, we will bypass the nouns set in the custom dictionary to the point that they are not cut up. Synonymization Synonym merging can improve the accuracy and reliability of word frequency statistics, in the synonym merging, generally for specific nouns and their abbreviations, after determining the meaning of the synonyms for the unification of the two in the identification process will be equivalent.
TF-IDF, i.e., Word Frequency-Inverse Document Frequency, is a statistical analysis method that can effectively measure the importance of a word in a document in a specific text set or corpus [17]. As a commonly used weighting technique, TF-IDF plays an indispensable role in the field of information retrieval and text mining.
The principle of TF-IDF is based on two main factors: word frequency (TF) and inverse document frequency (IDF). Word frequency (TF) indicates how often a word appears in a particular document. Specifically, if a word occurs more often in a document, the more weight that word has in assessing the content of that document. Thus, the formula for TF can be expressed as:
Where, tf - word frequency, numerator - the number of times a word occurs in the text, denominator - the total number of words in the text.
Inverse Document Frequency (IDF) is a key metric for evaluating the frequency of occurrence of a word in an overall text set, and it is used to measure the effectiveness of a word in distinguishing between different documents. In general, when the IDF value is high, it means that the term occurs in most documents and therefore the term contributes less to differentiating between specific documents.The IDF value is calculated using the following formula:
Where, idf - inverse document frequency, numerator - total number of documents in the corpus, denominator - number of documents containing the word.
Finally, TF and IDF are multiplied to get the TF-IDF value of a word in a particular document.The words with the top TF-IDF weights can be regarded as the keywords of the article.The TF-IDF is calculated by the formula:
Where, tfidf - word frequency-inverse document frequency, tf - word frequency, idf - inverse document frequency.
The larger the TF-IDF value, the higher the importance of the word in the document. Therefore, using TF-IDF value to sort, classify and filter documents can improve the efficiency and accuracy of information retrieval and text mining.
In summary, this study will be based on the principle of TF-IDF algorithm, and utilize the TfidfTransformer algorithm in Python to perform TF-IDF calculation on the result of word segmentation, so as to extract the keywords of Southeast Asian Chinese literary texts. By counting the word frequencies of these keywords and arranging them in descending order, the core content of Southeast Asian Chinese literary texts is revealed more clearly. In addition, to make it more intuitive to show the words with high frequency of occurrence in Southeast Asian Chinese literature. Based on this, these high-frequency keywords are compared with the thematic lexical items derived through the LDA thematic probability model to explore the consistency between them.
The preprocessed text corpus is input into the Word2Vec model, which utilizes a neural network model to automatically learn the words in the corpus and map them into a continuous vector space, and finally obtains a vectorized representation of each word in the corpus. Then, the trained word vector model is used to vectorize the extracted topic keywords. By representing words as vectors, the Word2Vec model endows words with richer semantic information so that various operations can be carried out in the vector space, which provides an important support for calculating the similarity between topic keywords [18].
For the keywords after vectorized representation, this paper quantifies the similarity between two words by calculating the cosine value between the keyword vectors, as shown in Equation (4). A value between -1 and 1 can be obtained, and the closer the value is to 1 means that the two keywords are more semantically similar, while the closer the value is to -1 means that the two keywords are less semantically similar:
where
This paper proposes a RoBERTa-BiLSTM-Attention based sentiment recognition method to recognize and analyze the sentiment in Southeast Asian Chinese literature.
The RoBERTa-BiLSTM-Attention based sentiment analysis method first inputs the Southeast Asian Chinese literature text into the RoBERTa model to obtain the corresponding word-oriented quantized sentence embeddings and word embeddings. A bidirectional LSTM network is utilized to learn contextual text features and focus on the key information in the text of user comments through the attention mechanism. Finally, the output obtained from the attention mechanism is combined with the sentence embeddings and mapped to the sentiment labeling space through the fully connected layer. The output of the fully connected layer is normalized by applying the Softmax function to the output of the fully connected layer in order to obtain the final classification results. The overall structure is shown in Fig. 1, including text vector representation layer, feature extraction layer and output layer.

Structure of RoBERTa-BiLSTM-Attention affection analysis model
The text vector representation layer is one of the key parts of the RoBERTa-BiLSTM-Attention model, whose main function is to transform the input text into semantically rich vector forms for subsequent sentiment analysis work.The RoBERTa model, as a pre-training model, is capable of in-depth feature extraction and encoding of the input text [19]. With a multi-layer Transformer encoder, the RoBERTa model is able to learn rich semantic information and encode it into contextually relevant word vectors.
RoBERTa-WWM processes a certain percentage of tokens in the input sentences by the MLM method during self-supervised training.The pre-training process of RoBERTa-WWM model consists of the Embedding layer and the Transformer layer based on the full word mask. During the pre-training process, the model goes through a series of steps to extract and learn the features and semantic information in the text data. Its pre-training process mainly contains the following steps:
Define the input sentence of the model as In the Embedding layer of the RoBERTa-WWM model, the input sentence is processed and transformed into an input sequence Sequence
The feature extraction layer consists of a Bidirectional Long Short Term Memory Network (BiLSTM) layer and an Attention mechanism (Attention) layer, which together act in the feature extraction layer to effectively model the sequence data. The final predicted label is further determined after all the elements are considered together. The computational formulas involved are shown below:
In the Long Short-Term Memory Network (LSTM) formulation, the oblivion gate state is identified by the symbol
Thus, by combining the forward representation
Then the data obtained from the dual-channel long-term short-term memory neural network is imported into the multiple head attention processing module, which sets different feature weights according to the importance of the words in the text, more effectively grasps the meaning association between words, enhances the importance of the key words, and reduces the noise in the text, thus greatly improving the accuracy of emotion recognition. Multiple head attention mechanism module is to use multiple head attention mechanism to emphasize the meaning characteristics of keywords and capture part of the qualities of the input sequence.The output vectors
Then the attention scores of each head are calculated separately and these scores are normalized by Softmax function, and finally the normalized scores are multiplied with the corresponding values V to get the weighted values. The specific formula is as follows:
Finally, after joining the weighted results of all the heads together to get the output of the multi-head attention mechanism, it is then linearly transformed to get the final output. The specific formula is as follows:
where
The output part includes the fully connected module and Softmax function, in addition, to avoid the overfitting problem, the Dropout technique is introduced to the fully connected module.Dropout is a common forwarding strategy used in the field of deep learning, which reduces the overfitting and enhances the wide-area applicability of the model by randomly removing a portion of the neural network units. During the training process, Dropout randomly sets the outputs of neurons (including input neurons and hidden layer neurons) to zero, temporarily discards them with a certain probability (usually 0.5), but keeps them active during forward propagation and back propagation. The process is shown in Fig. 2.

Dropout workflow diagram
Mechanism of Dropout During training, Dropout removes some neurons randomly to simplify the neural network structure and reduce the possibility of overfitting. It forces the neurons to not depend on specific input features, forcing the network to learn more robust and generalized features. In each training iteration, neurons to be discarded are randomly selected with a certain probability and their outputs are set to zero. The formula for this is shown in Eq:
where
Lexicon is the smallest research granularity that can be involved in natural language processing. Since the relevant algorithms cannot analyze the textual information directly, it is necessary to perform lexical segmentation operations based on the lexical weight and preference information provided by the user dictionary and the deactivation lexicon, and to slice up the original textual texts of Southeast Asian Chinese literature into multiple lexemes in order to facilitate the research smoothly. The PaddleNLP program was invoked in Python software to perform stage-by-stage (10 years as a node) lexical segmentation on 285 Southeast Asian Chinese literature works between 1961 and 2020, and the results of the segmentation in the case of retaining only verbs, nouns and gerunds are shown in Table 1. It can be found that the number of Southeast Asian Chinese literature is characterized by a leaping growth, but its effective vocabulary ratio has been maintained at the level of about 22% for a long time, which, on the one hand, indicates that the lexicon has been created in a more reasonable way, which can effectively and robustly extract effective information. On the other hand, this phenomenon indicates that Southeast Asian Chinese literature may have a stable creative content and framework in the long run. In addition, according to the results of descriptive statistics, it can be found that the number of publications of Southeast Asian Chinese literature as a whole is in a state of continuous growth, especially during the period of 2010-2020, the number of published works reached 129, and the intensity of its distribution is nearly twice as much as that of the previous period, and the total number of effective vocabularies has increased by as much as 310% in comparison with that of the previous period, which suggests that the history of Southeast Asian Chinese literature has entered into a brand-new stage of development.
Descriptive statistics
| Period | Publication quantity | Novel proportion | Word quantity | Effective word quantity | Effective word proportion |
|---|---|---|---|---|---|
| 1961-1970 | 7 | 0.429 | 55898 | 12074 | 0.216 |
| 1971-1980 | 12 | 0.417 | 589059 | 129004 | 0.219 |
| 1981-1990 | 22 | 0.545 | 648918 | 142113 | 0.219 |
| 1991-2000 | 48 | 0.521 | 805036 | 177108 | 0.220 |
| 2001-2010 | 67 | 0.597 | 944736 | 207842 | 0.220 |
| 2011-2020 | 129 | 0.651 | 3860471 | 853164 | 0.221 |
Southeast Asian Chinese literature encompasses novels, as well as other genres like prose and poetry. Although there is a high degree of similarity between the two in terms of final purpose, literary value, and creative arrangement, there are obvious differences in terms of specific creative perspectives and structures. Specifically, novels have more obvious macro features, which tend to reflect social and cultural phenomena through the arrangement of storylines. Prose and poetry have more obvious micro features, focusing more on specific author’s feelings. In order to better analyze the characteristics of Chinese literature in Southeast Asia, this paper reports the number of novels together with the number of words in the analysis of textual subtexts.
TF-IDF value can reflect the importance of the keywords in a set of text data, this paper will set the window to 100 unit size in the text based on word separation for keyword recognition, processing results shown in Table 2.
Keywords with high TF-IDF value (part)
| 1961-1970 | 1971-1980 | 1981-1990 | |||
|---|---|---|---|---|---|
| Keyword | TF-IDF | Keyword | TF-IDF | Keyword | TF-IDF |
| China | 8.214 | Qu Yuan | 16.815 | Qu Yuan | 27.405 |
| Qu Yuan | 7.082 | China | 15.393 | China | 22.183 |
| Great Wall | 5.038 | Great Wall | 12.852 | Wuxia | 21.035 |
| Yangtze river | 5.026 | Forbidden city | 12.787 | Martial arts | 20.788 |
| Chopsticks | 5.021 | Wuxia | 12.411 | Great Wall | 20.287 |
| Chang’e flies to the moon | 4.869 | Yangtze river | 9.223 | Forbidden city | 13.695 |
| Wuxia | 2.918 | Jiangnan | 7.889 | Chopsticks | 13.526 |
| Jiangnan | 2.471 | Chopsticks | 7.303 | Jiangnan | 11.246 |
| West Lake | 1.886 | Liang Shanbo & Zhu Yingtai | 5.905 | Jianghu | 10.188 |
| Jianghu | 1.062 | Tea set | 5.199 | Bamboo forest | 10.144 |
| 1991-2000 | 2001-2010 | 2011-2020 | |||
| Keyword | TF-IDF | Keyword | TF-IDF | Keyword | TF-IDF |
| China | 56.962 | Qu Yuan | 88.031 | China | 127.756 |
| Qu Yuan | 52.219 | China | 81.279 | Qu Yuan | 112.581 |
| Wuxia | 52.182 | Jiangnan | 71.965 | Jiangnan | 101.934 |
| Tea set | 47.655 | Chopsticks | 56.731 | Great Wall | 99.649 |
| Great Wall | 46.462 | Li Bai | 37.775 | Tea set | 89.415 |
| Jianghu | 29.361 | Wuxia | 34.832 | Chopsticks | 77.067 |
| Jiangnan | 29.306 | Great Wall | 29.216 | Su Shi | 75.014 |
| Chopsticks | 21.615 | Incense burner | 20.135 | Wuxia | 69.278 |
| Chang’e flies to the moon | 20.158 | Forbidden city | 19.718 | Jianghu | 65.818 |
| Yangtze river | 15.853 | Martial arts | 13.164 | Peony flower | 29.465 |
As can be seen from Table 2, the two elements of “Qu Yuan” and “China” are the most crucial in Southeast Asian Chinese literature, and during the 60-year period from 1961 to 2020, the TF-IDF values of these two words consistently ranked among the top two keywords at each stage. In the past 60 years, the TF-IDF value of the keyword “Qu Yuan” has increased from 7.082 to 112.581, and the TF-IDF value of “China” has increased from 8.214 to 127.756. In addition, elements such as “The Great Wall,” “martial arts,” “Jiangnan,” and “chopsticks” also appear frequently in Southeast Asian Chinese literature. Chinese literature in Southeast Asia.
This subsection utilizes the Word2Vec model for co-occurring word analysis of Southeast Asian Chinese literature. In literature intelligence analysis, co-occurring word analysis has become an important measure to quickly identify the development of a discipline. Its core logic is to reflect the allocation of attention about Southeast Asian Chinese literature within a certain period of time by counting the spatial distribution relationship of selected keywords, and then to examine the contemporary connotations embedded in these words. Returning the keyword identification results in Table 2 to the original text of the work, the analysis results of co-occurring words are obtained by calculating their spatial distribution as shown in Table 3.
Common word matrix (part)
| Qu Yuan | Great Wall | Confucius | China | Yangtze river | Liang Shanbo & Zhu Yingtai | Wuxia | Martial arts | Mid-Autumn festival | Swordsman | |
|---|---|---|---|---|---|---|---|---|---|---|
| Qu Yuan | 0 | 154 | 130 | 118 | 118 | 102 | 89 | 112 | 51 | 83 |
| Great Wall | 154 | 0 | 124 | 102 | 113 | 86 | 106 | 45 | 82 | 80 |
| Confucius | 130 | 124 | 0 | 98 | 105 | 111 | 86 | 103 | 83 | 77 |
| China | 118 | 102 | 98 | 0 | 91 | 108 | 88 | 96 | 88 | 77 |
| Yangtze river | 118 | 113 | 105 | 91 | 0 | 99 | 75 | 92 | 71 | 76 |
| Liang Shanbo & Zhu Yingtai | 102 | 86 | 111 | 108 | 99 | 0 | 85 | 102 | 81 | 75 |
| Wuxia | 89 | 106 | 86 | 88 | 75 | 85 | 0 | 73 | 74 | 63 |
| Martial arts | 112 | 95 | 103 | 96 | 92 | 102 | 73 | 0 | 76 | 63 |
| Mid-Autumn festival | 90 | 82 | 83 | 88 | 71 | 81 | 74 | 76 | 0 | 38 |
| Swordsman | 83 | 80 | 77 | 77 | 76 | 75 | 63 | 63 | 38 | 0 |
The keywords of Southeast Asian Chinese literature texts were clustered and analyzed, and the keyword clustering results are shown in Table 4. After the theme clustering of Southeast Asian Chinese literature using the Word2Vec model, the six categories of author’s needs are summarized. The six thematic clusters in this paper are summarized as Chinese historical celebrities (Cluster I), Chinese seasonal customs (Cluster II), hometown customs and implements (Cluster III), mountains and rivers (Cluster IV), myths and legends (Cluster V), and martial arts culture (Cluster VI).
Keyword clustering results
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 |
|---|---|---|---|---|---|
| Qu Yuan | Mid-Autumn festival | Chopsticks | Great Wall | Liang Shanbo & Zhu Yingtai | Wuxia |
| Confucius | Dragon Boat Festival | China | Forbidden city | Chang’e flies to the moon | Jianghu |
| Li Bai | Qingming festival | Incense burner | Terracotta warriors | Nuwa mends the sky | Bamboo forest |
| Su Shi | Spring festival | Writing brush | Yangtze river | Hou Yi shoots the sun | Kungfu |
| Yue Fei | Double ninth festival | Ink | Yellow river | Jing Wei fills the sea | Swordsman |
| Li Qingzhao | Cold food festival | Tea set | Jiangnan | The legend of the white snake | Revenge |
| Wang Bo | Summer solstice | Screen | West Lake | Kua Fu chases the sun | Hero |
| Zhuge Liang | Winter solstice | Oil paper umbrella | Ming Tomb | Yu Gong moves the mountain | Red dust |
| Liu Yuxi | Autumn equinox | Peony flower | Tengwang Pavilion | King Yu combats the flood | Martial arts |
| Liu Yong | Spring equinox | Lotus flower | Yueyang Tower | Cangjie creates characters | Chivalry |
This section further analyzes the authors’ sentiment tendencies under the six types of demands, obtains the number and percentage of positive sentiment, neutral sentiment and negative sentiment in each demand topic, and presents the high-frequency words of positive sentiment and negative sentiment by means of a word cloud display, so as to gain insights into the authors’ sentiment expressions.
The sentiment analysis tool used in this section is the Data Manager software, which is capable of automatically analyzing text sentiment tendencies and outputting sentiment tendency results. Its principle is to cut the text, according to the sentiment dictionary to cut each word for the sentiment score, the sum of the sentiment score is greater than 0 for the text of positive emotional tendency, equal to 0 for the neutral emotional tendency, less than 0 for the negative emotional tendency. The software comes with a sentiment dictionary containing 38,452 commonly used Chinese words, and each word is assigned a sentiment score between -10 and 10, from high to low. Users can also modify the sentiment scores of words independently according to their own needs, or import a customized sentiment dictionary.
The comments corresponding to the six themes obtained in the previous section are extracted to generate six new documents, which are sequentially imported into the Data Manager software for sentiment analysis to obtain six “Theme Comments - Sentiment Tendency” result documents. Each document contains 12 columns of content, including the serial number of the keyword, body text, sentiment tendency, positive word, negative word, degree word, negative word, number of positive sentences, positive score, number of negative sentences, negative score and total score. The number of positive, neutral, and negative sentiment comments in the 6 documents were counted and percentages were calculated to obtain topic-sentiment statistics as shown in Table 5.
Theme-affection tendency statistics
| Theme | Affection tendency | Word quantity | Proportion | Theme | Affection tendency | Word quantity | Proportion |
|---|---|---|---|---|---|---|---|
| 1 | Positive | 2467 | 80% | 2 | Positive | 2136 | 74% |
| Neutral | 432 | 14% | Neutral | 577 | 20% | ||
| Negative | 185 | 6% | Negative | 173 | 6% | ||
| 3 | Positive | 1792 | 67% | 4 | Positive | 2113 | 83% |
| Neutral | 696 | 26% | Neutral | 331 | 13% | ||
| Negative | 187 | 7% | Negative | 102 | 4% | ||
| 5 | Positive | 1669 | 72% | 6 | Positive | 2449 | 88% |
| Neutral | 371 | 16% | Neutral | 250 | 9% | ||
| Negative | 278 | 12% | Negative | 84 | 3% |
Observing the data in Table 5, it can be found that the proportion of positive sentiments for all six themes exceeds 60%, and the sum of positive and neutral sentiments exceeds 85%, indicating that the authors of Southeast Asian Chinese literature have a positive attitude towards Chinese culture in general.Among the six themes, Theme 3 has the smallest proportion of positive sentiments, only 67%, and Theme 5 has the largest proportion of negative sentiments, 12%. Themes 1, 4, and 6 all have more than 80% of positive emotions, and theme 6 has the highest percentage of positive emotions, at 88%, and only 3% of negative emotions, indicating that the authors are most enthusiastic about Chinese martial arts culture. Theme 3 has the largest proportion of neutral emotions, reaching 26%, indicating that there is still some room for improvement in the creations of Southeast Asian Chinese literature authors of the Wind Ware.
Southeast Asia is a region with a range of cultural traditions, and Southeast Asian Chinese literature is a significant component of Chinese literature from abroad. Southeast Asian Chinese literature has inherited Chinese traditions and incorporated the diverse cultures of Southeast Asia.
Under the linguistic display of Southeast Asian Chinese literature, Chinese elements have six faces. There are tributes to Chinese historical celebrities, such as Qu Yuan, whom Southeast Asian Chinese literature writers are particularly fond of. There are tributes to Chinese historical figures, such as Qu Yuan, who is particularly popular among Southeast Asian writers of Chinese literature. There is also nostalgia for hometown customs and utensils, such as chopsticks. The fondness for mountains and rivers, such as the Great Wall. Transplantation and spreading of myths and legends, such as Liang Shanbo and Zhu Yingtai and Chang’e Runs to the Moon. The depiction and transmission of martial arts culture. They are either modern poems, modern novels, or miscellaneous life stories, so that our cultural elements naturally appear in the Chinese community in Southeast Asia and in the vision of Nanyang people, and these beautiful treasures of Chinese culture shine brightly in Nanyang.
Ancient language has an important heritage role in Southeast Asian Chinese literature, containing rich cultural connotations and wisdom, enriching the depth and heritage of the literature, giving the literature a sense of the times and making the works more historical and contemporary. It contains rich emotional expressions, making the works more vivid and touching. Meanwhile, allusions and idioms in ancient languages often provide writers with a source of inspiration for creation. The ancient language not only provides profound cultural connotations to literary works, but also provides readers with rich life wisdom and emotional experience. This inheritance not only provides more depth and breadth to literary works, but also helps to promote the essence of traditional culture.
Language is not only a tool for human thinking and communication, but also a carrier of culture, because language is essentially a historical net woven by layers of culture, connecting readers and authors between expression and appreciation. The richness and diversity of language, dialect, and ethnicity of Southeast Asian Chinese literary works make them unique compared to works from the mainland. At the same time, the intermingling of the local and humanistic nature of the works, especially the cultural Chineseness, makes the language of the works have a great cultural inheritance value here. No matter whether it is vernacular, dialect, or classical language, they all tell the Chinese story.
In this paper, we use text mining methods to extract linguistic and cultural elements from Southeast Asian Chinese literature, and propose RoBERTa-BiLSTM-Attention for sentiment recognition and analysis of Southeast Asian Chinese literature.
From 1961 to 2020, the number of Southeast Asian Chinese literature has been growing, especially in the period of 2011-2020, showing a trend of leaping growth, and its effective vocabulary ratio is basically maintained at about 22%.The number of Southeast Asian Chinese literature published during the period of 2011-2020 is 129, which is nearly twice as many as that of the previous decade, and the total number of effective vocabulary is increased by 310% in comparison with that of the period of 2001-2010. During the 60-year period, the TF-IDF values of high-frequency keywords have been increasing, and the TF-IDF values of the words “Qu Yuan” and “China” have always occupied the top two positions.
The Chinese elements in Southeast Asian Chinese literature can be categorised into six categories: Chinese historical celebrities, Chinese seasonal customs, hometown customs and implements, mountains and rivers, myths and legends, and martial arts culture. The authors of Southeast Asian Chinese literature have positive attitudes toward all six categories, with more than 80% of positive feelings toward Chinese historical figures, mountains and rivers, and martial arts culture, with the highest percentage of positive feelings toward martial arts culture (88%).
