Folk Tales from Diverse Cultures: Digital Analysis of Content using Natural Language Processing
Publié en ligne: 19 mars 2025
Reçu: 07 nov. 2024
Accepté: 10 févr. 2025
DOI: https://doi.org/10.2478/amns-2025-0529
Mots clés
© 2025 Yaping Li, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
In the wave of the digital era, the protection and inheritance of traditional culture are facing unprecedented opportunities and challenges. Folktales, as the valuable wealth of Chinese excellent traditional culture, carry rich historical culture and regional characteristics, and their vivid storylines and profound cultural connotations are the bridge connecting the past and the future [1-2]. Giving new vitality to folktales through digital technology, so that this valuable intangible cultural heritage in the new era glows more brilliantly, is not only an important way to promote its innovative development, but also an important issue that needs to be urgently solved in the field of protection and inheritance of Chinese traditional culture [3-6].
Using machine learning technology to analyze the narrative structure of folktales is a compelling research topic. It would be of wide interest if morphology could be automatically and reliably extracted from a given set of folktales [7]. On the one hand, studying the structure of folktales can explore the deepest cultural connotations through the forest of disorganized cultural phenomena in order to give us a better understanding of the narrative logic of folklore [8-9]. On the other hand, facilitating the transformation of folklore resources in the current media era, borrowing the classical structure of folk narrative texts, we can recreate the excellent resources from folklore and give full play to the value of traditional folk narratives in modern society [10-12]. Most importantly, empowered by digital technology, these ancient and mysterious stories can be more revitalized, and the diversified and interactive communication methods it provides open up a brand new path for the digital interpretation of folktales, contributing to the protection and inheritance of folktales [13-15].
Natural language processing can not only help people manage huge data in an efficient and orderly way, but also help people mine the information and laws hidden in the data. In this paper, folktales from different cultures in China are selected as the research object, and firstly, distributed crawlers are utilized for resource integration. Secondly, based on natural language technology, the key features of the works are selected, and the differences of specific features in different folktales are analyzed. Finally, by analyzing the complex network characteristics of language rhythm in different categories of folktales, the relatively significant features of different categories of folktales are derived.
Web pages are described by HTML using markup tags, and web browsers read the corresponding HTML documents and ultimately present them to the user in the form of web pages. Using the chrome browser as an example, open a story website, right-click on the review element, and observe part of the page layout.
Found that most of the text contained in the 〈
lists - response.x path(“.li/div[aclass - “ic]/@href”).extract().
Crawling the data obtained also need to remove redundant processing, in accordance with the above method, has actually only extracted the main part of the body, in addition to most of the data, and then regular matching of the data and other operations, mainly to remove spaces and redundant blank lines.
Based on the scrapy framework, first the spider generates the corresponding network request, sends the request to the downloader through the scheduler scheduler, the downloader obtains the corresponding network data and returns the message response to the spiders, and finally the spiders put the data in the item container. Among them, scrapyengine is responsible for the circulation of data flow of all components of the entire framework, and control the corresponding actions, the architecture is shown in Figure 1.

Data flow architecture diagram
On this basis, we will redis database combined with scrapy to further realize the idea of distributed, scrapy-redis idea is to build a queue on the basis of the original, spiders generate a request directly after the network request sent to the redis queue, and finally after the scheduler scheduler will be extracted out of the queue request. The scheduler extracts the requests from the queue. Thus, we can establish multiple schedulers, each scheduler can be extracted from the redis in the corresponding request (but also can be deposited into the request), and further realize the use of multi-server distributed crawling purpose, distributed crawler architecture shown in Figure 2.

Architecture diagram of distributed crawler
The process of natural language processing is shown in Fig. 3, which can be roughly divided into five steps, (1) Acquire the text by means of web crawler or local import. (2) Pre-processing operations on the text, the text will be divided into words and remove the intonation and stop words in it. (3) Characterize the text and map the words into their corresponding word vectors by using unique heat encoding or word embedding techniques. Unique heat coding is also called one-bit valid coding, i.e.,

Natural language processing flow chart
A brief introduction is given for the above five steps: the corpus is mostly collected using web crawlers or local text datasets. The corpus preprocessing stage mainly includes operations such as corpus cleaning, word segmentation, lexical labeling and stop word removal on the collected corpus. In the characterization session, it is necessary to vectorize the text that has completed the preprocessing and represent the words that have completed the disambiguation into vector form so that the computer can compute them. Such operations help to discover similar relationships between different words through the vector representation. In the model training session, the training methods used include traditional supervised, unsupervised and semi-supervised learning models, etc. The specific models used need to be selected according to different application scenarios. For the evaluation of the effect after modeling, the commonly used effect evaluation indexes include accuracy rate, recall rate, and so on.
In natural language processing, words, characters or bytes are often used as labeled entities according to the design features of different models. N-gram language model is a probability-based discriminative model, and the
Assuming a sentence
Eq. (1) is concise and easy to understand, but suffers from a flaw in its computation, which is that the complexity of
In Eq. (2),
When
By analogy, the value of
In equation (5), count represents the frequency of occurrence of statistical words. For example,
The larger the value of
The main role of the SGM model is to generate a number of words around a particular word in the text based on that word. When performing SGM model training, all word vectors will be represented as two fixed dimensional word vectors to calculate the conditional probabilities. The background word window size needs to be set beforehand before performing word vector training. In the process of word vector training, each word will act as both a center word and a background word, and the vector size when acting as a center word and a background word will be stored in the corresponding two word vectors, respectively. For example, suppose the text sequence consists of {
The skip-word model specifies that the probabilities of occurrence of background words
Assuming that the text set consists of 5 words, the probability of occurrence of each background word is calculated by the center word, and the sliding window is set to 2.
In the process of SGM model building, each word is represented as two fixed dimensional vectors for calculating conditional probabilities. Let a word be indexed as
In Eq. (7),
Suppose, in a text sequence of length
The training process of the SGM model requires learning the model parameters through maximum likelihood estimation, i.e., the minimization of the loss function, which is as follows:
If stochastic gradient descent is used for optimization, at each iteration of the training set of utterances, the training model will randomly sample a shorter subsequence, and the gradient will be calculated by computing the loss about the subsequence, which will ultimately lead to the updating of the model parameters. The gradients about the center and background word vectors are calculated by taking logarithmic operations on the conditional probabilities. Substituting Eq. (7) into Eq. (9) yields:
Differentiating Eq. (10) yields a gradient of
The parameter
Word frequency analysis, also called word frequency statistical analysis. In the process of text analysis, it is a universally used basic analysis method. The basic principle is to count the number of times each phrase appears in a fixed text. Through word frequency statistics to explore the lexical laws in the text, mining the text of the hidden information. This lexical form of text analysis is simple to operate and the results are clear. The final presentation of word frequency analysis is the word frequency statistics table. Since the word frequency statistics table is pure text and numbers, the visualization is low, so the word frequency statistics table can be visualized and operated, and the word frequency statistics table can be processed into the form of word cloud diagram. The presentation of word cloud diagrams is also converted according to the needs of the research, and the common ones are geometrical shapes, map shapes, figures, and graphics related to the research topic. Interpretation of word cloud graphs is based on evidence. Word frequency (frequency) refers to how many times a particular phrase appears in that text. Usually, this number is regularized to avoid bias problems. The same phrase may appear more often in documents with longer word counts than in shorter documents; therefore, the text data is first subdivided, screened and filtered for meaningless words, and then word frequency statistics are performed. The specific calculation expression is:
After the above formula, the word frequency of a certain word can be derived. Using the word frequency analysis method, we can find out the core content and viewpoints of the text. In this paper, we will use word frequency statistical analysis to study folktales from different cultures, and visualize the word frequency statistics, which can visualize the ideas and contents of the text, and help to promote the subsequent analysis of the text.
Semantic network analysis refers to the description of what kind of relationship exists between individual keywords as well as the existence of associations among keywords, and the keyword associations formed from the results of text analysis are represented in a network dendrogram. In this paper, we quantitatively analyze the keywords in the text by “degree of centrality”, “word clusters”, “point relevance”, and the reddest result is the keyword association network graph.
Neutrality correlation
In semantic network analysis, centrality is usually quantified to reflect the correlation between keywords in a graph. And the study of centrality is divided into centrality degree (also called absolute degree centrality) and relative centrality degree. In the study of this paper, the relative centrality degree is selected for analysis and calculation. The calculation expression is as follows:
Where
Word clusters
Usually the study of the degree of neutrality of the points in the diagram is not only the degree of centrality correlation of the individual points, but the overall centrality of the diagram is often studied in the analysis process. Therefore, the keywords are categorized into one or several clusters and are not limited to a particular keyword study. Cluster analysis helps to grasp the totality of the study.
Correlation of points
It is the degree of closeness that describes the existence of connections between keywords in terms of numbers. In other words, whether the connection between a keyword and other keywords is close or not, the larger the value, the more distant the connection, on the contrary, the smaller the value, the closer the connection.
Complex networks are networks with small-world properties, which are mainly characterized by the network’s scale-free distribution, short average distances and high clustering coefficients.
Before stating the average shortest distance of the network, the distance between the nodes of the network is described:
Assuming that the network nodes two nodes
That is, the distance between two points in a network is the number of edges of the shortest path that exists between the two points.
Average Shortest Distance:
The average shortest distance of a network is the average of the shortest distances between all connected nodes in the network, assuming that there exist
The property of having a short average distance in a complex network. That is, the points in the network can be connected to each other by a finite number of connected edges.
The distribution of degree of nodes in a network is an important performance indicator of the network. Degree of a node is the number of edges that the node is connected to other nodes. In directed networks degree is subdivided into in-degree and out-degree, in undirected networks there is only one concept of degree. Here, only the phenomenon of degree distribution in undirected networks is discussed.
where the degree probability function,
In complex networks, the degree distribution exhibits scale-free properties:
That is, the degree distribution function is obeying a power rate distribution with:
The clustering coefficient, which measures the degree of network grouping, is an important parameter in a network, and its physical meaning is the probability that any two neighboring nodes of a node in the network will be neighboring nodes to each other, and for each node the clustering coefficient
Where,
Then the clustering coefficient of the overall network is, the average of the clustering coefficients of each node, assuming that there are
Larger values of clustering coefficients in the network indicate a higher degree of aggregation of the network and a more significant network clustering.
In this paper, Chinese folktales are categorized into three types based on the content crawled: “Mythological Stories”, “Wild Historical Mysteries” and “Literary and Historical Encyclopedias”. The difference in the lexical distribution of the works, on the other hand, reflects the author’s linguistic habits and textual style from the point of view of sentence structure, and this paper will analyze the key features of the three types of folktales.
The average number of words per chapter describes the author’s word usage from the perspective of the whole text and shows the author’s vocabulary and thus, to some extent, the author’s literary literacy and depth. The distribution of the average number of words per chapter for the comparison of “Myths and Stories”, “Wild Stories” and “Encyclopedia of Literature and History” is shown in Figure 4. It can be seen that the average vocabulary per chapter of wild stories is basically less than 10,000 words, the average vocabulary per chapter of mythological stories is between 10,000 and 20,000 words, while many literary and historical encyclopedias have more than 20,000 words, or even up to more than 40,000 words. The reason for this can be analyzed from its genre type:
Many wild history secrets are passed down through oral accounts, and the oral tradition requires clear and concise language to avoid lengthy descriptions. At the same time, wild tales usually focus on specific, marginalized historical events or characters, and involve less technical terms and cultural background. Mythological stories usually have metaphorical meanings, and many of the words are symbolic, exaggerated and rhetorical, and these rhetorical languages increase the richness of the vocabulary. At the same time, mythological stories have a deep function of explaining natural phenomena, religious beliefs, and the origin of the world, so they cover a wide range of areas. However, the coverage has limitations due to the focus on specific cultures or national traditions. Literature and history encyclopedias cover a very wide range of fields, covering specialized terms and concepts in many fields. It also has a high depth of knowledge and systematization, a rigorous framework for historical events, people and cultures, and a relatively rich and standardized vocabulary.

The distribution of the average number of phrases per chapter
The average paragraph length reflects an author’s habit of segmentation, shows the complexity of the text, and is an important indicator for measuring the readability of a text. If there are many paragraphs and the paragraph length is small, the text is simple and easier to be understood by readers, and it can also provide readers with better reading experience; if there are fewer paragraphs and the paragraph length is longer, the text is complex, relatively difficult to understand, and the readability is poor. The average paragraph lengths of “Myths”, “Mysteries” and “Encyclopedia of Literature and History” are shown in Figure 5.

The distribution of the average length of each chapter paragraph
As can be seen from Figure 5, the average paragraph length of mythological stories is basically concentrated between 100 and 200, the average paragraph length of literary and historical encyclopedias is basically concentrated between 200 and 300, while the average paragraph length of wild historical secrets fluctuates greatly, and overall the average paragraph length of literary and historical encyclopedias is the highest. It can be learned that the works of literature and history encyclopedias are more complex based on systematicity, mythological stories have more segments and compact plots based on storytelling, and the average paragraph length of wild historical secrets fluctuates more, which is in line with the characteristics of their uncertain historical sources.
A comparison of the distribution of average sentence lengths for “Myths and Stories”, “Wild Stories” and “Literary and Historical Encyclopedias” is shown in Figure 6. It is found that the average sentence length distribution of mythological stories is not much different from that of literary and historical encyclopedias, and in general the average sentence length of mythological stories is shorter than that of literary and historical encyclopedias, and the style is more lively and spontaneous. It can be seen that the average sentence length of mythological stories is relatively more concentrated basically distributed between 10 and 30, and the rhythmic style among the stories tends to be consistent; while the average sentence length of the wild history and secret stories varies a lot, and the difference in rhythmic style among different works is obvious and not regular.

The distribution of the average sentence length per chapter
Linguistic rhythm is possessed by every text, and here this paper analyzes different categories of folktales from the point of view of the complex network characteristics of linguistic rhythm in texts and works.
This article selects nine classic folk tales, including: “The Classic of Mountains and Seas”, “Journey to the West”, “Romance of the Gods”, “Daming Palace Ci”, “Wild History of Ming and Qing Dynasties”, “The Legacy of Supreme Harmony”, “Records of the Grand Historian”, “The Biography of Zuo”, and “Ci Hai”, including “Myths and Stories”, “Wild Stories” and “Literary and Historical Encyclopedias”. As a sample of classic folk tales is this paper and the collection of scientific papers cited in this paper.
The results of complex network characterization of works are shown in Table 1. From the analysis of Table 1, it can be seen that the aggregation coefficient of the language natural rhythm network of classic mythological stories and literature and history encyclopedias are very high, all of them can reach more than 0.35, and the aggregation coefficient of the language natural rhythm network of some of the wild secrets also performs well, while the one of this paper is only 0.17; the average distance of the language natural rhythm network of the classic mythological stories and the literature and history encyclopedias is relatively low, of which the average distance of “The Classic of Mountains and Seas” is the lowest, which is 1.654, from this point of view, the linguistic natural rhythm organization of Shanhaijing is very flexible and varied.
Analysis results of complex network characteristics
Title of work | Aggregation coefficient | Mean distance | Aggregation coefficient * Average distance |
---|---|---|---|
“The Classic of Mountains and Seas” | 0.738 | 1.654 | 1.220652 |
“Journey to the West” | 0.582 | 2.189 | 1.273998 |
“Romance of the Gods” | 0.622 | 1.746 | 1.086012 |
“Daming Palace Ci” | 0.492 | 2.297 | 1.130124 |
“Wild History of Ming and Qing Dynasties” | 0.286 | 3.147 | 0.900042 |
“The Legacy of Supreme Harmony” | 0.393 | 2.532 | 0.995074 |
“Records of the Grand Historian” | 0.793 | 1.893 | 1.501149 |
“The Biography of Zuo” | 0.863 | 2.109 | 1.820067 |
“Ci Hai” | 0.973 | 1.743 | 1.695939 |
This text | 0.167 | 3.862 | 0.644954 |
Collection of scientific articles | 0.274 | 2.753 | 0.754322 |
Since, the average distance in the complex network of linguistic natural rhythms can reflect the flexibility of natural rhythm organization, the shorter the average distance is, the more flexible the combination of various rhythmic units is, and it can be assumed that the author has a strong ability to master the combination of rhythmic units, and can use a variety of natural rhythms flexibly. The high clustering coefficient indicates the high tightness of the network, which shows the aggregation of various natural rhythms and the frequent use of multiple natural rhythmic units by the author in the article. It is understandable that the shorter the average distance, the higher the clustering coefficient, and these two properties of complex networks can describe and illustrate the author’s ability to use language and the linguistic expressiveness of the work.
But can there be a significant value that can accomplish the evaluation of good or bad articles? Here, it is proposed to examine the product of average distance and clustering coefficient of complex network, and it is found that the product of average distance clustering coefficient of all classic mythological stories and literary and historical encyclopedias is very high, except for the Feudal Deity Yanyi, which reaches more than 1.1, while the product of average distance clustering coefficient of wild historical secrets is relatively uneven and low in general. This paper is only 0.644954, and the collection of scientific research papers is only 0.754322, both much lower than the classic folktales. It can be seen that the difference between the average distance clustering coefficient product of excellent and non-excellent works is very significant, thus, it is shown here that the value of the average distance clustering coefficient product can be used to complete the preliminary judgment of the article’s good or bad.
Each person’s language control ability is different, here, this paper selects writers Wu Chengen, Pu Songling, Luo Guanzhong, Zheng Guangzu, Sima Qian and Zuo Qiu Ming to analyze many articles, through the analysis of the language rhythm network in the works of famous writers.
The results of the characterization of the complex networks of the works are shown in Table 2. Again from the characterization of the complex network, it can be concluded that the linguistic rhythmic complex network of the works of famous authors has a high aggregation coefficient while the average distance is very short, and the aggregation coefficient average distance product can all be in a relatively high value.
Analysis results of complex network characteristics of writers’ works
Writer | Aggregation coefficient | Mean distance | Aggregation coefficient * Average distance |
---|---|---|---|
Wu Cheng’en | 0.464 | 2.146 | 0.995744 |
Pu Songling | 0.526 | 2.087 | 1.097762 |
Luo Guanzhong | 0.531 | 2.138 | 1.135278 |
Zhen Guangzu | 0.525 | 1.955 | 1.026375 |
Sima Qian | 0.489 | 2.036 | 0.995604 |
Zuo Qiuming | 0.632 | 1.863 | 1.177416 |
In this paper, folktales from different cultures are classified into three categories based on natural language processing techniques, and their contents are analyzed based on key features.
The comparison chart of the distribution of the average number of words per chapter shows that the average vocabulary per chapter of wild stories is basically below 10,000 words, the average vocabulary per chapter of mythological stories is between 10 and 20,000 words, and the vocabulary number of many literary and historical encyclopedias is more than 20,000 words, or even up to more than 40,000 words. The average paragraph length comparison chart shows that the average paragraph length of mythological stories basically concentrates between 100~200, the average paragraph length of literature and history encyclopedias basically concentrates between 200~300, while the average paragraph length of wild and secret stories fluctuates greatly. The average sentence length comparison chart shows that the average sentence length distribution of mythological stories is not much different from that of literature and history encyclopedias, which is relatively more concentrated and basically distributed in the range of 10~30, while the average sentence lengths of wild and secret stories are more different from each other.
The complex network characterization of classic folktales shows that the aggregation coefficient of the language natural rhythm network of classic mythological stories and literature and history encyclopedias can reach more than 0.35, and the aggregation coefficient of the language natural rhythm network of some of the wild history secrets is also good; the average distance of the language natural rhythm network of classic mythological stories and literature and history encyclopedias is relatively low, and that of the “Classic of the Sea and the Mountain” is the lowest, with an average distance of 1.654. Complex network characterization of famous authors, it is concluded that the linguistic rhythmic complex network aggregation coefficients of famous authors’ works are all above 0.35, the average distances are all below 2.5, and the aggregation coefficients’ average distance products are all kept around 1.
The Research is Supported by: the First-Class Discipline Construction Project of Higher Education Institutions in Ningxia (Education Discipline) Funded Project (NXYLXK2021B10).