Folk Tales from Diverse Cultures: Digital Analysis of Content using Natural Language Processing

In the wave of the digital era, the protection and inheritance of traditional culture are facing unprecedented opportunities and challenges. Folktales, as the valuable wealth of Chinese excellent traditional culture, carry rich historical culture and regional characteristics, and their vivid storylines and profound cultural connotations are the bridge connecting the past and the future [1-2]. Giving new vitality to folktales through digital technology, so that this valuable intangible cultural heritage in the new era glows more brilliantly, is not only an important way to promote its innovative development, but also an important issue that needs to be urgently solved in the field of protection and inheritance of Chinese traditional culture [3-6].

Using machine learning technology to analyze the narrative structure of folktales is a compelling research topic. It would be of wide interest if morphology could be automatically and reliably extracted from a given set of folktales [7]. On the one hand, studying the structure of folktales can explore the deepest cultural connotations through the forest of disorganized cultural phenomena in order to give us a better understanding of the narrative logic of folklore [8-9]. On the other hand, facilitating the transformation of folklore resources in the current media era, borrowing the classical structure of folk narrative texts, we can recreate the excellent resources from folklore and give full play to the value of traditional folk narratives in modern society [10-12]. Most importantly, empowered by digital technology, these ancient and mysterious stories can be more revitalized, and the diversified and interactive communication methods it provides open up a brand new path for the digital interpretation of folktales, contributing to the protection and inheritance of folktales [13-15].

Natural language processing can not only help people manage huge data in an efficient and orderly way, but also help people mine the information and laws hidden in the data. In this paper, folktales from different cultures in China are selected as the research object, and firstly, distributed crawlers are utilized for resource integration. Secondly, based on natural language technology, the key features of the works are selected, and the differences of specific features in different folktales are analyzed. Finally, by analyzing the complex network characteristics of language rhythm in different categories of folktales, the relatively significant features of different categories of folktales are derived.

2

Folktale Resource Integration Using Distributed Crawlers

2.1

Analyze web page layout to extract story content

Web pages are described by HTML using markup tags, and web browsers read the corresponding HTML documents and ultimately present them to the user in the form of web pages. Using the chrome browser as an example, open a story website, right-click on the review element, and observe part of the page layout.

Found that most of the text contained in the 〈p〉〈/p〉 tag, we can use this to locate the whole page 〈p〉〈/p〉 tags, and then directly crawl the contents of its tags, there are some web pages with a more complex layout, which requires the use of other attributes in the HTML to locate (class, name, etc.). This project uses python, with the help of scrapy this kind of general crawler framework, here using xpath in the page to find elements for example, extract div block in the class name of ic tag content stored in the lists of variables:

lists - response.x path(“.li/div[aclass - “ic]/@href”).extract().

Crawling the data obtained also need to remove redundant processing, in accordance with the above method, has actually only extracted the main part of the body, in addition to most of the data, and then regular matching of the data and other operations, mainly to remove spaces and redundant blank lines.

2.2

Distributed design

Based on the scrapy framework, first the spider generates the corresponding network request, sends the request to the downloader through the scheduler scheduler, the downloader obtains the corresponding network data and returns the message response to the spiders, and finally the spiders put the data in the item container. Among them, scrapyengine is responsible for the circulation of data flow of all components of the entire framework, and control the corresponding actions, the architecture is shown in Figure 1.

On this basis, we will redis database combined with scrapy to further realize the idea of distributed, scrapy-redis idea is to build a queue on the basis of the original, spiders generate a request directly after the network request sent to the redis queue, and finally after the scheduler scheduler will be extracted out of the queue request. The scheduler extracts the requests from the queue. Thus, we can establish multiple schedulers, each scheduler can be extracted from the redis in the corresponding request (but also can be deposited into the request), and further realize the use of multi-server distributed crawling purpose, distributed crawler architecture shown in Figure 2.

3

Content analysis based on natural language processing

3.1

Analysis of Natural Language Processing Technologies

3.1.1

Processes

The process of natural language processing is shown in Fig. 3, which can be roughly divided into five steps, (1) Acquire the text by means of web crawler or local import. (2) Pre-processing operations on the text, the text will be divided into words and remove the intonation and stop words in it. (3) Characterize the text and map the words into their corresponding word vectors by using unique heat encoding or word embedding techniques. Unique heat coding is also called one-bit valid coding, i.e., N-bit register corresponds to N states, and only one bit in the register is valid; word embedding technique quantizes the words into a low-dimensional dense vector space with a fixed number of dimensions, which is more efficient compared to unique heat coding. (4) Training for the model, you can use machine learning model algorithms based on support vector machines, decision trees, proximity algorithms or logistic regression, or you can make deep learning based model algorithms such as convolutional neural networks or recurrent neural networks. (5) Validate the training model using a test set to assess the strengths and weaknesses of the model algorithm.

A brief introduction is given for the above five steps: the corpus is mostly collected using web crawlers or local text datasets. The corpus preprocessing stage mainly includes operations such as corpus cleaning, word segmentation, lexical labeling and stop word removal on the collected corpus. In the characterization session, it is necessary to vectorize the text that has completed the preprocessing and represent the words that have completed the disambiguation into vector form so that the computer can compute them. Such operations help to discover similar relationships between different words through the vector representation. In the model training session, the training methods used include traditional supervised, unsupervised and semi-supervised learning models, etc. The specific models used need to be selected according to different application scenarios. For the evaluation of the effect after modeling, the commonly used effect evaluation indexes include accuracy rate, recall rate, and so on.

3.1.2

Statistical models

In natural language processing, words, characters or bytes are often used as labeled entities according to the design features of different models. N-gram language model is a probability-based discriminative model, and the N in “N-gram” stands for a set of N words. The text sequence of these N words is taken as input, and after processing by the N-gram language model, the joint probability of the occurrence of these N words will be output. The input needs to be in sequential order between the N words, but repetition between words is allowed. In the common N-gram model, when N = 1, it is called unigram; when N = 2, it is called bi-gram; when N = 3, it is called tri-gram.

Assuming a sentence S consisting of w₁,…,w_n, a total of n words, to calculate the probability of the occurrence of the sentence, it can be derived from the chain-rule formulation of conditional probability: (1) $P (S) = P (w_{1}, \dots, w_{n}) = P (w_{1}) P (w_{2} | w_{1}) \dots P (w_{n} | w_{n - 1}, \dots, w_{1})$

Eq. (1) is concise and easy to understand, but suffers from a flaw in its computation, which is that the complexity of P(x_n|x_n−1,…, x₁) reaches O(n), a huge order of magnitude. To solve this problem, it is necessary to introduce Markov’s hypothesis, which assumes that the probability of occurrence of a particular word is only related to the N words preceding that word: (2) $P (w_{1}, \dots, w_{n}) = \prod_{i = 2}^{n} P (w_{i} | w_{i - 1}, \dots, w_{1}) \approx \prod_{i = 2}^{n} P (w_{i} | w_{i - 1}, \dots, w_{i - N + 1})$

In Eq. (2), N represents the parameter corresponding to N in the N-gram model, i.e., the occurrence of the word w_i depends on a number of words appearing before it. For example, when N = 2: (3) $P (w_{1}, \dots, w_{n}) = P (w_{1}) P (w_{2} | w_{1}) \dots P (w_{n} | w_{n - 1})$

When N = 3: (4) $P (w_{1}, \dots, w_{n}) = P (w_{1}) P (w_{2} | w_{1}) \dots P (w_{n} | w_{n - 1}, w_{n - 2})$

By analogy, the value of N can be set higher, but in practical experiments, setting N to 3 is sufficient for text categorization. Using the great likelihood estimation method, the specific value of each conditional probability P(w_n|w_n−1,…,w₁) can be calculated: (5) $P (w_{n} | w_{n - 1}, w_{n - 2}) = \frac{c o u n t (w_{n - 2}, w_{n - 1}, w_{n})}{c o u n t (w_{n - 2}, w_{n - 1})}$

In equation (5), count represents the frequency of occurrence of statistical words. For example, count(w_n−2, w_n−1) represents the number of times that words w_n−2 and w_n−1 appear together in the statistical data set; count(w_n−2, w_n−1) represents the number of times that words w_n−2, w_n−1 and w_n appear together in the statistical data set.

The larger the value of N in the N-gram model, the more the number of other words relied on to calculate the probability of occurrence of a word, and the better the model representation. However, when the value of N becomes large, the data becomes sparse and the zero probability problem occurs. In addition to this, a larger value of N makes the parameter space too large and produces a dimensionality catastrophe. In order to solve this problem and improve the statistical efficiency of the N-gram model, NLM (Neuro-Linguistic Model) was born. This model can well overcome the dimensionality catastrophe by modeling text sequences using distributed representations of words.

3.1.3

Jumping word model

The main role of the SGM model is to generate a number of words around a particular word in the text based on that word. When performing SGM model training, all word vectors will be represented as two fixed dimensional word vectors to calculate the conditional probabilities. The background word window size needs to be set beforehand before performing word vector training. In the process of word vector training, each word will act as both a center word and a background word, and the vector size when acting as a center word and a background word will be stored in the corresponding two word vectors, respectively. For example, suppose the text sequence consists of {w₁,…,w_n}, n words. Take w_c as the center word and set the size of the background window as 2. The training process of the skipped word model is to generate the joint probability of the background words w_c+1, w_c+2, w_c−1 and w_c−2 which are not more than two words away from it according to the center word w_c, which is computed by the formula: (6) $P (w_{c + 1}, w_{c + 2}, w_{c - 1}, w_{c - 2} | w_{c})$

The skip-word model specifies that the probabilities of occurrence of background words w_c+1, w_c+2, w_c−1, and w_c–2 are independent of each other in case the center word w_c has appeared, which can be written in the form of multiplication of four conditional probabilities according to the product rule of conditional probabilities.

Assuming that the text set consists of 5 words, the probability of occurrence of each background word is calculated by the center word, and the sliding window is set to 2.

In the process of SGM model building, each word is represented as two fixed dimensional vectors for calculating conditional probabilities. Let a word be indexed as i in the corpus, and its word vector is denoted as c_i ∈ R when the word is trained as a center word and B_i ∈ R when the word is trained as a background word vector, where R denotes the space of fixed dimensions, i.e., the word vectors corresponding to the words must belong to the space of that dimension. Suppose the center word w_a, indexed as a, and one of the background words w_b of the word indexed as b. The conditional probability of generating the background word w_b based on the center word w_a is as follows: (7) $P (w_{b} | w_{a}) = \frac{\exp (B_{b}^{T} C_{a})}{\sum_{i \in V} \exp (B_{i}^{T} C_{a})}$

In Eq. (7), V = {1,…,n} denotes the set consisting of word indexes in the text set. Using C_a and B_b to denote the vectors corresponding to the center word w_a and the background word w_b, respectively, the frequency of occurrence of other words with the center word is calculated by B_i. exp denotes the exponential function with a natural constant e as the base.

Suppose, in a text sequence of length n, let the background window size be m and the word w^(t) indexed t. Assuming that the center word and the background words are independent of each other, calculate the probability that the center word generates 2m background words when the window size of the background words is m: (8) $\prod_{t = 1}^{n} \prod_{- m \leq j \leq m, j \neq 0} P (w^{(t + j)} | w^{(t)})$ t from 1 to n, represents traversing all the words in the dataset. Meanwhile, P(w^(t+j)|w^(t)) from P(w^(1+j)|w⁽¹⁾) to P(w^(n+/)|w⁽ⁿ⁾) calculates the probability of generating background words for all words as center words.

The training process of the SGM model requires learning the model parameters through maximum likelihood estimation, i.e., the minimization of the loss function, which is as follows: (9) $- \sum_{t = 1}^{n} \sum_{- m \leq j \leq m, j \neq 0} \log P (w^{(t + j)} | w^{(t)})$ (9) is obtained by taking the logarithm of equation (8).

If stochastic gradient descent is used for optimization, at each iteration of the training set of utterances, the training model will randomly sample a shorter subsequence, and the gradient will be calculated by computing the loss about the subsequence, which will ultimately lead to the updating of the model parameters. The gradients about the center and background word vectors are calculated by taking logarithmic operations on the conditional probabilities. Substituting Eq. (7) into Eq. (9) yields: (10) $\log P (w_{b} | w_{a}) = B_{b}^{T} C_{a} - \log (\sum_{i \in V} \exp (B_{i}^{T} C_{a}))$

Differentiating Eq. (10) yields a gradient of C_a: (11) $\begin{matrix} \frac{\partial \log P (w_{b} | w_{a})}{\partial C_{a}} = B_{b} - \frac{\sum_{j \in V} \exp (B_{j}^{T} C_{a}) B_{j}}{\sum_{i \in V} \exp (B_{i}^{T} C_{a})} \\ = B_{b} - \sum_{j \in V} (\frac{\exp (B_{j}^{T} C_{a})}{\sum_{i \in V} \exp (B_{i}^{T} C_{a})}) B_{j} \\ = B_{b} - \sum_{j \in V} P (w_{j} | w_{c}) B_{j} \end{matrix}$

The parameter j is introduced to prevent j from approximating i during the calculation.

3.2

Visual presentation and interpretation of analysis results

3.2.1

Word Frequency Analysis - Word Cloud Presentation

Word frequency analysis, also called word frequency statistical analysis. In the process of text analysis, it is a universally used basic analysis method. The basic principle is to count the number of times each phrase appears in a fixed text. Through word frequency statistics to explore the lexical laws in the text, mining the text of the hidden information. This lexical form of text analysis is simple to operate and the results are clear. The final presentation of word frequency analysis is the word frequency statistics table. Since the word frequency statistics table is pure text and numbers, the visualization is low, so the word frequency statistics table can be visualized and operated, and the word frequency statistics table can be processed into the form of word cloud diagram. The presentation of word cloud diagrams is also converted according to the needs of the research, and the common ones are geometrical shapes, map shapes, figures, and graphics related to the research topic. Interpretation of word cloud graphs is based on evidence. Word frequency (frequency) refers to how many times a particular phrase appears in that text. Usually, this number is regularized to avoid bias problems. The same phrase may appear more often in documents with longer word counts than in shorter documents; therefore, the text data is first subdivided, screened and filtered for meaningless words, and then word frequency statistics are performed. The specific calculation expression is: (12) $\begin{matrix} T I (Word Frequency) = T F \times I D F \\ T F = \frac{Number of times a word appears}{Total number of words in the article} \\ I D F = \log_{10} \frac{Total number of contexts}{Number of articles in which a word appears} \end{matrix}$

After the above formula, the word frequency of a certain word can be derived. Using the word frequency analysis method, we can find out the core content and viewpoints of the text. In this paper, we will use word frequency statistical analysis to study folktales from different cultures, and visualize the word frequency statistics, which can visualize the ideas and contents of the text, and help to promote the subsequent analysis of the text.

3.2.2

Visual representation of semantic network analysis

Semantic network analysis refers to the description of what kind of relationship exists between individual keywords as well as the existence of associations among keywords, and the keyword associations formed from the results of text analysis are represented in a network dendrogram. In this paper, we quantitatively analyze the keywords in the text by “degree of centrality”, “word clusters”, “point relevance”, and the reddest result is the keyword association network graph.

1)

Neutrality correlation

In semantic network analysis, centrality is usually quantified to reflect the correlation between keywords in a graph. And the study of centrality is divided into centrality degree (also called absolute degree centrality) and relative centrality degree. In the study of this paper, the relative centrality degree is selected for analysis and calculation. The calculation expression is as follows: $\frac{C_{A D} (X)}{n - 1}$ (C_AD(X) is the absolute centrality degree of X points, n is the number of points in the figure).

Where C_AD is the number of other points directly connected to a keyword, and there are n points in the graph, then the maximum degree of centrality of any point in the graph can only be n–1 (the point itself has no degree of centrality). For example, in a graph with 40 words, the degree is 25, then the relative centrality degree of the word is 25/(40–1) = 0.64 (the result retains two decimal places).

2)

Word clusters

Usually the study of the degree of neutrality of the points in the diagram is not only the degree of centrality correlation of the individual points, but the overall centrality of the diagram is often studied in the analysis process. Therefore, the keywords are categorized into one or several clusters and are not limited to a particular keyword study. Cluster analysis helps to grasp the totality of the study.

3)

Correlation of points

It is the degree of closeness that describes the existence of connections between keywords in terms of numbers. In other words, whether the connection between a keyword and other keywords is close or not, the larger the value, the more distant the connection, on the contrary, the smaller the value, the closer the connection.

3.3

Complex network characterization

Complex networks are networks with small-world properties, which are mainly characterized by the network’s scale-free distribution, short average distances and high clustering coefficients.

3.3.1

Average shortest distance

Before stating the average shortest distance of the network, the distance between the nodes of the network is described:

Assuming that the network nodes two nodes a and b, between which there exist multiple paths p_i(i ∈ [1,n],n≥1), |p_i|, for the number of edges contained in path p_i, the distance d_ab between nodes a, b has: (13) $d_{a b} = \min (| p_{i} |) (i \in [1, n], n \geq 1)$

That is, the distance between two points in a network is the number of edges of the shortest path that exists between the two points.

Average Shortest Distance:

The average shortest distance of a network is the average of the shortest distances between all connected nodes in the network, assuming that there exist n pairs of nodes with path connections between them in the network: (14) $D = \frac{\sum_{i < j} d_{i j}}{n (n - 1) / 2}$

The property of having a short average distance in a complex network. That is, the points in the network can be connected to each other by a finite number of connected edges.

3.3.2

Degree distribution

The distribution of degree of nodes in a network is an important performance indicator of the network. Degree of a node is the number of edges that the node is connected to other nodes. In directed networks degree is subdivided into in-degree and out-degree, in undirected networks there is only one concept of degree. Here, only the phenomenon of degree distribution in undirected networks is discussed.

where the degree probability function, p(k), is the ratio of nodes with degree k to the overall nodes in the network.

In complex networks, the degree distribution exhibits scale-free properties:

That is, the degree distribution function is obeying a power rate distribution with: (15) $p (k) ~ k^{- γ}$ where γ is the number of normals. The network whose node degree distribution satisfies the power rate distribution has a scale-free characteristic. The network has a scale-free characteristic, which indicates that there are relatively few key nodes with a large number of connectivity relationships with other nodes, while most of the ordinary nodes do not have a high degree, and there is a significant long-tail characteristic. In the network if the removal of certain key nodes will increase the average distance of the network, and even affect the connectivity of the network; while if the loss is certain ordinary nodes, it will not affect the average distance and connectivity of the network. The degree distribution, which reflects the degree and level of connectivity of a complex network.

3.3.3

Clustering coefficients

The clustering coefficient, which measures the degree of network grouping, is an important parameter in a network, and its physical meaning is the probability that any two neighboring nodes of a node in the network will be neighboring nodes to each other, and for each node the clustering coefficient c_i has: (16) $c_{i} = \frac{e_{1}}{k_{1} (k_{1} - 1) / 2}$

Where, e_i represents the actual number of connection edges that exist between neighboring nodes connected to the node, k_i refers to the degree value of the node i, and k_i(k_i–1)/2, in fact, refers to the maximum number of edges that can possibly exist between neighboring nodes connected to the node i, i.e., the theoretical number of connection edges that exist between neighboring nodes of the i node. By the actual number of connected edges between neighboring nodes than on the theoretical number of connected edges, the aggregation of neighboring nodes around node i is described, and the larger the value of c_i represents the more aggregated nodes around node i; on the contrary, the smaller the value of c_i represents the more loosely connected nodes around node i.

Then the clustering coefficient of the overall network is, the average of the clustering coefficients of each node, assuming that there are n nodes in the network: (17) $C = \frac{\sum_{i = 1}^{n} c_{1}}{n}$

Larger values of clustering coefficients in the network indicate a higher degree of aggregation of the network and a more significant network clustering.

4

Applications of natural language processing for content analysis

In this paper, Chinese folktales are categorized into three types based on the content crawled: “Mythological Stories”, “Wild Historical Mysteries” and “Literary and Historical Encyclopedias”. The difference in the lexical distribution of the works, on the other hand, reflects the author’s linguistic habits and textual style from the point of view of sentence structure, and this paper will analyze the key features of the three types of folktales.

4.1

Analysis of differences in specific features across works

4.1.1

Difference in average number of words per chapter

The average number of words per chapter describes the author’s word usage from the perspective of the whole text and shows the author’s vocabulary and thus, to some extent, the author’s literary literacy and depth. The distribution of the average number of words per chapter for the comparison of “Myths and Stories”, “Wild Stories” and “Encyclopedia of Literature and History” is shown in Figure 4. It can be seen that the average vocabulary per chapter of wild stories is basically less than 10,000 words, the average vocabulary per chapter of mythological stories is between 10,000 and 20,000 words, while many literary and historical encyclopedias have more than 20,000 words, or even up to more than 40,000 words. The reason for this can be analyzed from its genre type: 1)

Many wild history secrets are passed down through oral accounts, and the oral tradition requires clear and concise language to avoid lengthy descriptions. At the same time, wild tales usually focus on specific, marginalized historical events or characters, and involve less technical terms and cultural background.

2)

Mythological stories usually have metaphorical meanings, and many of the words are symbolic, exaggerated and rhetorical, and these rhetorical languages increase the richness of the vocabulary. At the same time, mythological stories have a deep function of explaining natural phenomena, religious beliefs, and the origin of the world, so they cover a wide range of areas. However, the coverage has limitations due to the focus on specific cultures or national traditions.

3)

Literature and history encyclopedias cover a very wide range of fields, covering specialized terms and concepts in many fields. It also has a high depth of knowledge and systematization, a rigorous framework for historical events, people and cultures, and a relatively rich and standardized vocabulary.

4.1.2

Differences in average segment lengths

The average paragraph length reflects an author’s habit of segmentation, shows the complexity of the text, and is an important indicator for measuring the readability of a text. If there are many paragraphs and the paragraph length is small, the text is simple and easier to be understood by readers, and it can also provide readers with better reading experience; if there are fewer paragraphs and the paragraph length is longer, the text is complex, relatively difficult to understand, and the readability is poor. The average paragraph lengths of “Myths”, “Mysteries” and “Encyclopedia of Literature and History” are shown in Figure 5.

As can be seen from Figure 5, the average paragraph length of mythological stories is basically concentrated between 100 and 200, the average paragraph length of literary and historical encyclopedias is basically concentrated between 200 and 300, while the average paragraph length of wild historical secrets fluctuates greatly, and overall the average paragraph length of literary and historical encyclopedias is the highest. It can be learned that the works of literature and history encyclopedias are more complex based on systematicity, mythological stories have more segments and compact plots based on storytelling, and the average paragraph length of wild historical secrets fluctuates more, which is in line with the characteristics of their uncertain historical sources.

4.1.3

Mean Sentence Length Characterization

A comparison of the distribution of average sentence lengths for “Myths and Stories”, “Wild Stories” and “Literary and Historical Encyclopedias” is shown in Figure 6. It is found that the average sentence length distribution of mythological stories is not much different from that of literary and historical encyclopedias, and in general the average sentence length of mythological stories is shorter than that of literary and historical encyclopedias, and the style is more lively and spontaneous. It can be seen that the average sentence length of mythological stories is relatively more concentrated basically distributed between 10 and 30, and the rhythmic style among the stories tends to be consistent; while the average sentence length of the wild history and secret stories varies a lot, and the difference in rhythmic style among different works is obvious and not regular.

4.2

Characterization of complex networks of linguistic rhythms

Linguistic rhythm is possessed by every text, and here this paper analyzes different categories of folktales from the point of view of the complex network characteristics of linguistic rhythm in texts and works.

This article selects nine classic folk tales, including: “The Classic of Mountains and Seas”, “Journey to the West”, “Romance of the Gods”, “Daming Palace Ci”, “Wild History of Ming and Qing Dynasties”, “The Legacy of Supreme Harmony”, “Records of the Grand Historian”, “The Biography of Zuo”, and “Ci Hai”, including “Myths and Stories”, “Wild Stories” and “Literary and Historical Encyclopedias”. As a sample of classic folk tales is this paper and the collection of scientific papers cited in this paper.

4.2.1

Characterization of the linguistic rhythm network in classical works

The results of complex network characterization of works are shown in Table 1. From the analysis of Table 1, it can be seen that the aggregation coefficient of the language natural rhythm network of classic mythological stories and literature and history encyclopedias are very high, all of them can reach more than 0.35, and the aggregation coefficient of the language natural rhythm network of some of the wild secrets also performs well, while the one of this paper is only 0.17; the average distance of the language natural rhythm network of the classic mythological stories and the literature and history encyclopedias is relatively low, of which the average distance of “The Classic of Mountains and Seas” is the lowest, which is 1.654, from this point of view, the linguistic natural rhythm organization of Shanhaijing is very flexible and varied.

Table 1.

Analysis results of complex network characteristics

Title of work	Aggregation coefficient	Mean distance	Aggregation coefficient * Average distance
“The Classic of Mountains and Seas”	0.738	1.654	1.220652
“Journey to the West”	0.582	2.189	1.273998
“Romance of the Gods”	0.622	1.746	1.086012
“Daming Palace Ci”	0.492	2.297	1.130124
“Wild History of Ming and Qing Dynasties”	0.286	3.147	0.900042
“The Legacy of Supreme Harmony”	0.393	2.532	0.995074
“Records of the Grand Historian”	0.793	1.893	1.501149
“The Biography of Zuo”	0.863	2.109	1.820067
“Ci Hai”	0.973	1.743	1.695939
This text	0.167	3.862	0.644954
Collection of scientific articles	0.274	2.753	0.754322

Since, the average distance in the complex network of linguistic natural rhythms can reflect the flexibility of natural rhythm organization, the shorter the average distance is, the more flexible the combination of various rhythmic units is, and it can be assumed that the author has a strong ability to master the combination of rhythmic units, and can use a variety of natural rhythms flexibly. The high clustering coefficient indicates the high tightness of the network, which shows the aggregation of various natural rhythms and the frequent use of multiple natural rhythmic units by the author in the article. It is understandable that the shorter the average distance, the higher the clustering coefficient, and these two properties of complex networks can describe and illustrate the author’s ability to use language and the linguistic expressiveness of the work.

But can there be a significant value that can accomplish the evaluation of good or bad articles? Here, it is proposed to examine the product of average distance and clustering coefficient of complex network, and it is found that the product of average distance clustering coefficient of all classic mythological stories and literary and historical encyclopedias is very high, except for the Feudal Deity Yanyi, which reaches more than 1.1, while the product of average distance clustering coefficient of wild historical secrets is relatively uneven and low in general. This paper is only 0.644954, and the collection of scientific research papers is only 0.754322, both much lower than the classic folktales. It can be seen that the difference between the average distance clustering coefficient product of excellent and non-excellent works is very significant, thus, it is shown here that the value of the average distance clustering coefficient product can be used to complete the preliminary judgment of the article’s good or bad.

4.2.2

Characterization of linguistic rhythmic networks in the works of famous authors

Each person’s language control ability is different, here, this paper selects writers Wu Chengen, Pu Songling, Luo Guanzhong, Zheng Guangzu, Sima Qian and Zuo Qiu Ming to analyze many articles, through the analysis of the language rhythm network in the works of famous writers.

The results of the characterization of the complex networks of the works are shown in Table 2. Again from the characterization of the complex network, it can be concluded that the linguistic rhythmic complex network of the works of famous authors has a high aggregation coefficient while the average distance is very short, and the aggregation coefficient average distance product can all be in a relatively high value.

Table 2.

Analysis results of complex network characteristics of writers’ works

Writer	Aggregation coefficient	Mean distance	Aggregation coefficient * Average distance
Wu Cheng’en	0.464	2.146	0.995744
Pu Songling	0.526	2.087	1.097762
Luo Guanzhong	0.531	2.138	1.135278
Zhen Guangzu	0.525	1.955	1.026375
Sima Qian	0.489	2.036	0.995604
Zuo Qiuming	0.632	1.863	1.177416

5

Conclusion

In this paper, folktales from different cultures are classified into three categories based on natural language processing techniques, and their contents are analyzed based on key features.

The comparison chart of the distribution of the average number of words per chapter shows that the average vocabulary per chapter of wild stories is basically below 10,000 words, the average vocabulary per chapter of mythological stories is between 10 and 20,000 words, and the vocabulary number of many literary and historical encyclopedias is more than 20,000 words, or even up to more than 40,000 words. The average paragraph length comparison chart shows that the average paragraph length of mythological stories basically concentrates between 100~200, the average paragraph length of literature and history encyclopedias basically concentrates between 200~300, while the average paragraph length of wild and secret stories fluctuates greatly. The average sentence length comparison chart shows that the average sentence length distribution of mythological stories is not much different from that of literature and history encyclopedias, which is relatively more concentrated and basically distributed in the range of 10~30, while the average sentence lengths of wild and secret stories are more different from each other.

The complex network characterization of classic folktales shows that the aggregation coefficient of the language natural rhythm network of classic mythological stories and literature and history encyclopedias can reach more than 0.35, and the aggregation coefficient of the language natural rhythm network of some of the wild history secrets is also good; the average distance of the language natural rhythm network of classic mythological stories and literature and history encyclopedias is relatively low, and that of the “Classic of the Sea and the Mountain” is the lowest, with an average distance of 1.654. Complex network characterization of famous authors, it is concluded that the linguistic rhythmic complex network aggregation coefficients of famous authors’ works are all above 0.35, the average distances are all below 2.5, and the aggregation coefficients’ average distance products are all kept around 1.

Acknowledgements

The Research is Supported by: the First-Class Discipline Construction Project of Higher Education Institutions in Ningxia (Education Discipline) Funded Project (NXYLXK2021B10).

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Folk Tales from Diverse Cultures: Digital Analysis of Content using Natural Language Processing

Yaping Li

Data publikacji: 19 mar 2025

Otrzymano: 07 lis 2024

Przyjęty: 10 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0529

Słowa kluczoweNatural language processing, Complex networks, Folktales, Text analysis

© 2025 Yaping Li, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
Natural language processing, Complex networks, Folktales, Text analysis