A Study on the Evolution of Language Style in Japanese Academic Articles Based on Text Mining
Data publikacji: 17 mar 2025
Otrzymano: 16 paź 2024
Przyjęty: 08 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0319
Słowa kluczowe
© 2025 Xueyang Yin, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Academic discourse is a communication mechanism for scholars to express their scientific ideas in discipline-specific professional discourse clusters, which are characterized by interactivity [1-2]. The role of interpersonal metadiscourse characterized by interactivity in guiding the audience to understand the discourse and inviting them to participate in the discourse has been accepted by many scholars, which can help speakers to intervene in the discourse, either euphemistically or confidently presenting their own viewpoints, evaluations, and attitudes, and also help speakers to modify their own viewpoints in terms of the audience’s responses to improve the audience’s participation [3-5]. Therefore, interpersonal metadiscourse is a necessary rhetorical tool for speakers to interact effectively with the audience and facilitate the dissemination of results [6]. The beginning part of an academic speech is the key link to whether the speaker can arouse the interest of the audience and smoothly enter the research topic, and clarifying the interaction strategies in this part can effectively guide Japanese learners to use the correct interaction strategies in academic speeches and successfully achieve the communicative goal of results dissemination [7-9].
There is a great difference between written and spoken Japanese language in terms of expression, and Japanese language learners often focus on practicing oral expression, while neglecting the training of Japanese written expression ability. Students’ language foundation is not solid,which affects the quality of academic paper writing [10-12]. In the field of research, academic papers have been an important channel for presenting new knowledge [13]. As a special genre, academic paper has its unique discourse structure. Therefore, it is very necessary to systematically summarize the linguistic expression characteristics of academic papers, which is conducive to helping students improve their essay writing ability [14-16].
The title, which is an important part of a verbal work, has its own characteristics in the use of words and sentences, syntax and syntax, and so on. In Japanese academic research, how to set a good Japanese title for one’s research paper is a problem that every Japanese researcher needs to think about attentively [17-19].
In this paper, we first establish a dataset of Japanese academic articles, rely on Python programming language to convert the documents to the desired format, and construct a corpus of the dataset.Then the text data is processed with noise reduction and normalization.The linguistic feature model of Japanese academic articles is constructed, and linguistic features such as vocabulary and sentences are selected for text mining of Japanese academic articles.The established dataset for 1981-2020 was used to analyze the linguistic styles of Japanese academic articles.The word length changes of Japanese academic articles were analyzed using the average word length and word length distribution.Sentence length changes in Japanese academic articles during the 20-year period were observed by analyzing average sentence length, broken sentence length, and sentence length distribution.Finally, the changes in vocabulary richness of Japanese academic articles are examined in terms of vocabulary density, type-example ratio, and single-word usage.
The establishment of Japanese academic article dataset and data preprocessing is both the most fundamental and one of the essential steps in the research task of this paper. Firstly, this chapter clarifies the research object of this paper and the source of literature data, the research object of this paper, for the international mainstream operations research journals published, about Japanese academic literature, set the data collection of the publication year for 1981-2020. Secondly, the data preprocessing process, including document format conversion, text data noise removal, and text normalization, is elaborated in detail. Finally, the preprocessing function for text data has been encapsulated and integrated into the intelligent text mining system for Japanese academic articles.
This paper takes Japanese academic articles as the research object, and realizes the intelligent parsing as well as identification of this kind of literature through text mining methods. In order to ensure the scientificity, authority and timeliness of the literature dataset, this paper takes the four major literature retrieval databases, namely, Springer, Elsevier, Taylor & Francis, and Wiley Online, as the sources of the literature data to establish the Japanese language academic article dataset.
In this paper, based on the Python programming language and its third-party software packages, the conversion of the dataset document format is accomplished and the extracted content is written to a TXT file, which serves as the dataset corpus.
As an important carrier of information transmission in human society, text presents a high degree of unstructuredness and discontinuity in its expression of intrinsic data relationships. Each academic article takes words as the basic unit, and through the combination of words, complex phrases and sentences are formed, which in turn form the whole document. Not only is the syntax highly variable, but words themselves also have many variants and relationships, such as synonymy, proximity, and ambiguity.Normalization and desensitization of the original text data to retain only the data that is useful for subsequent research is necessary in text mining. At the same time, the key information in the text is often hidden in a large number of noise data, removing the noise in the text data is also the key to determine the merits of the modeling results.
This paper is based on Japanese academic documents. Due to the high-frequency nature of text noise in documents, it can negatively affect the modeling results. Regular expression is a rule-based character matching technique that enables the screening of text data by specified character combination patterns.Due to its powerful flexibility, logic, and functionality, it has become a necessary tool for noise removal in the text preprocessing process.
Regular expressions can be categorized into greedy and non-greedy modes according to their depth of matching text. Greedy mode refers to the global search of text data to be processed according to a predefined matching pattern. Non-greedy mode refers to the localized search of the text to be processed according to the specified search depth. In this paper, a greedy matching mode is used to process the text.
For the research object of this paper, i.e., Japanese academic articles, the above steps have already completed the removal of special characters and other noises in the documents. Furthermore, the cleaned text should be further normalized to turn the text data into well-defined sequences of linguistic components with standard structures and markings. As previously mentioned, text is not structured at the data level, but at the semantic level, it is structured in characters, words, phrases, and sentences to create a complete document architecture. Therefore, this paper tries to split and reorganize the document according to its semantic level, and further filters a thousand perturbations to obtain the normalized text content, which lays the foundation for the construction of feature engineering later. The main steps of text normalization include text splitting, acronym expansion, case conversion, deletion of stop words, stemming extraction, word shape reduction, and syntactic analysis.
Through the construction of a Japanese academic article dataset, this paper obtained the original data.Since the text is unstructured data, this paper extracts the metric features and transforms them into structured data through linguistic feature modeling for further research.
As there are lexical, syntactic, discourse and phonological corpora modules in the language mechanism, they monitor the generation of discourse and constrain speakers and writers to choose the corresponding lexis, syntactic structure, discourse articulation form and rhyme form according to the requirements of a particular corpora. Therefore, the linguistic features of a text are an unconscious and profound reflection of the language style of the text, and these features can be portrayed to a certain extent by quantitative features.
Summarizing the above discussion, this paper summarizes the information of text D into a linguistic feature model, which is defined as follows:
A text
The so-called linguistic features refer to the linguistic structures that can play a role in distinguishing meaning [20-21]. To study the linguistic style of a text from the perspective of metric stylistics, it is crucial to select appropriate linguistic features. The following requirements should be met when selecting linguistic features:
Quantifiable. Using the method of metric stylistics to analyze language style, the selected language features must be quantifiable, such as the length of the word and the richness of the vocabulary, which are all quantifiable indicators. Stable occurrence. The selected linguistic features should appear firmly in the work, and only the stably appearing linguistic features can reflect the linguistic style of the work more objectively and accurately. Distinctiveness. The selected linguistic features can distinguish the work from other works, i.e., they should have certain distinguishing significance.
At present, the measurable linguistic features that have been put forward by domestic and foreign related researches and confirmed by practice to represent the linguistic style are divided into punctuation, vocabulary, sentences, paragraphs, syntax, semantics and other levels.
Punctuation is divided into two categories: punctuation marks and punctuation symbols. Among them, punctuation is further divided into end-of-sentence punctuation and intra-sentence punctuation. Punctuation, as a non-verbal stylistic element, is an integral part of written language and an indispensable auxiliary tool of written language. Each punctuation mark has a unique role, and different punctuation use styles can reflect the author’s unique language style by giving the text different strengths and weaknesses, ups and downs in rhythm.
Based on previous studies, this paper uses the indicator - Punctuation Ratio (PR), which is calculated as follows:
Lexical characterization is to view the text as a collection of sentences consisting of a series of tokens, each token representing a word, number, or punctuation mark, respectively. [22]. In the earlier years, researchers mostly used the analysis of text language style such as high-frequency words, average word length and so on, and later Holmes conducted a more in-depth study on the syllable composition of each word. Holmes believed that analyzing text style by analyzing the number of syllables of each word in the text also yielded good results.
The vocabulary richness of a text is a kind of personalized index that can reflect the author’s writing characteristics, and it is also an important feature that reflects the style of the text [23]. In order to quantify the quantitative value of this feature, some researchers have proposed the vocabulary richness function. Such as the more typical type-example ratio (TTR), some researchers also call the word type as word type and the word example as word times, the larger the ratio of TTR, the richer the words used.
In this paper, we use the average word length and word length distribution to measure the word length attribute. The calculation is shown below:
Where
For word class information, this paper uses two indicators - modifier word ratio (MVR), real word ratio (CR). Its calculation method is shown below:
Where
Based on the word frequency order-frequency distribution, this paper uses the Gini coefficient G, A index. Its calculation method is as follows:
Where
In this paper, based on the frequency of word occurrences, two measures of features Yule-K value, S value are selected. Their calculation formulas are as follows:
Sentence is the basic unit of linguistic structure capable of expressing a complete meaning. For the analysis of linguistic features at the sentence level, there are mainly two main aspects: sentence length distribution and sentence discrete degree.
Sentence length is the length of a sentence. In Japanese, there are usually large pauses between sentences, and punctuation marks such as periods, question marks, exclamation points, ellipses, and so on.They are often used at the end of sentences, thus indicating the end of the sentence.Therefore, the calculation of sentence length is to count the number of characters contained in a sentence, with a period, question mark, exclamation point, and ellipsis, or other punctuation marks at the end of the sentence as a sign of termination.
The ratio of the sum of the lengths of all the sentences in a text to the number of sentences is the average sentence length. Like average word length, sentence length and average sentence length are important indicators of the readability (ease of reading and understanding) of a text. Generally speaking, the more long sentences there are, the longer the average sentence length is. This means that the text is more complex, harder to understand, and less readable.On the contrary, the more short sentences there are, the shorter the average sentence length becomes, and the lower the complexity of the text becomes. However, the readability improves.
Sentence dispersion, also known as the degree of change in sentence length, refers to the degree to which the sentence length of each sentence in the text differs from the average sentence length. Sentence length dispersion can be calculated by the formula:
Where,
The degree of rhythmical change in the text can be observed by measuring the size of sentence length dispersion. The lower the value of sentence length dispersion, the less sentence length changes, the greater the repetition of sentence length, the smoother the whole text, and the stronger the sense of rhythm. On the contrary, the larger the value, the greater the change in sentence length. This results in the text appearing to have ups and downs, which will give the reader a reading experience of ups and downs.
The Japanese language academic articles in the dataset constructed in this paper were published from 1981-2020, and this paper takes five years as a node and divides them into eight stages, namely 1981-1985, 1986-1990, 1991-1995, 1996-2000, 2001-2005, 2006-2010, 2011-2015, and 2016- 2020.Then, the changes in the style of Japanese academic articles in the 40 years from 1981-2020 are analyzed in stages.
Words are the building blocks of masonry texts and play an important role in constituting linguistic features. Quantitative analysis of the use of words in texts using statistical methods can reveal the differences in linguistic style between different texts. This paper focuses on the statistical analysis of the linguistic style of Japanese academic texts using representative measures such as average word length and word length distribution.
The statistical results of the average word length and word length dispersion of Japanese academic articles from 1981 to 2020 are shown in Table 1. As can be seen from Table 1, the average word length of Japanese academic articles from 1981 to 2020 ranges from 1.8329 to 1.9507, but there is no obvious pattern of differences between the stages.Among them, the average word length between 1981 and 1985 is the shortest, while that of 2016 to 2020 is the longest.As can be seen in Table 1, Japanese academic articles from 1981 to 2020 have higher average word length values and lower text readability.This is due to the fact that both the writing and reading of academic articles have certain knowledge thresholds and professional barriers, making academic articles more difficult to read and less readable than ordinary texts.
ALW and DLW of Japanese academic articles from 1981 to 2020
| Period | Total character number | Total word number | Average length of word | Dispersion length of word |
|---|---|---|---|---|
| 1981-1985 | 1585188 | 864852 | 1.8329 | 0.358 |
| 1986-1990 | 1733027 | 931985 | 1.8595 | 0.362 |
| 1991-1995 | 1881979 | 997286 | 1.8871 | 0.346 |
| 1996-2000 | 2032242 | 1086470 | 1.8705 | 0.351 |
| 2001-2005 | 2245315 | 1187746 | 1.8904 | 0.338 |
| 2006-2010 | 2487462 | 1298868 | 1.9151 | 0.345 |
| 2011-2015 | 2723730 | 1406522 | 1.9365 | 0.346 |
| 2016-2020 | 2968347 | 1521683 | 1.9507 | 0.346 |
The greater the word length dispersion, the more varied the length of words used by the author, which reflects the text’s flexibility and diversity in terms of word length changes. The degree of flexibility in the use of words in Japanese academic texts can be compared from Table 1.The range of word length dispersion in Japanese academic texts from 1981 to 2020 is between 0.338 and 0.362, with the greatest variation in word length from 1986 to 1990 and the least from 2001 to 2005. Since academic article writing requires standardization and rigor, the word length distribution of Japanese academic articles does not change much between 1981 and 2020, and always follows a pattern of lower word length distribution.
In addition, Table 1 shows that the change in word length is not directly related to the length of the text, and there is no significant correlation between the word length dispersion and the time of writing, and there is no other significant pattern in the word length dispersion of Japanese academic articles.
Word length distribution refers to the occurrence of words of different syllables in a text. Authors have personal trade-offs and unique preferences for the use of words of different lengths when creating texts, so counting the use of words of different lengths in texts can serve as an attempt to reveal the evolution of the linguistic style of Japanese academic articles.
In this paper, the use of monosyllabic words, bisyllabic words, trisyllabic words, four-syllabic words, and words of more than four syllables in eight stages of Japanese scholarly articles between 1981 and 2020 will be counted and their frequencies will be listed, as shown in Table 2.
Distribution of word length
| Period | Monosyllable frequency | Two-syllable frequency | Trisyllable frequency | Four-syllable frequency | Above-four-syllable frequency |
|---|---|---|---|---|---|
| 1981-1985 | 70.62% | 27.36% | 1.26% | 0.64% | 0.12% |
| 1986-1990 | 68.52% | 26.92% | 2.73% | 1.29% | 0.54% |
| 1991-1995 | 67.05% | 26.79% | 2.76% | 2.06% | 1.34% |
| 1996-2000 | 66.93% | 26.84% | 3.37% | 2.21% | 0.65% |
| 2001-2005 | 66.65% | 26.68% | 3.73% | 2.38% | 0.56% |
| 2006-2010 | 64.44% | 25.52% | 4.24% | 4.27% | 1.53% |
| 2011-2015 | 63.79% | 24.92% | 5.36% | 4.84% | 1.09% |
| 2016-2020 | 62.78% | 23.55% | 5.96% | 6.03% | 1.68% |
In the Japanese academic articles published between 1981 and 2020, the frequency of monosyllabic words ranges from 62.78% to 70.62%, accounting for almost two-thirds of the words in the entire work, while disyllabic words range from 23.55% to 27.36%, and three-syllable words and above only account for 2.02% to 13.67 of the total number of words. Overall, during the 20 years from 1981 to 2020, the word length distribution of Japanese academic articles is still dominated by monosyllabic and disyllabic words, but the proportion of monosyllabic and disyllabic words in the articles decreases by stages, while three-syllable and three-syllable and above words appear more and more in Japanese academic articles, and their proportion increases gradually. As academic writing becomes more and more standardized, the proportion of monosyllabic and disyllabic words in Japanese academic articles decreases, and the flexibility of academic writing gradually decreases and is replaced by the rigorous and standardized academic writing norms.
Sentence length can be used as a tool to study the linguistic style of a text, as it shows the author’s habit of sentence construction in Japanese academic texts. After eliminating the influence of the text’s own genre on the use of sentences, we can examine the author’s stylistic preference for the use of sentences from the perspective of sentence length, and the cross-sectional analysis of sentence length in different works can also be used to compare and analyze the changes in the style of writing in Japanese academic articles.
In this study, sentence length is calculated in terms of words in a sentence without taking into account other types of punctuation in the sentence, and the average sentence length is obtained based on the division of the number of words in the text without punctuation and the number of sentences into which the text is divided. Statistics from related studies show that the text of academic articles contains an average of 40.25 words and 69.46 characters per sentence. The statistical results of the number of words, the number of sentences, and the average sentence length of Japanese academic articles in the corpus of this paper’s dataset from 1981 to 2020 are shown in Table 3.
Character and sentence number and average sentence length of Japanese academic articles
| Period | Character number | Sentence number | Average sentence length |
|---|---|---|---|
| 1981-1985 | 1585188 | 36374 | 43.58 |
| 1986-1990 | 1733027 | 39677 | 43.68 |
| 1991-1995 | 1881979 | 42860 | 43.91 |
| 1996-2000 | 2032242 | 44362 | 45.81 |
| 2001-2005 | 2245315 | 48349 | 46.44 |
| 2006-2010 | 2487462 | 53242 | 46.72 |
| 2011-2015 | 2723730 | 56522 | 48.19 |
| 2016-2020 | 2968347 | 60249 | 49.27 |
Table 3 shows that from 1981 to 2020, the average sentence length of Japanese academic articles is greater than 40, which is in line with the basic characteristics of academic articles, and the average sentence length of Japanese academic articles shows a phase-by-phase increasing trend from 43.58 in 1981-1985 to 49.27 in 2016-2020.This indicates that, over time, the text of Japanese academic writing complexity and formality have been increasing.
Sentence pauses are short pauses in discourse and are often considered a measure of sentence dispersion. Depending on where the pause occurs, sentence pauses can be categorized as inter-sentence pauses and mid-sentence pauses. Inter-sentence pause, also known as end-of-sentence pause, refers to the pause between sentences, inter-sentence pause time is generally longer, and the next sentence contains the content of the expression and the previous sentence is not very relevant. The pause in the sentence refers to the pause within the sentence. It is generally shorter, and the text content before and after the pause is more relevant.In two sentences with the same number of words, the sentence with more punctuation marks has a relatively fragmented structure, while the sentence with less punctuation marks has a relatively centralized structure. In order to study the change of sentence dispersion in Japanese academic articles, the author counted the pauses in eight stages of Japanese academic articles from 1981 to 2020, respectively, and took the comma, pause, colon, semicolon, full stop, exclamation mark, and question mark appearing in the articles as the markers of sentence breaks, and each appearance of the above symbols was regarded as the appearance of one sentence break, and based on the above criterion, sentence breaks in the corpus were counted are shown in Table 4.
Statistical results of segmented sentence length of Japanese academic articles
| Period | Character number | Segmented sentence number | Segmented sentence length |
|---|---|---|---|
| 1981-1985 | 1585188 | 206674 | 7.67 |
| 1986-1990 | 1733027 | 225949 | 7.67 |
| 1991-1995 | 1881979 | 304527 | 6.18 |
| 1996-2000 | 2032242 | 347392 | 5.85 |
| 2001-2005 | 2245315 | 418122 | 5.37 |
| 2006-2010 | 2487462 | 515002 | 4.83 |
| 2011-2015 | 2723730 | 599941 | 4.54 |
| 2016-2020 | 2968347 | 711834 | 4.17 |
Table 4 shows that in the eight time periods from 1981 to 2020, the length of broken sentences shows a gradual decline, from 7.67 in 1981-1985 to 4.17 in 2016-2020.It can be seen that the percentage of broken sentences in the Japanese academic articles published in these 20 years decreases, and the frequency of whole sentences increases, and the structure of the articles is more compact and concentrated, and is more in line with the academic requirements.
Sentence length distribution refers to the distribution of the length of all the sentences in the text, and the study of sentence distribution can reveal the preference of the writers of the articles for the length of the sentences when they are writing. In this study, the difference of 15 words in sentence length is taken as the segmentation standard, and all the sentences in each corpus are segmented according to the sentence lengths of 1-15 words, 16-30 words, 31-45 words, 46-60 words, 61-75 words and 75 words or more, and this is used to calculate the proportion of the number of sentences in each segment to the number of all the sentences in the corpus. The details are shown in Table 5.
Distribution of sentence length of Japanese academic articles
| Period | Item | 1~15 | 16~30 | 31~45 | 46~60 | 61~75 | >75 | Total |
|---|---|---|---|---|---|---|---|---|
| 1981-1985 | Sentence number | 11876 | 8250 | 6144 | 4601 | 3368 | 2135 | 36374 |
| Proportion | 32.65% | 22.68% | 16.89% | 12.65% | 9.26% | 5.87% | 100% | |
| 1986-1990 | Sentence number | 12530 | 9344 | 6642 | 5170 | 3551 | 2440 | 39677 |
| Proportion | 31.58% | 23.55% | 16.74% | 13.03% | 8.95% | 6.15% | 100% | |
| 1991-1995 | Sentence number | 13779 | 9841 | 7338 | 5675 | 3630 | 2597 | 42860 |
| Proportion | 32.15% | 22.96% | 17.12% | 13.24% | 8.47% | 6.06% | 100% | |
| 1996-2000 | Sentence number | 13357 | 11024 | 7080 | 6237 | 4330 | 2334 | 44362 |
| Proportion | 30.11% | 24.85% | 15.96% | 14.06% | 9.76% | 5.26% | 100% | |
| 2001-2005 | Sentence number | 13577 | 10327 | 8490 | 8142 | 5647 | 2166 | 48349 |
| Proportion | 28.08% | 21.36% | 17.56% | 16.84% | 11.68% | 4.48% | 100% | |
| 2006-2010 | Sentence number | 14343 | 10728 | 9594 | 8796 | 5873 | 3908 | 53242 |
| Proportion | 26.94% | 20.15% | 18.02% | 16.52% | 11.03% | 7.34% | 100% | |
| 2011-2015 | Sentence number | 16058 | 11881 | 9756 | 8608 | 5850 | 4369 | 56522 |
| Proportion | 28.41% | 21.02% | 17.26% | 15.23% | 10.35% | 7.73% | 100% | |
| 2016-2020 | Sentence number | 17761 | 13345 | 10628 | 9664 | 6236 | 2615 | 60249 |
| Proportion | 29.48% | 22.15% | 17.64% | 16.04% | 10.35% | 4.34% | 100% |
From Table 5, it is obvious that in the distribution of sentence lengths in Japanese academic articles, the highest proportion of sentence lengths are 1~15 and 16~30, and the proportion of these two types of sentences is around 50% in the period of 1981-2020. Overall, it seems that the share of shorter length sentences is decreasing, but the share of sentences with 1~15 and 16~30 lengths is still close to 50%. The proportion of sentences with length >45 shows an overall increasing trend, from 27.78% in the first stage (1981-1985) to 30.73% in the eighth stage (2016-2020). It indicates that the language of Japanese academic articles tends to be regularized from 1981 to 2020.
Lexical density is a concept developed by functionalist linguists to measure the amount of information in words in a text. The formula for lexical density is the ratio of lexical words (i.e., real words) to all words, and the length of the discourse has little effect on the data and produces stable results. The real words counted in this study include nouns, verbs, adjectives, pronouns, number words, and quantifiers. The lexical density for the eight time periods from 1981-2020 are shown in Table 6.
Word density of Japanese academic articles of 1981-2020
| Period | Real word | Total word number | Word density |
|---|---|---|---|
| 1981-1985 | 751470 | 864852 | 0.8689 |
| 1986-1990 | 811107 | 931985 | 0.8703 |
| 1991-1995 | 791446 | 997286 | 0.7936 |
| 1996-2000 | 941209 | 1086470 | 0.8663 |
| 2001-2005 | 1034646 | 1187746 | 0.8711 |
| 2006-2010 | 1081308 | 1298868 | 0.8325 |
| 2011-2015 | 1159677 | 1406522 | 0.8245 |
| 2016-2020 | 1292213 | 1521683 | 0.8492 |
A high lexical density indicates that the proportion of real words in the text is large and carries a large amount of information.The lexical densities of the eight time periods are comparable, between 0.7936 and 0.8711, and the difference between the stages is between 0.0014 and 0.0767, which is a very small difference, indicating that the use of real words in the Japanese academic articles of the eight time periods between 1981 and 2020 is comparable, with no obvious distinguishing features.
The type-example ratio, which is the ratio of the number of words to the number of word types, reflects the vocabulary used in the writer’s work. The number of word types is quantitatively equal to the number of species of words in the corpus, i.e., the number of all words in the corpus after de-weighting. In the case of a certain total number of words, the greater the number of word types, the richer the vocabulary usage.That is, the lower the ratio of type examples to examples, the richer the vocabulary. The statistics of the type-example ratios of Japanese academic articles in eight stages from 1981 to 2020 are shown in Table 7.
Statistical results of type-token ratio of Japanese academic articles of 1981-2020
| Period | Total word number | Type number | Type-token ratio |
|---|---|---|---|
| 1981-1985 | 864852 | 24109 | 35.8726 |
| 1986-1990 | 931985 | 64061 | 14.5484 |
| 1991-1995 | 997286 | 63524 | 15.6994 |
| 1996-2000 | 1086470 | 70239 | 15.4682 |
| 2001-2005 | 1187746 | 171101 | 6.9418 |
| 2006-2010 | 1298868 | 171165 | 7.5884 |
| 2011-2015 | 1406522 | 87635 | 16.0498 |
| 2016-2020 | 1521683 | 100517 | 15.1386 |
Table 7 shows that the type-case ratios of 2001-2005 and 2006-2010 are the smallest, 6.9418 and 7.5884, respectively, which indicates that, the lexical richness of Japanese academic articles in 2001-2010 is the highest, and the vocabulary is the most varied.The type-case ratio of 1981-1985 is the largest, 35.8726, which indicates that, the lexical richness of the academic articles in 1981-1985 is the lowest, and the vocabulary is the most varied. Articles published between 1981 and 1985 have the lowest vocabulary richness, and the vocabulary tends to be more stable in terms of usage in comparison. The years 2001-2005 and 2006-2010 have the most extensive vocabularies in terms of size and variation of the type-example ratio.
Single-occurrence words, which are words that appear only once in a text, are another indicator of the richness of vocabulary in a language. The more single-occurrence words there are, the more vocabulary richness there is in the language. The statistics of single-occurrence words in the eight stages from 1981 to 2020 are shown in Table 8.
Statistical results of single occurrence word of Japanese academic articles of 1981-2020
| Period | Single occurrence word number | Total word number | Accumulative frequency |
|---|---|---|---|
| 1981-1985 | 10724 | 864852 | 0.0124 |
| 1986-1990 | 19292 | 931985 | 0.0207 |
| 1991-1995 | 25431 | 997286 | 0.0255 |
| 1996-2000 | 31073 | 1086470 | 0.0286 |
| 2001-2005 | 106185 | 1187746 | 0.0894 |
| 2006-2010 | 93909 | 1298868 | 0.0723 |
| 2011-2015 | 69763 | 1406522 | 0.0496 |
| 2016-2020 | 76388 | 1521683 | 0.0502 |
In Table 8, “number of single-occurrence words” refers to the number of words that appear once in the corpus. The “cumulative frequency” refers to the frequency of all single-occurrence words in the text. The table shows that the three stages of 1986-1990, 1991-1995, and 1996-2000 have relatively comparable cumulative frequencies of single-occurring words, which are 0.0207, 0.0255, and 0.0286, respectively, suggesting that the vocabulary richness of Japanese academic articles published in 1986-1990, 1991-1995, and 1996-2000 is comparable. The high cumulative frequencies of single occurrences in 2001-2005 and 2006-2010 are 0.0894 and 0.0723, respectively, indicating that 2001-2005 and 2006-2010 have high vocabulary richness, among which the cumulative frequency of single occurrences in Japanese academic articles in 2001-2005 is the largest, which indicates that vocabulary use is the richest.
The lexical densities, type-example ratios, and single-present word ratios of the eight stages are summarized as shown in Table 9. In the eight stages from 1981 to 2020, except for lexical densities, which are not significantly different, the degree of change in word lengths, type-example ratios, and the cumulative frequency of single-present words, the years from 2001 to 2010 all show higher lexical richness, which is a significant feature.
Lexical richness of Japanese academic articles of 1981-2020
| Period | Word density | Type-token ratio | Single occurrence word frequency |
|---|---|---|---|
| 1981-1985 | 0.8689 | 35.8726 | 0.0124 |
| 1986-1990 | 0.8703 | 14.5484 | 0.0207 |
| 1991-1995 | 0.7936 | 15.6994 | 0.0255 |
| 1996-2000 | 0.8663 | 15.4682 | 0.0286 |
| 2001-2005 | 0.8711 | 6.9418 | 0.0894 |
| 2006-2010 | 0.8325 | 7.5884 | 0.0723 |
| 2011-2015 | 0.8245 | 16.0498 | 0.0496 |
| 2016-2020 | 0.8492 | 15.1386 | 0.0502 |
The study establishes a dataset of Japanese academic articles, constructs a linguistic feature model for line text mining of Japanese academic articles after noise reduction and normalization of the text data, and statistically counts the changes in word length, sentence length, and vocabulary richness of Japanese academic articles between 1981 and 2020 by means of metrics to study the changes in the style of the articles.
The average word length of Japanese academic articles from 1981 to 2020 ranged from 1.8329 to 1.9507, and the word length dispersion interval ranged from 0.338 to 0.362. The frequency of monosyllabic and disyllabic words ranges from 62.78% to 70.62% and 23.55% to 27.36%, respectively, and words of three syllables and above account for only 2.02% to 13.67 of the total number of words.During the period of 1981-2020, monosyllabic and disyllabic words are decreasing in the proportion of words in the proportion of words in the proportion of words, while the proportion of words of three syllables and above is gradually increasing. Japanese academic writing is becoming increasingly standardized.
From 1981 to 2020, the average sentence lengths of Japanese academic texts are all greater than 40, and they increase by stages, from 43.58 to 49.27 in 20 years, and the complexity and formality of the texts are increased.During the 20-year period, the broken sentence lengths of Japanese academic texts become shorter and shorter, decreasing from 7.67 to 4.17.The proportion of sentence lengths of 1-15 and 16-30 is the highest, with the proportion of them being around 50%. The proportion of sentences with lengths > 45 is generally on the rise, and the language is becoming increasingly regular.
The lexical densities of the eight time periods are in the range of 0.7936 to 0.8711, with very small differences between the stages and no significant difference in the use of real words. The stage with the largest type-example ratio is 1981-1985 (35.8726), and the smallest is 2001-2005 (6.9418). The cumulative frequency of single-occurrence words in Japanese academic articles during 2001-2005 is the highest (0.0894), making it the stage with the richest vocabulary.
This paper is the phased achievement of the Philosophy and Social Science Research Youth Project: “Based on the corpus of Japanese form nouns” (No.: 21Q274).
