Open Access

A Study on the Evolution of Language Style in Japanese Academic Articles Based on Text Mining

  
Mar 17, 2025

Cite
Download Cover

Introduction

Academic discourse is a communication mechanism for scholars to express their scientific ideas in discipline-specific professional discourse clusters, which are characterized by interactivity [1-2]. The role of interpersonal metadiscourse characterized by interactivity in guiding the audience to understand the discourse and inviting them to participate in the discourse has been accepted by many scholars, which can help speakers to intervene in the discourse, either euphemistically or confidently presenting their own viewpoints, evaluations, and attitudes, and also help speakers to modify their own viewpoints in terms of the audience’s responses to improve the audience’s participation [3-5]. Therefore, interpersonal metadiscourse is a necessary rhetorical tool for speakers to interact effectively with the audience and facilitate the dissemination of results [6]. The beginning part of an academic speech is the key link to whether the speaker can arouse the interest of the audience and smoothly enter the research topic, and clarifying the interaction strategies in this part can effectively guide Japanese learners to use the correct interaction strategies in academic speeches and successfully achieve the communicative goal of results dissemination [7-9].

There is a great difference between written and spoken Japanese language in terms of expression, and Japanese language learners often focus on practicing oral expression, while neglecting the training of Japanese written expression ability. Students’ language foundation is not solid,which affects the quality of academic paper writing [10-12]. In the field of research, academic papers have been an important channel for presenting new knowledge [13]. As a special genre, academic paper has its unique discourse structure. Therefore, it is very necessary to systematically summarize the linguistic expression characteristics of academic papers, which is conducive to helping students improve their essay writing ability [14-16].

The title, which is an important part of a verbal work, has its own characteristics in the use of words and sentences, syntax and syntax, and so on. In Japanese academic research, how to set a good Japanese title for one’s research paper is a problem that every Japanese researcher needs to think about attentively [17-19].

In this paper, we first establish a dataset of Japanese academic articles, rely on Python programming language to convert the documents to the desired format, and construct a corpus of the dataset.Then the text data is processed with noise reduction and normalization.The linguistic feature model of Japanese academic articles is constructed, and linguistic features such as vocabulary and sentences are selected for text mining of Japanese academic articles.The established dataset for 1981-2020 was used to analyze the linguistic styles of Japanese academic articles.The word length changes of Japanese academic articles were analyzed using the average word length and word length distribution.Sentence length changes in Japanese academic articles during the 20-year period were observed by analyzing average sentence length, broken sentence length, and sentence length distribution.Finally, the changes in vocabulary richness of Japanese academic articles are examined in terms of vocabulary density, type-example ratio, and single-word usage.

Construction of a dataset of Japanese-language academic articles

The establishment of Japanese academic article dataset and data preprocessing is both the most fundamental and one of the essential steps in the research task of this paper. Firstly, this chapter clarifies the research object of this paper and the source of literature data, the research object of this paper, for the international mainstream operations research journals published, about Japanese academic literature, set the data collection of the publication year for 1981-2020. Secondly, the data preprocessing process, including document format conversion, text data noise removal, and text normalization, is elaborated in detail. Finally, the preprocessing function for text data has been encapsulated and integrated into the intelligent text mining system for Japanese academic articles.

Establishment of data sets

This paper takes Japanese academic articles as the research object, and realizes the intelligent parsing as well as identification of this kind of literature through text mining methods. In order to ensure the scientificity, authority and timeliness of the literature dataset, this paper takes the four major literature retrieval databases, namely, Springer, Elsevier, Taylor & Francis, and Wiley Online, as the sources of the literature data to establish the Japanese language academic article dataset.

Conversion of document formats

In this paper, based on the Python programming language and its third-party software packages, the conversion of the dataset document format is accomplished and the extracted content is written to a TXT file, which serves as the dataset corpus.

Text data noise reduction and its normalization

As an important carrier of information transmission in human society, text presents a high degree of unstructuredness and discontinuity in its expression of intrinsic data relationships. Each academic article takes words as the basic unit, and through the combination of words, complex phrases and sentences are formed, which in turn form the whole document. Not only is the syntax highly variable, but words themselves also have many variants and relationships, such as synonymy, proximity, and ambiguity.Normalization and desensitization of the original text data to retain only the data that is useful for subsequent research is necessary in text mining. At the same time, the key information in the text is often hidden in a large number of noise data, removing the noise in the text data is also the key to determine the merits of the modeling results.

Text Data Noise Removal

This paper is based on Japanese academic documents. Due to the high-frequency nature of text noise in documents, it can negatively affect the modeling results. Regular expression is a rule-based character matching technique that enables the screening of text data by specified character combination patterns.Due to its powerful flexibility, logic, and functionality, it has become a necessary tool for noise removal in the text preprocessing process.

Regular expressions can be categorized into greedy and non-greedy modes according to their depth of matching text. Greedy mode refers to the global search of text data to be processed according to a predefined matching pattern. Non-greedy mode refers to the localized search of the text to be processed according to the specified search depth. In this paper, a greedy matching mode is used to process the text.

Text normalization

For the research object of this paper, i.e., Japanese academic articles, the above steps have already completed the removal of special characters and other noises in the documents. Furthermore, the cleaned text should be further normalized to turn the text data into well-defined sequences of linguistic components with standard structures and markings. As previously mentioned, text is not structured at the data level, but at the semantic level, it is structured in characters, words, phrases, and sentences to create a complete document architecture. Therefore, this paper tries to split and reorganize the document according to its semantic level, and further filters a thousand perturbations to obtain the normalized text content, which lays the foundation for the construction of feature engineering later. The main steps of text normalization include text splitting, acronym expansion, case conversion, deletion of stop words, stemming extraction, word shape reduction, and syntactic analysis.

Linguistic Feature Text Mining of Japanese Academic Articles
Definition of linguistic feature model

Through the construction of a Japanese academic article dataset, this paper obtained the original data.Since the text is unstructured data, this paper extracts the metric features and transforms them into structured data through linguistic feature modeling for further research.

As there are lexical, syntactic, discourse and phonological corpora modules in the language mechanism, they monitor the generation of discourse and constrain speakers and writers to choose the corresponding lexis, syntactic structure, discourse articulation form and rhyme form according to the requirements of a particular corpora. Therefore, the linguistic features of a text are an unconscious and profound reflection of the language style of the text, and these features can be portrayed to a certain extent by quantitative features.

Summarizing the above discussion, this paper summarizes the information of text D into a linguistic feature model, which is defined as follows:

A text d can be regarded as a N-dimensional vector, where each component represents a measure of a linguistic feature, and so the text D can be written in the form: (V1(T),V2(T),…,Vn(T)), where T is the set of linguistic features, and Vi(T) represents the quantized value of the ith linguistic feature. In this paper, this N-dimensional vector is modeled as a linguistic feature. Thus, by selecting appropriate linguistic features, a text can be transformed into data that can be modeled.

Language feature selection

The so-called linguistic features refer to the linguistic structures that can play a role in distinguishing meaning [20-21]. To study the linguistic style of a text from the perspective of metric stylistics, it is crucial to select appropriate linguistic features. The following requirements should be met when selecting linguistic features:

Quantifiable. Using the method of metric stylistics to analyze language style, the selected language features must be quantifiable, such as the length of the word and the richness of the vocabulary, which are all quantifiable indicators.

Stable occurrence. The selected linguistic features should appear firmly in the work, and only the stably appearing linguistic features can reflect the linguistic style of the work more objectively and accurately.

Distinctiveness. The selected linguistic features can distinguish the work from other works, i.e., they should have certain distinguishing significance.

At present, the measurable linguistic features that have been put forward by domestic and foreign related researches and confirmed by practice to represent the linguistic style are divided into punctuation, vocabulary, sentences, paragraphs, syntax, semantics and other levels.

Punctuation

Punctuation is divided into two categories: punctuation marks and punctuation symbols. Among them, punctuation is further divided into end-of-sentence punctuation and intra-sentence punctuation. Punctuation, as a non-verbal stylistic element, is an integral part of written language and an indispensable auxiliary tool of written language. Each punctuation mark has a unique role, and different punctuation use styles can reflect the author’s unique language style by giving the text different strengths and weaknesses, ups and downs in rhythm.

Based on previous studies, this paper uses the indicator - Punctuation Ratio (PR), which is calculated as follows: PR=APN where N is the total number of words in a text and AP is the corresponding number of all punctuation marks. In general, texts with a high proportion of punctuation tend to be characterized by a rich chapter structure and sentence tone.

Vocabulary

Lexical characterization is to view the text as a collection of sentences consisting of a series of tokens, each token representing a word, number, or punctuation mark, respectively. [22]. In the earlier years, researchers mostly used the analysis of text language style such as high-frequency words, average word length and so on, and later Holmes conducted a more in-depth study on the syllable composition of each word. Holmes believed that analyzing text style by analyzing the number of syllables of each word in the text also yielded good results.

The vocabulary richness of a text is a kind of personalized index that can reflect the author’s writing characteristics, and it is also an important feature that reflects the style of the text [23]. In order to quantify the quantitative value of this feature, some researchers have proposed the vocabulary richness function. Such as the more typical type-example ratio (TTR), some researchers also call the word type as word type and the word example as word times, the larger the ratio of TTR, the richer the words used.

In this paper, we use the average word length and word length distribution to measure the word length attribute. The calculation is shown below: ALW=LN=i=1NLiN DLW=1Ni=1N(LiAWL)2

Where L is the total word length, N is the total number of words, and Li is the length of the ith word. Generally speaking, the larger the value of the average word length ALW, the more multisyllabic words are used in the text, the less colloquial and less readable the text is. On the contrary, the smaller the value, the more monosyllabic words are used in the text, which makes the text more understandable and readable. Word length dispersion DLW mainly portrays the degree of variation of word length in the text. The larger its value is, the greater the degree of change in word length in the text. On the contrary, the degree of change is small.

For word class information, this paper uses two indicators - modifier word ratio (MVR), real word ratio (CR). Its calculation method is shown below: MVR=MV CR=CN

Where M and V indicate the number of modifiers and verbs in the text, respectively, and C and N indicate the number of real words and the total number of words in the text, respectively. Modifiers refer to words that can act as determiners in a sentence. In this paper, they are mainly adjectives, pronouns, distinguishing words, and number words. A high proportion of modifiers reflects the strong descriptive and modifying roles of the text, more vivid language, and flexibility of expression. Conversely, it is more reflective of the fact that discourse expression may be more bland and colloquial. The proportion of real words, also known as lexical density, is generally used to measure the information content of words in a text. The higher the value, the richer the words used in the text.

Based on the word frequency order-frequency distribution, this paper uses the Gini coefficient G, A index. Its calculation method is as follows: G=1V(V+12Nr=1Vrf(r)) A=c(VHL2)a

Where V is the total number of different words, N is the total number of words, r and f(r) are the frequency order and the corresponding frequency number, respectively, HL is the number of single-occurrence words, and a and c are obtained by fitting function c/ra. Some scholars have confirmed that the smaller the Gini coefficient G is, the more uneven the use of words in the text is represented, and the lower the lexical richness is. On the contrary, the more the use of words in the text tends to be average. The larger the index of A, the more single-occurrence words it has and the richer the vocabulary.

In this paper, based on the frequency of word occurrences, two measures of features Yule-K value, S value are selected. Their calculation formulas are as follows: K=104i=1NV(i,N)i2NN2 S=V(2,N)V where N is the total number of words, V is the total number of distinct words, and V(i,N) is the number of words that occur i times in a text with a total word count of N. The values of K and S are both measures of vocabulary richness, where the value of K is not affected by the length of the text. For both indicators, the larger the value, the richer the vocabulary.

Sentences

Sentence is the basic unit of linguistic structure capable of expressing a complete meaning. For the analysis of linguistic features at the sentence level, there are mainly two main aspects: sentence length distribution and sentence discrete degree.

Sentence length distribution

Sentence length is the length of a sentence. In Japanese, there are usually large pauses between sentences, and punctuation marks such as periods, question marks, exclamation points, ellipses, and so on.They are often used at the end of sentences, thus indicating the end of the sentence.Therefore, the calculation of sentence length is to count the number of characters contained in a sentence, with a period, question mark, exclamation point, and ellipsis, or other punctuation marks at the end of the sentence as a sign of termination.

The ratio of the sum of the lengths of all the sentences in a text to the number of sentences is the average sentence length. Like average word length, sentence length and average sentence length are important indicators of the readability (ease of reading and understanding) of a text. Generally speaking, the more long sentences there are, the longer the average sentence length is. This means that the text is more complex, harder to understand, and less readable.On the contrary, the more short sentences there are, the shorter the average sentence length becomes, and the lower the complexity of the text becomes. However, the readability improves.

Sentence Dispersion

Sentence dispersion, also known as the degree of change in sentence length, refers to the degree to which the sentence length of each sentence in the text differs from the average sentence length. Sentence length dispersion can be calculated by the formula: Ds=1nΣ(SiS¯s)2(i=1,2,3,,N)

Where, Ds denotes the degree of sentence discrete, Si denotes the different sentence lengths, S¯s denotes the average sentence length, and n denotes the total number of sentences in the text.

The degree of rhythmical change in the text can be observed by measuring the size of sentence length dispersion. The lower the value of sentence length dispersion, the less sentence length changes, the greater the repetition of sentence length, the smoother the whole text, and the stronger the sense of rhythm. On the contrary, the larger the value, the greater the change in sentence length. This results in the text appearing to have ups and downs, which will give the reader a reading experience of ups and downs.

Analysis of the evolution of the style of Japanese-language academic articles

The Japanese language academic articles in the dataset constructed in this paper were published from 1981-2020, and this paper takes five years as a node and divides them into eight stages, namely 1981-1985, 1986-1990, 1991-1995, 1996-2000, 2001-2005, 2006-2010, 2011-2015, and 2016- 2020.Then, the changes in the style of Japanese academic articles in the 40 years from 1981-2020 are analyzed in stages.

Analysis of changes in word length

Words are the building blocks of masonry texts and play an important role in constituting linguistic features. Quantitative analysis of the use of words in texts using statistical methods can reveal the differences in linguistic style between different texts. This paper focuses on the statistical analysis of the linguistic style of Japanese academic texts using representative measures such as average word length and word length distribution.

Average word length

The statistical results of the average word length and word length dispersion of Japanese academic articles from 1981 to 2020 are shown in Table 1. As can be seen from Table 1, the average word length of Japanese academic articles from 1981 to 2020 ranges from 1.8329 to 1.9507, but there is no obvious pattern of differences between the stages.Among them, the average word length between 1981 and 1985 is the shortest, while that of 2016 to 2020 is the longest.As can be seen in Table 1, Japanese academic articles from 1981 to 2020 have higher average word length values and lower text readability.This is due to the fact that both the writing and reading of academic articles have certain knowledge thresholds and professional barriers, making academic articles more difficult to read and less readable than ordinary texts.

ALW and DLW of Japanese academic articles from 1981 to 2020

Period Total character number Total word number Average length of word Dispersion length of word
1981-1985 1585188 864852 1.8329 0.358
1986-1990 1733027 931985 1.8595 0.362
1991-1995 1881979 997286 1.8871 0.346
1996-2000 2032242 1086470 1.8705 0.351
2001-2005 2245315 1187746 1.8904 0.338
2006-2010 2487462 1298868 1.9151 0.345
2011-2015 2723730 1406522 1.9365 0.346
2016-2020 2968347 1521683 1.9507 0.346

The greater the word length dispersion, the more varied the length of words used by the author, which reflects the text’s flexibility and diversity in terms of word length changes. The degree of flexibility in the use of words in Japanese academic texts can be compared from Table 1.The range of word length dispersion in Japanese academic texts from 1981 to 2020 is between 0.338 and 0.362, with the greatest variation in word length from 1986 to 1990 and the least from 2001 to 2005. Since academic article writing requires standardization and rigor, the word length distribution of Japanese academic articles does not change much between 1981 and 2020, and always follows a pattern of lower word length distribution.

In addition, Table 1 shows that the change in word length is not directly related to the length of the text, and there is no significant correlation between the word length dispersion and the time of writing, and there is no other significant pattern in the word length dispersion of Japanese academic articles.

Word length distribution

Word length distribution refers to the occurrence of words of different syllables in a text. Authors have personal trade-offs and unique preferences for the use of words of different lengths when creating texts, so counting the use of words of different lengths in texts can serve as an attempt to reveal the evolution of the linguistic style of Japanese academic articles.

In this paper, the use of monosyllabic words, bisyllabic words, trisyllabic words, four-syllabic words, and words of more than four syllables in eight stages of Japanese scholarly articles between 1981 and 2020 will be counted and their frequencies will be listed, as shown in Table 2.

Distribution of word length

Period Monosyllable frequency Two-syllable frequency Trisyllable frequency Four-syllable frequency Above-four-syllable frequency
1981-1985 70.62% 27.36% 1.26% 0.64% 0.12%
1986-1990 68.52% 26.92% 2.73% 1.29% 0.54%
1991-1995 67.05% 26.79% 2.76% 2.06% 1.34%
1996-2000 66.93% 26.84% 3.37% 2.21% 0.65%
2001-2005 66.65% 26.68% 3.73% 2.38% 0.56%
2006-2010 64.44% 25.52% 4.24% 4.27% 1.53%
2011-2015 63.79% 24.92% 5.36% 4.84% 1.09%
2016-2020 62.78% 23.55% 5.96% 6.03% 1.68%

In the Japanese academic articles published between 1981 and 2020, the frequency of monosyllabic words ranges from 62.78% to 70.62%, accounting for almost two-thirds of the words in the entire work, while disyllabic words range from 23.55% to 27.36%, and three-syllable words and above only account for 2.02% to 13.67 of the total number of words. Overall, during the 20 years from 1981 to 2020, the word length distribution of Japanese academic articles is still dominated by monosyllabic and disyllabic words, but the proportion of monosyllabic and disyllabic words in the articles decreases by stages, while three-syllable and three-syllable and above words appear more and more in Japanese academic articles, and their proportion increases gradually. As academic writing becomes more and more standardized, the proportion of monosyllabic and disyllabic words in Japanese academic articles decreases, and the flexibility of academic writing gradually decreases and is replaced by the rigorous and standardized academic writing norms.

Analysis of changes in sentence length

Sentence length can be used as a tool to study the linguistic style of a text, as it shows the author’s habit of sentence construction in Japanese academic texts. After eliminating the influence of the text’s own genre on the use of sentences, we can examine the author’s stylistic preference for the use of sentences from the perspective of sentence length, and the cross-sectional analysis of sentence length in different works can also be used to compare and analyze the changes in the style of writing in Japanese academic articles.

Average sentence length

In this study, sentence length is calculated in terms of words in a sentence without taking into account other types of punctuation in the sentence, and the average sentence length is obtained based on the division of the number of words in the text without punctuation and the number of sentences into which the text is divided. Statistics from related studies show that the text of academic articles contains an average of 40.25 words and 69.46 characters per sentence. The statistical results of the number of words, the number of sentences, and the average sentence length of Japanese academic articles in the corpus of this paper’s dataset from 1981 to 2020 are shown in Table 3.

Character and sentence number and average sentence length of Japanese academic articles

Period Character number Sentence number Average sentence length
1981-1985 1585188 36374 43.58
1986-1990 1733027 39677 43.68
1991-1995 1881979 42860 43.91
1996-2000 2032242 44362 45.81
2001-2005 2245315 48349 46.44
2006-2010 2487462 53242 46.72
2011-2015 2723730 56522 48.19
2016-2020 2968347 60249 49.27

Table 3 shows that from 1981 to 2020, the average sentence length of Japanese academic articles is greater than 40, which is in line with the basic characteristics of academic articles, and the average sentence length of Japanese academic articles shows a phase-by-phase increasing trend from 43.58 in 1981-1985 to 49.27 in 2016-2020.This indicates that, over time, the text of Japanese academic writing complexity and formality have been increasing.

Length of broken sentences

Sentence pauses are short pauses in discourse and are often considered a measure of sentence dispersion. Depending on where the pause occurs, sentence pauses can be categorized as inter-sentence pauses and mid-sentence pauses. Inter-sentence pause, also known as end-of-sentence pause, refers to the pause between sentences, inter-sentence pause time is generally longer, and the next sentence contains the content of the expression and the previous sentence is not very relevant. The pause in the sentence refers to the pause within the sentence. It is generally shorter, and the text content before and after the pause is more relevant.In two sentences with the same number of words, the sentence with more punctuation marks has a relatively fragmented structure, while the sentence with less punctuation marks has a relatively centralized structure. In order to study the change of sentence dispersion in Japanese academic articles, the author counted the pauses in eight stages of Japanese academic articles from 1981 to 2020, respectively, and took the comma, pause, colon, semicolon, full stop, exclamation mark, and question mark appearing in the articles as the markers of sentence breaks, and each appearance of the above symbols was regarded as the appearance of one sentence break, and based on the above criterion, sentence breaks in the corpus were counted are shown in Table 4.

Statistical results of segmented sentence length of Japanese academic articles

Period Character number Segmented sentence number Segmented sentence length
1981-1985 1585188 206674 7.67
1986-1990 1733027 225949 7.67
1991-1995 1881979 304527 6.18
1996-2000 2032242 347392 5.85
2001-2005 2245315 418122 5.37
2006-2010 2487462 515002 4.83
2011-2015 2723730 599941 4.54
2016-2020 2968347 711834 4.17

Table 4 shows that in the eight time periods from 1981 to 2020, the length of broken sentences shows a gradual decline, from 7.67 in 1981-1985 to 4.17 in 2016-2020.It can be seen that the percentage of broken sentences in the Japanese academic articles published in these 20 years decreases, and the frequency of whole sentences increases, and the structure of the articles is more compact and concentrated, and is more in line with the academic requirements.

Sentence length distribution

Sentence length distribution refers to the distribution of the length of all the sentences in the text, and the study of sentence distribution can reveal the preference of the writers of the articles for the length of the sentences when they are writing. In this study, the difference of 15 words in sentence length is taken as the segmentation standard, and all the sentences in each corpus are segmented according to the sentence lengths of 1-15 words, 16-30 words, 31-45 words, 46-60 words, 61-75 words and 75 words or more, and this is used to calculate the proportion of the number of sentences in each segment to the number of all the sentences in the corpus. The details are shown in Table 5.

Distribution of sentence length of Japanese academic articles

Period Item 1~15 16~30 31~45 46~60 61~75 >75 Total
1981-1985 Sentence number 11876 8250 6144 4601 3368 2135 36374
Proportion 32.65% 22.68% 16.89% 12.65% 9.26% 5.87% 100%
1986-1990 Sentence number 12530 9344 6642 5170 3551 2440 39677
Proportion 31.58% 23.55% 16.74% 13.03% 8.95% 6.15% 100%
1991-1995 Sentence number 13779 9841 7338 5675 3630 2597 42860
Proportion 32.15% 22.96% 17.12% 13.24% 8.47% 6.06% 100%
1996-2000 Sentence number 13357 11024 7080 6237 4330 2334 44362
Proportion 30.11% 24.85% 15.96% 14.06% 9.76% 5.26% 100%
2001-2005 Sentence number 13577 10327 8490 8142 5647 2166 48349
Proportion 28.08% 21.36% 17.56% 16.84% 11.68% 4.48% 100%
2006-2010 Sentence number 14343 10728 9594 8796 5873 3908 53242
Proportion 26.94% 20.15% 18.02% 16.52% 11.03% 7.34% 100%
2011-2015 Sentence number 16058 11881 9756 8608 5850 4369 56522
Proportion 28.41% 21.02% 17.26% 15.23% 10.35% 7.73% 100%
2016-2020 Sentence number 17761 13345 10628 9664 6236 2615 60249
Proportion 29.48% 22.15% 17.64% 16.04% 10.35% 4.34% 100%

From Table 5, it is obvious that in the distribution of sentence lengths in Japanese academic articles, the highest proportion of sentence lengths are 1~15 and 16~30, and the proportion of these two types of sentences is around 50% in the period of 1981-2020. Overall, it seems that the share of shorter length sentences is decreasing, but the share of sentences with 1~15 and 16~30 lengths is still close to 50%. The proportion of sentences with length >45 shows an overall increasing trend, from 27.78% in the first stage (1981-1985) to 30.73% in the eighth stage (2016-2020). It indicates that the language of Japanese academic articles tends to be regularized from 1981 to 2020.

Lexical richness analysis
Vocabulary density

Lexical density is a concept developed by functionalist linguists to measure the amount of information in words in a text. The formula for lexical density is the ratio of lexical words (i.e., real words) to all words, and the length of the discourse has little effect on the data and produces stable results. The real words counted in this study include nouns, verbs, adjectives, pronouns, number words, and quantifiers. The lexical density for the eight time periods from 1981-2020 are shown in Table 6.

Word density of Japanese academic articles of 1981-2020

Period Real word Total word number Word density
1981-1985 751470 864852 0.8689
1986-1990 811107 931985 0.8703
1991-1995 791446 997286 0.7936
1996-2000 941209 1086470 0.8663
2001-2005 1034646 1187746 0.8711
2006-2010 1081308 1298868 0.8325
2011-2015 1159677 1406522 0.8245
2016-2020 1292213 1521683 0.8492

A high lexical density indicates that the proportion of real words in the text is large and carries a large amount of information.The lexical densities of the eight time periods are comparable, between 0.7936 and 0.8711, and the difference between the stages is between 0.0014 and 0.0767, which is a very small difference, indicating that the use of real words in the Japanese academic articles of the eight time periods between 1981 and 2020 is comparable, with no obvious distinguishing features.

Model Ratio

The type-example ratio, which is the ratio of the number of words to the number of word types, reflects the vocabulary used in the writer’s work. The number of word types is quantitatively equal to the number of species of words in the corpus, i.e., the number of all words in the corpus after de-weighting. In the case of a certain total number of words, the greater the number of word types, the richer the vocabulary usage.That is, the lower the ratio of type examples to examples, the richer the vocabulary. The statistics of the type-example ratios of Japanese academic articles in eight stages from 1981 to 2020 are shown in Table 7.

Statistical results of type-token ratio of Japanese academic articles of 1981-2020

Period Total word number Type number Type-token ratio
1981-1985 864852 24109 35.8726
1986-1990 931985 64061 14.5484
1991-1995 997286 63524 15.6994
1996-2000 1086470 70239 15.4682
2001-2005 1187746 171101 6.9418
2006-2010 1298868 171165 7.5884
2011-2015 1406522 87635 16.0498
2016-2020 1521683 100517 15.1386

Table 7 shows that the type-case ratios of 2001-2005 and 2006-2010 are the smallest, 6.9418 and 7.5884, respectively, which indicates that, the lexical richness of Japanese academic articles in 2001-2010 is the highest, and the vocabulary is the most varied.The type-case ratio of 1981-1985 is the largest, 35.8726, which indicates that, the lexical richness of the academic articles in 1981-1985 is the lowest, and the vocabulary is the most varied. Articles published between 1981 and 1985 have the lowest vocabulary richness, and the vocabulary tends to be more stable in terms of usage in comparison. The years 2001-2005 and 2006-2010 have the most extensive vocabularies in terms of size and variation of the type-example ratio.

Single occurrence words

Single-occurrence words, which are words that appear only once in a text, are another indicator of the richness of vocabulary in a language. The more single-occurrence words there are, the more vocabulary richness there is in the language. The statistics of single-occurrence words in the eight stages from 1981 to 2020 are shown in Table 8.

Statistical results of single occurrence word of Japanese academic articles of 1981-2020

Period Single occurrence word number Total word number Accumulative frequency
1981-1985 10724 864852 0.0124
1986-1990 19292 931985 0.0207
1991-1995 25431 997286 0.0255
1996-2000 31073 1086470 0.0286
2001-2005 106185 1187746 0.0894
2006-2010 93909 1298868 0.0723
2011-2015 69763 1406522 0.0496
2016-2020 76388 1521683 0.0502

In Table 8, “number of single-occurrence words” refers to the number of words that appear once in the corpus. The “cumulative frequency” refers to the frequency of all single-occurrence words in the text. The table shows that the three stages of 1986-1990, 1991-1995, and 1996-2000 have relatively comparable cumulative frequencies of single-occurring words, which are 0.0207, 0.0255, and 0.0286, respectively, suggesting that the vocabulary richness of Japanese academic articles published in 1986-1990, 1991-1995, and 1996-2000 is comparable. The high cumulative frequencies of single occurrences in 2001-2005 and 2006-2010 are 0.0894 and 0.0723, respectively, indicating that 2001-2005 and 2006-2010 have high vocabulary richness, among which the cumulative frequency of single occurrences in Japanese academic articles in 2001-2005 is the largest, which indicates that vocabulary use is the richest.

The lexical densities, type-example ratios, and single-present word ratios of the eight stages are summarized as shown in Table 9. In the eight stages from 1981 to 2020, except for lexical densities, which are not significantly different, the degree of change in word lengths, type-example ratios, and the cumulative frequency of single-present words, the years from 2001 to 2010 all show higher lexical richness, which is a significant feature.

Lexical richness of Japanese academic articles of 1981-2020

Period Word density Type-token ratio Single occurrence word frequency
1981-1985 0.8689 35.8726 0.0124
1986-1990 0.8703 14.5484 0.0207
1991-1995 0.7936 15.6994 0.0255
1996-2000 0.8663 15.4682 0.0286
2001-2005 0.8711 6.9418 0.0894
2006-2010 0.8325 7.5884 0.0723
2011-2015 0.8245 16.0498 0.0496
2016-2020 0.8492 15.1386 0.0502
Conclusion

The study establishes a dataset of Japanese academic articles, constructs a linguistic feature model for line text mining of Japanese academic articles after noise reduction and normalization of the text data, and statistically counts the changes in word length, sentence length, and vocabulary richness of Japanese academic articles between 1981 and 2020 by means of metrics to study the changes in the style of the articles.

The average word length of Japanese academic articles from 1981 to 2020 ranged from 1.8329 to 1.9507, and the word length dispersion interval ranged from 0.338 to 0.362. The frequency of monosyllabic and disyllabic words ranges from 62.78% to 70.62% and 23.55% to 27.36%, respectively, and words of three syllables and above account for only 2.02% to 13.67 of the total number of words.During the period of 1981-2020, monosyllabic and disyllabic words are decreasing in the proportion of words in the proportion of words in the proportion of words, while the proportion of words of three syllables and above is gradually increasing. Japanese academic writing is becoming increasingly standardized.

From 1981 to 2020, the average sentence lengths of Japanese academic texts are all greater than 40, and they increase by stages, from 43.58 to 49.27 in 20 years, and the complexity and formality of the texts are increased.During the 20-year period, the broken sentence lengths of Japanese academic texts become shorter and shorter, decreasing from 7.67 to 4.17.The proportion of sentence lengths of 1-15 and 16-30 is the highest, with the proportion of them being around 50%. The proportion of sentences with lengths > 45 is generally on the rise, and the language is becoming increasingly regular.

The lexical densities of the eight time periods are in the range of 0.7936 to 0.8711, with very small differences between the stages and no significant difference in the use of real words. The stage with the largest type-example ratio is 1981-1985 (35.8726), and the smallest is 2001-2005 (6.9418). The cumulative frequency of single-occurrence words in Japanese academic articles during 2001-2005 is the highest (0.0894), making it the stage with the richest vocabulary.

Funding:

This paper is the phased achievement of the Philosophy and Social Science Research Youth Project: “Based on the corpus of Japanese form nouns” (No.: 21Q274).

Language:
English