A Probabilistic Modeling Study of the Dynamics of Discourse Expression and the Construction of Discourse Power in an English News Corpus
Online veröffentlicht: 19. März 2025
Eingereicht: 05. Okt. 2024
Akzeptiert: 02. Feb. 2025
DOI: https://doi.org/10.2478/amns-2025-0368
Schlüsselwörter
© 2025 Wanni Mo, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
A corpus is a collection of corpus of a certain size that is specially collected for one or more applications, has a certain structure, is representative, and can be retrieved by a computer program. With the help of computer analysis tools, researchers are able to utilize corpora to carry out relevant research on language theory, application and teaching [1-4]. In the context of globalization, with the rapid development of information technology, the external dissemination of Chinese English news has been improving. Many colleges and universities utilize English websites for external communication [5-6]. In order to further improve the strength of foreign propaganda and to strengthen the teaching of foreign propaganda translation and news English translation in the teaching of translation courses, to guarantee that the discourse and discourse characteristics of news and foreign propaganda translation can conform to the linguistic expression habits of native English-speaking audiences [7-10], and to improve the quality of English news reporting and the quality of foreign propaganda translation teaching in Chinese colleges and universities, it is necessary to establish a domestic and foreign news corpus of English news [11-12].
Discourse usually refers to the speech tools used by people to communicate and express their views in a specific social context, and its carrier is language and words with complete meaning. In human life, discourse promotes communication between individuals, but also reflects the inter-subjective power relations to a certain extent [13-16], and the juxtaposition of discourse and power has a long-term impact on the development of society and the direction of inter-subjective relations. As an important part of “discourse power”, the influence of international discourse is not only affected by the comprehensive power of sovereign states, but also has a reverse effect on national power, and its construction is a dialectical process [17-20].
The study uses web crawler technology to obtain corpus information, uses speech act derivation theory and Bayesian network model to combine to establish Bayesian belief network model, and constructs news discourse power probability model on this basis. Taking English financial news, the World Customs Organization News Corpus, and the BROWN Corpus as the research materials, the study analyzes the lexical dynamic changes in the English news corpus and the influence on discourse power from different aspects.
The corpus sources are the three major mainstream English media in China from 2001 to the present (China Daily English http://www.chinadaily.com.cn/, Xinhua English http: //www.xinhuanet.com/english/, People’s Daily English http://en.people.cn/). The relevant reports. In the selection of keywords, this study synthesizes relevant information from domestic and foreign media and official websites of enterprises to improve the accuracy while ensuring data integrity as much as possible, in order to reduce the workload of manual data checking.
Based on the above keywords, a large amount of English corpus is obtained from web pages, and web crawler technology is utilized to capture the relevant corpus information. The web crawler architecture is mainly divided into three parts: the crawler scheduler, the main program and the target program, among which the main program is completed by three main modules, namely, the URL manager, the web page downloader and the web page parser as shown in Fig. 1.

Crawler structure
According to the crawler’s structure, the required content of the web page is obtained. The implementation of crawler scheduling is carried out first, and then the news links containing the selected keywords are extracted using regular expressions through the URL manager, the corresponding web pages are downloaded, and the corresponding texts are parsed using BeautifulSoup and lxml.
The English news corpus consists of two major sub-bases: enterprise category and non-enterprise category. The non-enterprise subcorpus is no longer subdivided into categories and is mainly used for the research of Hefei image outreach, while the enterprise subcorpus is divided into three categories, namely, production and sales P&S, technology T and innovation I. The data is automatically stored in Excel and TXT forms during data mining to retain the original data. TXT documents are uniformly encoded in UTF-8 format, which is convenient for the original corpus to be individually retrieved and verified during the process of corpus processing and application.
The raw corpus above needs to be processed by noise reduction, word splitting, and labeling before it can be used as a cooked corpus for related research. Since the raw corpus is a collection of txt format text automatically stored by python crawler program, the corpus is cleaner, so the noise reduction process only needs to remove the redundant spaces and carriage returns.
Segmentation is the process of splitting connected characters into mutually separated morphemes for vocabulary level statistics and analysis. Currently, the word segmentation software used for the English monolingual corpus is very mature, and Segment Ant software is chosen for automatic word segmentation in this study. Manual operation is carried out where the program is not accurate enough to identify, mainly some abbreviations and proper nouns.
The annotation of the corpus refers to the use of tags to mark the attributes of the text in the corpus to meet the needs of the study, so that the corpus has machine-readable features. To realize the machine-readability of the corpus and improve the utilization value of the corpus, the key lies in the effective annotation of the corpus [21]. The English news corpus is labeled at two levels: meta-information labeling and lexical assignment. Utilizing meta-information for corpus retrieval is an advanced use of corpus. Here, meta-information refers to information about the corpus. In this corpus, descriptions of the basic attributes of the corpus are added, such as “news source”, “release time”, “news type”, etc., which can improve the efficiency of corpus retrieval and statistical analysis at a later stage. The efficiency of corpus retrieval and statistical analysis can be improved later. Lexical assignment means that the lexical properties of all words in the corpus are labeled according to the context and features of the text. Lexical encoding represents the grammatical features of a word, therefore, lexical assignment is beneficial when analyzing the linguistic features of domestic news reports using this corpus. In this study, Tree Tagger is mainly utilized for automatic encoding, and then the encoded corpus is assisted by manual proofreading.
Uncertainty in Behavior Theory. They introduce a model of rational speech acts in the context of indirect speech acts, game theory, and the Bayesian period theorem in probability theory. The model takes the probability of the appearance of the speaker’s words based on his or her knowledge of the state of the world as the a priori probability, and then derives the a posteriori probability of the listener’s change in his or her knowledge of the state of the world based on the speaker’s words. The entire process’s formula is as follows:
Equation (1) can be simplified as:
In Equation (2),
In Equation (3), the listener’s a posteriori probability includes both the state of the world
Bayesian belief networks are a combination of graph theory and probability theory that lifts the condition of the plain Bayesian classification method that attributes must be independent of each other.
The joint probability distribution of
Given a vector
Assuming that the state of the objective world is

Human cognitive process in cognitive model theory
Based on this cognitive process, an attribute relationship diagram for attribute w can be given, which is shown in Figure 3.

Relationship of attribute w
In Figure 3, each idealized cognitive model depends on multiple cognitive models, so the probability distribution of idealized cognitive models is:
MP denotes the intermediate process from organizing the concept to understanding the meaning of the expression with a probability distribution:
Therefore, the probability distribution of a person’s perception of the state of the world is
In Eq. (8), the joint probability density of
Similarly, the joint probability density between the CMs of the biparental nodes of each ICM can be obtained. In this way, the probability distribution of the listener’s perception of the state of the world can be obtained
Discourse is a form of power; in simple terms, it is the ability to influence public opinion. Discourse power is essentially the right to contest the practice of negotiating meaning. Whoever owns the power of discourse on a certain issue may control the way the issue is produced, circulated or consumed, may control the way the issue is spoken, and ultimately will control the public opinion, and then achieve his or her own interests. In the system of discourse, discursive practice is the central concept [23]. A discourse practice from the macro level includes three aspects of discourse generation, circulation and consumption, and the discourse practice can in turn be expressed in the following six elements: (1) the discourse emitter can be the official institution of a sovereign state or an unofficial organization or group; (2) the content of the discourse is the viewpoints and positions reflecting a sovereign state’s concerns related to its own interests or the international responsibility and obligations it has undertaken; and (3) the Discourse mode is the expression of discourse content, i.e., the rhetorical way in which the discourse content is presented, and the way in which the information is packaged, which directly affects the audience’s acceptance of the discourse content, and in turn also affects the dissemination of the discourse content; (4) Discourse audience is a question of whom to say something to and involves how to choose the audience in order to strive for or expand the effect of the discourse, which has close relationship with the international environment in which the topic is located and the political and ecological environment of the country in which the audience is located. This is closely related to the international environment of the topic and the political and ecological environment of the country where the audience is located; (5) the discourse platform refers to the channels of discourse dissemination, mainly including various forms of media as well as platforms for meaning negotiation and communication in the process of country-to-country interactions; (6) the discourse effect refers to the results obtained by the positions, claims and opinions expressed in the discourse.
The simplest model for surface generation is to select the most commonly used templates in the training corpus that correspond to the attributes, given a pair of attribute values.
where
If a collection of attribute one-value pairs is given, the best choice for generation is to select a sequence of words that accurately describes the input attributes and has the maximum probability of occurrence. When generating a word, local information and attribute information are needed, and the local information can be obtained from the N-gram model. Here, the maximum entropy probability model is used to organically combine local information and attribute information to estimate the occurrence probability of words. The probability model is a conditional distribution on
where
Thus the probability of occurrence of the word sequence
If features with low frequency of occurrence are selected it will make the N-tuple grammar model unreliable, so all the features here have to occur more than a certain number of times K in the training corpus, and K has to be at least greater than 3 times.
So corresponding to the attribute one-valued pair A, the optimal phrase generated in the surface implementation is:
Given a collection of attribute one-value pairs, when generating the text, the maximum entropy probability model is applied to organically combine the local, syntactic and attribute information obtained from the ternary grammar, and the probability of the corresponding phrases is estimated by estimating the probability of occurrence of the syntactic tree. The occurrence probability of a sub-node word in the syntactic tree is a conditional distribution on
where
The probability of occurrence of a syntactic dependency tree expressing a set of attributes given attribute A can be found by calculating the probability of generating its left and right child nodes, which are independent of each other at the time of generation.
The probability of a sequence of left words in a syntactic dependency tree is:
The probability of the right-hand word sequence in the syntactic dependency tree is:
where
Thus the probability of the whole syntactic dependency tree is:
So the syntactic dependency tree of the optimal phrase corresponding to attribute one-valued pair A is:
The first experiment conducts discourse power construction based on both contextual discourse and response discourse, and its main goal is to find out the set of features suitable for discourse power construction labeled by conversational implication that are combined from lexical features, and conducts discourse power construction based on them to get the classification results.
After feature ranking and selection, this paper finds the most optimal feature set
The best number is the characteristics and their coefficients in F1
| Feature | The Explanation of Feature | Coefficient |
|---|---|---|
| WRDFRQmc | CELEX Log minimum frequency for content words, mean | 0.25745 |
| WRDHYPv | Hypernymy for verbs, mean | 0.172133 |
| WRDFRQc | CELEX word frequency for content words, mean | 0.1534 |
| DESWLltd | Word length, number of letters, standard deviation | -0.13745 |
| DESWLlt | Word length, number of letters, mean | 0.098632 |
| WRDHYPnv | Hypernymy for nouns and verbs, mean | -0.06452 |
| DESWLsyd | Word length, number of syllables, standard deviation | -0.05325 |
| WRDPOLc | Polysemy for content words, mean | 0.048352 |
| WRDHYPn | Hypernymy for nouns, mean | -0.04154 |
| DESWLsy | Word length, number of syllables, mean | -0.03544 |
| WRDFRQa | CELEX Log frequency for all words, mean | 0.027742 |
| LDTTRc | Lexical diversity, type-token ratio, content word lemmas | 0.009453 |
| LDMTLD | Lexical diversity, MTLD, all words | -0.0053 |
| LDTTRa | Lexical diversity, type-token ratio, all words | -0.00357 |
| WRDFAMc | Familiarity for content words, mean | -0.0027 |
| WRDMEAc | Meaningfulness, Colorado norms, content words, mean | 0.002464 |
| WRDIMGc | Imagability for content words, mean | -0.00225 |
From the table, it can be found that when the text used for classification includes both contextual discourse and response discourse, the required features are not only large in number (17), but also rich in variety, containing both descriptive statistics, lexical diversity, and relevant values used to represent lexical information. Features that were not selected for inclusion in
The results of the discourse power construct were next analyzed in terms of data. Experiments were conducted to record the categorization results of each cross-validation, where the basic categorization assessment metrics, including the number of positive samples predicted by the model to be in the positive category, the number of negative samples predicted by the model to be in the negative category, and based on which the sensitivity, special effects, and accuracy were reported, were first presented. Next, independent sample t-tests are conducted on texts labeled “yes” and “no” to examine whether the accuracy rates of the different categories are significantly different, with a view to determining whether the reliability of the categorization as a whole is related to the type of conversational meaning. Finally, a goodness-of-fit test with equal expected frequency was performed on the accuracy rates to determine whether logistic regression based on feature set
The session of the context discourse is the result of building the experiment
| Folds | Performance measures | T-test | Goodness of fit test | |||||
|---|---|---|---|---|---|---|---|---|
| TP | sensitivity | TN | specificity | accuracy | Sig. | χ2 | Sig. | |
| fold 1 | 53 | 0.684 | 48 | 0.661 | 0.674 | 0.504 | 13.442 | 0.002 |
| fold 2 | 48 | 0.608 | 44 | 0.591 | 0.591 | 0.631 | 4.522 | 0.055 |
| fold 3 | 46 | 0.584 | 50 | 0.652 | 0.611 | 0.333 | 6.003 | 0.016 |
| fold 4 | 43 | 0.554 | 42 | 0.537 | 0.554 | 0.885 | 0.974 | 0.347 |
| Total | 190 | 0.608 | 184 | 0.610 | 0.608 | 0.588 | 24.941 | 0.000 |
According to Table 2, the accuracy of per-fold cross-validation is centered around 60%, with an overall accuracy of 61%. This accuracy rate does not sufficiently indicate that Feature Set
Next consider whether the above categorization is balanced in both positive and negative categories. Based on the sensitivity and special-effectiveness of the per-fold cross-validation, there is no clear pattern in the accuracy levels between positive and negative categories. As a whole, the values of sensitivity (60.8%) and special-effectiveness (61%) do not differ much, and after independent samples t-tests, the significance of all the tests, both in the per-fold cross-validation and as a whole, is greater than 0.05, i.e., it can be affirmed that in the case of discursive power construction based on both contextual and responsive discourses, the categorization accuracies are independent of the categories of the implied meanings, which suggests that the categories containing the contextual discourse situation, the discourse power construction of conversational implicit meaning is balanced in the positive and negative categories.
The degree of influence of contextual discourse and response discourse on categorization results can be determined by cross-sectionally comparing the previous two experiments. Here, in addition to determining the magnitude of the relationship between the two by comparing the metrics associated with the accuracy and goodness-of-fit tests, since the classification of contextual discourse-containing and contextual discourse-absent discourses are one-to-one, a matched-sample t-test can be performed to determine whether the results of the two experiments are significantly different. The results of the comparison were organized as shown in Table 3.
The comparison of the context discourse and the non-contextual discourse
| Folds | Performance measures | Goodness of fit test | Effect comparison | Matched T-test | ||
|---|---|---|---|---|---|---|
| accuracy | χ2 | |||||
| with C. | without C. | with C. | with C. | Sig. | ||
| fold 1 | 0.653 | 0.596 | 13.454 | 4.325 | 1>2 | 0.108 |
| fold 2 | 0.612 | 0.671 | 5.255 | 17.543 | 1<2 | 0.067 |
| fold 3 | 0.607 | 0.604 | 7.000 | 5.224 | 1>2 | 0.692 |
| fold 4 | 0.564 | 0.624 | 0.972 | 0.970 | 1<2 | 0.174 |
| Total | 0.613 | 0.612 | 21.322 | 29.453 | 1<2 | 0.493 |
The table commences by comparing the accuracy of the two experiments and the statistics of the goodness-of-fit test. It can be found that in the four-fold cross-validation, “containing contextual discourse” is greater than, less than “not containing contextual discourse” have occurred, and overall the indicators of not containing contextual discourse are lower than those of containing contextual discourse. The per-fold cross-validation does not show any significant differences (p-values are all greater than 0.05), according to the results of the matched-samples t-test. The overall significance was 0.493 > 0.05, indicating that overall there was no significant difference in the effect of lexical features on the construction of power in conversationally implicit discourse in the case of containing or not containing contextual discourse.
One of the methods is to measure lexical density by the class-symbol-form-symbol ratio. The class-symbol form-symbol ratio is the ratio between the number of all class-symbols and all form-symbols in the corpus. Since the class symbols of a language do not change greatly over a certain period of time. So the larger the corpus, the smaller the TTR, so its ratio is easily affected by the size of the corpus. So, Scott improved the TTR by taking the text to a certain base number (default 1000 in WordSmith software) as the unit to calculate the ratio of class symbols to form symbols, and finally take its average, which can well avoid the influence of different corpus sizes on the TTR. Another method of measuring lexical density is to calculate it by using all real words in the corpus rather than counting the total number of words.
The respective word lengths of the two corpora and the average word length as a percentage of the whole corpus, as statistically derived by Word Smith 5.0, are shown in Figure 4. In terms of average word length, there is not much difference between the two corpora, with the former being 7.51 and the latter 7.47. However, it is easy to see from the figure that the trend of ECFE and FLOB is not the same. 2-letter words and 3-letter words of FLOB are significantly higher than those of ECFE, while 1-letter words, 2-letter words, 4-letter words, and 6- to 9-letter words are lower than those of ECFE. This reflects the fact that English financial news has relatively fewer vocabulary words, with relatively fewer short words and relatively more long words. In other words, English financial news uses longer words than general English discourse.

Percentage of corpus
The author carried out word frequency statistics through Word Smith 5.0, which resulted in the data of rare words used less than 7 times in both ECFE and FLOB corpora as shown in Figure 5. All the words with a frequency of 1 to 5 times in English financial news are lower than those in the generic corpus, while the words with a frequency of 6 times or more are significantly higher, which indicates that the usage of financial news vocabulary is more concentrated compared with the generic corpus.

Use the data of the words below 7 times
Since both corpora have been assigned through the CLAWS7 word class assigner, we can retrieve the word classes to which the words belong for comparison. However, English grammars are not entirely consistent in the way they categorize word classes, so in this paper we only concentrate on comparing real words (nouns, verbs, adjectives, adverbs) because these real words best reflect the use of vocabulary. The percentage of real-semantic words in the corpus obtained through Word Smith 5.0 search statistics is shown in Figure 6.

The real word accounts for the percentage of the corpus
As can be seen from the figure, the percentage of real-semantic words in English financial news is significantly higher, accounting for 69.2% of the entire corpus, compared to 63.5% of the real-semantic words in the generalized discourse. Among these four types of real semantic words, the most obvious difference is in nouns. The proportion of nouns in English financial news is 33.5%, while their proportion in generic discourse is 27.6%. The other three types of real semantic words accounted for about the same proportion in the two corpora. The information density of a text is often measured by calculating the ratio of real-sense words to the total number of words, which represents the amount of information in the text. From this point of view, the textual information density of English financial news is higher than that of the generic corpus, and its text is more informative.
The object of research in this section is the 1119 texts of the World Customs Organization News Corpus, with 765,460 type symbols and 74,997 class symbols, and the corpus is taken from WCO NEWS (World Customs Organization News) published by the World Customs Organization. Multidimensional analysis was applied to identify clusters of variables through factor analysis, which is widely used in the social sciences, to reduce the dimensionality of a large number of variables into factors or dimensions that are easy to manipulate for analysis, using Multidimensional Annotation and Analysis Tool Multidimensional Analysis Tagger 1.3 (MAT) developed by Nini (2015).
The main process of processing the corpus by the MAT software is: Firstly, the frequency of each linguistic variable in each 100 class symbols is counted, and the Z-value is calculated based on the mean and standard deviation of the frequency of each linguistic variable, and then the dimension scores of the six dimensions are calculated using Z-values with a mean higher than 1, and then the texts are categorized according to their closest type.
MAT software was utilized to analyze the BROWN Corpus and each sub-corpus multidimensionally, and the values for each dimension and the closest text type are displayed in Table 4.
Multidimensional analysis of the white corpus
| Brown sublibrary | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | The closest text type |
|---|---|---|---|---|---|---|---|
| News report | -17.96 | 0.5 | 4.61 | -1.74 | 0.89 | -1.27 | Academic article |
| editorial | -12.74 | -0.22 | 4.64 | 1.24 | 0.7 | -0.4 | Universal narrative |
| News review | -15.36 | -0.98 | 5.4 | -3.5 | 0.44 | -1.21 | Academic article |
| religion | -8.38 | 0.3 | 5.18 | 0.29 | 2.15 | 0.37 | Universal narrative |
| Skills, business and hobbies | -13.37 | -2.18 | 4.58 | -0.89 | 1.54 | -1.2 | Universal narrative |
| Social life | -14.7 | 0.4 | 4 | -0.96 | 1.51 | -0.77 | Universal narrative |
| Biographies and essays | -12.48 | 1.24 | 5.1 | -0.94 | 1.49 | -0.25 | Universal narrative |
| Government document | -17.68 | -2.64 | 8.39 | 0.54 | 2.67 | -0.13 | Academic article |
| Academic paper | -13.13 | -1.95 | 5.82 | -1.09 | 4.48 | -0.15 | Scientific article |
| General novel | -7.35 | 6.29 | 0.32 | -0.4 | -0.36 | -1.34 | Universal narrative |
| Detective story | -1.75 | 6.28 | -1.29 | 0.29 | -0.99 | -1.08 | The fantasy narrative |
| Science fiction | -3.49 | 5.46 | 1.33 | 0.2 | 0.82 | -0.89 | Universal narrative |
| Adventure and western fiction | -5.26 | 6.54 | -0.76 | -1.7 | -1.03 | -1.48 | Universal narrative |
| Love fiction | 0.11 | 6.43 | 0.55 | -0.16 | -1.1 | -1.22 | The fantasy narrative |
| Humor | -6.78 | 3.6 | 2.69 | -1.07 | 0.5 | -0.51 | Universal narrative |
| Library | -9.43 | 2.14 | 4.55 | -0.86 | 0.77 | -0.79 | Universal narrative |
Then, MAT software was used to conduct a multidimensional analysis of the WCO News Corpus and each sub-corpus, and the values of each dimension and the closest text type of each corpus are shown in Table 5.
Multidimensional analysis of the world customs organization news corpus
| The world customs organization news sub-library | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | The closest text type |
|---|---|---|---|---|---|---|---|
| Views | -20.4 | -3.54 | 10.4 | -0.45 | 3.06 | -0.76 | Academic article |
| Globe | -23.84 | -3.04 | 9.32 | -2.56 | 2.46 | -1.75 | Academic article |
| Book review | -22.19 | -1.71 | 8.25 | -2.39 | 3.72 | 0.09 | Academic article |
| Flocculus | -24.62 | -4.56 | 12.57 | -2.85 | 0.25 | -2.21 | Academic article |
| Close-up | -28.47 | -3.76 | 14.06 | -3.66 | 2.7 | -1.96 | Academic article |
| File | -22.46 | -4.14 | 11.49 | -1.24 | 3.5 | -1.78 | Academic article |
| Editor’s note | -17.02 | -3.91 | 10.23 | 3.12 | 0.74 | 0.94 | Academic article |
| Event | -23.37 | -4.23 | 11.91 | -2.69 | 1.7 | -1.43 | Academic article |
| Express | -23.05 | -4.43 | 11.13 | -2.77 | 1.32 | -1.56 | Academic article |
| Focal point | -19.85 | -3.33 | 10.2 | -0.21 | 3.6 | -1.22 | Academic article |
| Moderator | -19.91 | -4.81 | 12.7 | 1.02 | 0.74 | -1 | Academic article |
| Dialogue | -18.31 | -3.49 | 10.73 | 0.58 | 2.53 | -0.38 | Academic article |
| File | -22.85 | -3.95 | 14.25 | -0.73 | 2.11 | -0.94 | Academic article |
| Latest report | -22.26 | -1.8 | 10.27 | -2.55 | 1.2 | -0.79 | Academic article |
| Noncolumn name | -17.09 | -5.98 | 14.88 | -6.15 | -1.61 | -2.7 | Academic article |
| Member customs | -24.2 | -4.42 | 11.64 | -2.73 | 1.45 | -1.72 | Academic article |
| Panoramic view | -22.37 | -3.35 | 11.28 | -1.27 | 3.32 | -1.53 | Academic article |
| Publications | -26.31 | -6.65 | 12.73 | -6.92 | -0.61 | -2.62 | Academic article |
| Reader | -23 | -3.78 | 10.82 | -2.48 | 0.97 | -1.42 | Academic article |
| Special report | -21.72 | -3.7 | 12.23 | -1.2 | 2.61 | -1.2 | Academic article |
| Training log | -26.88 | -4.99 | 15.78 | -2.71 | -2.29 | -2.06 | Academic article |
| Focusing | -25.86 | -3.7 | 15.84 | -3.45 | 1.03 | -2.17 | Academic article |
| Library | -21.21 | -4.35 | 13.65 | -2.14 | 3.22 | -2.33 | Academic article |
As can be seen from Table 4, the BROWN corpus is closest to generalized narratives overall, but the text types of each sub-corpus cover academic articles, generalized narratives, scientific articles, and fantasy narratives, totaling four types. As can be seen from Table 5, the WCO News corpus as a whole and the sub-corpora are closest to academic articles as a text type.
Then SPSS software was used to carry out independent samples t-test on the means of z-values of 67 linguistic features of the World Customs Organization Corpus and the News Subcorpus of the BROWN Corpus, and a total of 27 features with significant differences (p < 0.05) were identified, and the 10 features with the largest differences among them were listed as shown in Table 6.
The 10 features of the two groups of corpus differences
| Serial number | Feature name | World customs organization news | BROWNNews library | P value | Absolute difference |
|---|---|---|---|---|---|
| 1 | PHC | 5.7663 | 2.4081 | 0.001 | 3.28962 |
| 2 | NOMZ | 2.9353 | 0.4015 | 0 | 2.46538 |
| 3 | AWL | 2.8208 | 0.7048 | 0 | 2.0475 |
| 4 | DWNT | -0.766 | 0.4648 | 0 | 1.16432 |
| 5 | [SMP] | -0.7083 | 0.3648 | 0 | 1.00659 |
| 6 | [WZPRES] | 1.3081 | 0.3715 | 0.017 | 0.86811 |
| 7 | SYNE | -0.7901 | 0.0481 | 0 | 0.77174 |
| 8 | [PUBV] | -0.991 | -0.1885 | 0 | 0.73598 |
| 9 | RB | -2.7892 | -2.0485 | 0.001 | 0.67417 |
| 10 | TIME | -0.9315 | -0.1985 | 0 | 0.66644 |
Notes:
PHC: Plain Concatenated Structure. NOMZ: Nominalization. Nouns and plural forms ending in -tion, -ment, -ness, -ity. AWL: Average Word Length. A “word” is any string of characters separated by spaces, as recognized by the Stanford Assigner. DWNT: Derogations: almost, barely, hard-ly, merely, mildly, nearly, only, partially, partly, practically, scarcely, slightly, somewhat. SMP: any form of SEEM and APPEAR. WZPRES: present participle form of a verb after a noun SYNE: Compound negation. Any adjective (definite adjective, predicate adjective) and noun or proper noun followed by no, neither, nor. PUBV: Publicly available cognitive verbs. Public verbs listed in Quirk et al. (1985:1180-1), such as acknowledge, add, admit, etc. RB: All adverbs. Words recognized as RB, RBS, RBR, or WRB by the Stanford Assigner. TIME: Adverbs of time.
The eight features with the largest differences between financial news in the Business English Corpus and general news are: non-restrictive determiner clauses, nouns, communal cognate verbs, ordinary parallel structures, third person pronouns, special interrogative clauses, other gerunds, and that relational clauses. The results of this study further reveal that, among the above features, the three features of non-restrictive clauses, nouns, and ordinary parallel structure are also among the eight linguistic features that differ most between Customs English and General English, and the communal cognate verbs and ordinary parallel structure are also among the ten features that differ most between Customs News English and General News English.
This paper establishes a Bayesian computational pragmatics inference model for hierarchical conversational implication and a Bayesian belief network model for special conversational implication, and establishes a discourse power probability model by taking the outputs in the Bayesian network model as features or labels. The English news corpus is chosen as the research object to analyze the effectiveness of the discourse power probability model and the dynamic change of discourse words in it. The following conclusions are drawn:
In the construction of discourse power, the experiment firstly determines the optimal set of features for the construction of discourse power in the case of containing contextual discourse and not containing contextual discourse as F1 and F2 containing 17 features and 5 features respectively through the data calculation, feature sorting and selection, automatic text classification, and significance test of the classification results, and it is found that the results all pass the the goodness-of-fit test, indicating that the linguistic form features can provide certain directions for the determination of the type of conversational meaning. In the lexical statistics of English financial news, the statistical results show that in terms of vocabulary usage, the words used from 1 to 5 times in English financial news are slightly lower than those in the general discourse, but the words used more than 6 times are slightly higher than those in the general discourse, which indicates that financial news is more concentrated in terms of vocabulary usage. According to the analysis of different English corpora, the results of the study show that the multidimensional analysis method can effectively identify the differences in linguistic features between Customs News English and General English and General News English. Customs News English is less narrative than General English and General News English, and is closer to academic articles. Customs News English exhibits more situational dependency than General English, which exhibits more situationally independent linguistic features. It exhibits that Customs English is more discursive and aids in the creation of a co-core corpus of specialized English.
