A Probabilistic Modeling Study of the Dynamics of Discourse Expression and the Construction of Discourse Power in an English News Corpus

A corpus is a collection of corpus of a certain size that is specially collected for one or more applications, has a certain structure, is representative, and can be retrieved by a computer program. With the help of computer analysis tools, researchers are able to utilize corpora to carry out relevant research on language theory, application and teaching [1-4]. In the context of globalization, with the rapid development of information technology, the external dissemination of Chinese English news has been improving. Many colleges and universities utilize English websites for external communication [5-6]. In order to further improve the strength of foreign propaganda and to strengthen the teaching of foreign propaganda translation and news English translation in the teaching of translation courses, to guarantee that the discourse and discourse characteristics of news and foreign propaganda translation can conform to the linguistic expression habits of native English-speaking audiences [7-10], and to improve the quality of English news reporting and the quality of foreign propaganda translation teaching in Chinese colleges and universities, it is necessary to establish a domestic and foreign news corpus of English news [11-12].

Discourse usually refers to the speech tools used by people to communicate and express their views in a specific social context, and its carrier is language and words with complete meaning. In human life, discourse promotes communication between individuals, but also reflects the inter-subjective power relations to a certain extent [13-16], and the juxtaposition of discourse and power has a long-term impact on the development of society and the direction of inter-subjective relations. As an important part of “discourse power”, the influence of international discourse is not only affected by the comprehensive power of sovereign states, but also has a reverse effect on national power, and its construction is a dialectical process [17-20].

The study uses web crawler technology to obtain corpus information, uses speech act derivation theory and Bayesian network model to combine to establish Bayesian belief network model, and constructs news discourse power probability model on this basis. Taking English financial news, the World Customs Organization News Corpus, and the BROWN Corpus as the research materials, the study analyzes the lexical dynamic changes in the English news corpus and the influence on discourse power from different aspects.

2

English news corpus and the construction of discursive power

2.1

Construction of English News Corpus

2.1.1

Selection of corpus

The corpus sources are the three major mainstream English media in China from 2001 to the present (China Daily English http://www.chinadaily.com.cn/, Xinhua English http: //www.xinhuanet.com/english/, People’s Daily English http://en.people.cn/). The relevant reports. In the selection of keywords, this study synthesizes relevant information from domestic and foreign media and official websites of enterprises to improve the accuracy while ensuring data integrity as much as possible, in order to reduce the workload of manual data checking.

2.1.2

Corpus collection

Based on the above keywords, a large amount of English corpus is obtained from web pages, and web crawler technology is utilized to capture the relevant corpus information. The web crawler architecture is mainly divided into three parts: the crawler scheduler, the main program and the target program, among which the main program is completed by three main modules, namely, the URL manager, the web page downloader and the web page parser as shown in Fig. 1.

According to the crawler’s structure, the required content of the web page is obtained. The implementation of crawler scheduling is carried out first, and then the news links containing the selected keywords are extracted using regular expressions through the URL manager, the corresponding web pages are downloaded, and the corresponding texts are parsed using BeautifulSoup and lxml.

2.1.3

Classification of corpus

The English news corpus consists of two major sub-bases: enterprise category and non-enterprise category. The non-enterprise subcorpus is no longer subdivided into categories and is mainly used for the research of Hefei image outreach, while the enterprise subcorpus is divided into three categories, namely, production and sales P&S, technology T and innovation I. The data is automatically stored in Excel and TXT forms during data mining to retain the original data. TXT documents are uniformly encoded in UTF-8 format, which is convenient for the original corpus to be individually retrieved and verified during the process of corpus processing and application.

2.1.4

Corpus processing

The raw corpus above needs to be processed by noise reduction, word splitting, and labeling before it can be used as a cooked corpus for related research. Since the raw corpus is a collection of txt format text automatically stored by python crawler program, the corpus is cleaner, so the noise reduction process only needs to remove the redundant spaces and carriage returns.

Segmentation is the process of splitting connected characters into mutually separated morphemes for vocabulary level statistics and analysis. Currently, the word segmentation software used for the English monolingual corpus is very mature, and Segment Ant software is chosen for automatic word segmentation in this study. Manual operation is carried out where the program is not accurate enough to identify, mainly some abbreviations and proper nouns.

The annotation of the corpus refers to the use of tags to mark the attributes of the text in the corpus to meet the needs of the study, so that the corpus has machine-readable features. To realize the machine-readability of the corpus and improve the utilization value of the corpus, the key lies in the effective annotation of the corpus [21]. The English news corpus is labeled at two levels: meta-information labeling and lexical assignment. Utilizing meta-information for corpus retrieval is an advanced use of corpus. Here, meta-information refers to information about the corpus. In this corpus, descriptions of the basic attributes of the corpus are added, such as “news source”, “release time”, “news type”, etc., which can improve the efficiency of corpus retrieval and statistical analysis at a later stage. The efficiency of corpus retrieval and statistical analysis can be improved later. Lexical assignment means that the lexical properties of all words in the corpus are labeled according to the context and features of the text. Lexical encoding represents the grammatical features of a word, therefore, lexical assignment is beneficial when analyzing the linguistic features of domestic news reports using this corpus. In this study, Tree Tagger is mainly utilized for automatic encoding, and then the encoded corpus is assisted by manual proofreading.

2.2

Probabilistic modeling of computational pragmatics of the corpus

2.2.1

Indirect speech act theory

Uncertainty in Behavior Theory. They introduce a model of rational speech acts in the context of indirect speech acts, game theory, and the Bayesian period theorem in probability theory. The model takes the probability of the appearance of the speaker’s words based on his or her knowledge of the state of the world as the a priori probability, and then derives the a posteriori probability of the listener’s change in his or her knowledge of the state of the world based on the speaker’s words. The entire process’s formula is as follows: (1) $P_{L} (w | u) = \frac{P (w) P_{S} (u | w)}{\sum_{i = 1}^{n} P (w_{i}) P (u | w_{i})}$

Equation (1) can be simplified as: (2) $P_{L} (w | u) \propto P (w) P_{S} (u | w)$

In Equation (2), w denotes the state of the world and u denotes the words. According to the Bayesian formula, the posterior probability of the listener can be inferred based on the prior probability of the speaker. After that, the rational speech act model adds the speaker’s behavior, which is called the uncertain rational speech act model. It is shown in Equation (3): (3) $P_{L} (w, s | u) \propto P (w) P (s) P_{S} (u | w, s)$

In Equation (3), the listener’s a posteriori probability includes both the state of the world w and the speaker’s behavior s in objective reality, which in turn allows for the speaker’s use of expressive devices such as metaphor, hyperbole, and irony to be taken into account in his or her own a posteriori probability.

2.2.2

Bayesian belief networks

Bayesian belief networks are a combination of graph theory and probability theory that lifts the condition of the plain Bayesian classification method that attributes must be independent of each other.

The joint probability distribution of N variable is: (4) $P (X_{1}, X_{2}, \dots, X_{n}) = \prod_{i = 1}^{n} P (X_{i} | X_{1}, X_{2}, \dots, X_{i - 1})$

Given a vector X = (X₁,X₂,…,X_n) described by attribute Y₁,Y₂,…,Y_n , each variable is conditionally independent of its non-descendants in the network if given two parents of the variable, and so: (5) $P (X_{1}, X_{2}, \dots, X_{n}) = \prod_{i = 1}^{n} P (X_{i} | P a r e n t s (Y_{i}))$ where Parents(Y_i) denotes the biparent node of node Y_i.

2.2.3

Bayesian Belief Network Model of Indirect Speech Behavior

Assuming that the state of the objective world is w, the accuracy of the listener’s cognition of the state of the world is P(w). Examining the theory of cognitive model: cognitive model is a relatively stereotyped mental structure formed by people in the process of recognizing things and understanding the world, which is a pattern of organizing and characterizing knowledge, and consists of concepts and their inter-fixed relations [22]. Ideal cognitive model refers to the abstract, unified and idealized understanding of experience and knowledge in a certain domain made by speakers in a specific cultural context, which is a complex, integrated and finished structure built on many cognitive models. The human cognitive process in cognitive modeling theory is depicted in Fig. 2.

Based on this cognitive process, an attribute relationship diagram for attribute w can be given, which is shown in Figure 3.

In Figure 3, each idealized cognitive model depends on multiple cognitive models, so the probability distribution of idealized cognitive models is: (6) $P (I C M_{1}) = P (I C M_{1} | C M_{1}, C M_{2}, \dots, C M_{a})$

MP denotes the intermediate process from organizing the concept to understanding the meaning of the expression with a probability distribution: (7) $P (M P) = P (M P | I C M_{1}, I C M_{2}, \dots, I C M_{q})$

Therefore, the probability distribution of a person’s perception of the state of the world is (8) $P (w) = P (w | M P)$

In Eq. (8), the joint probability density of ICM₁,ICM₂,…,ICM_q is: (9) $P (I C M_{1}, I C M_{2}, \dots, I C M_{q}) = \prod_{i = 1}^{q} P (I C M_{i} | I C M_{1}, I C M_{2}, \dots, I C M_{i - 1})$

Similarly, the joint probability density between the CMs of the biparental nodes of each ICM can be obtained. In this way, the probability distribution of the listener’s perception of the state of the world can be obtained P(w).

2.3

Probabilistic model of discourse construction

2.3.1

Discourse and discursive practices

Discourse is a form of power; in simple terms, it is the ability to influence public opinion. Discourse power is essentially the right to contest the practice of negotiating meaning. Whoever owns the power of discourse on a certain issue may control the way the issue is produced, circulated or consumed, may control the way the issue is spoken, and ultimately will control the public opinion, and then achieve his or her own interests. In the system of discourse, discursive practice is the central concept [23]. A discourse practice from the macro level includes three aspects of discourse generation, circulation and consumption, and the discourse practice can in turn be expressed in the following six elements: (1) the discourse emitter can be the official institution of a sovereign state or an unofficial organization or group; (2) the content of the discourse is the viewpoints and positions reflecting a sovereign state’s concerns related to its own interests or the international responsibility and obligations it has undertaken; and (3) the Discourse mode is the expression of discourse content, i.e., the rhetorical way in which the discourse content is presented, and the way in which the information is packaged, which directly affects the audience’s acceptance of the discourse content, and in turn also affects the dissemination of the discourse content; (4) Discourse audience is a question of whom to say something to and involves how to choose the audience in order to strive for or expand the effect of the discourse, which has close relationship with the international environment in which the topic is located and the political and ecological environment of the country in which the audience is located. This is closely related to the international environment of the topic and the political and ecological environment of the country where the audience is located; (5) the discourse platform refers to the channels of discourse dissemination, mainly including various forms of media as well as platforms for meaning negotiation and communication in the process of country-to-country interactions; (6) the discourse effect refers to the results obtained by the positions, claims and opinions expressed in the discourse.

2.3.2

Attribute-based probabilistic models

The simplest model for surface generation is to select the most commonly used templates in the training corpus that correspond to the attributes, given a pair of attribute values. out(A) is the output phrase corresponding to the attribute set A: (10) $o u t (A) = {\begin{array}{l} \arg \max_{ρ h r a s e T_{A}} C (p h r a s e, A) & T_{A} \neq ϕ \\ Empty string & T_{A} = ϕ \end{array}$

where T_A is the set of phrases corresponding to A in the training corpus, and C(phrase, A) is the frequency of occurrence in the training corpus of the phrase phrases expressing the attribute-valued pair A, where the maximum number of words in the phrases is M. The model is relatively simple in that it selects templates only based on their frequency of occurrence, and it can generate the text in whole sentences, but with low accuracy, and the generation fails when A is a mixture of attributes. Therefore, the constraint relationship between neighboring words should also be considered in the model.

2.3.3

Probabilistic models based on ternary grammars and attributes

If a collection of attribute one-value pairs is given, the best choice for generation is to select a sequence of words that accurately describes the input attributes and has the maximum probability of occurrence. When generating a word, local information and attribute information are needed, and the local information can be obtained from the N-gram model. Here, the maximum entropy probability model is used to organically combine local information and attribute information to estimate the occurrence probability of words. The probability model is a conditional distribution on V ∪ V_r, V is the word set, which contains all the words occurring in the training corpus, and V_r is the terminator. According to the ternary grammar model and attribute one-value pairs, the occurrence probability of w_i is: (11) $p (w_{i} | w_{i - 1}, w_{i - 2}, a t t r_{i}) = \frac{\prod_{j = 1}^{k} α_{j}^{f_{j} (w_{i}, w_{i - 1}, w_{i - 2}, a t t r_{i})}}{Z (w_{i - 1}, w_{i - 2}, a t t r_{i})}$

where p is the probability: w_i is the ird word in the phrase, which covers V ∪ V_T:{w_i − 1, w_i−2, attr_i} the history: attr_i is the attribute to be generated at position i in the phrase: α_j is the weight of the f_j which can be computed by a send generation algorithm; f_j is called a feature, f_j(a,b)∈{0,1} which is estimated by obtaining the information p(w_i|w_i-1, w_i-2, attr_i) from the history, and the feature can be built by matching the patterns in the training corpus, e.g., a feature derived from the templates of the binary syntax model is: (12) $Z (w_{i - 1}, w_{i - 2}, a t t r_{i}) = \sum_{w^{'}} \prod_{j - 1}^{k} α_{j}^{f_{j} (w^{'}, w_{i - 1}, w_{i - 2}, a t t r_{i})}$ (13) $f (w_{i}, w_{i - 1}, w_{i - 2}, a t t r_{i}) = {\begin{array}{l} 1 & i f w_{i} = At \\ w_{i - 1} = Be compelled \\ Stime \in a t t r_{i} \\ 0 & Other \end{array} .$

Thus the probability of occurrence of the word sequence W = w₁w₂…w_n given a pair of attribute values A: (14) $P (W = w_{1} w_{2} \dots w_{n} | l e n (W) = n, A) = \prod_{i = 1}^{n} p (w_{i} | w_{i - 1}, w_{i - 2}, a t t r_{i})$ where n ≤ M (M is the maximum sentence vocabulary length), len(W) is the length of the word sequence W, and w₁,w₀ is computed as non-existent, i.e., transferred to unary and binary grammars.

If features with low frequency of occurrence are selected it will make the N-tuple grammar model unreliable, so all the features here have to occur more than a certain number of times K in the training corpus, and K has to be at least greater than 3 times.

So corresponding to the attribute one-valued pair A, the optimal phrase generated in the surface implementation is: (15) $o u t (A) = \arg \max_{W \in W_{o u t}} P (l e n (W) = n) P (W | l e n (W) = n, A)$

2.3.4

Probabilistic models based on ternary grammar, syntax and attributes

Given a collection of attribute one-value pairs, when generating the text, the maximum entropy probability model is applied to organically combine the local, syntactic and attribute information obtained from the ternary grammar, and the probability of the corresponding phrases is estimated by estimating the probability of occurrence of the syntactic tree. The occurrence probability of a sub-node word in the syntactic tree is a conditional distribution on V ∪ V_r, V is the word pool, which contains all the words occurring in the training corpus, and V_r is the terminator. According to the ternary grammar model, syntactic dependency tree and attribute one-value pairs, the probability of occurrence of the ith sub-node w_i is: (16) $\begin{array}{l} p (c h_{i} (w) | w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r, a t t r_{w, j}) \\ = \frac{\prod_{j = 1}^{t} α_{j}^{f_{j} (c h (w), w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r, a w_{w, j})}}{Z (w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r, a t t r_{w, j})} \end{array}$ (17) $\begin{array}{l} Z (w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r, a t t r_{w, i}) \\ = \sum_{v} \prod_{j = 1}^{k} α_{j}^{f_{j} (w^{'}, w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r, a t h_{w, i})} \end{array}$

where p is the probability: ch_i(w) is the ith nearest child node of node w: par(w) is the parent node of node w: dir ∈ {left,right} indicates the child nodes that are the left and right child nodes of the parent node; attr_w,l indicates the attribute to be generated in the tree when the head node w predicts its ith child node: α_j is the weight of f_j, which can be computed by an iterative algorithm, too: f_j is called the feature, and f_j(a,b) ∈ {0,1}, is selected in accordance with the method in the previous section.

The probability of occurrence of a syntactic dependency tree expressing a set of attributes given attribute A can be found by calculating the probability of generating its left and right child nodes, which are independent of each other at the time of generation.

The probability of a sequence of left words in a syntactic dependency tree is: (18) $\begin{array}{l} P_{l e f l} (w | A) = P (l e f l_{c} h i l d = n) \\ \prod_{i = 1}^{n} p (c h_{i} (w) | w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r = l e f l, a t t r_{w, j}) \end{array}$

The probability of the right-hand word sequence in the syntactic dependency tree is: (19) $\begin{array}{l} P_{r g h} (w | A) = P (r i g h t_{c} h i l d = n) \\ \prod_{i = 1}^{n} p (c h_{i} (w) | w, c h_{i - 1} (w), c h_{i - 2} (w), p a r (w), d i r = r i g h t, a t t r_{w, j}) \end{array}$

where P(left_child = n),P(left_child = n) denotes the probability that the left and right word sequences are of length n, respectively.

Thus the probability of the whole syntactic dependency tree is: (20) $P (T | A) = \prod_{w \in T} P_{l e f t} (w | A) P_{r i g h t} (w | A)$ where T is the syntactic dependency tree.

So the syntactic dependency tree of the optimal phrase corresponding to attribute one-valued pair A is: (21) $o u t (A) = \arg \max_{T ò T_{o u}} P (T | A)$ $$out(A) = \arg {\max _{Tò{T_{ou}}}}P(T|A)$$

3

Dynamics of discourse expression in the English news corpus

3.1

Significance test for the construction of probabilistic model of discourse power

1)

Discourse power construction of conversational implicatures containing contextual discourse

The first experiment conducts discourse power construction based on both contextual discourse and response discourse, and its main goal is to find out the set of features suitable for discourse power construction labeled by conversational implication that are combined from lexical features, and conducts discourse power construction based on them to get the classification results.

After feature ranking and selection, this paper finds the most optimal feature set F₁ for discourse power construction based on both contextual discourse and response discourse, which contains 17 features in total. According to the interpretation in Coh-Metrix and the coefficients obtained in feature selection, the optimal feature set F₁ is summarized as shown in Table 1.

Table 1.

The best number is the characteristics and their coefficients in F1

Feature	The Explanation of Feature	Coefficient
WRDFRQmc	CELEX Log minimum frequency for content words, mean	0.25745
WRDHYPv	Hypernymy for verbs, mean	0.172133
WRDFRQc	CELEX word frequency for content words, mean	0.1534
DESWLltd	Word length, number of letters, standard deviation	-0.13745
DESWLlt	Word length, number of letters, mean	0.098632
WRDHYPnv	Hypernymy for nouns and verbs, mean	-0.06452
DESWLsyd	Word length, number of syllables, standard deviation	-0.05325
WRDPOLc	Polysemy for content words, mean	0.048352
WRDHYPn	Hypernymy for nouns, mean	-0.04154
DESWLsy	Word length, number of syllables, mean	-0.03544
WRDFRQa	CELEX Log frequency for all words, mean	0.027742
LDTTRc	Lexical diversity, type-token ratio, content word lemmas	0.009453
LDMTLD	Lexical diversity, MTLD, all words	-0.0053
LDTTRa	Lexical diversity, type-token ratio, all words	-0.00357
WRDFAMc	Familiarity for content words, mean	-0.0027
WRDMEAc	Meaningfulness, Colorado norms, content words, mean	0.002464
WRDIMGc	Imagability for content words, mean	-0.00225

From the table, it can be found that when the text used for classification includes both contextual discourse and response discourse, the required features are not only large in number (17), but also rich in variety, containing both descriptive statistics, lexical diversity, and relevant values used to represent lexical information. Features that were not selected for inclusion in F₁ include the mean of the degree of specificity of real words and the mean of the age of acquisition of real words. These two features, although they have high variance, do not contribute much to discriminating categories. And the variance of the value of lexical diversity of all words in VOCD is 0, which naturally cannot be used as a determining feature of discourse power construction.

The results of the discourse power construct were next analyzed in terms of data. Experiments were conducted to record the categorization results of each cross-validation, where the basic categorization assessment metrics, including the number of positive samples predicted by the model to be in the positive category, the number of negative samples predicted by the model to be in the negative category, and based on which the sensitivity, special effects, and accuracy were reported, were first presented. Next, independent sample t-tests are conducted on texts labeled “yes” and “no” to examine whether the accuracy rates of the different categories are significantly different, with a view to determining whether the reliability of the categorization as a whole is related to the type of conversational meaning. Finally, a goodness-of-fit test with equal expected frequency was performed on the accuracy rates to determine whether logistic regression based on feature set F₁ is an effective method for meaning classification, and to verify whether automatic classification of conversational meaning based on lexical features can be performed. The experimental results are shown in Table 2.

Table 2.

The session of the context discourse is the result of building the experiment

Folds	Performance measures					T-test	Goodness of fit test
Folds	TP	sensitivity	TN	specificity	accuracy	Sig.	χ²	Sig.
fold 1	53	0.684	48	0.661	0.674	0.504	13.442	0.002
fold 2	48	0.608	44	0.591	0.591	0.631	4.522	0.055
fold 3	46	0.584	50	0.652	0.611	0.333	6.003	0.016
fold 4	43	0.554	42	0.537	0.554	0.885	0.974	0.347
Total	190	0.608	184	0.610	0.608	0.588	24.941	0.000

According to Table 2, the accuracy of per-fold cross-validation is centered around 60%, with an overall accuracy of 61%. This accuracy rate does not sufficiently indicate that Feature Set F₁ can be used as a basis for the construction of conversationally implicit discourse power, as random factors may also lead to higher than average accuracy rates. According to the p-values in the goodness-of-fit test, two of the four tests were highly significant P₁ =0.002, P₃ =0.016), one was on the edge of significance ( P₂ =0.055), and one was not significant ( P₄ =0.347). This suggests that with the current sample size of the corpus, it is basically possible to determine that feature set F₁ is an influential factor in conversational meaning. In terms of the overall goodness-of-fit test result, the significance is very obvious ( X² =24.941,=0.000), the test result is not surprising, because with the expansion of the sample size, when the accuracy is fixed at 60%, it will be less and less interfered by random factors. So it can be determined that the feature set F1 can be used as a basis for the construction of conversational implicative discourse power.

Next consider whether the above categorization is balanced in both positive and negative categories. Based on the sensitivity and special-effectiveness of the per-fold cross-validation, there is no clear pattern in the accuracy levels between positive and negative categories. As a whole, the values of sensitivity (60.8%) and special-effectiveness (61%) do not differ much, and after independent samples t-tests, the significance of all the tests, both in the per-fold cross-validation and as a whole, is greater than 0.05, i.e., it can be affirmed that in the case of discursive power construction based on both contextual and responsive discourses, the categorization accuracies are independent of the categories of the implied meanings, which suggests that the categories containing the contextual discourse situation, the discourse power construction of conversational implicit meaning is balanced in the positive and negative categories.

2)

Comparison of Contextual Discourse Inclusion and Context-Free Discourse Sessions Implicit Discourse Power Constructs

The degree of influence of contextual discourse and response discourse on categorization results can be determined by cross-sectionally comparing the previous two experiments. Here, in addition to determining the magnitude of the relationship between the two by comparing the metrics associated with the accuracy and goodness-of-fit tests, since the classification of contextual discourse-containing and contextual discourse-absent discourses are one-to-one, a matched-sample t-test can be performed to determine whether the results of the two experiments are significantly different. The results of the comparison were organized as shown in Table 3.

Table 3.

The comparison of the context discourse and the non-contextual discourse

Folds	Performance measures		Goodness of fit test		Effect comparison	Matched T-test
	accuracy		χ²			Matched T-test
	with C.	without C.	with C.	with C.		Sig.
fold 1	0.653	0.596	13.454	4.325	1>2	0.108
fold 2	0.612	0.671	5.255	17.543	1<2	0.067
fold 3	0.607	0.604	7.000	5.224	1>2	0.692
fold 4	0.564	0.624	0.972	0.970	1<2	0.174
Total	0.613	0.612	21.322	29.453	1<2	0.493

The table commences by comparing the accuracy of the two experiments and the statistics of the goodness-of-fit test. It can be found that in the four-fold cross-validation, “containing contextual discourse” is greater than, less than “not containing contextual discourse” have occurred, and overall the indicators of not containing contextual discourse are lower than those of containing contextual discourse. The per-fold cross-validation does not show any significant differences (p-values are all greater than 0.05), according to the results of the matched-samples t-test. The overall significance was 0.493 > 0.05, indicating that overall there was no significant difference in the effect of lexical features on the construction of power in conversationally implicit discourse in the case of containing or not containing contextual discourse.

3.2

Basic statistical analysis

1)

Lexical density

One of the methods is to measure lexical density by the class-symbol-form-symbol ratio. The class-symbol form-symbol ratio is the ratio between the number of all class-symbols and all form-symbols in the corpus. Since the class symbols of a language do not change greatly over a certain period of time. So the larger the corpus, the smaller the TTR, so its ratio is easily affected by the size of the corpus. So, Scott improved the TTR by taking the text to a certain base number (default 1000 in WordSmith software) as the unit to calculate the ratio of class symbols to form symbols, and finally take its average, which can well avoid the influence of different corpus sizes on the TTR. Another method of measuring lexical density is to calculate it by using all real words in the corpus rather than counting the total number of words.

2)

Average word length

The respective word lengths of the two corpora and the average word length as a percentage of the whole corpus, as statistically derived by Word Smith 5.0, are shown in Figure 4. In terms of average word length, there is not much difference between the two corpora, with the former being 7.51 and the latter 7.47. However, it is easy to see from the figure that the trend of ECFE and FLOB is not the same. 2-letter words and 3-letter words of FLOB are significantly higher than those of ECFE, while 1-letter words, 2-letter words, 4-letter words, and 6- to 9-letter words are lower than those of ECFE. This reflects the fact that English financial news has relatively fewer vocabulary words, with relatively fewer short words and relatively more long words. In other words, English financial news uses longer words than general English discourse.

3)

Rarely used words

The author carried out word frequency statistics through Word Smith 5.0, which resulted in the data of rare words used less than 7 times in both ECFE and FLOB corpora as shown in Figure 5. All the words with a frequency of 1 to 5 times in English financial news are lower than those in the generic corpus, while the words with a frequency of 6 times or more are significantly higher, which indicates that the usage of financial news vocabulary is more concentrated compared with the generic corpus.

4)

Word Class Comparison

Since both corpora have been assigned through the CLAWS7 word class assigner, we can retrieve the word classes to which the words belong for comparison. However, English grammars are not entirely consistent in the way they categorize word classes, so in this paper we only concentrate on comparing real words (nouns, verbs, adjectives, adverbs) because these real words best reflect the use of vocabulary. The percentage of real-semantic words in the corpus obtained through Word Smith 5.0 search statistics is shown in Figure 6.

As can be seen from the figure, the percentage of real-semantic words in English financial news is significantly higher, accounting for 69.2% of the entire corpus, compared to 63.5% of the real-semantic words in the generalized discourse. Among these four types of real semantic words, the most obvious difference is in nouns. The proportion of nouns in English financial news is 33.5%, while their proportion in generic discourse is 27.6%. The other three types of real semantic words accounted for about the same proportion in the two corpora. The information density of a text is often measured by calculating the ratio of real-sense words to the total number of words, which represents the amount of information in the text. From this point of view, the textual information density of English financial news is higher than that of the generic corpus, and its text is more informative.

3.3

Multidimensional analysis of English domain features

The object of research in this section is the 1119 texts of the World Customs Organization News Corpus, with 765,460 type symbols and 74,997 class symbols, and the corpus is taken from WCO NEWS (World Customs Organization News) published by the World Customs Organization. Multidimensional analysis was applied to identify clusters of variables through factor analysis, which is widely used in the social sciences, to reduce the dimensionality of a large number of variables into factors or dimensions that are easy to manipulate for analysis, using Multidimensional Annotation and Analysis Tool Multidimensional Analysis Tagger 1.3 (MAT) developed by Nini (2015).

The main process of processing the corpus by the MAT software is: Firstly, the frequency of each linguistic variable in each 100 class symbols is counted, and the Z-value is calculated based on the mean and standard deviation of the frequency of each linguistic variable, and then the dimension scores of the six dimensions are calculated using Z-values with a mean higher than 1, and then the texts are categorized according to their closest type.

MAT software was utilized to analyze the BROWN Corpus and each sub-corpus multidimensionally, and the values for each dimension and the closest text type are displayed in Table 4.

Table 4.

Multidimensional analysis of the white corpus

Brown sublibrary	Dimension 1	Dimension 2	Dimension 3	Dimension 4	Dimension 5	Dimension 6	The closest text type
News report	-17.96	0.5	4.61	-1.74	0.89	-1.27	Academic article
editorial	-12.74	-0.22	4.64	1.24	0.7	-0.4	Universal narrative
News review	-15.36	-0.98	5.4	-3.5	0.44	-1.21	Academic article
religion	-8.38	0.3	5.18	0.29	2.15	0.37	Universal narrative
Skills, business and hobbies	-13.37	-2.18	4.58	-0.89	1.54	-1.2	Universal narrative
Social life	-14.7	0.4	4	-0.96	1.51	-0.77	Universal narrative
Biographies and essays	-12.48	1.24	5.1	-0.94	1.49	-0.25	Universal narrative
Government document	-17.68	-2.64	8.39	0.54	2.67	-0.13	Academic article
Academic paper	-13.13	-1.95	5.82	-1.09	4.48	-0.15	Scientific article
General novel	-7.35	6.29	0.32	-0.4	-0.36	-1.34	Universal narrative
Detective story	-1.75	6.28	-1.29	0.29	-0.99	-1.08	The fantasy narrative
Science fiction	-3.49	5.46	1.33	0.2	0.82	-0.89	Universal narrative
Adventure and western fiction	-5.26	6.54	-0.76	-1.7	-1.03	-1.48	Universal narrative
Love fiction	0.11	6.43	0.55	-0.16	-1.1	-1.22	The fantasy narrative
Humor	-6.78	3.6	2.69	-1.07	0.5	-0.51	Universal narrative
Library	-9.43	2.14	4.55	-0.86	0.77	-0.79	Universal narrative

Then, MAT software was used to conduct a multidimensional analysis of the WCO News Corpus and each sub-corpus, and the values of each dimension and the closest text type of each corpus are shown in Table 5.

Table 5.

Multidimensional analysis of the world customs organization news corpus

The world customs organization news sub-library	Dimension 1	Dimension 2	Dimension 3	Dimension 4	Dimension 5	Dimension 6	The closest text type
Views	-20.4	-3.54	10.4	-0.45	3.06	-0.76	Academic article
Globe	-23.84	-3.04	9.32	-2.56	2.46	-1.75	Academic article
Book review	-22.19	-1.71	8.25	-2.39	3.72	0.09	Academic article
Flocculus	-24.62	-4.56	12.57	-2.85	0.25	-2.21	Academic article
Close-up	-28.47	-3.76	14.06	-3.66	2.7	-1.96	Academic article
File	-22.46	-4.14	11.49	-1.24	3.5	-1.78	Academic article
Editor’s note	-17.02	-3.91	10.23	3.12	0.74	0.94	Academic article
Event	-23.37	-4.23	11.91	-2.69	1.7	-1.43	Academic article
Express	-23.05	-4.43	11.13	-2.77	1.32	-1.56	Academic article
Focal point	-19.85	-3.33	10.2	-0.21	3.6	-1.22	Academic article
Moderator	-19.91	-4.81	12.7	1.02	0.74	-1	Academic article
Dialogue	-18.31	-3.49	10.73	0.58	2.53	-0.38	Academic article
File	-22.85	-3.95	14.25	-0.73	2.11	-0.94	Academic article
Latest report	-22.26	-1.8	10.27	-2.55	1.2	-0.79	Academic article
Noncolumn name	-17.09	-5.98	14.88	-6.15	-1.61	-2.7	Academic article
Member customs	-24.2	-4.42	11.64	-2.73	1.45	-1.72	Academic article
Panoramic view	-22.37	-3.35	11.28	-1.27	3.32	-1.53	Academic article
Publications	-26.31	-6.65	12.73	-6.92	-0.61	-2.62	Academic article
Reader	-23	-3.78	10.82	-2.48	0.97	-1.42	Academic article
Special report	-21.72	-3.7	12.23	-1.2	2.61	-1.2	Academic article
Training log	-26.88	-4.99	15.78	-2.71	-2.29	-2.06	Academic article
Focusing	-25.86	-3.7	15.84	-3.45	1.03	-2.17	Academic article
Library	-21.21	-4.35	13.65	-2.14	3.22	-2.33	Academic article

As can be seen from Table 4, the BROWN corpus is closest to generalized narratives overall, but the text types of each sub-corpus cover academic articles, generalized narratives, scientific articles, and fantasy narratives, totaling four types. As can be seen from Table 5, the WCO News corpus as a whole and the sub-corpora are closest to academic articles as a text type.

Then SPSS software was used to carry out independent samples t-test on the means of z-values of 67 linguistic features of the World Customs Organization Corpus and the News Subcorpus of the BROWN Corpus, and a total of 27 features with significant differences (p < 0.05) were identified, and the 10 features with the largest differences among them were listed as shown in Table 6.

Table 6.

The 10 features of the two groups of corpus differences

Serial number	Feature name	World customs organization news	BROWNNews library	P value	Absolute difference
1	PHC	5.7663	2.4081	0.001	3.28962
2	NOMZ	2.9353	0.4015	0	2.46538
3	AWL	2.8208	0.7048	0	2.0475
4	DWNT	-0.766	0.4648	0	1.16432
5	[SMP]	-0.7083	0.3648	0	1.00659
6	[WZPRES]	1.3081	0.3715	0.017	0.86811
7	SYNE	-0.7901	0.0481	0	0.77174
8	[PUBV]	-0.991	-0.1885	0	0.73598
9	RB	-2.7892	-2.0485	0.001	0.67417
10	TIME	-0.9315	-0.1985	0	0.66644

Notes: 1)

PHC: Plain Concatenated Structure.

2)

NOMZ: Nominalization. Nouns and plural forms ending in -tion, -ment, -ness, -ity.

3)

AWL: Average Word Length. A “word” is any string of characters separated by spaces, as recognized by the Stanford Assigner.

4)

DWNT: Derogations: almost, barely, hard-ly, merely, mildly, nearly, only, partially, partly, practically, scarcely, slightly, somewhat.

5)

SMP: any form of SEEM and APPEAR.

6)

WZPRES: present participle form of a verb after a noun

7)

SYNE: Compound negation. Any adjective (definite adjective, predicate adjective) and noun or proper noun followed by no, neither, nor.

8)

PUBV: Publicly available cognitive verbs. Public verbs listed in Quirk et al. (1985:1180-1), such as acknowledge, add, admit, etc.

9)

RB: All adverbs. Words recognized as RB, RBS, RBR, or WRB by the Stanford Assigner.

10)

TIME: Adverbs of time.

The eight features with the largest differences between financial news in the Business English Corpus and general news are: non-restrictive determiner clauses, nouns, communal cognate verbs, ordinary parallel structures, third person pronouns, special interrogative clauses, other gerunds, and that relational clauses. The results of this study further reveal that, among the above features, the three features of non-restrictive clauses, nouns, and ordinary parallel structure are also among the eight linguistic features that differ most between Customs English and General English, and the communal cognate verbs and ordinary parallel structure are also among the ten features that differ most between Customs News English and General News English.

4

Conclusion

This paper establishes a Bayesian computational pragmatics inference model for hierarchical conversational implication and a Bayesian belief network model for special conversational implication, and establishes a discourse power probability model by taking the outputs in the Bayesian network model as features or labels. The English news corpus is chosen as the research object to analyze the effectiveness of the discourse power probability model and the dynamic change of discourse words in it. The following conclusions are drawn: 1)

In the construction of discourse power, the experiment firstly determines the optimal set of features for the construction of discourse power in the case of containing contextual discourse and not containing contextual discourse as F1 and F2 containing 17 features and 5 features respectively through the data calculation, feature sorting and selection, automatic text classification, and significance test of the classification results, and it is found that the results all pass the the goodness-of-fit test, indicating that the linguistic form features can provide certain directions for the determination of the type of conversational meaning.

2)

In the lexical statistics of English financial news, the statistical results show that in terms of vocabulary usage, the words used from 1 to 5 times in English financial news are slightly lower than those in the general discourse, but the words used more than 6 times are slightly higher than those in the general discourse, which indicates that financial news is more concentrated in terms of vocabulary usage.

3)

According to the analysis of different English corpora, the results of the study show that the multidimensional analysis method can effectively identify the differences in linguistic features between Customs News English and General English and General News English. Customs News English is less narrative than General English and General News English, and is closer to academic articles. Customs News English exhibits more situational dependency than General English, which exhibits more situationally independent linguistic features. It exhibits that Customs English is more discursive and aids in the creation of a co-core corpus of specialized English.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

A Probabilistic Modeling Study of the Dynamics of Discourse Expression and the Construction of Discourse Power in an English News Corpus

Wanni Mo

Online veröffentlicht: 19. März 2025

Eingereicht: 05. Okt. 2024

Akzeptiert: 02. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0368

SchlüsselwörterNews corpus, Discourse power, Probabilistic modeling, Lexical change

© 2025 Wanni Mo, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
News corpus, Discourse power, Probabilistic modeling, Lexical change