Lexical co-occurrence network and semantic relation mining based on English corpus

According to the cognitive law of learning as categorization and the related research results of mental lexicon, vocabulary learning is the linkage of old and new knowledge, which includes the linkage of morphological, phonological and semantic relations between words. The vocabulary that people have learned is stored in the brain in the form of a network, rather than in alphabetical order as in a dictionary [1-4]. Among them, lexical semantic network is considered to be the most ideal model reflecting the organization of mental lexicon at present. In lexico-semantic networks, words are represented by nodes, which are connected to each other by morphological and phonological similarities and various semantic relations. The distance between the nodes is determined by the strength of the linkage between words [5-8]. Since vocabulary learning is not only about learning new words, but also about learning the relationships between words, the application of lexical semantic networks in learning dictionaries to provide the relationships between words will greatly benefit learners’ vocabulary acquisition [9-11].

With the in-depth development of semantically related lexical co-occurrence research, more and more language scholars have begun to pay attention to its application in university English vocabulary teaching. By analyzing the relationship between different words, semantically related lexical co-occurrence research can help students understand and memorize vocabulary better and improve their vocabulary use ability. Semantic lexical co-occurrence research can help students build vocabulary networks, promote vocabulary depth and breadth, and help students understand the real usage of vocabulary [12-15].

This paper analyzes the process of discovering semantic relations in the domain of English corpus using complex networks. The co-occurrence analysis method is proposed, the general process of the co-occurrence analysis method is elaborated, and the lexical co-occurrence relationship based on lexical semantics is summarized. Combining text complex networks and lexical co-occurrence networks, the TextRank algorithm is improved to optimize the extraction of keywords from lexical co-occurrence networks of text complex networks in the corpus. Combining the community discovery features of complex networks, the FWN-based short text clustering algorithm is proposed. Experiments are conducted to analyze the operational performance of the keyword extraction algorithm based on lexical co-occurrence network and the FWN short text clustering algorithm. Combine the sample corpus for lexical semantic analysis.

2

Semantic relationship analysis of textual complex networks

2.1

Text Complex Networks

2.1.1

Complex networks

With the development of complex two networks, big data in network era provides rich data support for complex networks. And the research object of complex network is also beginning to be romantic to the real meaning, and its application field is also developed from a single mathematical graph theory knowledge application to the communication network, text network and so on. It makes it possible to simplify the complex real networks while realizing the structural characteristics of these networks from the internal exploration, and provides help for exploring the deep information of the actual network [16-17].

The basic properties of complex networks are as follows:

1)

Degree and degree distribution of nodes

If there are k_i nodes connected to node ν_i in the network, then k_i is said to be the degree of node ν_i. The average value of the degree of the nodes in the network is called the average degree of the network graph and is denoted as 〈k_i〉, i.e.: (1) $〈 k_{i} 〉 = \frac{1}{N} \sum_{i = 1}^{N} k_{i}$ \[\langle {{k}_{i}}\rangle =\frac{1}{N}\underset{i=1}{\overset{N}{\mathop \sum }}\,{{k}_{i}}\]

When calculating the node degree in a weighted network, in addition to the number of nodes adjacent to node ν_i , the weights of the edges of the nodes also need to be considered. In order to distinguish from the node degree of the unweighted network, the concept of node strength S_i has been introduced for the weighted network, i.e., the node degree and edge weights should be considered at the same time, and its calculation formula is as follows: (2) $S_{i} = \sum_{j = 1}^{N} (w_{i j} + w_{j i})$ \[{{S}_{i}}=\sum\limits_{j=1}^{N}{({{w}_{ij}}+{{w}_{ji}})}\]

Where w_θ is the weight of the edge of node ν_i pointing to neighboring node ν_j, and if nodes ν_t and ν_j are not adjacent, then w_ij = 0 . N is the number of nodes in the network, in addition to the fact that because of the influence between the nodes of the textual complex network constructed in this paper is mutual, the network constructed is a weighted undirected network, and there is no need to differentiate between the degree of node’s outgoing degree and the degree of incoming degree.

2)

Clustering coefficient

The clustering coefficient is used to determine the degree of tight clustering of the nodes in a network, which reflects the local characteristics of different clusters in the network. If there are k_i nodes connected to node ν_i in the network, these k_i nodes are called neighbors of node ν_i, and at most $c_{k}^{2}$ $c_{k}^{2}$ edges can be generated in these k nodes, then the clustering coefficient of the agglomeration group formed by node ν_i can be defined as: (3) $C_{i} = \frac{E}{c_{k}^{2}}$ \[{{C}_{i}}=\frac{E}{c_{k}^{2}}\]

where C_i is the clustering coefficient E of node ν and is the number of edges that actually exist between the k nodes connected to node ν_i.

Since complex networks are presented as graphs, the tight class coefficients can also be interpreted geometrically, if E is used to represent the number of triangles containing node ν_t in the vertices then Eq. (4) can also be expressed as: (4) $C_{i} = \frac{Number of triangles connected to node v_{i}}{Number of triples connected to node v_{i}}$ \[{{C}_{i}}=\frac{Number~of~triangles~connected~to~node~{{v}_{i}}}{Number~of~triples~connected~to~node~{{v}_{i}}}\]

3)

Median Centrality

For the above two characteristics, node strength usually measures the importance of individual nodes in the whole network. The concept of median is introduced to measure the centrality of a node in the global network, which is defined by the equation: (5) $B_{i} = \sum_{1 \leq j \leq l \leq N} \frac{n_{j l} (i)}{n_{j l}}$ \[{{B}_{i}}=\underset{1\le j\le l\le N}{\mathop \sum }\,\frac{{{n}_{jl}}(i)}{{{n}_{jl}}}\]

Where A B_i denotes the median of node ν_i, n_ij is the number of all shortest paths between node pair (ν_j,ν_l and n_ij(i) represents the number of shortest paths between node pair (ν_j,ν_l) passing through node ν_i.

The importance of nodes with large meshes is overly amplified in the calculation of the composite eigenvalue. For this reason, to address this hidden loyalty, the concept of centrality of meshes is introduced, which, in terms of meaning, is a normalization of the meshes of all nodes, which is calculated as: (6) $C_{B} (v_{i}) = \frac{B_{i}}{[(N - 1) (N - 2)]}$ \[{{C}_{B}}({{v}_{i}})=\frac{{{B}_{i}}}{[(N-1)(N-2)]}\]

4)

Proximity Centrality

Although there is no Euclid distance in complex networks, it can be replaced by the shortest path length, introducing the concept of centrality by proximity, reflecting the degree to which a node is at the center of the network, which Freeman defines as: (7) $C_{c} (v_{i}) = \frac{N - 1}{\sum_{j = 1}^{N} a_{i j}}$ \[{{C}_{c}}({{v}_{i}})=\frac{N-1}{\underset{j=1}{\overset{N}{\mathop \sum }}\,{{a}_{ij}}}\]

Where d_i is the distance between node d_i and node ν_j.

5)

PagcRank value

The PageRank algorithm was first used in web page linking to determine the degree of influence a page has on the pages it links to. Its calculation formula is: (8) $P R (v_{i}) = \frac{1 - d}{N} + d \times \sum_{j \in i n (v_{i})} \frac{P R (v_{j})}{k_{j}}$ \[PR({{v}_{i}})=\frac{1-d}{N}+d\times \underset{j\in in({{v}_{i}})}{\mathop \sum }\,\frac{PR({{v}_{j}})}{{{k}_{j}}}\]

where PR(ν_j) points to the PageRank value of node j of i and de is a model parameter.

2.1.2

Text Complex Network Construction

Up to now, the content characterized by the nodes in the text complex network model can be divided into words, phrases, sentences and paragraphs. As this paper in the construction of the network model also need to add the semantic information of words to maximize the retention of text information characteristics, so the selected text words as network nodes. The text network construction methods using words as nodes include three kinds of word-based co-occurrence, syntactic relationship-based and class-based dictionaries.

1) Text network model based on word co-occurrence

As the two necessary elements of complex network model are nodes and edges, nodes in the text network are composed of words, and the words can be obtained through textual disambiguation.

2) Text network model based on syntactic relations

3) Text network model based on class dictionary

Considering the connection between words in addition to neighboring co-occurring word relations. With syntactic relations, but also includes semantic relations between words. The calculation of semantic relationship is difficult to be realized by direct semantic calculation of words in the text, and the degree of similarity is usually measured by the semantic distance between words.

The calculation of semantic distance between words generally includes two ways based on domain knowledge and corpus.

2.1.3

Semantic relations based on complex networks

In the process of complex network construction, the two most basic elements are nodes and edges. The most important thing in network construction for a certain complexity problem is that nodes and edges should be abstracted. The steps of discovering semantic relations in the domain of English corpus using complex networks are shown in Figure 1.

The relationship extraction process represented by the model can be briefly described by the following steps:

STEP1: Sentences are extracted from the training corpus to form the text set S of the sentences, this time using with the template method to discover the corpus used in the semantic relations.

STEP2: Analyze and get the concepts, objects within the domain by the terms and manual recognition in the domain ontology of the existing English corpus to form the term set T.

STEP3: According to the term set T, the words W associated with the terms, which can be verbs or nouns, are extracted from the sentence set S.

STEP4: The terms in the term set T and the extracted words W associated with the terms are used as nodes to construct the complex network. The formation of a community in the complex network represents that such terms have the same relation R.

STEP5: The complex network of the English corpus is constructed by using the term pairs in the term set T as nodes and the discovered relations R as edges.

2.2

Co-occurrence analysis

2.2.1

Lexical co-occurrence relations based on lexical semantics

From the semantic point of view of vocabulary, common lexical co-occurrence relations can be categorized as follows:

Oppositional relationship, where there is an oppositional relationship between two semantic constituents with the same table difference within the same semantic field between the semantic positions.

Overlapping relations-Relationships between semantic positions within the same semantic field.

Containment relation - When the semantic constituent of B is contained in the semantic position of A within the same semantic field, there is a containment relation between AB and AB.

Contextual relations-Relationships between semantic positions that are not part of the same semantic field. Comparable to contextual relations.

Relatively Irrelevant Relationships-Relationships that are not between semantic bits of the same semantic field. There is a relative irrelevance relation between two loci if their semantic elements do not contain each other.

Combinatorial relations-relationships that do not belong to the same semantic field.

A full understanding of the characteristics of lexical co-occurrence relations can play an important role in the effective utilization of these specific types of semantic relations that exist in the surface of language.

2.2.2

General research process for co-occurrence analysis

The general operation process of co-occurrence analysis is shown in Figure 2, which can summarize the general research process of co-occurrence analysis into three steps, data extraction, construction of co-occurrence matrix or co-occurrence vector, and data analysis. Due to the differences in the research objectives, the specific research process of co-occurrence analysis will also be different.

2.2.3

Data analysis

The collected co-occurrence information is analyzed using the lexical relevance approach. Relevance is a measure of correlation between objects characterized by dispersed attributes. Relevance quantifies the correlation between objects, which in turn clusters pairs of pairs with similar attributes into clusters. The relationships between words can be analyzed by calculating their relatedness, or words can be clustered to form concepts and the relatedness between concepts can be calculated. The main measures to process the first two types of co-occurrence matrices to analyze the correlation between words are Dice index, cosine index, and Jaccard index. The calculation formula is as follows: (9) $Dice’s coefficient = \frac{2 c_{i j}}{c_{i} + c_{j}}$ \[\text{Dic{e}'s coefficient}=\frac{2{{c}_{ij}}}{{{c}_{i}}+{{c}_{j}}}\]]

(10)

Cosine Coefficient = \frac{c_{i j}}{\sqrt{c_{i}} * \sqrt{c_{i}}}

\[Cosine~Coefficient=\frac{{{c}_{ij}}}{\sqrt{{{c}_{i}}}*\sqrt{{{c}_{i}}}}\]

(11)

Jaccard Coefficient = \frac{c_{i j}}{c_{i} + c_{j} - c_{i j}}

\[Jaccard~Coefficient=\frac{{{c}_{ij}}}{{{c}_{i}}+{{c}_{j}}-{{c}_{ij}}}\]

(12)

w_{j k} = \frac{\sum_{j - 1}^{N} d_{i j k}}{\sum_{i - 1}^{N} d_{i j}} * W e i g h t i n g F a c t o r (k)

\[{{w}_{jk}}=\frac{\underset{j-1}{\overset{N}{\mathop \sum }}\,{{d}_{ijk}}}{\underset{i-1}{\overset{N}{\mathop \sum }}\,{{d}_{ij}}}*WeightingF\;actor\;(k)\]

(13)

w_{k j} = \frac{\sum_{i - 1}^{N} d_{i j k}}{\sum_{i - 1}^{N} d_{i k}} * W e i g h t i n g F a c t o r (j)

\[{{w}_{kj}}=\frac{\underset{i-1}{\overset{N}{\mathop \sum }}\,{{d}_{ijk}}}{\underset{i-1}{\overset{N}{\mathop \sum }}\,{{d}_{ik}}}*WeightingF\;actor\;(j)\]

where c_ij is the number of occurrences of words i and j, and c_i,c_j is the total number of occurrences of words i and j in the text set, respectively. where d_ijk is the joint weight of words j and k in document i, and the weighting factor(k) is similar to the inverted document frequency to exclude generic words, i.e., words that occur in many documents.

The Jaccard index is widely used as a standardized correlation coefficient between two words representing c_i,c_j because it directly reflects the similarity between two words based on their co-occurrence frequency and eliminates the negative effect of high-frequency words.

The correlation between two lexical vectors is usually calculated using the cosine index: (14) $\sin m (d_{i .} d_{j}) = \frac{\sum_{k = 1}^{n} w_{i k} \times w_{j k}}{\sqrt{(\sum_{k = 1}^{n} w_{i k}^{2}) (\sum_{k = 1}^{n} w_{j k}^{2})}}$ \[\sin m({{d}_{i.}}{{d}_{j}})=\frac{\underset{k=1}{\overset{n}{\mathop \sum }}\,{{w}_{ik}}\times {{w}_{jk}}}{\sqrt{(\underset{k=1}{\overset{n}{\mathop \sum }}\,w_{ik}^{2})(\underset{k=1}{\overset{n}{\mathop \sum }}\,w_{jk}^{2})}}\]

where w_ik is the value of d_i on the k th dimension of the n-dimensional lexical vector. After obtaining the semantic relations between words in specific applications aimed at analyzing lexical relatedness, such as constructing conceptual spaces and semantic retrieval, the results of the analysis are applied to the corresponding specific processing methods, and the co-occurrence analysis ends here.

2.3

Text networks and word co-occurrence networks

2.3.1

Text Keyword Extraction for Word Co-occurrence Networks

1)

Text Preprocessing

Sentence recognition is performed on the initial text T. The initial text T is divided into a collection of sentence T = {S₁,S₂,S₃,...,S_n}’s using sentence terminators, etc. as sentence separators. The segmented sentences S_i ∈ T are subdivided, labeled lexically, and the deactivated words are deleted according to the denoising word list to obtain the set of candidate keywords corresponding to each sentence S_i = {w_i1,w_i2,w_i3,...,w_im}, where w_ij ∈ S_i, is the remaining candidate keyword after the processing of the above steps, S_i ∈ T.

2)

Text representation

Word co-occurrence network G = (V,E) , where V is the set of nodes, consisting of the set of candidate words, E ⊆ V × V is the set of edges, and E . The size of the co-occurrence window of the text is the length of the sentence after text preprocessing, and the maximum span of the co-occurrence window of the candidate keywords is taken as 2. That is, for any two candidate words w_i, w_j in the text T = {w₁,w₂,w₃,...w_n}. If they appear in a sentence at the same time and the distance between them is not greater than 2, it is considered that there is an association between these two words. Accordingly, connecting edges are established between these two nodes and the same nodes are merged to obtain the word co-occurrence network.

2.3.2

Improved TextRank for Keyword Extraction

The steps of the improved keyword extraction algorithm for TextRank are specified as follows:

1) Construct word co-occurrence network. A series of preprocessing operations are performed on the text, and the word co-occurrence network is constructed on the basis of the word co-occurrence relationship of the processed document.

2) Initial node weight calculation. Considering the degree centrality and clustering coefficient value of each node in the network, the comprehensive feature value of each node is calculated and used as the initial weight of nodes in the network.

3) Connected edge weight calculation. The initial weight is assigned to the connecting edge between two nodes based on the importance of the node’s neighbors to it, so as to achieve the weighting of the connecting edge.

4) Keyword extraction. The C-TextRank algorithm is utilized for iterative calculation to obtain the importance ranking results of all nodes, and on the basis of which the node weights are updated by incorporating the node location features, and the top-K nodes are taken as the text keywords.

The algorithm is realized as follows:

1)

node initial weight calculation

When calculating the initial weights of nodes, this paper comprehensively considers the degree centrality and clustering coefficient of nodes in the computational text network two complex network statistical features.

The degree centrality kc_i and clustering coefficient cc_i of each word ν_i in word co-occurrence network G are calculated first, and then the comprehensive eigenvalue wc_i of each node ν_i is calculated as the initial weight of node ν_i by Eq. (15). That is: (15) $w c_{i} = α * k c_{i} + β * c c_{i}$ \[w{{c}_{i}}=\alpha *k{{c}_{i}}+\beta *c{{c}_{i}}\]

where α,β is the adjustable parameter,α + β = 1, kc_i is the degree centrality of the nodes, and cc_i is the clustering coefficient.

2)

Connected Edge Weight Calculation

First, the comprehensive eigenvalue wc_i of node ν_i is used as the initial weight of node ν_i, and then the initial weight of the node is assigned to the connecting edges between the nodes in the network as the connecting edge weight using equation (16), so as to obtain the undirected weighted network, based on which, the nodes in the network are scored in terms of importance by using the improved formula (16) proposed in this paper. The improvement formula proposed in this paper is as follows: (16) $w_{i j} = w c_{i} \frac{w c_{j}}{\sum_{k \in T (i)} w c_{k}} + w c_{j} \frac{w c_{i}}{\sum_{k \in T (j)} w c_{k}}$ \[{{w}_{ij}}=w{{c}_{i}}\frac{w{{c}_{j}}}{\underset{k\in T(i)}{\mathop \sum }\,w{{c}_{k}}}+w{{c}_{j}}\frac{w{{c}_{i}}}{\underset{k\in T(j)}{\mathop \sum }\,w{{c}_{k}}}\]

Where w_ij denotes the weight of the connecting edges between nodes ν_i,ν_j, wc_i,wc_j is the initial weight of node ν_i,ν_j , and T(i) denotes the set of neighboring nodes to node ν_i. That is: (17) $i m (v_{i}) = (1 - d) \frac{1}{N} + d * \sum_{j \in T (i)} \frac{w_{j i}}{w_{j}} * S (v_{j})$ \[im({{v}_{i}})=(1-d)\frac{1}{N}+d*\underset{j\in T(i)}{\mathop \sum }\,\frac{{{w}_{ji}}}{{{w}_{j}}}*S({{v}_{j}})\]

Where ν_i,ν_j is the neighbor node in the word co-occurrence network. N is the number of nodes in the network. T(i) denotes the set of neighboring nodes to node ν_i and im(ν_i) is the importance score of node ν_i in the word co-occurrence network.

3)

Calculation of Positional Characteristics

In text, the position where the word appears is usually also an important factor in determining the importance of the word, and if the word ν_i that appears in the body part also appears in the title, then the word is more likely to be used as a text keyword. Therefore, after completing the importance scoring of the nodes with the C-TextRank algorithm, the positional coefficient γ is further introduced to adjust the final weights of the keywords. The final weight of each word is E(ν_i) , which is calculated as Eq: (18) $E (v_{i}) = {\begin{array}{l} γ * i m (v_{i}) & v_{i} Appears in the title section \\ (1 - γ) * i m (v_{i}) & v_{i} Appears only in the body \end{array}$ \[E({{v}_{i}})=\{\begin{array}{*{35}{l}} \;\gamma *im({{v}_{i}}) & {{v}_{i}}~Appears~in~the~title~section \\ (1-\gamma )*im({{v}_{i}}) & {{v}_{i}}~Appears~only~in~the~body \\ \end{array}\]

4)

Keyword Extraction

According to E(ν_i) after all the nodes in the word co-occurrence network are sorted in descending order, considering that the keywords of the text will generally not be single-word words. The sorted nodes will be uniformly processed to remove the single-word words in the candidate keywords. Finally, the Top-K node is taken as the key node of the word co-occurrence network, i.e., the text keyword.

2.4

Short Text Clustering Algorithm for Lexical Co-occurrence Networks

The core idea of the FWN short text clustering algorithm is that there is a strong semantic connection between words that co-occur at high frequency. In order to better reveal the semantic connections between words, the FWN algorithm uses a complex network to model the topics in the short texts.The FWN algorithm treats words as vertices V in a graph, and there is an undirected edge E between two words if they occur together at high frequencies.It is shown that the network structure constructed in this way has the typical community structure where the words of the same topic are closely connected. structure. Therefore, by finding the communities of words that exist in the corpus word set co-occurrence network, we can find the feature word sets of topics that exist in the short texts. By utilizing the feature word sets of topics, short texts can be clustered.

2.4.1

Community discovery

Community is a common network structure often phenomenon appearing in complex networks, nodes with similar attributes are closely linked together presenting the structure of community, and nodes in different communities are relatively loosely connected to each other.

In this paper, the classical community discovery algorithm GN algorithm [18] is used in detecting the community structure present in the co-occurring network of English corpus.

2.4.2

Frequent Term Set Mining

Frequent itemsets refer to the set of data objects that co-occur at high frequencies. The frequency of the itemset refers to the number of data that have the objects of the thing and is called the support of the itemset. If the support of the itemset is greater than a predefined minimum support min value, the itemset is said to be called a frequent itemset.

FP-Growth frequent pattern growth algorithm is a classical frequent itemset mining algorithm. It uses the idea of partitioning to compress the set of data elements representing frequent itemsets into a frequent pattern tree called FP tree. This FP tree still retains the association data between these itemsets, and the FP tree algorithm treats the compressed data of things as conditional databases. Each database is associated with a frequent item, and for each conditional database only the data set associated with it needs to be considered.

2.4.3

Frequent word set word co-occurrence network

A frequent word set is a set of frequent items of words in a large corpus, and a more rigorous definition of a frequent word set is given in Definition 1. Words in a frequent word set are used to represent the same set of concepts or to illustrate the same topic at a high frequency.

Let W be the short text data set W = (w₁,w₂,w₃,...,w_n) and the set of words of the short text data w_i be C_i , C_i = {c_i1,c_i2,c_i3,...,c_in}, where c_ij is the j th word of w_i. t is the set of all short text data words in W T = {t₁,t₂,t₃,...,t_n} The set of words U is a subset of T. Define the support of U SUP(U) to be the number of short textual data in W that contain U, divided by the total number of short textual data in W. U is said to be a frequent word set when the value of SUP(U) is greater than the threshold of minimum support.

The short text clustering algorithm based on frequent word set word co-occurrence network, also called FWN short text clustering algorithm.The execution process of FWN short text clustering algorithm mainly consists of six steps as shown above and below:

1) Data preprocessing

2) Mining frequent word sets

3) Constructing FWN network

FWN network, is a network constructed on the basis of word co-occurrence relationship of words in frequent word sets, i.e., all the words in a frequent word set are used as nodes. If two words occur in the same frequent word set then it is considered that there is an edge between these two words. In the FWN network constructed on the basis of frequent word sets, words within the same community, usually describe the same topic, i.e., the topic appears in the structure of a community.

4) Mining topic communities

5) Single-pole clustering

After using the community discovery algorithm to find the topic communities that exist in the FWN network, the topics that exist in the corpus data and the set of feature words that these topics have are found. With these sets of feature words, one-poo clustering is performed by using a method that calculates the similarity between the set of feature words and the short text.

C,H for the two word sets, C = {cl,c2,c3,...,cn}H = {h1,h2,h3,...,hm} . To compute the similarity of two word sets, we introduce the function R(C,H) , which represents the similarity between word set C and word set H . That is: (19) $R_{(C, H) =} \frac{C \cap H}{C}$ \[{{R}_{(C,H)=}}\frac{C\mathop{\cap }^{}H}{C}\]

Define the similarity between C and H as S(C,H): (20) $S_{(C, H) =} \frac{R_{(C, H)} + R_{(H, C)}}{2}$ \[{{S}_{(C,H)=}}\frac{{{R}_{(C,H)}}+{{R}_{(H,C)}}}{2}\]

6) Generate clustering labels

After the above steps are executed, the FWN short text clustering algorithm accomplishes a typical text clustering task. That is, the input short text corpus data is automatically divided into multiple text clusters. The goal of generating class labels is to produce a readable and understandable label that provides a brief summary and description of the content of short texts within the same class of short text in-put.

3

Lexical co-occurrence and structural semantic analysis based on English corpus

3.1

English Vocabulary Keyword Extraction

1) The experimental environment includes:

Operating System: Windows 10 (64-bit)

Processor: Intel(R) Core(TM) i7 -1720HQ CPU 2.60GHz

Programming language: Java Matlab

Open platform: Eclipse 3.7 JDK2.5 Matlab 2023a

2) The experimental dataset of this paper adopts the condensed version of the Reduced Text Corpus, which is a corpus of English texts organized by a laboratory. The Reduced Text Corpus is derived from the news texts of an English news website, which mainly includes nine different categories of text datasets, such as sports, military, finance and economics, etc. Each category contains 2,000 different documents. Each category contains 2000 different documents, and the size of the compressed package is 60 M. Due to the fact that this dataset contains too much corpus, this paper will streamline and process the dataset in the following experiments, and select a part of it to be used for experimental simulation.

Based on the Reduced dataset, a total of 500 documents are extracted from 9 different categories and manually labeled with keywords, and a total of 2000 keywords are extracted. On average, 10.2 keywords are extracted from each document.

3) Experimental steps and parameter settings

The first step is to carry out the preprocessing of the text. It mainly includes Chinese word segmentation for each document, and this paper adopts FNLP word segmentation system. Then remove the deactivated words according to the deactivation word list and construct the candidate word set.

The second step of keyword extraction. The traditional TextRank algorithm and the improved algorithm are used for keyword extraction respectively, in which the specific experimental parameters are as shown in Table 1.

Since each document is manually labeled with 10.2 keywords on average, this experiment extracts the top 10 keywords for each document for comparative analysis. The number of sliding panes belongs to the uncertainty parameter, the experiment will be the number of panes were set to 2, 6, 10, 15, 20 to compare different experimental results.

4) Analysis of experimental results

For the experiment, the accuracy P, recall R, F1–measure(F1) and average iteration I are chosen. set the set of keywords extracted by the algorithm as T, and the set of keywords labeled manually as H.

The keyword extraction is performed according to the above procedure, and the keyword extraction results of the two algorithms at different sliding pane lengths are stored in the table. The comparison of the keyword extraction results is shown in Table 2.

The improved TextRank algorithm outperforms the traditional TextRank algorithm with the same sliding pane length. The results show that the optimization of edge weights of TextRank algorithm by combining the local semantic similarity of words is effective, which not only improves the accuracy of keyword extraction, but also reduces the time complexity and convergence of keyword extraction.

The results of the improved TextRank algorithm are better than those of the TextRank algorithm, with the average P, R, and F1 improved by 0.113, 0.0852, and 0.0826, respectively. The keyword extraction effect of the improved TextRank algorithm is also improved to some extent, but the reduction in the number of iterations of the algorithm is small, which once again indicates that the edge weights have a greater impact on the iterative calculation process.

Table 1.

Experimental parameter

Parameter name	Parameter value
Maximum iteration number	400
Iteration out of the threshold	0.003
The single document extracts the number of keywords N	10
Damping factor d	0.61
Vertex score	1.0
Slide pane size w	2/6/10/15/20

Table 2.

Keyword extraction comparison

Algorithm	Index	2	6	10	15	20	MEAN
TextRank
	P	0.3625	0.3212	0.3578	0.3755	0.3964	0.3627
	R	0.4689	0.4968	0.4578	0.4772	0.4931	0.4788
	F1	0.3715	0.3251	0.3698	0.3745	0.3604	0.3603
I	53	56	59	52	51	54.2
Improved textrank
	P	0.4596	0.4685	0.4725	0.4869	0.4911	0.4757
	R	0.5521	0.5637	0.5417	0.5698	0.5927	0.5640
	F1	0.4122	0.4166	0.4867	0.4474	0.4516	0.4429
I	42	41	38	35	45	40.2

3.2

English Vocabulary Short Text Clustering

The dataset used in this chapter is from a corpus of English texts organized by a laboratory (ibid.). In order to verify the effectiveness of the short text clustering algorithm based on FWN, a total of 300 documents are extracted from 9 different categories, 1560 data are extracted number by number and the order of the text is disrupted, and then cluster analysis is carried out to obtain the clustering results. The number of texts in each dataset is shown in Table 3. The number of texts in the datasets of each category is different, and dataset 1 includes 526 text information.

Table 3.

Number of data sets

Classification	Text quantity	Classification	Text quantity
Data set 1	526	Data set 6	64
Data set 2	125	Data set 7	153
Data set 3	348	Data set 8	186
Data set 4	56	Data set 9	54
Data set 5	48	-	-

In this paper, the results of short text clustering are judged using the check accuracy rate, check completeness rate and F1 value, and the check accuracy rate, check completeness rate and F1 value of text information in each category of dataset in the dataset are shown in Fig. 3. It can be obtained that the FWN short text clustering algorithm proposed in this paper can realize a check accuracy rate of 0.5554 and a check completeness rate of 0.5785, and the mean value of F1 is 0.5620.

3.3

English measurement analysis based on lexical co-occurrence networks

Wikipedia can serve different roles in different applications. In addition to being an encyclopedia, it is also considered a large-scale corpus, thesaurus, taxonomical map, conceptual hierarchy network, or ontological knowledge base. The different contents of Wikipedia sites contain a wealth of available resources, such as synonym page redirection, word disambiguation pages, structured infoboxes, hyperlinks to semantic web information, and taxonomical information classification maps. In this paper, Wikipedia is used as the research data.

In word co-occurrence network, each node represents a word and each word has a corresponding word class. By counting the nodes, the node characteristics of the network can be known. In the total network (which contains news domain word co-occurrence network, literature domain word co-occurrence network, and law domain word co-occurrence network), nouns have the highest number of nodes as nodes, followed by verbs. Time words have the least number of nodes as nodes. The order of the number of nodes from highest to lowest is: nouns > verbs > adjectives > adverbs > pronouns > temporal words > number words > exclamations > intonations > mimics > quantifiers > postpositions = connectives > modals > time words. The network nodes can then be better analyzed by comparing the three different domain words with the present network.

The distribution statistics of word categories of different domain word co-occurrence network nodes are shown in Figure 4. The three domains’ word co-occurrence network node word categories include nouns, adjectives, number words, quantifiers, pronouns, time words, adverbs, modals, mimics, postpositions, inflections, connectives, and exclamations. The largest number of nouns are distributed in the three domains of literature, journalism, and law, which include 8569, 7596, and 4021, respectively.

3.4

Semantic Relevance Calculation for the English Corpus

For the purpose of semantic relatedness evaluation, test sets have been created by manually annotating a collection of semantically related words. Certain publicly available test sets covering a larger number of keywords, which are widely cited in semantic similarity computations and experimental reviews. The Finkelstein-353 test set is a more typical manually labeled semantic similarity test set. It consists of a set of 153 word pairs covering 13 topics, and a set of 200 words covering 16 topics. The words are selected to cover as many possible semantic correlations as possible, and each pair of related words is manually labeled for semantic relatedness by 13-16 people. The average value is then selected as the final labeling result.

In order to review the multipath semantic relevance algorithm based on Wikipedia link graph, this test set is also used in this paper, using Wikipedia’s multi-language pairs of links, and also combined with a translation tool for translation and manual proofreading work on this test set, so that more words can be matched to nodes present in the English Wikipedia.

In order to evaluate the method of extracting semantically related words from the link references in this paper, and also to build a related word review dataset containing more vocabulary for the study of semantic similarity computation. The experiment manually labeled some of the semantically related words.

In order to ensure a certain degree of differentiation, the experiment did not completely select word pairs whose correlations are all particularly obvious. Instead, 800 word pairs from each of the two groups of words A and B were extracted from the set of semantically related words. Where group A are word pairs with high semantic relatedness. It was randomly selected from the set of words and from the set of words with average relevance in the range of 6-8. Group B is the word pairs with average similarity, which were randomly selected from the set of words with similarity in the range of 2-4. Further these related words were independently labeled with semantic relatedness by a team of 8 members and the results of their respective labeling were normalized. The average value was taken as the final result.

Some examples of the manual relevance labeling of the corpus are shown in Table 4. The table reflects the semantic relevance of the two terms, and the correlation between imperialism and colonialism, for example, is 0.472 for the two English terms.

Table 4.

Some examples of artificial relevance in corpus

Word A	Word B	Degree of correlation
Imperialism	Colonialism	0.472
Shampoo	Conditioner	0.399
Overseas Chinese	Settlers	0.085
Lion tiger	Tiger lion	0.551
Yongding River	Lugou Bridge	0.364
First order logic	Field of theory	0.261
The middle ages	Castle	0.277
Symphony	Movement	0.211
Taiwan	NT	0.316
Husband’s family	Wife’s family	0.387
Hewlett-Packard	Printer	0.428
Spring Festival	Dumplings	0.395

The experimental results of multipath semantic relevance calculation are shown in Table 5. Analyzing the experimental results, it can be seen that better results can be achieved by combining document graphs and classification graphs. And using any of the link diagrams alone may lead to worse results because of ignoring the useful semantic information. More desirable results can be achieved when the weighting factor of the classification diagram and document diagram is chosen to be λ = 0.32.

Table 5.

Multi-path semantic correlation calculation results

Characteristics of the correlation algorithm	Grade correlation coefficient
Separate use of the classification diagram	0.42
Reciprocal path	0.23
Depth information weighting	0.39
Information content weighting	0.37
Use the document map separately	0.36
A non-directional diagram of a two-way link	0.25
Link to link separately	0.37
Link to the link	0.31
Forward to the connection and adjust the weight of the parameters	0.41(π = 0.85)
Use the first paragraph instead of English	0.33
Integrated document and classification diagram, and adjust the parameters	0.46(λ = 0.32)
Open test masking(WS353)	0.37

4

Conclusion

In this paper, we analyze the semantic relationship of text complex network, propose the co-occurrence analysis method, and improve TextRank algorithm for keyword extraction of lexical co-occurrence network. Combined with the lexical co-occurrence network in the English corpus for semantic relationship mining. The improved TextRank algorithm can reduce the time complexity and convergence of keyword extraction. In the comparison test, the results of the improved TextRank algorithm are better than the original algorithm, with the average P, R, F1 improved by 0.113, 0.0852, 0.0826 respectively. The FWN short text clustering algorithm based on FWN is able to achieve a checking accuracy rate of 0.5554 in the English text corpus, which meets the demand. The English corpus selected for experimental analysis concentrates on the distribution of node words in the lexical co-occurrence network of literature, news and law in this English corpus. Among the various types of node words, the number of noun distribution is ranked at the top. And the multipath semantic relevance calculation is carried out to find the results of corpus relevance labeling and the relevance between words.

Funding:

1) Project of Guangdong Provincial Private Education Association: Research on innovative application of College English Teaching in Private Universities under the background of digital transformation (No.: GMG2024084).

2) The Quality Engineering Project of Zhanjiang University of Science and Technology: Research on the optimization of online and offline resources and teaching practice under the guidance of innovation and entrepreneurship (No.: JG-2022546).

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Lexical co-occurrence network and semantic relation mining based on English corpus

Guimei Pan

Published Online: Mar 17, 2025

Received: Oct 25, 2024

Accepted: Feb 09, 2025

DOI: https://doi.org/10.2478/amns-2025-0209

KeywordsTextRank algorithm, Text complex networks, Semantic relatedness, Co-occurrence networks, English corpus

© 2025 Guimei Pan, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
TextRank algorithm, Text complex networks, Semantic relatedness, Co-occurrence networks, English corpus