A Clustering Study of Online Public Opinion Texts on Public Emergency Events Based on Sentence-Level Similarity and Sentiment Analysis

With the development of the Internet, social media has become an important platform for people to obtain information and express their opinions, and the outbreak of public emergencies and the dissemination of public opinion have become more rapid and widespread [1-2]. Public emergencies are events that occur suddenly and may have a significant impact on public life, property and social public interests. Due to their suddenness and unpredictability, such events often trigger strong public concern and emotional fluctuations, forming a public opinion field [3-6]. While the Internet and social media make the dissemination of information more rapid and extensive, various voices are easily gathered and amplified, thus triggering online public opinion [7-8]. Since emergencies are often accompanied by the obstruction of the original information channels in the event center, all kinds of inaccurate speculations and false news will take the opportunity to fill the information vacuum [9-11]. The single, inefficient traditional public opinion propaganda concept is incapable of coping with online public opinion that is rapidly shaped on the online platform, with strong emotional color, uncontrolled sources, and one-sided public pressure [12-15].

Text clustering, as an unsupervised machine learning method, has high flexibility and automated processing ability because it does not need to be trained and does not need to manually label the categories of online texts in advance, and has become the main technical tool for organizing textual information and mining hot topics in online public opinion analysis systems [16-19]. Since many keywords constitute a massive collection of feature words in the process of building text information features, the text information feature vector should be reduced before the clustering processing of text. The usual method is to rank the weights of all feature words and select a predetermined number of best feature words as the resulting feature subset [20-23]. Another common method is determined by the characteristics of network information, which generally comes with headlines and summarizes the main content of the body in the headlines, so the headline content can also be expressed as textual information features [24-26].

Internet public opinion on public emergencies is characterized by fast dissemination and emotional polarization, and traditional analysis methods are difficult to accurately mine the topic evolution law. Existing studies mostly rely on single text features or static sentiment classification, ignoring the impact of sentence-level semantic associations and dynamic changes in sentiment on the clustering effect. For this reason, this paper proposes an opinion text clustering framework that integrates sentence-level similarity calculation and sentiment analysis. At the method level, a real-time recognition algorithm for emergent words based on incremental Kleinberg model is proposed to recognize emergent words in real time from streaming appeal records, which provides clues for subsequent emergent event detection. It also combines the TF-IDF algorithm to extract topic words, adopts an improved plain Bayesian algorithm (incorporating NTUSD, Hownet, and other multi-sentiment lexicons) to improve the sentiment classification accuracy, solves the problem of misclassification of words by the traditional model, realizes multi-stage topic clustering through K-means and DBSCAN algorithms, and visualizes the hotspot evolutions based on the topic river graph.

2

Research on Clustering of Internet Public Opinion Texts on Public Emergencies

2.1

Model of the Public Opinion Analysis System for Public Emergencies on the Internet

As a network platform for monitoring and analyzing public opinion by relevant government departments and units, the Public Opinion Analysis System for Public Emergencies can effectively analyze and monitor different types of public opinion information on public emergencies, quickly identify and track public opinion hotspots and sensitive topics, accurately grasp the process of public opinion evolution and trend prediction, and provide auxiliary decision-making for the government to guide the benign development of public opinion in a targeted manner. According to the above general requirements of the system, this paper proposes a web-based text mining system model for analyzing public opinion on public emergencies. The system model consists of four layers: web public opinion collection layer, public opinion data preprocessing layer, public opinion analysis layer and system application layer, and the model architecture is shown in Figure 1.

2.2

Text Processing Related Technologies

2.2.1

Burst word state recognition model

The identification of emergent word states has an important clue value for the determination of the time and content of emergencies. The main reason is that when an unexpected event occurs, the frequency of certain words highly related to the unexpected event will also increase greatly, and by recognizing the unexpected state of words in the data stream, the time and semantic content of potential unexpected events can be effectively delineated.

At present, the commonly used emergent word state recognition model in the emergent event detection method is Kleinberg state machine model. The basic idea of this model is to build a burst state recognition model for the time series of words, and identify the words that present a burst state and the duration period of the burst state of the words.

The Kleinberg state machine model assumes that the time interval x between the arrival of the same word in the data stream obeys an exponential function as follows: (1) $f (x) = α e^{- α x}$ $f(x) = \alpha {e^{ - \alpha x}}$

where α represents the arrival rate of the vocabulary, and the transfer behavior between bursty states of the vocabulary is simulated by modeling the state machine model of the vocabulary.

The Kleinberg state machine model is divided into the two-state Kleinberg state machine model and the infinite-state Kleinberg state machine model. The following sections describe the implementation of these two state machine models respectively. 1)

Two-state Kleinberg state machine model

The two-state Kleinberg state machine model divides the states of the vocabulary into a non-burst state q₀ and a burst state q₁. The arrival rate of the vocabulary in the non-burst state q₀ is α, and the arrival rate of the vocabulary in the burst state q₁ is s*α, s is the multiplicity, and s > 1.

The two-state Kleinberg state machine model is shown in Figure 2. The black line hollow circles in Fig. 2 represent the states of the vocabulary, the circle on the left indicates that the vocabulary is in a non-burst state, and the circle on the right indicates that the vocabulary is in a burst state. The states of the vocabulary transfer to each other with probability p.

The duration period of lexical bursts in the state machine model can be recognized as the time interval between transitions between non-burst and burst states over a period of time. Given the time intervals of vocabulary w, w at moment t for x_t, and w at moment t for s_t, s_t for q₀ means that w is in a non-burst state at moment t, and s_t for q₁ means that w is in a burst state at moment t.

By specifying w the sequence of time intervals ${x_{1}, \dots, x_{t}, \dots, x_{T}}$ $\left\{ {{x_1}, \cdots ,{x_t}, \cdots ,{x_T}} \right\}$, the w corresponding sequence of burst states ${s_{1}, \dots, s_{t}, \dots, s_{T}}$ $\left\{ {{s_1}, \cdots ,{s_t}, \cdots ,{s_T}} \right\}$ can be obtained by the two-state Kleinberg state machine model. 2)

Infinite State Kleinberg State Machine Model

The infinite state Kleinberg state machine model extends the two-state Kleinberg state machine model by dividing the states of the vocabulary into infinite states. The arrival rate of the vocabulary in the unique non-burst state q₀ is α; among the remaining infinite burst states ${q_{1}, \dots, q_{i}, \dots}$ $\left\{ {{q_1}, \cdots ,{q_i}, \cdots } \right\}$, the arrival rate of the vocabulary in burst state q_i is sⁱ*α, s is the multiplicity, and s > 1.

The infinite state Kleinberg state machine model’s sequence of burst states with the corresponding time interval sequence is shown in Fig. 3. Where the sequence of time interval observations specifying w is ${x_{1}, \dots, x_{t}, \dots, x_{T}}$ $\left\{ {{x_1}, \cdots ,{x_t}, \cdots ,{x_T}} \right\}$, the infinite state Kleinberg state machine model obtains w the corresponding sequence of burst states as ${s_{1}, \dots, s_{t}, \dots, s_{T}}$ $\left\{ {{s_1}, \cdots ,{s_t}, \cdots ,{s_T}} \right\}$, where the value q_i of s_t represents the state that w is in at moment t, when the arrival rate of vocabulary w is sⁱ*α.

2.2.2

Text Sentiment Analysis

Effectively mining the sentiment information behind text data is an important task in the work of microblog text opinion analysis, in which sentiment polarity analysis is an important branch in the field of sentiment analysis. Plain Bayesian algorithm is a classical machine learning algorithm, which assumes that the conditions between features are independent of each other on the basis of Bayesian algorithm, calculates the probability of object features based on probabilistic theory and determines the object classification category. The plain Bayesian algorithm obtains the a priori probability of various features in the category through statistical analysis of corpus data, calculates the a posteriori probability based on the condition that the current event belongs to a certain category, and the category with the highest a posteriori probability can be used as the predicted category. Due to the low learning cost of the simple Bayesian algorithm, it occupies less computational resources, and has a strong ability to ignore irrelevant features in the text, so it has a better performance than the complex model that can easily lead to the phenomenon of underfitting, and it is widely used in the text sentiment analysis. The principle of the specific algorithm is as follows:

The sentiment category of text $d = {w_{1}, w_{2}, \dots, w_{n}}$ $d = \left\{ {{w_1},{w_2}, \cdots ,{w_n}} \right\}$ to be analyzed belongs to one of the sentiment polarity category sets, and the sentiment polarity category set is defined as $C = {c_{1}, c_{2}, c_{3}}$ $C = \left\{ {{c_1},{c_2},{c_3}} \right\}$ according to the requirements, and the elements represent positive, neutral and negative respectively. The plain Bayesian algorithm assumes that each emotion feature in the emotion feature model is independent of each other to reduce the influence of the joint probability distribution in the model, combined with the weight of the emotion feature words, according to the plain Bayesian algorithm to get the feature word emotion classification formula, see equation (2). Where $P (c_{j})$ $P\left( {{c_j}} \right)$ is the prior probability of emotion polarity category c_j and $P (w_{i}, c_{j})$ $P\left( {{w_i},{c_j}} \right)$ is the posterior probability of emotion feature word w_i in emotion polarity category c_j. (2) $c = \underset{c_{j} \in C}{\arg \max} {P (c_{j}) \prod_{i = 1}^{n} P (w_{i}, c_{j})}$ $c = \mathop {\arg \max }\limits_{{c_j} \in C} \left\{ {P\left( {{c_j}} \right)\prod\limits_{i = 1}^n P \left( {{w_i},{c_j}} \right)} \right\}$

A priori probability $P (c_{j})$ $P\left( {{c_j}} \right)$ refers to the initial probability of the sentiment polarity category c_j. The category a priori probabilities are computed from the annotated training corpus and are estimated as shown in equation (3). Where $D o c (c_{j})$ $Doc\left( {{c_j}} \right)$ refers to the number of documents corresponding to category c_j in the corpus. (3) $P (c_{j}) = \frac{D o c (c_{j})}{\sum_{c_{j} \in C} D o c (c_{j})}$ $P\left( {{c_j}} \right) = \frac{{Doc\left( {{c_j}} \right)}}{{\sum\limits_{{c_j} \in C} {Doc} \left( {{c_j}} \right)}}$

The a posteriori probability $P (w_{i}, c_{j})$ $P\left( {{w_i},{c_j}} \right)$ refers to the probability of the occurrence of the sentiment feature word w_i in the category c_j, which is calculated as shown in Equation (4), where $C o u n t (w_{i}, c_{j})$ $Count\left( {{w_i},{c_j}} \right)$ represents the number of occurrences of the word w_i in the texts belonging to the category c_j. Considering that the vector matrix of the training set does not necessarily contain all the text sentiment feature words to be analyzed, a certain word w_i does not exist in all the texts of all the categories, that is, the a posteriori probability of all the classes is 0, and the denominator of the a posteriori probability is 0 will make the classification will not be able to proceed normally. Therefore, we can consider the addition of Laplace smoothing processing, that is, for each weight value to add a smaller value of δ, to ensure that the a posteriori probability of a smaller impact and the a posteriori probability calculation is not 0, the experiment can be δ taken as 1. (4) $P (w_{i}, c_{j}) = \frac{C o u n t (w_{i}, c_{j}) + δ}{\sum_{i = 1}^{n} (C o u n t (w_{i}, c_{j}) + δ)}$ $P\left( {{w_i},{c_j}} \right) = \frac{{Count\left( {{w_i},{c_j}} \right) + \delta }}{{\sum\limits_{i = 1}^n {\left( {Count\left( {{w_i},{c_j}} \right) + \delta } \right)} }}$

2.2.3

Text clustering analysis

The potential information mining of unlabeled data is usually realized by clustering, an unsupervised learning method, which is based on the principle that by analyzing the similarity of text features, similar objects are put into the same cluster, so that the data of the same class are aggregated together according to the similarity degree, so that the data within the same cluster are as similar as possible, and the differences of data in different clusters are as different as possible, so as to realize unsupervised segmentation of unlabeled data. Cluster analysis can be visualized by drawing word clouds on its analysis results.

TF-IDF (Word Frequency-Inverse Document Frequency) algorithm is a statistical method to determine the weight of a word by the number of times it appears in the bag of words. The principle is that through the statistics of the text word frequency, the higher the word frequency, the stronger the correlation between the word and the document, due to the use of common words in the process of the inevitable word frequency will be on the high side, so the use of inverse document frequency to reduce the weight of the common words, as shown in Equation (5): (5) $T D I D F_{i j} = T F_{i j} * I D F_{i j}$ $TDID{F_{ij}} = T{F_{ij}}*ID{F_{ij}}$

Therefore, the TF-IDF value is the product of word frequency TF and inverse document frequency IDF, the larger the TF-IDF value is, the more important the word is in the text, which is used to assess the importance of a word for the text where it is located, and the importance of the word is proportional to the number of times it rises in the text, and at the same time, with the frequency of the word in the text is proportional to the inverse of the decline.

K-means algorithm is a clustering algorithm based on division, by selecting the number of clusters k and n initial centers, according to the objective function and cosine similarity for the calculation of its distance to the center, each word clustered to the nearest center, and many iterations to update the number of iterations to guide to reach the maximum number of iterations or the center of clusters no longer change until the number of clusters selected for the number of clusters k can be determined by the SSE (and the variance) is determined. K-means algorithm has the advantages of fast convergence, simple algorithm, efficient processing of large data sets, has better scalability and high efficiency, suitable for the analysis of network large-scale group event data.

DBSCAN algorithm is a density-based clustering algorithm, mainly from a core point, to its center, Eps for the radius to obtain its Eps neighborhood, in the neighborhood of the data greater than the MinPts (the minimum number of) for the core object, to the expansion of the range of its density reachable, to get a maximization of the zone containing the core point and the boundary point of the city, where the density of any two points are connected. The process is repeated until all points are identified. The DBSCAN algorithm, compared to other algorithms, does not require a fixed number of clusters, is able to identify outliers as noise rather than simply categorizing them, and the shape and size of the clusters are not restricted.

3

Examples of evolutionary analysis of public opinion on the Internet in public emergencies

The 2021 Zhengzhou 720 Extraordinarily Heavy Rainstorm Disaster was selected as the research object for the text clustering case analysis of online public opinion.

3.1

Data collection and processing

3.1.1

Microblog Data Crawling

1)

Crawling Scope

According to the analysis steps, the public opinion heat needs to be calculated at a later stage, i.e., the number of relevant elements in the microblogs is counted, so the number of original microblogs, comments, and retweets need to be calculated.

In terms of content, the case was selected for the purpose of theme extraction and sentiment analysis by analyzing the content of microblogs sent by users. The elements involved in microblog content are original microblogs and comments (including retweets and comments), in which the original microblog is the subjective attitude of the user towards the event, which can intuitively express the user’s own attitude and opinion towards the event, and the comments are the evaluation of others’ views, and there may be tens of thousands of comments under one original microblog, which is too large a volume of data and difficult to analyze, so only the original microblogs are selected for content analysis.

2)

Crawling method

Utilize Octopus collector for data collection. Octopus is the leading Internet data collection platform in China, which can convert webpage unstructured data into structured data and store them in various forms such as database or EXCEL. It can accurately capture data from any web page by simply configuring rules to generate customized and regularized data formats. The principle of operation is to use C# language in Windows system, based on the Firefox kernel browser, by simulating the way of human mind operation (such as opening a web page, clicking a button in the web page), the web page content is extracted fully automatically.

Regarding the case studied in this paper, a total of 81,166 data were obtained, including 52,099 original tweets, 4,068 retweets, and 25,279 comments.

3.1.2

Calculation of Internet Public Opinion Heatiness

The heat of online public opinion refers to the degree of high public opinion in cyberspace, which is triggered by media reports, discussions among netizens, and specific actions taken by the government in response. By visualizing the relevant data through line graphs, the stage of public opinion can be clearly delineated. The relevant elements are the number of original tweets W_1,i, the number of retweeted tweets W_2,i, and the number of commented tweets W_3,i. The formula for calculating the daily public opinion heat O_i is: (6) $O_{i} = b_{1} \times W_{1, i} + b_{2} \times W_{2, i} + b_{3} \times W_{3, i}$ ${O_i} = {b_1} \times {W_{1,i}} + {b_2} \times {W_{2,i}} + {b_3} \times {W_{3,i}}$

Where i denotes the ind day, b₁, b₂ and b₃ are the weights of the number of original tweets, retweets and comments respectively.

Since the number of daily microblogs, retweets and comments may reach hundreds or thousands in practice, the data need to be simplified by the extreme value method in the calculation to obtain Equation (7): (7) $O_{i}^{'} = \frac{O_{i} - M i n (O_{i})}{M a x (O_{i}) - M i n (O_{i})} \times 100 %$ ${O_i}' = \frac{{{O_i} - Min({O_i})}}{{Max({O_i}) - Min({O_i})}} \times 100\%$

For the values of weights in Equation (6), b₁ = 0.4, b₂ = 0.35, and b₃ = 0.25 are taken to get the values of O_i and O_i′. Some of the web public opinion heat values are shown in Table 1.

Table 1.

Online public opinion heat value (part)

Date	Original tweets	Forwarding	Comment	O_i	O_i’
0720	812	25	1628	411.8	0.06029
0721	3729	64	4729	734.3	0.16725
0722	10283	2547	6248	8403.6	0.96792
0723	12272	1164	6027	8684.4	1
0724	8281	75	3143	3989.1	0.65322
0725	7934	42	1983	2933.6	0.54855
0726	4032	50	732	512.9	0.20316
0727	2224	44	461	83.8	0.12149
0728	1034	32	223	33.1	0.12883
0729	623	10	56	21.4	0.03946
0730	324	7	30	17.5	0.02181
0731	182	5	12	16.7	0.01486
0801	148	3	4	13.3	0.00740
0802	137	0	2	0	0.00083
0803	84	0	1	0	0.00034

3.1.3

Opinion Stage Classification

According to the results calculated in the previous step, take the date and draw a line graph with columns O_i and O_i′, and the trend of public opinion heat is shown in Figure 4. According to the trend of the folding line, the public opinion heat is divided into stages.

As shown in the figure, from July 20 to July 21 was the beginning period of public opinion, with a total of 4,541 original microblogs, 89 retweets, and 6,357 comments; from July 22 to July 25 was the outbreak period, with a total of 38,770 original microblogs, 3,828 retweets, and 17,401 comments; from July 26 to July 28 was the recurrence period, with a total of 7,290 original microblogs, 126 retweets, and 1,416 comments; from July 29 to August 3 was the slowdown period, with a total of 1,498 original microblogs, 25 retweets, and 1,416 comments. From July 26 to July 28, there were 7,290 original microblogs, 126 retweets and 1,416 comments, and from July 29 to August 3, there were 1,498 original microblogs, 25 retweets and 105 comments.

3.1.4

Data cleansing

Data cleansing is the processing of non-compliant content; it is the final procedure for detecting and correcting identifiable errors in data files, including the processing of mutilated data, erroneous data, and duplicate data. The ultimate goal of data cleaning is to improve the quality of data. Data cleaning in this paper is mainly twofold, duplicate data and invalid data. Duplicate data refers to the fact that due to possible errors in the operation of the data collector, the final result presents two and more completely duplicate microblog texts posted by the same user at the same time. Invalid data refers to microblog text that is completely unrelated to the topic of the study. Sina Weibo has a topic function, usually denoted by #topic#, under which netizens can post microblogs to express their opinions and views. However, the behavior of netizens posting microblogs is highly subjective, and it is inevitable that there will be the behavior of “rubbing on the topic”, i.e., posting contents completely unrelated to the topic itself, for example, publicizing the latest movie of a certain celebrity under the topic of the public emergency studied in this paper, and so on, and the emergence of these data will increase the inaccuracy and ineffectiveness of the text processing. The presence of such data will increase the inaccuracy and invalidity of text processing, so these data need to be deleted.

After manually cleaning the duplicate and invalid data, a total of 39,027 original tweets were obtained, of which 4,173 were in the starting period, 29,361 in the bursting period, 4,730 in the recurring period, and 763 in the plateauing period. The large difference between the cleaned data and the collected data is due to the fact that through the analysis of the original data in an Excel table, it is found that there is the behavior of “rubbing on the topic” in every stage of public opinion, and the number of microblogs discussing the topic itself is much larger than the number of microblogs discussing the topic itself.

3.2

TF-IDF Algorithm to Extract Topic Words

The idea of TF-IDF value operation in this case is to calculate the TF-IDF value of the words in each tweet to derive some themes, then count the number of times the same theme appears in the whole corpus (each stage of the microblog text is a corpus, i.e., there are 4 corpora in total), and select the theme with the number of times ranked in the top 100 as the theme of the whole corpus. The number of occurrences of each topic word is displayed from highest to lowest.

Table 2 shows that the words with high frequency in the initial period were “heavy rain” (273), “early warning” (231), “Zhengzhou” (219), “weather” (210), “subway” (187), “waterlogging” (162), “suspension” (98), “notice” (67), “transportation” (52), and “meteorological bureau” (21). The high-frequency words in the outbreak period are “trapped”, “tunnel”, “distress”, “information”, “refueling”, “casualty”, “rescue”, “line 5”, “firefighter”, “submerged”; The high-frequency words in the repeated period are “donation”, “reconstruction”, “accountability”, “investigation”, “insurance”, “claims”, “restoration”, “drainage”, “resettlement”, and “victims”. The high-frequency words in the slow period are “flood control”, “rectification”, “commemoration”, “sponge city”, “strategy”, “fresh delivery”, “prayer” and “restoration”.

Table 2.

Example of results of topic term frequency

Initial period		Eruption period		Recurrent period		Flat period
Keyword	Time	Keyword	Time	Keyword	Time	Keyword	Time
Rainstorm	273	Be trapped	583	Volunteer	146	Flood prevention	37
Forewarning	231	Tunnel	422	Donation	93	Rectify and reform	30
Zhengzhou	219	Message	394	Reestablish	90	Commemorate	26
Weather	210	Ask for help	361	Accountability	74	Sponge city	20
Subway	187	Fighting	347	Investigation	69	Policy	19
Hydrops	162	Casualties	328	Insurance	66	Present a bouquet	15
Off-line	98	rescue	287	Recover	57	R.I.P	14
Notification	67	Line 5	253	Drain water	48	Recover	10
Traffic	52	Fireman	218	Settle	39
Weather bureau	21	Submerge	184	victim	31

3.3

Sentiment Analysis of Internet Public Opinion Texts

3.3.1

Constructing an Exclusive Emotional Dictionary

The NTUSD sentiment dictionary was used as the seed, and the text sentiment analysis method in section 2.2.2 was used. In the results of the initial calculation, negative emotional words include words that are strongly related to the research topic, such as “United States”, “China”, “China and the United States”, and “trade war”, and the weight is high, so these words need to be removed. At the same time, when the negative emotion words are analyzed, it is found that the more obvious positive sentiment words, such as “come on”, are listed as negative sentiment words after calculation, which will affect the accuracy of the calculation of the emotion intensity of the subsequent text. The method adopted in this paper is to aggregate NTUSD, Hownet, BosonNLP sentiment dictionaries and other types of sentiment dictionaries on the network to calculate the positive sentiment thesaurus and negative sentiment thesaurus, respectively.

The comparison is done in the positive and negative sentiment dictionary obtained using the plain Bayesian algorithm proposed in Section 2.2.2, and if the word is a positive sentiment word in the thesaurus and a negative sentiment word after the computation, it is corrected by adding a score of 0.3 to it. If it is greater than 0 after correction, it is categorized in the positive sentiment dictionary, and if it is still less than 0 after correction, it remains in the negative sentiment dictionary. The corrected score was used as the final sentiment score for the word.

After several experiments to determine the final positive and negative sentiment words, a total of 2831 sentiment words were obtained. Among them, 1572 positive sentiment words and 1259 negative sentiment words are shown in Fig. 5 and Fig. 6 for Top 10 positive and negative sentiment words, respectively.

As can be seen from the above figure, in the sentiment analysis of the online public opinion text about the public emergency event “Zhengzhou 720 Heavy Rainstorm Disaster” in 2021, the top 10 positive emotional words mostly come from the praise of mutual aid, rescue, and national strength, reflecting social cohesion, among which “love”, “positive energy”, and “hero” have the highest emotional scores, which are 0.953, respectively. 0.935 and 0.915, while negative affective words focused on the destructiveness of the disaster itself and the fear and grief of the people, reflecting the severity of the disaster, among which the emotional scores of “disaster” and “ruthlessness” were the highest, -0.947 and -0.938, respectively.

3.3.2

Analysis of overall trends in sentiment

Figure 7 shows the time series of microblog sentiment intensity, from which it can be seen that the overall sentiment intensity in the period of July 22-26 is higher than the sentiment intensity in other periods. And the positive sentiment intensity peaked on July 23, with 2,371 positive sentiment posts on that day. The second peak occurred on July 22, with 2,246 posts. Negative sentiment is strongest at the beginning of the public opinion period, and reaches a trough on July 21, with 1,504 negative sentiments on that day. The fluctuation of negative sentiment intensity is not big with the outbreak of the public opinion. The score of negative sentiment is about half of that of positive sentiment, so it can be seen that during the evolution of public opinion, most of the sentiments expressed on Weibo are positive and the intensity of positive sentiments is much greater than that of negative sentiments.

3.4

Effect of Thematic Dimension on Clustering Results

The online public opinion analysis model for public emergencies constructed in this paper needs to assume the values of the theme dimensions in advance when performing theme modeling, so it is an unsupervised model. The values of theme dimensions affect the performance of the model. Since different number of topics will have different estimates and representations, the value of the number of topics affects the performance of the model. In order to verify the effectiveness of the topic model proposed in this paper for opinion topic word mining, the effect of different topic dimensions on the effect of K-means clustering is compared under the condition that the number of iterations is consistent. In order to ensure the accuracy of the data, in which the number of iterations is different, in order to reduce the experimental error, this paper conducts 25 repetitions of the experiment for each iteration and takes the average value from the experimental results as the final value. This experiment lists the changes in the F-measure and purity values when the theme dimension is changed as shown in Fig. 8 and Fig. 9, respectively.

Consistency in F-measure values occurs at an iteration number of 6, with F being 0.5398 for both subject dimensions K=100 and K=200. It can be seen that the clustering effect decreases slightly when the number of iterations is small and the topic dimension increases. With the increasing number of iterations, overall, the curve trend of K=200 is getting smoother and smoother, and the values of F-measure and purity are higher than the value of K=100, and the clustering effect is better.

3.5

Analysis of the evolution of hot topics

In order to further explore the hotspots of Internet users’ topic concerns during the Xin Guan epidemic so as to clarify the direction of public opinion, the study used K-means and DBSCAN algorithms to conduct cluster analysis of the comment corpus, and constructed a thematic river diagram as shown in Figure 10 with the proportion of daily topic discussions as an index to show the hot topics that the users pay attention to.

According to the text clustering results, the top 10 hot topic words were “heavy rain”, “subway”, “stagnant water”, “trapped”, “refueling”, “rescue”, “firefighter”, “casualty”, “reconstruction” and “drainage”, and they ran through three stages with different proportions and frequencies. As shown in Figure 10, during the first phase of the disaster (July 20 to July 22), the discussion focused on the real-time disaster situation of the rainstorm and the situation of people trapped in the subway, with the words with the highest proportion being “heavy rain” and “subway”, and a large number of on-site videos and distress messages appeared on social media. The second phase of the rescue period (July 23-July 26) turned to the progress of rescue, and the discussion of non-governmental mutual assistance and official rescue sparked frequent topics of “rescue”, “refueling” and “casualties”. The third stage of the rethinking and reconstruction period (July 27 to August 3) focused on policy adjustment and urban governance, such as the release of the State Council’s investigation report, the improvement of flood control standards, and the improvement of early warning mechanisms, with the core topics of “reconstruction” and “drainage”.

4

Conclusion

This paper constructs a text clustering model of online public opinion for public emergencies, taking the evolution of public opinion of “Zhengzhou 720 rainstorm” as the research object, and clustering its three-stage characteristics: the initial period is dominated by disaster descriptions such as “heavy rain” and “subway”, the outbreak period focuses on action responses such as “trapped” and “rescue” (the word frequency of “rescue” is 287 times), and the reflection and reconstruction period turns to governance issues such as “drainage” and “policy”. Sentiment analysis showed that the strength of positive affective words (e.g., “love” and “hero”) (peak 0.953) was significantly higher than that of negative words (e.g., “disaster” and “ruthless”), reflecting the dominant role of social cohesion. Experimental results show that the clustering effect is the best when the topic dimension K=200 (F-measure=0.5398, Purity=0.812), but the “rubbing topic” behavior in data cleaning leads to a cleaning rate of 51.9% of the original data (39027 entries are retained after cleaning), which highlights the challenge of noise processing. This paper provides theoretical support and innovative ideas for the study of public opinion on major public events.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

A Clustering Study of Online Public Opinion Texts on Public Emergency Events Based on Sentence-Level Similarity and Sentiment Analysis

Yaxian Qiu

Hui Han

Publicado en línea: 25 sept 2025

Recibido: 31 ene 2025

Aceptado: 10 may 2025

DOI: https://doi.org/10.2478/amns-2025-1018

Palabras claveSudden public events, Online public opinion, Sentence-level similarity, Sentiment analysis, Text clustering

© 2025 Yaxian Qiu and Hui Han, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
Sudden public events, Online public opinion, Sentence-level similarity, Sentiment analysis, Text clustering