A Clustering Study of Online Public Opinion Texts on Public Emergency Events Based on Sentence-Level Similarity and Sentiment Analysis
Publicado en línea: 25 sept 2025
Recibido: 31 ene 2025
Aceptado: 10 may 2025
DOI: https://doi.org/10.2478/amns-2025-1018
Palabras clave
© 2025 Yaxian Qiu and Hui Han, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the development of the Internet, social media has become an important platform for people to obtain information and express their opinions, and the outbreak of public emergencies and the dissemination of public opinion have become more rapid and widespread [1-2]. Public emergencies are events that occur suddenly and may have a significant impact on public life, property and social public interests. Due to their suddenness and unpredictability, such events often trigger strong public concern and emotional fluctuations, forming a public opinion field [3-6]. While the Internet and social media make the dissemination of information more rapid and extensive, various voices are easily gathered and amplified, thus triggering online public opinion [7-8]. Since emergencies are often accompanied by the obstruction of the original information channels in the event center, all kinds of inaccurate speculations and false news will take the opportunity to fill the information vacuum [9-11]. The single, inefficient traditional public opinion propaganda concept is incapable of coping with online public opinion that is rapidly shaped on the online platform, with strong emotional color, uncontrolled sources, and one-sided public pressure [12-15].
Text clustering, as an unsupervised machine learning method, has high flexibility and automated processing ability because it does not need to be trained and does not need to manually label the categories of online texts in advance, and has become the main technical tool for organizing textual information and mining hot topics in online public opinion analysis systems [16-19]. Since many keywords constitute a massive collection of feature words in the process of building text information features, the text information feature vector should be reduced before the clustering processing of text. The usual method is to rank the weights of all feature words and select a predetermined number of best feature words as the resulting feature subset [20-23]. Another common method is determined by the characteristics of network information, which generally comes with headlines and summarizes the main content of the body in the headlines, so the headline content can also be expressed as textual information features [24-26].
Internet public opinion on public emergencies is characterized by fast dissemination and emotional polarization, and traditional analysis methods are difficult to accurately mine the topic evolution law. Existing studies mostly rely on single text features or static sentiment classification, ignoring the impact of sentence-level semantic associations and dynamic changes in sentiment on the clustering effect. For this reason, this paper proposes an opinion text clustering framework that integrates sentence-level similarity calculation and sentiment analysis. At the method level, a real-time recognition algorithm for emergent words based on incremental Kleinberg model is proposed to recognize emergent words in real time from streaming appeal records, which provides clues for subsequent emergent event detection. It also combines the TF-IDF algorithm to extract topic words, adopts an improved plain Bayesian algorithm (incorporating NTUSD, Hownet, and other multi-sentiment lexicons) to improve the sentiment classification accuracy, solves the problem of misclassification of words by the traditional model, realizes multi-stage topic clustering through K-means and DBSCAN algorithms, and visualizes the hotspot evolutions based on the topic river graph.
As a network platform for monitoring and analyzing public opinion by relevant government departments and units, the Public Opinion Analysis System for Public Emergencies can effectively analyze and monitor different types of public opinion information on public emergencies, quickly identify and track public opinion hotspots and sensitive topics, accurately grasp the process of public opinion evolution and trend prediction, and provide auxiliary decision-making for the government to guide the benign development of public opinion in a targeted manner. According to the above general requirements of the system, this paper proposes a web-based text mining system model for analyzing public opinion on public emergencies. The system model consists of four layers: web public opinion collection layer, public opinion data preprocessing layer, public opinion analysis layer and system application layer, and the model architecture is shown in Figure 1.

Network public opinion analysis system model for public emergencies
The identification of emergent word states has an important clue value for the determination of the time and content of emergencies. The main reason is that when an unexpected event occurs, the frequency of certain words highly related to the unexpected event will also increase greatly, and by recognizing the unexpected state of words in the data stream, the time and semantic content of potential unexpected events can be effectively delineated.
At present, the commonly used emergent word state recognition model in the emergent event detection method is Kleinberg state machine model. The basic idea of this model is to build a burst state recognition model for the time series of words, and identify the words that present a burst state and the duration period of the burst state of the words.
The Kleinberg state machine model assumes that the time interval
where
The Kleinberg state machine model is divided into the two-state Kleinberg state machine model and the infinite-state Kleinberg state machine model. The following sections describe the implementation of these two state machine models respectively.
Two-state Kleinberg state machine model The two-state Kleinberg state machine model divides the states of the vocabulary into a non-burst state The two-state Kleinberg state machine model is shown in Figure 2. The black line hollow circles in Fig. 2 represent the states of the vocabulary, the circle on the left indicates that the vocabulary is in a non-burst state, and the circle on the right indicates that the vocabulary is in a burst state. The states of the vocabulary transfer to each other with probability

Two-state Kleinberg state machine model
The duration period of lexical bursts in the state machine model can be recognized as the time interval between transitions between non-burst and burst states over a period of time. Given the time intervals of vocabulary
By specifying Infinite State Kleinberg State Machine Model The infinite state Kleinberg state machine model extends the two-state Kleinberg state machine model by dividing the states of the vocabulary into infinite states. The arrival rate of the vocabulary in the unique non-burst state The infinite state Kleinberg state machine model’s sequence of burst states with the corresponding time interval sequence is shown in Fig. 3. Where the sequence of time interval observations specifying

Infinite state Kleinberg state machine model
Effectively mining the sentiment information behind text data is an important task in the work of microblog text opinion analysis, in which sentiment polarity analysis is an important branch in the field of sentiment analysis. Plain Bayesian algorithm is a classical machine learning algorithm, which assumes that the conditions between features are independent of each other on the basis of Bayesian algorithm, calculates the probability of object features based on probabilistic theory and determines the object classification category. The plain Bayesian algorithm obtains the a priori probability of various features in the category through statistical analysis of corpus data, calculates the a posteriori probability based on the condition that the current event belongs to a certain category, and the category with the highest a posteriori probability can be used as the predicted category. Due to the low learning cost of the simple Bayesian algorithm, it occupies less computational resources, and has a strong ability to ignore irrelevant features in the text, so it has a better performance than the complex model that can easily lead to the phenomenon of underfitting, and it is widely used in the text sentiment analysis. The principle of the specific algorithm is as follows:
The sentiment category of text
A priori probability
The a posteriori probability
The potential information mining of unlabeled data is usually realized by clustering, an unsupervised learning method, which is based on the principle that by analyzing the similarity of text features, similar objects are put into the same cluster, so that the data of the same class are aggregated together according to the similarity degree, so that the data within the same cluster are as similar as possible, and the differences of data in different clusters are as different as possible, so as to realize unsupervised segmentation of unlabeled data. Cluster analysis can be visualized by drawing word clouds on its analysis results.
TF-IDF (Word Frequency-Inverse Document Frequency) algorithm is a statistical method to determine the weight of a word by the number of times it appears in the bag of words. The principle is that through the statistics of the text word frequency, the higher the word frequency, the stronger the correlation between the word and the document, due to the use of common words in the process of the inevitable word frequency will be on the high side, so the use of inverse document frequency to reduce the weight of the common words, as shown in Equation (5):
Therefore, the TF-IDF value is the product of word frequency TF and inverse document frequency IDF, the larger the TF-IDF value is, the more important the word is in the text, which is used to assess the importance of a word for the text where it is located, and the importance of the word is proportional to the number of times it rises in the text, and at the same time, with the frequency of the word in the text is proportional to the inverse of the decline.
K-means algorithm is a clustering algorithm based on division, by selecting the number of clusters
DBSCAN algorithm is a density-based clustering algorithm, mainly from a core point, to its center, Eps for the radius to obtain its Eps neighborhood, in the neighborhood of the data greater than the MinPts (the minimum number of) for the core object, to the expansion of the range of its density reachable, to get a maximization of the zone containing the core point and the boundary point of the city, where the density of any two points are connected. The process is repeated until all points are identified. The DBSCAN algorithm, compared to other algorithms, does not require a fixed number of clusters, is able to identify outliers as noise rather than simply categorizing them, and the shape and size of the clusters are not restricted.
The 2021 Zhengzhou 720 Extraordinarily Heavy Rainstorm Disaster was selected as the research object for the text clustering case analysis of online public opinion.
Crawling Scope According to the analysis steps, the public opinion heat needs to be calculated at a later stage, i.e., the number of relevant elements in the microblogs is counted, so the number of original microblogs, comments, and retweets need to be calculated. In terms of content, the case was selected for the purpose of theme extraction and sentiment analysis by analyzing the content of microblogs sent by users. The elements involved in microblog content are original microblogs and comments (including retweets and comments), in which the original microblog is the subjective attitude of the user towards the event, which can intuitively express the user’s own attitude and opinion towards the event, and the comments are the evaluation of others’ views, and there may be tens of thousands of comments under one original microblog, which is too large a volume of data and difficult to analyze, so only the original microblogs are selected for content analysis. Crawling method Utilize Octopus collector for data collection. Octopus is the leading Internet data collection platform in China, which can convert webpage unstructured data into structured data and store them in various forms such as database or EXCEL. It can accurately capture data from any web page by simply configuring rules to generate customized and regularized data formats. The principle of operation is to use C# language in Windows system, based on the Firefox kernel browser, by simulating the way of human mind operation (such as opening a web page, clicking a button in the web page), the web page content is extracted fully automatically. Regarding the case studied in this paper, a total of 81,166 data were obtained, including 52,099 original tweets, 4,068 retweets, and 25,279 comments.
The heat of online public opinion refers to the degree of high public opinion in cyberspace, which is triggered by media reports, discussions among netizens, and specific actions taken by the government in response. By visualizing the relevant data through line graphs, the stage of public opinion can be clearly delineated. The relevant elements are the number of original tweets
Where
Since the number of daily microblogs, retweets and comments may reach hundreds or thousands in practice, the data need to be simplified by the extreme value method in the calculation to obtain Equation (7):
For the values of weights in Equation (6),
Online public opinion heat value (part)
| Date | Original tweets | Forwarding | Comment | Oi | Oi’ |
|---|---|---|---|---|---|
| 0720 | 812 | 25 | 1628 | 411.8 | 0.06029 |
| 0721 | 3729 | 64 | 4729 | 734.3 | 0.16725 |
| 0722 | 10283 | 2547 | 6248 | 8403.6 | 0.96792 |
| 0723 | 12272 | 1164 | 6027 | 8684.4 | 1 |
| 0724 | 8281 | 75 | 3143 | 3989.1 | 0.65322 |
| 0725 | 7934 | 42 | 1983 | 2933.6 | 0.54855 |
| 0726 | 4032 | 50 | 732 | 512.9 | 0.20316 |
| 0727 | 2224 | 44 | 461 | 83.8 | 0.12149 |
| 0728 | 1034 | 32 | 223 | 33.1 | 0.12883 |
| 0729 | 623 | 10 | 56 | 21.4 | 0.03946 |
| 0730 | 324 | 7 | 30 | 17.5 | 0.02181 |
| 0731 | 182 | 5 | 12 | 16.7 | 0.01486 |
| 0801 | 148 | 3 | 4 | 13.3 | 0.00740 |
| 0802 | 137 | 0 | 2 | 0 | 0.00083 |
| 0803 | 84 | 0 | 1 | 0 | 0.00034 |
According to the results calculated in the previous step, take the date and draw a line graph with columns

Public opinion heat distribution and trend
As shown in the figure, from July 20 to July 21 was the beginning period of public opinion, with a total of 4,541 original microblogs, 89 retweets, and 6,357 comments; from July 22 to July 25 was the outbreak period, with a total of 38,770 original microblogs, 3,828 retweets, and 17,401 comments; from July 26 to July 28 was the recurrence period, with a total of 7,290 original microblogs, 126 retweets, and 1,416 comments; from July 29 to August 3 was the slowdown period, with a total of 1,498 original microblogs, 25 retweets, and 1,416 comments. From July 26 to July 28, there were 7,290 original microblogs, 126 retweets and 1,416 comments, and from July 29 to August 3, there were 1,498 original microblogs, 25 retweets and 105 comments.
Data cleansing is the processing of non-compliant content; it is the final procedure for detecting and correcting identifiable errors in data files, including the processing of mutilated data, erroneous data, and duplicate data. The ultimate goal of data cleaning is to improve the quality of data. Data cleaning in this paper is mainly twofold, duplicate data and invalid data. Duplicate data refers to the fact that due to possible errors in the operation of the data collector, the final result presents two and more completely duplicate microblog texts posted by the same user at the same time. Invalid data refers to microblog text that is completely unrelated to the topic of the study. Sina Weibo has a topic function, usually denoted by #topic#, under which netizens can post microblogs to express their opinions and views. However, the behavior of netizens posting microblogs is highly subjective, and it is inevitable that there will be the behavior of “rubbing on the topic”, i.e., posting contents completely unrelated to the topic itself, for example, publicizing the latest movie of a certain celebrity under the topic of the public emergency studied in this paper, and so on, and the emergence of these data will increase the inaccuracy and ineffectiveness of the text processing. The presence of such data will increase the inaccuracy and invalidity of text processing, so these data need to be deleted.
After manually cleaning the duplicate and invalid data, a total of 39,027 original tweets were obtained, of which 4,173 were in the starting period, 29,361 in the bursting period, 4,730 in the recurring period, and 763 in the plateauing period. The large difference between the cleaned data and the collected data is due to the fact that through the analysis of the original data in an Excel table, it is found that there is the behavior of “rubbing on the topic” in every stage of public opinion, and the number of microblogs discussing the topic itself is much larger than the number of microblogs discussing the topic itself.
The idea of TF-IDF value operation in this case is to calculate the TF-IDF value of the words in each tweet to derive some themes, then count the number of times the same theme appears in the whole corpus (each stage of the microblog text is a corpus, i.e., there are 4 corpora in total), and select the theme with the number of times ranked in the top 100 as the theme of the whole corpus. The number of occurrences of each topic word is displayed from highest to lowest.
Table 2 shows that the words with high frequency in the initial period were “heavy rain” (273), “early warning” (231), “Zhengzhou” (219), “weather” (210), “subway” (187), “waterlogging” (162), “suspension” (98), “notice” (67), “transportation” (52), and “meteorological bureau” (21). The high-frequency words in the outbreak period are “trapped”, “tunnel”, “distress”, “information”, “refueling”, “casualty”, “rescue”, “line 5”, “firefighter”, “submerged”; The high-frequency words in the repeated period are “donation”, “reconstruction”, “accountability”, “investigation”, “insurance”, “claims”, “restoration”, “drainage”, “resettlement”, and “victims”. The high-frequency words in the slow period are “flood control”, “rectification”, “commemoration”, “sponge city”, “strategy”, “fresh delivery”, “prayer” and “restoration”.
Example of results of topic term frequency
| Initial period | Eruption period | Recurrent period | Flat period | ||||
|---|---|---|---|---|---|---|---|
| Keyword | Time | Keyword | Time | Keyword | Time | Keyword | Time |
| Rainstorm | 273 | Be trapped | 583 | Volunteer | 146 | Flood prevention | 37 |
| Forewarning | 231 | Tunnel | 422 | Donation | 93 | Rectify and reform | 30 |
| Zhengzhou | 219 | Message | 394 | Reestablish | 90 | Commemorate | 26 |
| Weather | 210 | Ask for help | 361 | Accountability | 74 | Sponge city | 20 |
| Subway | 187 | Fighting | 347 | Investigation | 69 | Policy | 19 |
| Hydrops | 162 | Casualties | 328 | Insurance | 66 | Present a bouquet | 15 |
| Off-line | 98 | rescue | 287 | Recover | 57 | R.I.P | 14 |
| Notification | 67 | Line 5 | 253 | Drain water | 48 | Recover | 10 |
| Traffic | 52 | Fireman | 218 | Settle | 39 | ||
| Weather bureau | 21 | Submerge | 184 | victim | 31 | ||
The NTUSD sentiment dictionary was used as the seed, and the text sentiment analysis method in section 2.2.2 was used. In the results of the initial calculation, negative emotional words include words that are strongly related to the research topic, such as “United States”, “China”, “China and the United States”, and “trade war”, and the weight is high, so these words need to be removed. At the same time, when the negative emotion words are analyzed, it is found that the more obvious positive sentiment words, such as “come on”, are listed as negative sentiment words after calculation, which will affect the accuracy of the calculation of the emotion intensity of the subsequent text. The method adopted in this paper is to aggregate NTUSD, Hownet, BosonNLP sentiment dictionaries and other types of sentiment dictionaries on the network to calculate the positive sentiment thesaurus and negative sentiment thesaurus, respectively.
The comparison is done in the positive and negative sentiment dictionary obtained using the plain Bayesian algorithm proposed in Section 2.2.2, and if the word is a positive sentiment word in the thesaurus and a negative sentiment word after the computation, it is corrected by adding a score of 0.3 to it. If it is greater than 0 after correction, it is categorized in the positive sentiment dictionary, and if it is still less than 0 after correction, it remains in the negative sentiment dictionary. The corrected score was used as the final sentiment score for the word.
After several experiments to determine the final positive and negative sentiment words, a total of 2831 sentiment words were obtained. Among them, 1572 positive sentiment words and 1259 negative sentiment words are shown in Fig. 5 and Fig. 6 for Top 10 positive and negative sentiment words, respectively.

Top 10 positive emotion words

Top 10 negative emotion words
As can be seen from the above figure, in the sentiment analysis of the online public opinion text about the public emergency event “Zhengzhou 720 Heavy Rainstorm Disaster” in 2021, the top 10 positive emotional words mostly come from the praise of mutual aid, rescue, and national strength, reflecting social cohesion, among which “love”, “positive energy”, and “hero” have the highest emotional scores, which are 0.953, respectively. 0.935 and 0.915, while negative affective words focused on the destructiveness of the disaster itself and the fear and grief of the people, reflecting the severity of the disaster, among which the emotional scores of “disaster” and “ruthlessness” were the highest, -0.947 and -0.938, respectively.
Figure 7 shows the time series of microblog sentiment intensity, from which it can be seen that the overall sentiment intensity in the period of July 22-26 is higher than the sentiment intensity in other periods. And the positive sentiment intensity peaked on July 23, with 2,371 positive sentiment posts on that day. The second peak occurred on July 22, with 2,246 posts. Negative sentiment is strongest at the beginning of the public opinion period, and reaches a trough on July 21, with 1,504 negative sentiments on that day. The fluctuation of negative sentiment intensity is not big with the outbreak of the public opinion. The score of negative sentiment is about half of that of positive sentiment, so it can be seen that during the evolution of public opinion, most of the sentiments expressed on Weibo are positive and the intensity of positive sentiments is much greater than that of negative sentiments.

Timing of Weibo emotion intensity
The online public opinion analysis model for public emergencies constructed in this paper needs to assume the values of the theme dimensions in advance when performing theme modeling, so it is an unsupervised model. The values of theme dimensions affect the performance of the model. Since different number of topics will have different estimates and representations, the value of the number of topics affects the performance of the model. In order to verify the effectiveness of the topic model proposed in this paper for opinion topic word mining, the effect of different topic dimensions on the effect of K-means clustering is compared under the condition that the number of iterations is consistent. In order to ensure the accuracy of the data, in which the number of iterations is different, in order to reduce the experimental error, this paper conducts 25 repetitions of the experiment for each iteration and takes the average value from the experimental results as the final value. This experiment lists the changes in the F-measure and purity values when the theme dimension is changed as shown in Fig. 8 and Fig. 9, respectively.

F-measure comparison diagram

Purity comparison diagram
Consistency in F-measure values occurs at an iteration number of 6, with F being 0.5398 for both subject dimensions K=100 and K=200. It can be seen that the clustering effect decreases slightly when the number of iterations is small and the topic dimension increases. With the increasing number of iterations, overall, the curve trend of K=200 is getting smoother and smoother, and the values of F-measure and purity are higher than the value of K=100, and the clustering effect is better.
In order to further explore the hotspots of Internet users’ topic concerns during the Xin Guan epidemic so as to clarify the direction of public opinion, the study used K-means and DBSCAN algorithms to conduct cluster analysis of the comment corpus, and constructed a thematic river diagram as shown in Figure 10 with the proportion of daily topic discussions as an index to show the hot topics that the users pay attention to.

Weibo comments hot topic evolution
According to the text clustering results, the top 10 hot topic words were “heavy rain”, “subway”, “stagnant water”, “trapped”, “refueling”, “rescue”, “firefighter”, “casualty”, “reconstruction” and “drainage”, and they ran through three stages with different proportions and frequencies. As shown in Figure 10, during the first phase of the disaster (July 20 to July 22), the discussion focused on the real-time disaster situation of the rainstorm and the situation of people trapped in the subway, with the words with the highest proportion being “heavy rain” and “subway”, and a large number of on-site videos and distress messages appeared on social media. The second phase of the rescue period (July 23-July 26) turned to the progress of rescue, and the discussion of non-governmental mutual assistance and official rescue sparked frequent topics of “rescue”, “refueling” and “casualties”. The third stage of the rethinking and reconstruction period (July 27 to August 3) focused on policy adjustment and urban governance, such as the release of the State Council’s investigation report, the improvement of flood control standards, and the improvement of early warning mechanisms, with the core topics of “reconstruction” and “drainage”.
This paper constructs a text clustering model of online public opinion for public emergencies, taking the evolution of public opinion of “Zhengzhou 720 rainstorm” as the research object, and clustering its three-stage characteristics: the initial period is dominated by disaster descriptions such as “heavy rain” and “subway”, the outbreak period focuses on action responses such as “trapped” and “rescue” (the word frequency of “rescue” is 287 times), and the reflection and reconstruction period turns to governance issues such as “drainage” and “policy”. Sentiment analysis showed that the strength of positive affective words (e.g., “love” and “hero”) (peak 0.953) was significantly higher than that of negative words (e.g., “disaster” and “ruthless”), reflecting the dominant role of social cohesion. Experimental results show that the clustering effect is the best when the topic dimension K=200 (F-measure=0.5398, Purity=0.812), but the “rubbing topic” behavior in data cleaning leads to a cleaning rate of 51.9% of the original data (39027 entries are retained after cleaning), which highlights the challenge of noise processing. This paper provides theoretical support and innovative ideas for the study of public opinion on major public events.
