An In-depth Analysis of Data Science Methods on the Path of Women’s Consciousness Awakening under the Cultural System of Marxist Chineseization

With the modernization of Chinese society, the issue of women’s modernization is getting more and more attention and attention, and the social phenomenon of women’s consciousness is frequently discussed in our life. As a female subject indispensable to the development of society and human beings, she has naturally become the protagonist of this topic [1-4], and the growth of Chinese women’s consciousness undoubtedly has a more far-reaching significance to the process of China’s modernization. In reality, there are many factors that hinder the awakening of Chinese women’s consciousness, such as history, tradition, society, family, education, etc. These factors undoubtedly hinder the modernization process of women in modern China [5-8].

As a matter of fact, due to the deep-rooted influence of feudal rituals such as the Three Principles and Five Principles and male superiority and female inferiority, the social status of Chinese women was low, and they had been in the situation of being exploited, oppressed, and enslaved for a long time [9-10]. As the May 4th period was an era of great ideological liberation, as well as an era of Chinese and Western cultural exchanges, mingling and exchange, due to the impetus of the New Culture Movement, the Western freedom, democracy, equality, human rights, women’s rights, emancipation, education and other various trends of thought have flooded into China [11-14], these trends of thought injected fresh blood into the Chinese culture, and at this time onwards the self-awareness of the Chinese women began to awaken, and the vast number of women began to From then on, Chinese women’s self-consciousness began to awaken, and the majority of women began to fight for independence and autonomy. Until the Chineseization of Marxism and the modernization of China, Chinese women’s consciousness realized a comprehensive awakening under the development of economy, popularization of education, and advancement of science and technology [15-18].

This paper combines the Word2Vec tool and LDA theme, and adds the EMD distance formula to the text mining domain to build the model framework of the W2v_dist algorithm.The preparation of a female consciousness corpus, text segmentation, and data cleaning are sequentially carried out to achieve the training of model word vectors. Combined with the EMD distance formula, using the semantic information between the female consciousness word vectors, the distance between each female consciousness related topic word is calculated and normalized. Using the EMD distance formula again, the text distance metric is defined, and the text distance metric formula W2v_dist based on the EMD distance and the female consciousness LDA topic model is obtained.Using the firefly algorithm, the association of the semantics of the female consciousness text is converted into the similarity on the spatial model, and is represented by text clustering features. The improved firefly algorithm AFA combined with the idea of K-medoids algorithm is applied to text clustering, and a more ideal optimal solution is obtained for the topic words related to female consciousness. The model is applied to analyze the literature on the theme of “women’s consciousness” awakening, and its methods and groups and other information are studied.

2

Text-mining-based methods for analyzing literature on women’s consciousness

2.1

Word2vec algorithm

2.1.1

CBOW model

In the CBOW model, word ω_t is unknown, but the context of ω_t ω_t–2, ω_t–1, ω_t+1, ω_t+2 is known, the core idea of CBOW is to use the context of ω_t to predict the unknown ω_t, the model through the training progress, constantly update the parameters, use the gradient ascent method to maximize the objective function, output the most likely ω_t and the updated value of the word vector, and sum the updated word vector and the initial vector to obtain the final trained word vector of the CBOW model [19].

The cbow model structure consists of a three-layer network. The input layer inputs the initial word vectors with known contexts, the middle projection layer sums up the vectors input from the input layer, and in the last layer the output layer model modifies the values of each parameter in the model and the vector values initially input to the model by solving the maximum value of the objective function, inversely modifying the values of each parameter in the model and the initial input to the model, and when the objective function is maximized the current word ω can be predicted, and at the same time the trained word vectors can also be obtained. Using Context(ω) to predict the current word ω, the objective function is defined as follows: (1) $L = \sum_{ω \in C} log p (ω | C o n t e x t (ω))$ \[L=\sum\limits_{\omega \in C}{\log p\left( \omega \left| Context\left( \omega \right) \right. \right)}\] Where: ω denotes any word in corpus C, and Context(ω) denotes the context of ω.

The cbow model based word2vec algorithm trains word vectors as follows:

Input layer: a random vector of size m of the context Context(ω) of input ω is updated as training progresses.

Projection layer: the initial word vectors of context Context(ω) input to the input layer are directly summed as input to the hidden layer, i.e., $X_{ω} = \sum_{i = 1}^{2 c} V (C o n t e x t {(ω)}_{i}) \in R^{m}$ ${{X}_{\omega }}=\sum\limits_{i=1}^{2c}{V}\left( Context{{\left( \omega \right)}_{i}} \right)\in {{R}^{m}}$.

Output layer: In order to simplify the complex multiclassification problem, the output layer of the model is designed as a Huffman tree structure, which can transform the multiclassification problem into a biclassification problem after multiple bisections. Figure 1 shows the output layer structure of the cbow model, bisection is performed once at each non-leaf node in the Huffman tree, to the left or to the right, and the corresponding output value is represented by 0 or 1. The leaf nodes of the Huffman tree are all the words in the corpus, and the weights corresponding to the leaf nodes are the number of times the word corresponding to the node appears in the corpus. The specific process of solving the objective function based on the cbow model optimized by the hierarchl softmax algorithm is to search from the root node of the Huffman tree all the way to the corresponding leaf node where the word is located, and multiply all the probabilities on the whole path of the search.

In the figure, the context of female thoughts is used to predict the probability of “female thoughts”, and when the objective function is maximized, the model searches from the leaf node that maximizes the objective function to the root node while modifying the parameter values and the initial word vector.

The probability of going left or right in the Huffman tree is calculated according to equations (2) and (3). The probability of positive instances to the right is calculated according to equation (2) and the probability of negative instances to the left is calculated according to equation (3): (2) $σ (X_{ω}^{T} θ) = \frac{1}{1 + e^{- X_{ω}^{T} θ}}$ \[\sigma \left( X_{\omega }^{T}\theta \right)=\frac{1}{1+{{e}^{-X_{\omega }^{T}\theta }}}\] (3) $1 - σ (X_{ω}^{T} θ) = \frac{e^{- X_{ω}^{T} θ}}{1 + e^{- X_{ω}^{T} θ}}$ \[1-\sigma \left( X_{\omega }^{T}\theta \right)=\frac{{{e}^{-X_{\omega }^{T}\theta }}}{1+{{e}^{-X_{\omega }^{T}\theta }}}\] (4) $L = \sum_{ω \in C} p (ω | C o n t e x t (ω)) = \prod_{j = 2}^{l^{ω}} p (d_{j}^{ω} | X_{ω}, θ_{j - 1}^{ω})$ \[L=\sum\limits_{\omega \in C}{p}\left( \omega \left| Context\left( \omega \right) \right. \right)=\prod\limits_{j=2}^{{{l}^{\omega }}}{p}\left( d_{j}^{\omega }\left| {{X}_{\omega }} \right.,\theta _{j-1}^{\omega } \right)\] (5) $L = \sum_{ω \in C} \sum_{j = 2}^{l^{ω}} {(1 - d_{j}^{ω}) \cdot log [σ (X_{ω}^{τ} θ_{j - 1}^{ω})] + d_{j}^{ω} \cdot log [1 - σ (X_{ω}^{T} θ_{j - 1}^{ω})]}$ \[L=\sum\limits_{\omega \in C}{\sum\limits_{j=2}^{{{l}^{\omega }}}{\left\{ \left( 1-d_{j}^{\omega } \right)\cdot \log \left[ \sigma \left( X_{\omega }^{\tau }\theta _{j-1}^{\omega } \right) \right]+d_{j}^{\omega }\cdot \log \left[ 1-\sigma \left( X_{\omega }^{T}\theta _{j-1}^{\omega } \right) \right] \right\}}}\]

Use the gradient ascent method to obtain a partial derivative of the above equation with respect to $θ_{j - 1}^{ω}$ $\theta _{j-1}^{\omega }$: (6) $\begin{matrix} \frac{\partial L (ω, j)}{\partial θ_{j - 1}^{ω}} = \frac{\partial}{\partial θ_{j - 1}^{ω}} {(1 - d_{j}^{ω}) \cdot log [σ (X_{ω}^{τ} θ_{j - 1}^{ω})] + d_{j}^{ω} \\ \cdot log [1 - σ (X_{ω}^{τ} θ_{j - 1}^{ω})]} \end{matrix}$ \[\begin{align} & \frac{\partial L\left( \omega ,j \right)}{\partial \theta _{j-1}^{\omega }}=\frac{\partial }{\partial \theta _{j-1}^{\omega }}\left\{ \left( 1-d_{j}^{\omega } \right)\cdot \log \left[ \sigma \left( X_{\omega }^{\tau }\theta _{j-1}^{\omega } \right) \right]+d_{j}^{\omega } \right. \\ & \left. \cdot \log \left[ 1-\sigma \left( X_{\omega }^{\tau }\theta _{j-1}^{\omega } \right) \right] \right\} \end{align}\]

The updated expression for $θ_{j - 1}^{e}$ $\theta _{j-1}^{e}$ is: (7) $θ_{j - 1}^{ω} = θ_{j - 1}^{ω} + η [1 - d_{j}^{ω} - σ (X_{ω}^{T} θ_{j - 1}^{ω})] X_{ω}$ \[\theta _{j-1}^{\omega }=\theta _{j-1}^{\omega }+\eta \left[ 1-d_{j}^{\omega }-\sigma \left( X_{\omega }^{T}\theta _{j-1}^{\omega } \right) \right]{{X}_{\omega }}\]

Similarly for Eq. (5) with respect to X_ω for partial derivatives: (8) $\frac{\partial L (ω, j)}{\partial X_{ω}} = [1 - d_{j}^{ω} - σ (X_{ω}^{T} θ_{j - 1}^{ω})] \cdot θ_{j - 1}^{ω}$ \[\frac{\partial L\left( \omega ,j \right)}{\partial {{X}_{\omega }}}=\left[ 1-d_{j}^{\omega }-\sigma \left( X_{\omega }^{T}\theta _{j-1}^{\omega } \right) \right]\cdot \theta _{j-1}^{\omega }\]

The updated expression for V_w is: (9) $V_{ω} : = V_{ω} + η \sum_{j = 2}^{l^{ω}} ([1 - d_{j}^{ω} - σ (X_{ω}^{T} θ_{j - 1}^{ω})] θ_{j - 1}^{ω})$ \[{{V}_{\omega }}:={{V}_{\omega }}+\eta \sum\limits_{j=2}^{{{l}^{\omega }}}{\left( \left[ 1-d_{j}^{\omega }-\sigma \left( X_{\omega }^{T}\theta _{j-1}^{\omega } \right) \right]\theta _{j-1}^{\omega } \right)}\]

2.1.2

Skip-gram modeling

The idea of skip-gram model is more similar to the idea of cbow model. skip-gram model is that the input word ω_t is known, while the context of the word is unknown, the model uses the current word to predict the probability of its context ω_t–2, ω_t–1, ω_t+1, ω_t+2 appearing. skip-gram model also consists of a three-layer network structure of the input layer, the projection layer, and the output layer [20].

Input layer: the input intermediate word ω corresponds to a dimension of m random initialization vector v(ω) ∈ R^m.

Projection layer: in fact, this layer does not have much significance, mainly to correspond to the cbow model structure.

Output layer: also corresponds to a Huffman tree can be analogous to Figure 1, through the Huffman tree can be transformed into a complex multi-classification problem into binary classification. The skip-gram model based on hierarchl softmax optimization is to transform the probability of ω contexts after softmax normalization into a Huffman tree, where the leaf nodes correspond to the words in the corpus lexicon, and the non-leaf nodes decide to which child node the word vectors are assigned. So the objective function of the skip-gram model based on hierarchl softmax optimization is computed as the product of the probabilities of the non-leaf nodes on the paths traveled from the root node of the Huffman tree to the leaf nodes. The objective function of this model is defined as follows: (10) $L = \sum_{ω \in C} log p (C o n t e x t (ω) | ω)$ \[L=\sum\limits_{\omega \in C}{\log }p\left( Context\left( \omega \right)\left| \omega \right. \right)\] (11) $L = \sum_{ω \in C} p (C o n t e x t (ω) | ω) = \prod_{j = 2}^{t^{u}} p (d_{j}^{u} | V (ω), θ_{j - 1}^{u})$ $L=\sum\limits_{\omega \in C}{p}\left( Context\left( \omega \right)\left| \omega \right. \right)=\prod\limits_{j=2}^{{{t}^{u}}}{p}\left( d_{j}^{u}\left| V(\omega ) \right.,\theta _{j-1}^{u} \right)$ (12) $p (d_{j}^{u} | V (ω), θ_{j - 1}^{u}) = {[σ (V (ω) θ_{j - 1}^{u})]}^{1 - d_{j}^{ω}} \cdot {[1 - σ (V (ω) θ_{j - 1}^{u})]}^{1 - d_{j}^{ω}}$ $p\left( d_{j}^{u}\left| V(\omega ) \right.,\theta _{j-1}^{u} \right)={{\left[ \sigma \left( V(\omega )\theta _{j-1}^{u} \right) \right]}^{1-d_{j}^{\omega }}}\cdot {{\left[ 1-\sigma \left( V(\omega )\theta _{j-1}^{u} \right) \right]}^{1-d_{j}^{\omega }}}$ (13) $p (d_{j}^{u} | V (ω), θ_{j - 1}^{u}) = {[σ (V (ω) θ_{j - 1}^{u})]}^{1 - d_{j}^{ω}} \cdot {[1 - σ (V (ω) θ_{j - 1}^{u})]}^{1 - d_{j}^{ω}}$ \[p\left( d_{j}^{u}\left| V\left( \omega \right) \right.,\theta _{j-1}^{u} \right)={{\left[ \sigma \left( V\left( \omega \right)\theta _{j-1}^{u} \right) \right]}^{1-d_{j}^{\omega }}}\cdot {{\left[ 1-\sigma \left( V\left( \omega \right)\theta _{j-1}^{u} \right) \right]}^{1-d_{j}^{\omega }}}\] (14) $\begin{matrix} L = \sum_{ω \in C} \sum_{u \in C o n t e x t (ω)} \sum_{j = 2}^{t^{ω}} {(1 - d_{j}^{ω}) \cdot log [σ (V {(ω)}^{T} θ_{j - 1}^{u})] \\ + d_{j}^{ω} \cdot log [1 - σ (V {(ω)}^{T} θ_{j - 1}^{u})]} \end{matrix}$ $\begin{align} & L=\sum\limits_{\omega \in C}{\sum\limits_{u\in Context\left( \omega \right)}{\sum\limits_{j=2}^{{{t}^{\omega }}}{\left\{ \left( 1-d_{j}^{\omega } \right)\cdot \log \left[ \sigma \left( V{{\left( \omega \right)}^{T}}\theta _{j-1}^{u} \right) \right] \right.}}} \\ & \left. +d_{j}^{\omega }\cdot \log \left[ 1-\sigma \left( V{{\left( \omega \right)}^{T}}\theta _{j-1}^{u} \right) \right] \right\} \end{align}$

The update formula for the parameter vector $θ_{j - 1}^{u}$ $\theta _{j-1}^{u}$ obtained by partial derivation of equation (14) with respect to $θ_{j - 1}^{u}$ $\theta _{j-1}^{u}$ using the gradient ascent method is: (15) $θ_{j - 1}^{\infty} = θ_{j - 1}^{\infty} + η [1 - d_{j}^{ω} - σ (V {(ω)}^{T} θ_{j - 1}^{u})] \cdot V (ω)$ \[\theta _{j-1}^{\infty }=\theta _{j-1}^{\infty }+\eta \left[ 1-d_{j}^{\omega }-\sigma \left( V{{\left( \omega \right)}^{T}}\theta _{j-1}^{u} \right) \right]\cdot V\left( \omega \right)\]

The update formula for the parameter vector V(ω) obtained by partial derivation of equation (14) with respect to V(ω) using the gradient ascent method is: (16) $V (ω) = V (ω) + η \sum_{u \in C o n t e c t (ω)} \sum_{j = 2}^{t^{ω}} ([1 - d_{j}^{ω} - σ (V {(ω)}^{T} θ_{j - 1}^{u})] \cdot θ_{j - 1}^{u})$ $V\left( \omega \right)=V\left( \omega \right)+\eta \sum\limits_{u\in Contect\left( \omega \right)}{\sum\limits_{j=2}^{{{t}^{\omega }}}{\left( \left[ 1-d_{j}^{\omega }-\sigma \left( V{{\left( \omega \right)}^{T}}\theta _{j-1}^{u} \right) \right]\cdot \theta _{j-1}^{u} \right)}}$

2.2

LDA Subject Modeling

LDA is a probabilistic generative model that recognizes implicit topic-word information in a document set. The topic of the document has uncertainty, which is hidden in the “topic-word” and “document-topic” probability distributions, and the number of topics is also uncertain, it is possible that the text contains more than one topic, and it is also possible that the whole text only centers on a topic.

The LDA model is a three-layer Bayesian model, which is divided from top to bottom into a document set layer, a topic layer and a feature word layer [21]. Among them, topics are features of documents, and each document can be considered as a mixed distribution of topic information.Words are features of topics, and each topic can be regarded as a multinomial distribution of words.The essence of the LDA model is to utilize common features to mine the topics of the text. The formula is as follows: (17) $P (w_{n} | M_{m}) = \sum_{k \in K} P (w_{n} | K_{k}) P (K_{k} | M_{m})$ \[P\left( {{w}_{n}}\left| {{M}_{m}} \right. \right)=\sum\limits_{k\in K}{P\left( {{w}_{n}}\left| {{K}_{k}} \right. \right)}P\left( {{K}_{k}}\left| {{M}_{m}} \right. \right)\]

As shown in Equation (17), P(w_n|K_k) denotes the probability that feature word w_n occurs in document M_m set. P(w_n|M_m) denotes the probability that feature word w_n appears in topic K_k. P(K_k|M_m) denotes the probability that topic K_k appears in document M_m.

Assuming that the number of documents is M and the number of topics is K, the TF-IDF formula is used to obtain the “topic-word-item” matrix, which obeys the Dirichlet prior distribution of β . Further, the “document-topic” matrix is obtained, which also obeys the Dirichlet prior distribution of α . As shown in equation (18): (18) $f (p; α, β) = \frac{p^{α - 1} {(1 - p)}^{β - 1}}{\int_{0}^{1} u^{α - 1} {(1 - μ)}^{β - 1} d u} = \frac{1}{B (α, β)} p^{α - 1} {(1 - p)}^{β - 1}$ \[f\left( p;\alpha ,\beta \right)=\frac{{{p}^{\alpha -1}}{{\left( 1-p \right)}^{\beta -1}}}{\int_{0}^{1}{{{u}^{\alpha -1}}}{{\left( 1-\mu \right)}^{\beta -1}}du}=\frac{1}{B\left( \alpha ,\beta \right)}{{p}^{\alpha -1}}{{\left( 1-p \right)}^{\beta -1}}\] where parameter (α,β) obeys a Beta distribution. P denotes the probability of event 1 (feature word appears in the topic) or event 2 (topic is in the document). The K-dimensional Dirichlet distribution can be generalized from Eq. (19): (19) $D i r i c h l e t (\vec{p} | \vec{α}) = \frac{Γ (\sum_{k = 1}^{K} α_{k})}{\prod_{k = 1}^{K} Γ (α_{k})} \prod_{k = 1}^{K} p_{k}^{α_{i} - 1}$ \[Dirichlet\left( \vec{p}\left| {\vec{\alpha }} \right. \right)=\frac{\Gamma \left( \sum\limits_{k=1}^{K}{{{\alpha }_{k}}} \right)}{\prod\limits_{k=1}^{K}{\Gamma }\left( {{\alpha }_{k}} \right)}\prod\limits_{k=1}^{K}{p_{k}^{{{\alpha }_{i}}-1}}\]

2.3

W2v_dist algorithm model construction

2.3.1

Feasibility analysis

In this section, based on the Word2Vec tool and the LDA topic model, this paper will introduce a distance metric, EMD distance formula, which is widely used in the image processing field but less used in the text mining field, to build a model of W2v_dist algorithm. First, the feasibility of using the EMD formula to measure text distance is evaluated.

EMD distance is also known as land movement distance, and EMD distance formula is commonly used to solve the optimal solution of transportation problems and calculate the similarity of images. It is widely used in the fields of image processing and computer vision.

The EMD distance provides a good quantification of the minimum cost required to transform histogram P into histogram Q, i.e., the degree of similarity between the two. Assumptions P = {p₁,p₂,…p_n}, Q = {q₁,q₂,…q_n}. The comparison leads to the ground distance matrix $D = [d_{i, j}]$ $D=\left[ {{d}_{i,j}} \right]$. where d_i,j denotes the distance traveled from the ith data bucket in P to the jth data bucket in Q. Let f_i,j ∈ F be the size of the probabilistic flow from p_i to q_j, i.e., f_[i,j] = |p_i – q_j|. EMD distance essentially solves a handling problem, i.e., the total amount remains the same before and after transportation. Therefore, it is common practice to normalize P and Q so that their totals are 1. This leads to Equation (20): (20) $\begin{array}{l} E M D (P, Q) = \min (\sum_{i = 1}^{n} \sum_{j = 1}^{n} f_{i j} d_{i j}) \\ s . t . \forall i : \sum_{j} f_{i j} = p_{i} \\ \forall j : \sum_{i} f_{i j} = q_{j} \\ \forall i, j : f_{i j} \geq 0 \end{array}$ \[\begin{align} & EMD\left( P,Q \right)=\min \left( \sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{n}{{{f}_{ij}}}}{{d}_{ij}} \right) \\ & s.t.\forall i:\sum\limits_{j}{{{f}_{ij}}}={{p}_{i}} \\ & \forall j:\sum\limits_{i}{{{f}_{ij}}}={{q}_{j}} \\ & \forall i,j:{{f}_{ij}}\ge 0 \\ \end{align}\]

From equation (20), EMD exhibits the least consumable cost of transforming histogram P into histogram Q.

EMD distance utilizes the features of the images to represent the distance of the images. Suppose there exist two images P and Q, p_i and q_j denote the features of these two images respectively. n_i denotes the weight values of the features. Define the distance matrix $[d_{i j}]$ $\left[ {{d}_{ij}} \right]$ for feature P and feature Q with Size as 1*m. Suppose there exists matrix $F = [f_{i j}]$ $F=\left[ {{f}_{ij}} \right]$, f_ij representing the amount of transfer from p_i to q_j. Then there exists minimization cost function: (21) $W o r k (P, Q, F) = \sum_{i = 1}^{l} \sum_{j = 1}^{m} d_{i j} f_{i j}$ \[Work\left( P,Q,F \right)=\sum\limits_{i=1}^{l}{\sum\limits_{j=1}^{m}{{{d}_{ij}}}}{{f}_{ij}}\]

A normalized expression of Eq. (21) yields the EMD distance equation: (22) $E M D (P, Q) = \frac{\sum_{i = 1}^{l} \sum_{j = 1}^{m} d_{i j} f_{i j}}{\sum_{i = 1}^{l} \sum_{j = 1}^{m} f_{i j}}$ \[EMD\left( P,Q \right)=\frac{\sum\limits_{i=1}^{l}{\sum\limits_{j=1}^{m}{{{d}_{ij}}}}{{f}_{ij}}}{\sum\limits_{i=1}^{l}{\sum\limits_{j=1}^{m}{{{f}_{ij}}}}}\]

EMD distance can also be expressed in terms of the distance between texts in terms of the features in the text. From the analysis above, it can be seen that the topic is the feature of the text, while the words are the features of the topic. In this paper, we will use word vector and EMD distance to calculate the distance between topics. Then, the distance between topics combined with EMD formula is used to get the distance between texts.

2.3.2

Female Awareness Text Set Processing and Word Vector Training

The first step in the implementation of the W2v_dist algorithm is to train the word vectors. Training Word2Vec word vectors means that the researcher utilizes the Skip-gram model or the CBOW model in the Word2Vec tool to train a corpus in a particular domain. The ultimate goal is to obtain word vectors for each word in that corpus.

1)

Preparing the corpus

The size of the corpus is determined by the specific task. When the corpus size is small (total vocabulary less than 100 million words), it is more efficient to use the Skip-gram model to train word vectors. The research context of this paper is women’s consciousness, and the corpus used is a collection of related literature with a total vocabulary of less than 100 million words. Therefore, the skip-gram model is used to train word vectors in this paper.

2)

Text Segmentation

In this paper, the Woed2Vec tool that comes with Gensim is used to train word vectors. Since Gensim is developed in Python, this paper uses Python to perform text segmentation operations on the corpus.

3)

Data Cleaning

After using Stuttering Segmentation to perform the segmentation operation on the corpus, it is also necessary to filter out the stop words in the corpus. When loading the stop word list, not only the system’s stop word list should be loaded, but also the user-defined stop word list.

2.3.3

Topic distance metric based on word vectors

This paper describes the principle of LDA is that each topic can in turn be viewed as a multinomial distribution of a number of words. That is, words are features of topics. Then, each topic vector can be represented in the following form: (23) $v e c (T o p) = (w e i g h t (w_{1}), w e i g h t (w_{2}), w e i g h t (w_{3}), \dots, w e i g h t (w_{n}))$ \[vec\left( Top \right)=\left( weight\left( {{w}_{1}} \right),weight\left( {{w}_{2}} \right),weight\left( {{w}_{3}} \right),\ldots ,weight\left( {{w}_{n}} \right) \right)\] Where weight(w_i) is the weight value occupied by the ind feature word. If common distance calculation formulas such as Euclidean distance and Ma distance are used to calculate the distance between topics, the correlation between words is ignored.

In order to better utilize the semantic information to calculate the distance between topics, this paper proposes the following formula in combination with EMD distance: (24) $\begin{array}{l} c o n s T (T o p_{1}, T o p_{2}) = \min \sum_{i = 1}^{n} \sum_{j = 1}^{n} t r a n s W (i, j) \cdot c o n s W (W o r d_{i}, W o r d_{j}) \\ s . t . \\ t r a n s W (i, j) \geq 0; \\ \sum_{i = 1}^{n} t r a n s W (i, j) = W W Q_{j}; \\ \sum_{i = 1}^{n} t r a n s W (i, j) = W W P_{i} \\ \sum_{i = 1}^{n} \sum_{j = 1}^{n} t r a n s W (i, j) = \sum_{i = 1}^{N} W W P_{i} = \sum_{i = 1}^{N} W W Q_{i} = 1 \\ i, j = 1, 2, \dots n \end{array}$ \[\begin{align} & consT\left( To{{p}_{1}},To{{p}_{2}} \right)=\min \sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{n}{transW\left( i,j \right)}}\cdot consW\left( Wor{{d}_{i}},Wor{{d}_{j}} \right) \\ & s.t. \\ & transW\left( i,j \right)\ge 0; \\ & \sum\limits_{i=1}^{n}{trans}W\left( i,j \right)=WW{{Q}_{j}}; \\ & \sum\limits_{i=1}^{n}{trans}W\left( i,j \right)=WW{{P}_{i}} \\ & \sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{n}{trans}}W\left( i,j \right)=\sum\limits_{i=1}^{N}{W}W{{P}_{i}}=\sum\limits_{i=1}^{N}{W}W{{Q}_{i}}=1 \\ & i,j=1,2,\ldots n \\ \end{align}\]

In Eq. (24), Top₁ = (WWP₁,WWP₂,…,WWP_n), Top₂ = (WWQ₁,WWQ₂,…,WWQ_n). WWP_i and WWQ_i denote the weight values of the same feature word in two topics. The offset of moving from WWP_i to WWQ_j can be represented by transW(i,j). Normalization is performed for transW(i,j).

2.3.4

Text distance metric based on subject distance

In the previous section, this paper pointed out that themes can be viewed as features of a text. Each text can be viewed as a vector of low dimensions, and each dimension of the vector represents a topic. The dimension value is the weight of the topic in the text. Therefore, this paper again applies the EMD distance formula and defines the text distance metric formula as follows: (25) $\begin{array}{l} W 2_{-} d i s t (D o c 1, D o c 2) = \min \sum_{i = 1}^{N} \sum_{j = 1}^{N} t r a n s T (i, j) \cdot c o n s T (T o p_{i}, T o p_{j}) \\ s . t . \\ t r a n s T (i, j) \geq 0 \\ \sum_{i = 1}^{N} t r a n s T (i, j) = T W Q_{j} \\ \sum_{i = 1}^{N} t r a n s T (i, j) = T W P_{i} \\ \sum_{i = 1}^{N} \sum_{j = 1}^{N} t r a n s T (i, j) = \sum_{i = 1}^{N} T W P_{i} = \sum_{i = 1}^{N} T W Q_{i} = 1 \\ i, j = 1, 2, \dots N \end{array}$ \[\begin{align} & W{{2}_{-}}dist\left( Doc1,Doc2 \right)=\min \sum\limits_{i=1}^{N}{\sum\limits_{j=1}^{N}{trans}}T\left( i,j \right)\cdot consT\left( To{{p}_{i}},To{{p}_{j}} \right) \\ & s.t. \\ & transT\left( i,j \right)\ge 0 \\ & \sum\limits_{i=1}^{N}{trans}T\left( i,j \right)=TW{{Q}_{j}} \\ & \sum\limits_{i=1}^{N}{trans}T\left( i,j \right)=TW{{P}_{i}} \\ & \sum\limits_{i=1}^{N}{\sum\limits_{j=1}^{N}{trans}}T\left( i,j \right)=\sum\limits_{i=1}^{N}{T}W{{P}_{i}}=\sum\limits_{i=1}^{N}{T}W{{Q}_{i}}=1 \\ & i,j=1,2,\ldots N \\ \end{align}\]

In Eq. (25), Doc1 = (TWP₁,TWP₂,…TWP_N) and Doc2 = (TWQ₁,TWQ₂,…TWQ_N), the weight values of the subject in the text are denoted by TWP_i and TWQ_j. The amount of transfer from TWP_i to TWQ_j is denoted by tranT(i,j). From Eq. (25), we get consT(Top_i,Top_j). From this, the text distance metric formula W2v_dist based on EMD distance and LDA topic model is obtained.

3

Text clustering based on firefly algorithm

3.1

Feasibility analysis of text clustering based on firefly algorithm

In fact, this paper finds a strong similarity between the firefly algorithm and text clustering.The FA algorithm moves by imitating individual fireflies being attracted by the light intensity of their companions at night, and the specific flight distance is related to their distance and brightness, and the ultimate goal is to find out the location coordinates where the firefly with the largest brightness is located. In text clustering, after the pre-processing of the text feature vector matrix, each document corresponds to a vector with a right value, the vector is equivalent to the spatial location of the fireflies in the FA algorithm, the smaller the angle between the vectors the higher the degree of similarity, so you can use the brightness of the fireflies is set to the inverse of the cosine of the vector of the document, the greater the value of the brightness is, the lower the cosine value is. After obtaining the documents in the center of clustering, the text clustering process is completed once by comparing the cosine values of other documents to the documents in the center of clustering, and assigning them to the cluster where the center with the smallest cosine value is located. Thus, this paper explains how the FA algorithm can be used to cluster texts using the following key elements:

1) Each firefly corresponds to one document.

2) The position of the firefly in the spatial coordinate system corresponds to the weight of the feature vector of each document.

3) The brightness of the firefly corresponds to the inverse of the sum mean of the distances of the vectors from one document to other documents in the cluster, i.e., the objective function.

At the same time, the firefly algorithm for clustering research is in the exploratory stage, there is no firefly algorithm applied to text clustering research, therefore, its use in text clustering has a strong exploratory, whether for the application of firefly algorithm, or text clustering field of the improvement of the improvement of the significance.

In summary, this paper will firefly algorithm applied to text clustering has full feasibility, next, this paper will introduce the construction of text clustering model based on firefly algorithm in detail.

3.2

General model construction

3.2.1

General design of the model

The Firefly algorithm is simple in structure, robust, easy to implement, and strong in finding ability. Therefore, this paper applies the Firefly algorithm to text clustering. Next, this paper will describe in detail the text clustering model based on the firefly algorithm.

Text clustering is first to do preprocessing operations on the text to remove the deactivated words. Then, using feature selection or feature extraction, the best word items are selected to express the text features. Finally, its most fundamental process is realized through cluster analysis.Thus, the Firefly algorithm will be used for the final stage of clustering implementation.

3.2.2

Text clustering feature representation

The current research application is the VSM, which converts textual semantic associations into spatial modeling similarities. These similarities can then be made available to clustering algorithms for manipulation.After obtaining the feature words from the text dataset through feature selection or extraction, each document can be represented in the following form: (26) $d_{i} = (w_{1}, w_{2}, \dots, w_{n})$ ${{d}_{i}}=\left( {{w}_{1}},{{w}_{2}},\ldots ,{{w}_{n}} \right)$

3.2.3

Cluster similarity calculation

In the text clustering process, each text is composed of feature vectors, where each dimension represents the weight of the corresponding feature item in this piece of data. Two n-dimensional objects i and j are known and their text feature vectors are denoted as: i = (x_i1,x_i2,…,x_in), j = (x_j1,x_j2,…,x_jn). The measures of similarity of text i and text j are:

1) Euclidean distance: (27) $s i m (i, j) = d (i, j) = \sqrt{{| x_{i 1} - x_{j 1} |}^{2} + {| x_{i 2} - x_{j 2} |}^{2} + \dots + {| x_{i n} - x_{j n} |}^{2}}$ \[sim\left( i,j \right)=d\left( i,j \right)=\sqrt{{{\left| {{x}_{i1}}-{{x}_{j1}} \right|}^{2}}+{{\left| {{x}_{i2}}-{{x}_{j2}} \right|}^{2}}+\cdots +{{\left| {{x}_{in}}-{{x}_{jn}} \right|}^{2}}}\]

2) Manhattan Distance: (28) $s i m (i, j) = d (i, j) = | x_{i 1} - x_{j 1} | + | x_{i 2} - x_{j 2} | + \dots + | x_{i n} - x_{j n} |$ \[sim\left( i,j \right)=d\left( i,j \right)=\left| {{x}_{i1}}-{{x}_{j1}} \right|+\left| {{x}_{i2}}-{{x}_{j2}} \right|+\cdots +\left| {{x}_{in}}-{{x}_{jn}} \right|\]

3) Minkowski distance: (29) $s i m (i, j) = d (i, j) = {({| x_{i 1} - x_{j 1} |}^{m} + {| x_{i 2} - x_{j 2} |}^{m} + \dots + {| x_{i n} - x_{j n} |}^{m})}^{\frac{1}{m}}$ \[sim\left( i,j \right)=d\left( i,j \right)={{\left( {{\left| {{x}_{i1}}-{{x}_{j1}} \right|}^{m}}+{{\left| {{x}_{i2}}-{{x}_{j2}} \right|}^{m}}+\cdots +{{\left| {{x}_{in}}-{{x}_{jn}} \right|}^{m}} \right)}^{\frac{1}{m}}}\]

4) The vector cosine theorem: (30) $\begin{matrix} s i m (i, j) = \cos (i, j) \\ = \frac{x_{i 1} \cdot x_{j 1} + x_{i 2} \cdot x_{j 2} + \dots + x_{i n} \cdot x_{j n}}{\sqrt{{| x_{i 1} |}^{2} + {| x_{i 2} |}^{2} + \dots + {| x_{i n} |}^{2}} \cdot \sqrt{{| x_{j 1} |}^{2} + {| x_{j 2} |}^{2} + \dots + {| x_{j n} |}^{2}}} \end{matrix}$ \[\begin{align} & sim\left( i,j \right)=\cos \left( i,j \right) \\ & =\frac{{{x}_{i1}}\cdot {{x}_{j1}}+{{x}_{i2}}\cdot {{x}_{j2}}+\cdots +{{x}_{in}}\cdot {{x}_{jn}}}{\sqrt{{{\left| {{x}_{i1}} \right|}^{2}}+{{\left| {{x}_{i2}} \right|}^{2}}+\cdots +{{\left| {{x}_{in}} \right|}^{2}}}\cdot \sqrt{{{\left| {{x}_{j1}} \right|}^{2}}+{{\left| {{x}_{j2}} \right|}^{2}}+\cdots +{{\left| {{x}_{jn}} \right|}^{2}}}} \end{align}\]

3.2.4

Improvement of the Firefly Algorithm

From the above analysis, it can be learned that during the flight of the firefly, the step size of its movement has a direct impact on the performance of the algorithm. Therefore, it is very critical to set an appropriate position update strategy.

(31)

x_{i} = x_{j} + β \times (x_{j} - x_{i}) + α \times \frac{| I_{b} - I_{j} |}{I_{j}} \times (r a n d - \frac{1}{2})

\[{{x}_{i}}={{x}_{j}}+\beta \times \left( {{x}_{j}}-{{x}_{i}} \right)+\alpha \times \frac{\left| {{I}_{b}}-{{I}_{j}} \right|}{{{I}_{j}}}\times \left( rand-\frac{1}{2} \right)\]

3.2.5

Text Clustering Based on Improved Firefly Algorithm

This paper utilizes the FA intelligent bionic algorithm to find the optimal solution ability, fast convergence speed and other characteristics, at the same time, for the traditional FA algorithm deficiencies made improvements, and the improved firefly algorithm AFA combined with the idea of the K-medoids algorithm is applied to the text clustering, and a new firefly clustering algorithm (K-AFA) is proposed. This paper describes three aspects: the selection of the objective function, the idea of the algorithm, and the process of the algorithm.

FA algorithm is based on the brightness of each firefly to search for the optimal solution of the search, usually set the brightness of the firefly as the value of the objective function, so the selection of the objective function directly affects the final algorithm results. In the K-center point algorithm, the point with the smallest arithmetic mean sum of the data to the rest of the data in the cluster is selected as the center of the clustering. So the objective function of this algorithm is defined as follows: (32) $f (i) = \frac{1}{N_{k}} \sum_{j \in S_{k}} d_{i j}$ \[f\left( i \right)=\frac{1}{{{N}_{k}}}\sum\limits_{j\in {{S}_{k}}}{{{d}_{ij}}}\] Where N_k denotes the number of data elements in the cluster S_k where data element i is located, and d_ij denotes the distance between data element i and data element j, generally the Euclidean distance, which is judged by the cosine angle between documents in text clustering.

The firefly FA algorithm is applied to the problem of selection of clustering centroids, in general, the FA algorithm is to search for the individual with the largest brightness, while the clustering centroid selection is to search for the point that minimizes the value of the objective function when the number of clusters in the cluster is given a value, for this reason, the brightness of the firefly s is defined as: (33) $I (s) = \frac{1}{f (s)}$ \[I\left( s \right)=\frac{1}{f\left( s \right)}\]

The higher the value of I(s), the smaller the value of the function denoting f(s), then the lower the value of the objective function sought, the closer the optimal solution obtained in the end is to the ideal solution, and the better the final clustering effect.

4

Findings and analysis

4.1

Research on the theme of female consciousness based on literature analysis

4.1.1

General characteristics

Using the Firefly text theme clustering algorithm constructed in this paper to analyze the specific publication year of 210 papers, it can be found that: since the 1930s, women’s consciousness research began to revive, and the research results showed a general increasing trend, the number of papers increased rapidly during 1945, and after 1949 there was even a surge in the number of papers. The yearly values are shown in Figure 2.

1)

Budding Stage: 1937-1939

The two phases of the surge in the number of essays since 1937 are closely related to the enhancement of external forces, such as the policy orientation of the state to implement civic education. In 1941, the Women’s Federation Organization issued a call for a province-wide literacy campaign on the occasion of Women’s Day on the 8th of March, which further contributed to the development of the literacy class movement, and from then on, the consciousness of women began to awaken and develop rapidly.

2)

Rapid development and maturity stage: 1940-1947

From the second half of 1943 onwards, the war situation changed, the Anti-Japanese War entered the stage of strategic counter-offensive, the Japanese troops were heavily invested in the Pacific War, and the CPC continued to insist on guerrilla warfare, crushing and dismantling the Japanese sweeps in China, and the war situation was favorable. As a result, the scale of the base areas gradually expanded and clustered into smaller areas.Coupled with the maturity and perfection of the Party’s leadership work, all the work in the basic areas could be systematically carried out, and the female literacy class movement was no exception. At this stage, the literacy class movement was the largest in scale, and the people’s education continued to develop at a deeper and deeper level, and it reached the climax of the development of the literacy class movement in 1945. Almost all the young women in the revolutionary bases at that time participated in the literacy class movement, and the revolutionary bases showed a small cultural upsurge of literacy and learning in the form of “village-run schools, household-reading, anti-Japanese and national salvation, everyone competing to be the first,” with a total of 22 articles published in the literature on women’s consciousness from 1945 to 1947.

3)

Stagnation and Recovery Phase: 1948-1949

After the victory in the War of Resistance Against Japanese Aggression, the Communist Party and the Kuomintang maintained peace for a short period of time, and the two parties reached the Double Ten Agreements in Chongqing on the future development of China at that time. However, the Kuomintang side tore up the Double Ten Agreements in June 1946 and waged a war against the Communist Party, resulting in the War of Liberation.1947 saw the Kuomintang launch a focused offensive against Shandong, and the liberated areas of Shandong were constantly shrinking.The Communist Party went all out to break the Kuomintang’s focused attack, resulting in all undertakings in the revolutionary base areas being brought to a standstill, and the literacy class movement was once brought to a standstill. It was not until September 1948, when the Communist Party of China put an end to the Kuomintang’s rule in Shandong after a hard battle and the war slowed down a bit, that all the undertakings in the revolutionary base areas had time to recover and develop. The Women’s Relief Society (WRS) played an important role in this period, actively restoring and developing the literacy class movement in the liberated areas, and encouraging women to persist in participating in the literacy class movement was a key task of the WRS at that time.

4)

Further development: 1949-1956

During the seven years after the founding of the People’s Republic of China, the number of papers published in core journals under the title of “Women’s Consciousness” accounted for almost 81.71% of the total number of papers published since the War of Resistance Against Japanese Aggression.

4.1.2

Research themes

Select 179 of them and study their research themes and contents.

The statistics of themes and contents of women’s consciousness research are shown in Figure 3. The themes of women’s consciousness research since the Anti-Japanese War have been widely distributed, but unevenly. Among the eight categories of themes and contents summarized, background research on the meaning, value, and practical foundation of cultivating women’s consciousness takes the first place, accounting for 49.721% of the total research, research on the status quo, problems, and countermeasures of cultivating women’s consciousness takes the second place, accounting for about 14.525% of the total research, and research on the social hotspots and phenomena caused by the lack of women’s consciousness accounts for 13.966% of the total research, and research involving Women, the concept, connotation and characteristics of women’s consciousness have not been given due attention, accounting for only 10.056% of the total. Historical studies on the development of women’s consciousness, comparative studies on theories, ideologies, policies and experiences of women’s consciousness in foreign countries as well as trends in the development of women’s education in the context of globalization, and the measurement and evaluation of women’s consciousness as well as a review of women’s consciousness have received less attention from the scholars. Measurement and evaluation of women’s consciousness and review of women’s consciousness studies have received less attention from scholars.

4.1.3

Thematic content, methodological distribution

Figure 4 shows the distribution of topics, content, and methods related to women’s consciousness research. The field of women’s consciousness research in China mainly adopts qualitative research methods, and among the 179 samples in this statistical survey, a total of 130 articles, or about 72.626%, have been used in literature analysis and theoretical discursive research. In contrast, comparative studies, case studies, experimental studies, and multivariate studies are rarely used. As an indispensable research method in scientific research, theoretical discursive research is of great significance to the construction of the basic research system of women’s consciousness, however, as an important practical field, the cultivation of women’s consciousness, its effectiveness, the current situation of women’s consciousness and the important factors influencing it at the micro level need to be supported by scientific research and empirical analysis, so as to make women’s consciousness practice work in a targeted way. However, as an important field of practice, the cultivation of women’s consciousness needs scientific research and empirical analysis to support its effectiveness, the current situation of women’s consciousness and the important factors influencing it at the micro level.

4.1.4

Distribution of women’s consciousness research groups

Special groups such as farmers and migrant workers, primary and secondary school students, and university students are gradually receiving attention. Figure 5 shows the distribution of research groups on women’s consciousness, with the largest number of researches on the ideal general group, accounting for almost 58.659% of the total researches, and the special groups of citizens, farmers and migrant workers, college students, primary and secondary school students, party and governmental organs and the military, ethnic minority groups, and enterprise units have all been involved, among which the college students’ group has received a higher degree of attention. With the spread of the construction of the new socialist countryside and the prominence of the problem of rural migrant workers, the study of the female consciousness of farmers and rural migrant workers has begun to attract the attention of the academic world. In contrast, women’s awareness in primary and secondary schools, which is the main foundation of women’s education in China, has not received enough attention.The study of women’s consciousness in public, party, government, and military organizations, ethnic minority groups, and enterprise groups has been favored by only a few scholars and needs further attention from scholars.

4.2

Empirical analysis of data

4.2.1

Research hypotheses

1)

Periodic background, social environment and the awakening of women’s consciousness

As the core concepts of gender theory, the context of the times and the social environment are the results of the long-term development of institutional arrangements and economic culture in various historical periods, and are important indicators reflecting women’s stratification status, and the application of gender theory helps to identify inequalities in gender relations to a certain extent. With the rapid improvement of Chinese women’s social and economic status, verifying whether the traditional gender concept exists in the social stratification of modern society based on the gender perspective not only helps to expand the scope of the relevant theories and applications of social stratification, but also helps to more objectively judge the current status of gender equality in China under the high labor participation rate of women. Based on this, hypothesis H1 is proposed: the lower the social status of women, the more unfavorable it is for women to awaken to their consciousness.

2)

Cultural values and female class identity

Some studies in China have confirmed that cultural values characterized by education, occupation, and income have a certain influence on women’s identity, but there are differences in opinions about the intensity of the influence. People’s direct experience and cognition of objective cultural value differences are more likely to influence people’s class self-evaluation than objective factors such as education, occupation, income, etc., and such differences are due to people’s self-expectations and comparisons with other individuals and groups. Based on this, the hypothesis H2 is proposed: the higher the cultural values of women, the more favorable it is to the awakening of women’s consciousness.

4.2.2

Variable settings

The variables were categorized into explanatory, interpretive, and control variables according to the purpose of the study. The assigned values and descriptive statistics of each variable are shown in Table 1.

1) Explained variable. The explanatory variable is female consciousness awakening. Female consciousness awakening is measured according to women’s subjective evaluation of their class status, with a mean of 1.7365 and a standard deviation of 0.4693, indicating that female consciousness awakening is poor.

2) Explanatory variables. The explanatory variables are the era context and social environment, and cultural values.

(1) Era background and social environment. In this study, based on relevant studies in the academic world, gender division of labor, gender competence perception, marriage, gender discrimination in employment, and distribution of household chores are selected as the proxy variables for women’s awareness. The higher scores of the above five proxies represent the higher social status of women. According to the ranking, the mean values of gender division of labor, marriage, gender competence, employment discrimination, and housework distribution are 3.3155, 3.1485, 2.9645, 2.1566, and 2.1056, respectively.

(2) With regard to cultural values, the vertical comparison was higher than the horizontal comparison, and the difference between them was 0.5233.

(3) Control variables. The control variables are mainly the factors affecting the individual characteristics of women’s consciousness awakening, including 9 variables such as age, political appearance, marital status, years of education, work status, household type, geographical type, health status, and family economic status. Among them, age is a continuous variable, and the actual age of the respondents at the time of the interview is selected. Years of education is a continuous variable, and the years of education of the respondents are selected; political appearance, marital status, work status, type of household registration, geographical type, health status, and family economic status are added into the model in the form of fixed class variables. The distribution of the sample was statistically significant. Among them, the mean value of age is 49.4885 years and the standard deviation is 16.4586, which indicates that the surveyed women have a large age difference. The mean value of political affiliation is 0.1856, indicating that most of the surveyed females are members of the general public. The mean value of marital status is 0.7985, indicating that most of the surveyed females are married; the mean value of years of education is 8.0655, indicating that the surveyed females have lower years of education, and most of them have junior high school education. The mean value of work status is 0.5186, which indicates that the number of women surveyed who have a job is basically equal to the number of those who do not. The mean value of household registration type is 0.3648, indicating that most of the surveyed females have agricultural household registration. The mean value for geographic location is 0.3856, indicating that most of the females surveyed are located in inland areas.The mean health status is 0.5591, which indicates that most of the females surveyed are in good health.The mean value of household economic status is 0.6245, indicating that most of the females surveyed have above-average household income.

Table 1.

The assignment and descriptive statistics of each variable

/		Variable	Mean	SD
Explained variable	Female awareness		1.7365	0.4693
Interpretation variable	Background and social environment	Gender division	3.3155	1.2658
		Bisexual ability	2.9645	1.2615
		Marriage marriage	3.1485	1.1596
		Gender discrimination	2.1566	1.0066
		housekeeping	2.1056	0.9655
	Cultural values	Lateral contrast	1.7415	0.5236
	Cultural values	Longitudinal contrast	2.2648	0.6185
Control variable		age	49.4885	16.4586
		Political appearance	0.1856	0.3153
		Marital status	0.7985	0.4188
		Education life	8.0655	5.0652
		Working condition	0.5186	0.5269
		Household registration	0.3648	0.4866
		Geographic type	0.3856	0.4856
		Health status	0.5591	0.4969
		Family economy	0.6245	0.4826

4.2.3

Main effects analysis

In the main effect analysis of the influence of era background and social environment, cultural values on the awakening of women’s consciousness, as shown in Table 2, Model 1, Model 2 and Model 3 are the regression results obtained from the fitting of ordered Logit model, and Model 4 is the regression results obtained from the fitting of multiple linear regression model. Among them, Model 1 is the result of regression analysis with only control variables, age, political appearance, marital status, years of education, type of household registration, type of region, health status, and family economic status all significantly and positively affect women’s awakening of consciousness, while work status does not show significant statistical significance.

1) Era background and social environment. Model 2 is the regression analysis result obtained after adding the era background and social environment on the basis of model 1. Among them, gender division of labor significantly and positively affects women’s class identity at the 1% level, and marriage significantly and negatively affects women’s consciousness awakening at the 5% level, i.e., the more women’s consciousness of marriage and marrying is inclined to the traditional, the more unfavorable it is to women’s consciousness awakening, which is mainly due to the harshness and modeling of the gender selection in the competitive talent market leading to the fact that women are facing a greater pressure of survival, and they put the value of their lives on the marriage. Marriage. Employment gender discrimination significantly and positively affects women’s class identity at the 1% level, i.e., the more employment gender discrimination exists among women, the higher the awakening of women’s consciousness. Combined with the above analysis, it can be seen that hypothesis H1 is partially verified.

2) Cultural values and the awakening of female consciousness. Model 3 is the result of regression analysis obtained by adding cultural values on the basis of Model 2. Among them, the two operationalized indicators of cultural values significantly and positively affect the awakening of women’s consciousness at the 1% level, and the adjusted R² is 14.8%, whose overall explanatory power is improved compared with Model 2. This is mainly due to the fact that, on the one hand, under the influence of the rich-poor gap and relative poverty, people’s value pursuit is more inclined to material money, and the family economic status affects women’s cultural values to a great extent, which directly affects the awakening of female consciousness. On the other hand, when individuals compare themselves horizontally or vertically with others around them or with their own past, a sense of relative deprivation will arise, and this subjective feeling will directly affect the awakening of female consciousness. Therefore, hypothesis H2 was tested.

Table 2.

Analysis of the main effect of women’s conscious awakening

Interpretation variable		Order Logit						Multivariate linear regression
		Model 1		Model 2		Model 3		Model 4
		Coefficient	S.E.	Coefficient	S.E.	Coefficient	S.E.	Coefficient	S.E.
Background and social environment	Gender division	-	-	0.094***	0.034	0.105***	0.034	0.015***	0.004
	Bisexual ability	-	-	-0.015	0.032	-0.013	0.036	-0.003	0.005
	Marriage marriage	-	-	-0.076**	0.033	-0.057	0.033	-0.015*	0.005
	Gender discrimination	-	-	0.115***	0.038	0.098**	0.037	0.018**	0.008
	housekeeping	-	-	-0.026	0.035	-0.023	0.035	-0.005	0.008
Cultural values	Lateral contrast	-	-	-	-	0.915***	0.078	0.182***	0.015
Cultural values	Longitudinal contrast	-	-	-	-	0.348***	0.054	0.067***	0.013
Control variable	age	0.015***	0.005	0.007***	0.005	0.006**	0.042	0.002*	0.000
	Political appearance	0.248*	0.154	0.265**	0.125	0.185	0.128	0.025	0.015
	Marital status	0.284***	0.083	0.245***	0.082	0.248***	0.085	0.048***	0.016
	Education life	0.048***	0.008	0.053***	0.008	0.048***	0.012	0.007***	0.003
	Working condition	-0.043	0.078	-0.037	0.072	-0.085	0.078	-0.015	0.014
	Household registration	0.192**	0.065	0.246***	0.075	0.226***	0.071	0.042***	0.015
	Geographic type	0.548	0.062	0.265***	0.062	0.348***	0.062	0.059***	0.015
	Health status	0.345***	0.075	0.315***	0.073	0.264***	0.087	0.045***	0.013
	Family economy	1.485***	0.073	1.465***	0.071	0.958***	0.072	0.196***	0.015
Constant term		-	-	-	-	-	-	0.948***	0.054
Adjust R²		0.085		0.135		0.148		0.175

Note :1), * and ** indicate that each variable is significant at the level of 10%, 5% and 1%, respectively :2) Standard error is robust standard error :3) “-” in model 1 indicates that orderly Logit regression is not performed using gender awareness and socioeconomic status, and “-” in model 2 indicates that orderly Logit regression is not performed using socioeconomic status. A “-” in a constant term indicates that this value does not exist.

4.2.4

Analysis of urban and rural differences

The household registration system is an important feature of the urban-rural dichotomy, and the type of household registration causes differences in resource endowments, lifestyles, and social attitudes among different groups, which significantly affects the group’s class identity. This study further conducted a sub-sample regression analysis of women’s awakening of consciousness based on household registration type, and Table 3 shows the analysis of urban-rural differences in the influence of era background and social environment, and cultural values on women’s awakening of consciousness. It is found that there is a significant difference between urban and rural areas in the influence of contemporary background and social environment on women’s awakening to consciousness, while there is no significant difference between urban and rural areas in the influence of cultural values on women’s awakening to consciousness.

Table 3.

The urban and rural differences of women’s consciousness

Interpretation variable		Order Logit
		Countryside		Town
		Coefficient	S.E.	Coefficient	S.E.
Background and social environment	Gender division	0.055	0.048	0.215***	0.053
	Bisexual ability	0.026	0.043	-0.156**	0.061
	Marriage marriage	-0.034	0.034	-0.086	0.057
	Gender discrimination	0.154***	0.045	0.034	0.072
	housekeeping	-0.019	0.044	-0.015	0.065
Cultural values	Lateral contrast	0.082***	0.082	1.065***	0.136
Cultural values	Longitudinal contrast	0.254***	0.065	0.469***	0.105
Adjust R²		0.115		0.189
N		3.485		1.915

Both horizontal and vertical socio-economic status comparisons show that socio-economic status significantly and positively affects the class identity of rural and urban women at the 1% level, further validating hypothesis H2. Meanwhile, the adjusted R² is 0.115 and 0.189 in rural and urban areas, respectively, with urban areas being more awakened to women’s consciousness than rural areas.

5

Conclusion

In this paper, on the basis of word2vec tool and LDA topic model, EMD distance formula is introduced for building W2v_dist algorithm model, which is used to train the female conscious text set processing and word vectors to get the text distance metric formula. The Firefly algorithm text clustering is constructed to screen topic words in the text set and select the best word items to express text features.The constructed model is employed to study 210 texts on women’s consciousness awakening.From 1945 to 1947, the total number of published texts on women’s consciousness is 22, and women’s consciousness awakening comes to the stage of rapid development and maturity. In the seven years after the founding of the People’s Republic of China, the number of publications in journals focusing on women’s consciousness reached 81.71% of the total number of publications since the war, and women’s consciousness was further developed. Of the five proxy variables for the context of the times and the social environment, gender division of labor, marriage, gender competence, gender discrimination in employment, and the distribution of household chores ranked 3.3155, 3.1485, 2.9645, 2.1566, and 2.1056, respectively, and the difference between the vertical and horizontal comparisons of cultural values was 0.5233, higher than that of the horizontal comparisons.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

An In-depth Analysis of Data Science Methods on the Path of Women’s Consciousness Awakening under the Cultural System of Marxist Chineseization

Shaohong Li

Published Online: Mar 19, 2025

Received: Nov 01, 2024

Accepted: Feb 19, 2025

DOI: https://doi.org/10.2478/amns-2025-0365

KeywordsData Mining, Text Analysis, Female Consciousness Awakening, Marxist Chineseization

© 2025 Shaohong Li, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Data Mining, Text Analysis, Female Consciousness Awakening, Marxist Chineseization