Integrating Content Analysis and LDA Thematic Modeling to Analyze the Presentation of Youth Culture in Urban Cinema

The movie industry in the post epidemic era has shown strong resilience in production and box office in recent years. However, excessive IP speculation is cooling down, reality observation, public issues and humanistic concern are becoming more and more intense, art films are valued and begin to dialogue with the market, new mainstream, science fiction, animation and other types of films are beginning to rise, the genre pattern of film production is constantly expanding and breaking through, the enhancement of the quality of film content is becoming more and more the direction of development, and the path of film innovation is becoming more and more specialized and diversified [1-4]. Driven by this form of creation and competition, youth films and other youth genre films with youth groups as the content and main audience also continue to change shape, showing a new phenomenon in terms of content quality and artistic temperament, genre fusion and value integration, etc. [5-6]. Youth culture has also broken through the scope of traditional subculture and entered the production and creativity of more types of movies with a constructive attitude, becoming a movie culture phenomenon worthy of attention [7-8]. Youth is no longer a simple rebellion and transgression, but a valuable life experience that provokes people’s empathy and resonance, and a fresh leaven that promotes cultural reflection, publicity and transcendence. In recent years, a number of excellent new mainstream works have appeared in youth films, and their rewriting of the mainstream value of youth culture is reflected in the output of in-depth and meaningful views based on the aesthetic preferences of young people, and the transmission of thoughts and care for social reality in daily life and life-oriented narratives [9-12]. This new cultural expression is conducive to cultural dialogue and interaction with the youth audience. The film stands in the position of mainstream values, shapes a new youth culture and youth temperament with youthful narratives, draws the distance with the youth audience closer, and tries to express the importance and respect for the youth audience group, and in this way, develops a cultural dialogue with the contemporary youth [13-16]. Youth culture has begun to break through upward, constantly integrating with mainstream values, and reshaping mainstream culture with its own advantageous cultural content. Today’s youth culture rewrites the mainstream values with the expression mechanism of individuality, and is recognized, absorbed and supported by the mainstream culture by leveling and communicating with the mainstream culture, which makes the mainstream youth culture become an extremely powerful cultural communication discourse in the current media culture context [17-19].

As a spatial and temporal art, movie contains two dimensions of time and space at the same time. Space is an important perspective for interpreting urban cinema, and spatial narrative, i.e., using space to tell a story, can play multiple roles such as revealing the social reality in the context of the times and conveying cultural meaning [20]. Movies were first developed in cities, and Shanghai was the main production space for early Chinese movies. Using images as a carrier, urban movies use the camera to show real-life spaces to the audience in a realistic way, linking urban spaces and showing urban life. In films such as “Monga” and “There is a Cloud Made of Rain in the Wind”, the coordinates of cities such as Taiwan and Guangzhou have become spiritual symbols and cultural symbols that have dynamically evolved in the leap of time and space [21-22]. From the perspective of Lefebvre’s triadic dialectic of space, the first space is the experience and perception of the city on the material level [23]. Movies show the image of the city in video language, which will inevitably show the material space of the city. City movies will express the diversity of the city, cultural history, human complexity and the specificity of social problems, which provides us with a unique perspective to observe and understand the city [24-26]. Movies that take youth culture as the object of expression are constantly tied to the times, writing about youth from the situation of the times, and in this way, providing a perspective on the development of the city in general. For this reason, analyzing the presentation form of youth culture in urban films provides reference for the display of youth culture value and the development of urban films.

In this paper, we use the vector space model to transform the movie text into vector form, and calculate the weights between the texts by three methods such as Boolean expression. On this basis, the contextual information in the text is captured by combining the n-meta language model to enhance the model’s understanding of the semantics. The LDA theme model is used to mine the potential themes of movie texts and analyze their probability distributions to extract thematic features of youth culture. The LDA-Kmeans method is further adopted to use the theme probability distributions generated by LDA as feature vectors and combined with the Kmeans algorithm for clustering analysis in order to optimize the theme classification effect of the model. In terms of model optimization, the coding layer that incorporates theme features and the classification layer that enhances semantic information are designed, and comparative experiments are used to verify the optimization effect of the model in this paper. Then combined with the case study, it reveals the current situation of youth culture presented in urban films and analyzes the results of the characterization of youth subculture.

2

LDA-Kmeans-based topic modeling approach

2.1

Text content analysis based on LDA topic modeling

2.1.1

Vector space model

Text, as a collection of carriers of human comprehensible knowledge, is not comprehensible to computers. In order to reduce the complexity of computer processing of urban movie text, urban movie text must be represented into a digital form that computers can handle. The vector space model [27] is an effective and simple model for representing text features. Each urban movie document is partitioned into a collection of words and each word represents a feature in the text, and the document is represented in a feature space vector. Formally, given a set of n documents D, D with a unique set of words V, document d_i is represented in an IV1-dimensional vector: (1) $d_{i} = (w_{1, i}, w_{2, i}, \dots, w_{t, i})$

Each dimension in the vector corresponds to a specific word. If a youth culture word appears in a document, the value of the corresponding dimension in the vector is non-zero, and there are three ways to calculate the weights in these dimensions. 1)

Boolean expression

In this basic data model it is simply checking whether the word in the urban movie document appears in the feature set dictionary. The calculation formula is as follows: (2) $C o u n t B o o l (w, d) = {\begin{matrix} 1, w a p p e a r i n & d o c d \\ 0, & o . t . \end{matrix}$

This method can simply represent the features in the urban movie document, but ignores the count information of youth culture reflecting words, and this representation obviously cannot meet the requirements for the application of high semantic requirements. 2)

Frequency of words

The frequency of different words in the document plays an important role in the description of the document content. If a word appears frequently in a document, that word can better depict the content of the document. (3) $T F (w, d) = {\begin{array}{l} t f, t i m e s w a p p e a r i n & d o c d \\ 0, & o . t . \end{array}$

3)

TF-IDF

The weight calculation of TF-IDF [28] is widely used in the field of urban movie text mining and youth culture information extraction, which is calculated by the following formula: (4) $t f * i d f (w, d) = t f_{w} \times \log \frac{| D |}{d f_{w}}$

where tf_w denotes the word frequency and df_w denotes the document frequency of a word, i.e. how many documents a word has appeared in. $\log | D | / d f_{w}$ gives higher weights to those words that are more discriminating; if a word occurs in many documents, it means that the youth culture representative word is not discriminating and therefore has a low weight.

2.1.2

The n-meta-language model

Many applications of human language technology involve the use of statistical language models that give the prior probabilities of word sequences in the languages of our youth culture of interest. Since natural language has no restriction in that it allows word sequences to be infinitely long, the probability of a very long word sequence W is not directly amenable to computation. The probability of $P (W)$ can be decomposed into the product of the individual partial probabilities according to the chain rule: (5) $P (W) = p (w_{1} w_{2} \dots w_{t}) = P (w_{1}) \prod_{i = 1}^{t} P (w_{i} | w_{i - 1} w_{i - 2} \dots w_{1})$

Because each term in the product remains difficult to compute, the statistical language model uses an n meta-approximation, which is the n meta-model. It assumes that only the first n − 1 most recent words are relevant to the prediction of the current youth culture representative word, and that words before that are irrelevant to the current word. This n-element model can be expressed as: (6) $P (W) \approx \prod_{i = 1}^{t} P (w_{i} | w_{i - 1} w_{i - 2} \dots w_{i - n + 1})$

Based on the length of n, we can define models of 1 element (n = 1), 2 element (n = 2) or even higher. A n-element model is equivalent to a n − 1th-order Markov model, so the equation reflects the Markovian assumption that the current word is only related to the first n − 1 words.

2.1.3

Probability Correlation Distribution

1)

Bayesian Networks

In a probabilistic graphical model, a node represents a random variable, and edges between linked nodes represent assumptions of conditional independence between nodes. Thus probabilistic graphical models provide a compact representation of the distribution of a joint random variable. When the edges in a graph model are directed acyclic, it is a Bayesian network, which is a directed acyclic graph (DAG), Bayesian networks are a special case of probabilistic graph models.

A Bayesian network $B = 〈 G, Θ 〉$ is a labeled directed acyclic graph that represents the joint probability distribution over the set of variables V, and G represents the DAG in network B, where node X_l, X₂, …, X_n represents a random variable and the edges between them represent mutual dependencies. Figure G encodes the independence assumption, where node X_i is independent of its non-ancestor nodes. Θ denotes the set of parameters in the network, including parameter $ϑ_{x_{i} | π_{i}} = P_{B} (x_{i} | π_{i})$ , denoting each value x_i of X_i conditional on π_i. Thus B defines a unique joint probability distribution on V: (7) $P_{B} (X_{1}, X_{2}, \dots, X_{1 n}) = \prod_{i = 1}^{n} P_{B} (X_{i} | π_{i}) = \prod_{i = 1}^{n} θ_{X_{i}} | π_{i}$

In practice, when there are multiple cultural variables with dependencies between them, it is impractical to draw all of them, so a box is used to indicate N repetition on the variable. 2)

Correlation probability density distribution

In the probability graph model, each node represents a random variable, and each movie word variable has its own corresponding probability distribution. The two most common probability distributions in the topic model are the polynomial distribution and the Dirichlet distribution, which are conjugate to each other.

A polynomial variable can represent the rate at which one of K scenarios occurs, and the joint distribution of these variables is the polynomial distribution, assuming that m_i, m₂, ⋯, m_K represents the number of times the Kth scenario occurs in N total experiments, respectively: (8) $M u l t (m_{1}, m_{2}, \dots, m_{K} | μ, N) = (\begin{matrix} N \\ m_{1}, m_{2}, \dots, m_{K} \end{matrix}) \prod_{k = 1}^{K} u_{k}^{m_{k}}$

where μ is the prior distribution of the variable and $\sum_{k = 1}^{K} μ_{k} = 1$ . Standardized coefficients: (9) $(\begin{matrix} N \\ m_{1}, m_{2}, \dots, m_{K} \end{matrix}) = \frac{N!}{m_{1}!, m_{2}!, \dots, m_{K}!}$

And: (10) $\sum_{k = 1}^{K} m_{k} = N$

The polynomial distribution can be seen as a generalization and a generalization of the binomial distribution.

In the polynomial distribution introduced above, the parameter $μ = [μ_{1}, μ_{2}, \dots, μ_{k}]$ is given in advance by a distribution p(μ). There is a simple form of distribution with well-analyzed urban cinema properties that can be perfectly applied to Bayesian estimation: the “distribution over distributions”: the conjugate distribution. By looking at the polynomial distribution the form of the conjugate prior is given by the following equation: (11) $p (μ | α) \propto \prod_{k = 1}^{K} u_{k}^{a_{k} - 1}$

where 0 ≤ μ_k ≤ 1 and $\sum_{k = 1}^{K} μ_{k} = 1$ , and here α_l, α₂, …, α_k are parameters of the distribution. μ_k is restricted to a K − 1-dimensional simplex. The standardized form of this distribution is: (12) $D i r (μ | α) = \frac{Γ (α_{0})}{Γ (α_{0}) \dots Γ (α_{K})} Π_{k = 1}^{K} u_{k}^{α_{k} - 1}$

This is the standard Dirichlet distribution, where $Γ (α_{0})$ is a gamma function and $α_{0} = \sum_{k = 1}^{K} α_{k}$ . Different α_k give different forms of the distribution under simplex.

2.2

LDA-based text topic modeling approach

2.2.1

LDA-based analysis methods

The core idea of the LDA topic model [29] is the assumption that urban movie documents are represented by a mixture of topics, and each youth culture topic is represented by a mixture of words. The model is divided into three levels of hierarchy from the outside to the inside, with parameter a denoting the hyperparameters of the document-topic distribution θ. β denotes a hyperparameter for the topic-word distribution ϕ. z_d,n denotes an implicit variable for the youth culture theme to which each word in the urban movie document belongs. Variable θ_d is a variable at the urban movie document level that represents the distribution of each youth culture theme in document d. The structure of the LDA model is shown in Figure 1.

The result of LDA is a D × K-dimensional matrix θ containing $p (z_{t} | d_{d})$ . Specifically, the computational procedure for θ can be described as approximating the posterior distribution of θ by optimizing the variational parameters given the words in document d by means of a variational inference method. It is computed as follows: (13) $θ = (\begin{matrix} θ_{1} \\ ⋮ \\ θ_{D} \end{matrix}) = (\begin{matrix} p (z_{1} | d_{1}) & \dots & p (z_{K} | d_{1}) \\ ⋮ & ⋱ & ⋮ \\ p (z_{1} | d_{D}) & \dots & p (z_{k} | d_{D}) \end{matrix})$

where θ_l, ⋯, θ_D is a 1 × K-dimensional probability vector representing the probability distribution of each topic in the document.

2.2.2

LDA Topic Characterization Model for Fusion Content Analysis

In the process of theme modeling, it is usually considered that an urban film document can contain multiple youth culture themes, and each cultural representative word in the document is generated by a certain youth culture theme. The LDA model gives the probability distribution of urban film documents belonging to each theme, and also gives the probability distribution of the words on each cultural theme, and the LDA schematic is shown in Figure 2. In this section, the LDA model will be introduced from several aspects such as Gamma function, binomial distribution, beta distribution and polynomial distribution.

1)

Gamma function: It is a continuous probability function, which can be regarded as the distribution of the time required for an event to occur α times, and the formula is as follows: (14) $G a m m a (t | α, β) = \frac{β^{α} t^{α - 1} e^{- β t}}{Γ (α)}$

Where parameter α is also known as shape parameter and β is known as scale parameter. 2)

Conjugate Prior Distribution: In Bayesian inference, if the posterior distribution and the prior distribution belong to the same kind of distribution, the prior distribution and the posterior distribution are called conjugate distribution.

3)

Binomial distribution: proposed by Bernoulli, the binomial distribution assumes that there are only two possible outcomes in each trial and that the outcomes are independent of each other and independent of the results of the other trials, and that the probability of the event occurring or not remains the same in each independent trial, i.e.: (15) $p (X = k) = (\begin{matrix} n \\ k \end{matrix}) p^{k} {(1 - p)}^{n - k}$

4)

Beta distribution: a set of continuous probability distribution functions defined in the interval $(0, 1)$ . In Bayesian inference, the beta distribution is the conjugate prior distribution of the binomial distribution for: (16) $\begin{array}{rcl} B e t a (p | α, β) & = & \frac{p^{α - 1} {(1 - p)}^{β - 1}}{\int_{0}^{1} p^{α - 1} {(1 - p)}^{β - 1} d p} \\ = & \frac{Γ (α + β)}{Γ (α) Γ (β)} p^{α - 1} {(1 - p)}^{β - 1} \end{array}$

5)

Polynomial distribution: the polynomial distribution is a generalization of the binomial distribution, which does n Bernoulli test, specifying only two outcomes per test, while the polynomial distribution has K outcomes in N independent tests, each with a definite probability p, calculated as follows: (17) $M u l t (\vec{n} | \vec{p}, N) = (\frac{N}{n}) \prod_{k = 1}^{K} p_{k}^{n_{k}}$

Among them: (18) $\sum_{k = 1}^{K} n_{k} = N, \sum_{k = 1}^{K} p_{k} = 1 (\begin{matrix} N \\ \vec{n} \end{matrix}) = \frac{N!}{\prod_{k} n_{k}!}$

6)

Dirichlet distribution: is a generalization of the beta distribution in higher dimensions, with the probability density function computed as: (19) $\begin{array}{rcl} B e t a (p | α, β) & = & \frac{p^{α - 1} {(1 - p)}^{β - 1}}{\int_{0}^{1} p^{α - 1} {(1 - p)}^{β - 1} d p} \\ = & \frac{Γ (α + β)}{Γ (α) Γ (β)} p^{α - 1} {(1 - p)}^{β - 1} \end{array}$

Among them: (20) $Δ (\vec{α}) = \frac{\prod_{k = 1}^{K} Γ (α_{1})}{Γ (\sum_{k = 1}^{K} α_{k})} = \int_{i}^{k} \prod_{k = 1}^{k} p_{k}^{v_{1}} \sqrt{p}$ (21) $E (\vec{p}) = {\frac{α_{1}}{\sum_{k = 1}^{k} α_{1}} \cdot \frac{α_{2}}{\sum_{k = 1}^{k} α_{2}}, \dots, \frac{α_{k}}{\sum_{k = 1}^{k} α_{k}}}$

In the LDA model, due to its assumption that each topic obeys a uniform Dirichlet distribution, and the model is randomized in the initialization process, resulting in the diversity of youth culture topics as well as the diversity of culture topic words can not be guaranteed, which makes it very difficult to extract topics from a large-scale corpus. In practice, due to the uncertainty of the corpus and other reasons, this a priori assumption and the actual situation are often not necessarily fully consistent, and for the a priori probability relation, the goodness of the a priori probability also affects the convergence speed and convergence effect of the model, which further leads to a decline in the performance of the model.

2.3

Optimization of LDA topic models incorporating content analysis

2.3.1

LDA-Kmeans Topic Fusion Algorithm

The extraction task of theme embedding relies on the fusion algorithm of LDA and K-means [30]. In order to make LDA output theme vectors of urban movie documents, this paper makes the following improvements for LDA: 1)

Expand LDA for clustering into a topic model that can generate vectors of urban movie documents, and no longer use the clustering results of the LDA model;

2)

Modify the Dirichlet distribution of “urban movie document-theme” to take the theme as the dimension and the probability value as the value on the dimension, so as to get the vector representation of urban movie documents on different youth culture theme dimensions;

3)

Combining the LDA theme model and K-means algorithm can enhance the model’s ability to extract movie themes;

4)

The subject vectors of each city movie document are spliced into a matrix of size n × 768.

The outermost layer of the improved LDA-Kmeans model is the document collection layer, followed by the document and word layers. Among them, the implied theme share assigned to words of each urban movie document is z, the youth culture theme words of the text are w, each word has an underlying cultural theme, the number of themes is K, and the distribution of youth culture themes reflected by its urban movie documents is: (22) $P (w, z, θ | α, β) = P (θ | α) \prod_{n = 1}^{N} P (z_{n} | θ) P (w_{n} | z_{n}, β)$

The basic logic of the K-means clustering algorithm is to first define the number of centers of mass C, and then the objects are assigned to the nearest center of mass after a continuous loop.

2.3.2

Coding layer design incorporating thematic features

1)

Structural design of the coding layer

The coding layer consists of 12 Transformer encoders stacked on top of each other, and each layer contains a multi-head self-attention mechanism and a fully connected feed-forward neural network. The multi-head self-attention sublayer can calculate the correlation between any two words/phrases and reduce their distance to 1.

The specific structure of a single encoder for the coding layer is shown in Fig. 3. In the figure N_x is the number of stacked layers, and E is the vector representation of the word embedding, position embedding, segment embedding and city movie theme embedding after stitching.

2)

Input representation design for coding layer

Obtaining high-quality youth culture representative word vectors through embedding technology is a prerequisite for classifying the youth culture sentiment polarity of text using deep learning. The role of embedding is to map token sequences into vector representations to realize urban film text vectorization. The input embeddings of the LDA-Kmeans model include word embedding, segment embedding, position embedding and topic embedding.

Because the order of sentences in a document is critical to the expression of sentiment, sentence-segment positions are added to assist the model in distinguishing between two different sentence orders or paragraph orders. The part is a binary vector and the paragraph embedding of a document D_i is represented as: (23) ${\begin{array}{l} δ (s e g, 2 i + 1) = [E_{a}, E_{b}, E_{a}, \dots, E_{a}] \\ δ (s e g, 2 i) = [E_{a}, E_{b}, E_{a}, \dots, E_{b}] \end{array}$

where seg denotes the segment information in the sentence, usually 0 or 1.

In this paper, we introduce the positional embedding of ρ as n × 768 to guide the coding layer to learn the positional features of the text sequence, which solves the short board of the attention mechanism. In this paper, we use the sine-cosine function to generate the position vector, which can encode the position of sentences of arbitrary length, and the specific position encoding vector of the embedding layer is shown as follows: (24) ${\begin{array}{rcl} ρ (p o s, 2 i) = \sin (\frac{p o s}{d_{\mod e l} \sqrt{1000^{2 i}}}) \\ ρ (p o s, 2 i + 1) = \cos (\frac{p o s}{d_{\mod e l} \sqrt{1000^{2 i}}}) \end{array}$

where pos denotes the position of the word in the sentence; i denotes the dimension index of the positional embedding, i = 0, 1, 2, ⋯, d_model; d_model denotes the hidden layer dimension of the model.

In this paper, the two-dimensional moments E_Toki of n × 768 obtained by element-wise summation of the vector representation of the output of the embedding layer are the inputs of the ith token, i.e., the characterization of the inputs of the encoding layer. 3)

Calculation process of the coding layer

The model uses the Softmax function to calculate the weight coefficients of one word/phrase to $T = {T_{i} | i \in {1, 2, \dots, n}}$ other words/phrases, and uses the syntactic structure and semantic information learned by the coding layer to classify the textual sentiment.

The bi-directional Transformer encoder network extracts text features in parallel, and then maps the text features $E = {E_{T o k i} | i \in {1, 2, \dots, N}}$ to the sample labeling space through the fully-connected layer to get the corresponding representation vectors.

2.3.3

Classification Layer Design for Enhanced Semantic Information

In this paper, a pooling layer is added after the encoding layer, and all token embeddings T output from the encoding layer undergo a pooling operation to output a fixed-size urban movie document representation embedding p. The logical vector is converted to a probability using the Softmax function. The sentence representation p is mapped into a 2-dimensional vector to compute the youth culture sentiment polarity probability. The probability of emotional polarity of youth culture in urban movies maps the input vector x to a vector of length n, where each element corresponds to a category and the value of each element represents the probability of that category. To wit: (25) $P = \frac{e W_{i} x + b_{i}}{\sum_{j - 1} n e^{W_{i} x + b_{j}}}$

where P denotes the class probability of the output vector; x denotes the input vector; W_i denotes the ith row of the weight matrix; b_i denotes the ith element of the bias vector; and n denotes the length of the output vector.

2.4

Evaluation indicators

1)

Precision rate

The percentage of the number of correctly predicted samples $(T P + T N)$ out of the total number of samples $(T P + T N + F P + F N)$ , regardless of positive or negative labeling, is calculated as follows: (26) $A C C = \frac{T P + T N}{T P + T N + F P + F N} \times 100 %$

2)

Recall

The number of samples $(T P)$ with correct predictions and also positive labels as a percentage of the total number of positive samples $(T P + F N)$ is: (27) $R = \frac{T P}{T P + F N} \times 100 %$

3)

F1-value

Using the F1-value as the reconciling mean, the calculation is as follows: (28) $F 1 = \frac{2 P R}{P + R} \times 100 %$

3

Performance analysis of LDA-Kmeans models

In order to verify whether the features proposed in this paper for integrating urban movie content text themes are effective in enhancing the extraction of youth culture features, 547.31 million non-duplicated YCT datasets of youth culture urban movie multi-domain content texts obtained from multiple platforms are used. In order to fully verify the stability of the model, this paper repeats the training and evaluation work on the publicly available Sina Weibo comment datasets weibo1 and online2 datasets. The label “0” in the dataset indicates youth negative youth culture, and “1” indicates youth positive youth culture. During the training process, the model parameters are set as follows: Hidden Layer Activation: Relu, Optimizer: Adam, Hidden: 754, Batch-size: 64, Dropout: 0.8, epochs: 20, Embedding: 152.

3.1

Model Structure Comparison Experiment

The experiment aims to verify the effectiveness of adding an embedding layer using LDA combined with K-means in the task of analyzing youth culture urban films. The comparison models selected for the experiment are as follows: 1)

CNN: using word vectors from word2vec, textual semantic information is extracted with CNN;

2)

LSTM: using word vectors of Word2vec, text semantic information is extracted using LSTM;

3)

BERT-base: based on BERT-base fine-tuning, use the “[CLS]” text aggregation sequence output from the coding layer to do classification;

4)

The method of this paper (LDA-Kmeans): based on LDA, using LDA combined with K-means to extract topic information from text.

The experimental results of model structure comparison are shown in Fig. 4. From the experimental results, it can be seen that compared with the traditional deep learning methods, ALBERT performs better in the first 14 epochs, and with the increase of the number of iterations, the accuracy converges faster and is more stable. When the number of convergence reaches 13 times, the accuracy of this paper’s model is the highest, 93.59%.

3.2

Comparative Experiments of Feature Fusion and Pooling Methods

The experiment aims to verify the effectiveness of the topic feature extraction method and pooling method based on LDA and K-means proposed in this paper, and the improvement strategy of this paper’s model is experimented on the YCT dataset. The unimproved LDA model is used as the baseline model, while the ERNIE model and ALBERT-base are added as the comparison algorithm, and the improved method of this paper is LDA-Kmeans.

The experimental results of feature fusion and pooling method comparison are shown in Figure 5. The experimental data show that the model proposed in this study significantly outperforms ERNIE on the task of textual youth culture analysis. The different variants of the model that incorporate the topic features, including ERNIE, ALBERT-base, and the model in this paper, LDA-Kmeans, show higher accuracy rates than the original LDA in all training cycles. This result suggests that the performance of youth culture analysis is significantly enhanced by incorporating improvements in thematic feature extraction and pooling techniques. A comprehensive analysis revealed that the LDA-Kmeans model with the latest peak reached the highest accuracy rate of 97.68% at the 16th iteration and showed overfitting in the subsequent iterations. Therefore, 20 iterations are fully sufficient to present the optimal parameterization of the model.

3.3

Analysis of the overall effect of the model

Regarding the comparison test of the overall effect of the model, it was conducted 20 times in the same experimental environment, and the model effect was calculated after 20 rounds of each iteration, and the average of the optimal effect of each validation set was taken as the reporting metrics in the table. The test results under three different data sets are shown in Table 1.

Table 1.

Test results in three different data sets

Data set	Model	Accuracy rate	Recall rate	F1 value
YCT	CNN	0.9015	0.8996	0.9219
	LSTM	0.8638	0.8605	0.8646
	BERT-base	0.9164	0.9169	0.9131
	LDA-Kmeans	0.9613	0.9844	0.9702
	LDA	0.9248	0.9118	0.9206
	ERNIE	0.9474	0.9465	0.9434
Weibo1	CNN	0.8891	0.8858	0.9078
	LSTM	0.8489	0.8472	0.8526
	BERT-base	0.9031	0.9015	0.8977
	LDA-Kmeans	0.9789	0.9699	0.9545
	LDA	0.9063	0.8981	0.9056
	ERNIE	0.9341	0.9321	0.9295
Online2	CNN	0.873	0.8694	0.9026
	LSTM	0.8381	0.8207	0.8458
	BERT-base	0.8942	0.8882	0.8861
	LDA-Kmeans	0.9679	0.9789	0.9625
	LDA	0.8954	0.8848	0.8948
	ERNIE	0.9228	0.908	0.9166

The experimental results show that the LDA-Kmeans method achieves good results on all three datasets of textual youth culture analysis, which indicates that the characterization method of fusing thematic features is effective. The analysis of the comparative experimental results is as follows:

Due to the incorporation of more semantic information, LDA-Kmeans can achieve high classification accuracy even when the number of iterations is low, which is one of the advantages that traditional deep learning methods do not have. In addition, after incorporating the thematic features extracted by Kmeans, the model is able to obtain more diversified and coherent information, which helps deepen the understanding of text content and thematic structure, thus improving the performance of the model’s youth culture analysis, and the method’s ability to enhance the model’s generalization is verified through comparative experiments in three different datasets. Full implicit feature pooling aggregates the hidden states of the whole text to provide more input features for the classifier, which in turn improves the accuracy of youth culture analysis, compared to ERNIE, which represents only one integrated feature of the whole text. Since the deployment of the average pooling layer does not increase the model parameters or affect the training speed of the model, the model with the addition of the average pooling layer can improve its effectiveness without adding additional computational costs.

In summary, the LDA-Kmeans model proposed in this paper has improved the evaluation indexes, such as precision and recall, in the field of text analysis of youth culture urban movies compared to other models.

3.4

Running efficiency of each model for different subject keywords

In this section, we add three topic models, LSA (Latent Semantic Analysis), PLSA (Probabilistic Semantic Analysis), and STM (Spherical Topic Model), and the experimental data use general keywords, synonymous keywords, and one-word polysemous keywords as proposed in the previous section, and measure the results in terms of the time of the retrieval execution of the keywords for the different topic keywords as well as the retrieval accuracy of the two parameters. The running time of each model for different subject keywords is shown in Table 2. There is a large difference in the execution time of the 9 models under different keywords. And we find that compared to the other 8 topic models, LDA-Kmeans consumes the least time in various topic keywords, and the advantage of the running time is very obvious, while the LDA method consumes the most time, and the running time of this paper’s model, LDA-Kmeans, for synonym and polysemous keywords is reduced by 10.73%- respectively compared to the other 8 compared methods. 55.51%, 8.58%-39.6%.

Table 2.

The operating time of each model in different topic keywords

Model	Running time(s)
Model	General key	Synonymous words	Keywords of multiple meanings
LSA	5.0944	5.0498	5.7439
PLSA	4.2886	3.7515	4.6910
STM	4.3783	4.7040	5.1960
CNN	5.0109	5.2280	6.2341
ERNIE	4.337	4.5253	4.8669
LDA	5.5752	6.1680	6.7025
LSTM	5.0323	5.4087	6.6815
BERT-base	3.3305	3.8339	4.6453
LDA-Kmeans	0.2744	0.3241	0.4543

Table 3 shows the accuracy of each model for different subject keywords. From the accuracy results, it can be seen that the LDA-Kmeans model in this paper is much more accurate than all the other 8 models, regardless of retrieving general keywords and polysemous keywords. It can also be seen from the table that the accuracy rate of retrieving general keywords is higher than that of synonymous keywords and polysemous keywords. In addition, the optimized LDA-Kmeans model increases the accuracy of general keywords, synonymous keywords and one word polysemous keywords by 36.72%, 38.6% and 49.51% respectively over the LDA model. It can be seen that the optimized model LDA-Kmeans of this paper has significantly increased the accuracy of content analysis related to youth culture in urban cinema.

Table 3.

The accuracy of each model is compared to the accuracy of the key words

Model	Accuracy(%)
Model	General key	Synonymous words	Keywords of multiple meanings
LSA	44.37	41.64	38.77
PLSA	56.88	53.37	42.34
STM	64.51	58.32	40.23
CNN	47.74	43.04	37.67
ERNIE	52.78	51.38	50.69
LDA	57.16	51.52	39.03
LSTM	65.75	60.85	54.08
BERT-base	55.85	49.45	49.21
LDA-Kmeans	93.88	90.12	88.54

4

Analysis of the results of the expression of youth culture and emotions in urban cinema

4.1

Data sources and processing

In this study, we selected 120 movies related to youth culture as data sources, and used “youth” and “culture” as search keywords, and used Python to capture urban movie contents related to the keywords respectively. The collection period is set from 2014.1.1 to 2023.12.31, and the content of the collection is movies containing the keywords “youth and culture”. The collected data were cleaned and organized, and the duplicates, missing contents and contents not related to this study were removed from the original data, and finally 2793 valid text data were obtained for the topic of “youth”, and 1924 valid text data were obtained for the topic of “culture”. The text data were further analyzed by using word frequency statistics and LDA topic model.

4.2

Analysis based on word frequency statistics

After processing the text data, this study calculates the probability distribution by LDA-Kmeans algorithm to get the frequency of vocabulary occurrence in the text data of lay flat topic and involution topic. According to the high and low word frequency, the top 30 words and frequency table are obtained.

Table 4 shows the high-frequency vocabulary table of the text data of the youth lay flat topic (with * for the high-frequency words common to the two topics). In addition to the lay flat keyword, the most frequent word in the text is self. Words related to emotional expression such as happy, anxious, tired, etc., words with youth cultural tendency such as don’t want, like, hope, etc., words of daily life such as go home, study, go to work, words indicating confusion and anxiety and procrastination such as choose, life, tomorrow, etc., and words closely related to young people’s life such as eat, young people, etc. appear frequently in the movie. These high-frequency words show that the phenomenon of lying flat is a kind of self-expression of youth culture and release of inner feelings in response to real life, reflecting the current situation of contemporary young people’s life as well as their emotional catharsis.

Table 4.

Young people lie in the high frequency vocabulary of text data

Serial number	Participle	frequency	Serial number	Participle	frequency
1	Lie down*	76500	16	Question *	7014
2	Self *	71034	17	Child *	6757
3	Life *	50377	18	Go home	6549
4	Effort *	30549	19	Learning *	6321
5	Work *	22147	20	Anxiety *	6218
6	Suffer *	19053	21	Young man	6210
7	Like *	9987	22	Fatigue	6138
8	Eat	9654	23	Friend	6022
9	Select	9014	24	World	5317
10	Time *	8326	25	At home	5015
11	Get up	8059	26	Society *	5004
12	Hope *	8011	27	Teacher *	4932
13	Happiness	7877	28	Go to work	2714
14	Joyfulness	7656	29	China*	1999
15	Thing *	7325	30	Tomorrow	1934

Table 5 shows the high-frequency vocabulary list of the text data of the youth in-reel topics. It can be found that vocabulary related to social life such as education, industry, company and school, and vocabulary depicting high-pressure social life scenarios such as seriousness, overtime work and competition appear in large numbers in the youth culture city movie. The large number of these words indicates that the in-reel topics are mainly presented as the fierce competition of social life and the internal depletion of the spiritual world from the perspective of youth. In addition, 17 words such as life, endeavor, work, and society are high-frequency words shared by both Lay Ping and the in-reel topics. These high-frequency words together present the life and mentality of young people, indicating a significant correlation and correspondence between these two topics.

Table 5.

The text data high frequency vocabulary of the topic of youth

Serial number	Participle	frequency	Serial number	Participle	frequency
1	Inner volume	54870	16	Anxiety *	6891
2	Self *	49423	17	Question*	6624
3	Education*	28762	18	Hope *	6434
4	Child *	18941	19	Time *	6194
5	Work *	11547	20	China*	6097
6	Severity	10562	21	Company	6082
7	Effort *	9848	22	Stars	6012
8	Life *	9534	23	School	5884
9	Society *	8889	24	Student	5190
10	Teacher *	8193	25	Age	4879
11	Donation	7925	26	Overtime	4873
12	Lie down*	7884	27	Competition	4809
13	Like *	7754	28	Parent	2584
14	Industry	7526	29	Thing *	1877
15	Learning *	7206	30	Money	1813

Through the word frequency statistical analysis of the collected text data, this study found that the urban movie topics of lie flat and in-reel mainly revolve around daily life, such as work and study, and the topics involve the complaining of trivialities around us, the expression of inner emotions, the release of psychological pressure, the lamentation of social life, the dissatisfaction of one’s own demands, the resistance of the existing order, the flirtation of self-depreciation, the ideal life of the ideal life, and the goal of life. The characters of the topics include children, teachers, teachers, friends, and other people. The characters of the topics involve children, teachers, friends and other acquaintances in the social circle, as well as celebrities and other common figures. The cultural expressions of youth topics include negative emotions such as anxiety, fatigue, unwillingness, and seriousness, as well as positive emotions such as happiness, fondness, hope, and endeavor.

4.3

Analysis based on LDA subject modeling

The results of the theme classification of the text data of the lying topic are shown in Table 6. Combined with the theme words and the original movie text for theme summarization, the theme of the topic of lying down is described as “reasons for lying down, inner feelings, seeking support and enjoying life”, and its corresponding weights are 35.78%, 24.97%, 21.03% and 18.22% respectively.

Table 6.

The result of the text data topic is divided

Theme	Weighting (%)	Core theme	Topic description
Topic 1	35.78	Choices, problems, young people, society, children, life, future, ability, flat lying, opportunity, education fund.	Cause of lie down
Topic 2	24.97	Self, effort, work, no desire, things, anxiety, learning, salted fish, rejection, giving up, resting	Inner emotion
Topic 3	21.03	Like, teacher, friend, forever, hope, lovely, good-looking, enter the pit, game, pit, thank you, stage	Resistance volume
Topic 4	18.22	Happy, home, weekend, day, comfort, sleep, mobile phone, refueling, sports, summer holidays, happiness, air conditioning	Enjoy life

Theme 1 Reasons for lying down: The keywords of this theme show that contemporary young people are faced with various choices and problems, and the emergence of various kinds of pressure makes many people choose to lie down.

Theme 2 Inner Youth Culture: Among the keywords appearing in this theme, words expressing negative emotions account for the majority, which represents the real thoughts and mild resistance psychology of young people when they face pressure, releasing their inner youth culture in a playful and flirtatious way.

Theme 3 Seeking support: young people like to communicate with friends and others to seek help when they encounter problems. At the same time, this theme highlights the phenomenon of fan youth subculture. In the network era, “pit” has become a synonym for expressing favoritism, such as chasing dramas, and young people usually use the phrase “lying flat in the pit” to describe the youth culture that is obsessed with a certain thing or a certain person, and is unable to extricate itself from it.

Theme 4 Enjoying life: In this theme, there are words that express positive youth culture, such as joy and happiness, as well as words that express pastime activities, such as going home and playing sports, which show that young people are enjoying their leisure time.

The results of the theme classification of the text data of the in-reel topics are shown in Table 7. Combined with the textual content of the youth culture city movie, the themes of the in-reel topics are summarized as “serious in-reel, life pressure, resistance in-reel and education in-reel”, and their corresponding weights are 34.26%, 29.04%, 21.17% and 15.53%, respectively.

Table 7.

The text data topic of the topic is divided

Theme	Weighting (%)	Core theme	Topic description
Topic 5	34.26	Society, industry, company, time, work, competition, market, enterprise, Internet, young person, graduation, opportunity.	The inner volume is very serious
Topic 6	29.04	Oneself, hard work, life, work, study, anxiety, lying flat, things, overtime, life, examination and investigation	Stress of life
Topic 7	21.17	Serious, after-work, anti-internal volume, evening, likes, colleagues, work, support, milk tea, mobile phone, game, star.	Resistance volume
Topic 8	15.53	Education, school, students, parents, training, teachers, institutions, cold and summer holidays, policies, college entrance exams, universities and complementary courses	Education volume

Theme 5: Competition, the Internet, graduation and other words in this theme conveyed the status quo of serious involution in the eyes of young people. In recent years, information about the “996” work system of Internet companies and other internalization has triggered a lot of discussion on the Internet, making many young people lament the arrival of the internalization era.

Theme 6 Pressure of life: The keywords hard work, work, anxiety, etc. in this theme are repeated with some of the words in Theme 2. Faced with the pressure of work and study, some young people choose to lie flat, reflecting young people’s soft rebellion against the involutionized society.

Theme 7 Resistance to Involvement: Words such as anti-involvement, milk tea, cell phones and games in this theme show some young people’s playful flirtation and blasé attitude in the face of involvement.

Theme 8 Education Involvement: In this theme, education-related terms such as school, training, teacher and college entrance examination appeared. The education scrolls have caught many young people and college students in a deep state of inactivity, and have caused great anxiety to young parents in particular.

4.4

Youth Subcultural Representations of Lying Flat and Involvement

In terms of the content of urban movies, young people take the opportunity to express their dissatisfaction with the pressures of life and the problems they encounter by laying down a calm involution. The youth group shows an obvious ambiguous attitude of passive resistance and a strong post-modern compromise. Lying flat and the previous network buzzwords and emoticons such as “funeral culture, Buddha system, Ge You lying, sad frog” reflect the social psychological phenomenon of some young people’s ideal confusion, which is a symptom of the social mentality of youth subculture spawned and shaped in the new media era.

The youth subculture of lying flat and involution presents the characteristics of tribal gathering, and young people seek identity in the network to strengthen self-identity and group identity. Young people express their thoughts in a joking tone, use all kinds of “network terriers” to spit out, talk about lying flat and complain about involution, but still will not stop struggling. Through analyzing the content of urban films, we found that many young people believe that releasing their emotions on the Internet can relieve anxiety and reduce pressure, and that venting and briefly lying down are just for the sake of moving forward.

5

Conclusion

In this study, we constructed an LDA-Kmeans thematic model integrating content analysis, extracted youth cultural characteristics and quantified the strength of cultural element associations in urban film texts, and explored the current situation of contemporary young people’s lives and inner emotions. The main conclusions are as follows:

The accuracy of this paper’s model LDA-Kmeans is highest when the number of convergence reaches 13 times (93.59%), and in the field of youth culture urban movie text analysis, the precision, recall and F1 to of this paper’s model have been greatly improved compared with the other comparison models. In addition, the LDA-Kmeans model in the synonym and polysemous keywords in the operation of the efficiency of the average value than the other eight comparative methods increased by 93.12% and 91.71%, the accuracy increased by 38.92% and 44.54%, respectively. It can be seen that no matter retrieving general keywords or polysemous keywords, the LDA topic model in this paper has a significant advantage over the other 8 topic models in terms of time-consumption and accuracy.

In this paper, we summarize the topics of lying flat topics into four aspects, namely, “reasons for lying flat, inner feelings, seeking support and enjoying life”, and describe the topics of involution topics into four aspects, namely, “serious involution, pressure of life, resistance to involution and education involution”. The themes with the highest weighting in the lying down and scrolling topics are: reasons for lying down (35.78%) and serious scrolling (34.26%), which reflect the current situation of contemporary young people’s lives and inner feelings.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Integrating Content Analysis and LDA Thematic Modeling to Analyze the Presentation of Youth Culture in Urban Cinema

Hao Liu

Published Online: Sep 26, 2025

Received: Dec 22, 2024

Accepted: Apr 18, 2025

DOI: https://doi.org/10.2478/amns-2025-1030

KeywordsLDA topic model, Vector space model, N-element language model, LDA-Kmeans, Softmax function, Youth culture

© 2025 Hao Liu, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
LDA topic model, Vector space model, N-element language model, LDA-Kmeans, Softmax function, Youth culture