Sentiment analysis and validity evaluation of Japanese language under the transfer learning model

Transfer learning is based on the assumption that there is some sharing or correlation of categories and features between the source and target domains. By transferring knowledge from the source domain to the target domain, it can help the learning process in the target domain. There are two main types of transfer learning: feature-based transfer learning and model-based transfer learning [1-2]. Feature-based transfer learning transfers features from the source domain to the target domain and then uses labeled data from the target domain to train the model. Model-based transfer learning applies the model of the source domain directly to the target domain. The application of transfer learning in sentiment analysis is of great significance [3-4].

Japanese sentiment analysis is a process of analyzing Japanese text or speech data to extract the sentiment tendency or polarity expressed in it through natural language processing techniques and machine learning algorithms. In recent years, with the development of deep learning technology, its application in the field of sentiment analysis has become more and more extensive. As an important branch of deep learning, transfer learning is a method to improve the performance of tasks in the target domain by utilizing the knowledge in the source domain. In sentiment analysis, deep learning models pre-trained on large-scale corpora can be used and transferred to the sentiment analysis task. The performance of sentiment analysis models on small-scale datasets can be improved by migration learning [5-8].

Literature [9] combined the TF-IDF algorithm and SVM to construct a Japanese text sentiment classification model, and designed controlled experiments to analyze the performance of the model. The comparative analysis shows that the model exhibits better comprehensive performance and can meet the needs of the sentiment classification system, which is of some reference value for related research. Literature [10] emphasizes the importance of sentiment analysis and reveals the development of transfer learning and its applications. Based on the existing research results, the algorithms and applications of transfer learning in sentiment analysis and the development trend of sentiment analysis are outlined. Literature [11] uses active learning to construct a high-quality sentiment corpus of Japanese tweets and improves sentiment analysis using the Transformer language model. Through experiments, it is shown that the adapted Transformer language model performs better than other models in both Twitter sentiment analysis and sentiment corpus creation. Literature [12] proposes new techniques using language agnostic sentence representations to adapt models trained on linguistic texts to recognize polarity in texts from other languages. Not only was the model evaluated for the PolEmo 1.0 sentiment corpus, but validation was also initiated using a deep neural network model. Literature [13] describes the application of deep learning natural language processing techniques for multilingual sentiment analysis. The effectiveness of deep learning models in cross-language sentiment categorization tasks is confirmed by validating the performance of deep learning models such as BERT in a multilingual environment. Future research directions and the development of natural language processing techniques are also discussed. Literature [14] examined a sentiment classification model based on kanji root embeddings in Chinese and Japanese, which consists of a CNN word feature encoder and a bidirectional RNN document feature encoder. It was shown that the radical embedding-based approach is cost-effective for machine learning in Chinese and Japanese. Literature [15] created a Japanese social sentiment classification model with the help of support vector machine and KNN and proposed a topic model for user sentiment analysis based on support vector machine and KNN. Experiments are carried out based on Twitter dataset and the results reveal the effectiveness of the proposed method. Literature [16] aims to explore the use of deep learning techniques to detect and predict the emotional tendencies of artificial neural network models by constructing a natural language processing system and testing its performance. The results show that the system’s analysis techniques can meet the basic requirements. The importance of natural language processing is emphasized and the application of deep learning in the field of Japanese language sentiment analysis is discussed. Literature [17] aims to develop accurate models using multi-task deep learning in order to estimate the type and intensity of emotions in Japanese tweets. By extending a variety of deep learning models for estimating sentiment intensity, the type of sentiment and its intensity are predicted and the effectiveness of the developed models is demonstrated based on experiments. Literature [18] aims to determine a deep transfer learning baseline for Russian language sentiment analysis. By identifying the Russian sentiment analysis dataset and the official language model supporting the Russian language and fine-tuning the multilingual bidirectional encoder representation, robust and state-of-the-art results were obtained on the sentiment dataset and the fine-tuned model was made publicly available. Literature [19] proposes a deep CNN model for linguistic sentiment analysis based on character-level representations. The model is also used to apply transfer learning between the domains of sentiment analysis and sentiment detection for languages. It is shown experimentally that the model exhibits enhanced performance in terms of accuracy measures.

The article examines various forms of sentiment analysis at every text granularity, and discusses the fundamental concepts of transfer learning, migration methods, and feature selection methods.RoBERTa is selected as the model for building a multi-step migration method that analyzes the sentiment of Japanese language. Sentimentless plain text is used as a sample for pre-training of the masked language model, and noise reduction coding and pre-training are performed in the RoBERTa model. Migration learning strategies such as introducing external data in pre-training, training RoBERTa model using cross-entropy loss function, and assigning smaller gradient descent rate are used to improve the problems of negative migration and catastrophic forgetting in multi-step migration learning. The processed text sequences are represented using categorical identifiers and intervals, and the cross-entropy loss function is introduced to fine-tune the RoBERTa model that completes the training. Sentiment analysis of vocabulary in the comments of a Japanese-language website using different methods and analyzing a variety of factors that affect sentiment analysis.

2

Adaptive learning techniques and sentiment analysis methods

2.1

Emotional analysis

Sentiment analysis is a term used to describe emotional tendency analysis and polarity judgment. In layman’s terms, it is the process of analyzing, processing, summarizing, and reasoning about emotionally charged subjective content. Sentiment analysis refers to analyzing the emotional state implicit in people’s conveyance of information, and judging or evaluating the speaker’s attitude or opinion. And the purpose of sentiment classification is to categorize the sentiment data in terms of polarity [20].

With the further development of network informatization and globalization, a large number of subjective texts with emotional polarity have appeared on the Internet, and researchers have begun to gradually transition from the analysis and study of single emotion words to the study of more complex emotion sentences or even the study of chapter-level emotions. Therefore, according to the different size of text granularity, sentiment analysis can be categorized into word-level, phrase-level, sentence-level, chapter-level and multi-chapter-level.

1) Judgment of Sentiment Polarity at Word Level

Word-level sentiment polarity can be seen as the basis of text sentiment analysis. Generally, it is indicated by a real number (-1 or 1) whether the word is positive or negative. Sentiment polarity judgment of words mainly includes: corpus-based methods and lexicon-based methods.

Corpus-based methods usually take the connectives and features between words as the main basis for judging the sentiment polarity of words. Dictionary-based methods usually use the semantic similarity and hierarchical structure of the Chinese dictionary HowNet as the main basis for judging the sentiment polarity of words.

2) Sentiment analysis at the utterance level

Sentiment analysis of words usually distinguishes the subjectivity and objectivity of the utterance, then judges whether the subjective sentence is positive or negative, and finally extracts the granularity of the sentiment tendency in the utterance, which usually includes the holder of the comment and the object of the comment related to the expression of the sentiment tendency.

3) Chapter-level sentiment analysis

Sentimental analysis at the chapter level usually classifies the text as positive, negative, or neutral overall, or refers to the process of pre-processing the chapter before conducting sentiment analysis. In the pre-processing of chapter-level sentiment analysis, the first step is to divide the chapter level into sentences according to the punctuation, and then analyze each sentence through the sentiment analysis method of the sentence, and finally get the sentiment analysis of the chapter by integrating the sentiment tendency values of all sentences.

2.2

Transfer learning

2.2.1

Concepts and Objects of Transfer Learning

Migration learning is different from traditional machine learning methods, it is a new framework for machine learning of different domains and different knowledge, its main advantage is that it does not require much labeling of the target domain, i.e., through the labeled data in the source domain and a small amount of labeled data in the target domain to complete the learning of new domains, that is, to achieve the migration of knowledge from the source domain to the target domain.

The object of migration refers to what should be migrated between two or more different tasks, and the content of migration can be methods or parameters or features or instances or related knowledge, which can be summarized into two categories: behavior or knowledge. Behavioral migration can be interpreted as the transfer of problem-solving strategies and learning methods from the source domain to a new and different domain, which focuses on the similarity of solutions to categorization problems between different domains. Knowledge migration focuses on domain categorization itself, extracting the correlation between different domains regarding their categorization features. Thus, migration is the basic principle regarding the construction of classification models [21].

2.2.2

Methods of migration

Migration method refers to the migration of objects through the use of appropriate migration means and techniques, which is the main research content of transfer learning. If different migration methods are used to migrate the same objects, they will have different migration effects and performance. Therefore, it is necessary to choose the appropriate migration method for different situations.

Example-based migration learning is the most intuitive and easy to understand among all the methods, which assumes that some training data in the source domain can be weighted and selected and then applied to the target domain, so the weighting strategy of the training data in the source domain becomes the key to the migration of examples.

If the distribution function of the data in the source domain is P_S(x,y) and the distribution function of the target domain is P_T(x,y), in order to learn the optimal model θ* from the set of functions Θ, it is necessary to minimize the risk function, which needs to satisfy the conditions of (1): (1) $θ^{*} = \underset{θ \in Θ}{\arg \min} \int Q (x, y, θ) P (x, y)$ \[{{\theta }^{*}}=\underset{\theta \in \Theta }{\mathop{\arg \min }}\,\int{Q}(x,y,\theta )P(x,y)\]

If the distribution function of the data in the source domain is Ps(x,y) and the distribution function of the target domain is PT(x,y), in order to learn the optimal model $θ_{T}^{*}$ ${{\theta }_{T}}^{*}$ from the set of functions Θ, it is necessary to minimize the risk function, which needs to satisfy the conditions of (2): (2) $θ_{T}^{*} = \underset{θ \in θ}{\arg \min} \int Q (x, y, θ) P (x, y)$ \[\theta _{T}^{*}=\underset{\theta \in \theta }{\mathop{\arg \min }}\,\int{Q}(x,y,\theta )P(x,y)\]

The purpose of feature-based transfer learning is to train a “good” feature representation for the target domain, find a common feature representation in the feature space of the source and target domains, and thus transfer the knowledge from the source domain to the target domain through feature migration, which shows a significant improvement in the learning performance of the target domain.

Feature migration avoids the problem of estimating P_S(x,y)/P_T(x,y) the fitting function caused by the different distributions of source and target data in instance migration. Feature migration is assumed to be performed by transforming the feature space X of the samples x to change the marginal distribution P(X) and the conditional distribution P(X|Y), which ultimately makes PS(x,y) = P_T(x,y).

Assuming that the characteristic transformation h : X→W, by function w = h(x)∈W, is performed on the observation sample x, and assuming that P(W,Y) is the distribution function after the transformation, there is no need to estimate P_T(x,y)Ps(x,y) in the instance migration if a suitable transformation h : X→W can be found such that Ps(x,y) = PI(x,y).

If both the target and source domains have labeled data, but in the target domain, there are fewer labeled data compared to unlabeled data, therefore, with the help of the classification model in the target domain, the target domain can be optimized by using the model parameters of the source data.

Let P(θD) be the shared a priori parameter information between different domains and independent of the domain data, then the distributional model learning in different domains can be represented by (3): (3) $P (θ) \prod_{i = 1}^{N} p (x_{i}, y_{i}; θ)$ \[P(\theta )\prod\limits_{i=1}^{N}{p}\left( {{x}_{i}},{{y}_{i}};\theta \right)\]

If the parameter learned from the source domain is assumed to be P(θDS), the representation (4) can be obtained by applying it to the target domain: (4) $P (θ | D_{S}) \prod_{i = 1}^{N} p (x_{i}, y_{i}; θ)$ \[P\left( \theta |{{D}_{S}} \right)\prod\limits_{i=1}^{N}{p}\left( {{x}_{i}},{{y}_{i}};\theta \right)\]

2.3

Feature selection

2.3.1

Concepts and roles

The process of selecting features that represent the overall characteristics of a text document is called feature selection. Feature selection is a key issue in sentiment analysis, not all features are useful for training a classifier, and feature selection can reduce the dimensionality of the text vector space, improve the accuracy of learning, improve the performance of classification and reduce the overhead.

Feature selection is the selection of a true subset F = (f₁,f₂,…,f_n) from the source domain feature vector set F = (f₁,f₂,…,f_n), where n is the size of the source domain feature set and n is the size of the selected feature set. The main purpose of feature selection is to reduce the dimensionality of the vector space and improve the accuracy of the sentiment tendency.

2.3.2

Feature selection methods

The functions usually used in feature selection methods include: feature frequency, document frequency, information gain, weight of evidence, χ² statistic, correlation coefficient, cross-first, odds ratio, and mutual information. In this paper, the mutual information method is mainly used to select inter-lingual inter-axial features between English and Chinese.

Mutual information is a useful information measure in information theory, which can be viewed as the amount of information contained in one random variable about another random variable, or the uncertainty of a random variable reduced by the knowledge of another random variable. Mutual information has been widely used in statistical language modeling to represent the correlation between two different variables [22].

Let the joint distribution of two random variables (x,y) be p(x,y) and the marginal distributions be p(x), p(y), respectively, and the mutual information I(X,Y) is the relative entropy of the joint distribution p(x,y) and the product distribution p(x)p(y), then it can be represented by (5): (5) $I (X; Y) = \sum_{x \in X} \sum_{y \in Y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)}$ \[I(X;Y)=\sum\limits_{x\in X}{\sum\limits_{y\in Y}{p}}(x,y)\log \frac{p(x,y)}{p(x)p(y)}\]

3

Japanese Sentiment Analysis Method Based on RoBERTa Multi-Step Migration

A state-of-the-art pre-training model, RoBERTa, is chosen as the base model to design a multi-step transfer learning algorithm to accomplish the document-level sentiment analysis task. The robust RoBERTa model is used to ensure that the feature vectors contain rich semantic information, and a multi-step transfer learning approach is proposed to improve the low-resource nature of the training data for the document-level task.

The migration process is shown in Fig. 1, where D_general is the source domain used by the RoBERTa model in the pre-training phase to learn generalized knowledge of the pan-domain. Document level sentiment classification (DSC) labeled data is introduced in domain D_DSC as the dataset for the supervised learning task phase. The target domain D_ASC of the final migration learning system is derived from the document level sentiment classification dataset, which belongs to labeled data, whereas D_ASC is used as the source domain for the pre-training task by removing the labels corresponding to its aspect terms in the self-supervised learning phase.

3.1

Pre-training of masked language models

Design the first stage of the transfer learning strategy, call the RoBERTa model from the generic domain training to continue to do masked language model (MLM) further training in the D_ASC target domain, here D_ASC is the document level sentiment analysis of the final goal of the task of the dataset. The labeled samples contain positive and negative categorical labels, in this paper, we choose to ignore the specific evaluation object in the sentence, only regarded as ordinary words, remove the sentiment category labels and convert them to unlabeled data to get plain text data, which is used as the sample for pre-training of the masked language model.

The BERT model chooses to randomly replace the token participle in the input sequence with the special token [Mask] with a probability of 15%, and masking the original token with an appropriate probability corresponds to introducing noise in the input sequence instead of reconstructing the sequence. This stage of masked language model self-supervised learning task belongs to further pre-training by applying data from the target domain, and also can be regarded as the initial fine-tuning done in the target task.RoBERTa model treats the [Mask] token bit as noise, and computes the noise-reducing self-encoding layer by layer in the 12 Transformer layers by adopting the multi-head self-attention mechanism, and the output of the top layer is the hidden state vector output ∈ [batch_size,sequence_length,hidden_size]. The RoBERTa model inference the internal context information of the input word sequence x\m(x) based on the token subwords that are not masked out by the [Mask] flags, and determines whether the set of masked out tokens m(x) in the input vector x = {〈s〉,x1,x2,…,xn〈/s〉} is correctly restored. The first-stage migration learning loss function L_MIM is shown in Equation (6), where n is the number of samples and λ is the regularization coefficient that adjusts the overall parameters θ of the model: (6) $L_{M L M} = - \sum_{i}^{n} \sum_{\hat{x} \in m (x)} \log p (\hat{x} | x_m (x)) + λ {‖ θ ‖}^{2}$ \[{{L}_{MLM}}=-\sum\limits_{i}^{n}{\sum\limits_{\hat{x}\in m(x)}{\log }}p\left( \hat{x}|x\_m(x) \right)+\lambda {{\left\| \theta \right\|}^{2}}\]

The masked language model introduces a cross-entropy loss function to train the model’s ability to extract structural and semantic features of a sentence by predicting the original words masked by [Mask] in the final output sequence.

3.2

Document-level transfer learning strategies

Document-level sentiment analysis tasks are difficult to adequately drive very large models like RoBERTa due to the complexity of manual engineering and limited training data, and the work in this phase addresses such problems by designing a transfer learning strategy aimed at acquiring sentiment knowledge from other related domains to help the model deepen its comprehension capabilities.

When the source and target domains of migration learning are unrelated, forced migration can harm the performance of the algorithm on the target domain, a situation known as negative migration. Therefore the multi-step migration learning algorithm introduces external data in the supervised learning pre-training task, rather than simply exposing the model to more textual data. Firstly, a sentiment analysis dataset from the Internet needs to be investigated, and similarly labeled data of comment scenes are selected as D_DSC. Here, the classic algorithm of text similarity detection work is borrowed, and the cosine function is used to quantify the semantic relevance of the text in D_DSC and D_ASC. Word2vec and traditional machine learning based detection methods are chosen to convert sentences into a set of n-dimensional word vectors by word embedding, and the cosine similarity is computed interactively, with A and B representing any two sentences in different datasets, respectively, and the computation method is shown in Eq. (7): (7) $s i m i l a r i t y = \cos (θ) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}$ \[similarity=\cos (\theta )=\frac{A\cdot B}{\left\| A \right\|\left\| B \right\|}=\frac{\sum\limits_{i=1}^{n}{{{A}_{i}}}\times {{B}_{i}}}{\sqrt{\sum\limits_{i=1}^{n}{{{\left( {{A}_{i}} \right)}^{2}}}}\times \sqrt{\sum\limits_{i=1}^{n}{{{\left( {{B}_{i}} \right)}^{2}}}}}\]

The evaluation object in document-level sentiment analysis corresponds to three classification categories: positive, neutral, and negative. When the pre-training form of task-guided sentiment analysis belongs to the regression task, in order to adapt to the target task samples and label matching rules, it is necessary to reclassify the labels according to the range of the document sample scoring values and the sentiment scores. Migrate the RoBERTa model that has completed the first stage MLM pre-training task and introduce the document-level sentiment label classification task for further supervised learning pre-training. This phase of the work only requires selecting a small range of specialized data for targeted training, so that the model can effectively capture the background information of the evaluation object and the sentiment speculative patterns. The output feature matrix h_s ∈ [batch_size,hidden_size] is computed after deep interaction to imply complete sentence information, which can be directly fed into the classifier to predict sentiment labels $\hat{y}$ $\hat{y}$, and the feature transformation is constructed using the parameter matrices W_i and W_f whose dimensions are consistent with the hidden state, as well as the biases b_i and b_f see Eq. (8) and Eq. (9): (8) $\hat{y} = s i g m o i d (W_{i} h_{s} + b_{i})$ \[\hat{y}=sigmoid\left( {{W}_{i}}{{h}_{s}}+{{b}_{i}} \right)\] (9) $\hat{y} = s o f t \max (W_{f} h_{s} + b_{f})$ \[\hat{y}=soft\max \left( {{W}_{f}}{{h}_{s}}+{{b}_{f}} \right)\]

A cross-entropy loss function is used to train the RoBERTa model in the document-level sentiment classification task, and the error between the output predicted value y_i and the true value $\hat{y}$ $\hat{y}$ of the categorization identifier bit is used as a supervised signal. The loss function L_DSC is constructed in this migration step in the form of Eq. (10), n is the number of samples in the training batch, M is the number of sentiment propensity categories, θ is the overall parameters of the model to be optimized, and the regularization term parameter λ_d is introduced to regulate the model complexity: (10) $L_{D S C} = - \sum_{i}^{n} \sum_{c}^{M} y_{i} \log ({\hat{y}}_{i}) + λ_{d} {‖ θ ‖}^{2}$ \[{{L}_{DSC}}=-\sum\limits_{i}^{n}{\sum\limits_{c}^{M}{{{y}_{i}}}}\log \left( {{{\hat{y}}}_{i}} \right)+{{\lambda }_{d}}{{\left\| \theta \right\|}^{2}}\]

The BERTology language model based on the Transformer bi-directional encoder structure transitions from phrase and inter-sentence structural knowledge to complex semantic knowledge as the encoding layer focuses on capturing information as it moves from low to high itself. Inspired by the fine-tuning scheme of the BERT model during the text categorization task, the migration RoBERTa model for document-level sentiment analysis pre-training process uses a hierarchical learning rate to assign smaller gradient descent rates to the lower layers for ameliorating the catastrophic forgetting problem in migration learning. The same learning rate is set for 3 adjacent layers in the selected coding layer, and parameter θ^l of each coding layer group corresponds to learning rate η^l, with an intergroup decay factor value of 2.6, which is shown in Eq. (11) to update the model parameters at the moment t: (11) $\begin{matrix} θ_{t}^{l} = θ_{t - 1}^{l} - η^{l} \cdot \nabla_{θ^{t}} L_{D S C} (θ) \\ η^{l - 1} = η^{l} / 2.6 \end{matrix}$ \[\begin{matrix} \theta _{t}^{l}=\theta _{t-1}^{l}-{{\eta }^{l}}\cdot {{\nabla }_{{{\theta }^{t}}}}{{L}_{DSC}}(\theta ) \\ {{\eta }^{l-1}}={{\eta }^{l}}/2.6 \\ \end{matrix}\]

3.3

Document-level sentiment analysis fine-tuning

The RoBERTa model, which migrated to complete the pre-training task of the two phases of self-supervised learning and supervised learning, was fine-tuned in the final phase on the target task. The processed text sequences are represented using categorical identifiers and intervals are shown in Eq. (12): (12) $i n p u t = 〈 s 〉 + T e x t t + 〈 / s 〉 + A s p e c t + 〈 / s 〉$ \[input=\left\langle s \right\rangle +Textt+\left\langle /s \right\rangle +Aspect+\left\langle /s \right\rangle \]

Multiple comment samples are generated from the same comment sentence according to its targeting of multiple specific aspects. Unlike classical sentence pair tasks such as text matching and semantic relation recognition, here the fragment token 〈/s〉 not only serves as an end marker for the complete sample sequence, but also requires the insertion of a special symbol 〈/s〉 at the end of the sentence to emphasize the aspect term. After multiple self-attention calculations the model learns the ability to distinguish the evaluation object entity from ordinary token sub-syntactic meaning.

After multi-step migration RoBERTa model in the document level samples fully extracted to the useful feature vectors, the last layer 〈s〉 flag bit output state contains the entire sample information sent to softmax classifier. Softmax function calculates 〈s〉 output state vector corresponding to the maximum probability of belonging to a certain sentiment category, and then recognized by the classifier as the final category label. The calculation process is shown in equation (13), where C_k is the probability representation of belonging to the kth sentiment category in the feature vector: (13) $C_{k} = \frac{e^{k}}{\sum_{j = 1}^{3} e^{j}} k = 1, 2, 3$ \[{{C}_{k}}=\frac{{{e}^{k}}}{\sum\limits_{j=1}^{3}{{{e}^{j}}}}k=1,2,3\]

The RoBERTa model, trained by multiple pre-training migration learning, serves as the base model for the document-level sentiment analysis target task, which can improve the shortcomings caused by the poor relevance of the large-scale MLM unsupervised learning scenarios and the low resourcefulness of the fine-tuning data. The multi-step transfer learning strategy expands on the traditional paradigm of “transfer model” plus “downstream task fine-tuning” by extending the model training process according to the phased approach L_MMM → L_DSC → L_ASC. The contribution of each part of the source domain knowledge to the downstream task can be explored in experiments, so the algorithm is highly interpretable. The cross-entropy loss function in the fine-tuning stage is shown in Eq. (14), y_i and ${\hat{y}}_{i}$ ${{\hat{y}}_{i}}$ are the true and predicted labels of the document-level evaluation objects, which correspond to the k sentiment categories, θ represents the model parameters, and λ_a is the L2 regularization coefficient in the fine-tuning stage: (14) $L_{A S C} = - \sum_{i}^{n} \sum_{c}^{k} y_{i} \log ({\hat{y}}_{i}) + λ_{a} {‖ θ ‖}^{2}$ \[{{L}_{ASC}}=-\sum\limits_{i}^{n}{\sum\limits_{c}^{k}{{{y}_{i}}}}\log \left( {{{\hat{y}}}_{i}} \right)+{{\lambda }_{a}}{{\left\| \theta \right\|}^{2}}\]

4

Evaluation of Japanese language sentiment analysis under a multi-step transfer learning model

4.1

Comparison of Japanese Sentiment Analysis Methods

4.1.1

Experimental data and setup

The task of the source domain is to perform sentiment analysis of Japanese words in different domains, and in this paper, we use the multi-domain YelpAspect dataset, which is obtained from Japanese review websites and includes three domains: restaurants (R1), beauty (B), and hotels (H), where the sentiment tendency is categorized into three categories, namely positive, neutral, and negative. The task of the target domain is to analyze the sentiment of Japanese vocabulary in different domains, and the data used in this paper are the dataset published in Task 4 of the SemEval 2014 competition and the Twitter dataset, again, identifying the different domains: restaurants (R2), laptops (L), and Twitter (T).

In order to evaluate the multi-step migration method proposed in this paper, the experiments will be divided into 8 groups of different domain Japanese vocabulary migration pairs, namely R1→L, H→L, B→L, H→R2, B→R2, R1→T, H→T, B→T, with the left side of the arrows being the source domains, and the right side being the target domains. For each set of migration pairs D_s→D_t, the training data in Ds and 70% of the training data in D_t are used as the training set for the experiment, 30% of the data in D_t is used as the validation set for the experiment for hyperparameter tuning of the model, and the test data in D_t is used as the test set for the experiment.

4.1.2

Comparison with non-migration methods

Table 1 shows the comparison of the results of the proposed method in this paper with the results of each of the baseline methods, where the baseline methods are all methods that do not use migration.Also, since the training data in this paper contains three domains, the final reported results are taken as the average of the experimental results on these three domains. The methods in the table are divided into three parts: the first part is the baseline method, the second part is the base model of this paper, and the third part is the model of the text. The following conclusions can be drawn from the table:

1) The base model BERT achieves the best results on two domains (Laptops, Twitter) and the method RoBERTa of this paper achieves the best results on the domain of Restaurants(R). This result is as expected because the base model BERT is fine-tuned on the target domain and directly utilizes the labeled samples on the target domain, while the method in this paper belongs to the unsupervised domain adaptive methods and does not use the labeled data on the target domain. And the method in this paper achieves the best results on the restaurant domain, mainly because the restaurant domain is included in the source domain, and the migration of domain knowledge makes the results on the target domain better.

2) The base model BERT gives better results than other baseline methods. The main reason is that the BERT model utilizes large-scale natural language text for pre-training, and a large amount of linguistic knowledge is extracted and encoded into the network structure, while the target domain has a limited amount of labeled data, and the use of these linguistic features will have a great feature complementary to the target task and increase the model’s generalization ability.

3) The method in this paper has better results than most baseline methods (TD-LSTM, AE-LSTM, ATAE-LSTM, MemNet, IAN, RAM), especially on the restaurant(R) domain, and the method in this paper does not use the labeled data of the target domain, which shows that the knowledge in the source domain can be fully utilized in the model in this paper and well migrated to the target domain.

Table 1.

The experimental results of ours and the baseline method(non-transfer)

Model	R		L		T
Model	Acc (%)	Macro-F1 (%)	Acc (%)	Macro-F1 (%)	Acc (%)	Macro-F1 (%)
TD-LSTM	75.63	64.25	68.12	62.28	66.58	63.98
AE-LSTM	69.05	62.54	69.03	62.46	69.32	56.39
ATAE-LSTM	77.26	65.10	68.74	62.33	69.69	56.90
MemNet	78.05	65.85	70.36	63.97	68.44	67.11
IAN	78.57	--	72.23	--	--	--
RAM	78.41	68.56	72.00	68.51	69.52	67.24
SACA	82.10	73.15	76.31	73.04	72.76	71.07
BERT	82.93	74.08	76.99	73.69	75.04	73.57
RoBERTa	83.71	74.50	72.43	69.55	70.83	68.76

4.1.3

Comparison with migration methods

Table 2 shows a comparison of the methods in this paper with the baseline methods, where all baseline methods include a migration strategy. Each migration pair contains two metrics reported, accuracy (Acc) and macro-averaging (Macro-F1). The following conclusions can be drawn from the comparison results:

1) The results obtained from the model in this paper achieved the best results on all migration pairs compared to other migration methods, which indicates the effectiveness of the method proposed in this paper.

2) The BERT model gives better results compared to the MGAN model. Both models train the model on the source domain and then apply it directly on the target domain, the difference is that BERT uses BERT as an encoder, while MGAN uses Bi-LSTM combined with different attention mechanisms, which shows the effectiveness of the pre-trained BERT model.

3) The RoBERTa model in this paper gives slightly better results than the MMD model, both the MMD metric and the KL scatter can be used to measure the distance between two distributions, where the KL scatter can be viewed as a first-order moment based on the entropy of the matching information (i.e., a metric of the mean match), whereas the MMD method is a metric that matches the weighted sum of all order moments. During the experiment, the text finds that KL is slightly better in terms of results and less computationally intensive than the MMD method, so this paper finally chooses the KL scatter metric.

4) Both the RoBERTa model and MMD model in this paper have better results than the DAN method, which is based on the adversarial training method. In the experimental process, we found that when the BERT model is sufficiently trained, the above domain discriminator is still in the state of underfitting, and the use of DAN’s method is more demanding in terms of parameter tuning.

Table 2.

The experimental results of ours and the baseline method(transferred)

D_s→D_t	Indicator	BERT	MMD	DAN	MGAN	Ours
R1→L	Acc(%)	70.67	72.33	72.05	69.80	73.82
R1→L	Macro-F1(%)	68.77	69.84	68.49	66.96	71.28
B→L	Acc(%)	70.17	71.44	70.56	70.40	71.77
B→L	Macro-F1(%)	67.41	67.91	67.28	66.82	68.11
H→L	Acc(%)	70.74	71.35	71.43	70.79	72.26
H→L	Macro-F1(%)	67.05	67.98	67.79	67.84	69.15
B→R2	Acc(%)	76.66	78.89	78.34	72.71	79.31
B→R2	Macro-F1(%)	69.03	69.78	70.86	64.26	70.93
H→R2	Acc(%)	75.49	79.57	78.87	72.21	79.97
H→R2	Macro-F1(%)	68.24	70.74	70.03	62.52	71.91
R1→T	Acc(%)	60.60	69.24	68.32	46.50	70.16
R1→T	Macro-F1(%)	58.33	67.23	66.01	45.77	68.38
B→T	Acc(%)	61.30	70.79	70.11	46.36	71.41
B→T	Macro-F1(%)	59.94	68.21	68.23	45.76	68.74
H→T	Acc(%)	60.91	69.62	68.96	47.52	71.06
H→T	Macro-F1(%)	58.21	67.33	67.24	46.77	69.58

In order to easily see the effect of migration between Japanese words in different domains, we present the results of the experiment as a bar chart, and the visualization of the experimental results is shown in Figure 2, where the horizontal coordinate is the accuracy rate. It can be found that migration is most effective when the target domain is the restaurant domain and least effective when the target domain is the Twitter domain. Intuitively, the review text in the Twitter domain contains more social words and diverse expressions, which are more different from the source domain, and thus the migration effect is worse. The laptop domain comments contain more electronic product terminology, but some of the evaluation Japanese words are related to the source domain, resulting in a slightly improved migration result. The above results also verify the rationality of the experiments in this paper from the side.

4.2

Factors Affecting Sentiment Analysis of Japanese Language

Relevant factors affecting Japanese language sentiment analysis are further analyzed in order to determine the optimal environment for the application of the methodology of this paper and to enhance the effectiveness of Japanese language sentiment analysis.

4.2.1

Effect of alpha values on cross-linguistic sentiment analysis

The word vector representation of a word should consider both word-level and document-level sentiment information. Therefore, the total loss function is defined as the sum of the word-level and document-level loss functions, as shown in Equation (15): (15) $f = α f_{w o r d} + (1 - α) f_{d o c}$ \[f=\alpha {{f}_{word}}+(1-\alpha ){{f}_{doc}}\] where α ∈ [0, 1] is a compromise factor that adjusts the effect of f_word and f_doc on the total loss function. When α = 0, the word vector representation of a word considers only the sentiment information of the document context, and the larger α is, the more sentiment information of the word context is considered. Japanese is used as the target language to analyze the effect of different α values on the performance of cross-lingual sentiment analysis.

During the training process of word embeddings incorporating emotional semantics, the size of α value will have an impact on the representation ability of word embeddings. Since the classification effect performs best on the Japanese dataset, the Japanese dataset is chosen to explore the effect of different α values with a step size of 0.1, and the experimental results are shown in Figure 3.

As can be seen from Fig. 2, the classification accuracy can reach 0.801 when α is 0.1, at which time the weight of emotional information at the document level is the largest. When the value of α gradually increases, the classification accuracy rate gradually decreases. The accuracy rate is lowest when α value is 0.5, at this time the word level and the document level have the same weight of sentiment information. When the α value continues to increase, the weight of word-level sentiment information exceeds that of document-level sentiment information, then the classification accuracy rate improves, and reaches the highest accuracy rate of 0.823 when the α value is 0.9. The experimental results show that both word-level and document-level sentiment information have better independent supervision effects, but when the weights of both are close to each other, the utilization rate of the sentiment information decreases, which affects the effect of the word embedding representation , which in turn leads to a decrease in cross-lingual sentiment classification accuracy.

4.2.2

The Effect of Word Vector Dimension on Cross-Language Sentiment Analysis

The dimension size of word vectors has a certain effect on the ability to represent the semantics of words, so the experiments in this section set the word vector dimensions to 50, 100, 150, and 200 dimensions, respectively, to explore the effect of word vector dimensions on cross-lingual sentiment analysis. The Japanese dataset is still chosen for the experiment, and DAN is selected for the feature extraction network. The experimental results are shown in Table 3.

Table 3.

The influence of the word vector dimension on emotional analysis

Method	50 Dimension		100 Dimension		150 Dimension		200 Dimension
Method	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1
Upper	0.856	0.867	--	--	--	--	--	--
Machine translation	0.729	0.718	--	--	--	--	--	--
Bi_random	0.557	0.709	0.576	0.647	0.572	0.545	0.604	0.718
Bi_W2V	0.732	0.714	0.747	0.744	0.777	0.730	0.748	0.709
CLCDSA	0.602	0.632	0.689	0.693	0.720	0.715	0.786	0.799
Ours	0.818	0.839	0.776	0.771	0.800	0.733	0.778	0.700

With the increase of word vector dimension in the cross-lingual sentiment classification task, the classification accuracy of the Bi_random method, which only uses random word embedding, can also reach 0.604 when the word vector dimension is 200 dimensions, and the F1 value is 0.718, and the improvement is most obvious. It indicates that for the Bi_random method with randomly initialized text vectors, the word vector dimension is larger, providing more information about characterization and better results. When the Bi_W2V method is used, there is a small increase in accuracy when increasing the word vector dimension, and the Bi_W2V method obtains the highest F1 value of 0.744 when the word vectors are 100 dimensions, and the highest accuracy of 0.777 when the word vectors are 150 dimensions, while when the word vector dimensions are further increased to 200 dimensions, the accuracy and the F1 value are instead decreased.

For this paper’s method, changing the size of word vector dimension is not obvious for the improvement of classification accuracy, and it has been able to integrate the sentiment semantic information well when the dimension is 50 dimensions, and the highest accuracy and F1 value reach 0.818 and 0.839 respectively, which has good stability.

For the CLCDSA method, the best performance is achieved when the word vector dimension is equal to 200. The performance decreases as the vector dimension decreases. The reason for the decrease is mainly due to the fact that the parameters of the Encoder-Decoder model decrease as the vector dimension decreases: when the word vector dimension is equal to 200, the number of parameters of the model is 12,380,000, while when the word vector dimension is equal to 50, the number of parameters of the model decreases to 710,000.

4.2.3

Visual analysis of word-vector representations of words

In order to analyze from the linguistic and semantic point of view that the word vector representations based on source language sentiment features can better take into account the information of word semantic and sentiment features compared with Word2Vec, in this section, we compare the word vector representations obtained by this paper’s method and Word2Vec model by using the visualization method. The word vector representations obtained by Word2Vec or this paper’s method are 50-dimensional high-dimensional vectors, which cannot be visualized in the 2D plane, so the experimental word vector representations obtained in the experiment are downscaled using Principal Component Analysis (PCA) method, and finally output in the 2D plane.

Fig. 4 and Fig. 5 show the output of two sets of words visualized in the 2D plane under Word2Vec and word vector representations of the method in this paper, respectively. In order to be able to clearly see the results of the visualized representations, a small number of words are selected as examples for the experiment. Each point in the figure represents the 2D plane embedding result of the high-dimensional word vectors of a word after PCA dimensionality reduction, and the closer the word vector representations of the two points are, the closer they are in the 2D plane.The results of the word vector representations of Word2Vec are on the left side of the figure, and the results of the word vector representations of the method of this paper are on the right side of the figure.

Figure 4 shows a visualization of a group of words “good”, “good”, “hate”, “bad”, “exciting”, “happy”, “beautiful” in a two-dimensional plane. The affective polarity of this group of words is obvious, and it can be seen that the word vector representation of this method takes into account the emotional feature information of the words, and can distinguish words with different affective polarity. For example, the words “hate” and “bad” with negative emotional polarity are closer together, while “good” and “delicious” are clustered together. Compared to Word2Vec’s word vector representation, the words “happy”, “bad”, and “beautiful” are clustered together, which cannot effectively distinguish the emotional polarity of words.

On the basis of Figure 4, a few words with close semantics: “dog”, “cat”, and “bird” are added, and a few words are randomly removed, and the visualization results are shown in Figure 5. It can be seen that the Word2vec model has more advantages in semantic representation, and is able to cluster the semantically similar words “dog”, “cat”, and “bird”, but the words “hate” and “exciting” still overlap. However, the word vector representation of the method in this paper can still clearly distinguish the emotional polarity of words, and “hate”, as a word with negative emotional polarity, has obvious semantic distance from other words.

5

Conclusions and reflections

5.1

Conclusion

The purpose of this paper is to design a multi-step transfer learning strategy based on the RoBERTa model to improve sentiment analysis of Japanese language using transfer learning and sentiment analysis methods. The conclusions drawn from the study are as follows:

1) This paper’s method achieved higher sentiment classification accuracy (83.71%) and F1 value (74.5%). Among all migration pairs, this paper’s model also exhibits the highest sentiment classification accuracy, which verifies the rationality of this paper’s migration learning strategy.

2) When the compromise coefficient α is 0.9, the sentiment classification accuracy of this paper’s method is the highest. The sentiment recognition rate of this method remains stable when facing different number of word vector dimensions.

5.2

Inadequacies

1) The pre-training model RoBERTa used in this paper ignores the middle layer information of the bidirectional encoder network and does not take full advantage of the emotional messages present in the information modalities such as pictures and videos.

2) External neural networks can be added in subsequent research to combine with the top layer output of the pre-trained model, and emotional factors such as expressions and scenes can be utilized to complement each other with the text semantics.

Funding:

This research was supported by the State Scholarship Fund to pursue study: “China Scholarship Council” (File No: [2021]340).

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Sentiment analysis and validity evaluation of Japanese language under the transfer learning model

Xiaodan Li

Published Online: Mar 21, 2025

Received: Oct 31, 2024

Accepted: Feb 19, 2025

DOI: https://doi.org/10.2478/amns-2025-0606

KeywordsRoBERTa, Multi-step transfer learning, Japanese sentiment analysis, Natural language processing

© 2025 Xiaodan Li, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
RoBERTa, Multi-step transfer learning, Japanese sentiment analysis, Natural language processing