Sentiment analysis and validity evaluation of Japanese language under the transfer learning model
Published Online: Mar 21, 2025
Received: Oct 31, 2024
Accepted: Feb 19, 2025
DOI: https://doi.org/10.2478/amns-2025-0606
Keywords
© 2025 Xiaodan Li, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Transfer learning is based on the assumption that there is some sharing or correlation of categories and features between the source and target domains. By transferring knowledge from the source domain to the target domain, it can help the learning process in the target domain. There are two main types of transfer learning: feature-based transfer learning and model-based transfer learning [1-2]. Feature-based transfer learning transfers features from the source domain to the target domain and then uses labeled data from the target domain to train the model. Model-based transfer learning applies the model of the source domain directly to the target domain. The application of transfer learning in sentiment analysis is of great significance [3-4].
Japanese sentiment analysis is a process of analyzing Japanese text or speech data to extract the sentiment tendency or polarity expressed in it through natural language processing techniques and machine learning algorithms. In recent years, with the development of deep learning technology, its application in the field of sentiment analysis has become more and more extensive. As an important branch of deep learning, transfer learning is a method to improve the performance of tasks in the target domain by utilizing the knowledge in the source domain. In sentiment analysis, deep learning models pre-trained on large-scale corpora can be used and transferred to the sentiment analysis task. The performance of sentiment analysis models on small-scale datasets can be improved by migration learning [5-8].
Literature [9] combined the TF-IDF algorithm and SVM to construct a Japanese text sentiment classification model, and designed controlled experiments to analyze the performance of the model. The comparative analysis shows that the model exhibits better comprehensive performance and can meet the needs of the sentiment classification system, which is of some reference value for related research. Literature [10] emphasizes the importance of sentiment analysis and reveals the development of transfer learning and its applications. Based on the existing research results, the algorithms and applications of transfer learning in sentiment analysis and the development trend of sentiment analysis are outlined. Literature [11] uses active learning to construct a high-quality sentiment corpus of Japanese tweets and improves sentiment analysis using the Transformer language model. Through experiments, it is shown that the adapted Transformer language model performs better than other models in both Twitter sentiment analysis and sentiment corpus creation. Literature [12] proposes new techniques using language agnostic sentence representations to adapt models trained on linguistic texts to recognize polarity in texts from other languages. Not only was the model evaluated for the PolEmo 1.0 sentiment corpus, but validation was also initiated using a deep neural network model. Literature [13] describes the application of deep learning natural language processing techniques for multilingual sentiment analysis. The effectiveness of deep learning models in cross-language sentiment categorization tasks is confirmed by validating the performance of deep learning models such as BERT in a multilingual environment. Future research directions and the development of natural language processing techniques are also discussed. Literature [14] examined a sentiment classification model based on kanji root embeddings in Chinese and Japanese, which consists of a CNN word feature encoder and a bidirectional RNN document feature encoder. It was shown that the radical embedding-based approach is cost-effective for machine learning in Chinese and Japanese. Literature [15] created a Japanese social sentiment classification model with the help of support vector machine and KNN and proposed a topic model for user sentiment analysis based on support vector machine and KNN. Experiments are carried out based on Twitter dataset and the results reveal the effectiveness of the proposed method. Literature [16] aims to explore the use of deep learning techniques to detect and predict the emotional tendencies of artificial neural network models by constructing a natural language processing system and testing its performance. The results show that the system’s analysis techniques can meet the basic requirements. The importance of natural language processing is emphasized and the application of deep learning in the field of Japanese language sentiment analysis is discussed. Literature [17] aims to develop accurate models using multi-task deep learning in order to estimate the type and intensity of emotions in Japanese tweets. By extending a variety of deep learning models for estimating sentiment intensity, the type of sentiment and its intensity are predicted and the effectiveness of the developed models is demonstrated based on experiments. Literature [18] aims to determine a deep transfer learning baseline for Russian language sentiment analysis. By identifying the Russian sentiment analysis dataset and the official language model supporting the Russian language and fine-tuning the multilingual bidirectional encoder representation, robust and state-of-the-art results were obtained on the sentiment dataset and the fine-tuned model was made publicly available. Literature [19] proposes a deep CNN model for linguistic sentiment analysis based on character-level representations. The model is also used to apply transfer learning between the domains of sentiment analysis and sentiment detection for languages. It is shown experimentally that the model exhibits enhanced performance in terms of accuracy measures.
The article examines various forms of sentiment analysis at every text granularity, and discusses the fundamental concepts of transfer learning, migration methods, and feature selection methods.RoBERTa is selected as the model for building a multi-step migration method that analyzes the sentiment of Japanese language. Sentimentless plain text is used as a sample for pre-training of the masked language model, and noise reduction coding and pre-training are performed in the RoBERTa model. Migration learning strategies such as introducing external data in pre-training, training RoBERTa model using cross-entropy loss function, and assigning smaller gradient descent rate are used to improve the problems of negative migration and catastrophic forgetting in multi-step migration learning. The processed text sequences are represented using categorical identifiers and intervals, and the cross-entropy loss function is introduced to fine-tune the RoBERTa model that completes the training. Sentiment analysis of vocabulary in the comments of a Japanese-language website using different methods and analyzing a variety of factors that affect sentiment analysis.
Sentiment analysis is a term used to describe emotional tendency analysis and polarity judgment. In layman’s terms, it is the process of analyzing, processing, summarizing, and reasoning about emotionally charged subjective content. Sentiment analysis refers to analyzing the emotional state implicit in people’s conveyance of information, and judging or evaluating the speaker’s attitude or opinion. And the purpose of sentiment classification is to categorize the sentiment data in terms of polarity [20].
With the further development of network informatization and globalization, a large number of subjective texts with emotional polarity have appeared on the Internet, and researchers have begun to gradually transition from the analysis and study of single emotion words to the study of more complex emotion sentences or even the study of chapter-level emotions. Therefore, according to the different size of text granularity, sentiment analysis can be categorized into word-level, phrase-level, sentence-level, chapter-level and multi-chapter-level.
1) Judgment of Sentiment Polarity at Word Level
Word-level sentiment polarity can be seen as the basis of text sentiment analysis. Generally, it is indicated by a real number (-1 or 1) whether the word is positive or negative. Sentiment polarity judgment of words mainly includes: corpus-based methods and lexicon-based methods.
Corpus-based methods usually take the connectives and features between words as the main basis for judging the sentiment polarity of words. Dictionary-based methods usually use the semantic similarity and hierarchical structure of the Chinese dictionary HowNet as the main basis for judging the sentiment polarity of words.
2) Sentiment analysis at the utterance level
Sentiment analysis of words usually distinguishes the subjectivity and objectivity of the utterance, then judges whether the subjective sentence is positive or negative, and finally extracts the granularity of the sentiment tendency in the utterance, which usually includes the holder of the comment and the object of the comment related to the expression of the sentiment tendency.
3) Chapter-level sentiment analysis
Sentimental analysis at the chapter level usually classifies the text as positive, negative, or neutral overall, or refers to the process of pre-processing the chapter before conducting sentiment analysis. In the pre-processing of chapter-level sentiment analysis, the first step is to divide the chapter level into sentences according to the punctuation, and then analyze each sentence through the sentiment analysis method of the sentence, and finally get the sentiment analysis of the chapter by integrating the sentiment tendency values of all sentences.
Migration learning is different from traditional machine learning methods, it is a new framework for machine learning of different domains and different knowledge, its main advantage is that it does not require much labeling of the target domain, i.e., through the labeled data in the source domain and a small amount of labeled data in the target domain to complete the learning of new domains, that is, to achieve the migration of knowledge from the source domain to the target domain.
The object of migration refers to what should be migrated between two or more different tasks, and the content of migration can be methods or parameters or features or instances or related knowledge, which can be summarized into two categories: behavior or knowledge. Behavioral migration can be interpreted as the transfer of problem-solving strategies and learning methods from the source domain to a new and different domain, which focuses on the similarity of solutions to categorization problems between different domains. Knowledge migration focuses on domain categorization itself, extracting the correlation between different domains regarding their categorization features. Thus, migration is the basic principle regarding the construction of classification models [21].
Migration method refers to the migration of objects through the use of appropriate migration means and techniques, which is the main research content of transfer learning. If different migration methods are used to migrate the same objects, they will have different migration effects and performance. Therefore, it is necessary to choose the appropriate migration method for different situations.
Example-based migration learning is the most intuitive and easy to understand among all the methods, which assumes that some training data in the source domain can be weighted and selected and then applied to the target domain, so the weighting strategy of the training data in the source domain becomes the key to the migration of examples.
If the distribution function of the data in the source domain is
If the distribution function of the data in the source domain is
The purpose of feature-based transfer learning is to train a “good” feature representation for the target domain, find a common feature representation in the feature space of the source and target domains, and thus transfer the knowledge from the source domain to the target domain through feature migration, which shows a significant improvement in the learning performance of the target domain.
Feature migration avoids the problem of estimating
Assuming that the characteristic transformation
If both the target and source domains have labeled data, but in the target domain, there are fewer labeled data compared to unlabeled data, therefore, with the help of the classification model in the target domain, the target domain can be optimized by using the model parameters of the source data.
Let
If the parameter learned from the source domain is assumed to be
The process of selecting features that represent the overall characteristics of a text document is called feature selection. Feature selection is a key issue in sentiment analysis, not all features are useful for training a classifier, and feature selection can reduce the dimensionality of the text vector space, improve the accuracy of learning, improve the performance of classification and reduce the overhead.
Feature selection is the selection of a true subset
The functions usually used in feature selection methods include: feature frequency, document frequency, information gain, weight of evidence,
Mutual information is a useful information measure in information theory, which can be viewed as the amount of information contained in one random variable about another random variable, or the uncertainty of a random variable reduced by the knowledge of another random variable. Mutual information has been widely used in statistical language modeling to represent the correlation between two different variables [22].
Let the joint distribution of two random variables (
A state-of-the-art pre-training model, RoBERTa, is chosen as the base model to design a multi-step transfer learning algorithm to accomplish the document-level sentiment analysis task. The robust RoBERTa model is used to ensure that the feature vectors contain rich semantic information, and a multi-step transfer learning approach is proposed to improve the low-resource nature of the training data for the document-level task.
The migration process is shown in Fig. 1, where

Multi-step transfer learning flowchart
Design the first stage of the transfer learning strategy, call the RoBERTa model from the generic domain training to continue to do masked language model (MLM) further training in the
The BERT model chooses to randomly replace the token participle in the input sequence with the special token [
The masked language model introduces a cross-entropy loss function to train the model’s ability to extract structural and semantic features of a sentence by predicting the original words masked by [Mask] in the final output sequence.
Document-level sentiment analysis tasks are difficult to adequately drive very large models like RoBERTa due to the complexity of manual engineering and limited training data, and the work in this phase addresses such problems by designing a transfer learning strategy aimed at acquiring sentiment knowledge from other related domains to help the model deepen its comprehension capabilities.
When the source and target domains of migration learning are unrelated, forced migration can harm the performance of the algorithm on the target domain, a situation known as negative migration. Therefore the multi-step migration learning algorithm introduces external data in the supervised learning pre-training task, rather than simply exposing the model to more textual data. Firstly, a sentiment analysis dataset from the Internet needs to be investigated, and similarly labeled data of comment scenes are selected as
The evaluation object in document-level sentiment analysis corresponds to three classification categories: positive, neutral, and negative. When the pre-training form of task-guided sentiment analysis belongs to the regression task, in order to adapt to the target task samples and label matching rules, it is necessary to reclassify the labels according to the range of the document sample scoring values and the sentiment scores. Migrate the RoBERTa model that has completed the first stage MLM pre-training task and introduce the document-level sentiment label classification task for further supervised learning pre-training. This phase of the work only requires selecting a small range of specialized data for targeted training, so that the model can effectively capture the background information of the evaluation object and the sentiment speculative patterns. The output feature matrix
A cross-entropy loss function is used to train the RoBERTa model in the document-level sentiment classification task, and the error between the output predicted value
The BERTology language model based on the Transformer bi-directional encoder structure transitions from phrase and inter-sentence structural knowledge to complex semantic knowledge as the encoding layer focuses on capturing information as it moves from low to high itself. Inspired by the fine-tuning scheme of the BERT model during the text categorization task, the migration RoBERTa model for document-level sentiment analysis pre-training process uses a hierarchical learning rate to assign smaller gradient descent rates to the lower layers for ameliorating the catastrophic forgetting problem in migration learning. The same learning rate is set for 3 adjacent layers in the selected coding layer, and parameter
The RoBERTa model, which migrated to complete the pre-training task of the two phases of self-supervised learning and supervised learning, was fine-tuned in the final phase on the target task. The processed text sequences are represented using categorical identifiers and intervals are shown in Eq. (12):
Multiple comment samples are generated from the same comment sentence according to its targeting of multiple specific aspects. Unlike classical sentence pair tasks such as text matching and semantic relation recognition, here the fragment token 〈/
After multi-step migration RoBERTa model in the document level samples fully extracted to the useful feature vectors, the last layer 〈
The RoBERTa model, trained by multiple pre-training migration learning, serves as the base model for the document-level sentiment analysis target task, which can improve the shortcomings caused by the poor relevance of the large-scale MLM unsupervised learning scenarios and the low resourcefulness of the fine-tuning data. The multi-step transfer learning strategy expands on the traditional paradigm of “transfer model” plus “downstream task fine-tuning” by extending the model training process according to the phased approach
The task of the source domain is to perform sentiment analysis of Japanese words in different domains, and in this paper, we use the multi-domain YelpAspect dataset, which is obtained from Japanese review websites and includes three domains: restaurants (R1), beauty (B), and hotels (H), where the sentiment tendency is categorized into three categories, namely positive, neutral, and negative. The task of the target domain is to analyze the sentiment of Japanese vocabulary in different domains, and the data used in this paper are the dataset published in Task 4 of the SemEval 2014 competition and the Twitter dataset, again, identifying the different domains: restaurants (R2), laptops (L), and Twitter (T).
In order to evaluate the multi-step migration method proposed in this paper, the experiments will be divided into 8 groups of different domain Japanese vocabulary migration pairs, namely R1→L, H→L, B→L, H→R2, B→R2, R1→T, H→T, B→T, with the left side of the arrows being the source domains, and the right side being the target domains. For each set of migration pairs Ds→Dt, the training data in Ds and 70% of the training data in Dt are used as the training set for the experiment, 30% of the data in Dt is used as the validation set for the experiment for hyperparameter tuning of the model, and the test data in Dt is used as the test set for the experiment.
Table 1 shows the comparison of the results of the proposed method in this paper with the results of each of the baseline methods, where the baseline methods are all methods that do not use migration.Also, since the training data in this paper contains three domains, the final reported results are taken as the average of the experimental results on these three domains. The methods in the table are divided into three parts: the first part is the baseline method, the second part is the base model of this paper, and the third part is the model of the text. The following conclusions can be drawn from the table:
1) The base model BERT achieves the best results on two domains (Laptops, Twitter) and the method RoBERTa of this paper achieves the best results on the domain of Restaurants(R). This result is as expected because the base model BERT is fine-tuned on the target domain and directly utilizes the labeled samples on the target domain, while the method in this paper belongs to the unsupervised domain adaptive methods and does not use the labeled data on the target domain. And the method in this paper achieves the best results on the restaurant domain, mainly because the restaurant domain is included in the source domain, and the migration of domain knowledge makes the results on the target domain better. 2) The base model BERT gives better results than other baseline methods. The main reason is that the BERT model utilizes large-scale natural language text for pre-training, and a large amount of linguistic knowledge is extracted and encoded into the network structure, while the target domain has a limited amount of labeled data, and the use of these linguistic features will have a great feature complementary to the target task and increase the model’s generalization ability. 3) The method in this paper has better results than most baseline methods (TD-LSTM, AE-LSTM, ATAE-LSTM, MemNet, IAN, RAM), especially on the restaurant(R) domain, and the method in this paper does not use the labeled data of the target domain, which shows that the knowledge in the source domain can be fully utilized in the model in this paper and well migrated to the target domain.
The experimental results of ours and the baseline method(non-transfer)
Model | R | L | T | |||
---|---|---|---|---|---|---|
Acc (%) | Macro-F1 (%) | Acc (%) | Macro-F1 (%) | Acc (%) | Macro-F1 (%) | |
TD-LSTM | 75.63 | 64.25 | 68.12 | 62.28 | 66.58 | 63.98 |
AE-LSTM | 69.05 | 62.54 | 69.03 | 62.46 | 69.32 | 56.39 |
ATAE-LSTM | 77.26 | 65.10 | 68.74 | 62.33 | 69.69 | 56.90 |
MemNet | 78.05 | 65.85 | 70.36 | 63.97 | 68.44 | 67.11 |
IAN | 78.57 | -- | 72.23 | -- | -- | -- |
RAM | 78.41 | 68.56 | 72.00 | 68.51 | 69.52 | 67.24 |
SACA | 82.10 | 73.15 | 76.31 | 73.04 | 72.76 | 71.07 |
BERT | 82.93 | 74.08 | ||||
RoBERTa | 72.43 | 69.55 | 70.83 | 68.76 |
Table 2 shows a comparison of the methods in this paper with the baseline methods, where all baseline methods include a migration strategy. Each migration pair contains two metrics reported, accuracy (Acc) and macro-averaging (Macro-F1). The following conclusions can be drawn from the comparison results:
1) The results obtained from the model in this paper achieved the best results on all migration pairs compared to other migration methods, which indicates the effectiveness of the method proposed in this paper. 2) The BERT model gives better results compared to the MGAN model. Both models train the model on the source domain and then apply it directly on the target domain, the difference is that BERT uses BERT as an encoder, while MGAN uses Bi-LSTM combined with different attention mechanisms, which shows the effectiveness of the pre-trained BERT model. 3) The RoBERTa model in this paper gives slightly better results than the MMD model, both the MMD metric and the KL scatter can be used to measure the distance between two distributions, where the KL scatter can be viewed as a first-order moment based on the entropy of the matching information (i.e., a metric of the mean match), whereas the MMD method is a metric that matches the weighted sum of all order moments. During the experiment, the text finds that KL is slightly better in terms of results and less computationally intensive than the MMD method, so this paper finally chooses the KL scatter metric. 4) Both the RoBERTa model and MMD model in this paper have better results than the DAN method, which is based on the adversarial training method. In the experimental process, we found that when the BERT model is sufficiently trained, the above domain discriminator is still in the state of underfitting, and the use of DAN’s method is more demanding in terms of parameter tuning.
The experimental results of ours and the baseline method(transferred)
Ds→Dt | Indicator | BERT | MMD | DAN | MGAN | Ours |
---|---|---|---|---|---|---|
R1→L | Acc(%) | 70.67 | 72.33 | 72.05 | 69.80 | |
Macro-F1(%) | 68.77 | 69.84 | 68.49 | 66.96 | ||
B→L | Acc(%) | 70.17 | 71.44 | 70.56 | 70.40 | |
Macro-F1(%) | 67.41 | 67.91 | 67.28 | 66.82 | ||
H→L | Acc(%) | 70.74 | 71.35 | 71.43 | 70.79 | |
Macro-F1(%) | 67.05 | 67.98 | 67.79 | 67.84 | ||
B→R2 | Acc(%) | 76.66 | 78.89 | 78.34 | 72.71 | |
Macro-F1(%) | 69.03 | 69.78 | 70.86 | 64.26 | ||
H→R2 | Acc(%) | 75.49 | 79.57 | 78.87 | 72.21 | |
Macro-F1(%) | 68.24 | 70.74 | 70.03 | 62.52 | ||
R1→T | Acc(%) | 60.60 | 69.24 | 68.32 | 46.50 | |
Macro-F1(%) | 58.33 | 67.23 | 66.01 | 45.77 | ||
B→T | Acc(%) | 61.30 | 70.79 | 70.11 | 46.36 | |
Macro-F1(%) | 59.94 | 68.21 | 68.23 | 45.76 | ||
H→T | Acc(%) | 60.91 | 69.62 | 68.96 | 47.52 | |
Macro-F1(%) | 58.21 | 67.33 | 67.24 | 46.77 |
In order to easily see the effect of migration between Japanese words in different domains, we present the results of the experiment as a bar chart, and the visualization of the experimental results is shown in Figure 2, where the horizontal coordinate is the accuracy rate. It can be found that migration is most effective when the target domain is the restaurant domain and least effective when the target domain is the Twitter domain. Intuitively, the review text in the Twitter domain contains more social words and diverse expressions, which are more different from the source domain, and thus the migration effect is worse. The laptop domain comments contain more electronic product terminology, but some of the evaluation Japanese words are related to the source domain, resulting in a slightly improved migration result. The above results also verify the rationality of the experiments in this paper from the side.

Visualization of experimental results
Relevant factors affecting Japanese language sentiment analysis are further analyzed in order to determine the optimal environment for the application of the methodology of this paper and to enhance the effectiveness of Japanese language sentiment analysis.
The word vector representation of a word should consider both word-level and document-level sentiment information. Therefore, the total loss function is defined as the sum of the word-level and document-level loss functions, as shown in Equation (15):
During the training process of word embeddings incorporating emotional semantics, the size of α value will have an impact on the representation ability of word embeddings. Since the classification effect performs best on the Japanese dataset, the Japanese dataset is chosen to explore the effect of different α values with a step size of 0.1, and the experimental results are shown in Figure 3.

The effects of different α values on cross-language emotional classification
As can be seen from Fig. 2, the classification accuracy can reach 0.801 when α is 0.1, at which time the weight of emotional information at the document level is the largest. When the value of α gradually increases, the classification accuracy rate gradually decreases. The accuracy rate is lowest when α value is 0.5, at this time the word level and the document level have the same weight of sentiment information. When the α value continues to increase, the weight of word-level sentiment information exceeds that of document-level sentiment information, then the classification accuracy rate improves, and reaches the highest accuracy rate of 0.823 when the α value is 0.9. The experimental results show that both word-level and document-level sentiment information have better independent supervision effects, but when the weights of both are close to each other, the utilization rate of the sentiment information decreases, which affects the effect of the word embedding representation , which in turn leads to a decrease in cross-lingual sentiment classification accuracy.
The dimension size of word vectors has a certain effect on the ability to represent the semantics of words, so the experiments in this section set the word vector dimensions to 50, 100, 150, and 200 dimensions, respectively, to explore the effect of word vector dimensions on cross-lingual sentiment analysis. The Japanese dataset is still chosen for the experiment, and DAN is selected for the feature extraction network. The experimental results are shown in Table 3.
The influence of the word vector dimension on emotional analysis
Method | 50 Dimension | 100 Dimension | 150 Dimension | 200 Dimension | ||||
---|---|---|---|---|---|---|---|---|
Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
Upper | 0.856 | 0.867 | -- | -- | -- | -- | -- | -- |
Machine translation | 0.729 | 0.718 | -- | -- | -- | -- | -- | -- |
Bi_random | 0.557 | 0.709 | 0.576 | 0.647 | 0.572 | 0.545 | 0.604 | 0.718 |
Bi_W2V | 0.732 | 0.714 | 0.747 | 0.744 | 0.777 | 0.730 | 0.748 | 0.709 |
CLCDSA | 0.602 | 0.632 | 0.689 | 0.693 | 0.720 | 0.715 | 0.786 | 0.799 |
Ours | 0.818 | 0.839 | 0.776 | 0.771 | 0.800 | 0.733 | 0.778 | 0.700 |
With the increase of word vector dimension in the cross-lingual sentiment classification task, the classification accuracy of the Bi_random method, which only uses random word embedding, can also reach 0.604 when the word vector dimension is 200 dimensions, and the F1 value is 0.718, and the improvement is most obvious. It indicates that for the Bi_random method with randomly initialized text vectors, the word vector dimension is larger, providing more information about characterization and better results. When the Bi_W2V method is used, there is a small increase in accuracy when increasing the word vector dimension, and the Bi_W2V method obtains the highest F1 value of 0.744 when the word vectors are 100 dimensions, and the highest accuracy of 0.777 when the word vectors are 150 dimensions, while when the word vector dimensions are further increased to 200 dimensions, the accuracy and the F1 value are instead decreased.
For this paper’s method, changing the size of word vector dimension is not obvious for the improvement of classification accuracy, and it has been able to integrate the sentiment semantic information well when the dimension is 50 dimensions, and the highest accuracy and F1 value reach 0.818 and 0.839 respectively, which has good stability.
For the CLCDSA method, the best performance is achieved when the word vector dimension is equal to 200. The performance decreases as the vector dimension decreases. The reason for the decrease is mainly due to the fact that the parameters of the Encoder-Decoder model decrease as the vector dimension decreases: when the word vector dimension is equal to 200, the number of parameters of the model is 12,380,000, while when the word vector dimension is equal to 50, the number of parameters of the model decreases to 710,000.
In order to analyze from the linguistic and semantic point of view that the word vector representations based on source language sentiment features can better take into account the information of word semantic and sentiment features compared with Word2Vec, in this section, we compare the word vector representations obtained by this paper’s method and Word2Vec model by using the visualization method. The word vector representations obtained by Word2Vec or this paper’s method are 50-dimensional high-dimensional vectors, which cannot be visualized in the 2D plane, so the experimental word vector representations obtained in the experiment are downscaled using Principal Component Analysis (PCA) method, and finally output in the 2D plane.
Fig. 4 and Fig. 5 show the output of two sets of words visualized in the 2D plane under Word2Vec and word vector representations of the method in this paper, respectively. In order to be able to clearly see the results of the visualized representations, a small number of words are selected as examples for the experiment. Each point in the figure represents the 2D plane embedding result of the high-dimensional word vectors of a word after PCA dimensionality reduction, and the closer the word vector representations of the two points are, the closer they are in the 2D plane.The results of the word vector representations of Word2Vec are on the left side of the figure, and the results of the word vector representations of the method of this paper are on the right side of the figure.

Example (1) of word vector representation

Example (2) of word vector representation
Figure 4 shows a visualization of a group of words “good”, “good”, “hate”, “bad”, “exciting”, “happy”, “beautiful” in a two-dimensional plane. The affective polarity of this group of words is obvious, and it can be seen that the word vector representation of this method takes into account the emotional feature information of the words, and can distinguish words with different affective polarity. For example, the words “hate” and “bad” with negative emotional polarity are closer together, while “good” and “delicious” are clustered together. Compared to Word2Vec’s word vector representation, the words “happy”, “bad”, and “beautiful” are clustered together, which cannot effectively distinguish the emotional polarity of words.
On the basis of Figure 4, a few words with close semantics: “dog”, “cat”, and “bird” are added, and a few words are randomly removed, and the visualization results are shown in Figure 5. It can be seen that the Word2vec model has more advantages in semantic representation, and is able to cluster the semantically similar words “dog”, “cat”, and “bird”, but the words “hate” and “exciting” still overlap. However, the word vector representation of the method in this paper can still clearly distinguish the emotional polarity of words, and “hate”, as a word with negative emotional polarity, has obvious semantic distance from other words.
The purpose of this paper is to design a multi-step transfer learning strategy based on the RoBERTa model to improve sentiment analysis of Japanese language using transfer learning and sentiment analysis methods. The conclusions drawn from the study are as follows:
1) This paper’s method achieved higher sentiment classification accuracy (83.71%) and F1 value (74.5%). Among all migration pairs, this paper’s model also exhibits the highest sentiment classification accuracy, which verifies the rationality of this paper’s migration learning strategy. 2) When the compromise coefficient α is 0.9, the sentiment classification accuracy of this paper’s method is the highest. The sentiment recognition rate of this method remains stable when facing different number of word vector dimensions.
1) The pre-training model RoBERTa used in this paper ignores the middle layer information of the bidirectional encoder network and does not take full advantage of the emotional messages present in the information modalities such as pictures and videos.
2) External neural networks can be added in subsequent research to combine with the top layer output of the pre-trained model, and emotional factors such as expressions and scenes can be utilized to complement each other with the text semantics.
This research was supported by the State Scholarship Fund to pursue study: “China Scholarship Council” (File No: [2021]340).