Research on Style Migration Techniques Based on Generative Adversarial Networks in Chinese Painting Creation
Data publikacji: 24 mar 2025
Otrzymano: 07 lis 2024
Przyjęty: 05 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0781
Słowa kluczowe
© 2025 Ying Liu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Painting is a kind of artistic creation and an expression of artistic concepts and themes. Many painting creations have a lot in common, and this characteristic is also the style [1-2]. The art styles of eastern and western painting are very different, western painting usually adopts the drawing method of focal perspective and the rule of light and dark, while eastern painting emphasizes on the idea of the brush and image thinking [3-4]. Chinese painting is unique in the East and even in the world because of its unique national style and complete system of theory and method [5-6].
Artists often invest a lot of time and effort in the process of creating Chinese paintings. If the target artwork can be generated quickly and efficiently with the help of computer technology, it will greatly improve the efficiency of modern art creation, which has different degrees of application value in many fields [7-10].
Chinese painting synthesis belongs to the image transformation problem, the purpose of which is to transform a simple pasted synthetic image into a fused into one image by modeling [11-12]. For example, by inserting a portrait (foreground) into a photograph (background), image synthesis expects to fuse the two together, making the observer think that the portrait was originally in the photograph [13-14]. Because the foreground and background have different stylistic features such as light, brightness, and texture, simple paste compositing can result in an unnatural visual effect that can be easily judged as a fake composite [15-16]. Therefore a fusion process is needed to migrate some of the styles of the background to the foreground so that their composites are visually unified and coordinated [17-18].
Generative Adversarial Networks were proposed in 2014 with impressive performance on many image problems. GAN consists of a generator and a discriminator, where the generator tries to generate images similar to the real data and the discriminator tries its best to recognize these generated images until they reach the Nash equilibrium, a state in which the generator generates sufficiently realistic data [19-21]. CGAN constructs generators and discriminators with convolutional neural networks and is used to solve problems in images, ICGAN combines GAN and encoder to edit image attributes in the feature space in order to control the generation of images, and CycleGAN performs the task of image-to-image generation with a bi-directionally mapped GAN model [22-23].
For the existing generative adversarial network algorithm simulation in the Chinese painting creation style migration technology exists in the content is incomplete, style and visual effect is poor and other issues, this paper in-depth study of the existing CycleGAN generative adversarial network model, with full reference to the loss function design ideas in the model, the design ideas will be applied to this paper’s adversarial network generation model. At the same time, combined with the ResNeSt network structure for CycleGAN generative adversarial network model generation network optimization and upgrading, in order to improve the model algorithm of the image style migration level and generation quality. The improved CycleGAN model (ICycleGAN) is theoretically able to complete the migration of Chinese painting styles. In order to verify the effectiveness of the ICycleGAN model, this paper first launches a comparison test to compare the model algorithm of this paper with other existing model algorithms. Secondly, a user survey is conducted to rate the Chinese paintings generated by the ICycleGAN model.
All kinds of generative adversarial network models designed before CycleGAN are just a network structure that learns the conversion from source domain to target domain, and in order to make up for the lack of feature constraints caused by the lack of pairwise data, CycleGAN model introduces a symmetric mapping from target domain to source domain on the basis of the original mapping from source domain to target domain. Intuitively, it looks like a copy of the same structure as the original generative adversarial network that generates the target samples, the first generator is responsible for converting the standard Chinese painting image into the simulated Chinese painting image, and the second generator is responsible for converting the simulated Chinese painting image into the standard Chinese painting image; the first discriminator is used to recognize whether the input is the real standard Chinese painting image or the standard Chinese painting image generated by the loop, and the second discriminator is used to recognize whether the input is the real standard Chinese painting image or the standard Chinese painting image generated by the loop. The first discriminator is used to recognize whether the input is a real standard Chinese painting image or a standard Chinese painting image generated by a loop, and the second discriminator is used to recognize whether the input is a real Chinese painting image or a simulated Chinese painting image generated by a loop.
If the model uses only one generator and discriminator, it will be trained to achieve a higher score.
If the model only uses one generator and discriminator, the model will convert all the standard Chinese painting images into the same simulated Chinese painting style in order to get a higher score, which completely deviates from the original purpose of the generation task, so it needs to use two sets of generator-discriminators to constrain it to fulfill the generation task better. The internal structure of the generator and the discriminator is not particularly innovative, the generator chooses the convolution and residual module to combine into a common encoder-decoder structure, and the discriminator follows the PatchGAN structure used in the Pix2Pix model, which can better capture the detailed features of the painting structure, the specific internal structure of the network is shown in Figure 1.

CycleGAN Model network internal structure
The loss function design of Cycle GAN model is very innovative and universal, and has been widely used in generative tasks in many domains since then, and the following is a detailed description of the design idea of the loss function of Cycle GAN model. The loss of the model is divided into two parts, the first part is similar to the original generative adversarial network adversarial loss function, which is defined as equation (1), (2):
Where
However, relying only on this one loss function to constrain the network model will present a serious problem, i.e., the generator will focus more on learning the style of Chinese paintings and will neglect the preservation of Chinese painting elements. In order to prevent this problem from leading to the failure of the generation task, the CyleGAN model also innovatively employs a cyclic consistency loss function, which is defined in equation (5) below:
Where
The meaning of this formula is the expectation that the image obtained after the image has gone through generators
where
Although the painting data is easier to obtain in pairs, the Cycle GAN model’s Cyclic Consistency Loss Function and Identity Loss Loss Function’s ability to constrain the feature structure of the image is urgently needed in the task of generating Chinese paintings, and this paper makes full reference to the Cycle GAN model’s contribution to the loss function in the subsequent research to further improve the Chinese painting image generation quality.
Although the traditional generative adversarial network model has a good performance in image style migration applications, as the model continues to deepen, it also faces the problems of unstable training, overfitting, large computation and network degradation and so on. To address the above problems Residual Network Structure (ResNet) can effectively solve these problems, which is introduced in the traditional CycleGAN, also known as ResNet Residual Network Structure. In the composition of the ResNet residual block, a set of direct connections from input to output is added, and the residual structure whose convolution is fitted to the difference between the output and the input, and the input and output are batch normalized.
The ResNeSt network structure introduced in this chapter is an improvement on ResNet, in contrast to ResNet, ResNeSt uses Nested Residual Blocks, with multiple branches and multiple convolutional layers within each block, which share an attention module. This structure is designed so that ResNeSt has a shallower network depth than ResNet, but fewer parameters.Each Nested Residual Block in ResNeSt contains a Channel Attention Module that adaptively learns the correlation between channels, thus solving the problem of channel imbalance in ResNet. The channel attention module can adaptively assign different weights to different channels, thus improving the accuracy of the network.ResNeSt also uses a new image enhancement technique, called “Split-Attention”, which improves the accuracy of the network while maintaining its efficiency. This technique uses grouped convolution to capture multiple localized features at each pixel and uses a channel attention module to capture the interrelationships between features. Before examining ResNeSt, this paper provides an introduction to ResNext, a successful variant of ResNet.
ResNext is a combination of the Inception idea based on ResNet, which demonstrates that in addition to increasing the depth and width of the network, the performance of the model can also be improved by increasing the conditioning parameter (cardinality). ResNeXt is formed by replacing the 3 × 3-convolution in each stage of ResNet with grouped convolution, which is essentially the use of grouped convolution.
The slicing idea is also used in ResNeSt, to which SENet and SKNet attention ideas are added. In ResNeSt residual block structure, the input is first
The generative network of the original CycleGAN model is Encoder-Decoder structure, Encoder is composed of three convolutional layers to extract image features; Decoder is composed of two inverse convolutional layers and one convolutional layer to reconstruct the image, and in the middle of which the residual network acts as a converter to transform image features. The two generator networks are combined to restore the image to its original state by performing two transformations. The improved generator network will modify the intermediate converter network with a densely connected network acting as the conversion module of the improved generator. The convolutional layers used by the generators for extracting features, transforming features and reconstructing the image have different step sizes and number of convolutional kernels as well as convolutional kernel sizes, respectively, and the mapping relationship from the source domain to the target domain is learned during adversarial training by feature extraction from the input image and the target style image, and the same network structure is used for both generators
The dataset image size used for training this network model is 256 × 256, and it is firstly convolved in the first layer with 64 convolution kernels, whose convolution kernel size is 7 × 7 with a step size of 1, and the output is 64 feature maps of 256 × 256 sizes. Filling operation is performed before convolution to fully extract the boundary information; the larger size of the convolution kernel is to obtain a larger sensory field and capture more image feature information, and no pooling operation is used. Then it goes through two more convolutional layers, whose convolution kernel sizes are both 3 × 3 and step sizes are both 2. The number of convolution kernels is 128 for the first one and 256 for the second one, and finally outputs 256 64 × 64-sized feature vectors, which are extracted from the input image and fed into a converter consisting of a densely connected network. The
Where
The Dense Net module contains Bottleneck layer which is used to reduce the number of feature maps and thus reduce the computational effort.The Dens eBlock layer contains 6 Dense Net modules, after which 256 feature vectors of size 64 × 64 are output. Then the image is generated by two layers of inverse convolution, i.e., the transposed convolution in the table and the last layer of convolution with 3 convolution kernels.The Dense Net is firstly convolved before the Dense Block. The convolution kernel size of the transposition convolution is 3 × 3, step size 1/2, the first layer of the transposition convolution has 128 convolution kernels, the second layer of the transposition convolution has 64 convolution kernels, and finally a layer of 3 × 3 size convolution kernels, step size 1, the number of convolutional layers of 3 to build the generated image that we want. In addition, the instance normalization layer and the ReLU activation function layer are added after all the convolution and transposition convolution operations, except for the last output convolution layer of the generator, since the generator output layer uses Tanh as the activation function. The improved generator structure is shown in Fig. 2.

Improved generator structure
After the above research and analysis, this paper optimizes and upgrades the generative network algorithm in the original CycleGAN generative adversarial model based on the technical requirements of style migration for the creation of Chinese paintings, and then the effectiveness of the optimized CycleGAN generative adversarial network model (ICycleGAN) will be examined and evaluated.
In order to verify the effectiveness of the ICycleGAN adversarial network model proposed in this thesis for the style migration technique in Chinese painting creation, and to evaluate the style migration effect of the model algorithm, this paper carries out the following effectiveness experiments and user surveys respectively.
In this paper, we conduct experiments using an existing dataset to study the migration of Chinese painting styles. The sample size of the dataset is shown in Table 1.
Number of content samples in dataset
| Object | Train Set | Test set | Seen | Unseen |
|---|---|---|---|---|
| Horse | 1465 | 165 | √ | |
| Shrimp | 525 | 110 | √ | |
| Cattle | × | 205 | √ | |
| Dog | × | 835 | √ | |
| Cat | × | 1115 | √ | |
| Bird | × | 170 | √ | |
| Tiger | × | 530 | √ | |
| Lion | × | 445 | √ | |
| Zebra | × | 110 | √ |
The input image of the model in this paper is 256*256 pixels in size, so in this paper, the resolution of the real photos of these seven different object types are adjusted, and the size of all of them is adjusted to 256*256 pixels. These real photos only participate in the testing stage and do not appear in the training stage, so as to test the ability of the network model proposed in this paper to generate zero-shot ink style images. All networks are trained from scratch. The weights of all convolutional layers are initialized to a Gaussian distribution with mean 〇 and standard deviation 0.02, the batchsize is set to 4, the total number of loop iterations epoch is set to 300, the learning rate is kept the same 0.002 for the first 150 loops, and the learning rate linearly decays until 0 for the next 150 loops, and the model proximizes with the Adam Optimization algorithm with betas=(0.5, 0.999) of the Adam optimization algorithm. In the linear combination of the total loss function, the weights are set to α = 10.0, β = 10.0, γ = 0.05, and η = 1.0. In this paper, the content encoder and the style encoder each have three convolutional layers, and the style features are embedded through the style weaving self-attention module. The intermediate residual network contains 9 residual blocks and the decoder contains 3 upper convolutional layers, each of which contains IN and ReLU, and the details of each convolutional layer are shown in Table 2. The discriminator consists of 70*70 patchGANs to classify the authenticity of 70*70 overlapping image blocks.
The details of the SWAN
| Networks | Operation | Kernel size | Stride | Padding | Feature maps | Normalization | Nonlinearity |
|---|---|---|---|---|---|---|---|
| Content Encoder | Convolution | 6 | 1 | - | 64 | IN | ReLU |
| Convolution | 3 | 1 | 1 | 128 | ReLU | ReLU | |
| Convolution | 4 | 3 | 1 | 256 | ReLU | ReLU | |
| Style Encoder | Convolution | 6 | 1 | - | 64 | ReLU | ReLU |
| Convolution | 3 | 1 | 1 | 128 | ReLU | ReLU | |
| Convolution | 4 | 3 | 1 | 256 | ReLU | ReLU | |
| Resnet Blocks | Convolution | 4 | 1 | - | 258 | ReLU | ReLU |
| Convolution | 4 | 1 | - | 258 | ReLU | - | |
| Decoder | Convolution | 4 | 3 | 1 | 128 | ReLU | ReLU |
| Convolution | 4 | 3 | 1 | 64 | ReLU | ReLU | |
| Convolution | 6 | 2 | - | 3 | - | - |
Table 3 shows the quantitative comparison results of the experiments of this paper’s method and the existing methods. In the metric FID↓, which measures the distance between the feature vectors of the real and generated images, the model score of this paper is 180.0012. In the metric SSIM↑, which measures the similarity of the two images, the model score of this paper is 0.9119. In the Kernel MMD↓, which measures the difference of the data distribution in the domain of the real and generated datasets, the model score of this paper is 0.9502.
Quantitative evaluation results of FID, SSIM and Kernel MMD
| Gatys | AdaIN | CycleGAN | CartoonGAN | ChiGAN | ICycleGAN | |
|---|---|---|---|---|---|---|
| FID↓ | 296.2014 | 224.5274 | 217.0988 | 252.3054 | 234.88873 | 180.0012 |
| SSIM↑ | 0.8719 | 0.9002 | 0.8107 | 0.8827 | 0.8791 | 0.9119 |
| Kernel MMD↓ | 1.085 | 1.0275 | 0.9846 | 1.1352 | 1.1045 | 0.9502 |
In the FID↓ index, this paper’s algorithm is 116.1998 points lower than the highest-scoring Gatys algorithm, indicating that the generated image of this paper’s model is the closest to the real image. In the SSIM↑ index, the algorithm in this paper scores 0.04 points higher than the lowest score Gatys algorithm, which indicates that the generated image of this paper’s model is the most similar to the structure of the real image and has the best quality. In the Kernel MMD↓ index, this paper’s algorithm is 0.185 points lower than the highest-scoring CartoonGAN algorithm, indicating that the generated image data of this paper’s model has the smallest difference and the highest similarity with the real image data. It can be seen that the model in this paper has achieved good results in the evaluation index. The method’s ability to retain semantic information and learn the stylistic features of Chinese paintings is demonstrated by the experimental results.
To comprehensively evaluate the effect of style migration on the ICycleGAN algorithm, two rounds of user surveys were designed in this paper from different perspectives. Firstly, 40 college students were invited to score the perceived degree of Chinese painting style migration. Subsequently, 25 experts were invited to evaluate the quality of the generated Chinese paintings from more specialized and detailed perspectives, such as hooking, chapping, pointing, dyeing, thickness, lightness, dryness, wetness, yin, yang, direction and back.
In the first user survey, 40 university students were invited to score the perceived degree of style migration for each of the five algorithms (ICycleGAN, NST, DPST, INST, and WCT), and 10 Chinese paintings generated by each of the algorithms were provided to the users during the survey process, which were scored by the participants on a scale of 1 to 5, with higher scores indicating better style migration. The higher the score, the better the effect of style migration. The average scores achieved by each algorithm in this round of investigation are shown in Fig. 3, in which the NST algorithm model achieved an average score of 2.0.1, the DPST algorithm achieved an average score of 2.89, the INST algorithm achieved an average score of 1.71, the WCT algorithm achieved an average score of 1.52, and the ICycleGAN algorithm achieved an average score of 4.33, and the scores reflect the degree of stylistic similarity between the images generated by each algorithm and the reference Chinese paintings as well as the satisfaction of the investigators. The Chinese paintings generated by the ICycleGAN algorithm achieved the highest scores in this survey, which further proves that the ICycleGAN algorithm is more sensitive and effective in migrating the style of Chinese paintings than the other four algorithms.The scores obtained by the INST and WCT algorithms are very low, which indicates that these two algorithms cannot successfully migrate the style of Chinese paintings and the visual effect is very poor.

Student evaluations of the five algorithms
In the second round of the survey, this paper invites 10 experts to evaluate the quality of the generated Chinese paintings from more specialized and detailed perspectives, such as hooking, chapping, pointing, dyeing, intensity, lightness, dryness, wetness, yin, yang, direction and back. In this round of investigation, the images generated by the WCT algorithm were not shown to the painters so that the painters could concentrate on evaluating the Chinese paintings generated by the remaining four algorithms. Each algorithm provided 10 paintings for the painters to score within the range of 1 to 5, and the higher the scores indicated that the quality of the generated Chinese paintings was higher. The average scores obtained by the four algorithms are shown in Fig. 4. The average score obtained by the NCT algorithm is 2.4, the average score obtained by the DPST algorithm is 3.1, the average score obtained by the INST algorithm is 1.8, and the average score obtained by the ICycleGAN algorithm is 4.2, which is the highest score of them. This indicates that the Chinese paintings created by ICycleGAN algorithm are also better in details. The painters also indicated that the Chinese paintings generated by ICycleGAN algorithm give a classical beauty in the overall color tone, and they believe that the work in this paper will bring new opportunities for the development of Chinese paintings.

Expert evaluation of the results of the four algorithms
In recent years, image style migration technology has been developing rapidly, and the exploration of diversification of art forms has also opened up a new chapter. Aiming at the pain point that the existing image style migration technology and the creation of Chinese paintings cannot fit perfectly, this paper adopts CycleGAN to generate an adversarial network model. By analyzing the loss function design idea of Cycle GAN model, the original algorithm of Cycle GAN model generative network is improved by using ResNeSt residual network structure, and the ICycleGAN algorithm of this paper is built. With the support of ICycleGAN algorithm, the model in this paper performs well in the effectiveness experiments, can realize the complete retention of image semantic information and image style feature learning, and achieves higher scores compared with other algorithms in the user survey evaluation, which is unanimously recognized by the experts of Chinese painting.
The Cycle GAN generative adversarial network model optimized by the research method in this paper can not only meet the demand for image style migration technology in the creation of Chinese paintings, but also provide effective technical support for the inheritance and development of Chinese paintings, which is of great significance to the development of Chinese paintings and even traditional arts.
