Research on data-driven optimization of cross-border e-commerce copywriting and artwork

With the continuous development of the Internet, cross-border e-commerce has become a global hot topic. The development potential of this market has been increasingly emphasized by governments and enterprises, and it has become a global trade model [1-3]. Copywriting and artwork are important factors in enterprise marketing, determining the consumer behavior of products, and the level of copywriting and artwork in cross-border e-commerce determines the competitiveness of cross-border e-commerce enterprises [4-5].

As an important part of cross-border e-commerce, copywriting will become an indispensable core ability of e-commerce enterprises. A successful cross-border e-commerce copywriting needs to be attractive. When writing copy, the needs and preferences of the target customers should be fully considered, and the features and advantages of the products should be described in simple and vivid language, so that the customers can understand them at a glance [6-9]. In addition, some vivid adjectives and interesting metaphors can be used to make the copy more vivid and interesting, attract customers’ eyes and arouse their interest. At the same time, credibility and emotional resonance, etc. are also key factors of successful copywriting, and only by reasonably integrating these elements into the copywriting can we attract customers’ attention and increase the sales of products [10-13]. And cross-border e-commerce artwork is an important part of cross-border e-commerce operation, which enhances the attractiveness and saleability of goods through art design, product display, advertising and other means [14-15]. Cross-border e-commerce artisans need to have certain art design ability and cross-border e-commerce operation knowledge to meet the needs of commodity display and enhance the sales effect of commodities, and for cross-border e-commerce companies, having a group of professional artisan team is very necessary, and it is also an important guarantee to enhance the sales effect of e-commerce [16-19].

This paper utilizes Octopus collector to crawl a cross-border e-commerce website in the commodity marketing copywriting and aesthetics related data, and crawled data for image information extraction, special symbols and other replacement filtering, text de-emphasis and missing value processing and other pre-processing operations to do. For the traditional cross-border e-commerce copywriting manual writing in the low-quality and low-efficiency problems, this paper according to the principles of text generation technology and the idea of factor decomposition machine model, to build the keyword theme control copywriting generation model based on the cross-terms encoder, and the introduction of the attention mechanism in the model to realize the arbitrary adjustment of the sequence of the generated copywriting and the collection of keywords in the copywriting process of the degree of participation. On the basis of the semantic fusion-based generative adversarial network framework, this paper constructs a generative adversarial network model based on coding and decoding structural discriminators for generating cross-border e-commerce artwork images. By selecting different benchmark comparison models and evaluation indexes, we analyze the performance superiority of this paper’s model in the optimization of cross-border e-commerce copywriting and artwork, and further demonstrate the intrinsic synergistic optimization effect of the two models in this paper, so as to provide technical support for the optimization and development of cross-border e-commerce enterprises.

2

Data-driven cross-border e-commerce copywriting optimization

Traditional cross-border e-commerce copywriting is mostly generated manually by professional writers, which theoretically ensures the fit between the copywriting and the products to a certain extent. However, in practice, due to the uneven level of writers, deviations in the understanding of commodity tone and other issues, resulting in low quality of manually written copy. Meanwhile, for cross-border e-commerce platforms with more commodity data, manually writing commodity marketing copy is a time-consuming and inefficient task. Therefore, the research decides to build a cross-border e-commerce copywriting automatic generation model using natural language processing technology, and promote the optimization of cross-border e-commerce copywriting by mining the commodity and copywriting data in cross-border e-commerce platforms. In this paper, we design a keyword topic-controlled copywriting model based on cross-term encoder, firstly, the input keyword set is encoded by cross-term encoder to get the semantic vectors of the keyword set, secondly, it is inputted into the decoder through the attention mechanism to act on the specific process of copywriting, and finally, the generative adversarial network is used to improve the performance of the model.

2.1

Data Acquisition and Processing

2.1.1

Data acquisition

The data used in the study comes from a large cross-border e-commerce platform. In the Discover Goods section of this platform, each recommended product will have a paragraph of commodity marketing copy written by professional writers, which will be used as the reference commodity marketing copy for the study. The Source, which is the basis for generation, mainly consists of three parts of data, namely, the title of the commodity, the attributes of the commodity, and the marketing image of the commodity marketing copy.

To carry out the research on the generation of copywriting for cross-border e-commerce platforms, it is necessary to collect the data first. Octopus Collector is a universal web page data intelligent collection tool that can collect all public web page data on the network, with rich built-in collection templates and anthropomorphic intelligent algorithms, no need to learn programming, simple operation, and easy for novices to handle. Its self-developed cloud collection technology has more than 5,000 servers around the world, which enables efficient, large-scale acquisition of the required data and rapid export or docking to internal systems. Therefore, Octopus Collector is chosen to capture data from cross-border e-commerce platforms in this study.

Since the product marketing copy and the product title and attributes are not on a single page, direct acquisition is likely to cause misalignment of the result fields. Therefore, this study firstly collects the product marketing copy and the address of the product detail page corresponding to “I want to go and take a look” on the product discovery detail page, and secondly collects the address of the page, product title, product attributes and image information on the product detail page. Finally, the two data result tables are summarized by the public field of the product detail page address. After the above collection process, the study collects a total of 45,000 usable cross-border e-commerce copywriting data.

2.1.2

Data pre-processing

In performing the cross-border e-commerce copy generation task, the first step in acquiring text corpus information is to perform data preprocessing. Data preprocessing transforms text into structured text form, and subsequent text representation and model training and prediction rely on the preprocessed text information. Therefore, the quality of data preprocessing is crucial. In this section, data preprocessing will be briefly explained, including picture information extraction, replacement filtering such as special symbols, text de-emphasis, and missing value processing. 1)

Picture Information Extraction

For the collected picture information, it is necessary to extract the text information from it. Translating images into text is generally known as Optical Character Recognition (OCR), which refers to the process of converting text information in images into editable, searchable, and analyzable text. Using Tesseract text recognition tool, combined with Python use can quickly achieve text recognition. In this paper, we use Tesseract for image text recognition, and test its effect accordingly.

2)

Special symbols and other replacement filters

There are some links, emoticons and other special characters in the collected text data and image recognition data, and these characters are not related to the generated product copy. If these characters are not filtered, it will increase the number of vocabulary leading to the occupation of a large amount of memory, and will even have a direct impact on the effect of the generated copy, so it is necessary to filter these characters, and the core semantics of the source text content after filtering has basically no impact. In this paper, regular expressions are used to replace and filter the emoticons, links and so on.

3)

Text de-emphasis

Because there will be some non-existent URLs when generating collection URLs in batches, and the web page will jump to the default page, there will be duplicate web page data, which is worthless on the one hand, and even if it is useful, it is only the first piece of usefulness, so this kind of duplicate web page data must be deleted. There are two ways to deduplicate text: using the “drop_duplicates” function in pandas and the “deduplicate” function in Excel. The drop_duplicates function contains three parameters: subset, keep, and inplace. Subset indicates the name of the column to be deduplicated, which is None by default. Keep has three optional parameters, which are first, last, and false, the default is first, which means that only the first occurrence of duplicates is retained, and the rest of the duplicates are deleted, last means that only the last duplicates that appear in turn are retained, and false means that all duplicates are deleted. Inplace is a Boolean parameter, which defaults to false to return a copy after deleting duplicates, and true indicates that duplicates are directly removed from the original data. In this study, 44892 data were used, 3072 were deleted, and 41820 data were left.

4)

Missing value processing

In the process of data collection, there will be web page failure, data collection failure, etc., resulting in data results in the presence of missing data, etc., so it is necessary to carry out missing value processing of data. Missing value means that the value of a certain indicator or some indicators in the existing data set is incomplete. Since the experimental data is text data, the filling class method is not applicable, so the experiment uses the method of deleting samples. After the method of deleting samples, a total of 1280 data with missing values are removed, and a total of 40540 data are finally used for the study of this paper.

5)

Constructing the vocabulary list

Building a glossary is a crucial step when constructing the dataset used to train the model. A glossary digitizes text, allowing computers to understand and process it. In order to ensure the effectiveness of the vocabulary list, the word frequency information and the size of the vocabulary list usually need to be taken into account when constructing the vocabulary list. A vocabulary that is too large leads to high computational cost in acquiring word to poverty, while a vocabulary that is too small makes many different words share a single representation and loses the independence of the words. For this reason, a word frequency-based approach is usually used to construct vocabularies, and a threshold is set to select words with higher frequency of occurrence as members of the vocabulary in order to balance computational efficiency and encoding quality. In this paper, words filtered by deactivated words and with a word frequency less than 5 are called low-frequency words, which are denoted by the <unk> symbol, and the size of the vocabulary list is set to 8000.

6)

Text serialization

Text serialization refers to the process of converting the state information of the text into a form that can be stored or transmitted. Word embedding will not directly convert the text into vectors, but first into numbers, and then into vectors, the realization of this process requires text vectorization. Text serialization to achieve the specific ideas: first of all the sentences for the word separation, and then the words into the dictionary, according to the number of times the words are filtered and counted the number of times, here the use of Python’s collection module in the Counter to complete the final realization of the text to digital sequences and digital sequences to the text of the method.

2.2

Cross-border e-commerce copywriting generation

2.2.1

Copy generation model

The model in this paper contains three main parts, namely encoder, decoder and discriminator. The main structure of the model is shown in Fig. 1, in the left frame is the copy generator based on cross term encoding, which consists of two parts, the cross term encoder and the decoder.

2.2.2

Cross Term Encoder

In this paper, we draw on the idea of factorial decomposition machine model [20] to mine the hidden information between keywords through the combination of keywords, so as to enrich the input of the model and reduce the sensitivity of the model to the temporal position of keywords. The structure of the factorizer model is shown in equation (1): (1) $y (x) = b + \sum_{i = 1}^{n} w_{i} x_{i} + \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} w_{i j} x_{i} x_{j}$

Where: the first two terms are ordinary logistic regression linear models that consider only individual features individually, completely ignoring hidden connections between individual features of the input. b is the bias term, the input vector is X = {x₁, x₂, ⋯, x_n}, w_i is the parameter of a single feature, and w_ij is the parameter of the combination term of any two features combined.

The model in this paper is designed based on the characteristics of natural language, and the specific model structure is shown in Figure 2. As can be seen from the figure, the whole cross term coding model has 3 layers, the first layer is the most core layer in the whole model, and the main operation of the cross term layer is to combine the keywords in order to get the combination of features between the keywords.

The left border of the figure shows the implementation details of the crossover layer, in which the keyword serial numbers are firstly input into the embedding layer for vector representation, and then the feature vectors of the preliminary vectorized representation are input into the fully-connected layer in order to enrich the model parameters as well as to adjust the vector dimensions, and finally the keywords are combined with each other to obtain the combined vectors. The specific realization process of this step is shown in equation (2): (2) $\begin{matrix} X_{i j} = [X_{i} : X_{j}] \\ i \in [0, l - 1], j \in [1, l] \\ l = l e n (k e y w o r d_{s e t}), j > i \end{matrix}$

where X_i represents the ind feature vector, X_j represents the lth feature vector, l is the length of the keyword set, and X_ij represents the combined feature vector of the ith and jth inputs. The problem of two-by-two combination of features is similar to the problem of permutations and combinations, in order to avoid repetition, it is necessary to set j > i, when i = 0 denotes the feature vector of a single keyword of the original input.

The second layer is the fully connected layer, and the specific realization process is shown in equation (3): (3) $Z_{i j} = W_{c r o s s - i t e m} X_{i j} + b_{c r o s s - i t e m}$

where W_cross−item is the weight parameter of the fully connected layer and b_cnss−item is the bias term.

The third layer is the softmax layer, which takes the output of the fully connected layer as the input of the softmax function for normalization operation and introduces the nonlinear components, and then the feature vectors after the normalization operation are weighted and summed as shown in Eqs. (4) and (5): (4) $Y_{i j k} = \frac{\exp (Z_{i j k})}{\sum_{n = 0} \exp (Z_{i j n})}$ (5) $\hat{y} = \sum_{i, j} v_{c r o s s - i s e m_{i j}} X_{i j}$

where Z_ijk denotes the krd element in Z_ij, Y_ijk denotes the softmax value of this element, $v_{c r o s s - i t e m}_{i j}$ denotes the weight coefficient corresponding to the combined feature X_ij, and $\hat{y}$ denotes the semantic vector of the entire keyword set.

2.2.3

Weighted Connected Attention Mechanisms

The aim of this paper is to realize the arbitrary adjustment of the participation of generated copy sequences and keyword collections in the copy generation process by introducing an attention mechanism [21]. A weighting factor is used to break the situation that the contribution of generated copy sequences and keyword collections to the generation process is always the same in the connected attention mechanism. The intensity of different semantic vectors can be flexibly adjusted by changing this r, and the value of r is [0, 1]: (6) $s c o r e (K, Q) = v_{a}^{T} \tanh (W_{a} [{(1 - r)}^{*} Q + r^{*} K])$ (7) $v_{j} = \frac{\exp (s c o r e_{j})}{\sum_{K = 1} \exp (s c o r e_{k})}$

Where: v_a and W_a denote trainable parameters. tanh denotes the activation function. Q denotes the query vector, which in this paper denotes the generated text sequence. K denotes the key vector, which in this paper denotes the keywords. score(K, Q) denotes the calculated relevance index. v_j denotes the weight probability of the jth element in the correlation index score after softmax.

3

Cross-border e-commerce aesthetics optimization

Cross-border e-commerce aesthetics due to its commercial nature, the need to have a high visual communication effect, for the aesthetics of the image also has the content and the text to maintain a high degree of consistency, high quality and style style rich and other requirements. Therefore, the research adopts the text-to-image technology, utilizing the copy generation model constructed in this paper to generate commodity marketing copy, and then generating high-quality realistic images that conform to the content of the copy and cover rich details according to the description of the copy. Since the images generated from the same paragraph of descriptive text may be completely consistent, the generated images have diversity. Thus, the optimization of cross-border e-commerce artwork images can be achieved.

3.1

SF-GAN network framework

Semantic Fusion based Generative Adversarial Network (SF-GAN) is composed of a character encoder coding and Gaussian distribution generating random noise vectors, and two inputs [22]. The role of the Gaussian random noise vectors is to ensure the diversity of the generated images, i.e., to make the generated images as diverse as possible and to ensure that the generated images are consistent with the given text. The core part of the generative procedure of the SF-GAN consists of six up-sampling layers, six fusion modules (FMs), and one convolutional layer, where each FM is a residual structure consisting of a SATM and a SJAM. The formula is shown in equation (8): (8) $h_{0} = F_{0} (z)$

where z in equation (8) represents a fully connected layer and h₀ represents the output of the fully connected layer. (9) $h_{1} = F_{1}^{F M} (U_{1} (h_{0}), s)$ (10) $h_{i} = F_{i}^{F M} (U_{i} (h_{i - 1}), s) i = 2, 3, \dots 6$ (11) $o = G_{c} (h_{6})$

In Eqs. (9) to (11), S represents the global utterance vector, $F_{i}^{F M}$ represents the FM proposed in this paper, h₁ − h₆ represents the 6 FM output eigenmaps, U_i − U_o represents the 6 upper sample layer with scale factor 2, Gc represents the last convolutional layer of the generator, and 0 represents the resulting image. The generator transforms the 100-dimensional noise vector z into an 8192-dimensional vector via full concatenation F₀ and then transforms the resulting vector into a 512 × 4 × 4-dimensional eigenmap, which is transformed by six up-sampling layers U_i with six fusion modules consisting of SATM and SJAM $F_{i}^{F M}$ to yield an eigenmap h₆ of 322 × 256 × 256 dimensions, which is then convolved in 3 × 3 dimensions Gc to yield the the final image o.

3.2

Generative Adversarial Networks Based on Codec Structure Discriminators

3.2.1

Network framework for SF-GAN-V2

In Generative Adversarial Network based on Coding and Decoding Structured Discriminators (SF-GAN-V2), the discriminators constructed using coding and decoding can discriminate the whole picture and the parts of the picture. In this section the network architecture of the proposed SF-GAN-V2 is schematically shown in Fig. 3, from which it can be seen that the main part of the whole network consists of a pre-trained text encoder, a generator and a discriminator. The work in this paper focuses on making some improvements on the difference between the text encoding and image generator modules in SF-GAN-V2, and proposes a high-precision image synthesis method based on SATM and SJAM.

3.2.2

Discriminators based on coding and decoding structures

According to the given encoding-decoding architecture, the encoder Ddec is dominated by the convolutional layer and the underlying component Block A, and the decoder Ddec is dominated by the convolutional layer and the underlying component Block B. The dimensions labeled in this figure are the output characteristic dimensions of each component, and “□” represents the collocation along the channel axis. For all screening devices, the inputs are RGB images with a resolution of 3 × 25 × 256 × 256 and statement vectors. 1)

One 3 × 3 image with a resolution of 3 × 256 × 256 is convolved by 3 × 3 to obtain a feature map with a size of 32 × 256 × 256.

2)

The feature map obtained in the first step is input into Block A and sampled to obtain a feature map of dimension 4×128×128, and then five Block A are input sequentially to obtain feature maps of dimensions 128×64× 64×64,256×32×3, 2512×16×16 and 12×8×8,512×4×4.

3)

Since the one-dimensional vector value used for recognition is 256 dimensions, this paper will copy the space of the vector value into a 256×4×4 dimensional eigenvalue with a resolution of 4×4, and the 256×1×1 dimensions of the eigenvalues on the eigenvalue will be the initial vector value, and then the copied eigenvalue will be combined with the output of the final Block A in the Ddec, to obtain an eigenmap with textual and image meaning. 768×4×4 dimensional feature value with textual meaning and image characteristics.

4)

The feature map obtained in the first step of the 3×3 convolutional layer is used for feature extraction to obtain a 64×4×4 feature map, and then it is activated by using the Re LU activation function, and finally the 4×4 convolution is used again to obtain a 1×1×1 feature map, and the value obtained at this time is the authenticity of the whole image and the probability of whether it is in accordance with the text or not.

4

Analysis of optimization results

4.1

Copy optimization analysis

In this section, we will first briefly describe the basic setup and evaluation metrics of the experiment, and then test the training effect of the model and the quality of the generated text, and compare it with other models that have become more advanced in recent years.

4.1.1

Experimental setup

This study was conducted under the CentOS Linux release 7.5.1804 operating system using two NVIDIA Tesla V100S 32G graphics cards, Intel(R) Xeon(R) Gold 5218 CPU@2.30GHz, and 512GB of RAM in conjunction with the latest version of the PyTorch Deep Learning framework with the CUDA 11.2 and cuDNN8.1 libraries for experiments. The text encoder in this study is BART encoder and the maximum length of the output text sequence is 128. The study uses Adam as the optimization algorithm and sets the learning rate to 0.00001. The specific hyperparameter setting information is shown in Table 1.

Table 1.

Hyperparameter details

Hyperparameter	Data set
Learning Rate	0.00001
Warmup Steps	380
Eval Period	105
Beam Size	4
Length Penalty	1.3
Optimizer	Adam
Num Nodes	48
Num Relations	57
Embedding	769
v_a	0.17
W_a	0.48
Batch Size	45

4.1.2

Benchmarking model

In this paper, we use the following state-of-the-art controlled copy generation models for comparative experiments with the model Cross-GRU in this paper: 1)

CTRL: a model for large-scale supervised training by using control codes in the pre-training phase, the experiments will use the Huggingface version.

2)

PPLM: A plug-and-play model fine-tuned to the training model by introducing an attribute discriminator, using the Huggingface version.

3)

CoCon: Similar to the model structure in this paper, self-supervised training is performed by inserting a Transformer layer into the pre-trained model.

4)

TAV-LSTM: utilizes the average weighting and weights of all topic words to represent the topic semantics, and uses Long Short-Term Memory Network (LSTM) as the coder/decoder. This experiment uses the selected cross-border e-commerce platform product marketing copy dataset, which is divided into 32,432 pieces of data as the training set for training according to the ratio of 8:2, and the remaining 8,108 pieces of data are used as the test set for testing.

4.1.3

Evaluation indicators

1)

Automatic Evaluation

BLEU: Bilingual Evaluation Substitute (BLEU) is an automatic evaluation metric for machine translation. Using the training set as a reference, BLRU values are calculated to evaluate the generated texts. In this paper, the scores of BLEU-2, BLEU-3 and BLEU-4 are selected for comparison. The higher the score, the better the accuracy (fluency) of the generated copy.

Back-BLEU: Using the generated copy as a reference, BLEU values are calculated to evaluate the copy in the training set. In this paper, the value of Back-BLEU-2 is selected for comparison, later abbreviated as B-BLEU. The higher the score, the better the recall (diversity) of the generated copy.

2)

Manual evaluation

Subjective evaluation of 150 random samples generated by each model by 5 professional writers of cross-border e-commerce copywriting. Five evaluation dimensions were included: completeness (whether the generated copy is complete), accuracy (whether the generated copy is accurate), relevance (whether the generated copy is relevant to the product), fluency (whether it is well-structured grammatically and syntactically), and coherence (whether it has a thematic and logical structure). Each dimension is given a score between 1 and 5 and the final score is calculated.

4.1.4

Analysis of experimental results

The results of the automatic evaluation of the model on the training set and the test set are shown in Fig. 4, where the numbers after the model Cross-GRU in this paper indicate different values of the weighting factor r. The results show that the Cross-GRU model performs optimally on the training set and the test set for all metrics. In particular, compared with the optimal baseline model CoCon, the training set Cross-GRU-0.6 model improves 15.88, 39.09, 59.25, and 22.88 on the four metrics, which proves that the smoothness and diversity of the generated texts are significantly improved. After further observation, it is not difficult to find that: with the increasing of the weight coefficient, the scores of this paper’s Cross-GRU model in the four indicators show a trend of increasing and then decreasing, i.e., when the weight coefficient rises from 0.2 to 0.6, the scores of the model in the four indicators rise simultaneously, while when the weight coefficient rises from 0.6 to 0.8, the scores of the model in the four indicators decline instead. This may be due to the fact that when the weighting coefficient is less than 0.6, the model does not reach the best performance due to underfitting, while when the weighting coefficient is greater than 0.6, the model is overfitted, resulting in a significant decline in model performance. Therefore, based on the experimental results, the optimal weighting coefficient for the model in this paper is determined to be 0.6.

Fig. 5 shows the results of manual evaluation on the test set and the training set, where D1-D5 denote the five evaluation dimensions of manual evaluation. Similar to the automatic evaluation, the Cross-GRU model of this paper outperforms the best baseline models in both the manual evaluation on the test set and the training set. The training set Cross-GRU-0.6 model outperforms the best baseline model TAV-LSTM in the five dimensions by 1.32, 2.08, 2.04, 1.99, and 1.22, respectively. And the manual evaluation scores of the model in this paper on the test set and the training set also show a trend of increasing and then decreasing with the increase of the weight coefficients, and the best performance of the model is reached when the weight coefficient is 0.6. Therefore, in the experiments and analyses in the following sections, the weighting coefficient of the model is set to 0.6 by default, if not otherwise specified.

The training effect of each model is compared below. Figure 6 shows the decrease of Loss value during the training process. It can be seen that the TAV-LSTM model has the worst convergence effect, and the Cross-GRU model in this paper has a better convergence effect than the other benchmark models. When the model tends to stabilize, the Loss value of Cross-GRU is significantly lower than the Loss value of other models.

Figure 7 compares more visually the accuracy of the generated copy for each model after 100 rounds of training. The accuracy rate is obtained from the element-by-element comparison of the tensor of generated copy and reference copy. From the figure, it can be seen that when the models are trained after 100 rounds, Cross-GRU generates copy with the highest accuracy and TAV-LATM performs the worst. The CoCon model performs the best among the benchmark models due to its low loss rate during training and is second only to the model in this paper in terms of accuracy.

In order to test the ability of the model to generate controlled texts from topic to text, i.e., the ability to control at the level of words and phrases, this paper selects a single topic word as a control text in a specific domain, and tests whether the generated text conforms to the semantics and topic of the control text. In the experiment, we will select the iconic words among the words related to “shirt” as the control text, and use the common marketing words as the cue text to generate the control text. In the experiment, each model generates test texts based on the cue texts and individual control texts, and inputs these test texts into the trained model to output the corresponding accuracy rate and F1 scores. The experimental results are shown in Fig. 8. The model in this paper slightly outperforms the other baseline models in all metrics, which indicates that the copy generated by the model in this paper has better topic relevance. Among them, CTRL and PPLM use control codes for controlled copy generation, and their generated copy is related to the control code words themselves, when it does not necessarily have a higher relevance to the corresponding topic, and does not understand the semantics, so the score is lower. In contrast, the model in this paper uses topic words for training in the pre-training phase, so the generated copy is more inclined to be related to both the control copy itself and to the control copy at the topic and semantic levels, so it performs better in the task of generating control copy with the topic as the control copy.

Combining the above experimental results, the Cross-GRU model constructed in this paper performs well in the cross-border e-commerce copywriting task, and is able to efficiently generate rich and varied high-quality merchandise marketing copy with the data in the cross-border e-commerce platform, and the copy generated by the model is highly relevant to the merchandise theme. Using the model in this paper can effectively get rid of the constraints of low-quality and low-efficiency problems of manually written copy, and effectively realize the optimization of cross-border e-commerce copy.

4.2

Aesthetic Optimization Analysis

4.2.1

Data set and parameterization

In order to validate the effectiveness of the proposed model in this paper, the experiments in this section use two datasets for validation. The first dataset is the image data with copywriting annotations collected in the initial cross-border e-commerce platform, totaling 20,457 pieces of data, which is recorded as DS1. The second dataset is a self-made dataset for the study, which is produced by extracting the images in the original cross-border e-commerce dataset and deleting the original copywriting table annotations, and then utilizing the copywriting annotations obtained by this paper’s cross-border e-commerce copywriting generation model to be labeled in the corresponding images by using the This method also obtains 20457 data, which is recorded as DS2.

Throughout the training process, setting N_w = 128, N_r = 32, N_m = 256 is the dimension of the copy, image and memory feature vectors, respectively. In addition, hyperparameter λ₁ = 2, λ₂ = 7 is set for DS1 and DS2 datasets, and Adam optimizer is used to optimize the network with optimization parameters β₁ = 0.3, β₂ = 1, and the learning rate of generator is set to 0.001 and the learning rate of discriminator is set to 0.001. The number of rounds of training on DS1 and DS2 datasets is 500 rounds, and the batch_size is set to 64.

4.2.2

Evaluation indicators

It is difficult to assess the performance of the generative model, and although it is straightforward and reliable to decide the quality of the generated images directly by human beings, human beings are inherently subjective and different people have different standards of judgment, which can lead to unfair results. Therefore, in this paper, two widely used evaluation metrics for image quality assessment, Inception score (IS) and Frechet Inception Distance (FID), are used to quantify the model and assess the quality and diversity of images. 1)

Inception score

The Inception score is a measure to assess the quality of the generated images by the cross-entropy difference between the conditional class distribution and the edge class distribution, using a pre-trained Inception-v3 network, the performance of the generative network is calculated by counting the output of this network. Inception-v3 is a well-designed Convolutional Network model, the input is the image tensor, and the output is a 1000-dimensional vector, the value of each dimension of the output vector corresponds to the probability that the image belongs to a certain category, so the whole vector can be viewed as a probability distribution, which is calculated as follows: (12) $I S = \exp (E_{x} D_{K L} (p (y | \hat{x}) | | p (y)))$

Where $\hat{x}$ is the image generated by the model, y is the class label predicted by the Inception-v3 model, and $p (y | \hat{x})$ denotes the probability that the generated image belongs to each of the different classes, a good model should produce diverse and realistic images, so the relative entropy between the two feature distributions should be as large as possible. Therefore, the higher the calculated IS score, the more diverse and meaningful the images generated by the model are. 2)

Frechet Inception Distance

The IS measure has a fatal flaw: the generated samples are not compared with the real images, so it cannot measure whether the distribution of the generated images is close to the distribution of the real images. The Frechet Inception Distance score evaluates the quality of the generated samples by calculating the Frechet Distance between the generated images and the real images, and the specific calculation method is as follows: (13) $F I D = | | u_{r} - u_{g} | |^{2} T r (\sum_{r} + \sum_{g} - 2 (\sum_{r} \sum_{g})^{1 / 2})$

Where u_r is the average of real image features and u_g is the average of synthesized image features. $\sum_{r}$ is the covariance matrix of real image features and $\sum_{g}$ is the covariance matrix of synthesized image features. FID The level of the score can prove the quality of the synthesized image, the lower the score the more realistic the image is.

4.2.3

Analysis of results

In this paper, we experimentally validate the effectiveness of the proposed model (SF-GAN-V2) and compare the results obtained on the DS1 dataset and DS2 dataset with several models that have become better known in the last few years, including GAN-INT-CLS, GAWWN, StackGAN, StackGAN++, AttnGAN, ControlGAN , MirrorGAN, SA-AttnGAN, SegAttnGAN, DualAttn-GAN, DM-GAN, KT-GAN, and OP-GAN. The parameter settings for the comparison models are the same as those in this paper.

The IS and FID comparison results on the DS1 dataset are shown in Fig. 9. As can be seen from the figure, the method proposed in this paper has a relatively outstanding performance in both evaluation metrics on the DS1 dataset. Among them, the value of the IS metric is improved by 0.50 compared to the benchmark model (AttnGan), reaching a high score of 4.83, and the model performance is improved by 11.55%, and the metric data obtained from this paper’s model is even better compared to the multi-stage models proposed in recent years. For another evaluation metric, FID score, compared with the benchmark model (AttnGAN), this paper’s method reduces by 9.66 to reach 15.23, with a model performance improvement of 38.81%. It further shows that the quality of cross-border e-commerce artwork images generated by this paper’s method, and the degree of matching with the copy are higher.

In order to verify that this paper’s method also has a good generalization ability for the homemade dataset, experimental validation is also carried out on DS2 data, and the specific performance comparison results are shown in Fig. 10. The value of IS index of this paper’s method on DS2 dataset reaches a high score of 32.97, which is higher than the comparative model, especially compared to the benchmark model (AttnGAN) by 9.11. The performance of the model is improved by 38.23%, while the FID value decreases to 12.31, which indicates that the images generated by using this paper’s method are much closer to the real images, and they can control the generated intrinsic connection between images and marketing copy.

Meanwhile, after further observation, it can be found that the two evaluation indexes of this model on DS1 dataset and DS2 dataset have some differences. For the IS index, the score of this paper’s SF-GAN-V2 model on DS1 dataset is 4.83, while the score on DS2 dataset rises to 32.97, which is a 28.14 improvement, and the performance of the model has been improved by nearly six times. For the FID index As for the FID index, the model’s FID score on the DS1 dataset is 15.23, while the score on the DS2 dataset is 12.31, a decrease of 2.92, and the model performance also has a 19.17% improvement. Since the DS2 dataset is a self-made dataset based on the copy generated by the Cross-GRU model in this paper, it can be assumed that the SF-GAN-V2 model in this paper can generate images with higher quality based on the copy generated by the Cross-GRU model accordingly, so as to realize the co-optimization of cross-border e-commerce copy and artwork images.

5

Conclusion

By constructing the image generation model of copywriting and artwork, we get rid of the drawbacks of traditional cross-border e-commerce copywriting and artwork writing and production, and realize the effective synergistic optimization of the two. According to the designed optimization experiment analysis concludes that the model in this paper has good performance. Among them, the copy generated by the Cross-GRU model, which is mainly used for copywriting, is improved by 15.88, 39.09, 59.25, and 22.88 under the automatic evaluation indexes compared with the optimal baseline model, indicating that the model has a good copywriting generation capability. The accuracy and F1 score of the model are 88.49 and 83.26 respectively for the given topic word “shirt”, which are 38.55% and 41.86% higher than that of the CoCon model with the second best performance, indicating that the model in this paper can successfully generate high-quality cross-border e-commerce copy. It indicates that the copywriting generation model in this paper can successfully generate high-quality cross-border e-commerce copy. Meanwhile, the SF-GAN-V2 model for artwork image generation also shows strong performance, especially on the dataset homemade by this paper based on the Cross-GRU model for generating copy, the IS index score is as high as 32.97, while the FID index score decreases to 12.31. The synergistic optimization effect of the copywriting generation and artwork image generation model proposed in this paper on cross-border e-commerce copywriting and artwork is significantly proposed.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Research on data-driven optimization of cross-border e-commerce copywriting and artwork

Shaolin Hu

Pubblicato online: 29 set 2025

Ricevuto: 14 gen 2025

Accettato: 10 mag 2025

DOI: https://doi.org/10.2478/amns-2025-1133

Parole chiave<kwd>Data-driven</kwd>, <kwd>Cross-GRU</kwd>, <kwd>SF-GAN</kwd>, <kwd>SF-GAN-V2</kwd>, <kwd>Copywriting and artwork</kwd>

© 2025 Shaolin Hu, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
<kwd>Data-driven</kwd>, <kwd>Cross-GRU</kwd>, <kwd>SF-GAN</kwd>, <kwd>SF-GAN-V2</kwd>, <kwd>Copywriting and artwork</kwd>