Research on Traditional Costume Culture Information Extraction and Digital Reconstruction Methods Based on Artificial Intelligence
Online veröffentlicht: 24. März 2025
Eingereicht: 04. Nov. 2024
Akzeptiert: 03. Feb. 2025
DOI: https://doi.org/10.2478/amns-2025-0718
Schlüsselwörter
© 2025 Miao Yu, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Chinese traditional dress culture generally refers to the traditional cultural forms mainly characterized by Chinese Han traditional dress. It is the most important and prominent part of Chinese national culture, and also the important foundation of the development and change of Chinese dress culture [1-4]. Chinese dress culture has a long history and profound cultural connotation, which not only reflects the values, cultural consciousness and aesthetic characteristics of traditional Chinese culture, but also reflects the national characteristics of dressing style, customs and habits, as well as political, economic, religious and social development level [5-8]. However, with the rapid development of science and technology, traditional costumes have begun to face the challenges and opportunities of digital transformation [9].
The digitalization of traditional dress culture is manifested in the digital design of traditional dress, which is a mode based on computer-aided design, which can quickly generate clothing design models, simplify the designer’s workflow, and improve the design efficiency. Digital clothing design can be divided into two stages: virtual design and physical design [10-13]. Virtual design through digital technology to generate three-dimensional clothing models, physical design through digital technology to assist in the production of clothing samples [14]. Making full use of digital technology can promote the innovation and development of traditional clothing, combine traditional culture with modern fashion, break through the boundaries of the industry, and create Chinese fashion brands with international influence. Through digital transformation, the charm of traditional costumes will be continued and revitalized [15-18].
In this paper, on the basic framework of U-Net model, Residual and SKNet attention with attention mechanism are embedded in its encoder, ParNet is incorporated in the decoder to improve the generalization ability of the model, and Lovász-hinge loss function is introduced to optimize the network to improve the segmentation accuracy of traditional dress patterns, and construct RSKP-UNet model based on the We construct a traditional dress cultural information extraction method based on RSKP-UNet model. By crawling the self-constructed dataset of traditional dress images, model comparison experiments are conducted to explore the performance of the proposed method in improving the overall accuracy of image feature extraction and recognition. Then an optimization is carried out on the traditional stable diffusion model, combined with LoRA fine-tuning network and discriminative network, and a traditional dress pattern generation method with improved stable diffusion model is proposed to explore the digital reconstruction of traditional dress cultural information. And the experimental analysis is carried out from quantitative analysis and qualitative aspects respectively to realize the comprehensive evaluation of the traditional dress pattern images generated by the model.
As one of the carriers of national culture, traditional national costumes are products of national spirit and material culture foundation. The study of traditional national costume patterns, the use of artificial intelligence technology to promote a more systematic and in-depth learning and understanding of costume patterns and other related national elements, and to contribute to the informatization of traditional national costumes, has a significant and far-reaching significance in the inheritance and promotion of national culture. As a result, this paper proposes a method for extracting traditional dress culture information and a digital reconstruction method for traditional dress culture information.
In order to protect the transmission of traditional clothing culture and increase the research on traditional clothing culture, this section takes the information extraction of traditional clothing culture as the research goal, and on the basis of the U-Net model, a traditional clothing pattern segmentation algorithm based on the RSKP-UNet (Residual Selective-Kernel Parallel U-Net) model is proposed.
The U-Net model as a whole can be divided into two parts, the encoder part and the decoder part, the role of the encoder part is used for feature extraction of the input image, as a way to gradually obtain the higher-order semantic information, and the role of the decoder part is to carry out cross-scale feature fusion and output the predicted results. Firstly, the encoder of the U-Net model is improved, except the first convolutional block in the U-Net model, the convolutional block used for downsampling in the encoder is replaced by the Residual module group in the ResNet model, and each Residual module group contains two Residuals, the first one is used for downsampling to obtain higher-order semantic information, and the second one is used to speed up feature transfer. Residual is used to speed up feature transfer and prevent model overfitting. Then, the decoder of the U-Net model is improved by adding the SKNet Attention in the SKNet model after the first three splices in the decoder part, the SKNet Attention serves to suppress irrelevant feature information in the feature maps generated during the up-sampling process, and the ParNet in the ParNet model is added after the fourth splices, the ParNet is able to increase the model’s feature expression ability, so that the model has more nonlinear ability, and at the same time has the ability of multi-scale feature fusion, ParNet has the role of preventing model overfitting.
In the encoder part a Residual module group consisting of two Residuals is used to replace each downsampling of the U-Net model, Residuals are proposed in the ResNet model, and there are two structures of Residuals used in this paper, where two Residuals are used simultaneously at a time, and their formulas can be expressed in equations (1) and (2), respectively:
Where,
The first Residual operation is to perform different convolution operations on the input feature maps before fusion, in which the down-sampling to capture the features is realized due to the fact that the first 3×3 convolution kernel has a move step of 2, while the second 3×3 convolution kernel has a move step of 1. Due to the complexity and variety of the content of the traditional dress pattern, rich color, fine texture, and more details, Residual can better extract these features and perform feature transfer to reduce the loss of the detailed content of the traditional dress pattern in the down-sampling process.
The second Residual operation is to convolve the input feature map before fusing it with the input. Since both 3×3 convolution kernels have a move step of 1, it does not change the resolution size of the input feature map, and it does not change the number of channels of the input feature map. Since the traditional dress pattern has more features, this Residual serves to prevent the model from overfitting, as a way to improve the generalization ability of the model, so that the model can effectively learn the various features of the traditional dress pattern, and thus accurately perform the segmentation of the traditional dress pattern.
SKNet attention consists of three parts, namely Split, Fuse, and Select. the Split part learns the features of the feature maps from multiple scales using different convolutional kernels, the Fuse part performs channel attention operations on the feature maps obtained from different convolutional kernels so that the model can assign higher channel weights to more important feature maps, and the Select part re-aggregates feature maps with weight coefficients from different convolutional kernels so that the model can obtain feature maps with weight coefficients. The Select part is to reaggregate feature maps from different convolutional kernels so that the model can obtain feature maps with weight coefficients to enhance the feature representation of the model.
The whole SKNet attention can be expressed by the following equation:
Where,
In this paper, the reason why SKNet attention is added after up-sampling is because the up-sampling process in the decoder generates irrelevant feature information, and feature fusion by splicing can reduce the impact of irrelevant feature information on the subsequent output segmentation results. However, it is not possible to realize the segmentation of fineness, and the detail content of the traditional dress pattern is large, SKNet attention can make the model recover certain details of the traditional dress pattern after up-sampling.
The structure of ParNet can be divided into three parts, which are SSE and two different convolutional kernels. The SSE part contains the global average pooling operation and the Sigmoid activation function.
ParNet can be represented by the following equation:
where
The formula for the SiLU function can be expressed as:
Where
The most commonly used loss function in the field of image segmentation is the BCE loss function with the following formula:
where
Due to the complexity of traditional dress patterns, there are usually far more segmented target pixels than background pixels in traditional dress patterns, so the traditional dress pattern dataset generally has the problem of sample category imbalance, so the BCE loss function is no longer applicable. To address such a problem, this paper uses the Lovász-hinge loss function for segmentation calculation of traditional dress patterns, which can be optimized for loss calculation by optimizing the IoU. The IoU is also known as the Jaccard index, and its calculation formula is as follows:
The formula for Jaccard loss is:
Where:
The design of traditional dress patterns has been inherited and improved by generations of craftsmen for thousands of years, and contains elemental information in various dimensions such as color, iconography, composition, format, and symbolism. In order to reconstruct the cultural information in traditional costumes by using digital methods, this paper improves on the basis of the stable diffusion model and proposes a traditional costume pattern generation method based on the improved stable diffusion model.
The main basis of the stabilizing diffusion model is the denoising diffusion implicit model (DDIM), which is mainly divided into a forward diffusion process and an reverse inverse diffusion process.
The forward process asymptotically transforms the input real image
where
The DDIM model inverse process is to generate an image by gradually denoising a noisy image by learning to predict the variance and mean of a Gaussian distribution for the inverse diffusion process through a neural network. Given a noisy
The forward process is defined through a fixed prior
where
During training, the final optimization goal is to make the network predicted noise consistent with the true noise with the following loss function:
Where
The traditional stable diffusion model requires a large number of datasets to be trained, and the generation results have a large randomness, which is not effective to be directly applied to the generation of traditional dress patterns, an improved stable diffusion model structure is proposed, which consists of a stable diffusion model combined with a LoRA fine-tuning network, plus a discriminative network.
The main stabilizing diffusion models are the CLIP text encoder, the VAE model, and the hidden layer space.
The text encoder converts the input textual information into digital information that can be understood by the computer.
Variational Auto-Encoder (VAE) model is composed of two parts, encoder and decoder, which transforms between image and Latent Space.
The hidden space is where the noise iteration is performed. The 4×64×64 feature vector output from the decoder is passed into the hidden layer space, and next the de-noising begins by inverting the time step of the DDIM sampler and looping over it. The noise vector, iteration step, and cue word condition are passed into the hidden layer space together, and the noise at this moment in time is computed by combining iteration step
The LoRA fine-tuning network freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for the downstream task.The LoRA model cannot be used alone, but needs to be paired with the original model, and the two can be merged to obtain complete weights.
There are many dense parameter matrices during the fine-tuning training process of the large model, but the change matrices of the parameters during the fine-tuning process are not full of rank, as shown in Eq. (18):
The
Using this feature, for the pre-trained weight matrix
A random Gaussian initialization is used for matrix
In the improved stable diffusion model, the fine-tuning task can be accomplished by adding the LoRA fine-tuning technique to the U-net’s attention mechanism layer, which is the Spatial Transformer layer.
Adding a discriminative network based on the stable diffusion model to determine whether the generated image is a traditional dress pattern can improve the discriminative effectiveness. The discriminative network adopts the adjusted VGG-16 network. The input generated 512×512×3 image is subjected to feature extraction to obtain the feature layer of 224×224×64, followed by multiple feature extraction to obtain the feature layers of 112×112×128, 56×56×256, 28×28×512, 14×14×512h, and 7×7×512, respectively, and then the fully connected layer, which obtains the feature layers of 1×1× 4096, 1×1×1024 and 1×1×2 feature vectors, and finally the output is two dimensions, i.e., yes or no, to discriminate the input image to determine whether it is a traditional dress pattern or not.
In this chapter, the performance of the traditional dress cultural information extraction method based on the RSKP-UNet model is discussed by constructing a traditional dress image dataset and conducting test experiments.
Crawler technology is used to obtain traditional dress images from Baidu images, Taobao, Jingdong and other websites, and region detection technology is used to obtain screen shots of film and television dramas, documentaries and other screen shots with traditional dress. In order to guarantee the reliability of the data in the later stage, according to the quality of the image, wearing, characters accounted for the ratio of the image as the screening conditions, screened out the images that match the experiments in this chapter, eliminated the noisy images, processed the images using a unified standard, and utilized the computer technology of rotating, brightness change to supplement and expand the data to get the data set Traditional dress.
The fine-grained featured semantic labels defined for the unique styles, accessories and other attributes of traditional ethnic costumes that meet the characteristics of each ethnic costume are manually labeled using the LabelImg tool, which divides the traditional costumes into six parts: head, chest, arms, legs, feet and waist.
The traditional dress pattern segmentation algorithm based on RSKP-UNet is used to conduct experiments on ethnic dress element information extraction to verify the validity of the RSKP-UNet model.The training curve and validation curve of the RSKP-UNet model are shown in Fig. 1. Train loss denotes the training curve, Val loss denotes the validation curve, and Smooth denotes the smoothing of the two curves. Analyzing the curve graphs of the model shows that Loss effectively converges around 25 generations. The model is able to fit the distribution of the input samples well and has good generalization ability.

The training curve and the validation curve of the RSKP-UNet model
The RSKP-UNet model is compared with the ordinary UNet model for experiments, and the test results of information extraction from six parts of traditional ethnic costumes are shown in Table 1 and Table 2. After analysis, it can be obtained that the AP value of cultural information extraction of RSKP-UNet model for six traditional dress parts is 71.54%~95.62%, and the AP value of UNet model is 62.34%~92.46%, the information extraction performance of RSKP-UNet model is better than the ordinary UNet model, especially for the cultural information extraction effect of the head of the traditional dress, and its AP, Precision, F1 value and Recall value are all above 92%.
Information extraction results of the RSKP-UNet model
| Site | AP | Precision | F1 value | Recall |
|---|---|---|---|---|
| Arm | 81.88% | 81.05% | 78.74% | 80.67% |
| Foot | 73.42% | 74.05% | 75.52% | 72.55% |
| Forebreast | 84.47% | 86.74% | 85.04% | 85.88% |
| Head | 95.62% | 93.29% | 92.71% | 94.33% |
| Leg | 71.54% | 68.94% | 72.17% | 75.24% |
| Waist | 74.23% | 73.03% | 70.18% | 79.74% |
Information extraction results of the UNet model
| Site | AP | Precision | F1 value | Recall |
|---|---|---|---|---|
| Arm | 72.76% | 78.29% | 77.84% | 75.99% |
| Foot | 62.34% | 70.27% | 70.22% | 70.46% |
| Forebreast | 78.47% | 77.35% | 83.96% | 81.31% |
| Head | 92.46% | 90.14% | 90.39% | 91.58% |
| Leg | 68.17% | 63.09% | 70.92% | 73.87% |
| Waist | 70.17% | 67.58% | 65.33% | 72.49% |
This chapter provides an experimental investigation of the traditional dress pattern generation method based on the improved stable diffusion model from both quantitative and qualitative analytical perspectives.
The samples of the dataset in this paper come from the physical image data preserved in cultural relic museums, libraries, archives, and existing non-heritage cultural inheritance artifact workshops, and 200 traditional dress patterns were extracted from the traditional dress images, including categories of animal patterns (elephant, peacock, and horse), botanical patterns, human figures, architectural patterns, and geometric patterns.
Collecting experimental data. According to the experimental requirements, each original image is adjusted to a transparent bitmap of 512×512px, with a size range of 100-500KB and a resolution of 300DPI. Establish the experimental data set. Based on the clarity, richness, typicality and inheritance of pattern images, 100 Dong animal patterns were selected to form the experimental data set. Pre-processing experimental data. Textual descriptions of six issues, namely tattoo name, number of thematic motifs, geometric abstraction, tattoo color, symmetry or not, and background color, are provided for the samples of the dataset, and these descriptions are saved on the Hugging Face website to form a corpus.
In the training of traditional dress pattern generation models, the selection of appropriate parameters can improve the training effect. In the actual training, the training effect will change according to the change of the dataset, and several parameter adjustments are needed to achieve a more satisfactory effect.
Batch size, epoch, and learning rate are important parameters in deep learning, and their settings affect the training effect of the model. The size of Batch_size affects the training time and stability of the model. A larger batch size reduces the training time while maintaining gradient calculation stability. The epoch affects the total number of steps in the experiment; the more steps there are, the more accurate the model is, but too many steps may lead to overfitting. The higher the learning rate, the faster the model convergences, but it may also cause oscillations or non-convergence.
By analyzing the actual training effect and loss curves, it is found that the learning rate higher than 1 × 10−4 or lower than 1×10-6 will lead to model training misfit. Take 1 × 10−4 and 1 × 10−6 as the upper and lower limits of the learning rate respectively, take a certain interval for training respectively, set epoch as 100, batch size as 8, the loss curves of model training with different learning rates are shown in Fig. 2, and (a)~(d) represent the loss curves with learning rates of 3 × 10−6, 5 × 10−6, 3 × 10−5 and 5 × 10−5 respectively. When the learning rate is 3 × 10−6, the loss value fluctuates between 0.5 × 10−6, which is relatively small, and the loss curve has an obvious downward trend, indicating that the model training has achieved good results.

Model training loss curve of different learning rate
Due to the limited hardware power, RTX A5000 can set the maximum Batch size to 8. According to the actual training effect and background data analysis results, the relatively effective Batch size is between 1 and 8. Take 1 and 8 as the upper and lower limits of Batch size, take a certain interval for training, set the epoch as 100, and the learning rate as 5×10-5, and the training loss curves of models with different Batch sizes are shown in Fig. 3, and (a)~(d) represent the loss curves of batch size 1, 3, 5, and 7, respectively. When the Batch size is 7, the fluctuation of the loss value is relatively small, and the loss curve has a more obvious downward trend, indicating that the model training has achieved good results.

Model training loss curve of different batch size
The paper evaluates the degree of model fitting by comparing the similarity of the improved stable diffusion model generating traditional dress pattern images with the training set images, and the introduced image similarity evaluation indexes include the following four:
Structural similarity SS. When the value of structural similarity is above 0.9, the images possess extremely high similarity, and when the value of structural similarity is around 0.7, there is a certain degree of similarity between the images, but there are differences that can be detected by the naked eye. Mean square error MSE: The mean square error is used to evaluate the degree of difference between the restored image and the original image at the pixel level, the smaller the mean square error, the more similar the images are. Root Mean Square Error MSRE: The root mean square error, also known as the standard error, is the arithmetic square root of the mean square error. Peak Signal-to-Noise Ratio PSNR: Peak Signal-to-Noise Ratio (PSNR) is an objective evaluation index to assess the noise level or distortion of the compressed image, the larger the peak signal-to-noise ratio, the less distortion, the better the quality of the generated image. A peak SNR greater than 50 dB indicates that the compressed image has only a very small error. With a value of (30, 50] dB, it is difficult for the human eye to notice the difference between the compressed image and the original image. A value of (20, 30] dB means that the human eye can detect the difference between the compressed image and the original image. When the value is (10, 20] dB, the original structure of the image can be seen with the naked eye, and it can be seen intuitively that there is a large difference between the image before and after compression.
Using the improved stabilized diffusion model to generate the corresponding traditional dress patterns under typical cue words, the similarity between the generated patterns and all the patterns under the corresponding classification in the training set is evaluated using several similarity evaluation indexes, and the results of the similarity of the patterns of different elements are shown in Table 3. The mean values of structural similarity, mean square error, root mean square error and peak signal-to-noise ratio of different elemental patterns are 0.745, 0.816, 0.887 and 18.712, respectively. The traditional dress pattern patterns generated by the improved stable diffusion model with the cue words have a certain degree of similarity with those in the training set and at the same time have enough differentiation, i.e., the model is well fitted and the training has achieved better results.
Similarity of different element images
| Pattern elements of clothing | SS | MSE | MSRE | PSNR |
|---|---|---|---|---|
| Elephant, pagoda | 0.689 | 0.775 | 0.851 | 18.693 |
| Elephant, human | 0.684 | 0.769 | 0.859 | 17.857 |
| Peacock | 0.779 | 0.925 | 0.946 | 20.672 |
| Human, horse | 0.725 | 0.828 | 0.919 | 18.683 |
| Architecture | 0.846 | 0.781 | 0.861 | 17.657 |
| Average | 0.745 | 0.816 | 0.887 | 18.712 |
The qualitative analysis aims to gain an in-depth understanding of the artistic characteristics and visual effects of generating traditional dress images. In this paper, we get the evaluation of appearance characteristics of tattoos from styling index (S) and color index (C), the evaluation of inner performance of tattoos from aesthetics index (A) and innovation index (Cr), and the evaluation of the application value of tattoos from application index (P), and the comprehensive score according to the weights is used as the basis of qualitative analysis. The questionnaire survey randomly selected 20 groups of experimentally generated tattoos from the tattoo appearance characteristics, tattoo intrinsic performance and tattoo utility value evaluation of three dimensions to demonstrate the design evaluation of tattoo images. In this paper, the research was conducted in the form of online questionnaire, the total number of questions was 15, and 148 valid feedbacks were received, with 63.72% of the respondents related to art and design, and 36.28% of the respondents related to other people.
The evaluation scores of respondents from different occupations are shown in Figure 4. The scores of the improved stable diffusion model in styling index (S) and color index (C) as well as the ratings of art-related personnel are higher than those of the AI tool, so the improved stable diffusion model in this paper better reflects the appearance characteristics of the original traditional dress patterns.The scores of the AI tool in aesthetic index (A) and innovation index (Cr) are higher than those of the improved stable diffusion model, so the AI-generated patterns are more aesthetic and more innovative than the improved stable diffusion model. Stronger. The scores of the improved stable diffusion model for the application indicator (P) were 2.48 (for artists) and 2.67 (for others), which were higher than those of the AI-generated tool. Through the calculation of score weights, the improved stable diffusion model has a slightly higher composite score for art-related personnel than the AI tool, with 2.78 and 2.71, respectively. Overall, the improved stable diffusion model is stronger than the AI tool and is more in line with the design guidelines for art professionals, and the AI tool is better than the improved stable diffusion model in terms of aesthetics and innovation.

Evaluation score of different professional respondents
In order to better interpret, protect and inherit the traditional culture of the Chinese nation, research on the information of traditional dress culture is carried out, and digitization methods are proposed for the information extraction and reconstruction of traditional dress culture respectively, and an algorithm for segmenting traditional dress patterns based on the RSKP-UNet model and a method for generating traditional dress patterns with an improved stabilized diffusion model are constructed. Through experimental analysis, the RSKP-UNet model can complete the convergence faster and has a better fitting effect, and compared with the UNet model, its information extraction AP values for different traditional dress parts are better, and the range of the AP value results is 71.54%~95.62%, which reflects the performance of this paper’s traditional dress cultural information extraction based on the RSKP-UNet model. In traditional dress pattern generation, the model-generated patterns have higher similarity to the corresponding classified traditional dress patterns, and the corresponding mean values of structural similarity, mean square error, root mean square error and peak signal-to-noise ratio are 0.745, 0.816, 0.887 and 18.712, respectively, which proves that the model training has achieved better results. At the same time, the comprehensive scores of the model-generated images by artistic personnel and other personnel are 2.78 and 3.59, and the scores of the model-generated images in the dimensions of tattoo appearance features and tattoo utility value are higher than those of the AI generation tool, which indicates that the improved stabilized diffusion model of this paper achieves a more superior result in the application of the traditional clothing tattoo image generation.
The proposed traditional dress culture analysis method based on artificial intelligence in this paper is conducive to the digital protection of traditional national costume images, and also provides some reference for researchers of traditional national costume images to carry out fast and accurate segmentation of traditional dress patterns. In the future research, an in-depth study will be carried out on the mapping relationship between the dress style feature points and the image feature points after image segmentation processing.
