Research on Dynamic Ink and Character Skeleton Extraction of Calligraphic Style in Calligraphy Creation

Calligraphy is one of the most representative forms of artistic expression in traditional Chinese culture, and its existence has given a certain degree of beauty to Chinese character fonts and shapes [1]. In recent years, calligraphy not only promotes the excellent traditional Chinese culture, but also gradually integrates into today’s education with the core concept of “ink education” [2]. As we all know, Chinese calligraphy has always existed as an ancient art in the traditional culture of the Chinese nation, which has not only given a unique soul to the culture of Chinese characters, but also witnessed the course of China’s development in different periods [3]. However, with the rapid development of the times, people have less and less contact with the culture of calligraphy in their daily life. Nowadays, the development of network technology also makes people give up all kinds of writing tools, and brush is one of them [4]. In the rapid development of computer technology, Internet technology today, office, records, writing slowly toward the trend of automation, the use of calligraphy as a writing tool for a long time the practicality of the impact has been great [5]. As a representative of traditional art, calligraphy in today’s development is far behind the previous, so the status quo is worrying. Therefore, how to promote, popularize and inherit the culture of calligraphy in a novel way, so that more people can understand the culture of calligraphy, fall in love with calligraphy, and help Chinese calligraphy to the world has an important academic research value and social significance [6].

With the development of digital technology, calligraphy enthusiasts in the process of appreciating digital calligraphy works, expect to appreciate similar styles of calligraphy works in a style-guided way, which brings some urgent scientific research problems to the analysis, identification and style classification of digital calligraphy works [7]. However, in the research related to digitized calligraphic works, there are more studies on image processing and retrieval, and relatively few studies on calligraphic style recognition and classification. Due to the huge number of calligraphic font images, it is necessary to use automated methods to label these different styles of calligraphic works in order to have a better user experience [8]. Therefore, it is still a challenging task to use computers to automatically recognize the styles of calligraphic works.

Chinese calligraphy has been developed for thousands of years and occupies a very important position in the traditional art and culture of the Chinese nation. With the rapid development of computer technology, the art of calligraphy is being integrated into today’s network era in a digital way. Huang, J. D et al. proposed a stone calligraphy identification technique with convolutional neural network as the underlying architecture, and gave the experimental method to simulate the simulation, which corroborated that the identification accuracy of this method reaches more than 99%, and it can realize the accurate identification of stone calligraphy [9]. Zhang, Y. C. Y. presented a humidity sensor design based on the concept of calligraphic art to efficiently and quickly identify the humidity of calligraphic, architectural environments [10]. A, R. W. , A, C. Z et al. have conceptualized a robotic calligraphy design system that can copy calligraphic works for writing Chinese characters, and in practice can be used to create Chinese calligraphy based on the strokes and sequences entered into the system [11]. Cai, W in order to optimize the skeleton extraction of calligraphic fonts, based on the aggregation algorithm was optimized and improved, and a new fuzzy support vector machine algorithm was proposed to identify and authenticate the style and authenticity of calligraphic fonts, which to a certain extent meets the demand for the authentication of calligraphic fonts [12]. Liu, W. Y et al. examined the value-demand preferences of visitor groups to the Taichung Calligraphy Greenway, and the findings indicated that the most important attributes of the Calligraphy Greenway valued by the visitors were the quality of recreational services, cost of recreation, environmental attributes, cultural attributes, and calligraphic activities, and these findings are important references for the management and operation of the Calligraphy Greenway [13]. Guoqing, L. et al. designed a contour-based core logic stroke extraction algorithm for Han dynasty clerical script to digitize the calligraphy for the purpose of cultural heritage and inheritance of calligraphy, and then to protect and publicize it, and the algorithm was examined by using an experimental method, and it was informed that the calligraphy extraction algorithm performs well, and it can achieve the expected purpose [14]. Shi. explored the application and significance of the traditional concepts in the application and significance of literary art criticism, through the creation and development of Chinese calligraphy as a case study to analyze and illustrate, that the copying of traditional calligraphy is a necessary path for the growth of calligraphers, and that the inheritance of the essence of traditional calligraphy and the innovation on this basis are very meaningful [15].

The theme of this research centers around the dynamic inking and character skeleton extraction of font styles in calligraphy creation, using convolutional neural network for image feature extraction of all the obtained samples, successfully obtaining the dynamic inking of the sample fonts by generating a particle system to simulate the movement of the ink particles, and subsequently extracting the skeleton of the sample font based on the antagonistic neural network. For further research. This paper concludes with the task of font style classification using twin neural networks, and finally, all the research work is analyzed to draw conclusions as required.

2

Method

2.1

Sample image feature extraction

In this study, image features are extracted using a convolutional neural network that contains four convolutional layers, a maximum pooling layer, and a batch normalization layer. Specifically, each convolutional layer is immediately followed by a Maximum Pooling layer. A Batch Nonmalization layer is introduced after each Maximum Pooling layer, thus speeding up the convergence of network training. ReLU is used as the activation function for each layer. Next, the features are aggregated using global average pooling. Finally, Haar wavelets are utilized to obtain feature representations at different scales.

2.1.1

Convolution

First, the input image is fed to the convolutional layer of the network. The role of the convolutional layer is to obtain deep feature information from the input image. This depends on the convolution kernel in it, in this paper the size of the convolution kernel in the network is 5×5, the number of convolution kernel is increasing exponentially from the third layer, the step size is 1. Fig. 1 shows the process of convolution operation. The yellow matrix in the figure is a 2×2 size convolution kernel, the convolution kernel has a different size of the weights, the weights are shared, so it can make the parameters in the network is greatly reduced, and it can also realize the parallel computation, shorten the training time.

It can be seen that the upper left border matrix in the original image is multiplied by the corresponding value in the yellow matrix to get the value in the upper left border in the feature map. The convolution kernel is shifted one frame at a time on the original image (with a step size of 1), and then the above computational operation is repeated until all the matrices of the original image are traversed and the corresponding values are obtained in the feature map. The convolutional layer is characterized by translation invariance and is capable of extracting learnable and interference-resistant features in localized regions of the image.

2.1.2

Pooling

After the convolutional layer acquires image features, the next operation is to use this feature information for integration and classification, a process called pooling. The purpose of pooling is to compress the extracted feature image and keep the main features of the image, on the one hand, it will downsize the feature map to make its scale smaller, which greatly reduces the amount of computation. On the other hand, the invariance of feature map translation, flipping, and scaling scaling is also ensured to some extent. Because every neuron in the fully connected layer is connected to each other, the number of parameters is huge to occupy the memory, and it is easy to overfitting, so the use of pooling to cut down the parameters of the later network layers is very important to ensure the performance of convolutional neural networks. The specific process of maximum pooling is shown in Figure 2.

Maximum pooling is used for convolutional layer features in the proposed network, and the features are aggregated at the end using global average pooling.

1)

Maximum Pooling. According to the convolutional kernel and step size, all the elements in the corresponding feature region are compared, and the largest element is selected as the representative element of the feature region, and the other element information is discarded.

2)

Global average pooling. The values of all the elements in the feature map are summed up and averaged, and this average value is considered the representative element of the feature map. Global average pooling is like an average pooling layer without filter size, which can take the average value of the whole image features, and can convert the multi-dimensional initial tensor into a one-dimensional tensor, so as to reduce the number of parameters and avoid the possibility of overfitting.

2.1.3

Batch Normalization BN

Normalization is an algorithm that limits the data to be processed to a certain range, calculated by taking the input values of the data and differing them from their mean and then dividing the resulting value by the standard deviation. Batch normalization refers to normalizing each layer of the network before each input layer, but it is not applicable for all data. “Batch” refers to a small batch of data.

The specific procedure for batch normalized forward conduction is as follows: 1)

Mean value calculation: (1) $μ_{B} \leftarrow \frac{1}{m} \sum_{i = 1}^{m} x_{i}$

2)

Variance calculation: (2) $σ_{B}^{2} \leftarrow \frac{1}{m} \sum_{i = 1}^{m} {(x_{i} - μ_{B})}^{2}$

2)

Data standardization: (3) ${\hat{x}}_{i} \leftarrow \frac{x_{i} - μ_{B}}{\sqrt{σ_{B}^{2} + ε}}$

4)

Data reconstruction: (4) $y_{i} \leftarrow γ {\hat{x}}_{t} + β \equiv B N_{γ, β} (x_{i})$

In the above equation, x_i represents the input data, m is the size of each batch of training samples, ε is a very small real number, and y_i is derived from the linear transformation of the learnable parameters γ and β , and the latest data better represents the real distribution of the training samples, which further improves the performance of the network.The BN layer is usually added behind the convolutional layer, and in this paper, we put the BN layer behind the maximum pooling layer, which can further enhance the convergence speed of the network training, ignoring the selection of regularization parameters. In this way, the convergence speed of network training can be further improved, and the selection problem of regularization parameters can be ignored to improve the generalization ability of the model.

2.1.4

Activation function ReLU

An activation function is a functional relationship that exists between the output of an upper node and the input of a lower node. Its role is to retain the feature information through the function and map the inputs of neurons to the corresponding outputs. If the activation function is missing, the inputs and outputs between the hidden layers of the neural network are linear, which is equivalent to the most primitive perceptual machine, which can only deal with linearly differentiable problems. The neurons are given a nonlinear factor by the activation function, which matches the nonlinear data generated by the model during the training process and enhances the power of the deep neural network.

In this paper, the activation function ReLU is added after the BN layer with the following expression: (5) $f (x) = \max (0, x)$

Compared with the saturation activation function Sigmoid and Tanh, the ReLU attribute the negative region of the model to 0. It improves the expression ability of the model by using CNN to simulate the workflow path of the bioneural network.

The image of ReLU function is shown in Fig. 3. Compared with other activation functions, the advantage of ReLU is reflected in both improving the model training speed and overcoming the problem of gradient vanishing, so that the model collection and training speed has been maintained in a stable state.

2.1.5

Haar wavelets

A wavelet can be considered a band-pass filter that only allows signals with frequencies similar to the wavelet basis function to pass through. The basic idea of wavelet decomposition is to represent a function or signal using a set of wavelet functions. Since wavelets have good time-frequency localization, multi-resolution analysis, and other properties, they have a wide range of applications in many fields, such as image noise reduction, image classification, and so on. In this paper, Haar wavelet is used to further decompose the above extracted features into multi-resolution to obtain the feature representation at different scales.Haar wavelet is not only simple to compute but also achieves satisfactory results in many tasks. In this paper, the effectiveness of embedding the Haar wavelet decomposition algorithm into convolutional neural networks is verified in subsequent experiments.

The process of Haar wavelet decomposition is the representation of an image with its average pixel values and detail coefficients. The characteristics of Haar wavelet decomposition are as follows: 1)

No image information is lost during the decomposition process and the original image can be reconstructed from the last recorded data.

2)

Images of any resolution can be obtained.

3)

Detail coefficients with small magnitude values are generated after multiple decompositions, which provides an important way for image compression.

The Haar wavelet’s concrete process for two-dimensional images is as follows. First, the pixel of each line of the image is transformed, the average or detail coefficient of the line is generated, and then the column transformation is made for the pixel of the image, which produces the average or detail coefficient of the image. In this way, four subgraphs can be obtained after decomposition: the low-frequency approximation image, horizontal detail image, vertical detail image, and diagonal detail image. Accordingly, the feature extraction can be better completed.

2.2

Dynamic Ink Extraction of Fonts in Particle System

Calligraphy is written on rice paper by brushes dipped in ink, which has a unique ink effect compared to other artistic effects. With the rapid development of computer graphics, more and more people are using computers to simulate the movement of ink particles, and some even extract the dynamic ink effects of fonts.

2.2.1

Particle Systems

Particle systems are a method in computer graphics that uses particles as a base unit to combine into an image, and then uses this newly combined image to simulate the target object. Since its introduction, it has attracted the research of many scholars, and with the depth of the research, this method has become more and more perfect, and has now been applied in a variety of important fields. The method of particle system is to use the combination of structurally simple particles to compound into the target object. The variety of particle units and composite forms can simulate any irregular object. Therefore, as a typical dynamic irregular object, the ink stains of calligraphy works can be well characterized by the particle system when describing, which has considerable flexibility.

1)

Basic Principle of Particle System

The basic principle of applying particle systems to the description of calligraphy works is to use a large number of particle swarms, gathered in a certain space, so as to simulate the font ink. The smallest unit used for simulation, i.e., the particle, is very rich in shapes, which can be line segments, polygons, or three-dimensional shapes. Attributes of the particles (e.g., state of motion, size, color, life cycle, etc.) can be added arbitrarily by specific descriptive objects. Any particle has three states: “Formed state”, “Motion state” and “Dying state”, which is the life cycle property of the particle. The rest of the properties of the particle will change randomly over time during the life cycle of the particle. These particle properties represent the irregularity and dynamic properties of an object.This matches the requirements for dynamic ink extraction in calligraphic works.

2)

Particle system generation and property description

Usually, the emitter generates the particle units that constitute the particle system, and controls the position and motion modes of the particle units in a specific space. The emitter also predefines a large amount of attribute information inside itself that determines the particle’s state. While forming the particles, the emitter will pass this information to them.Along with the passage of time, particles continuously form, move, and die out. The detailed steps are given below:

Step 1: Generate particles through the emitter and decide how many particles will be formed per unit of time and in what mode;

Step 2: the emitter gives each particle corresponding basic attribute information according to the specific target object;

Step 3: Emission of particles, which make random motions according to the attributes given to them at the time of formation;

Step 4: Particle life cycle check, if the particle is not in the life cycle, then stop rendering, otherwise continue to update the particle according to its properties.

Step 5: Continue to generate new particles.

2.2.2

Dynamic ink effect production

The variability of the dynamic ink effect is huge, so it is only necessary to apply the particle system from the previous section to perform dynamic ink extraction on all calligraphic font samples.

In this study, seven samples were selected for the extraction of word dynamic ink using the generated particle system, and the specific results of this extraction are given in Table 1. From Table 1, it can be seen that the particle system has an extraction rate of 100% for all the samples in terms of the speed of pen and ink flow, as well as 100% for each of the extraction items such as speed, width, force and flow for sample 8.

Table 1.

Results of dynamic inkblot extraction

Trajectory Attribute	Dynamic Attribute	Pattern 1	Pattern 2	Pattern 3	Pattern 4	Pattern 5	Pattern 6	Pattern 7	Pattern 8
Speed	Speed of Ink Flow	✓	✓	✓	✓	✓	✓	✓	✓
	Direction of Ink flow	✓	✓	✓	-	✓	✓	✓	✓
	Ink Brightness	✓	-	✓	-	✓	✓	✓	✓
	Particle Oscillation Frequency	✓	✓	-	✓	✓	✓	✓	✓
Breadth	Dispersion of Pen and Ink	-	✓	✓	-	✓	✓	✓	✓
	Particle Density	✓	-	✓	✓	-	✓	✓	✓
	Particle aggregation degree	-	✓	-	✓	✓	✓	✓	✓
Strength	Length of ink flow	-	✓	✓	-	✓	✓	✓	✓
	Amplitude of Particle Oscillation	✓	✓	✓	✓	-	✓	-	✓
	Particle Size	-	✓	-	✓	✓	✓	✓	✓
Flow rate	Pen Density	-	✓	✓	✓	✓	✓	✓	✓

Annotation: ✓ indicate successful fetch, - indicates that the extraction failed

2.3

Chinese character skeleton extraction model based on generative adversarial network

2.3.1

Kanji Skeleton Extraction Generator

In order to accurately extract the skeleton of Chinese characters of calligraphy images and obtain a certain generalization ability. The structure of the generator extracted from the skeleton of Chinese characters is shown in Figure 4. The Chinese character skeleton extraction model inputs 128×128×1 Chinese character calligraphy images to the encoder end, and the encoder outputs a vector of 1×32768, obtains a vector of 1×4096 through a fully connected neural network, and then transforms it into a vector of 1×32768 through a fully connected neural network. The vector is input to the decoder, and the decoder outputs a 128×128×1 image of the skeleton of Chinese characters.

In Fig. 4, the encoder consists of four convolutional modules with a step size of 2, a convolutional kernel size of 5 × 5, and a number of filters of 64, 128, 256, and 512, respectively. The decoder consists of four inverse convolution modules with a step size of 2, a convolution kernel size of 5 × 5, and a number of filters of 512, 256, 128, and 64, respectively.

2.3.2

Kanji Skeleton Extraction Discriminator

The discriminator in the Chinese character skeleton extraction model consists of two parts, one is used to determine the authenticity of the Chinese character skeleton, i.e., whether it conforms to the correct topology of the Chinese character, and the other is used to identify whether the extracted Chinese character skeleton matches the input calligraphic image. The discriminator for determining the authenticity of the skeleton of a Chinese character is a Markov discriminator, which classifies each sensory field in N×N block of the image to determine its authenticity. The discriminator performs a convolution operation on the image and takes the mean value of each element value in the final output feature map as the final discriminator score. The structure of the discriminator is shown in Fig. 5.

As can be seen from Fig. 5, the discriminator model for determining the authenticity of the skeleton of Chinese characters takes the 128×128×1 image of the skeleton of Chinese characters as the input, and does the convolution operation on the image to obtain the 32×32×1 feature map as the output Out, and FT is the number of convolutional neural network filters in the discriminator. The elements in the output feature map Out are averaged as the output value of the Chinese character skeleton true/false discriminator D₁.

The discriminator D is used to determine whether the skeleton of a Chinese character extracted from an image of a calligraphic Chinese character matches the character. Two 128×128×1 images are input, one of which is the skeleton image of the Chinese character and the other is the calligraphic image of the Chinese character.

2.3.3

Loss function

The generator loss of the Chinese character skeleton extraction model is mainly composed of adversarial loss and consistency loss, and the loss function is shown in Equation (6). The discriminator loss function consists of adversarial loss and gradient penalty, as shown in Equation (7): (6) $L_{G} = - D_{1} [G (X)] - D_{2} [X_{1}, G (X)] + MSE (G (X), X) + 0.1 SSIM (G (X), X)$ (7) $L_{D} = D_{1} [G (X)] - D_{1} (X) + D_{2} [X, G (X)] - D_{2} [X_{I}, X] + GP [G (X), X]$

where X_l denotes the input Chinese character calligraphy image, X denotes the Chinese character skeleton label image corresponding to the calligraphy image, and G(X) is the Chinese character skeleton image generated by the generator model.

2.3.4

Kanji Skeleton Transformation Generator

The Chinese character skeleton transformation model uses a fully convolutional structure to adjust to images with diverse resolutions, and each layer of the generator handles images with different resolutions. The input image with a resolution of 128×128 is downsampled to obtain an image with a resolution of 32×32, which is used as the Chinese character skeleton image input to the first layer generator model. The downsampling factor is set to 0.5, which means that the resolution of the image processed by the previous layer is about 0.5 times that of the next layer, and the minimum resolution for processing is set to 32×32.

Each layer in the generator has corresponding random noise and can generate feature maps directly. Here the up-sampling is not done internally as an inverse convolution operation, but the generator of each level generates the image of the current size being processed, and then it is up-sampled by the bilinear interpolation method to obtain the skeleton image of the Chinese character, which serves as the input image for the generator of the next level, and the generator is shown in Fig. 6.

X_out represents the final output image, z₁~z₃ refers to the random noise of the input of each level, br(·) is the scaling process of bilinear interpolation, and X_out is calculated as shown in Equation (8): (8) $X_{o u t} = G_{3} (z_{3}, br (G_{2} (z 2, br (G_{1} (z_{1}, X_{I N})))))$

The structure of each layer of the generator is shown in Fig. 7, where no dimensional transformation of the feature map is required inside the generator. conv_block refers to the module that performs the convolution operation on the feature map, which contains a convolution layer with a step size of 1, using a 3 × 3 convolution kernel, Instance Normalization (INT), and Leaky Relu activation layer.

The process from the first convolutional module to the activation function ReLU is denoted by φ_i(·) in the internal of the generator of the current layer, the image generated by the generator of this layer is denoted by x_i, ε is the noise rate (noise_str), z_i is the random noise of the layer, and the bilinear interpolation scaling is denoted by br(·). There are three layers of generators in the model, and Equation (9) shows the processing of the image by each layer level generator: (9) $x_{i} = {\begin{array}{l} φ_{i} (ε \cdot z_{1}, X_{I n}) & i = 1 \\ φ_{i} [ε \cdot z_{2} + b r (x_{1})] + b r (x_{1}) & i = 2 \\ φ_{i} [ε \cdot z_{3} + b r (x_{2})] + 0.25 \cdot b r (x_{2}) & i = 3 \end{array}$

2.3.5

Kanji Skeleton Transformation Discriminator

The discriminator in the skeleton transform model is only related to the generator of the corresponding level, and the corresponding number of discriminators is set for as many level generators exist, and the data processed by each level discriminator is the image generated by the generator of its corresponding level, as well as the real sample image obtained by scaling the original image. The discriminator finally takes the mean value of all elements in the output feature map matrix as the output. The structure of each hierarchical discriminator can be seen in Fig. 8.

The Conv_block convolution module inside each level of the discriminator in Fig. 8 does not change the feature map size and fixes the number of output channels d. The mean value of each element in the 32 × 32 × 1 feature map matrix is used as the output value of the discriminator. Equation (10) shows the calculation of the discriminator output value R. For the discriminator output R, the result tends to positive infinity to consider that the output image originates from a real sample, and tends to negative infinity to be considered that the input image originates from a fake sample generated by the generator: (10) $R = \frac{1}{h \cdot w} \sum_{i = 1}^{h} \sum_{j = 1}^{w} F (i, j)$

2.3.6

Loss function

The generator loss in the Hanzi skeleton transformation model is shown in Equation (11): (11) $L o s s G_{i} = - D_{i} (x_{i}) + α {‖ x_{i} - x ‖}_{1}$

D_i(x_i) denotes the output value of the image output by the i nd level generator via the i rd level discriminator, ǁx_i–xǁ₁ denotes the L₁ loss of the output image of the i th level generator to the input image of the first level. α is used to control L₁

The loss weight, which can be set according to the different structures of different Chinese characters, takes the value in the range of [0,10].

The discriminator in the Chinese character skeleton transformation model only focuses on whether it can correctly distinguish whether an image of a specified size originates from a generator-generated image or a real sample image. Denote by X_Fi the fake image generated by the generator at level i, and denote by X_Ri the scaling of the real image at level i size. The discriminator loss at level i can be represented by equation (12), where GP(.,.) is the gradient penalty: (12) $L o s s D_{i} = D_{i} (X_{F i}) - D_{i} (X_{R i}) + G P (X_{F i}, X_{R i})$

2.4

Twin neural network-based model for font style classification

2.4.1

PSA module

In this paper, we learn from previous research and adopt the attention module PSA in the network design. Although the appearance of the convolutional attention module CBAM reasonably utilizes the spatial and channel information in the features, and the subsequent research in the attention area, such as in the multi-scale and the establishment of long-term channel dependence, has made improvements, but the above methods are often accompanied by higher parameter counts, which greatly improves the complexity of the model. The PSA is used to refine the feature information in the network in a multi-scale space and establish long-distance channel dependence, embedding the PSA module into the BottleNeck block of ResNet and replacing the 3×3 convolutional layer in its structure, thus forming a new basic block, and based on this, a modified version of EPSNET is formed. EPSANet, an improved version of ResNet, has experimental results showing that this network outperforms other channel attention methods.

The module structure of PSA is given in Fig. 9, and the PSA module is divided into four phases to implement the attention mechanism, firstly, using the SPC module, we obtain the multi-scale feature information about the channel on the channel, and then we use the SEWeight module to compute the attention weight matrix of multiple scales in the direction of the channel, and then we use the Softmax to re-optimize the attention weight matrix in the direction of the channel to get the calibrated refined weights, and finally weight the weight matrix with the original feature information to obtain the feature information refined by the attention mechanism as output.

The SPC module plays the role of extracting multi-scale feature information, and the structure of SPC is shown in Fig. 10, in which C represents the number of channels, X represents the input feature information, K represents the size of the convolution kernel, G represents the number of feature information contained in each group, and F represents the output feature vector of the SPC module. For the input feature information X, SPC uses a multi-branch approach to extract multiple scales of information in parallel, the size of the convolution kernel K in each branch is different, the number of channels of the input feature map is compressed through the operation of convolution and different scales of information are obtained, and finally multiple scales are spliced together and output.

(13)

F = C a t ([F_{0}, F_{1}, \dots, F_{S - 1}])

The splicing operation is shown in Equation (13), where S represents the number of branches, when the output F is obtained, F will be input into the SEWeight module, SEWeight with the SE module introduced earlier, will be for the input feature information first global average pooling of channel information compression, the formula is expressed as (14): (14) $g_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)$

x_c(i,j) The feature value at position (i,j) in the c rd channel feature map in Table F, g_c represents the global average pooling value on the c th channel, and H,W represents the height and width of the feature map, respectively, and then g_c is inputted into the two-layer fully-connected layer to make the high- and low-dimensional channel information interact as shown in Eq. (15): (15) $w_{c} = σ (W_{1} (δ (W_{0} (g_{c}))))$

In Eq. (15) W() represents the fully connected layer, σ() represents the sigmoid activation function, and δ() represents the ReLU activation function, so the channel attention weight vector Z for multiscale features can be simplified and expressed as Eq. (16): (16) $Z_{i} = S E W e i g h t (F_{i}), i = 0, 1, 2, \dots, S - 1$

Then for the obtained channel attention weights at multiple scales, Softmax is used to optimize them again so that there is an interaction between the local and global channel attention to obtain the recalibrated weight att_i, which is formulated in Eq. (17): (17) $a t_{i} = S o f t m a x (Z_{i}) = \frac{\exp (Z_{i})}{\sum_{i = 0}^{S - 1} \exp (Z_{i})}$

2.4.2

Cross entropy loss

The classification loss function that helps optimize the network is the cross-entropy loss, the cross-entropy loss is the most commonly used and efficient loss function for the optimization of convolutional neural networks used for classification, and its role is mainly to assess the gap between the learned distribution and the actual distribution, for the understanding of this loss function is based on the following concepts: 1)

Informativeness. The amount of information can be abstractly understood as the degree of uncertainty elimination of a certain piece of information, for a statement of fact, we consider the probability of 1, because this piece of information is the exact occurrence of the fact, so it does not eliminate any uncertainty the amount of information is low, and vice versa, for a high degree of uncertainty of the statement, it contains a larger amount of information. In conclusion the probability of the exact occurrence of the event is negatively correlated with the amount of information, assuming that P represents the probability of the occurrence of event x, the amount of information can be expressed as equation (18): (18) $I (x) = - (\log P (x))$

2)

Information entropy. Information entropy is the expectation of all the information of the event, that is, the probability of occurrence of all possible outcomes multiplied by its information. Therefore, information entropy can be abstractly understood as the value used to assess the uncertainty of a certain tree, when the information entropy is larger, the event is also more uncertain, which can be expressed as equation (19), where X=x₁,x₂⋯x_n is a discrete random variable: (19) $H (X) = - \sum_{i = 1}^{n} P (x_{i}) \cdot \log P (x_{i})$

3)

Relative entropy. Relative entropy, also known as mutual entropy, is mainly used to measure the degree of difference between a pair of probability distributions, so when the two probability distributions are identical relative entropy should be zero, when the difference between the two probability distributions gradually increase relative entropy will also increase, set P(x) and Q(x) for the same random variable X of the two probability distributions, then the relative entropy can be expressed as formula (20): (20) $D_{K L} (p ‖ q) = \sum_{i = 1}^{n} p (x_{i}) \cdot \log \frac{p (x_{i})}{q (x_{i})}$

4)

Cross entropy. On this basis the relative entropy is split: (21) $\begin{array}{l} D_{K L} (p ‖ q) = \sum_{i = 1}^{n} p (x_{i}) \cdot \log p (x_{i}) - \sum_{i = 1}^{n} P (x_{i}) \cdot \log q (x_{i}) \\ \begin{matrix} \begin{matrix} \begin{matrix} = [- \sum_{i = 1}^{n} p (x_{i}) \cdot \log q (x_{i})] - [- \sum_{i = 1}^{n} p (x_{i}) \cdot \log p (x_{i})] \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} = H (p, q) - H (p) \end{matrix} \end{matrix} \end{array}$

The cross entropy H(p,q) can then be expressed as the sum of relative entropy and information entropy: (22) $H (p, q) = H (p) + D_{K L} (p ‖ q) = - \sum_{i = 1}^{n} p (x_{i}) \cdot \log q (x_{i})$

When P(x) is the real distribution, the value of H(p) is actually a constant, so the value of relative entropy used to reflect the difference between the probability distributions depends entirely on the cross entropy, so the actual neural network training process, the optimization of the cross entropy can make the network prediction results converge to the real results, the choice of the cross entropy rather than the relative entropy is also in order to reduce the amount of computation. When the cross-entropy loss is applied to the binary classification problem, assuming that the prediction probability of the two categories is p and 1–p respectively, then the cross-entropy loss can be expressed in the form of: (23) $L o s s = - \frac{1}{N} (y_{i} \log p_{i} + (1 - y) \log (1 - p_{i}))$

Where N represents the number of samples, y_i represents the label of the corresponding sample, p corresponds to the category y as 1, and 1–p corresponds to the category y as 0. In this paper the corresponding categorization category is 4, so the form of multiple categorization with cross entropy loss should be chosen for each branch categorization as shown in equation (24): (24) $L o s s = - \frac{1}{N} \sum_{i = 0}^{N - 1} \sum_{k = 0}^{M - 1} y_{i k} \log p_{i k}$

where p_ik represents the probability that the network predicts the i nd sample data to be in the k rd category, y_ik represents whether the i th sample is true for the k th category, k if the i th sample is in the y_ik=1 th category, and y_ik=0 otherwise.

3

Results and Discussion

3.1

Effect of different activation functions on feature extraction accuracy

Commonly used activation functions include the Sigmoid function, the Tanh hyperbolic tangent function, and the ReLU function. In order to verify the influence of different activation functions on the performance of the model proposed in this paper, the experimental results on the selected calligraphy font dataset are shown in Figure 11 while keeping the other parameter settings in the model unchanged. It can be seen that the recognition rate of the activation function using the ReLU function as the convolutional layer is the highest, with 19 correct recognition cases out of 20 samples, with a recognition rate of 95%, the recognition rate of the Sigmoid function is 75%, and the recognition rate of the Tanh hyperbolic tangent function is the lowest at 60%. This is due to the fact that the ReLU function is less computationally intensive than the Sigmoid function, and overcomes the fact that the Sigmoid function is prone to gradient vanishing, and can maintain the integrity of the data to a certain extent. Although the Tanh hyperbolic tangent function solves the dispersion of data samples to a certain extent, it still cannot fully address the two problems of neuronal saturation and gradient vanishing. However, the ReLU function can effectively avoid gradient vanishing for unsaturated neurons, and has the advantages of high computational efficiency and fast convergence speed, so it is more appropriate to use the ReLU function as the activation function of the convolutional layer.

3.2

Effect of different attention mechanisms on model noise immunity

In previous studies, CBAM and SE attention modules are usually embedded in convolutional neural networks, and their experiments show that the choice of attention modules indeed enhances the model’s performance in the problem of classifying font styles in calligraphy creation, and this practice inspired the model design of this paper, so in this paper, we choose the attention modules of CBAM, SE, and PSA to conduct comparative experiments respectively, and Fig. 12 demonstrates the performance of the three mainstream attention modules on the two datasets of CCS and e CCS respectively. It can be found that the gap between the three is not obvious on the CCS dataset, but the noise immunity of the PSA module on the eCCS dataset is 0.978, which is improved by 15.47% and 20.74% compared with the CBAM module and SE module, respectively, and therefore this paper chooses to embed the PSA module in the feature extraction network.

3.3

Analysis of glyph skeleton extraction under generative adversarial networks

In the process of glyph skeleton extraction and classification, it is easy to confuse the two types of fonts, regular and running script, the main reason is that the running script is developed and originated on the basis of regular script, and the rhythm is only added to the strokes, which makes the strokes active. To address this problem, the glyph skeleton extraction algorithm and font style classification model proposed in this paper are co-supervised by two different losses during training, and in addition to the commonly used classification loss, the use of the contrast loss allows the distance between two input image features from the same class to be as small as possible and the distance between two input image features from different classes to be as large as possible, and since each branch of the network uses a Haar wavelet to decompose the features, the extracted image features have a higher degree of differentiation and can effectively capture the subtle differences between images. To better illustrate this point, the image features extracted in this paper are downscaled and visualized using the t-SNE method. Figure 13 shows the visualization results for the CNCalliStyle dataset.It can be seen that the algorithm in this paper can effectively extract the glyph skeleton of the resulting samples and classify them according to different font styles.Overall, the accuracy of both extraction and categorization can reach more than 90%.

3.4

Accuracy analysis of style classification based on twin neural networks

In order to better express the accuracy and effectiveness of the font style classification model for calligraphy creation designed based on twin neural networks in this paper, the confusion matrix corresponding to the five style categories is given in the following analysis, and each element in the matrix indicates the probability that the horizontal true category is predicted to the category where the column is located, which significantly exhibits the probability of correct classification and the probability of misclassification among the individual categories, and the classification result is shown in Fig. 14 , where SC denotes Seal Script, OS denotes Official Script, CH denotes Cursive Script, RH denotes Running Script, and RS denotes Regular Script. From the results in the figure, it can be seen that the prediction accuracy of all styles is above 80%, and the model’s prediction accuracy for cursive script is even as high as 91%. Comparatively, the prediction accuracy of seal script is only 81%, and from the results, we can see that 8% of seal scripts are incorrectly recognized as clerical scripts, as well as 6% of seal scripts are recognized as running scripts. This result may be related to the fact that seal script has a more complex glyph structure, and seal script fonts are less common.

4

Conclusion

This paper aims to realize the dynamic ink and character skeleton extraction of font style in calligraphy creation, and successfully completes the extraction task with the help of convolutional neural network, particle system, generative adversarial neural network and other technical means. At the same time, this paper is based on a twin neural network to classify the style of all the extracted font samples in order to lay a foundation for further research.Through the analysis, the following conclusions were finally drawn: 1)

Unlike the Sigmoid function and the Tanh hyperbolic tangent function, the use of the ReLU function as the activation function of the convolutional layer can improve the recognition accuracy of image feature extraction to 95%, which is nearly 20% higher than that of the Sigmoid function. It shows that the convolutional neural network algorithm used in this paper can effectively complete the work of sample image feature extraction.

2)

Aiming at the problem of easy confusion between regular script and running script in the extraction of glyph skeleton, this paper innovatively uses the combination of classification loss and comparison loss, and decomposes the features by Haar wavelet, which finally makes the accuracy of glyph skeleton extraction of all the samples reach more than 90%. The accurate extraction of glyph skeleton not only achieves the research purpose of this paper, but also prepares for the next step of style classification.

3)

The classification model based on twin neural network can achieve more than 80% accuracy in predicting and recognizing all font style types, including 91% accuracy in recognizing Lady Cursive Script. It’s notable that 8% of the seal scripts in the study were mistakenly labeled as clerical scripts, and 9% of the running scripts were mistakenly labeled as clerical scripts.The classification model can still be optimized to adapt to more complex research work in the future, as indicated by the above results.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Research on Dynamic Ink and Character Skeleton Extraction of Calligraphic Style in Calligraphy Creation

Weigang Fu

Pubblicato online: 17 mar 2025

Ricevuto: 09 ott 2024

Accettato: 27 gen 2025

DOI: https://doi.org/10.2478/amns-2025-0238

Parole chiaveDynamic ink, Glyph skeleton, Convolutional neural network, Particle system, Twin neural network

© 2025 Weigang Fu, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Dynamic ink, Glyph skeleton, Convolutional neural network, Particle system, Twin neural network