Automatic Generation of Cinematic Animated Characters and Their Behavioral Characterization Using Graph Generation Networks

The continuous development of computer technology promotes the overall presentation of animation scenes to be more and more exquisite and real, but the complexity and sensibility of the human body affects the presentation effect of the character model to a certain extent, so the relevant technical personnel should actively innovate and dare to practice, which in turn promotes the advancement of China’s animation character modelling production technology and the healthy and sustainable development of China’s animation industry [1-3].

There are two key points in the production of animation character models, including human body motion control technology and human body modelling and skin deformation technology. Since the development of mannequins in the 1970s, the character models in animation have been progressing gradually in terms of production technology and level, but there are still some deficiencies and needs to be perfected in relation to skin and human body, which is an area worthy of challenge and enhancement [4-5]. And the reason why skin and human body is considered difficult is due to several factors. The first is that the movement of the human body is very complex and requires a high degree of refinement, thus placing superb demands on the simulation technology. Secondly, the complex movement of the human body should take the muscle changes into full consideration, and if this problem cannot be solved well, it will lead to the final character showing a stiff and rigid condition [6-8]. Finally, the characters in animation need to have a certain degree of sensibility, if the characters in animation are very different from the real characters, it will lead to the loss of authenticity and attractiveness of the animated characters. Therefore, character modelling techniques need to be improved continuously to overcome these problems in order to usher in a more open space for development [9-10].

Domestic animation industry continues to develop, more and more animation works are sought after and loved by children and even adults, which is inseparable from the hard work of animation producers [11-12]. In film and television animation works, we need to effectively portray the characters, create an animation prototype according to the character of the story, and at the same time, effectively shape the character’s heart, expression, language, movement and so on, to further outline a living and breathing animation character image. In animation works, the effective shaping of animation character characters can further enhance the attraction of the whole animation works, which is like the star effect, in other words, the design of a story scenario needs a character with a successful image, such a theory is also true in film and television animation works, animation characters with characteristics can build the main line of the whole work [13-16].

Animation works in the performance of the story development process of the character characters are illusory, is the author conceived, but this conception is not designed out of thin air, the need for the author to have a certain life experience, and constantly looking for prototypes of character characters from life [17-18]. Film and television animation characters need to uphold the life, the concept of prototypes, the need to further highlight the character traits of the animated characters, the need to further increase the visibility of the characters, so that as soon as a person sees him, he will have a deep impression of him, so that you need to further expand the role of the character roles in the film and television works. Effective construction of animated characters requires the exaggeration and virtualisation of many aspects of life, so that the quality of film and television animation works can be further improved [19-20].

In this study, based on deep learning analysis of the process and key technologies of animation character creation, an automatic generation method flow of animation characters based on graph generation network is proposed.Subsequently, an action recognition model is established using behavioral feature extraction to detect and evaluate the abnormal behavior of the generated animated characters.The image generation network model and action recognition method of this paper are verified through experiments to study the effect of this paper’s method on assisting the creation of animated characters using graph generation networks.

2

Assisted creation of animated characters based on image generation networks

2.1

Animated Character Generation Creation

There are many types and forms of computer animation. From the application perspective, there are cartoons, game animation, information dissemination animation, teaching animation, and decorative animation.From the circulation perspective, there are network animations and non-network animations.From the production point of view, there are human-motorized animation, mechanism animation, and programming animation.From a visual perspective, there are two-dimensional animations and three-dimensional animations.From the perspective of image formats, there are bitmap animation and vector animation.From the perspective of production software, there are FLASH animation, 3DS MAX animation, MAYA animation, and so on.

The principle of animation is due to the human “visual retention” characteristics, is produced by the human eye on the delay characteristics of the things seen, when the human eye sees an object, in 1/24th of a second will not disappear. Utilizing this principle, a frame of the still picture is played quickly, when a frame in the human vision has not disappeared before the next frame is played quickly, so that the visual dynamic effect, giving people a smooth visual change to form the animation effect. A picture represents a moment by drawing different contents in different frames. The picture changes with the change of time, and the animation effect is formed when it is played continuously.The production of animation involves changing the content of successive frames.In the animation production, the movie level animation generally adopts the playback speed of 30 frames/second.

With the development of technology and society, the field of animation is faced with the impact of technology as well as increased demand. In this context, traditional methods of animation production and character creation cannot meet contemporary design needs. As an emerging way of design thinking and practice, generative aided design method is gradually attracting attention and attention in the field of animation. This study will explore the generative animation character creation method to provide guidance for graphic generation against network-generated animation character-assisted animation creation.

In the field of animation production, according to the traditional animation production process, the generation of the design program relies on the brainstorming of animation designers.However, this method is limited by personal experience and imagination, unable to generate a large number of creative solutions, and its design efficiency is low.

The evolution of the animation design process is shown in Figure 1. With the advancement of science and technology, computers are becoming increasingly prominent. In the early days, these computers were mainly used as auxiliary tools to enhance the efficiency of designers in the construction of physical models and visual effects, although it did achieve efficiency gains, but the overall framework of the design process did not bring about fundamental changes. However, with the continuous advancements in computer technology, the role of computers has gradually changed from assisting design to generating design.By inputting design goals, computers are able to automatically generate a large number of visualization graphics to help designers find excellent design solutions that they can hardly imagine. In this way, animation design has gone through the transition from the traditional design mode that relies on human labor, through the computer-aided design stage, and finally towards the frontiers of generative design that has attracted much attention today.

2.2

Key technologies for the production of movie-quality animated characters

Computer graphics were created and developed along with computers and their peripherals.It is the result of the convergence of computer technology and the development of television and graphic image processing technology, and has a wide range of applications.A wide range of application areas also greatly promote the rapid development of computer graphics technology, including hardware and software technologies.

Geometric transformation of graphics generally refers to the transformation of geometric information in graphics to produce new graphics, which are changed by new coordinate values.Geometric transformations of a graph include proportional transformations, miscut transformations, rotational transformations, translational transformations, and their composite transformations. All the points on the graph before and after the geometric transformation of the coordinate relationship can generally be obtained by analytical geometry.

1)

Translation

Translation refers to the process of relocating a point along a linear path from one coordinate position to another. The formula for translational transformation is: (1) $[\begin{array}{l} x' & y' \end{array}] = [\begin{array}{l} x + T_{x} & y + T_{y}] \end{array}$

Where T_s, T_y are called translation vectors and indicate that the point has moved a distance of T_x and T_y in the x and y directions.

2)

Proportional Transformation

Proportional transformation refers to a point relative to the origin of the coordinates along the x-direction deflation S_x times, along the y-direction deflation S_y times. The formula for proportional transformation is: (2) $[\begin{array}{l} x' & y' \end{array}] = [\begin{array}{l} x & y \end{array}] \cdot [\begin{matrix} S_{x} & 0 \\ 0 & S_{y} \end{matrix}]$

3)

Rotational transformations

Rotation is the process of relocating a point p by turning it by an angle θ (counterclockwise is positive, clockwise is negative) around the coordinate origin to obtain a new point p′. Its formula is: (3) $[\begin{array}{l} x' & y' \end{array}] = [\begin{array}{l} x & y \end{array}] \cdot [\begin{matrix} \cos θ & \sin θ \\ - \sin θ & \cos θ \end{matrix}]$

4)

Mis-cut transformations

In the application of graphics, it is sometimes necessary to produce a unisexual object deformation processing, which will use the wrong cut transformation, also known as shear, dislocation transformation. Its calculation formula is: (4) $[\begin{array}{l} x' & y \end{array}] = [\begin{array}{l} x & y \end{array}] \cdot [\begin{array}{l} 1 & b \\ c & 1 \end{array}]$

In the animation production, the key frame can be played directly, while in the two key frames in the middle of the ordinary frame need to do complementary animation, in order to realize the movement of the picture, such as animation character action effects, scene changes and so on. Complementary animation is calculated by the program and does not require the user to draw or edit it. Complementary animation is one of the very important functions, using it, you can set the size, position, color, transparency, rotation and other attributes of the symbols, and can change a variety of wonderful incredible dazzling animation effects, and graphic movement and deformation of natural and smooth. When the animation is running, every frame has an image in it. Image is a bitmap composed of pixel points, can use digital arbitrary description of pixel points, intensity and color, is the object to a certain resolution resolution after each point of information presented in a digital way, can be directly and quickly displayed on the screen.

Frame, as the smallest unit in animation, contains a single image frame, which represents a grid on the timeline of animation software, representing a point in time. The key frame is the sudden change point of the graphic image, the qualitative change point of the picture, which is equivalent to the original painting in 2D animation, referring to the frame where the key action in the movement or change of the component or graphic image is located. The animation between the key frame and the key frame can be calculated automatically by the program, which is also known as the ordinary frame or intermediate frame, which is known as complementary animation.The concept of keyframes and complementary animation has long existed in traditional cartoon production, where skilled painters design the keyframes of the cartoon, and then the intermediate frames are designed by average painters. In short, fill-in animation is a gradual process, which is a quantitative process from one keyframe to the next keyframe.

The calculation process of motion interpolation animation is actually an interpolation process, which interpolates the symbols in ordinary frames according to the difference between the symbols in the first and the last two key frames, so that the symbols change smoothly from the first key frame to the second key frame. This is also the significance of proposing the transformation matrix in the symbols. The principle of generating motion filler frame animation is to interpolate the transformation matrix of the symbols in the first and last keyframes and the position of the common frame to find the corresponding symbols in the common frame. The interpolation is a uniform process, and the formula for calculating the amount of change in the nst frame is: (5) $δ = Δ \cdot (n / N u m)$

During the animation, the graph will gradually become twice the size of the first graph, and the size of the graph is incremental, i.e., it will not become smaller and then larger in the middle. Therefore, the variables in the transformation matrix also change gradually, increasing or decreasing. But the speed of change can be different for each frame, starting with an introduction to the concept of tween_ease, the acceleration between complements.

Acceleration in physics is the ratio of the amount of velocity change and the time taken for this change to occur, it is a physical quantity that describes how fast or slow an object’s velocity changes, and its formula is: (6) $s = v_{0} t + 1 / 2 a t^{2}$

However, Gong its not applicable to the complementary acceleration tween_ease, because the amount of change between two keyframes is fixed, and the time of change is also fixed, which is not enough to calculate the complementary acceleration tween_ease, and thus the formula is used to calculate the complementary acceleration, that is: (7) $δ + (1 + a) δ + (1 + 2 a) δ + \dots + (1 + (n u m b e r - 1) a) δ = Δ$

Where Δ indicates the first two key frames for the amount of change between the motion patch, number indicates the number of frames before and after the two key in the middle of the frame, a indicates the patch acceleration, δ indicates the amount of change between the first pause and the second pause, (1+a)δ indicates the amount of change between the second frame and the third frame, and so on.

The formula can be introduced: (8) $r a t i o_{n} = (2 n + n (n - 1) a) / (2 n u m + n u m (n u m - 1) a)$

where a is the complementary acceleration tween_ease, num is the number of frames between the first and last keyframes, and ratio_n is the ratio of the transform of the nth frame of the complementary interval to Δ.

2.3

Anime character generation based on image generation network

2.3.1

Deep Learning Based Graph Generation Network Modeling

Deep learning is derived from artificial neural networks and is characterized by deeper network layers and more complex network structures than the original artificial neural networks. In the early days when artificial neural networks were proposed, the number of layers of neural networks was generally shallow due to the limitations of the machine’s arithmetic power and storage capacity, and the corresponding data processing capacity of these networks was weak. Accompanied by the further development of core hardware such as CPU and GPU in scientific computers, neural networks are able to achieve improved data processing benefits through layer stacking and complex structure design.

The original generative adversarial network is based on game theory ideas, and the corresponding overall framework is mainly divided into two parts, one part is the generator, and the other part is the discriminator, which are built by two separate groups of deep neural networks.

In the entire adversarial training process, the goal of the generator is to deceive the discriminator, and the corresponding loss function of the generator during training is described as follows: (9) $\min_{G} E_{z \sim P_{z}} [\log (1 - D (G (z))]$

The goal of the discriminator is to distinguish between the generated false image and the real image, its corresponding input is a picture, and the corresponding output is the judgment of the authenticity of the picture, its loss function is as follows: (10) $\max_{D} E_{x \sim P_{r}} [\log D (x)] + E_{z \sim P_{z}} [\log (1 - D (G (z))]$

Where D is the discriminator, G is the generator, P_r is the distribution of the real data, x is the sampling of the real picture, P_z is the noise distribution of the input to the generated data, and z is the sampling noise.

The two loss functions can be described jointly: (11) $\begin{matrix} min_{G} max_{D} V (D, G) = min_{G} max_{D} E_{x \sim P_{r}} [\log D (x)] \\ + E_{z \sim P_{z}} [\log (1 - D (G (z))] \end{matrix}$

For the field of generative adversarial networks to realize image generation, a deep convolution based generative adversarial network DCGAN (GAN) has been proposed. In the network model of DCGAN, all fully-connected hidden layers present in the original generative adversarial network are removed.Most of the structural designs suggested by DCGAN are adopted by a large number of subsequent variants of generative adversarial networks.

In order to be able to improve the stability and efficiency of training, a batch normalized neural network is added to all network layers except the output layer of the generator and the input layer of the discriminator. For the activation function, the effect of various types of activation functions was tested and a better design was finally arrived at: for the generator, the Relu activation function was used for all layers except the output layer which used Tanh as the activation function. For the discriminator, LeakyRelu is used as the activation function for all network layers.DCGAN has a certain improvement in the quality of image generation compared to the original GAN network, as well as having fewer training parameters, taking up less memory, and a more efficient training speed.

The Wasserstein distance is also known as EM distance, which is defined accordingly: (12) $W (P_{r}, P_{g}) = \inf_{γ \sim π (P_{r}, P_{g})} E_{(x, y) \sim γ} [‖ x - y ‖]$

where π(P_r,P_g) denotes the set of all joint distributions of combinations P_r, P_g whose marginal distributions are P_r, P_g.

For any joint distribution γ of P_r, P_g, sampling any of the samples (x,y) yields the corresponding sample distance ║x–y║, E_(x,y)~γ[║x–y║] is the expectation of the sample distance under that joint distribution. The Wasserstein distance, on the other hand, describes the lower bound of the expected distance across all joint distributions.

The KL scatter will be meaningless when the support sets of the two distributions do not overlap or do not overlap almost everywhere, and the JS scatter is a constant, neither of which is a valid measure of the difference between the two distributions. The Wasserstein distance, on the other hand, is still able to portray how far apart the two distributions are under this condition.

The Wasserstein distance, although significantly superior in measuring the difference between two distributions, cannot be solved directly in practice because its expression contains $\inf_{γ \sim π (P_{r}, P_{g})}$ . By the KR theorem, where ║f║_L≤K means that the Lipschitz constant of function f does not exceed K. That is, for any x₁, x₂ belonging to the domain of definition of f, |f(x₁)–f(x₂)|≤K|x₁–x₂| is satisfied.

Thus it is only necessary to find the f that satisfies the K-Lipschitz continuum such that E_{x~P_r}[f(x)]–E_{x~P_g}[f(x)] is maximized to obtain the corresponding Wasserstein distance.

In this way the original Wasserstein GAN (i.e., WGAN) is designed.

The deep neural network is utilized to construct the fit f (discriminator D) such that the corresponding value can be maximized under the K-Lipschitz constraints, and thus the corresponding loss function obtained is as follows: (13) $L o s s_{D} = - E_{x \sim P_{r}} [D (x)] + E_{z \sim P_{z}} [D (G (z))]$

And the loss function at this point is equivalent to the approximate Wasserstein distance between the true and generated distributions.

The generator part, on the other hand, aims to minimize the Wasserstein distance, and considering that the left half of Eq. is not directly related to the generator, the corresponding loss function of the generator is: (14) $L o s s_{G} = - E_{z \sim P_{z}} [D (G (z))]$

2.3.2

Graphics-based animated character generation process

Nowadays, animation production enters the stage of generative design, especially for movie-level animation production, the traditional production method is extremely time-consuming and labor-intensive, so the combination of intelligent algorithms and animation production is a big trend. The generation and creation of animation characters are carried out using the graphic generation adversarial network. The animation character generation and creation process is shown in Figure 2, and the animation character generation and design creation process is divided into six steps.

The first step is to construct the animated character dataset, which is intended to provide training data for the generative adversarial network and provide data reference for extracting the animated character imagery vocabulary.

The second step is to construct a graphic generation network model, compare the advantages and disadvantages of various generative adversarial networks as well as the application areas, and add visdom to the constructed model for real-time feedback.

The third step is to train the model, using the constructed training dataset to train the network model, in the training process, the generator continuously generates new animated characters, while the discriminator will judge the authenticity of these animated characters. Through repeated iterative training, the performance of the generator and discriminator will gradually improve.

The fourth step establishes selection criteria, which can be carried out simultaneously with the third step, to classify the data while training the network, extract the animation character intention vocabulary, match the relationship between morphology and emotion vocabulary, and determine the design intention of the animation character, so as to assist the producer in the creation of animation.

The fifth step automatically generates animated characters, generates animated characters through the trained model, and then adjusts the parameters to control the number of generated animated characters as well as the generating effect of the network to meet the output of high-quality programs.

The sixth step of screening optimization and final animation character determination, through the establishment of a good selection criteria to select the initial program to meet the requirements, and then by the expert group scoring to determine the final optimization of the program, reference to the optimal program for the final animation character generation, design and production.

2.3.3

Simulation Experiment of Animated Character Generation

Four datasets of different sizes and connectivity are considered in this chapter. The experiment divides all the datasets into a test set (20%) and a training set (80%).The experiments are validated using the 20% training set, and the final number of samples generated is equal to the number of samples in the test split of each dataset. To assess the quality of the generated graphs, the experiment compares the distribution of the generated graphs of the animated characters to the distribution of the real graphs by measuring the Maximum Mean Difference (MMD) of the graph statistics, capturing how close their distributions are. The graph statistics used by the model to evaluate the graphs of the generated animated characters are Node Degree Distribution (Degree), Clustering Coefficient Distribution (Clustering), Mean Orbit Count Distribution (Orbit), Mean Number of Nodes (Node), and Mean Number of Edges (Edge). The statistics listed above can be used to evaluate the structural distribution difference between the generated graph data and the real graph dataset in a more comprehensive and quantitative way, which further reflects the advantages and disadvantages of the model.

In order to verify the effectiveness of the graph generation network improvement in this paper, the traditional GAN, CGAN, ACGAN and the generative network model in this paper are compared. The maximum mean difference (MMD) comparison results can be seen in Fig. 3, and Figs. (a) ~ (c) are the MMD score comparison results of node degree distribution, clustering coefficient distribution and average orbit distribution, respectively. The average MMD values of this paper’s model in node degree distribution, clustering coefficient distribution and average orbit distribution are 0.748, 0.503 and 0.523, respectively. Compared with the ACGAN model which has better data, the average value of each MMD is still higher than that of the ACGAN model by 0.116, 0.035 and 0.076, which shows that the improved optimization of this paper’s model is very effective.

In order to further validate the effectiveness of the models in this paper in the overall algorithm, the four models are tested on a new synthetic COKK dataset. Node Degree Distribution (Degree), Clustering Coefficient Distribution (Clustering), and Mean Orbit Count Distribution (Orbit), and the results of the model testing are shown in Table 1. It can be seen that the node degree distribution, clustering coefficient distribution, and average orbit count distribution of this paper’s model on the COKK dataset are 0.178, 0.185, and 0.076, respectively, which, in a comprehensive view, have achieved the best results, further illustrating the superiority of this paper’s model.

Table 1.

Model test results

Model	COKK
Model	Degree	Clustering	Orbit
GAN	0.315	0.274	0.105
CGAN	0.275	0.244	0.078
ACGAN	0.189	0.201	0.065
Ours	0.178	0.185	0.076

Since real-life graphs typically have thousands or even tens of thousands of nodes, the scalability or generalizability of any generation method is crucial. Generalizability is one of the key aspects of graph generation algorithms and refers to the ability of an algorithm to handle larger and more complex datasets while maintaining the high performance and quality of the generated graph. Generalizable algorithms can effectively handle increasing amounts of data, making them suitable for applications that require large-scale graph generation or applications that generate large graphs. Therefore, ensuring the generalizability of graph generation algorithms is important for their practical use in various industries.

Since there is no standardized metric for generality, in this experiment, the metric will be set to the time required to generate graphs with different numbers of nodes. The untrained model is initialized to generate multiple random graphs, and the number of nodes generated, n, is continuously expanded to perform this experimental measure.In this experiment, the GAN, CGAN, ACGAN methods, and the proposed model in this paper were tested on the same device environment. The generalization test results are shown in Fig. 4. It can be seen that after the number of nodes is greater than 11000, the graph generation time of the three models of GAN, CGAN, and ACGAN is more than 10 s. The time required to generate the image of this paper’s model is less than 10 seconds, which is a significant reduction in time compared to other models.

3

Movie-quality animated character behavior detection

3.1

Action recognition based on extraction of behavioral features

Characters and objects involved in the behavior in animation are the entities where the behavior occurs, and traditional recognition algorithms neglect to model the entities involved in the behavior and the interactions between the entities, resulting in poor performance of the algorithmic model in complex scenes.

Visual semantics can describe entity information more effectively.Usually, two types of methods can model the entities involved in behavior: attention mechanism-based entity description methods and visual semantic-based entity description methods. Attention mechanism based entity description methods can describe the appearance characteristics of the drums, but the two drums are very similar, resulting in very similar behavioral characteristics, which cannot be distinguished by the visual classification network, but for the visual semantic information, because the drums are labeled using the target detection network, the two different types of behaviors can be directly distinguished.

For the behavior recognition algorithm, it can be defined as the solution process of A=η(V), where A denotes the behavioral features and V denotes the video. For the method proposed in this paper, it can be defined as: (15) $A = η (F_{r g b}, F_{w o r d})$

Where F_rgb is video image features and F_word is video semantic features. Firstly the method uses 3D convolutional network to obtain image behavioral features, then yolov3 detection network is used to detect the entities involved in the behavior in the video and extracts the location information and category information of the entities. After that the category information is converted into semantic information using word2vec. On the basis of this, the semantic information of the entities is inferred using graph convolutional network to derive the semantic features of the behaviors, considering that there are cases of false or missed detection in the video. Finally, a discriminator is used to guide the semantic features in the image to recognize the behavior.

Semantic Feature Extraction:

1)

Target Detection

The method in this chapter uses the Yolov3 network as a target extractor for video frames.Yolov3 is used as a target detection network, which enables the extraction of features for each frame of video, as well as the labeling of the target location, and the categorization of the detection target. Using the characteristics of the detection network, it is able to extract the entities, their category information, and the location information of the behavior in each frame. Specifically, for each entity O₁, iϵ{0,1,2,⋯,N}, N indicates the number of entities involved in the behavior, when i=0 for the whole video, the entity information is summarized as: (16) $P_{o_{i}} = \det e c t (b_{o_{i}}^{j}, V i d e o, O_{i})$

where P_{o_i} denotes the set of types of entity O_i.

2)

Semantic Transformation

When the category information of the entities in the video is obtained, the category information needs to be transformed into semantic information, using Word 2Vec to represent the entity category information in the video using word vectors: (17) $f_{o_{i}} = φ (P_{o_{i}})$

Where φ denotes the Word 2Vec algorithm, which transforms the category information into semantic information, and f_{o_i} is the word vector of each entity, for the whole video, the word vector information of all its entities can be represented as: (18) $F_{w o r d} = {f_{o_{1}}, f_{Q_{Q}}, \dots, f_{o_{x}}}$

where f_o₀ denotes the word vector representing the person. The relative position of the entity where the behavior occurs throughout the video is: (19) $b_{o_{i}} = \frac{\sum_{j = 1}^{F r a} (x_{o_{o}}^{j}, y_{o_{i}}^{j}, w_{o_{e}}^{j}, h_{o_{i}}^{j})}{F r a}$

Where Fra represents the total number of frames of a video, b_a represents the relative position of entity O_i in the video, and the position of O_i is denoted as [x_{Q_i},y_{o_i},w_{o_i},h_{o_i}].

The construction of the person-object relationship graph is based on the construction of nodes as well as the construction of edges. For the construction of nodes, the method uses the word vectors of the extracted entities as the nodes of the graph, i.e., F_word. In the whole graph, F_word is denoted as the feature matrix X. For the edges of the graph, the person-object relationship is modeled considering that the relationship between the entities in which the behavior occurs is a person-object interaction, i.e., the edges of the graph are denoted by using the adjacency matrix. For the relationship measure of two nodes, the similarity of the features of the two nodes used by most methods for the measure, or calculating the intersection and concurrency ratios of the positions of the two nodes, is not applicable for the method proposed in this chapter. This is because the nodes of this method have clear categories and are highly distinguishable, and using similarity for the metric would result in too small a value for the edge. Using the intersection and merge ratio of node positions is also inappropriate because many behavioral object positions do not intersect with human positions, which would result in zero values for edges that should be present. Based on this, the defined join operation is as follows: (20) $c o n n e c t (X_{i}, X_{j}) = \frac{α}{\sqrt{{(x_{o_{i}} - x_{o_{j}})}^{2} + {(y_{o_{i}} - y_{o_{t}})}^{2}}}$

The connection operation represents the calculation of the distance between two entities. The relationship between two nodes is represented by defining the connection operation. For the adjacency matrix M, its dimension size is N×N and each value inside is: (21) $M_{i j} = {\begin{array}{l} c o n n e c t (X_{i}, X_{j}) & i f i = 0 \\ 0 \end{array}$

For the above formula, the values on the edges of the person-object relationship graph are constructed, as the human nodes are connected to other object nodes in the graph, considering that the behavior occurs mainly by the interaction between the person and the object.

The method in this paper feeds the constructed person-object relationship graph into a graph convolutional network for inference, and uses graph convolutional networks to reason about the graph. Unlike ordinary convolutional networks, graph convolution can aggregate information from neighboring nodes to the current node, therefore, graph convolution allows information to pass through the graph and update each node. The output of the graph convolution layer is represented as follows: (22) $Z = M X W$

W is the weight of the convolutional layer, X is the features of all nodes, M is the adjacency matrix of the graph, and Z represents the output of the graph after convolution. For the graph structure generated by this method, the representation of the graph is done by using global pooling to transform the person-object relationship semantic graph into behavioral semantic features F_word, specifically: (23) $F_{w o r d} = p o o l i n g (Z)$

where pooling(·) denotes the global pooling operation that changes the dimension of the feature matrix X from N×d to 1×d to generate behavioral semantic features. For this graph convolution, the loss function defined is: (24) $L = - \frac{1}{N} (\sum_{c = 1}^{N} y_{c} \log (δ (F_{w o r d})))$

where C denotes the number of classes of behaviors, δ(·) denotes the penultimate layer of fully connected mapping functions, and N denotes the number of entities.

3.2

Animated Character Behavioral Feature Recognition Experiment

Behavior recognition of animated characters requires processing and analysis of a large amount of animated video data, and the pre-processing of video by deep learning algorithms in the early stage has high requirements for computer configuration, and hardware configuration is needed to accelerate the processing of video image data.

In this paper, we fit the animated human behavior through behavioral feature extraction, and take the human behavioral features and the confidence level as the training data, and then build a neural network for identification and classification, take the human feature data corresponding to the abnormal behavior of the marked animated character and its corresponding confidence information as the input of the network, and through calculation, select the behavioral classification with the largest probability value as the result, and feedback it to the initial animation video. And feedback it to the original animated video, finally completing the detection of abnormal behavior of animated characters in animated videos.

The laboratory in this chapter is based on the abnormal behavior detection dataset of animated characters, as well as the simulated abnormal behavior dataset. The resolution of the animation video is 1080p, and the abnormal behavior detection and recognition method designed in this paper identifies the character movement error (M1), movement stiffness (M2), movement anomaly (M3), and interaction anomaly (M4) on the dataset. By training the feature vector data and validating it with the test set, the training results are shown in Fig. 5, (a) and (b) are the model loss function and model accuracy curve, respectively. The model accuracy rate is 93.40%, and the model works well.

The results of abnormal behavior recognition of animated characters are shown in Table 2, and the experiment proves that the behavior recognition method designed in this paper has a recognition accuracy of 97.18%, 96.52%, 97.51% and 95.85 for the animated character behaviors of movement error (M1), movement rigidity (M2), movement anomaly (M3), and interaction anomaly (M4), etc. The average recognition of abnormal behaviors of animated characters the average recognition accuracy rate of abnormal behaviors of animated characters is 96.76%, and the above experimental data verifies that the detection method of abnormal behaviors of animated anger designed in this paper has certain effectiveness and feasibility.

Table 2.

The result of the abnormal behavior recognition of the artist

The action of the artist	Video number	Correct detection	Error detection	Accuracy(%)
M1	354	344	10	97.18
M2	546	527	19	96.52
M3	482	470	12	97.51
M4	337	323	14	95.85
Mean	—	416	13.75	96.76

In order to further validate the effectiveness of the proposed method in detecting abnormal behavior of animated characters in animation videos, the conventional scores of the 13th test set of video frames in the animation data set are analyzed, and the results of the conventional scores test are shown in Fig. 6, where the horizontal coordinate is the number of frames, and the vertical coordinate is the conventional scores corresponding to the frames, and the yellow curves are the conventional scores calculated for the test video frames with which the test video frames correspond to, and the higher conventional The yellow curve is the routine score calculated from the test video frame, the higher routine score is the video frame with normal behavior, and the lower routine score is the video frame with abnormal behavior, in which the light blue region is the region of the animation video frame where the abnormal behavior of the animated character occurs. It can be seen that most of the time the conventional score is relatively stable and high, while at 430 frames where the abnormal behavior event of the animated character occurs, the conventional score decreases sharply to as low as 0.198, effectively detecting the occurrence of abnormal behavior. The incoherent movement of the animated character between 200 and 300 frames can also be effectively detected. As a result, it has been further verified that the method proposed in this chapter has good detection performance.

4

Conclusion

Based on deep learning, this paper constructs a method to automatically generate animated characters using graph generation network, and identifies the abnormal behaviors of the generated animated characters through behavioral feature extraction to adjust the animated characters in time.

The node degree distribution, clustering coefficient distribution and average track count distribution of the graph generating network model constructed in this paper on COKK dataset are 0.178, 0.185 and 0.076, respectively, which have obvious advantages over other models. And in the generalization test, the time of generating images of this paper’s models are all under 10s, compared with the time used by GAN, CGAN and ACGAN is significantly reduced, and the generalization is better.

In the detection of abnormal behavior of animated characters, the accuracy rate of this paper’s model is 93.40%, and it works well.The recognition accuracy of the animated character’s action error, action stiffness, action anomaly, and interaction anomaly are 97.18%, 96.52%, 97.51%, and 95.85, respectively. The average recognition accuracy of the anomalous behavior of the animated character is as high as 96.76%, and the method of recognizing the anomalous behaviors of the animated character in this paper has a high recognition accuracy. Additionally, by conducting regular score analysis of video frames, it is possible to effectively detect abnormal behaviors and incoherent movements of animated characters.The above research results comprehensively verify the application effect of the proposed method in this paper to assist in the generation and creation of animated characters.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Automatic Generation of Cinematic Animated Characters and Their Behavioral Characterization Using Graph Generation Networks

Wei Peng

Qingya Zhang

Publié en ligne: 17 mars 2025

Reçu: 16 oct. 2024

Accepté: 31 janv. 2025

DOI: https://doi.org/10.2478/amns-2025-0268

Mots clésDeep learning, Graph generation network, Behavioral features, Semantic features, Animated characters

© 2025 Wei Peng et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Deep learning, Graph generation network, Behavioral features, Semantic features, Animated characters