Digital Intelligence Technology-Driven Transformation and Innovation Research on Choreography Design

With the continuous innovation of contemporary art, the audience’s aesthetic concept shows diversified characteristics, in order to provide the audience with beautiful artistic enjoyment, it is crucial to strengthen the choreography. Choreography is a unique art form, which organically integrates time and space products into a whole to be presented, and has high requirements for the material level and technical level, presenting diversified design characteristics [1-3]. Contemporary choreography includes actors’ costumes, props, decorative sets, etc. To present beautiful stage effects, it is necessary for stage lighting, costumes, props and stage sets to coordinate and cooperate with each other, and optimize the design around the theme content of the work and the contemporary audience’s aesthetic needs, so that it can create a high-quality audio-visual environment, allowing the audience to get a more three-dimensional, rich aesthetic feeling [4-8].

Choreography is a typical expression of the combination of art and technology, and choreographers have been pursuing the innovation and artistic expression of choreography. However, the traditional means of choreography design mainly relies on the combination of handmade props, physical sets and lighting, but this approach is easily limited by the production process and spatial constraints, making it difficult to satisfy the needs of modern audiences for visual effects [9-12]. In this context, the emergence of digital intelligence technology has revolutionized the design of choreography. With its strong visual presentation ability, real-time interactivity and high flexibility, digital intelligence technology provides unlimited possibilities for choreography [13-16]. By using these technologies, choreographers can create more vivid and realistic stage effects and bring more shocking visual experience to the audience [17-19]. In addition, digital media technology can also realize the real-time interaction between the stage and the audience, enhance the audience’s sense of participation and immersion, and make the choreography more creative and dynamic [20-22]. Therefore, exploring the application of digital intelligence technology in choreography design has important theoretical significance and practical value for promoting the innovation and development of choreography design.

In the process of choreography design, the use of light, sound and electricity technology in digital intelligence technology can also enrich the stage’s expressive power, and bring the audience an all-round, multi-dimensional audio-visual enjoyment. Stage lighting as the visual focus of choreography design, its application is also inseparable from the support of digital intelligence technology. Gao, J. et al. researched a stage lighting intelligent control system for lighting automatic tracking of performers, which is based on the deep convolutional neural network tracking algorithm, and is able to automatically identify the character’s position and complete real-time target tracking, which solves the problem that the manual lighting control method can not accurately and timely tracking of the actors [23]. Hsiao, S.W. et al. introduced music emotion recognition and machine learning methods into the stage lighting adjustment system, which automatically adjusted the lighting mode and optimized the stage effect by detecting the music intensity and emotion in the music clip [24]. Qu, S. pointed out that the stage lighting control system has problems such as poor control effect and weak energy-saving performance, and proposed the construction of a support vector machine based on the improvement of the histogram of oriented gradients (HOG-SVM) feature extraction of intelligent stage lighting control system, which can effectively recognize the human body contour image and reduce the control energy consumption while meeting the needs of stage performance [25].

Sound technology is an important part of the choreography, bringing the audience a clear and realistic sound experience through advanced sound equipment and precise audio processing technology. Liu, Y. showed that the sound system in the stage plays an important role in enhancing the audience’s auditory experience and stage art expression, and the combination of digital technology to configure and optimize the sound system according to the principles of stage design significantly enhances the stage performance atmosphere [26]. The application of optoelectronic technology can also make the props and sets on the stage present a more splendid and varied effect. Jung, H. et al. introduced the application of projection mapping technology in stage design, which uses the outer wall of a building or the surface of a specific object as a screen to extend the existing stage presentation from a small screen to the reality of the three-dimensional space, which greatly enhances the stage performance tension [27]. Nakatsu, R. et al. developed a projection mapping system that can project images on non-rigid moving objects such as performers, and realized real-time 3D projection mapping of multiple moving targets with the help of depth sensors and multiple projection devices, which innovated and expanded the form of stage performances [28]. Sun, F. et al. analyzed the form and function of movable stage set design from the point of view of intelligent control, which helps to improve the presentation of stage theatre art [29].

In addition, the virtual interactive system can ensure that all elements on the stage can be coherently presented to the audience through precise synchronized control and data interaction, thus creating a more perfect stage effect. Samur, S.X. constructs virtual, literal modes of performance presence through novel head-mounted virtual reality technology to help audiences realize the digital sense of presence in stage performances [30]. YOO, Y. Aiming at the problem of insufficient completeness and diversity of visual content due to the occupation of the field of vision that occurs in traditional stage performances, he proposes to use digital image technology to design virtual reality performances suitable for the visual space of the stage, and to realize cross-media performances of the stage through virtual reality technology [31]. Yan, S. et al. emphasized that social interaction is an important element of stage performances, and designed a new approach to social interaction based on virtual reality technology to evoke the audience’s social awareness in stage performances, bringing new possibilities for enhancing the audience’s stage viewing experience [32].

Summarizing the above studies, we find that the application of digital intelligence technology in choreography has brought epoch-making changes to stage art. The application of digital technology not only significantly enhances the visual effect of the stage, but also greatly enriches the expressive power of the stage through innovative means. With the continuous development of digital media technology, it is believed that choreography will show a more diversified and personalized artistic style.

This paper clarifies the artistic goals and visual style, deeply understands the creative process of choreography, analyzes the data of a large number of choreography images and related text descriptions using the stable diffusion model, and determines the tone and style of choreography design generation. Then, the parameters of LoRA model were finely tuned to realize the style fine-tuning of the choreography design. The CGADM first and second layer architectures were designed one by one to improve the SD generation scene model. Through the YOLOv5 algorithm, stage performers are detected and recognized, and the KNN and RANSAC algorithms are used to extract the key points and feature points of the human body and form a matching point group for matching, so as to improve the choreography design and the positioning and tracking function of the middle stage lighting. The final presentation effect of the choreography design of this paper was jointly evaluated through physiological feature measurements and the collection of subjective opinions.

2

Generative artificial intelligence-based choreography

2.1

Choreography design process and technology application

The choreography process begins with an in-depth understanding of the project’s theme, defining the artistic goals and visual style. AIGC techniques such as Stable Diffusion are utilized to input textual cues related to the theme and initiate the image generation process. This step is critical in determining the tone and style of the generated images. Next, the style is fine-tuned by selecting and adjusting the large model, as well as fine-tuning the parameters of the LoRA model. This stage requires close collaboration between the creator and the technology to ensure that the detail and style of the image matches the creative vision. Finally, frame-by-frame output of the video is performed using plug-ins.

2.1.1

Stable Diffusion

1)

Large model selection. Stable Diffusion (SD) is a deep learning model based on artificial intelligence, which belongs to a kind of Generative Adversarial Networks (GAN), and is specifically applied to the field of image generation, and is a powerful authoring tool [33]. WebUI is the web interface of SD, and by analyzing a large amount of image data and its related textual descriptions, SD learns to transform textual information into visual images, which opens up digital authoring to a a whole new dimension, making the conversion from text to image possible.

2)

Mov2Mov plug-in The Mov2Mov plug-in is an extension for SD software that allows the user to convert a video file into a series of textual cues that are then used by SD to generate images corresponding to the original video frames. The innovation of this plug-in is the ability to change the transfer efficiency of the video stream from traditional video data compression to text cue compression, thereby significantly reducing the required bitrate while maintaining image quality. In this way, the Mov2Mov plug-in provides a novel solution for video editing and transmission.

2.1.2

LoRA model training

1)

LoRA is a fine-tuning technique for tuning large language models (LLMs) [34]. It significantly reduces the number of trainable parameters, memory requirements, and training time by introducing low-rank matrices at each layer of the Transformer architecture and training only these matrices while keeping the original model weights unchanged. LoRA is a Parameter Efficient Fine-Tuning (PEFT) technique that achieves model tuning by introducing a small number of parameters while keeping the original model unchanged. With the LoRA model, customized tuning can be performed to fit a specific style and maintain high operational efficiency when generating images using SD tools.

LoRA, like Control Net, uses a small amount of data to train a painting style or character without modifying the SD model to achieve the customization requirements, and requires much smaller training resources than training the SD model, which is expressed in the data equation (1) as follows: (1) $W = W_{0} + B A$

Where W₀ is the parameter of the initial SD model (Weights), BA is the parameter of the LoRA model, and W represents the parameter of the final SD model after being influenced by the LORA model. 2)

WebUI image generation process: select model - input prompt words - adjust parameters - click to generate images, the actual machine operation process: prompt words prompts - lexical elements Tokens - text EncoderText Encod-er-Set parameters-Noise Predictor-Noise Predictor-Variable Auto-Encoder VAE-Output image. The core principle of LoRA model training is to embed the “repeated impressions” in the noise predictor network.

In the field of stagecraft, for lighting designers, the dataset can be fixture specifications and documentation, and for stage designers and costume designers, the dataset can be costume styles, well-known designers and works of different style genres.

2.2

Stable Diffusion Model-based Textual Description Generation Choreography

2.2.1

Generating Adversarial Networks

Generative adversarial network consists of a generator and a discriminator. Where the generator is responsible for learning the sample distribution and generating the most realistic image possible [35]. The discriminator is responsible for recognizing the real data and the generated data. The model training ends when the data generated by the generator successfully deceives the discriminator. The loss function of the generative adversarial network can be expressed as equation (2): (2) $\begin{array}{rcl} \min \max V (D, G) & = & E_{x ~ p d a t a (x)} [\log D (x)] \\ + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))] \end{array}$

where z denotes random noise, real data is denoted by x, generator is denoted by G, and discriminator is denoted by D, G(z) denotes samples generated by the generator network, D(G(z)) denotes the output result of the discriminator on the samples generated by the generator, and $\log (1 - D (G (z)))$ denotes the result of misclassification of the generator-generated samples as real samples by the discriminator. E denotes the expectation. D(x) denotes the result of the discriminator’s output of the real sample. log D(x) denotes the result of the discriminator’s judgment of the generator-generated sample data. E_x~pdata(x) denotes the expectation of the true sample. E_x~pdata(x)[log D(x)] When maximized, the goal of the discriminator is to correctly classify the true sample as a true sample, thereby improving its accuracy in identifying true and generated samples. The generator and the discriminator are trained by the model and continuously optimized to finally reach the Nash equilibrium.

2.2.2

Diffusion models

The diffusion model constructs a Markov chain that uses a diffusion process to transform known distributions (e.g., Gaussian) into target data distributions. The diffusion model is shown in Figure 1.

where x₀ denotes a real data observation, such as a natural image, and x_t denotes a pure Gaussian noise, and the diffusion process starts at x₀ and follows this chain structure of gradually adding Gaussian noise to obtain pure Gaussian noise x_t. $q (x_{t} | x_{t - 1})$ denotes a conditional probability distribution for the state x_t at the current time step t, conditioned on the state at the previous time step x_t−1. This distribution uses the output of the previous state as its mean. The joint probability distribution of the entire network is denoted by $p (x_{0; T})$ . If this joint probability is chain-split in left-to-right order it can be expressed as equation (3): (3) $p (x_{0; T}) = q (x_{0}) \prod_{t = 1}^{T} q (x_{t} | x_{t - l})$

2.2.3

Potential diffusion models vs. stable diffusion models

The latent diffusion model (LDM) belongs to the conditional control model significantly reduces the computational requirements compared to pixel-based diffusion models. The latent diffusion model can not only be trained on limited computational resources, but also ensures the quality and flexibility of the generated images.

Stabilized Diffusion (SD) model is proposed on the basis of potential diffusion model, the difference between stabilized diffusion model and potential diffusion model is that stabilized diffusion model uses CLIP as a text encoder and self encoder (VAE) as an image encoder. The stable diffusion model uses LAION-5B as the dataset for model training. The similarity between the stable diffusion model and the potential diffusion model is that both use U-Net network.

2.3

SD-based model design for generative scene improvement

2.3.1

CGADM first level model design

1)

Attention mechanism

In order to further enhance the expression ability of object embedding vectors, the attention mechanism is added to the graph convolutional network, i.e., the attention coefficients are assigned to the neighboring nodes when the aggregated neighboring nodes output the feature vectors, and thus the ability of the object to perceive other neighboring nodes can be enhanced.

A shared parameter matrix is first used $W \in R^{F_{1} F_{2}}$ . Where parameter matrix W represents the weight parameters in the model used to transfer information between different layers and to perform feature transformations, and R represents the real number domain. F₁ denotes the number of matrix rows, and F₂ denotes the number of matrix columns. The object vectors and relationship vectors in the scene graph are transformed into high-level feature vectors using the parameter matrix, so as to make the object and relationship feature vectors more expressive, and the attention coefficients between the objects are calculated using the high-level feature vectors of the edges in the scene graph. The calculation formula is expressed as equation (4): (4) $e_{i j} = φ (W [h_{o i}, h_{r k}, h_{o j}])$

where $(h_{o i}, h_{r k}, h_{o j})$ is the vector of object embeddings corresponding to edges in the input scene graph, and φ denotes the attention computation network. e_ij is the degree of contribution of any edge $(o_{i}, r, O_{j})$ instance object O_j to instance object O_i of the scene graph obtained after computation. 2)

Scene Layout

In this paper, we use a scene layout network based on a regional convolutional network using four parameters t_x, t_y, t_w, t_h as learning objectives. t_x, t_y are used to adjust the offset of the center point position of the bounding box. By adjusting the center point position, the position of the object in the image can be located more accurately. t_w, t_h are used to adjust the scaling of the width and height of the bounding box. t_x, t_y, t_w, t_h are obtained through regional convolutional network learning, and the bounding box can be adjusted according to the extracted features and the features of the target object, so that the final target box more accurately frames the target object and improves the accuracy and stability of target detection. Parameters t_x, t_y, t_w, t_h are calculated as shown in Eqs. (5)~(8): (5) $t_{x} = \frac{G_{x} - P_{x}}{G_{w}}$ (6) $t_{y} = \frac{G_{y} - P_{y}}{G_{h}}$ (7) $t_{w} = \log (\frac{G_{w}}{P_{w}})$ (8) $t_{h} = \log (\frac{G_{h}}{P_{h}})$

where G_x, G_y denote the coordinates of the center point of the bounding box provided by the real data, G_w, G_h denote the width and length of the border provided by the real data. P_x, P_y denote the coordinates of the center point of the predicted bounding box, P_w, P_h denote the width and length of the border of the predicted data. The learned set of parameters is denoted as $W \in R^{F_{2} \times 4}$ , and the object vector h_o is operated by Eq. to obtain the bounding box transformation parameter vectors t_*, $* \in (x, y, w, h)$ , viz: (9) $t_{*} = W {h^{'}}_{0}$

First, a border P⁰ = (x, y, w, h) is initialized for the object, and the coordinates of the center point of the border of the generated image are set to $P_{x}^{0}$ . $P_{y}^{0}$ The length and width of half of the generated image are set to $P_{w}^{0}$ , $P_{h}^{0}$ . The four learning parameters are continuously optimized through training to make the border close to the real border. Through continuous training and learning, the border transformation parameter vector t_* is gradually brought closer to the real data provided by the border regression network ${\hat{t}}_{*}$ . The formula for ${\hat{t}}_{*}$ can be expressed as Eqs. (10) to (13): (10) ${\hat{t}}_{x} = \frac{G_{x} - P_{X}^{0}}{G_{x}}$ (11) $\hat{t_{y}} = \frac{G_{y} - P_{y}^{0}}{G_{h}}$ (12) $\hat{t_{w}} = \log \frac{G_{w}}{P_{w}^{0}}$ (13) $\hat{t_{h}} = \log \frac{G_{h}}{P_{h}^{0}}$

2.3.2

CGADM sublayer design

1)

CLIP text encoder

The attention mechanism uses the multi-head attention mechanism to introduce the three concepts of Query, Key and Value, Query denotes the query, which is represented by the symbol Q. Key denotes the key, which is represented by the symbol K. Value denotes the value, which is represented by the symbol V. By calculating the correlation between Q and K, the attention weights are obtained, and then the weights are used to weight the combination of V, and finally the output of the attention mechanism is obtained. The attention mechanism expression is represented as equation (14): (14) $A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{κ}}}) V$

where Q, K, V are the matrices obtained by doing a linear mapping of the embedding vectors X_embedding1 of the previous layer. QK^T is the attention matrix. The attention function used is the dot product attention function, and $\sqrt{d_{k}}$ denotes the dimensions of Q and K. The matrix is made normally distributed by scaling factor. Normalization is done using Softmax function and finally the output of the attention mechanism is obtained.

The text embedding vectors output from the text encoder are high dimensional, and when the data is limited, the text description cannot fill the hidden space, and the data flow shape is discontinuous, which affects the training effect, so the conditional enhancement technique is introduced to get more hidden variables by using random sampling in Gaussian distribution, and the expression for the Gaussian distribution is shown in Eq. (15) [36]: (15) $N (U_{0} (φ_{t}), \sum_{0} (φ_{t}))$

where $\sum_{0} (φ_{t})$ denotes the diagonal covariance matrix and $U_{0} (φ_{t})$ denotes the mean function of the text embedding vector $(φ_{t})$ . In this way, the model can generate more text-image training pairs on limited text-image data for deeper understanding of the hidden information in the text, thus enabling the sublayer network to generate clearer images. 2)

Residual network generation

The residual network can be used to deeply understand the multi-peak features of text and images, and this network can alleviate the model gradient vanishing as well as the overfitting problem. The first layer of the network will generate the image through a convolutional kernel window size of 4 × 4, step size of 2 convolutional layer, batch attribution layer and LeakyReLU activation function composed of the network for downsampling, to get the image features. The combination of image features and hidden variables are fed into the residual network for learning the multimodal representation between the image and the hidden variables to improve the image accuracy. The residual network consists of a series of residual blocks, in this paper, four residual blocks are used, and each residual block can be expressed by equation (16): (16) $x_{l + l} = h (x_{l}) + F (x_{l}, W_{l})$

The residual block consists of two parts, the direct mapping and the residual, where $h (x_{1})$ is the direct mapping and $F (x_{l}, W_{l})$ is the residual. Denotes the residual block. $F (x_{l}, W_{l})$ is the residual, which consists of a convolutional layer with a convolutional kernel window size of 3 × 3 and a step size of 1, a batch normalization layer and a ReLU activation function. The learned features are upsampled to get the output image. The objective function of the sublayer network generator is expressed as equation (17): (17) $L^{(G)} = E_{s 0 ~ p_{G_{0}}, t - p d a t a} [\log (1 - D (G (s_{0}, \hat{c}), φ_{t}))]$

Where G₀ denotes the first layer network generator, G and D denote the generator and discriminator of the second layer network respectively, s₀ denotes the output of the first layer network of CGADM, $\hat{c}$ is the hidden variable generated by the conditional enhancement technique, φ_t is the text embedding vector obtained from the text through the text encoder, and t denotes the text. 3)

Match-aware discrimination

The discriminator of the sublayer network uses a match-aware discriminator. In the training process, the discriminator divides the real image, the generated image and its corresponding text description into two cases: positive sample and negative sample. Only the combination of the real image and the matching text is a positive sample, and the rest are negative samples The discriminator uses a deep convolutional neural network containing two downsampling layers, the downsampling layer consists of a convolutional layer with a step size of 2 and a convolutional kernel window size of 4 × 4, a batch normalization and a Leaky ReLU.

The objective function of the sublayer discriminator network is expressed as equation (18): (18) $\begin{array}{rcl} L^{(D)} & = & E_{(I, t) ~ P d a t a} [\log D (I, φ_{t})] \\ + E_{s_{0} ~ p G_{0, t} ~ P d a t a} [\log (1 - D (G (s_{0}, \hat{c}), φ_{t}))] \end{array}$

Where G, D, G₀, z, $\hat{c}$ , φ_t represent the same meaning as represented in the objective function of the generator, and I represents the real data. When L is maximum it means that the training is completed and the discriminator network realizes the classification between positive and negative samples. The final model reaches convergence by alternating training of the residual network generator and the match-aware discriminator.

3

Deep neural network-based stage lighting localization design

3.1

Deep neural network based localization design

In order to realize the stage lighting localization and tracking function of deep neural network technology, it is firstly necessary to detect and identify the target by YOLO algorithm, and then extract the human body’s key points and feature points in each image by key point detection algorithm, and form a matching point group for matching.

3.1.1

Human body detection based on YOLOv5 algorithm

First, the Mosaic data enhancement technique significantly improves the system’s ability to adapt to various scenes and the accuracy of target detection by randomly cropping, scaling, and rearranging the images. In addition, all images are resized to a uniform size at the input side in order to maintain the consistency of the training data and improve the processing speed of the model. This step ensures the standardization of data input and lays a solid foundation for model training.

The Backbone architecture integrates three key architectures, Focus, CSP and SPP. the Focus architecture specializes in image slicing and generates finer-grained feature maps through convolutional operations. The CSP structure introduces a varying number of residual components to further optimize the feature extraction process. SPP (Spatial Pooling Pyramid) enhances the model’s adaptability to changes in image size by performing a maximum pooling operation on feature maps of different sizes.

The role of the Neck module is to connect the Backbone to the Head, which contains both FPN and PAN structures. FPN efficiently conveys feature information through a top-down approach, while PAN employs a bottom-up approach that enhances the detailed representation of features.

Finally, the Head module disposes of the mismatch between the prediction frame and the target frame through loss function and non-maximum suppression (NMS) operations, and effectively removes redundant prediction frames through a weighting method, which increases the accuracy and clarity of detection.

3.1.2

Feature point extraction based on KNN and RANSAC algorithms

During the matching process, each feature point is assigned the corresponding 3D coordinates one by one. By applying the Nearest Neighbor Rule Classification Matching Algorithm (KNN Algorithm), the distance between each sample, as well as the difference in their directions, can be calculated to minimize the error in the matching process.

The calculation of the distances follows equation (19): (19) $d = \sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}}$

where: n is the number of dimensions, A_i and B_i are the values of the two feature points in the i-dimensional space, respectively, and d is the distance between the two samples.

In order to measure the difference in direction, Eq. (20) can be utilized to calculate the direction cosine value between two feature points: (20) $\cos θ = \frac{x_{1} x_{2} + y_{1} y_{2}}{\sqrt{x_{1}^{2} + x_{2}^{2}} \times \sqrt{y_{1}^{2} + y_{2}^{2}}}$

Where: x₁, y₁ are the coordinates of the 1st actor, x₂, y₂ are the coordinates of the 2nd actor, and cos θ is the cosine value of the angle formed by the two actors and the coordinate origin.

In the process of calculation, the system will take the center of the stage as the origin, establish a coordinate system with a spacing of 0.5m, and calculate the 3D coordinates of each matching point group through the built-in parameters and position of each camera device, and then after deriving the 3D coordinates, the system will convert the 3D coordinates of each matching point group into the corresponding 2D coordinates of the projection onto the stage in the way of perspective projection.

3.1.3

Cross-dataset training and motion pose prediction

Cross dataset training, i.e., training the localization algorithm by collecting additional data as a supplementary training set. By collecting data under different environments, different lighting conditions and different actor postures as supplementary training sets, the stability and extensiveness of the system can be effectively improved, thus enhancing the system’s adaptability to situations that may occur during improvisation.

And the motion posture prediction can be divided into two cases. The first case is uniform motion, i.e., the actor’s route in a certain time is uniform, in this case, the system can calculate the next coordinate point based on the 2D projection coordinates of the previous feature points, as shown in Eq. (21) and Eq. (22): (21) $x_{n e x t} = x_{c u r r e n t} + \frac{t \times (x_{c u r r e n t} - x_{l a s t})}{t_{p}}$ (22) $y_{n e x t} = y_{c u r r e n t} + \frac{t \times (y_{c u r r e n t} - y_{l a s t})}{t_{p}}$

Where: $(X_{c u r r e n t}, Y_{c u r r e n t})$ is the current coordinates, $(X_{l a s t}, Y_{l a s t})$ is the last coordinates, $(X_{n e x t}, Y_{n e x t})$ is the next coordinates, t is the next time to be predicted, and t_p is the time sampling period.

By using the above formula, the next route of the actor can be effectively predicted.

The second case is non-uniform motion, i.e., the actor carries out irregular, and the speed changes at any time on the stage. In this case, the system can ensure real-time by its own real-time tracking technology as well as high-speed computation capability to output images in a very short time.

3.2

Workflow of modeling in choreography

First, the server receives images from multiple synchronized cameras, locates human bodies and creates human detection frames using a human detection algorithm. Each human body extracts multiple keypoints and orientation gradient feature points (ORBs) from the image. Then, combining these features, the algorithm server calculates the 3D coordinates of the body’s key points and corrects for reprojection errors to improve accuracy. Subsequently the 2D stage coordinates of the performer are calculated. Finally these coordinates are converted into DMX commands, which the console receives and directs the lights to accurately illuminate the performer.

On top of the above workflow, the system’s camera technology also applies flexible mechanisms to meet the challenges of the stage’s complex environment. Fig. 2 shows the schematic diagram of two-view triangulation for stage light tracking. During the working process, it is impossible to obtain the pixel depth information through a single picture, so two or more cameras need to be aimed at the same target, and then the system will accurately locate the target through the method of multi-view triangulation.

Wherein, when two camera devices O₁ and O₂ are aligned to a target, there will exist two frames I₁ and I₂ at a certain moment, R, t are the rotation matrix and translation vector of the second frame with respect to the first frame, K is the internal parameter matrix of the camera device, s₁ and s₂ are the depth information, that is, the distance between the target and the camera device. Assuming the existence of feature point P₁ in I₁, and I₂, the two-view triangularization schematic sign point P₂, based on the above information can be derived from the feature points P₁, P₂ in the plane of the I₁, I₂ frame image of the points x₁ and x₂, respectively, with the relation equation: (23) $x_{1} = K^{- 1} \times p_{1}$ (24) $x_{2} = K^{- 1} \times p_{2}$

According to the pair of pole constraints, the following relation is available: (25) $s_{1} x_{1} = s_{2} R x_{2} + t$

Eq. (25) is obtained by multiplying both the left and right sides by ${\hat{x}}_{1}$ simultaneously: (26) $s_{1} x_{1} {\hat{x}}_{1} = 0 = s_{2} {\hat{x}}_{1} R x_{2} + {\hat{x}}_{1} t$

Finally, s₁ can be calculated according to the equation in the left half of Eq. (26), and s₂ can be calculated according to the equation in the right half of Eq. (26). However, in the actual bit position estimation, it is also necessary to take into account situations such as the error generated by the noise, and at this time, the result in Eq. (26) may not necessarily be 0. Therefore, in order to ensure the accuracy of the results, the system will calculate s₁ and s₂ by the least squares method.

And when there are more than 2 cameras focusing on 1 target, the target Y will be observed in multiple keyframes k₁, k₂⋯k_n at a certain moment. Similarly, the coordinates of the feature point in the plane taken from the observation in each frame are x_k, such that $x_{k} = {[\begin{array}{l} u_{k} & v_{k} & 1 \end{array}]}^{T}$ , and the projection matrix is $[\begin{array}{l} R_{K} & T_{K} \end{array}] \in R^{3 \times 4}$ . Where u_k and v_k are the internal references of the camera device. Thus, when the real coordinates are swapped with the camera coordinates, the projection relation is as follows: (27) $\forall k, λ_{k} x_{k} = P_{k} Y$

Where: Y is the target position to be solved.

Expanding equation (27) gives: (28) $λ_{k} = [\begin{matrix} u_{k} \\ v_{k} \\ 1 \end{matrix}] = [\begin{matrix} P_{k, 1}^{T} \\ P_{k, 2}^{T} \\ P_{k, 3}^{T} \end{matrix}] Y$

This can be derived from the third line of the above equation: (29) $λ_{k} = P_{k, 3}^{T} Y$

Where: $P_{k, 3}^{T}$ means the third line of P_k.

Substituting the result of the third line into the first two lines subsequently gives: (30) $u_{k} P_{k, 3}^{T} Y = P_{k, 1}^{T} Y$ (31) $v_{k} P_{k, 3}^{T} Y = P_{k, 2}^{T} Y$

The above two equations are the results of one observation. However, in actual performances, multiple observations are required, so after assuming that n observation has been made (i.e., k = 1, 2, ⋯, n), the resulting equation is: (32) $[\begin{matrix} u_{1} P_{1, 3}^{T} - P_{1, 1}^{T} \\ v_{1} P_{1, 3}^{T} - P_{1, 2}^{T} \\ ⋮ \\ u_{n} P_{n, 3}^{T} - P_{n, 1}^{T} \\ v_{n} P_{n, 3}^{T} - P_{n, 2}^{T} \end{matrix}] Y = 0$

Let $D = [\begin{matrix} u_{1} P_{1, 3}^{T} - P_{1, 1}^{T} \\ v_{1} P_{1, 3}^{T} - P_{1, 2}^{T} \\ ⋮ \\ u_{n} P_{n, 3}^{T} - P_{n, 1}^{T} \\ v_{n} P_{n, 3}^{T} - P_{n, 2}^{T} \end{matrix}]$ , then DY = 0 can be obtained.

Similarly, in order to account for errors due to noise, the system will be solved by least squares to figure out the target position.

4

Evaluation and optimization of the effect of choreography

4.1

Choreographic design elements

4.1.1

Color application

Color theory provides a series of guiding principles on how to influence the viewing experience through color combinations, such as color harmony, contrast, and color symbolism. The direction of the audience’s emotions can be guided through color combinations in stage design, e.g., warm tones are often used to create a warm, exciting or tense atmosphere, while cool tones are used to express calmness, sadness or contemplative emotions.

4.1.2

Light and Shadow and Lighting

Lighting and lighting design is an extremely core component of choreography. Lighting not only provides the necessary illumination, but is also a key tool for constructing the atmosphere of the scene, guiding the audience’s emotions and reinforcing the visual impact, the core of lighting design lies in the mastery of the quality, direction, color, and intensity of the light as well as how to interact with the stage space and the actors through these elements. Lighting design must be closely integrated with the content of the script and the visual intent of the director, supporting the development of the plot and deepening the expression of emotions through changes in lighting.

4.1.3

Space and composition

Spatial layout focuses on how the stage space is utilized including the depth, width and height of the stage as well as the configuration of reality and reality in the scene, while composition involves how the visual elements of the stage are distributed in the space, which includes the position of the actors, the arrangement of props and the design of the backdrop. Effective use of space can guide the audience’s eyes to create a visual focus, so that the audience’s attention is focused on the key elements or actions on stage.

4.2

Choreography and effect evaluation

4.2.1

Measurement of Physiological Characteristics of the Audience

1)

Eye movement

In previous studies, the presentation time of the stimulus materials for choreography was usually controlled at 7 s to 20 s depending on the number and type of the stimulus materials, and in this experiment, the Tobii Pro Glasses 2.0 eye-tracking device was used to collect data to present the samples through a screen with a resolution of 1,920 × 1,080, and the presentation time of the choreography samples was 10 s, with a gap of 2 s between the photographs, and all the photographs were randomly shown after being disrupted in order, and only one subject was allowed to enter the experimental site during the experiment. All photos were randomly played after disrupting the order, and only one subject was allowed to enter the test site during the experiment. The sampling rate of the eye movement data obtained in this experiment reached more than 80%, which was in line with the collection standard, and the data were reliable. After the eye movement experiment was completed, subjective data were collected from 20 samples.

Table 1 shows the analysis of the sample gaze time, this paper counts the gaze time of the subjects in each region in the sample by dividing the eye movement interest region, and quantifies the eye movement indexes to analyze the degree of eye-catchingness, degree of attention, and attractiveness of the spatial elements of each choreography design. After analyzing all the samples, this paper divided the spatial elements in the samples into the following six categories: spatial interface elements, costume elements, prop elements, choreographic composition, lighting elements, and set elements, and computed and analyzed the average gaze time of automatic optical inspection (AOI) in each region. In order to reduce the experimental error, the boundaries of each region were divided in such a way that they followed the original element boundaries as much as possible, and the boundaries of each region did not overlap. Among the six categories of choreographic design space elements, the attention of choreographic composition is significantly weaker than that of the other five categories of choreographic design space elements, indicating that the choreographic composition elements are less eye-catching and lack attractiveness, but they are more recognizable and comprehensible, with a total gaze duration value of 3.95 s. The total duration of attention is 3.95 s, which is the same as that of the other five categories of choreographic design space elements.

Props elements are widely present in the choreography scene and account for a larger proportion in some samples, but the degree of eye-catchingness is weaker than that of other major elements in the choreography scene in subjects’ observation, with a higher degree of recognizability but a slightly weaker attraction to the subjects. The spatial elements that received more attention from the subjects were the spatial interface elements and set elements, which were more visible and attractive than the other four spatial elements, with the total gaze duration of 104.624 s and 63.552 s. However, due to the large amount of information contained in the elements, the subjects’ comprehensibility and recognizability of these two types of elements were lower.

As the primary component of spatial form, spatial interface elements influence the perception of space. As secondary elements, spatial interface elements are more eye-catching in choreography, but in terms of attractiveness, they are again weaker than elements.

Table 1.

Sample fixation time analysis

Region	Counting point	Time before first gaze/s		First fixation duration/s		Total fixation duration/s
Region	Counting point	Mean value	Total value	Mean value	Total value	Mean value	Total value
Spatial interface element	28	1.846	51.688	0.745	20.86	1.615	45.22
Clothing elements	26	1.036	26.936	1.788	46.488	4.024	104.624
Item element	25	2.563	64.075	0.469	11.725	1.498	37.45
Stage composition	10	3.615	36.15	0.318	3.18	0.395	3.95
Lighting element	10	2.036	20.36	0.428	4.28	0.495	4.95
Scenery element	24	1.129	27.096	0.915	21.96	2.648	63.552
All spatial elements	1214	1.978	2401.292	1.348	1636.472	3.379	4102.106

The normality test was performed on the overall eye movement data of the scene and the eye movement data of each AOI using spss26.0 software. Table 2 shows the S-W test results of the overall eye movement indexes, according to the S-W test results of each index the absolute value of the kurtosis is less than 10 and the absolute value of the skewness is less than 3, the kurtosis is 0.02796, 1.03498, 0.63486, and the skewness is 0.31486, 0.57964, -0.39486, respectively. Figure 3 shows the histogram of the overall normal distribution of the eye movement indexes, which shows that the overall eye movement index data distribution satisfies the normal distribution, and the mean values of the three indexes are 1.97909, 1.3409, and 3.38583, respectively.

Table 2.

The results of the S-W test of the overall eye movement index

Variable name	Median	Mean value	Standard deviation	Skewness	Kurtosis	S-W test
Time before first gaze	2.00184	1.97909	0.56514	0.31486	0.02796	0.94862
First fixation duration	1.34854	1.3409	0.22196	0.57964	1.03498	0.98348
Total fixation duration	3.39706	3.38583	0.61046	-0.39486	0.63486	0.98764

2)

Heart rate variation

Due to the complexity of the human physiological signals and the limited precision of the measuring instruments, the values fluctuate greatly, and the demand for local data analysis is very small, and it is only suitable for analyzing the general trend of a large amount of data. In this paper, the Euclidean distance calculation is performed on the physiological signal data of the simulated subjects in different color environments of the choreography scene to analyze the differences and then compare the differences of human physiological state in different color environments. Specifically, Euclidean distances were calculated for 40 groups of physiological signals under the simulated test environment of light blue, white, warm white, orange, red and green, respectively, and 40 groups of physiological signals under the natural conditions of the choreographic environment. As shown in Eq. (33), D represents the sum of physiological signal data acquired by a single subject during 10 min, i.e., the product of sampling frequency and time. a_k and b_k represent the physiological signal data in the natural conditions of the choreographic environment and the physiological signal data in the simulated experimental environment, respectively. D(a, b) represents the Euclidean distance between the two, reflecting the difference between the physiological signal data in different simulated test environments and the choreographic environment: (33) $D (a, b) = {(\sum_{k = 1}^{d} {(a_{k} - b_{k})}^{2})}^{1 / 2}$

Heart rate is the number of times the heart beats per minute and is the most common measure of cardiac activity. A normal adult’s heart rate is about 70 beats per minute and an athlete’s heart rate is about 50 beats per minute. In a choreographic environment, it is important for the audience to maintain a smooth heart rate activity.

In the simulation of the test environment for stage performances to watch the heart rate signal measurement experiments, T-Sens heart rate sensor sampling frequency of 16Hz, the T-log wireless data logger in the SD card data reading, by CAPTIV software processing can be obtained within 10 minutes, 40 subjects heart rate data, due to the sample size is too large, this paper only placed four typical convergence of the collection of results Due to the large sample size, only 4 typical convergent acquisition results are placed in this paper. Fig. 4 shows the heart rate signal acquisition results of 4 subjects, Fig. (a)~(d) represent the 1st-4th subjects respectively, and the mean values of heart rate of 4 subjects in 10 minutes range from 70 to 86 beats/min.

4.2.2

Effect of Digital Intelligence Technology on Choreography Presentation

1)

Color

The Euclidean distances of the six simulated experimental environment colors of the choreography design were compared with those of the natural environment heart rate signals, and Figure 5 shows the results of the experiment. Warm white has the smallest Euclidean distance among the six choreography colors, ranging from 800 to 1,500 beats/min, and the heart rate signals in the choreography design color simulated experimental environment are closest to the Euclidean distance in the natural indoor environment.

2)

Light and shadow and lighting

Figure 6 shows the tracking algorithm solution, using the simulation platform based on KNN and RANSAC algorithms to build a 5m×8m simulated stage positioning environment, to achieve the stage character light tracking, set the lamps after the coordinate system rotation, translation, in the positioning point coordinate system under the coordinate system of the L (5,4), select the positioning point in the environment for simulation of irradiation. Firstly, according to the stage positioning algorithm to locate the known positioning point for the positioning solution, and then according to the stage light tracking algorithm, respectively, the positioning point and the positioning results of the light tracking calculation, in the simulation of the stage environment to select 100 positioning points for the test, to get the lamps and lanterns rotating tracking direction of the correct rate of 97%, the degree of rotation error in the [-18.72%,22.3%] between. It meets the stage lighting tracking performance requirements.

3)

Spatial scene composition

By measuring the aspect ratio D/H of the corresponding external space of the stage collection point in the field, the D/H of the four choreography scenes is assigned according to the interval grading, the same value interval is counted according to the same class, and the segments of D/H are divided into four segments of 0~1, 1-2, 2-4, and >4 with reference to the values of the different spatial types.

Figure 7 shows the D/H ratio of choreographic space characteristics, after statistical sorting, most of the space aspect ratios in the four stage scenes are concentrated in the interval of 0.5~1.5. The average value of D/H of the external space of scene 1 is 0.842 slightly less than 1, the scale ratio of scene 2 is uniform, all in the range of 1±0.38 interval, scene 3 most of the external space of D/H in close to or slightly greater than 1, a few D/H in 0.4~0.8, scene 4 external space is very wide, open space D/H are close to or greater than 1, most of them are greater than 2, the space is very open, space The average value of D/H is about 3.093, and the proportion scale is appropriate.

The visual entropy of the designed choreography is calculated by writing the program language through matlab, the operation method is to read in the choreography image and grayscale it, obtain the grayscale histogram of the image, calculate the probability of each grayscale in the image appearing in the image, and calculate the entropy by the definition of entropy. Figure 8 shows the visual entropy statistics of the choreography scene, through the program calculation, it is found that the visual entropy of each choreography scene is between 7 and 8.5, mainly concentrated between 7.5 and 8, a few lower than 7.5 and higher than 8, and the numerical difference is reflected in the 2 decimal places, which on the one hand indicates that the choreography image samples collected in this collection are very uniform in terms of the type, and the information is close to the information, which is helpful for concentrating the focus of the research. On the other hand, it shows that there is a difference in the visual information carried by the visual elements of the external space of the choreography, and this difference is reflected in the details of the color composition, material texture and so on. The data of visual entropy are standardized and data segmented, which are divided into 3 segments: low visual entropy (7.30~7.50), medium visual entropy (7.50~7.70) and high visual entropy (7.70~7.90).

4.2.3

Audience satisfaction

Table 3 shows the statistics of the audience’s overall perception, and the analysis of the audience’s overall perception in this paper is divided into two aspects: the audience’s communication method and overall feeling towards the stage performance. The results presented in the survey indicate that the evaluation of the communication mode of this stage performance is very good and good evaluation of the majority, accounting for 62.5% and 17.5% of the total number of people respectively. At the same time, 65% of the audience thought that the overall feeling brought by the stage performance was very good, and no audience thought that the overall feeling of the stage performance was very bad.

Table 3.

Statistics of the audience’s overall perception,

Audience perception	Group	Number of people	Percentage
Stage performance communication mode	Fine	25	0.625
	Good	7	0.175
	Normal	3	0.075
	Difference	3	0.075
	Very bad	2	0.05
Stage performance overall feeling	Fine	26	0.65
	Good	9	0.225
	Normal	3	0.075
	Difference	2	0.05
	Very bad	0	0

The audience’s satisfaction with the stage performance elements of the survey is divided into four aspects of the overall stage modeling, stage lighting effects, stage performance and stage scheduling form of generalization, satisfaction scores in accordance with the level of scoring, very satisfied with 5 points, more satisfied with 4 points, and so on. Table 4 for the audience on the performance of the elements of satisfaction analysis, the audience stage lighting effect satisfaction is the highest (score of 178), the rest of the satisfaction in descending order: the form of stage scheduling (score of 171), the overall stage modeling and stage performance scores are equal, are 170 points.

Table 4.

The audience analyzed the performance of the performance

Performance element	Satisfaction	Number of people	Score	Percent
Dance together with the overall shape	Very satisfied	22	110	0.55
	Relatively satisfied	10	40	0.25
	Normal	5	15	0.125
	Not satisfied	2	4	0.05
	Very dissatisfied	1	1	0.025
Stage lighting effect	Very satisfied	26	130	0.65
	Relatively satisfied	9	36	0.225
	Normal	3	9	0.075
	Not satisfied	1	2	0.025
	Very dissatisfied	1	1	0.025
Dance expression	Very satisfied	23	115	0.575
	Relatively satisfied	9	36	0.225
	Normal	4	12	0.1
	Not satisfied	3	6	0.075
	Very dissatisfied	1	1	0.025
Stage scheduling form	Very satisfied	24	120	0.6
	Relatively satisfied	8	32	0.2
	Normal	5	15	0.125
	Not satisfied	1	2	0.025
	Very dissatisfied	2	2	0.05

Table 5 shows the performance music audience satisfaction, performance music audience satisfaction from high to low in order of ranking: music and stage picture between the cooperation, music and lighting between the rhythmic sense of embodiment, music and performance style combination, music, sound production effect, satisfaction with the total score were 184, 179, 176 and 174.

Table 5.

Performance of music audience satisfaction

Performance element	Satisfaction	Number of people	Score	Percent
Music, sound production effects	Very satisfied	23	115	0.575
	Relatively satisfied	11	44	0.275
	Normal	4	12	0.1
	Not satisfied	1	2	0.025
	Very dissatisfied	1	1	0.025
The experience of rhythm between music and light	Very satisfied	26	130	0.65
	Relatively satisfied	10	40	0.25
	Normal	2	6	0.05
	Not satisfied	1	2	0.025
	Very dissatisfied	1	1	0.025
The music matches the stage picture	Very satisfied	27	135	0.675
	Relatively satisfied	11	44	0.275
	Normal	1	3	0.025
	Not satisfied	1	2	0.025
	Very dissatisfied	0	0	0
The combination of music and performance style	Very satisfied	25	125	0.625
	Relatively satisfied	9	36	0.225
	Normal	4	12	0.1
	Not satisfied	1	2	0.025
	Very dissatisfied	1	1	0.025

Table 6 shows the other sensory experience, in the stage performance, added a perfume rain link, this link experience satisfaction survey, the audience’s satisfaction in descending order: perfume rain and stage atmosphere catering to (satisfaction score of 181), the olfactory experience of perfume rain (satisfaction score of 176).

Table 6.

Other sensory experiences

Performance element	Satisfaction	Number of people	Score	Percent
Perfume rain smell experience	Very satisfied	24	120	0.6
	Relatively satisfied	10	40	0.25
	Normal	4	12	0.1
	Not satisfied	2	4	0.05
	Very dissatisfied	0	0	0
Perfume rain meets the stage atmosphere	Very satisfied	25	125	0.625
	Relatively satisfied	11	44	0.275
	Normal	4	12	0.1
	Not satisfied	0	0	0
	Very dissatisfied	0	0	0

5

Conclusion

In this paper, AIGC technology is utilized to input textual cues related to choreography to generate the tone of choreography, and LoRA model parameters are applied to fine-tune the style. Based on the SD generation scene, combined with generative adversarial network, the choreography generation scene model is improved, meanwhile, deep neural network is utilized to realize the stage lighting positioning. Simulation experiments and empirical investigations are integrated to jointly evaluate the effect of choreography design. In the audience physiological characteristics test, the spatial elements of the choreography that received more attention from the subjects were the spatial interface elements and the set elements, and the total gaze duration of these two elements was 104.624s and 63.552s, respectively, and the heart rate of the four subjects when enjoying the stage performances ranged from 70 to 86 beats/min. Warm white color in the choreography design color simulation was the closest to the natural indoor environment with the European style distance close to the natural indoor environment, ranging from 800 to 1500 beats/min, and the color can be used as the main color in the subsequent design. The audience’s satisfaction survey, the audience’s overall perception analysis is divided into the audience of the stage performance of the communication mode and the overall feeling of the two aspects of the audience for the stage performance of the communication mode evaluation, very good and good evaluation of the majority, respectively, accounted for 62.5% of the total number of 62.5% and 17.5%, 65% of the audience that the choreography performance to the overall feeling of the audience is very good, all the audience have recognized the stage of this paper effect presented by the design.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Digital Intelligence Technology-Driven Transformation and Innovation Research on Choreography Design

Wei Miao

Pubblicato online: 05 giu 2025

Ricevuto: 09 gen 2025

Accettato: 25 apr 2025

DOI: https://doi.org/10.2478/amns-2025-1033

Parole chiaveStable Diffusion technique, LoRA model, Target detection, CLIP text encoder, Choreography design

© 2025 Wei Miao, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Stable Diffusion technique, LoRA model, Target detection, CLIP text encoder, Choreography design