Research on three-dimensional modeling and motion capture technology for accordion playing hand posture

Accordion playing mainly includes playing posture, key touch, bellows and other important skills. Gesture - Gesture embodies a variety of forms of accordion playing, from relatively static to full-body rhythmic movements, and includes both posture and posture. Keystrokes - Any keyboard instrument includes the technique of keystrokes, which can be seen as the player’s skillful movement of any part of the arm that transmits force to the fingertips and acts on the keys. In playing, the change of the contact state between the fingers and the keys (key knobs) is related to the effect of the tone, evenness of the hook, and the performance of special tones, etc. It can be said that it is the basis of the foundation of the performance. Windchest technique - the most central technique in accordion playing is reflected in the operation of the windchest, which can be regarded as the technique of the left hand pushing and pulling the windchest to generate airflow to blow the internal reeds to cause vibration and sound [1-2].

When playing the accordion, the appropriate height of the back of the instrument is the upper end of the body is exactly flush with the clavicle, smaller instruments can be slightly below the clavicle, the back of the instrument is too high will make the neck, head, shoulder activities are constrained, and make the right hand to play the bass region more difficult, the back of the instrument is too low will make the left hand windchest can not be completely pulled, the right hand to play the treble goat position loss. Therefore, the appropriate height of the back of the instrument is determined after taking both into account [3]. Nevertheless, the accordion back after the bass area keyboard is located in a fairly high position, the right hand in playing the bass area, the arm must be bent to less than the degree in order to touch these keys, in the performance of the lowest several tones, the arm is bent to the limit of the degree of almost, due to the arm of the bending degree of enhancement of the scapula is bound to have different degrees of uplift phenomenon, this state of affairs, so that the arm of the flexion of the muscle groups, such as the biceps brachii, the victorious muscle In this state, the muscles that bend the arm, such as the biceps, rhomboids, and scapularis, are relatively tense compared to the middle and high tones. The position of the arm thus loses its natural state, that is, the above mentioned loss of position, in this state, will undoubtedly bring difficulties to the wrist and finger dexterity, affecting the finger relaxation and ease of key touch [4].

If the performance of individual notes or individual bars is accomplished in this state, this difficulty will not yet appear to be prominent, but for some, such as the “Dance of the Wild Bees” such a class of tunes, the difficulty of playing will be great. To make the human body in the movement of certain muscle tension, while some muscle relaxation, it and the nervous system of the regulatory effect, the player’s psychological quality of the merits and the degree of skill in the operation and other factors are closely related. In order to overcome the difficulties caused by the above reasons, it is recommended to pay attention to the following points when practicing, first of all, make the bent arm undulate with the rhythm of the piece, like singing “breathing”, the relative tension of the arm muscles is not equal to rigidity, the arm lift height is not mechanically set in stone, but with the rhythm of the piece of music up and down the elasticity of the oscillation [5-6]. This kind of slight swing is very helpful to reduce the fatigue of arm muscles. Secondly, it should be clear that the relaxation of the fingers should first relax the wrist, the wrist plays the role of the top and bottom. Only when the wrist is relaxed can we talk about finger relaxation, and we should pay enough attention to this point at the beginning of the practice until it becomes a habit over time [7].

Finger motion tracking and capturing during musical instrument performance is an effective channel to understand and analyze the process of musical instrument performance, and related researches have proposed different types of hand motion recognition and tracking methods and demonstrated good results. Kuanishuly, T. B introduces the concept of motion capture technology and its applicable scenarios, which are commonly used in musical instrument performance score tracking, and focuses on analyzing Leap Motion The Leap Motion application is analyzed, and it is pointed out that the program can assist composers and performers to perform fine finger gesture tracking with high accuracy [8]. Jakubowski, K et al. used three genres of motion capture techniques for analyzing a video collection of instrumental ensemble performances, and a comparative analysis revealed significant differences between the three techniques in performance motion tracking, pointing out that computer vision techniques can satisfy the capture of postural movements as well as the quantification of body motions in music videos, subject to targeted constraints [9]. Freire, S et al. evaluated a system for capturing playing gestures consisting of inertial sensors and a six-voice nylon guitar as the underlying architecture and introduced a compensated integration strategy to balance the oscillations of the playing process, which was evaluated and shown to have the advantages of portability and low cost [10]. Ancillao, A et al. proposed the use of an optoelectronic system and some infrared reflective marking strategies for violin playing motion capture and tracking, and in practice and evaluation tests, it was confirmed that the proposed strategies can help players to record the playing process, which can then be optimized for improvement and training tuning [11]. Bayd, H et al. devised an innovative application to explore the linkage between hand motions and musical linkage between beats and tested it in a music library using a dynamic time warping algorithm and found that hand movements matched different musical rhythms [12].

The accordion, as a musical instrument with a long history and a highly artistic character, attracts many people to learn to play the accordion, and therefore many scholars have studied the teaching strategies and curriculum design of the accordion, which involves innovative teaching methods, digital information teaching techniques, and intelligent instruction. Haragova, P discusses the difficulties in teaching the accordion, such as memorization of musical scores, knowledge of rhythm, control of the instrument and coordinated control of the left and right hands, and introduces and analyzes a method of teaching code notation to support beginners in overcoming obstacles in the process of teaching the accordion [13]. Ivkov, V Combining the information from interviews with formal and non-formal accordion pedagogues to be screened and analyzed in order to count the differences in the modes of accordion teaching as well as pedagogical assistive technologies in and out of school, the study provides an important reference to understand and optimize the teaching strategies of the accordion [14]. Berlin, B. D attempted to build a unified curriculum for teaching keyboard instruments such as accordion, electronic keyboard, organ and piano, as well as objective standards, and carried out a teaching practice to evaluate and test accordingly, confirming the feasibility of the proposed idea [15]. Lu, H attempted to reform and optimize the accordion teaching management based on big data technology in order to enhance students’ interest and enthusiasm in accordion learning, which in turn improves the effect of accordion teaching and also realizes the personalized teaching of accordion [16]. Lu, H based on the multivariate data-aware perspective to try to intelligently recognize and computer-assisted instruction of accordion art features, which has made a positive contribution to the promotion of accordion art development [17].

In this paper, personalized 3D modeling of hand posture for accordion playing is accomplished based on multiple views. Each image is used as a feed into a neural network for joint point detection, and the correspondence of the same joint points is utilized in the multi-view to improve the accuracy of the hand two-bit key point estimation results. Using the method based on thermal diffusion equilibrium, the skin weights of the joint points to the mesh vertices are transformed into the temperature distribution at thermal conduction equilibrium according to the thermal equilibrium equations, so as to obtain the continuous and smooth bone skin weights. The pose parameter deforms the 3D skeleton of the template model and optimizing it involves minimizing the error between the 3D joint projection points and the 2D joint points.A number of personalized hand models with their corresponding hand pose parameters are used as inputs to obtain a personalized central hand pose 3D model for accordion playing by solving a system of linear equations in college. The attitude and position information of the hand movement are computed from the data collected by the motion capture sensor at the up position end. Based on the positive kinematic equations of the human hand, a hand joint motion projection algorithm suitable for hand posture capture is designed, and the computed output of hand posture information is combined to realize the real-time capture of real hand movements by the hand model. The boosting model is used to integrate three algorithms, namely support vector machine, K-nearest neighbor, and feed-forward neural network, to classify the hand gestures captured during accordion playing. Different experiments are set up to investigate the effectiveness of the hand gesture 3D construction and motion capture algorithms in this paper, and the good performance of the accordion playing hand gesture classification model in this paper is analyzed with the actual results obtained.

2

3D modeling of hand posture

2.1

Personalized hand 3D modeling based on multiple views

Personalized hand models play an important role in hand-object interaction scenarios, e.g., in accordion playing, personalized hand modeling can better show the hand gestures used in playing, and thus accurately express the skills of accordion playing. For scenarios that require personalized interaction, such as virtual reality, augmented reality, and intelligent interaction, the use of generic hand models can lead to a decrease in recognition accuracy and affect the interaction experience, while personalized models can be customized according to the individual’s hand characteristics, improving the interaction accuracy and efficiency, as well as enhancing the sense of user experience.

2.1.1

Hand 2D joint points and mask acquisition

In this paper, a neural network is used to estimate the coordinates of two-dimensional joint points of the hand from an image, and a threshold segmentation method is used to obtain a hand mask.

Each image is fed into the neural network in turn for joint point detection, and it is experimentally found that the network can get accurate estimation results for unmasked joint points, but it may not be able to estimate accurate results for certain views with more serious masks. To solve this problem, the correspondence of the same joint points among multiple views is utilized to improve the accuracy of the hand 2D keypoint estimation results. Firstly, the 3D joint points are obtained by optimizing the calibrated camera parameters with the initial estimated 2D joint points, and the optimization is performed by using the property of vector multiplication in the camera projection formula, and for each 3D joint point, the optimization equation is as follows: (1) $x = \underset{x}{\arg \min} \sum_{v \in V} | | {[X_{v}]}_{\times} \cdot K_{v} (R_{v} x + t_{v}) | |$ where x is the 3D joint point, V is the set of cameras, K_v represents the inner reference of the v th camera and [R_v,t_v] the outer reference. X_v is the 2D joint point under the v th camera and [X_v]_× is the skew-symmetric matrix of the chi-square coordinates of X_v.

After solving the optimization equation, the 3D coordinates of each joint are obtained, and then the 3D joints are reprojected into multiple views to update the positions of the 2D joints, which in turn continues to optimize the 3D joint positions, and then the 2D joints are projected to update the 2D joints, and iteratively obtain more accurate results.

2.1.2

Rigid deformation based on 2D joint points

With the previous subsection, the 2D joint points of the hand in the image are obtained, and in this subsection, the predefined hand template mesh will be rigidly deformed to fit the 2D joint points in the view using linear blend masking, at which point the deformation result contains only the change of the hand’s pose [18]. LBS is more commonly used in the animation and gaming industries, and is centered on the control of the deformation of a model’s surface (i.e., the skin) with skeleton joint points that is widely used in skeleton-driven skinning animation. Its generalized formula is as follows: (2) $v_{i}^{'} = \sum_{k = 1}^{K} ω_{i, k} T_{k} v_{i}$ where $ν_{i}^{'}$ is the vertex position after LBS deformation, i is the vertex index, K is the number of joints, ω_i,k is the skin weight, which indicates the degree of influence of the k th joint on the i th vertex, and needs to satisfy the condition of $\sum_{i} ω_{i, k} = 1$ , and T_k is the transformation matrix of the joints, and the following will introduce the computation process of the skin weight and transformation matrix in the formula.

There are many methods for calculating the skin weights, and the one used in this subsection is the most classical method based on thermal diffusion equilibrium [19], which is mainly based on the thermal equilibrium equations, and transforms the skin weights of the joints to the mesh vertices into the temperature distribution at the thermal conduction equilibrium, to obtain the continuous and smooth skeleton skin weights. Fig. 1 Schematic of the 3D skeleton and skin weights of the 3D template network model, where (a) is the skeleton structure, the numbers are the joint point numbers, and (b) is the skin weights of joint points 2 and 5 relative to the surrounding mesh vertices. For the same template model, once the skin weights are determined, they remain unchanged in the subsequent LBS deformation, and the only amount of variation that needs to be solved is the transformation matrix.

The 3D skeleton of the hand is considered as a kinematic chain, and the positions of all the skeleton joint points can be computed by the kinematic chain law. The joint points can be rotated around the skeleton and the transformation matrix involves T_k involves the rotation of the joint points and is a rigid transformation for the joint points. The rotation matrix [20] is a matrix of size 3×3 and orthogonal, the constraints imposed by the nature of orthogonal matrices make it difficult to solve the rotation matrix optimally and therefore it is not solved directly. A three-dimensional rotation can be expressed as a rotation by a certain angle θ around some axis of rotation n, i.e., as an axis-angle representation. If a three-dimensional vector r is used to represent the rotation, the rotation matrix R can be obtained by the Rogriges transform by taking the unit vector of the vector as the axis of rotation and the mode as the angle of counterclockwise rotation around the axis, i.e. $n = \frac{r}{| | r | |}, θ = | | r | |$ : (3) $R (n, θ) = \cos θ I + (1 - \cos θ) n n^{T} + \sin θ {[n]}_{\times}$

When defining the skeleton structure of the hand template mesh model, the rotation axes of the joints were determined, and all the joints had at least one rotation axis except the root joint and the fingertip joint, which had no rotation axis. The template model has a total of 21 joint points and 20 rotation angles. In addition to the rotation angles, seven global parameters need to be optimized, namely global scaling, global rotation vector, and global translation vector, for a total of 27 to-be-optimized parameters as pose parameters Θ, where Θ = [s,r₁,r₂,r₃,t₁,t₂,t₃,θ₁,θ₂,…,θ₂₀] of the hand.

The 3D skeleton of the template model is deformed by the pose parameter Θ, and the optimization is performed by minimizing the error between the 3D joint projection points and the 2D joint points. In order to construct the optimization criterion, changes are made to Eq. (1) by replacing the 3D joint points x in Eq. (1) with functions of Θ: (4) $Θ = \underset{Θ}{\arg \min} \sum_{ν \in V} | | {[X_{ν}]}_{\times} \cdot K_{ν} (R_{ν} x (Θ) + t_{ν}) | |$

Each child node only estimates the rotation angle parameter, but the computation of the transformation matrix also needs to be traced back to the root node, and the root node itself has to be globally transformed to include the translation transformation, so the rigid transformation of the child nodes does not just rotate, but also includes translation transformation, and the generalized LBS formula is finally modified to: (5) $v_{i}^{'} = \sum_{J_{k}} ω (v_{i}, J_{k}) [R_{k} (v_{i} - J_{k}) + J_{k} + t_{k}]$ where J_k denotes the position of the k nd joint point after rigid transformation, T_k = {R_k,t_k} is the rigid transformation of the k th joint, and ω(v_i,J_k) is the corresponding skin weight.

2.1.3

Personalized hand model construction

As mentioned earlier, T_k is a rigid transformation defined only by the pose parameter Θ, defining T_k = [R_k,t_k], R_k as rotations and t_k as translations. Thus the rigid transformations of the i th vertex v_i can be defined as R_i(Θ) and T_i(Θ): (6) $R_{i} (Θ) = \sum_{k = 1}^{K} ω_{k}^{i} R_{k}$ (7) $T_{i} (Θ) = \sum_{k = 1}^{K} ω_{k}^{i} t_{k}$

Assuming that the hand model has a total of n vertex, a diagonal array of R(Θ) of dimension 3n×3n can be constructed: (8) $R (Θ) = [\begin{matrix} R_{1} (Θ) \\ R_{2} (Θ) \\ ⋱ \\ R_{n} (Θ) \end{matrix}]$

And stack all the T_i(Θ) into a column vector $T (Θ) \in ℝ^{3 n}$ , and then rewrite the LBS deformation formula: (9) $v^{'} = R (Θ) v + T (Θ)$

Here v and v′ are column vectors consisting of the coordinates of all model vertices before and after the deformation. The personalized template hand model is then obtained according to the following equation: (10) $v = \underset{v}{\arg \min} \sum_{j} | | R (Θ_{j}) v + T (Θ_{j}) - v_{j}^{'} | |$ where the pose parameter of the jst personalized hand model is Θ_j, $v_{j}^{'}$ is the column vector consisting of the coordinates of the vertices of this model.

In order to further ensure the smoothness of the personalized neutral hand model, a Laplace smoothing term is added to the optimized loss term above, and Eq. (10) is further updated as follows: (11) $v = \underset{v}{\arg \min} \sum_{j} | | R (Θ_{j}) v + T (Θ_{j}) - v_{j}^{'} | | + | | L v | |$

All R(Θ_j) and L are sparse matrices and their regular equations are solved using a sparse linear solver: (12) $(\sum_{j} R {(Θ_{j})}^{T} R (Θ_{j}) + L^{T} L) v = \sum_{j} R {(Θ_{j})}^{T} (v_{j}^{'} - T (Θ_{j}))$

2.2

Experimental results and analysis

The method proposed in this paper evaluates 2 challenging hand depth sequences. The first one is a hand depth sequence (Sequence 1) provided in Guo’s proposed method, which contains 450 frames of RGB images recorded using an Intel IVCam camera and a deformable hand shape template with about 8000 vertices. The other sequence (Sequence 2) contains a sequence of 450 frames of RGB images extracted from the NYU hand pose dataset. Both of these 2 sequences contain fast hand movements with severe occlusion cases, more information is shown in Table 1. After roughly aligning the hand shape template with the first depth frame, on these sequences, the method in this paper was run at 30 frames per second, implemented in C++, on a computer with 3.20GHz, 4-core CPU cores with 8G RAM.

Table 1.

Statistics of two test sequences in the experiments

Sequence	Frame number	Anchor point	Vertex number	Occlusion	Capture device
1	450	13	40~68k	Yes	IVCam
2	450	16	8~12k	Yes	Kinect

To quantitatively analyze the experimental results, accurate annotation data were generated by inputting these sequences into an existing Vicon optical motion capture system and manually aligning the system marker points with the corresponding vertices on the hand template. Fig. 2 shows the results of comparing the error distributions of this paper’s method and the other two methods on a non-obscured input case, (a) shows the input depth frame, and (b)~(d) show the results of the error distributions of this paper’s method, method A, and method B, respectively. Where the Euclidean distance between each vertex and its true labeled value is used to encode the color of a single vertex. As can be seen from Fig. 2, the modeling error distribution of this paper’s method is stable and free of outliers in the non-obscuring case.

Fig. 3 shows the comparative results of the error distributions of this paper’s method, Method A and Method B on another input sequence with occlusion, (a) is the input depth frame, and (b)~(d) are the results of the error distributions of this paper’s method, Method A and Method B, respectively. From Fig. 3, it can be seen that in the case where the scanned object is partially occluded, both Method A and Method B show outliers at the tiger’s mouth, while this paper’s method is still able to obtain accurate modeling results.

Fig. 4 shows the results of quantitative comparison of single-frame average error on sequence 1 for three of the methods in this paper, Method A and Method B. The average error for a single frame is defined as the average of the Euclidean distances between all vertex pairs between each modeled mesh and the real labeling. It is obvious from Fig. 4 that the curves represent the average error results of this paper, Method A and Method B. Compared with the other two methods, this paper’s method obtains a lower average error, i.e., it is more effective to use this paper’s method for the three-dimensional modeling of the accordion playing hand posture.

3

Hand posture motion capture

3.1

Algorithm for hand pose solution

The core of hand motion attitude information capture is the interpretation of attitude and position information, and the interpretation of position information is usually derived from the computation of the data collected by the motion capture sensors at the upper position end.

3.1.1

Velocity information solving

According to the attitude information solving method, the current attitude information (γ,θ,ψ) of the moving carrier, i.e., the attitude relation matrix $C_{n}^{b}$ between the moving carrier and the reference coordinate system, can be conveniently calculated, for which the specific force sensed by the acceleration sensor is transformed from the carrier’s own coordinate system to the reference navigational coordinate system, which can be had: (13) $[\begin{matrix} f_{x^{'}} \\ f_{y^{'}} \\ f_{z^{'}} \end{matrix}] = C_{b}^{t} [\begin{matrix} f_{x}^{b} \\ f_{y}^{b} \\ f_{z}^{b} \end{matrix}]$

In Eq. (13), $C_{b}^{'} = {[C_{t}^{b}]}^{T}, [f_{x}^{b}, f_{y}^{b}, f_{z}^{b}]$ is the sensory measurement of the acceleration sensor, then the solution for the carrier velocity can be derived by calculating the specific force equation: (14) ${\dot{\vec{V}}}_{e t} = \bar{f} - (2 {\bar{ω}}_{i e} + {\bar{ω}}_{e p}) \times {\bar{V}}_{e p} + \bar{g}$

Transforming the coordinate system of Eq. (14), it can be described as: (15) $[\begin{matrix} {\dot{V}}_{e x}^{t} \\ {\dot{V}}_{e y}^{t} \\ {\dot{V}}_{e z}^{t} \end{matrix}] = [\begin{matrix} f_{x}^{t} \\ f_{y}^{t} \\ f_{z}^{t} \end{matrix}] + [\begin{matrix} 0 \\ 0 \\ - g \end{matrix}] - [\begin{matrix} 0 & - (2 ω_{e z}^{t} + ω_{e z}^{t}) & 2 ω_{e y}^{t} + ω_{e y}^{t} \\ 2 ω_{e z}^{t} + ω_{e z}^{t} & 0 & - 2 ω_{e x}^{t} - ω_{e x}^{t} \\ - 2 ω_{e y}^{t} - ω_{e y}^{t}) & (2 ω_{e x}^{t} + ω_{e x}^{t}) & 0 \end{matrix}] [\begin{matrix} V_{e x}^{t} \\ V_{e y}^{t} \\ V_{e z}^{t} \end{matrix}]$

Given the velocity of motion of the carrier with respect to the reference coordinates, let its component in the horizontal plane be V, which can be expressed by Eq. (16), viz: (16) $V = \sqrt{V_{x}^{2} + V_{y}^{2}}$

The carrier’s own geographic coordinate system and the reference coordinate system are in relative motion, and there are constraints on each other, first of all, the reference coordinate system of the Earth’s motion and the angular velocity of the navigational inertial system are related in the natural coordinate system: (17) $[\begin{matrix} ω_{i e x}^{t} \\ ω_{i e y}^{t} \\ ω_{i e z}^{t} \end{matrix}] = C_{e}^{t} [\begin{matrix} 0 \\ 0 \\ ω_{i e} \end{matrix}] = [\begin{matrix} ω_{i e} \cos φ \\ 0 \\ ω_{i e} \sin φ \end{matrix}]$

Secondly the angular velocity of the carrier’s own motion reference coordinate system in relation to the Earth’s coordinate system in the natural coordinate system is: (18) $[\begin{matrix} ω_{e t x}^{t} \\ ω_{e t y}^{t} \\ ω_{e t z}^{t} \end{matrix}] = [\begin{matrix} ω_{e t x}^{t} \\ ω_{e t y}^{t} \\ ω_{e t x}^{t} \tan φ \end{matrix}] = [\begin{matrix} - \frac{V_{e t y}^{t}}{R_{y t}} \\ \frac{V_{e t x}^{t}}{R_{x t}} \\ - \frac{V_{e t y}^{t}}{R_{y t}} \tan φ \end{matrix}]$

Bringing Eqs. (17) and (18) into (15) for calculation, we can have: (19) ${\begin{matrix} {\dot{V}}_{e t x}^{t} = f_{x}^{'} + (2 ω_{ω} \sin φ - \frac{V_{e t y}^{'}}{R_{y t}} \tan φ) V_{ω y}^{'} - \frac{V_{e t x}^{'}}{R_{x t}} V_{e t z}^{'} \\ {\dot{V}}_{e t y}^{t} = f_{y}^{'} - (2 ω_{i e} \sin φ - \frac{V_{e t y}^{'}}{R_{y t}} \tan φ) V_{e t x}^{'} + (2 ω_{ω} \cos φ - \frac{V_{e t y}^{'}}{R_{y t}}) V_{e t z}^{'} \\ {\dot{V}}_{e t x}^{t} = f_{z}^{'} + \frac{V_{e t x}^{'}}{R_{x t}} V_{e t x}^{'} - (2 ω_{i e} \cos φ - \frac{V_{e t y}^{'}}{R_{y t}}) V_{e t y}^{'} - g \end{matrix}$

During the hand posture information capturing process, the velocity of the hand in the direction of gravity will be small, which can be made $V_{c x}^{'} \approx 0$ . The transformation equation (19) can be had: (20) ${\begin{matrix} {\dot{V}}_{e t x}^{'} = f_{x}^{'} + (2 ω_{i x} \sin φ - \frac{V_{e t y}^{'}}{R_{y t}} \tan φ) V_{e t y}^{'} \\ {\dot{V}}_{e t y}^{'} = f_{y}^{'} - (2 ω_{i x} \sin φ - \frac{V_{e t y}^{'}}{R_{y t}} \tan φ) V_{e t x}^{'} \end{matrix}$

3.1.2

Location information solving and data preprocessing

After the attitude information $C_{n}^{b}$ of the carrier is derived using the attitude algorithm, the position information (λ,φ) of the carrier can be calculated by combining it with the Earth’s reference coordinate system, where λ represents the longitude and φ the latitude, i.e: (21) ${\begin{array}{l} C_{1} = - \cos λ \sin φ \\ C_{2} = \sin λ \\ C_{3} = \cos λ \cos φ \end{array}$

The solution of the hand position information can also be derived by using the acceleration sensor measurements to perform a quadratic integration operation in a reference navigation coordinate system, which is a simple process but slightly less accurate.

The signal measured by the acceleration sensor is obtained after data preprocessing, and then the quadratic integration method is used to calculate the position information where its carrier is located. Since the sampled signal is discrete data, the position information of the carrier can be calculated directly by quadratic summation of the sampling time: (22) ${\begin{array}{l} v_{i} = v_{i - 1} + f_{i} \cdot δ_{i} \\ s_{i} = s_{i - 1} + v_{i} \cdot δ_{i} \end{array}$

In Eq. (22), f_i,v_i,s_i is the acceleration, velocity, and position information at i moments, respectively, and at i = 0, f₀ = 0,V₀ = 0, s₀ = 0, δ_t are the sampling intervals.

In practical applications, when the hand posture at rest, due to the cumulative effect of the sensor’s own speed, will cause the output acceleration value in the tendency to zero, the actual speed is not also zero, the position information calculated at this time there will be a great accumulation of error. At this time, if you want to accurately locate the position information, it is necessary to use some auxiliary positioning methods.However, even if the position information is only roughly assessed, it can still be calculated by adding some judgment algorithms based on this method.

In the process of human hand movement, it is difficult for the joints of the hand to achieve coordinated uniform motion, but it is easier to do so when the whole hand is stationary. When the acquisition of pure acceleration is zero, the hand movement will not have a uniform motion of this state, at this time you can directly determine the state for the stationary, even if the speed of the movement is not zero, but also its automatic return to zero, which will be able to a large extent with the time to correct the speed of the drift in a timely manner. At the same time, because the acceleration sensing can sense the slightest change in movement, external interference will also cause data fluctuations, it is necessary to carry out some pre-processing of the measured data to inhibit the drift of the position information and ensure the reliability of the data.

Data preprocessing methods are commonly used for software digital filtering, that is, through the filtering algorithm on the measurement of the original information to calculate, remove the noise and interference in the information, and come up with the real information to match the data. In this paper, we choose to combine limit filtering and mean filtering methods for processing. For the motion data collected by the sensor, it is first filtered by the limiting filter to remove random pulses and other external interference signals that may exist in the signal, and then the mean value filtering method is used to perform the arithmetic average operation on the data and take the average value of its samples as the filtered output data, so as to realize the pre-processing of the data and ensure the reliability and accuracy of the system’s subsequent calculations.

3.1.3

Motion Algorithm for Hand Posture

The human hand forward kinematics equation describes the mutual affiliation and motion relationship between the joints of the hand, and in the description matrix of the motion of the hand joints introduced by the operation of the method, it is necessary to know the number of its parent joint points. Based on the forward kinematic equation of the human hand, a hand joint motion projection algorithm suitable for capturing hand posture has been developed. The details are as follows: 1)

the data collected by the sensor according to the attitude solution algorithm to derive the angle of the information expressed, but due to the complexity of the human hand joint movement and the number of sensors in the system constraints, have to be set first for the specification of the hand attitude: the initial state of the sensor reference system can make the Z-axis zero, the palm is perpendicular to the surface of the water plane, and other joints can be obtained according to the relevant coordinate transformations of the attitude. The simple structure of the system is shown in Fig. 5, the black square is the MEMS inertial sensor, and the motion state of the finger joints relative to their parent joints is denoted by ∠A,∠B,∠C. The algorithm performs calculations to obtain the attitude data, which is then used to reconstruct the hand motion.

2)

The motion relationship between the distal and proximal fingers of the hand joints is linearly proportional with a coefficient of roughly 2/3, which is verified by a large amount of observation and experimental data in the article Virtual Hand Drive. The hand joint motion algorithm samples the following methods to restore real hand motion movements:

In Fig. 5 the posture angle calculated using the acceleration sensor measurements is ∠G_(x,g)j, where (x,g) is the angle between the component of the acceleration sensor in the horizontal X axis and the direction of gravity, j is used to represent the different finger joints of the hand, and the angle between Y_i and Y_i+1 is represented by Y(i,i+1), which can be categorized into three cases according to their calculated ranges:

If the calculated range of the posture angle is ∠G_(x,g)j ∈ (0°,90°), the value of ∠A,∠B,∠C can be calculated using the following formula: (23) ${\begin{matrix} ∠ A_{Y (1, 2) j} = \frac{∠ G_{(x, g) j}}{3} \\ ∠ B_{Y (2, 3) j} = ∠ G_{(x, g) j} - \frac{∠ G_{Y (1, 2) j}}{3} \\ ∠ C_{Y (3, 4) j} = \frac{∠ G_{(x, g) j}}{2} \end{matrix}$

If ∠G_(x,g)j ∈ (90°,+∞), ∠A,∠B,∠C is calculated as: (24) ${\begin{array}{l} ∠ A_{Y (1, 2) j} = \frac{∠ G_{(x, g) j}}{2} \\ ∠ B_{Y (2, 3) j} = ∠ G_{(x, g) j} - \frac{∠ G_{Y (1, 2) j}}{2} \\ ∠ C_{Y (3, 4) j} = \frac{∠ G_{(x, g) j}}{2} \end{array}$

If ∠G_(x,g)j ∈ (–∞,0°), the value of ∠A,∠B,∠C is made zero.

In order to achieve the tracking and capturing of real hand movements, OpenGL can be utilized to render the established hand model and combined with the computed output of hand pose information to realize the real-time recovery of real hand movements by the hand model.

3.2

A model for categorizing hand gestures in accordion playing

3.2.1

Support Vector Machines

The essence of support vector machines is to find a separating hyperplane to maximize the interval. For nonlinear problems, it is necessary to map the training samples from the original space to a high-dimensional space, so that the samples are linearly separable in the high-dimensional space. Let ϕ(x) denote the feature vector obtained by mapping x, then the model representation corresponding to the separating hyperplane is shown in equation (25): (25) $f (X) = w^{T} φ (X) + b$ where w is the separating hyperplane normal vector, then the planning function is as follows: (26) $\underset{a, b}{* \min} \frac{1}{2} ∥ w ∥^{2}, y_{i} (w^{T} ϕ (x) + b) ⩾ 1, i = 1, 2, \dots, m$

3.2.2

K-nearest neighbors

K – The core idea of the nearest neighbor method is that a sample belongs to the same class if most of the K most similar samples to that sample in the feature space belong to the same class [21]. K The premise of the nearest neighbor method is that the selected neighboring points have all been classified. The similarity between the sample points is realized by calculating the distance between the objects, usually the distance uses the Euclidean distance or the Hamden distance with the following expression:

Euclidean distance: (27) $d (x, y) = \sqrt{\sum_{k = 1}^{n} {(x_{k} - y_{k})}^{2}}$

Manhattan Distance: (28) $d (x, y) = \sqrt{\sum_{k} =} 1^{n} | x_{k} - y_{k} |$ where n denotes the dimension of the feature vector.

3.2.3

Feedforward Neural Networks

Feedforward neural network consists of multiple layers of directed graphs with fully connected nodes between each layer and the next [22]. The feedforward neural network in this paper consists of 3 layers, which are the input, hidden and output layers.

Assuming N_i, N_h and N_o are the number of nodes in the input, hidden and output layers respectively, the expression for the k rd node in the output layer is as follows: (29) $y_{k} = f (θ_{k} + \sum_{h = 1}^{N_{h}} w_{k h} f (λ_{h} + \sum_{i = 1}^{N_{i}} v_{h} x_{i}))$

Where, x_i is the i nd node of the input layer, v_hi is the weight between the nodes of the input layer and the hidden layer, w_kh is the weight between the nodes of the hidden layer and the output layer, θ_k and λ_h are the deviations, and k = 1,⋯, N_o. In addition to the nodes of the input layer, each of the nodes is a neuron with a nonlinear activation function, which in this paper is a logistic function, with the following expression: (30) $f (z) = \frac{1}{1 + e^{- z}}$ where z is the weighted sum of the input synapses.

3.2.4

Model integration

For the results recognized by the 3 classification algorithms, this paper adopts Boosting model for integration. The classification model weights change adaptively according to the last error rate, and the number of iterations is set to 100 in this paper in order to avoid overfitting as well as to take into account the real-time nature of hand posture classification.

3.3

Experimental results and analysis

3.3.1

Experimental data

Experimental data was taken from 10 volunteers (5 males and 5 females), who were hand-wrapped with motion capture sensors and executed predefined accordion playing gestures. In order to fully evaluate the performance of the accordion playing hand gesture motion capture and hand gesture classification method proposed in this paper, each volunteer was asked to repeat the execution of each gesture 30 times, of which 10 gestures were executed at a slower speed (each gesture was executed in no more than 3s). 10 gestures were executed at a normal speed (each gesture was executed in no more than 2s). 10 gestures were executed at a faster speed (each gesture execution time does not exceed 1s). During the data collection process, a camera was used to synchronously record the user’s hand movements. The collected data was manually segmented and labeled by comparing the sensor data with the video data using a visualization tool.

3.3.2

Experimental environment and parameters

The model in this paper is built on the Ubuntu system and implemented using Python. The experimental environment is configured as follows: the operating system uses Ubuntu 16.04.3 LTS.The CPU is Intel i7-7700 with 8GB of RAM, and the Python libraries used are Tensorflow 1.14.0, sklearn 0.23.1, pandas 1.0.5, numpy 1.18.5, matplotlib 3.2.2, and keras 2.2.5.In the experiment, 68 features are first extracted as inputs from a data stream with an input window length of 10 by logistic function in the input layer of the feed-forward neural network. Then L2-norm and Adam optimizer are set up to train the model parameters using gradient descent.Finally, 9 predefined labels of accordion playing hand gesture classes are output using the softmax function.

3.3.3

Experimental evaluation and results

In the evaluation phase, three experiments are designed to evaluate the hand gesture motion capture and classification method proposed in this paper.The first experiment evaluates the recognition performance of the algorithm in a user-dependent situation. The 2nd experiment, which requires recruiting 3 testers, verifies the recognition performance of the algorithm in a user-independent situation. The 3rd experiment, to evaluate the real-time performance of the algorithm. User-dependent means that both the test system and training model data are from the same person, while user-independent means that the test system and training model data are from different people. This distinction is made mainly to verify the robustness of the system and the generalization ability of the model.

1)

Recognition results in the user-dependent case

In the experiments, the ten-fold crossover method was used to evaluate the capture and classification of accordion playing hand gesture motion for each volunteer. Three evaluation metrics (Accuracy A, Precision P, and Recall R) were used to compare the proposed integrated model with individual Support Vector Machines, K-neighborhood method, and Feedforward Neural Networks.

Figure 6 shows the overall performance comparison of a total of nine accordion playing hand postures, including natural relaxation and basic hand position. The integrated method proposed in this paper achieves an accuracy of 97.54% in the user-dependent case, and its overall performance is superior to those of the other three methods.This is mainly due to the fact that the integrated model synthesizes the advantages of the other three classifiers and prevents the model from overfitting and other situations by setting the number of iterations.

The accuracy rate of each gesture in user-dependent situations is displayed in Table 2. In most cases, the method proposed in this paper can obtain satisfactory results, especially for simple gestures, with an accuracy rate above 98%. However, finger independence and finger curvature are easily confused due to the high similarity of hand gestures and the large difference in the angle of the testers’ gestures, and the accuracy rate is reduced by about 2% compared to other gestures.

Table 2.

Accuracy of each hand gesture recognition

Hand gesture	A/%
Natural relaxation	98.32
Basic position	98.27
Finger independence	96.49
Finger curvature	96.54
Span	98.83
Weight and speed	98.47
Wrist position	98.09
Shoulder position	98.52
Arm position	98.13

2)

Recognition results in the user-independent case

The experiment is conducted by recruiting three testers to perform user-independent hand gesture motion capture and classification using the trained recognition algorithm. The method proposed in this paper and the direct hand gesture motion capture and classification with raw sensor data in the user-independent case are compared in Figure 7. Thanks to the effective integration of different classification methods, the method proposed in this paper is more stable and robust in the user-independent case.

3)

Method real-time assessment

According to the practical application scenario of hand positioning for accordion playing, the timeliness of the method is an important factor that affects the user experience. Table 3 shows the latency of various methods, and the results show that the real-time performance of the method proposed in this paper is satisfactory, and the latency is even lower than that of feed-forward neural networks and comparable to support vector machines. The efficiency of the classifier is improved by the reduced data redundancy and distinct features after feature extraction.

Table 3.

Delay of each model

Models	Delay/ms
SVM	114
K-close	138
FNN	153
Boosting	109

4

Conclusion

In this paper, the construction of a 3D model of the accordion playing hand posture is realized based on the multi-view of the accordion playing hand posture, and the motion capture of the playing hand posture is accomplished by using the hand posture solving algorithm. A boosting model integrating support vector machine, K-nearest neighbor, and feed-forward neural network algorithms is used to classify the hand posture of the accordion player.The 3D construction algorithm in this paper achieves the minimum error in both the presence and absence of occlusion of the hand, and the error is more stable compared to both Method A and Method B. The integrated model is more stable than Method A and Method B in the user-dependent case. The classification accuracy of the integrated model in the user-dependent case is 97.45% for the nine hand postures of accordion playing, and seven of them have classification accuracy above 98%.This paper’s proposed method for user-independent cases is more stable and robust than the support vector machine and other methods used alone. At the same time, the method in this paper has a classification delay of only 109ms, which has a stronger timeliness compared with other methods, and effectively improves the classification efficiency of hand gestures for accordion playing.

Funding:

This research was supported by the Shanxi Provincial Bureau of Cultural Heritage project in 2024: Research on digital display and transmission path of musical cultural relic - taking Chime of Jin State in Eastern Zhou Dynasty as an example (Project No.: 2024KT21).

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

Research on three-dimensional modeling and motion capture technology for accordion playing hand posture

Bohua Yang

Yimu Wang

Online veröffentlicht: 17. März 2025

Eingereicht: 20. Okt. 2024

Akzeptiert: 03. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0318

SchlüsselwörterThree-dimensional modeling, Motion capture, Support vector machine, Linear hybrid skin, Hand posture

© 2025 Bohua Yang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Schlüsselwörter
Three-dimensional modeling, Motion capture, Support vector machine, Linear hybrid skin, Hand posture