Uneingeschränkter Zugang

Research on the construction of virtual playing environment and its human-computer interaction performance improvement in piano teaching

  
21. März 2025

Zitieren
COVER HERUNTERLADEN

Introduction

In today’s comprehensive advocacy of quality education, music education, as an important part of quality education, plays a pivotal role in promoting the physical and mental development of students and the formation of a perfect personality, which cannot be replaced by other subjects in school education [1-2]. Strengthening the education of music subject has a long-term significance for improving the music cultural literacy of the whole nation and cultivating high-quality innovative talents for the century [3-4]. However, music as an art subject, there are great restrictive conditions in the traditional teaching implementation, the implementation of music teaching not only needs to invest in a large number of music equipment and music equipment, but also needs specialized music teachers and teaching places, which undoubtedly restricts the popularization range and promotion of music teaching [5-6]. In recent years, with the continuous deepening of computer technology and multimedia technology in the field of music, music teaching is no longer limited to the hardware environment under specific conditions, which provides an opportunity for the popularization and promotion of music teaching [7].

At the same time, in the context of the era of information technology, the discipline of music needs to make full use of the new technological achievements in the field of information technology in order to obtain the further development of the discipline. Moreover, the traditional music teaching system is bound to produce a great transformation in terms of concepts, methods and contents, and at the same time, it also opens up pay new ways for teachers to carry out music teaching creatively [8]. However, the existing music teaching software is still dominated by foreign products, especially those from the United States and Japan. The number of domestic computer music teaching software is relatively small, and it is mostly used to assist teachers to carry out the preliminary production of courseware [9]. In the process of music teaching is still based on the traditional teaching methods, the computer most of the time just act as a player role. Therefore, the development of new music teaching tools directly involved in classroom teaching has important practical significance and far-reaching application value [10].

Virtual reality technology has been successfully used in medical training, military aircraft navigation, entertainment, and education. The main advantage of this technology is that it can create a perceptual and cognitive crossover between physical objects (e.g., instruments) and users [11-13]. Applying virtual reality technology to piano teaching can build a virtual playing environment for teachers and students, improve the performance of human-computer interaction, and then fully invoke the students’ auditory, visual and tactile senses, which increases the students’ sensibility to music, and at the same time plays an important role in deepening the teaching concepts of the teachers, improving the teaching methods, and increasing the efficiency of music teaching [14-17].

This paper mainly uses somatosensory interaction technology and mixed reality technology as the main technical tools to construct a virtual playing environment in piano teaching and improve the performance of human-computer interaction. Specifically, this paper designs a gesture recognition method based on the Hidden Markov Model, and uses the Baum-Welch algorithm as the training algorithm of the model to improve the model recognition accuracy. Then, using gesture recognition, a virtual piano playing scheme based on mixed reality was designed, and the virtual piano playing environment was successfully constructed. Then, the model and the scheme were experimentally verified. Finally, through research on the user’s interaction experience during the virtual piano performance, strategies to improve the model’s human-computer interaction performance are proposed.

The application of somatosensory interaction technology in piano teaching

This chapter discusses the application of somatosensory interaction technology in virtual piano teaching tools, i.e., human-computer interaction in virtual playing environments through somatosensory interaction technology.

Development environment and tools
Hardware environment

The operating system of the computer used in this study is a 64-bit operating system, x64-based processor, 8.00 GB of system memory, of which the hard disk has more than 120 GB of free space, and a graphics card that supports NVIDIA GeForce GTX 1650 or higher level graphics cards.

Software environment

The system architecture and business logic of the Virtual Piano Teaching Tool is mainly implemented using the programming language C#, which is developed and integrated for distribution using Visual Studio 2017. The front-end interface design is realized using Unity 2018.4.8.

Equipment tools

The virtual piano teaching tool primarily utilizes Leap Motion controllers for gesture interaction, and the necessary resources involve several core resource packages from Leap Motion’s official website for function building.

Leap Motion Gesture Recognition Technology

The essence of the virtual piano teaching tool is that the user uses the virtual hand provided by Leap Motion to interact with the function buttons and virtual objects in the system, so the first problem to be solved before creating the teaching tool is the connection between the Leap Motion controller and the virtual hand in the virtual scene. It is divided into the following steps:

1) Import the Leap Motion Core Assets into Unity, find the Prefabs you need to use and the resource pack. Add the LeapHandController that controls the hand in the resource pack to the scene.

2) Insert the capsule hand models Capsule Hand Right and Capsule Hand Light provided in the resource pack into the empty object (Create Empty) to form a parent-child relationship, at this time, the scene will show two hand models. Then drag the RigidRoundHand_R and RigidRoundHand_L in the HandModelPhysical file into the empty object, and all the necessary files are added to the scene.

3) Set the Model Pool value in LeapHandController, set the size value to 2, paste the four hand models identified in the previous step into the two elements, and check the Is Enabled and Can Duplicate options. At this time, run the project again, you can find that the two virtual hands can already follow the user’s hand for movement.You can adjust the position of the Main Camera according to the specific scene to adjust the field of view and find the right position and suitable size.

Gesture Interaction Technology

To use Leap Motion to grab objects, you need to import the Interaction Engine resource pack, which also provides many free examples and code for users to learn and use directly. We can directly use the Capsule Hands (Desktop) scene provided in its example, which has the same function as the previous setup. After running, it will issue a warning because the gravity setting and the value of Timestep are not reasonable, you need to set it up in Project Settings, and you can run the virtual hands after setting it up.

Use the virtual hand function to grab objects to create a new scene, and perform a simple grabbing operation as follows:

1) Create a new Cube as the object to be grasped and a new Plane to be used as the carrier, attach different materials to the two in order to distinguish them. And set the relative position and size of the virtual hand and the object for grasping, because the data collection range of Leap Motion is limited, and the interactive object must be within the controllable range of the device.

2) Search for Interaction Manager in the search box and add it to the scene. Replace the Interaction Hand (Left&Right) in the Interaction Manager subset with the steel body hand RigidRudHand in HandModels.

3) Add the interaction code on the Cube that needs to be interacted, add the component Rigidbody and Interaction Behavior code, and check Use Gravity to add gravity to the object. Run the program to observe the interaction between the virtual hand and the object.

Piano virtual performance model based on gesture interaction

In this chapter, a Hidden Markov Model (HMM)-based gesture interaction recognition method is proposed, and the 3D modeling and virtual playing of a piano is realized by combining mixed reality technology on the basis of gesture recognition.

Hidden Markov-based recognition of gesture interactions
Hidden Markov Models (HMM)

Hidden Markov Models (HMMs) are probabilistic models of temporal sequences that describe the process of generating random sequences of observations by randomly generating random sequences of unobservable states from a hidden Markov chain, and then generating a random sequence of observations by generating an observation from each state [18]. The sequence of states arbitrarily generated by the hidden Markov chain is the sequence of states, each state generates an observation, and the resulting random sequence of observations is the sequence of observations, where a position in the sequence can be represented as a moment.

The structure of the HMM model is shown in Fig. 1, where the variables can be divided into two groups, one group is the state variable I = {i1,i2,…,it,…,iT}, which is generally assumed to be hidden and unobservable, so it is also known as the hidden variable, and the other group is the observation variable O = {o1,o2,…,ot,…,oT}. In the HMM, the system usually transitions between multiple states Q = {q1,q2,…,qN}, and the range of values of the state variable it is generally a discrete space of N possible values. The range of values O of the observed variables ot is Vt = {v1,v2,…,vM}.

Figure 1.

Diagram structure of HMM

The HMM is mainly determined by the initial probability distribution, the state transfer probability distribution and the observation probability analysis, which are defined in the following form:

State transfer probability matrix: usually denoted as A=[ aij ]N×N$A={{\left[ {{a}_{ij}} \right]}_{N\times N}}$, where: aij=P(it+1=qj|it=qi)(1iN,1jN) \[{{a}_{ij}}=P({{i}_{t+1}}={{q}_{j}}|{{i}_{t}}={{q}_{i}})(1\le i\le N,1\le j\le N)\]

Equation (1) in which qi denotes the probability of the current moment i , and qj denotes the probability of the next moment j . At any moment, the probability matrix of the transition between them is aij.

Observation probability matrix: usually denoted as B = [bj(k)]N×M, where: bj(k)=P(ot=vk|it=qj)(1jN,1kM) \[{{b}_{j}}(k)=P({{o}_{t}}={{v}_{k}}|{{i}_{t}}={{q}_{j}})(1\le j\le N,1\le k\le M)\]

This formula represents the probability value obtained at moment t when its state value is qj, then the observation value is vk.

Initial state probability vector: the probability of occurrence of each state of the HMM at t = 0, usually denoted as π = {π1,π2,…,πN}, where πi = P(i1 = qi)(1 ≤ iN). In the HMM, the parameter set is denoted by λ = (A,B,π).

Three basic problems of HMMs

There are three main basic problems in HMM:

1) The valuation problem, i.e., given an HMM, find the probability of an arbitrary sequence of observations O = {o1,o2,…,oT}. That is, estimate the value of P(O|λ).

2) Decoding problem, i.e., given the HMM, any given observation sequence O, re-find a state sequence Q = {q1,q2,…,qT} such that the observation sequence O yields the maximum probability value.

3) The learning problem, i.e., given the HMM and observation sequences, readjust the values of the parameters in the model λ = (A,B,π) to maximize the value of the observation sequence probability P(O|λ) in the model after adjustment. In the calculation, the observation sequence is used as the training sequence, and the more accurate the parameters of the trained model are, the more accurate its gesture model is. The most applied training algorithm available is the Baum-Welch algorithm.

The Baum-Welch algorithm belongs to a specialization of the EM (Expectation Maximisation) algorithm, which is an iterative algorithm for the estimation of the great likelihood of the parameters of a probabilistic model containing hidden variables, or the great a posteriori probability statistics [19]. During the computation of the Baum-Welch algorithm, the forward probability and the backward probability need to be computed to obtain the desired Baum-Welch algorithm.

First analyze the forward probability, given the HMM, and the probability that the moment t part of the observation sequence is O = o1,o2,…,oT and the state is qi is the forward probability αt(i) = P(o1,o2,…,ot,it = qi|λ). The forward probability αt(i) can be computed recursively, and the observation sequence probability P(O|λ) can be finally obtained. The specific steps are as follows:

1) Initialize the forward probability, i.e., calculate the product of the initial moment state it = qi and the observation probability o1: a1(i)=πibi(o1),1iN ${{a}_{1}}(i)={{\pi }_{i}}{{b}_{i}}({{o}_{1}}),1\le i\le N$

2) Recursive computation of forward probability: αt+1(j)=[ j=1Nαt(j)αij ]bi(ot+1),(1tT1),(i=1,2,...,N) \[{{\alpha }_{t+1}}(j)=\left[ \sum\limits_{j=1}^{N}{{{\alpha }_{t}}}(j){{\alpha }_{ij}} \right]{{b}_{i}}({{o}_{t+1}}),(1\le t\le T-1),(i=1,2,...,N)\] where αt(j) is the first t observations and stops at state qi by transferring multiplied by the state transfer probability matrix aij to state qi and summing the N possibilities of the previous state during the state transfer. And bi(ot+1) is the probability of generating the t + 1 th observation sequence and being in state qi at moment t + 1.

3) Termination: P(O|λ)=i=1NαT(i) \[P(O|\lambda )=\sum\limits_{i=1}^{N}{{{\alpha }_{T}}}(i)\] where αT(i) is the probability of generating the entire sequence of observations and terminating in state qi, αT(i) = P(o1,o2,…,oT,iT = qi|λ), which ultimately has to be summed over all possible termination states to obtain P(O|λ).

Pressing down to analyze the backward probability, it is known that the HMM, under the condition of t moment and the state is qi, the probability that the partial observation sequence in the moment from t + 1 to the post-war moment is O = ot+1,ot+2,…oT, the form of which can be expressed by βt(i) = P(ot+1,ot+2,……,oT|it = qi,λ), can be used recursively to find out the backward probability βt(i) and the probability of the observation sequence P(O|λ). The specific steps are as follows:

1) Backward probability initialization, for the final moment of all states qi initialized as: βT(i)=1,(1iN) ${{\beta }_{T}}(i)=1,(1\le i\le N)$

2) The recursive operation of backward probability involves calculating the transfer probability aij for all possible N states qj at moment T + 1, and the observation probability bi(ot+1) for the observation probability ot+1 in this state, and ultimately obtaining the backward probability βt+1(j) for the observation probability after qj: βt(i)=j=1Naijbj(ot+1)βt+1(j),(t=T1,T2,...,1),(1iN) \[{{\beta }_{t}}(i)=\sum\limits_{j=1}^{N}{{{a}_{ij}}}{{b}_{j}}({{o}_{t+1}}){{\beta }_{t+1}}(j),(t=T-1,T-2,...,1),(1\le i\le N)\]

3) Termination: P(O|λ)=i=1Nπibi(o1)β1(i) \[P(O|\lambda )=\sum\limits_{i=1}^{N}{{{\pi }_{i}}}{{b}_{i}}({{o}_{1}}){{\beta }_{1}}(i)\]

The probability of being in state qi at moment t and in state qj at moment t + 1 is known for the HMM and all observation sequences O: ξt(i,j)=P(it=qi,it+1=qj|O,λ) \[{{\xi }_{t}}(i,j)=P({{i}_{t}}={{q}_{i}},{{i}_{t+1}}={{q}_{j}}|O,\lambda )\]

Calculated through a series of forward and backward probabilities: ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j) \[{{\xi }_{t}}(i,j)=\frac{{{\alpha }_{t}}(i){{a}_{ij}}{{b}_{j}}({{o}_{t+1}}){{\beta }_{t+1}}(j)}{\sum\limits_{i=1}^{N}{\sum\limits_{j=1}^{N}{{{\alpha }_{t}}}}(i){{a}_{ij}}{{b}_{j}}({{o}_{t+1}}){{\beta }_{t+1}}(j)}\]

In this equation, αt(i) is generating the first t observations and stopping at state qi at moment t, transferring to state qj with probability aij, generating the t + 1th observation, and continuing to generate the second half of the observation sequence from state qj at moment t + 1. By calculating the probability of the system at the edge of the arc at the edge of the arc probability for all possible next states, we get the probability that the system is at qi at the time of t: γt(i)=j=1Nξt(i,j) \[{{\gamma }_{t}}(i)=\sum\limits_{j=1}^{N}{{{\xi }_{t}}}(i,j)\]

The Baulm-Welch algorithm steps can be finally obtained:

1) Initialization: for n = 0, select aij(0),bj(k)(0),πi(0)$a_{ij}^{(0)},{{b}_{j}}{{(k)}^{(0)}},\pi _{i}^{(0)}$ to get model λ(0) = (A(0),B(0),π(0)).

2) Recursive computation: aij¯=t=1T1ξt(i,j)t=1T1γt(i) \[\overline{{{a}_{ij}}}=\frac{\sum\limits_{t=1}^{T-1}{{{\xi }_{t}}}(i,j)}{\sum\limits_{t=1}^{T-1}{{{\gamma }_{t}}}(i)}\] bj(k)¯=t=1,ot=vkT1γt(j)t=1T1γt(j) \[\overline{{{b}_{j}}(k)}=\frac{\sum\limits_{t=1,{{o}_{t}}={{v}_{k}}}^{T-1}{{{\gamma }_{t}}}(j)}{\sum\limits_{t=1}^{T-1}{{{\gamma }_{t}}}(j)}\] π¯i=P(q1=si)=γ1(i) \[{{\bar{\pi }}_{i}}=P({{q}_{1}}={{s}_{i}})={{\gamma }_{1}}(i)\]

3) Terminate to obtain model parameter λ(n+1) = (A(n+1),B(n+1),π(n+1)).

HMM dynamic gesture recognition process
Types of dynamic gestures for virtual piano

For piano performance, five basic non-continuous gestures as shown in Table 1 were mainly selected as input gesture data in this study.

Dynamic gesture classification

Gesture 1 2 3 4 5
Figure
Dynamic Gesture Tracking

This study uses Kinect second generation for gesture tracking, Skeletal tracking is a very important part of Kinect, Skeletal tracking extracts the desired human body from complex background images, i.e., recognizes various parts of the human body from the depth image and tracks them in real time, In Kinect second generation, enhanced camera fidelity and software improvements have made skeletal tracking both more accurate and more efficiency have been improved.

From the bone tracking data, we can obtain the position information of 25 key joints of the human body recognized by Kinect, including the joints of the palm, and the hand tracking can be calculated by obtaining the coordinates of the palm, the 25 joints of the human body recognized by Kinect are shown in Fig. 2.

Figure 2.

Kinect bone tracks 25 binders

Dynamic gesture feature extraction

Dynamic hand gestures have three basic features: hand position, direction angle, and movement rate. During hand movement, they are effectively extracted in real time from the hand movement trajectory, and then the gesture is recognized from these three basic features in a comprehensive manner.

Positional characteristics: the coordinates of the center of gravity of the effective point of the gesture’s trajectory during its operation and the distance from the current gesture action position. (Cx,Cy) denotes the coordinates of the center of gravity of all valid gesture positions. Lt denotes the distance from the center of gravity to the current gesture position of the gesture’s trajectory. I.e: (Cx,Cy)=1n(t=1nxt,t=1nyt) \[({{C}_{x}},{{C}_{y}})=\frac{1}{n}(\sum\limits_{t=1}^{n}{{{x}_{t}}}\;,\sum\limits_{t=1}^{n}{{{y}_{t}}})\] Lt=(xt+1Cx)2+(yt+1Cy)2 \[{{\text{L}}_{t}}=\sqrt{{{({{x}_{t+1}}-{{C}_{x}})}^{2}}+{{({{y}_{t+1}}-{{C}_{y}})}^{2}}}\]

Direction angle: there are two direction angles calculated, one is the direction angle between the current gesture coordinates and the center of gravity of all valid points during the gesture run θ1. The other is the direction angle between two valid points adjacent to each other during the gesture movement θ2. ie: θ1=arctan(yt+1Cyxt+1Cx) \[{{\theta }_{1}}=\text{arctan}\left( \frac{{{y}_{t+1}}-{{C}_{y}}}{{{x}_{t+1}}-{{C}_{x}}} \right)\] θ2=arctan(yt+1ytxt+1xt) \[{{\theta }_{2}}=\text{arctan}\left( \frac{{{y}_{t+1}}-{{y}_{t}}}{{{x}_{t+1}}-{{x}_{t}}} \right)\]

Rate: rate Vt plays an important role in trajectory recognition and represents the distance between two neighboring nodes of the gesture running trajectory. It is significantly different at the beginning and end of the gesture and at the corners. Its calculation formula is as follows: Vt=(xt+1xt)2+(yt+1yt)2 \[{{V}_{t}}=\sqrt{{{({{x}_{t+1}}-{{x}_{t}})}^{2}}+{{({{y}_{t+1}}-{{y}_{t}})}^{2}}}\]

The combination of three features, position, orientation angle, and rate, can improve the final result of dynamic gesture recognition. In this study the combination of three basic features is used for feature extraction of dynamic gesture motion trajectories collected by Kinect, a somatosensory interaction device.

HMM Improvement and Dynamic Gesture Recognition

There are two main state transfer topologies in HMM as shown in Fig. 3, where in Fig. 3(a) states can only be transferred from the current state to the current state or to the next numbered larger neighboring state, and the state can only be moved from left to right, which is known as the left-right structure, and every state in Fig. 3(b) can be transferred to any other state, which is known as the fully connected structure. The left-right structure is more suitable for time series problems with constraints on order. For dynamic gesture action is a kind of time series problem that has a constraint on the order, so the left-right structure is chosen as the state transfer topology for for dynamic gesture recognition.

Figure 3.

HMM topology structure

The position and direction angles of the gesture are obtained by the above analysis, so a discrete gesture trajectory is obtained. In order to better represent the gesture trajectory, this study uses the Freeman chain code method.In the process of determining boundary and line segments, the current point is taken as the center and searched in 8 directions around it, thus obtaining a discrete data sequence containing 8 direction angles. In order to reduce the data error caused by human or itself during the hand pushing process, in this paper, the current position and the next hand position in the hand rotation trajectory are set as a vector and mapped to the 8 direction angles in the 8Freeman chain code. And calculate the maximum value, then determine the final desired result by determining its direction.

The initialization of the HMM is very important for gesture recognition. In matrix A, the model is initialized as shown below: A=(0.50.500000.50.500000.50.500000.50.500000.5) \[A=\left( \begin{matrix} 0.5 & 0.5 & 0 & 0 & 0 \\ 0 & 0.5 & 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 0.5 & 0.5 \\ 0 & 0 & 0 & 0 & 0.5 \\ \end{matrix} \right)\]

Its probability matrix, B = {bjk}, has all its values bjk=1M${{b}_{jk}}=\frac{1}{M}$. and π1 = (1,0,0,0,0). The training of this HMM yields a new HMM, which ultimately yields the HMM needed for gesture recognition in the experiment.

Virtual Piano Playing Based on Mixed Reality

This section describes the virtual piano playing scheme, which supports users to play piano music and interact with the computer through gestures in mixed reality scenarios.

Mixed Reality Interface
Construction of a three-dimensional model

When wearing a HoloLens helmet, the interface that the user can see is a mixed reality scene that superimposes virtual objects on the real world, in which the protagonist is a three-dimensional piano model. In order to ensure that the 3D model of the piano is as close as possible to the shape of a real piano, a combination of 3D scanning and post-processing is used to build the 3D model.

The 3D scanning ensures the realism of the experience, but there are some drawbacks. Because the piano object is large in appearance, the collected data has millions of facets, which is not conducive to rendering and displaying in the HoloLens device, and further modification operations need to be performed on it. The original point cloud data is imported into Geomagic Studio software to automatically convert the point cloud into polygons, and then operations such as hole patching, surface reduction, and collocation are performed, and manual repair of details is performed after the automatic operation. In order to keep the basic shape of the model and keep the model’s memory consumption as small as possible to ensure smooth loading, the model was optimized using 3DMax’s optimization modifier to remove empty spline and free points.

After the model is built, materials are added to make it more realistic. In the software, you can specify the corresponding material sphere for the model. Here, we chose the wood material sphere.The wood material alone cannot display the different textures of each part of the piano, so we need to attach a texture map to it. Expand UV refers to disassemble a 3D model and expand it into a flat 2D picture, so that the texture maps can fit evenly and without distortion on the surface of the model, which can make the model’s texture mapping effect more realistic. The texture mapping is done by using all the photos taken by the camera, no longer using the preset textures in the texture library but using diffuse reflection mapping, and the pictures are saved in JPG format with a resolution of 72 and quality of 8. The processed blank model is pasted with the corresponding texture maps, and the modeling of a simulation, a small data volume, and a complete piano model is completed.

Coordinate mapping within Unity and HoloLens scenes

This system uses Kinect to capture the user’s position and movement, and updates the position of the 3D piano model in HoloLens in real time, so that it can follow the user to move and rotate instead of being fixed in the screen. It also performs the functions of displaying the position of the hands, highlighting the virtual piano keys that are pressed, and playing audio based on other transmitted data. Before transmission, the user’s skeletal coordinates obtained through Kinect should be converted to the coordinate system of HoloLens, i.e., the mapping of the coordinate systems of the two devices should be carried out.

Seven-parameter model

When two different 3D spatial Cartesian coordinate systems are converted to each other, a seven-parameter model is usually used. In this paper, a seven-parameter model is used for mapping the coordinate systems of the two devices, and the result can be obtained with at most seven conversion parameters, i.e., three translational parameters, three rotational parameters, and one scale scaling factor. Knowing the coordinates of three points in the two coordinate systems, the transformation relationship between the two coordinate systems can be derived. The coordinate transformation is represented as: [ XYZ ]A=λ([ ΔXΔYΔZ ]+R[ XYZ ]B) \[{{\left[ \begin{matrix} X \\ Y \\ Z \\ \end{matrix} \right]}_{A}}=\lambda \left( \left[ \begin{matrix} \Delta X \\ \Delta Y \\ \Delta Z \\ \end{matrix} \right]+R{{\left[ \begin{matrix} X \\ Y \\ Z \\ \end{matrix} \right]}_{B}} \right)\]

λ represents the scale parameter, R represents the rotation matrix, Δ represents the translation vector, A is the coordinate in the coordinate system after the transformation, and B is the coordinate in the coordinate system before the transformation. First calculate the scale parameter, if |P1P2|A is the distance between P1P2 two points in two coordinate systems respectively, then we have: λ=|P1P2|A|P1P2|B=|P2P3|A|P2P3|B=|P3P1|A|P3P1|B \[\lambda =\frac{|{{P}_{1}}{{P}_{2}}{{|}_{A}}}{|{{P}_{1}}{{P}_{2}}{{|}_{B}}}=\frac{|{{P}_{2}}{{P}_{3}}{{|}_{A}}}{|{{P}_{2}}{{P}_{3}}{{|}_{B}}}=\frac{|{{P}_{3}}{{P}_{1}}{{|}_{A}}}{|{{P}_{3}}{{P}_{1}}{{|}_{B}}}\]

The rotation matrix R is an orthogonal matrix of 3×3 with 3 degrees of freedom and can be constructed from the antisymmetric matrix S, R has only three variables a, b, and c. Solving it then determines the rotation matrix R. i.e: S=[ 0cbc0aba0 ] \[S=\left[ \begin{matrix} 0 & -c & -b \\ c & 0 & -a \\ b & a & 0 \\ \end{matrix} \right]\] R=I+SIS \[R=\frac{I+S}{I-S}\]

Knowing the coordinates of P1P2P3 three points in two coordinate systems A and B, then the coordinates of P1P2 two points respectively into the formula (21) and subtract can be eliminated ΔXYZ. And then into the formula (23) ~ (24), the collation is obtained: [ XA12λXB12YA12λYB12ZA12λZB12 ]=[ c(λYB12+YA12)bλ(ZB12+bZA12)c(λXB12+XA12)a(ZB12+bZA12)b(λXB12+YA12)+a(λYB12+YA12) ] \[\left[ \begin{matrix} {{X}_{A12}}-\lambda {{X}_{B12}} \\ {{Y}_{A12}}-\lambda {{Y}_{B12}} \\ {{Z}_{A12}}-\lambda {{Z}_{B12}} \\ \end{matrix} \right]=\left[ \begin{matrix} -c(\lambda {{Y}_{B12}}+{{Y}_{A12}}) & -b\lambda ({{Z}_{B12}}+b{{Z}_{A12}}) \\ c(\lambda {{X}_{B12}}+{{X}_{A12}}) & -a({{Z}_{B12}}+b{{Z}_{A12}}) \\ b(\lambda {{X}_{B12}}+{{Y}_{A12}}) & +a(\lambda {{Y}_{B12}}+{{Y}_{A12}}) \\ \end{matrix} \right]\] Where XA12 represents the value of the x -axis coordinates of point P1 in the A -coordinate system minus the value of the x-axis coordinates of point P2 in the A-coordinate system. The coordinates of the two points of the P1P3 brought to obtain the formula (25), the system of joint equations and then organized to obtain: [ abc ]=[ 0(λZB12+ZA12)(λYB12+YA12)(λZB12+ZA12)0(λXB12+XA12)(λYB13+YA13)(λXB13+XA13)0 ]1[ XA12λXB12YA12λYB12ZA13λZB13 ] \[\begin{align} & \left[ \begin{matrix} a \\ b \\ c \\ \end{matrix} \right] \\ & ={{\left[ \begin{matrix} 0 & -(\lambda {{Z}_{B12}}+{{Z}_{A12}}) & -(\lambda {{Y}_{B12}}+{{Y}_{A12}}) \\ -(\lambda {{Z}_{B12}}+{{Z}_{A12}}) & 0 & (\lambda {{X}_{B12}}+{{X}_{A12}}) \\ (\lambda {{Y}_{B13}}+{{Y}_{A13}}) & (\lambda {{X}_{B13}}+{{X}_{A13}}) & 0 \\ \end{matrix} \right]}^{-1}}\left[ \begin{matrix} {{X}_{A12}}-\lambda {{X}_{B12}} \\ {{Y}_{A12}}-\lambda {{Y}_{B12}} \\ {{Z}_{A13}}-\lambda {{Z}_{B13}} \\ \end{matrix} \right] \\ \end{align}\]

Bringing a, b, and c into Eq. (24) yields the rotation matrix R, and bringing the coordinates of any point into Eq. (21) yields the value of the translation vector: [ ΔXΔYΔZ ]=1λ[ XYZ ]AR[ XYZ ]B \[\left[ \begin{matrix} \text{ }\!\!\Delta\!\!\text{ }X \\ \text{ }\!\!\Delta\!\!\text{ }Y \\ \text{ }\!\!\Delta\!\!\text{ }Z \\ \end{matrix} \right]=\frac{1}{\lambda }{{\left[ \begin{matrix} X \\ Y \\ Z \\ \end{matrix} \right]}_{A}}-R{{\left[ \begin{matrix} X \\ Y \\ Z \\ \end{matrix} \right]}_{B}}\]

The realization of coordinate mapping

To determine the mapping relationship between two coordinate systems, we need to obtain the coordinates of at least three common points in both coordinate systems.It is known that the origin of Kinect coordinates in Unity is located in the center of Kinect’s infrared camera, which allows for the acquisition of human bone coordinates. Skeletal tracking technology can accurately calibrate the 20 skeletal nodes of the human body and can track the spatial position of these 20 points in real time, so it is most convenient to choose the three points of the Kinect coordinate system on the human body.The coordinate system in HoloLens is a relative coordinate system of the human body, which takes the user’s body as the origin, and the whole body moves along with the user.When the user’s body doesn’t move but only his head moves, the HoloLens will plot the user’s body and the user’s head as the origin. HoloLens will draw the scene in the current field of view, when the objects in the scene are stable relative to the user’s body. The position of the generated 3D piano model changes with the user, and there is no other reference, so it is more appropriate to select the coordinate points of the three bones on the user itself for calculation. In this paper, we have chosen the skeletal coordinate points of the head, the center of the hip, and the left hand.The unit in HoloLens is meter, and we can get the position in the HoloLens coordinate system according to the size of the human body in the real world, and combined with the coordinates obtained from Kinect, we can get the coordinate values of the three common points in the two coordinate systems, and we can get the parameter by calculating with the coordinate value of these three points, so as to can get the mapping of the two coordinate system conversion.

Mixed Reality Scene Construction

The construction of the mixed reality scene and the logical realization of the system are done using Unity3D software. When the user makes a specified gesture, a piano model will appear in the mixed reality scene, and the user can interact with it by touching both hands on it. However, the mixed reality image is superimposed on the real scene, so it is inevitable that the user’s hands will be blocked, and when there is no physical object to touch, the user will lose the sense of proportion and will not know how to react to the virtual piano. Therefore, this system represents the position of the user’s hands on the 3D piano model in real time, so that the user can be informed of their interaction by looking at the concentric markers. In order to help the user to be clearer about the “keys” pressed, the “keys” triggered by the user’s gestures on the piano model are highlighted, and the corresponding piano rhythm is played slowly, so that the user has the experience of playing the piano in space.

One of the drawbacks of HoloLens is the small field of view due to the holographic diffractive light waveguide method. The screen is confined to a rectangular viewing angle, which may cause the user to ignore the content outside the screen, or to be confused when viewing a large object because they are only looking at part of the screen. Therefore, the user interface has a hint mode, so that if the user is not familiar with the structure of the piano, or even does not know the playing area, the user can be assured that the left and right hands are in the correct position with this hint. If the user is already familiar with the piano, this guidance can also be canceled to make the screen more concise. There are always arrows pointing to the current position of the left and right hands, as well as hints for each of the emblems, to avoid confusion if the user cannot see the left and right hands at the same time in the current view.

Implementation of the audio section

When the user’s movement is captured by Kinect, the system gets the user’s intention through the gesture recognition algorithm, finds the corresponding clip from the sound library to synthesize and play it, and realizes the effect that the user can conjure up the piano melody by waving his hands in the air. The process of mapping action to audio is as follows:

Sound library construction

The most important quality of a musical instrument is its sound, so the sound library is the core content of the virtual performance system. The sound library is constructed by combining Kong Audio sound library and actual collection, Kong Audio software provides soft sound source of folk music, which provides very professional sound for musicians engaged in arranging and composing, and covers a variety of ethnic sound sources. In this study, the piano timbre is exported from Kong Audio as the basic timbre library, and the missing parts are supplemented by actual acquisition.

Mapping of Motion and Audio

The user’s action data are collected to obtain the position of the hands, the rate of movement of the hands, and the result of gesture recognition, and the information of the audio to be called can be obtained by comprehensively analyzing these three.

First of all, the real-time movement of the hands can get which key is currently being pressed, but sometimes there are cases where multiple keys are touched one after another, so it is necessary to analyze them in combination with hand gestures. The gesture recognition result informs the system of the direction of movement of the user’s hands, but since the fingers are not specifically recognized, no detailed distinction is made between the same operation and different fingers in terms of timbre. After obtaining the above information, the corresponding audio file can be called from the sound library, and then the audio playback speed is modified according to the rate of both hands, and according to the different types of sound, the rate of the left and right hands affects the results to a different extent, which is controlled by coefficients.

Virtual Piano Playing Experiment and Result Analysis

In this chapter, the proposed piano virtual performance model based on interactive gesture recognition is applied to perform gesture recognition experiments and virtual performance experiments respectively, and the results are analyzed to evaluate the performance of the model.

Gesture Recognition Experiment

In this paper, gesture recognition is carried out by the fitting method, where the virtual hand is first reduced to the current gesture by fitting, and then determined to be a certain gesture based on comparing the difference in curvature between the fingers of the current virtual hand.

In this paper the 3D virtual hand is created using two basic geometries, rectangular and capsule. The fingers are divided into three segments: the basal, middle, and end joints, which are represented by three capsule bodies connected to each other at the beginning and at the end. Each finger joint has a parametric representation of the curvature, so that the total curvature of each finger can be obtained. During piano playing, the thumb flexes to a lesser extent, so this paper only compares the curvatures of the remaining four fingers in the gesture determination section. By constantly changing gestures and recording the curvature of each finger, the average curvature is calculated as shown in Table 2.

The average curvature of the fingers under each gesture

Curvature Gesture
Finger one two three X
Index finger 12 96 15 5 8
Middle finger 14 12 76 6 5
The fourth finger 8 8 40 76 57
Little finger 20 3 4 8 104

Note: “宀”represents a gesture state

By analyzing and comparing the finger curvature in each gesture state, the classification parameters for gesture judgment can be determined. When in the “one”, “two”, “three”, and “X” gesture states, the index finger, middle finger, ring finger, and little finger index, middle, ring, and little fingers have a high degree of curvature, respectively, and the values are all greater than 75. In the “宀” gesture state, the average bending degree of the four fingers is not 0. This is because in the initialization stage of gesture fitting, a fixed virtual hand size will be determined based on the current gesture area, and the distance between the hand and the camera may change slightly during gesture changes. When the hand is slightly farther away from the camera relative to the previous one, it can be misjudged as a slight bending of the fingers by area fitting even when the four fingers are not bent. In this case, experimental observation can be obtained that none of the curvature of the four fingers exceeds 55 in the maximum error case, because the distance between the hand and the camera does not change too much during the playing process. Therefore, a threshold value can be set, and when the curvature of all four fingers is below this threshold value, it can be judged that there is no bending of the fingers.

Experiments show that the algorithm in this paper is able to complete the gesture fitting process relatively accurately, with a frame rate of about 15Hz, although it fails to meet the traditional real-time requirement of processing 30 frames per second, but in the actual piano playing process, the current algorithm processing performance can be satisfied with the gesture recognition used for the virtual playing process because the gesture changes will not be particularly fast.

In the actual playing process, the Kinect camera is used to capture the gesture image, the segmented finger is contour extracted and the Hu moment value is calculated, and the experimental results of gesture recognition obtained are shown in Table 3.

The result of gesture recognition

Gesture Correct frequency Number of errors Total degree The accuracy rate
Spread your fingers 294 6 300 98.00%
Half-closed fist 293 7 300 97.67%

Table 3 lists the experimental results of recognizing the two gestures of two fingers extended and half clenched fist, and it can be found that the recognition rate reaches 98.00% and 97.67% respectively, which are both greater than 95%. The recognition rate achieved in this experiment is very high, mainly because the experiment itself only recognizes two kinds of gestures, one kind of gesture into a half-clenched fist, and one kind of gesture for two fingers extended, the contour extracted from the two kinds of gestures has obvious feature differences, and the use of non-A is B in the program for recognition, so there will not be a situation that can not be recognized. Secondly, combining depth and color images in hand region segmentation in gesture segmentation, the obtained gesture images are more accurate and less affected by other environmental conditions, so the error caused by the calculation results of Hu moments is low.

Virtual Performance Experiment

A common problem in virtual piano playing systems is the setting of the size of the virtual piano model, i.e., the size of the player’s hands in proportion to the size of the piano to make it look more natural. This requires some adjustment of the drawing scale of the piano model according to the size of the player’s hands. If a regular camera is used, the method is to split the hands, calculate the hand area, and then adjust the drawing parameters of the piano based on the area.With the Kinect device, this is done based on the depth of the hand.In the Kinect coordinate system, the Kinect is located at the origin, and the Z axis is aligned with the camera’s direction.The user’s hand is obtained through NITE and the z-value in the current 3D coordinates of the hand is recorded. Therefore, the z-value can be roughly represented as the distance of the hand from the Kinect.The data is obtained by adjusting the drawing scale of the piano model according to the z-value. The scale relationship between the z-value of the hand and the piano is shown in Table 4.

Relationship between hand z value and piano ratio

Hand z value Piano ratio Hand z value Piano ratio Hand z value Piano ratio
600 0.6 850 0.48 1100 0.45
650 0.6 900 0.45 1150 0.48
700 0.6 950 0.43 1200 0.3
750 0.53 1000 0.4 1250 0.3
800 0.50 1050 0.43 1300 0.3

By analyzing the data, a simple functional relationship between the hand z value and the piano drawing scale can be obtained, as shown in Equation (28): scaleRate={ 0.6,z<=7000.4(z1000)*0.1/200,700<z<12000.3,z>=1200 \[scaleRate=\left\{ \begin{array}{*{35}{l}} 0.6,z<=700 \\ 0.4-(z-1000)*0.1/200,700<z<1200 \\ 0.3,z>=1200 \\ \end{array} \right.\]

Since the drawing scale is larger than 0.6, the piano model will be out of the window and the hand is too close to the Kinect sensor, the piano drawing scale is fixed to 0.6 when z is smaller than 700, while when the hand z value is larger than 1200, the hand size and the piano model produce a smaller proportional change between the hand size and the piano model, and at this time, the hand area captured by the Kienct is difficult to be used for recognizing gestures due to its too small size. Therefore, the piano drawing scale is fixed to 0.3 when z is greater than 1200.

Piano Virtual Performance Interaction Experience Hierarchical Analysis and Quantitative Research

This chapter aims to gain an in-depth understanding of users’ human-computer interaction experience in the piano virtual performance environment, which is divided into five layers of user experience, including function, content, interface, emotion and value. A quantitative questionnaire survey was conducted to collect extensive user feedback, including the frequency of use and experience ratings, and tests such as credibility and relevance were conducted through SPSS in order to gain a comprehensive understanding of the users’ habits and feelings towards the piano virtual performance supporting products.

Questionnaire design

The questionnaire design in this study innovatively uses function, content, interface, emotion, and value as the division of the tiers, and these five tiers provide more specific elements that are related to actual user interactions, which can be more easily translated into specific survey questions and assessment indicators.

Functionality is defined as the features and services offered by a product that support the user’s human-computer interactions and operations. Content refers to the information contained in the product and how it fulfills the user’s purpose and needs. Interface refers to the visual design and page architecture of the product, and how it affects the user’s usage and perception. Emotion refers to the emotional expression of the product and how it stimulates the user’s feelings and emotions. Value cuts across all tiers and refers to the value and benefits a product can bring to users, as well as how it meets their expectations and preferences.

The specificity of this segmentation makes it easy to apply and comprehensive to observe in real-world research, as researchers are able to categorize user feedback directly into well-defined tiers that can be observed in detail and understood in a holistic manner. Therefore, the selection of function, content, interface, emotion, and value as the hierarchy provides a clear, practical, and targeted analytical framework for this study.

Research Results and Analysis

A total of 124 questionnaires were collected in this study, and the results of the collected questionnaires were analyzed for reliability, and the Cronbach’s alpha coefficient was calculated to be 0.904, which indicated that the collected questionnaire data possessed a high degree of internal consistency and reliability.

In the results and analysis section, a comprehensive comparative analysis of the questionnaire levels was conducted using dimensional analysis and a comprehensive approach from the interaction level.Secondly, the questionnaire was analyzed in terms of multiple dimensions, such as function, content, interface, emotion, and value level. Given that this paper focuses on the mixed reality interface and human-computer interaction effects of the virtual piano playing in the product, it focuses on analyzing the parts related to human-computer interaction performance in the interaction level, function level, interface level, and emotion level.

Comparison of Interaction Levels

A one-sample ANOVA t-test was conducted on the users’ experience scores at each level, and the results of the one-sample t-test analysis are shown in Table 5. Where * indicates p<0.05 and ** indicates p<0.01.

Single sample t test analysis results

Name Sample size Minimum value Maximum value Mean value Standard deviation t p
Functional experience 124 1.00 7.00 4.721 1.579 3.517 0.002**
Content experience 124 1.00 7.00 4.602 1.591 3.251 0.001**
Interface experience 124 1.00 7.00 4.685 1.684 3.243 0.000**
Emotional experience 124 1.00 7.00 4.806 1.607 4.019 0.001**
Value experience 124 1.00 7.00 4.643 1.825 2.858 0.016*

From the analysis results in Table 5, it can be seen that the p-values of functional experience, content experience, interface experience, emotional experience, and value experience are 0.002, 0.001, 0.000, 0.001, and 0.016, respectively, which are all less than 0.05, indicating that there is significance, so as to be able to analyze them in comparison with the scale score of Neutral Score 4, which turns out to be significantly higher than the Neutral Score, and the users hold a favorable attitude.

Next, the descriptive analysis of the scores obtained for each interaction level is shown in Figure 4. Combined with Table 5, it can be seen that the user’s emotional experience (M=4.806, SD=1.607, n=124) scored the highest, indicating that the user usually receives an emotional experience that meets his/her expectations and preferences during virtual piano playing and human-computer interaction. The functional experience (M=4.721, SD=1.579, n=124) scored next highest, also indicating that the functional aspects provided by the virtual piano playing companion product performed well. The interface experience (M=4.685, SD=1.684, n=124) also showed a high level of satisfaction with the design and ease of use of the mixed reality interface of Virtual Piano Play. The value experience (M=4.643, SD=1.825, n=124) scored medium, showing that users can get some value out of the Virtual Piano Playing companion product. The content experience (M=4.602, SD=1.591, n=124) scored the lowest, indicating that users perceived some level of satisfaction with the content of the virtual piano playing companion product, but there is room for improvement.

Figure 4.

Descriptive analysis results of each level of experience score

Functional Hierarchy Analysis

In this paper, we focus on analyzing the product’s function in terms of interaction gestures at the function level, and the usage frequency and expectation scores of various interaction gestures are shown in Figure 5.

Figure 5.

Interactive gesture current frequency and expected score

As can be seen from Figure 5, in terms of interaction gestures, the long-press gesture has a low frequency of use (F=1.94), but the highest degree of expectation (F′=6.09), and users want the long-press gesture to provide more functions or options. Zoom gestures were moderately used (F=4.27) but highly desired (F′=5.53), and users may perceive them to be effective in resizing views or content. Sliding gestures were used moderately frequently (F=3.51) but with a low level of expectation (F′=2.64), while dragging gestures were used most frequently (F=6.20) and with the lowest level of expectation (F′=1.41), suggesting that the product may be prone to mishandling or insensitive. Tap gesture has the highest frequency (F=5.49) and lowest expectation (F′=4.34), indicating that this feature has a high level of user satisfaction, but there is still some room for improvement. The rotation gesture has the lowest frequency of use (F=1.88) and the degree of expectation (3.51) is high, suggesting that users would like to add more novel functions.

Considering the results of the analysis, the piano virtual performance product can improve the user experience and satisfaction by improving the interaction gestures and operations. In addition, the problem of mis-touch needs to be paid attention to, and mis-touch can be reduced by adjusting the interface sensitivity and providing clearer hints, so as to improve user participation and loyalty, and better satisfy the user’s human-computer interaction needs.

Interface Hierarchy Analysis

The results of the questionnaire study at the interface level are shown in Figure 6, where the values indicate the percentage of different levels of satisfaction. The specific analysis is as follows:

Figure 6.

User reviews at the interface level

In terms of clarity and ease of use, the overall level of satisfaction was high, and users considered the interface design to be clear and easy to use.However, a portion of users suggested improvements, emphasizing the room for improvement.The suggestions were made to meet the diverse needs of users by optimizing interface elements, providing personalized options, or improving user experience.

In terms of aesthetics and comfort, there is a polarization of user evaluations, which may reflect differences in how different users feel about visual design and comfort. It is recommended that this piano virtual performance product study the feedback of different user groups in the mixed reality interface design and personalize the design to better meet the expectations of different users and improve the overall quality of user experience.

Emotional Hierarchy Analysis

The results of the questionnaire study in terms of emotional resonance at the emotional level are shown in Figure 7, which shows that there is a portion of users who do not have a strong feeling of emotional resonance, indicating that there is potential room for improvement. The emotional expression of the content needs to be improved to better meet the emotional needs of a wider range of users. This can be achieved by improving the piano’s 3D model and increasing user immersion.

Figure 7.

User ratings of emotional resonance

Conclusion

In this paper, we successfully constructed a virtual performance environment in piano teaching, designed a gesture recognition method based on Hidden Markov Model, and realized the improvement of human-computer interaction performance in piano virtual performance scheme. The results obtained are as follows:

In the “one”, “two”, “three”, and “X” gesture states, the index, middle, ring, and little finger had high curvature, all greater than 75, respectively. The curvature of all four fingers in the maximum error case did not exceed 55, so a threshold can be set, and when the curvature of all four fingers is lower than this threshold, it can be determined that the fingers are not curved. In addition, the recognition rates of the two gestures of two fingers stretching and half clenched fist reach 98.00% and 97.67% respectively, which are both greater than 95%. It shows that the Hidden Markov-based gesture interaction recognition method in this paper has a high recognition rate and is suitable for human-computer interaction tasks in virtual piano performance.

The following improvement strategies can be derived from the quantitative study of hierarchical analysis and questionnaire research on the interaction experience of virtual piano performance:

1) The development of a personalized interface is the key to improving the user interaction experience, allowing users to customize the color theme, font size, and even button positions to better enhance their interface experience. And integrating user feedback into the platform makes it part of continuous improvement.

2) In terms of richness and diversity, platforms can leverage AI technology to further personalize the user experience by providing more relevant content and suggestions based on user interests and behaviors. This will increase user loyalty and retention.

3) In terms of interactivity and engagement: First, in terms of interface-operation interaction, ensure smooth transitions and real-time feedback to provide a seamless user experience, simplify operations and navigation, shorten the user learning curve, and keep interactions predictable so that users have a clear understanding of the results of their operations. Second, enhance social integration and functionality expansion, including the introduction of more social interaction features, optimization and refinement of friend interactions and private messaging systems to facilitate connections and interactions between users. Third, improve emotional design and resonance by optimizing content and audio-visual effects to enhance the user’s sense of immersion and emotional resonance, which will attract deeper user participation.

Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
1 Hefte pro Jahr
Fachgebiete der Zeitschrift:
Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere