A study of auditory-associative musical emotion based on multidimensional signal processing techniques

Audio-visual association refers to the interrelationship between visual and auditory senses, similar to the “general sense”, the application of audio-visual association in music appreciation is mainly to integrate students’ visual and auditory senses, so that students can get a richer music experience and improve their music appreciation ability [1-3]. Music has the characteristic of disembodiment, and the current music teaching is often shallow, and teachers ignore students’ musical emotional experience, so that students’ understanding of music stays on the surface [4-5]. In order to effectively improve students’ music aesthetics and achieve the teaching goals, teachers can introduce audio-visual mechanisms in music teaching, so that students can get visual and auditory sensory touch.

Traditional music teaching is based on “listening”, but from the perspective of teaching effect, only by listening experience can not get the best appreciation effect, presenting the complete charm of music [6-8]. Audio-visual association fully mobilizes students’ visual and auditory senses, which interact to produce sensory stimulation, creating a good channel for music appreciation [9-11]. Audio-visual association is born in order to let students understand the connotation and emotion of music, music often has its own emotional theme, which can make people have emotional resonance [12-15]. Teachers should allow students to deeply perceive musical emotions on the basis of hearing and vision, which is also a kind of soothing to students’ emotional mood and plays a positive role in students’ mental health [16-19]. Teachers create diversified music appreciation contexts through a variety of ways, so that students can produce a more realistic contextual experience, so that students can obtain a stronger audio-visual sensation through the vivid and enthusiastic atmosphere [20-21].

This paper utilizes the Hevner affective model to study audio-visual association and create video and music datasets.For video signals, the attention mechanism is introduced into the VGG16 network, which is trained by computing the core region of the video image to perform a screening comparison of feature data.For auditory, this paper first segments the audio into multiple segments and extracts the low-level features of the audio from each segment.Then a CNN network model is constructed to characterize the emotional impact of low-level features.Then, the extracted video and audio features are inputted into the fusion module, and the cosine similarity computation method is utilized for emotional associative labeling. Finally, experiments are conducted on the databases RML and SAVEE to demonstrate the effectiveness and accuracy of the algorithms in this paper.

2

Research on multi-dimensional signal recognition of musical emotion

2.1

Physiological basis of synesthesia

Synesthesia is a physiological phenomenon that is shared by humans. Describe the different sensory results produced when external things act on the various senses of the human body as shown in Table 1. As animals, our cerebral cortex is the same as that of animals, and the structure and function of each part have its own characteristics.When human sensory organs encounter external stimuli, they are transmitted to the cerebral cortex by the internal conduction nerves and then enter the corresponding brain regions that can cause excitement. Although the sensory organs have their own roles, we know that the various regions of the cerebral cortex are not completely isolated from each other, and there are many “superimposed” parts at the edge of the cerebral cortex. In other words, there are influences that connect, coordinate, and communicate. At the time of “excitation differentiation”, there will be “excitation generalization”, resulting in “sensory displacement”. Therefore, the human central nervous system as a whole organically carries out a variety of functional activities with mutual coordination. The human body’s “five senses are connected”, which is the physiological basis of the sense of association.

Table 1.

Physiological basis

External	It works on the internal structure of the brain	Information formation
Light wave→	Occipital lobe→	Visual information
Sound wave→	Temporal lobe→	Auditory information
Odour→	Medial temporal lobe→	Olfactory information
Something that can touch→	Parietal lobe→	Touch and kinesthetic

Music appreciation is a practical activity for people to perceive, experience, and understand the art of music, which is an important and indispensable link in the whole of music practical activities.According to the Marxist principle that art is a kind of human social ideology, music, as an artistic activity, contains human thoughts and emotions in addition to musical sound.From the music structure diagram, we can find that when people appreciate music, the first thing they come into contact with is the music’s sound, which is also the most superficial. Then to appreciate a piece of music, we should not only stay at the surface, we should also go deeper to understand the author’s thoughts and feelings, the real connotation to express, and the attitude towards life. Therefore, in the appreciation process, we have to follow the melody, rhythm, and association of music. To achieve this, we need to understand the state of music. This is the real way to appreciate music.

The correspondence between the hierarchical structure of musical works and the psychological elements of music appreciation is shown in Figure 1: From the above illustration, we understand the relationship between the structural hierarchy of musical works and the psychological elements of music appreciation. A musical work contains two layers, one is the surface layer and the other is the inner layer, which is the emotional connotation that the music breeds. When we only hear the sound of music, we only stay in the surface layer, and when we experience through emotion, imagination and association, we can really reach the level of understanding and realization, which is the realm of thought mentioned on the diagram.

2.2

Emotional analysis of music

With the development of the network, information technology, intelligent audio and video acquisition, processing technology is widely used. Pictures, audio, video, and other similar technologies have become more common.An important way for people to publish news, communicate, exchange ideas, and share. Research has shown that different kinds of multimedia information carry information that is helpful for high-level semantic information understanding, and these different modalities of visual or auditory information often also have complementary roles between them. When the viewer in front of the scene is a vast picture, such as the beautiful grass scenery, if the sound is presented in the grasslands, unique to the horse-head fiddle and other musical instruments played by the soothing music, often more than just look at the picture can bring harmonious emotional experience; and at this time, if the playback is a piece of exciting music, people will have what kind of emotional experience? From the above phenomenon, we guess that people in the process of watching the video, produce visual stimulation of the image and produce auditory stimulation of the music in the process of the role of human beings is bound to exist in the mutual influence, and thus affect the existence of the phenomenon of human emotion.

The analysis of visual and auditory multidimensional signals also shows that this audio-visual complementary information has an important use value in improving the understanding of multimedia semantic information.Especially in the field of cultural creativity and other fields, it has a broad application value to realize the matching, recommendation, and correlation evaluation of images and music with the help of auxiliary tools.After being processed by the human brain, these images, sounds, and other information presented to the audience can cause changes in the human emotional state.It’s important to note that emotional experience is a complex phenomenon that cannot be explained by a single mechanism, or that a single mechanism doesn’t cover all instances of emotional evocation.Despite the fact that multiple mechanisms are implicated in human emotional responses, it is still challenging to summarize a comprehensive theoretical framework based on some theoretical hypotheses. On the one hand, an individual’s emotional experience, whether visual or auditory, is influenced by the culture and education level he or she has experienced, and on the other hand, cross-channel visual associations are more complex than single-channel emotional experiences, which include both physiological and psychological associations, and are also more closely related to the learning system and memory system. Therefore, this paper gives the use of information science methods, combined with the theories of psychology, aesthetics, brain science, neuroscience, information science, computer science, to establish an emotional semantic-based associative model of human audio-visual association experience.

2.3

An associative model of emotional semantics

In order to successfully build an associative model based on the semantics of emotion, a detailed definition and modeling of emotion is required. Emotion is a complex state that integrates feelings, thoughts, and behaviors, and is a person’s psychophysiological response to internal or external stimuli.

Among many emotion models, this paper chooses to use the Hevner emotion model to study the audiovisual associative model of this paper and to label the video and music datasets constructed in this paper based on the Hevner emotion ring model.

Hevner, after experimentation, selected a series of adjectives to describe music emotion, totaling 67. These 67 adjectives are categorised into eight categories according to their emotional differences and put into a ring structure, which constitutes an emotional ring model with eight categories of emotions. And the emotion categories with progressive relationship are placed in adjacent grids, that is to say, in this ring structure, two adjacent emotion categories have similarity but different degrees, thus realizing a smoother transition between the eight categories of emotions. Therefore, the emotion vectors of video images and music can be expressed as Eq. (1), Eq. (2): (1) $E_{v e d i o} = {X_{v}, X_{D i}, X_{S_{2}}, X_{D r}, X_{S O_{°}}, X_{G}, X_{J}, X_{E}}$ (2) $E_{m u s i c} = {Y_{v}, Y_{D i}, Y_{S a}, Y_{D r}, Y_{S o_{0}}, Y_{G}, Y_{J}, Y_{E}}$

The X_i and Y_i represent images and musical emotions of power, solemnity, sadness, dreaminess, soothing, elegance, joy and excitement, respectively.

2.3.1

Visual Emotional Feature Representation

In general, a story in a video may contain different stages, and emotions are mainly evoked by a few keyframes and corresponding discriminative regions, with only a limited number of keyframes being able to directly convey and determine emotions. However, existing visual emotion recognition methods mainly focus on utilizing image features to bridge the emotion gap, ignoring the problem that emotion may be determined by only a few key frames of discrete segments, while other parts of the video frames only provide background and contextual information. Therefore, this paper seeks to identify the region of emotional expression in video images.The network is directed by visual attention in deep learning to focus on regions of interest that are relevant to a particular recognition task and avoid computing features from irrelevant image regions for better performance. Therefore, in this paper, the attention mechanism is introduced into the VGG16 network to screen the core region of the image for training, and through the computation of the attention network layer, the feature data is screened and compared, and some of the features are used instead of the whole features for training, so as to reduce the redundant data, and to improve the training efficiency and the accuracy of the experiment. Specifically that is, the channel attention network and spatial attention network are introduced on the basis of the VGG16 network, and the softmax layer of the vGG16 network is discarded, and the output of the fc2 layer of the VGG16 network is used as the final image features.

VGG16 has a powerful capability for extracting image features through the convolutional layer. Therefore, in this paper, the feature mapping generated by the convolutional layer is utilized to study the correlation between image features and emotions. Specifically, in VGG layer 1, the feature mapping extracted from the original image is represented as equation (3): (3) $F_{i}^{l} = [f_{1}^{l}, f_{2}^{l}, \dots f_{x^{2}}^{l}] \in R^{c * x^{2}}$

where $f_{i}^{l}$ denotes the pixel direction vector from layer 1 CONV at a particular spatial location i, and i ranges from 1 to x². x² denotes the total number of spatial locations in the layer, and c is the number of feature channels. Thus, for each output of hidden layer 1, the feature map consists of a total of c×x×x spatial regions.

The spatial attention module can automatically explore the different contributions of different regions in keyframes for better video image emotion characterization. In order to enhance the contrast of the influence of different regions on emotions, this paper introduces a spatial attention mechanism in the VGG16 network, which is chosen to tune the local features of some intermediate convolutional layers of VGG16 and the global features of the fully-connected layer FC-1 for feature fusion, and cascade them with the fully-connected layer FC-2, thus generating the final salient mapping S_i. To this end, this paper defines a compatibility $C_{i} = (F_{t}^{l}, S_{t}) = {c_{i}}_{t = 1}^{n}$ , where n = x², c_i is computed as shown in Eq. (4) is shown: (4) $c_{i} = 〈 u, f_{i}^{l} + s_{i} 〉, i \in {1, \dots, n}$

where ˂˃ denotes the dot product and the weight vector u is the learnable parameter. The resulting compatibility scores are then normalized using the softmax operation as shown in equation (5): (5) $α_{i} = \frac{\exp (c_{i})}{\sum_{j}^{n} \exp (c_{j})}, i \in {1, \dots, n}$

{α_i} is called the attention coefficient, which highlights emotional regions and suppresses other irrelevant regions. Then, the elements of each feature map are weighted to obtain the corresponding attention coefficient α_i. Ultimately, the vector $f_{α}^{l}$ produced by each layer is computed by summing the weighted feature maps. See equation (6) for details: (6) $f_{α}^{l} = \sum_{i = 1}^{n} α_{i} \cdot f_{i}^{l}$

2.3.2

Auditory music emotion representations

Similar to images, the selection and characterization of music features is one of the key steps in cross-modal retrieval tasks. For computers, musical emotions are not directly accessible, but need to be represented by some features in music. Therefore, in order to characterize music emotion, it is necessary to analyze and study the features of music. The construction of musical features can analyze the influence of music on emotion more deeply. In order to represent musical parts, this paper splits the music into shorter segments, called local frames (or window excerpts), and extracts multiple low-level features from each frame. The low-level features of the music are extracted first, and then a CNN network is constructed to characterize the extracted music features for emotional features.

2.3.3

Audio-visual integration module

After the video image features and music features are extracted, the extracted features are fed into the fusion module for similarity metrics, which consists of two parts: a similarity metric layer and a fully connected layer.

The cosine similarity values of music pairs are used as emotional association markers for similarity prediction modeling. The metric is widely used in text mining, natural language processing and information retrieval systems. It can be used to measure the similarity between two given data. Cosine distance is a measure of similarity between two vectors by the value of the cosine of the angle between the two vectors, drawing on knowledge from geometric mathematics. The cosine value takes the range of, which is a closed interval, the cosine value tends to [–1,1] means the higher the similarity, tends to -1 means the lower the similarity, the distance is suitable for tasks that require a range of values for the distance. The specific expression is shown in (7): (7) $S i m = \frac{x_{1 k} \cdot x_{2 k}}{‖ x_{1 k} ‖ ‖ x_{2 k} ‖} = \frac{\sum_{k = - 1}^{n} x_{1 k} x_{2 k}}{\sqrt{\sum_{k = 1}^{n} x_{1 k}^{2}} \sqrt{\sum_{k = 1}^{n} x_{2 k}^{2}}}$

Where, Sim is the cosine similarity, X_1k is the image feature vector and X_2k is the music feature vector.

3

Experimental results and analysis

3.1

Analysis of the results of the emotion recognition experiment

The results of the experiments on emotion recognition based on video features have been discussed in detail above, so this section focuses on emotion recognition based on audio features and the fusion of audio-visual features. According to the audio feature extraction method mentioned above, the most discriminative features, i.e., 25 rhyming features and MFCCs, are used to conduct the experiments of emotion recognition on RML and SAVEE speech libraries, which correspond to the video samples one by one, and a total of 720 speech samples from the RML database are used, out of which 480 are used for training, and out of which 240 are used for testing.There are 360 speech samples from the SAVEE database, out of which 240 are selected as training samples, and out of which 240 are chosen for testing. There are 360 speech samples in the SAVEE database, of which 240 are selected as training samples and 120 as test samples. The relationship between the sentiment recognition rate and the number of dimensionality reduction for each feature extracted separately on the RML and SAVEE audio libraries for a single audio modality is shown in Figure 2. As can be seen from the figure, when the 25 rhyming features and MFCCs features are extracted separately, the research experiments on both RML and SAVEE databases show that when the number of dimensionality-decreasing features is around 200 dimensions, the recognition rate is the highest and the emotion recognition is the best, with the four features having the highest recognition rate of 56.62%, 61.02%, 68.5%, and 71.28%, respectively, and with the number of dimensions increases, the overall recognition rate shows a decreasing trend. Therefore, considering each feature in a comprehensive manner, the best effect is achieved when the number of features is reduced to about 200 dimensions. For convenience, we will use all the extracted features to reduce the dimensionality to 200 dimensions, which is also conducive to the fusion of the features in the future, and it is easy to improve the overall performance of the system.

The relationship between individual features and recognition rates for each emotion obtained on the RML and SAVEE speech libraries is shown in Table 2, and the specific experimental results are shown as follows: from the table, it can be seen that the average recognition rates after extracting the 25 rhyming features, the MFCCs features and fusing these two features on the RML speech library are 51.74%, 58.26% and 59.96%, respectively, and that the average recognition rates after extracting the 25 rhyming features, the MFCCs features and fusing these two features on the SAVEE speech library are 62.17%, 63.95% and 66.07%, respectively. The average recognition rates after extracting 25 rhythmic features, MFCC features, and fusing them together are 62.17%, 63.95%, and 66.07% respectively. It can be clearly concluded that the fusion of these two types of features has better recognition effect than one type of feature alone, which also indicates to some extent that the fusion of the two types of features, namely, rhythmic features and MFCCs features, has a complementary advantage.

Table 2.

RML audio library six emotional status recognition

RML
Feature	Pleasure	Get angry	Revulsion	Sadness	Surprise	Fear	Average
Metrical feature	50.37%	51.27%	54.63%	50.14%	50.21%	53.84%	51.74%
MFCCs	55.96%	56.78%	60.28%	57.84%	58.42%	60.28%	58.26%
Fusion feature	58.03%	58.42%	60.8%	61.34%	60.24%	60.94%	59.96%
SAVEE
Feature	Pleasure	Get angry	Revulsion	Sadness	Surprise	Fear	Average
Metrical feature	60.28%	60.94%	62.35%	64.32%	62.58%	62.54%	62.17%
MFCCs	61.75%	61.75%	65.38%	65.94%	65.37%	63.48%	63.95%
Fusion feature	65.23%	63.45%	66.28%	67.38%	67.3%	66.78%	66.07%

The results of the six types of emotion recognition on the two audiovisual libraries are shown in Table 3, from which it can be concluded that the recognition effect for different dimensions of different emotional states varies, and the recognition rates of various emotional states are not very different. The best recognition rate of 79.00% is achieved after fusing features on the audiovisual RML database, and the best emotion recognition rate is achieved when fusing audiovisual features on the audiovisual library SAVEE, with an average recognition rate of 88.07%, which is also the best recognition state of the whole audiovisual emotion recognition system. The fusion of the two audio-visual dimensions for emotion recognition is significantly better than the recognition effect of using one of the dimensions alone, while the overall recognition rate based on video features is higher than the recognition rate based on audio features. Again, it can be concluded that only when the two dimensions of video features and audio are fused together does the overall system provide the best recognition results. It also shows that when these two types of dimensional features are fused together, they are complementary to the overall emotion recognition, and the fusion of multi-dimensional information is more favorable to the overall audio-visual emotion recognition system. The experimental analysis shows that compared to a single dimension, the fusion of features between the two audiovisual dimensions improves the rate of emotion recognition.

Table 3.

Database visual audio emotion recognition rate

Feature	Pleasure	Get angry	Revulsion	Sadness	Surprise	Fear	Average
RML audio feature	58.24%	58.45%	61.2%	61.28%	60.28%	61.28%	60.12%
RML video features	75.2%	76.24%	74.25%	74.89%	78.02%	73.6%	75.37%
RML visual audio	78.21%	80.25%	80.28%	81.27%	78.85%	75.14%	79.00%
SAVEE audio feature	64.22%	63.58%	65.87%	67.23%	67.29%	65.74%	65.66%
SAVEE video features	83.47%	85.27%	86.77%	85.36%	82.67%	84.22%	84.63%
SAVEE visual audio	87.36%	87.52%	88.39%	87.57%	90.28%	87.28%	88.07%

3.2

Emotion Recognition Results for Different Lengths of Scores

For this experiment, we collected a collection of MIDI music files, all of which were obtained from websites that specialize in MIDI music. Before the experiment, all music files had to be selected. The length of all the music was controlled to be less than five minutes, since too long music also contains more emotions, which is not conducive to the experiment. Finally, 500 MIDI files were selected for this paper. Secondly, all 500 pieces of music were categorized in accordance with the previous categorization. And from them, 324 songs were selected as the training set, and the music emotion recognition analysis model constructed with the method proposed in this paper was used to judge the emotion of 176 MIDI music in the test set. The results can be seen in Fig. 3. In general, the longer the music piece is, the more emotional information it contains, and the greater the fluctuation of emotion in the music, so the accuracy of emotion recognition decreases for longer music. The recognition accuracy is 87% when the length of the music piece is 60s, and 80% when the length of the music piece is >240.

3.3

Emotion Recognition Results for Different Rhythms of the Scores

The music score of western music belongs to the category of phonemic score, it is difficult to analyze the connotation of emotion from the score, which makes the analysis of emotion mostly start from the sense of hearing, however, the music emotion is the unity of self-regulated formal emotion and other-regulated social emotion, and it is difficult for the listener to accurately reproduce the composer’s (the performer’s) primitive emotion, which brings a great trouble in the field of music emotion cognition. The guqin minus-word score belongs to the category of zhengfa score, which is characterized by not directly recording the pitch, but recording the method of each tone, i.e., fingering, and different fingerings can produce different tones, and the rich tones of the guqin largely depend on the variations of left and right fingerings. Although there is no strict definition of tone length in the score, the use of the left hand glissando, rhythmic clef, and right hand compound fingers can determine the priority of music playing. Therefore, studying music emotion from minuscule notation is an effective method. Combined with the traditional Chinese cultural background, the emotion of guqin compositions is divided into eight basic language sets of subcategories: indignation, sadness, longing, calmness, pleasure, purity, magnificence, and complexity. Through the in-depth analysis of the main guqin compositions, and combining with the characteristics of the music itself, six main features of the emotion calculation are extracted: modulation, big jump intervals, tempo, glissandos, compound fingering, and timbre. The change curves of different features are shown in Fig. 4, from which it can be seen that the change amplitude of the rhythm curve is very small, because there is relatively little information about the rhythm recorded in the guqin score, which is mainly carried out in the process of playing combined with the player’s respiration, so the rhythm feature does not have a great influence on our emotion recognition, and the change amplitude of the big jump intervals and glissandi is larger, which can be seen to play a larger role in the emotion recognition.

4

Conclusion

In this paper, an audio-visual associative emotion recognition model based on emotional semantics is constructed using visual attention mechanism, auditory CNN network and fusion module, etc., and a series of experiments are conducted to test and compare the proposed audio-visual associative emotion recognition model. The results show that the fusion of the latter features is most effective when the music emotion features are downscaled to 200 dimensions, resulting in an improvement in the overall performance of the system.The influence of feature fusion between the two audiovisual dimensions is negligible. It can significantly improve the recognition rate of the emotion recognition model. The shorter the music, the less emotional information it carries, and the smaller the emotional fluctuations are, the more favorable it is for emotion recognition and rhythmic features.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

A study of auditory-associative musical emotion based on multidimensional signal processing techniques

Xiaohong Cui

Xiaoqing Li

Dan Mao

Data publikacji: 17 mar 2025

Otrzymano: 10 lis 2024

Przyjęty: 18 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0279

Słowa kluczoweMusic emotion, Feature mapping, Attention mechanism, CNN network

© 2025 Xiaohong Cui, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
Music emotion, Feature mapping, Attention mechanism, CNN network