Research on robotic mechanical power sensing model based on multimodal sensor fusion
Data publikacji: 17 mar 2025
Otrzymano: 06 lis 2024
Przyjęty: 18 lut 2025
DOI: https://doi.org/10.2478/amns-2025-0312
Słowa kluczowe
© 2025 Jianjia Qi, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
There are many tasks in human’s daily life that are simple and repetitive, and in the various exploratory activities of human beings, they often encounter situations that go beyond the limits of human beings, thus limiting the activities of human beings [1-2]. So mankind thought of using machines instead of people to accomplish these repetitive or dangerous jobs, so mankind began the study of robotics. Robotics is a comprehensive discipline, involving many disciplines such as bionics, mechanics, mechanics, materials science, computers and control science, and it is because of its comprehensive nature that it has jointly promoted its development [3-4].
Humans can obtain external information through smell, touch, vision, etc. to perceive the world, and like humans, robots also need to perceive external information for feedback control [5]. Robot sensors are similar to human eyes, ears, nose, using the known physical laws of humans to convert the detected quantities into physical quantities that can be recognized by the robot so as to analyze and calculate. Through the measurement of the corresponding signal data and sent to the central processor to execute the corresponding action to achieve the corresponding function. Sensors play a very important role in the motion control of robots [6-7].
Currently, in common multi-sensor robot interaction application scenarios, force sensors can be utilized to detect the contact force, cameras can be used to obtain external visual image information, proximity sensors can be used to perceive the approaching and moving away actions of objects, and acceleration sensors can be used to get the motion and vibration amplitude of the objects, and so on. de Gea Fernández, J. et al. presented the development of a two-armed robotic system for industrial production human-robot collaboration, focusing on the analysis of the robot’s sensor system and the robot arm control system [8]. Din, S et al. explored the multimodal sensor fusion design, as well as fabrication, and confirmed through theoretical analysis and experiments that flexible printed circuit board substrates can be converted into tensile circuits integrating multimodal sensors based on current PCB fabrication techniques, laser processing techniques, etc [9]. Xue, T et al. synthesized the research literature on multimodal sensors, summarized the current research breakthroughs and obstacles faced by multimodal sensors, and provided an outlook on the future research directions related to multimodal sensor fusion [10]. Park, S et al. In order to enhance wearable robotic rehabilitation devices to adapt to a wide range of upper limb injury conditions, they proposed to introduce multimodal sensing and interaction technologies into wearable robotic rehabilitation devices, which effectively extends the scope of application practice of wearable robotic rehabilitation devices [11]. Wang, Z et al. illustrated the seamless integration of multi-material systems designed to enable robots to sense temperature, haptics (i.e., material recognition), and electrochemical stimuli, pointing out that magnetic soft robots with multi-modal sensing capabilities can serve as the basis for research and innovation in the next-generation magnetic soft robots [12]. Research in various robot sensing fields have revealed the importance of robot sensing as a key homework direction for innovative robotics research.
The robot drive system is an important component of the robot as a whole, and research in this area involves motion patterns, drive principles, and dynamics analysis, but more theoretical and experimental analyses have been conducted, while practical applications are scarce. He, J et al. comprehensively compared and analyzed the design of multilimbed robots in recent years, especially in terms of the drive system and the dynamic control of the robot, and at the same time looked forward to the practical trends of the multilimbed robots for future applications [13]. Goldberg, B et al. envisioned an insect-like robot with autonomously controlled dynamics, introducing microcontrollers and customized drive electronics to improve the flexibility and maneuverability of the robot [14]. Pal, A et al. explored the differences between soft and rigid robots, and proposed a new drive approach for enhancement, which is to utilize the mechanical instability in order to enhance the drive speed and output power [15]. Farrell Helbling, E et al. presented cutting-edge research on the design of a small flapping-winged aerial vehicle, in particular, the drive technology and flight motion control system of this aerial robot, which contributed positively to the optimization and innovation of small flapping-winged aerial robots [16]. Yandell, M. B et al. combined the motion capture and force measurement methods as the technical basis , designing wearable walking assistive devices, and revealing the power transmission process between the assistive devices and the human body through research and analysis [17].
In this paper, we construct a cross-modal generation model based on audio-visual and haptic multimodal co-representation, which fully exploits the complementarity and co-distribution of multimodal data to achieve cross-modal generation between audio-visual to haptic modalities. Specifically, the model first encodes using audiovisual encoders and maps the different input modalities to a common feature space. Then, the model uses a decoder in that feature space to generate the target modal image. At the same time, a haptic self-encoding network is utilized to retain haptic reconstruction information and capture the semantic coherence of the haptic itself. Finally, two discriminative models are used to simultaneously perform intra-modal high-dimensional data constraints and inter-modal low-dimensional feature constraints. Compared to the current mainstream cross-modal generation methods, the model in this paper utilizes generative adversarial networks to optimize multimodal co-perception for improved accuracy.
Typically, when robots utilize multiple sensing devices to acquire multiple modal information, each perceives the surrounding environment in isolation severing the intrinsic correlation between the modal information, resulting in the loss of some key information about the physical world. In terms of performance comparison, this method has obvious advantages over unimodal, but it also has many drawbacks that limit the intelligent development of robots. On the one hand, the multimodal information obtained by using multiple sensing devices has great differences in structural settings, time scales, and spatial dimensions, and how to fuse the simultaneous measurement data obtained by force-touch-vision and other sensors as well as invert the differences in spatial and temporal scales of the modal information in order to determine the data exchange law between the information world and the physical world is an important difficulty in the cognitive computation of perceptual data and inference, and the requirements for algorithmic performance, processing equipment, etc. are extremely high. The requirements for algorithm performance, processing equipment, etc. are extremely high; on the other hand, when robots collaborate with each sensing device, there is a time difference in the processing and conversion of information between each modality, which makes robots seem not so sensitive, and is also one of the important factors affecting the judgment of robot intelligence. Therefore, it is especially important to open up new methods to obtain multimodal information for the intelligent development of robots.
The purpose of this paper is to design a sensing model for robot multimodal information perception, to improve the robot’s intelligence, and to enhance its sensory prediction ability.
Autoencoders have been discussed for decades and are known as Boltzmann machines. They have a structure similar to the neural organization of the brain, and are primarily used to solve combinatorial and optimization problems. Later, nonlinear principal component analysis can be utilized to discover and eliminate nonlinear correlation components in the data, and can be used to reduce the dimensionality of the data by removing redundant information. A typical autoencoder operates through a feed-forward neural network, which is mainly composed of an encoder network (input layer) and a decoder network (output layer). The structure is shown in Fig. 1. The encoder compresses the high-dimensional input data into a small bottleneck representation with the lowest dimension, and the decoder tries to reconstruct the bottleneck as close to the input as possible. The L2 paradigm for Euclidean distance is used in the autoencoder to measure reconstruction loss.

Automatic encoder structure
The variational autoencoder (VAE) has a very similar structure to the autoencoder (AE). However, unlike AE, VAE is able to regularize the latent representation and generate new data instead of reconstructing it. It has two neural networks, one is inferential network and the other is generative network, the two neural networks are connected by an implicit variable, the inferential network performs variational inference from the original input data to get the probability distribution of the implicit variable, and the generative network approximates the original data probability distribution by the probability distribution data generated in the previous stage. Figure 2 illustrates the distinction between the classical autoencoder and the variational autoencoder.

The difference between simple automatic encoders and variant encoders
A Generative Adversarial Network (GAN) consists of two parts: generator
Where
The principle of generative adversarial network model is to take a vector that satisfies a Gaussian distribution and map it to the generated modal space, and its generating function usually uses the structure of a neural network, so much so that the generated image or text can be approximated close to the real image or text.The cost function of the GAN network adversary is shown in equation (2):
It was mentioned earlier that the generator and the adversary are a zero-sum game, so the sum of the costs of both needs to satisfy that the outcome is zero. Therefore it can be deduced that the generator’s cost function should satisfy equation (3):
Therefore, a cost function V can be set to represent
The deformation of the cost function of GAN is shown in Eqs. (4) to (6) below:
Currently, the problem translates into finding some suitable
According to the definition of Nash equilibrium point in game theory, neither party to the game can change its behavior to gain its own benefit. Therefore, the same is true in GAN networks, which need to seek an equilibrium point to minimize the cost function of both sides. That is to say, it can be defined as a problem of finding a very large minimal value, as shown in equation (7) below:
The so-called maxima minima also means that the function can be de-maximized in one direction and the maximum value can be taken in the other direction.
So, after the above derivation, the generator and antagonist of an ideal generative adversarial network are shown in equation (8) below:
For
Now it’s just a matter of finding a
It can be seen that it is a value ranging from 0 to 1. This is also in line with the standard pattern of the discriminator, ideally the discriminator should judge 1 when receiving the real data and 0 for the generated data, and when the generated data distribution is very close to the distribution of the real data, it should output a result of 1/2.
After finding
In probability statistics, JS scatter also possesses the ability to measure the degree of similarity between two probability distributions as the previously mentioned KL scatter, which is calculated based on the KL scatter and inherits the non-negativity of the KL scatter, etc., with one important difference, the JS scatter possesses symmetry.The relationship between the JS scatter and the KL scatter is shown below in Eqn. (13), and the formula for finding the JS scatter is shown in Eqn. (14) as follows:
For
The model designed in this paper consists of three main parts, namely cross-modal generative network
The multimodal dataset is denoted as
Given visual and auditory signal pairs {
Where
For a given haptic modal real image
The encoded feature vector
For the adversarial lossy discriminative model, the inputs to the discriminator
The generative model aims to uncover the intrinsic structure and characteristics of the data, thus enabling the generation of multimodal data.
In this,
Cross-modal generative model for audio-visual co-representation
A discriminator network
Then discriminator
The loss function
In the discriminator
Unlike the traditional feature matching loss, the algorithm proposed in this chapter is a feature-level supervised loss function during the generation of haptic signals by
For the cross-modal generative model, the feature vectors
The stochastic gradient is calculated as follows:
The stochastic gradient is calculated as follows:
The input to the decoder is a mapping matrix from the target modality to the common representation space, and the output is a reconstructed image of the target modality in the common representation space. The aim is to minimize the objective function to fit the true correlation distribution. Its stochastic gradient descent is given in the following equation:
Similarly, for the self-coding network model, in order to fit the spatial distribution of the reconstructed signal to the real signal.
The training process for generative and discriminative models involves iterating them until they reach a stable equilibrium. In this process, the generative model tries to generate samples that are similar to the real samples, while the discriminative model tries to distinguish between real and generated samples. As a result, the heterogeneity gap between the different modalities gradually decreases and the learning space shares the representation space.
The experimental platform is a UR3 robotic arm equipped with a Barrett hand dexterous hand. This experiment is only for one finger to predict the local attitude of the object. There are two experimental objects, namely a water bottle and a square. The objects are placed on a flat surface, given an initial position of the dexterous hand relative to the object, the dexterous hand equipped with a fingertip tactile sensor is utilized to grasp the object from open to closed, and the sensor outputs the data of the proximity unit during the approach process. The collected data set is processed with data outliers removal and normalization, and then the proximity sensing data of the 2 objects are input into the trained model respectively, and the prediction curves are obtained after model fitting as shown in Fig. 3 and Fig. 4. Where the parameter

The local attitude prediction curve of the water bottle

The local attitude prediction diagram of the cube
Observing the parameter
The prediction curve of the square body shows an overall decreasing trend, with small fluctuations in the first half. Observing the parameters
In order to analyze the effect of dictionary size K on the algorithm, by adjusting the size of K value, setting the pooling mode to average pooling, and observing the object recognition accuracy of OSL-SR and the algorithm in this paper, the results are shown in Figure 5.

The relationship between the accuracy and dictionary size
The object recognition results are not only related to the specific algorithm, but also to the dictionary size parameter set by the algorithm. As the dictionary size K increases, the object recognition results of OSL-SR and this paper’s model in the case of single-sample learning also increase.The recognition accuracies of the OSL-SR model in the process of increasing the K value from 30 to 80 are 84, 86, 87, 90, 91, 91, respectively, while that of this paper’s model is 89, 90, 92, 94, 95, 92.The figure shows that the recognition accuracy of this paper’s model in the different K stages are higher than OSL-SR, which directly indicates that it has a better generalization ability under different parameters, and reflects that it is beneficial to improve the efficiency of the algorithm by considering the temporal characteristics of reconstructed data with coupling properties.
Figure 6 shows the F1 score results for perceptual state recognition for this paper and the other two algorithms. At any sparsity, the recognition of this paper’s model is significantly better than JKSC and AMDL. When T=5, the maximum recognition result is 0.953, which is higher than the recognition performance of the remaining two models. When T>5, the model of this paper starts to show a decreasing trend, but it is still higher than the other algorithms. And it has been further found that AMDL is more sensitive to sparsity than KSC because it considers the force association between multiple fingers. It can be concluded that the model in this paper is also superior in overall multimodal perception of objects.

The algorithm of different sparse degrees is compared
In this paper, a sensing model for robotic multimodal information perception is designed to be able to complete the fusion of multimodal heterogeneous data information acquired by multiple sensing devices in structural settings, time scales, and spatial dimensions, and ultimately to realize the enhancement of robotic perceptual prediction capability. It is verified that it is reasonable to predict the overall trend for the local poses of different objects, and different object surfaces will produce corresponding prediction effects. Comparing this paper’s model with OSL-SR, JKSC and AMDL, it is found that the robot perception prediction effect is related to the dictionary size parameter set by the algorithm. As the dictionary size K increments, the object recognition results of OSL-SR and this paper’s model in the case of single-sample learning also increase, and the recognition accuracy of this paper’s model in different K stages is higher than that of OSL-SR, which proves that the model has a better generalization ability in the case of different parameters, and moreover, reflects that it is beneficial to improve the algorithm’s efficiency by considering the temporal features of reconstructed data with coupling characteristics. Under any sparsity, the recognition effect of this paper’s model is significantly better than that of JKSC and AMDL. At T=5, the maximum recognition result is 0.953, which is superior to the recognition performance of the other two models. When T > 5, the model in this paper starts to show a decreasing trend, but it is still higher than the other algorithms. In summary, this paper’s model achieves the design goal of intelligent enhancement of the robot’s perceptual prediction ability.
220124038, Heilongjiang Institute of Technology Horizontal Research Project, Development of a Web-based Product Selection Platform for S Enterprise’s Gear Reducers.
