Research on robotic mechanical power sensing model based on multimodal sensor fusion

There are many tasks in human’s daily life that are simple and repetitive, and in the various exploratory activities of human beings, they often encounter situations that go beyond the limits of human beings, thus limiting the activities of human beings [1-2]. So mankind thought of using machines instead of people to accomplish these repetitive or dangerous jobs, so mankind began the study of robotics. Robotics is a comprehensive discipline, involving many disciplines such as bionics, mechanics, mechanics, materials science, computers and control science, and it is because of its comprehensive nature that it has jointly promoted its development [3-4].

Humans can obtain external information through smell, touch, vision, etc. to perceive the world, and like humans, robots also need to perceive external information for feedback control [5]. Robot sensors are similar to human eyes, ears, nose, using the known physical laws of humans to convert the detected quantities into physical quantities that can be recognized by the robot so as to analyze and calculate. Through the measurement of the corresponding signal data and sent to the central processor to execute the corresponding action to achieve the corresponding function. Sensors play a very important role in the motion control of robots [6-7].

Currently, in common multi-sensor robot interaction application scenarios, force sensors can be utilized to detect the contact force, cameras can be used to obtain external visual image information, proximity sensors can be used to perceive the approaching and moving away actions of objects, and acceleration sensors can be used to get the motion and vibration amplitude of the objects, and so on. de Gea Fernández, J. et al. presented the development of a two-armed robotic system for industrial production human-robot collaboration, focusing on the analysis of the robot’s sensor system and the robot arm control system [8]. Din, S et al. explored the multimodal sensor fusion design, as well as fabrication, and confirmed through theoretical analysis and experiments that flexible printed circuit board substrates can be converted into tensile circuits integrating multimodal sensors based on current PCB fabrication techniques, laser processing techniques, etc [9]. Xue, T et al. synthesized the research literature on multimodal sensors, summarized the current research breakthroughs and obstacles faced by multimodal sensors, and provided an outlook on the future research directions related to multimodal sensor fusion [10]. Park, S et al. In order to enhance wearable robotic rehabilitation devices to adapt to a wide range of upper limb injury conditions, they proposed to introduce multimodal sensing and interaction technologies into wearable robotic rehabilitation devices, which effectively extends the scope of application practice of wearable robotic rehabilitation devices [11]. Wang, Z et al. illustrated the seamless integration of multi-material systems designed to enable robots to sense temperature, haptics (i.e., material recognition), and electrochemical stimuli, pointing out that magnetic soft robots with multi-modal sensing capabilities can serve as the basis for research and innovation in the next-generation magnetic soft robots [12]. Research in various robot sensing fields have revealed the importance of robot sensing as a key homework direction for innovative robotics research.

The robot drive system is an important component of the robot as a whole, and research in this area involves motion patterns, drive principles, and dynamics analysis, but more theoretical and experimental analyses have been conducted, while practical applications are scarce. He, J et al. comprehensively compared and analyzed the design of multilimbed robots in recent years, especially in terms of the drive system and the dynamic control of the robot, and at the same time looked forward to the practical trends of the multilimbed robots for future applications [13]. Goldberg, B et al. envisioned an insect-like robot with autonomously controlled dynamics, introducing microcontrollers and customized drive electronics to improve the flexibility and maneuverability of the robot [14]. Pal, A et al. explored the differences between soft and rigid robots, and proposed a new drive approach for enhancement, which is to utilize the mechanical instability in order to enhance the drive speed and output power [15]. Farrell Helbling, E et al. presented cutting-edge research on the design of a small flapping-winged aerial vehicle, in particular, the drive technology and flight motion control system of this aerial robot, which contributed positively to the optimization and innovation of small flapping-winged aerial robots [16]. Yandell, M. B et al. combined the motion capture and force measurement methods as the technical basis , designing wearable walking assistive devices, and revealing the power transmission process between the assistive devices and the human body through research and analysis [17].

In this paper, we construct a cross-modal generation model based on audio-visual and haptic multimodal co-representation, which fully exploits the complementarity and co-distribution of multimodal data to achieve cross-modal generation between audio-visual to haptic modalities. Specifically, the model first encodes using audiovisual encoders and maps the different input modalities to a common feature space. Then, the model uses a decoder in that feature space to generate the target modal image. At the same time, a haptic self-encoding network is utilized to retain haptic reconstruction information and capture the semantic coherence of the haptic itself. Finally, two discriminative models are used to simultaneously perform intra-modal high-dimensional data constraints and inter-modal low-dimensional feature constraints. Compared to the current mainstream cross-modal generation methods, the model in this paper utilizes generative adversarial networks to optimize multimodal co-perception for improved accuracy.

2

Method

Typically, when robots utilize multiple sensing devices to acquire multiple modal information, each perceives the surrounding environment in isolation severing the intrinsic correlation between the modal information, resulting in the loss of some key information about the physical world. In terms of performance comparison, this method has obvious advantages over unimodal, but it also has many drawbacks that limit the intelligent development of robots. On the one hand, the multimodal information obtained by using multiple sensing devices has great differences in structural settings, time scales, and spatial dimensions, and how to fuse the simultaneous measurement data obtained by force-touch-vision and other sensors as well as invert the differences in spatial and temporal scales of the modal information in order to determine the data exchange law between the information world and the physical world is an important difficulty in the cognitive computation of perceptual data and inference, and the requirements for algorithmic performance, processing equipment, etc. are extremely high. The requirements for algorithm performance, processing equipment, etc. are extremely high; on the other hand, when robots collaborate with each sensing device, there is a time difference in the processing and conversion of information between each modality, which makes robots seem not so sensitive, and is also one of the important factors affecting the judgment of robot intelligence. Therefore, it is especially important to open up new methods to obtain multimodal information for the intelligent development of robots.

The purpose of this paper is to design a sensing model for robot multimodal information perception, to improve the robot’s intelligence, and to enhance its sensory prediction ability.

2.1

Generating Adversarial Network Algorithms

2.1.1

Auto Encoder (AE)

Autoencoders have been discussed for decades and are known as Boltzmann machines. They have a structure similar to the neural organization of the brain, and are primarily used to solve combinatorial and optimization problems. Later, nonlinear principal component analysis can be utilized to discover and eliminate nonlinear correlation components in the data, and can be used to reduce the dimensionality of the data by removing redundant information. A typical autoencoder operates through a feed-forward neural network, which is mainly composed of an encoder network (input layer) and a decoder network (output layer). The structure is shown in Fig. 1. The encoder compresses the high-dimensional input data into a small bottleneck representation with the lowest dimension, and the decoder tries to reconstruct the bottleneck as close to the input as possible. The L2 paradigm for Euclidean distance is used in the autoencoder to measure reconstruction loss.

2.1.2

Variational autoencoders (VAE)

The variational autoencoder (VAE) has a very similar structure to the autoencoder (AE). However, unlike AE, VAE is able to regularize the latent representation and generate new data instead of reconstructing it. It has two neural networks, one is inferential network and the other is generative network, the two neural networks are connected by an implicit variable, the inferential network performs variational inference from the original input data to get the probability distribution of the implicit variable, and the generative network approximates the original data probability distribution by the probability distribution data generated in the previous stage. Figure 2 illustrates the distinction between the classical autoencoder and the variational autoencoder.

2.1.3

Generating Adversarial Networks

A Generative Adversarial Network (GAN) consists of two parts: generator G and discriminator D [18]. The goal of the generator is to capture the latent distribution of the training data and generate plausible data to deceive the discriminator. The goal of the discriminator is to distinguish whether its input is from the training data or the generated data. G and D are trained simultaneously in this adversarial system, and both models attempt to optimize their respective objectives. The objective function can be expressed as shown in equation (1) below: (1) $\begin{matrix} \min_{G} \max_{D} V (D, G) & = E_{x ~ p_{t a t}} [\log D (x)] \\ + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))] \end{matrix}$ $$\matrix{ {{{\min }_G}{{\max }_D}V(D,G)} & { = {{\rm{E}}_{x∼{p_{tat\,}}}}\left[ {\log D(x)} \right]} \cr {} & { + {{\rm{E}}_{z∼{p_z}(z)}}\left[ {\log (1 - D(G(z)))} \right]} \cr } $$

Where p_data is the real data and p_z(z) is the generated data. In the best result of the training, we will get a generator which is able to generate an almost close to real data that spoofs the discriminator.

The principle of generative adversarial network model is to take a vector that satisfies a Gaussian distribution and map it to the generated modal space, and its generating function usually uses the structure of a neural network, so much so that the generated image or text can be approximated close to the real image or text.The cost function of the GAN network adversary is shown in equation (2): (2) $J^{(D)} (θ^{(D)}, θ^{(G)}) = - \frac{1}{2} E_{x - P_{a c a}} \log D (x) - \frac{1}{2} E_{x - p_{z}} \log (1 - D (G (z)))$ where E represents the desired probability distribution.

It was mentioned earlier that the generator and the adversary are a zero-sum game, so the sum of the costs of both needs to satisfy that the outcome is zero. Therefore it can be deduced that the generator’s cost function should satisfy equation (3): (3) $J^{(D)} {{G}} = - J^{{{D}}} J (G) = - J (D)$

Therefore, a cost function V can be set to represent J^{{G}} J(G) and J^{{D}} J(D).

The deformation of the cost function of GAN is shown in Eqs. (4) to (6) below: (4) $V (θ^{(D)}, θ^{(G)}) = E_{x ~ P_{t a s}} \log D (x) + E_{x ~ P_{i}} \log (1 - D (G (z)))$ $$V\left( {{\theta ^{(D)}},{\theta ^{(G)}}} \right) = {E_{x∼{P_{tas}}}}\log D(x) + {{\rm{E}}_{x∼{P_i}}}\log \left( {1 - D(G(z))} \right)$$ (5) $J^{(D)} = - \frac{1}{2} V (θ^{(D)}, θ^{(G)})$ (6) $J^{(G)} = \frac{1}{2} V (θ^{(D)}, θ^{(G)})$

Currently, the problem translates into finding some suitable V(θ^{{D}}) and θ^{{G}}V(θ(D)) to make J^{{G}}J(G) and J^{{D}}J(D) as small as possible.

According to the definition of Nash equilibrium point in game theory, neither party to the game can change its behavior to gain its own benefit. Therefore, the same is true in GAN networks, which need to seek an equilibrium point to minimize the cost function of both sides. That is to say, it can be defined as a problem of finding a very large minimal value, as shown in equation (7) below: (7) $\arg \min_{G} \max_{D} V (D, G)$

The so-called maxima minima also means that the function can be de-maximized in one direction and the maximum value can be taken in the other direction.

So, after the above derivation, the generator and antagonist of an ideal generative adversarial network are shown in equation (8) below: (8) $\begin{matrix} D^{*} = \arg \max_{D} V (D, G) G^{*} = \arg \min_{G} \max_{D} V (D, G) \\ = \arg \min_{G} V (D^{*}, G) \end{matrix}$

For D* of the above equation, fix the generator G such that G(z) = x₀ of the equation can be solved for V The result is shown in equation (9) below: (9) $\begin{matrix} V & = & E_{x - p_{d a t a}} \log D (x) + E_{x - p_{2}} \log (1 - D (G (z))) \\ = & \int p_{d a t a} (x) \log D (x) d x + \int p_{g} (x) \log (1 - D (x)) d x \\ = & \int p_{d a t a} (x) \log D (x) + p_{g} (x) \log (1 - D (x)) d x \end{matrix}$

Now it’s just a matter of finding a D that maximizes V, hopefully for whatever value x takes for the term f(x) = p_dara(x)logD(x)+p_g(x)log(1–D(x)) in the integral. Where we know that p_data is fixed, we also assumed before that the generator G is fixed, so p_g is also fixed, so we can easily find D to maximize f(x). Assuming x is fixed and the derivative of f(x) on D(x) is equal to zero, we can find D(x) as shown in equations (10) and (11) below: (10) $\frac{d f (x)}{d D (x)} = \frac{p_{d a t a} (x)}{D (x)} = \frac{p_{g} (x)}{1 - D (x)} = 0$ (11) $D^{*} (x) = \frac{p_{d i t a} (x)}{p_{d a t a} (x) + p_{g} (x)}$

It can be seen that it is a value ranging from 0 to 1. This is also in line with the standard pattern of the discriminator, ideally the discriminator should judge 1 when receiving the real data and 0 for the generated data, and when the generated data distribution is very close to the distribution of the real data, it should output a result of 1/2.

After finding D*, for generator G*, substituting D*(x) into the previous integral equation is re-expressed as shown in equation (12) below: (12) $\begin{matrix} \max_{D} V (G, D) = V (G, D^{*}) \\ = \int p_{d a t a} (x) \log D^{*} (x) d x \\ + \int p_{g} (x) \log (1 - D^{*} (x)) d x \\ = \int p_{d a t a} (x) \log \frac{p_{d a t a} (x)}{p_{d a t a} (x) + p_{g} (x)} d x \\ + \int p_{g} (x) \log \frac{p_{g} (x)}{p_{d a t a} (x) + p_{g} (x)} d x \end{matrix}$

In probability statistics, JS scatter also possesses the ability to measure the degree of similarity between two probability distributions as the previously mentioned KL scatter, which is calculated based on the KL scatter and inherits the non-negativity of the KL scatter, etc., with one important difference, the JS scatter possesses symmetry.The relationship between the JS scatter and the KL scatter is shown below in Eqn. (13), and the formula for finding the JS scatter is shown in Eqn. (14) as follows: (13) $J S D (P ‖ Q) = \frac{1}{2} K L (P ‖ M) + \frac{1}{2} K L (Q ‖ M)$ (14) $\begin{matrix} J S D (P ‖ Q) = \int p (x) \log \frac{p (x)}{\frac{p (x) + q (x)}{2}} d x \\ + \int q (x) \log \frac{q (x)}{\frac{p (x) + q (x)}{2}} d x \end{matrix}$

For $\max_{D} V (G, D)$ , since the JS scatter is non-negative, the above equation achieves a global minimum if and only if p_data, p_s are equal. So the optimal generator G*, which we require, is exactly the distribution that is going to make G*. This is shown in equation (15) below: (15) $\begin{array}{l} \max_{D} V (G, D) = - \log (4) + K L (p_{d a t a} ‖ \frac{p_{d a t} + p_{g}}{2}) \\ + K L (p_{g} ‖ \frac{p_{d a t a} + p_{g}}{2}) = - \log (4) + 2 J S D (p_{d a t a} ‖ p_{g}) \end{array}$

2.2

Cross-modal generative model based on multimodal co-representation of audiovisual touch

2.2.1

Model Architecture

The model designed in this paper consists of three main parts, namely cross-modal generative network G_IS, haptic self-encoding network G_T, and discriminative network (D_IS,D_T,D_c1,D_c2). In this paper, the generative model is named CRCM-GAN.

1)

Multimodal co-representation generative network

The multimodal dataset is denoted as D = {D_tr,D_te}, where, D_tr is the training data and D_te is the test data, specifically, the training data contains the modal data of three pairs of visual, auditory, and haptic modalities, and D_tr = {I_tr,S_tr,T_tr}, where, $I_{t r} = {i_{x}}_{x = 1}^{m_{t r}}$ , $S_{t r} = {s_{x}}_{x = 1}^{m_{t r}}$ , $T_{t r} = {t_{x}}_{x = 1}^{m_{t r}}$ , i_x, s_x, and t_x, each modality has m_tr pairs of training data. Similarly, the test data is denoted as D_te = {I_te,S_te,T_te}, $I_{t e} = {i_{y}}_{y = 1}^{n_{t e}}$ , $S_{t e} = {s_{y}}_{y = 1}^{n_{t e}}$ , $T_{t e} = {t_{y}}_{y = 1}^{n_{t e}}$ , with n_te test data for each modality.

Given visual and auditory signal pairs {i_x,s_x}, the input data is mapped to the feature space, denoted {f^l,f^S} by encoder E_I and encoder E_S, respectively. As shown in the following equation: (16) $f^{I} = E_{l} (i_{x})$ (17) $f^{S} = E_{S} (s_{x})$

Where E_I and E_S represent the forward computation of the encoding network to obtain the visual and auditory features {f^l,f^S} respectively, which will {f^l,f^S} be spliced to obtain the hidden layer inputs $h_{0}^{G}$ , and $h_{0}^{G}$ go through the intermediate hidden layer module to extract the common representation features $h_{1}^{G}$ . The decoding network Dec^IS of the cross-modal generative network G extracts the output features $f_{k}^{G}$ of the different convolutional layers from the fused features $h_{1}^{G}$ as shown in the following equation: (18) $f_{k}^{G} = D e c_{k}^{I S} (h_{1}^{G})$ where k represents layer k of the decoding network Dec^IS. $D e c_{k}^{I S} (\cdot)$ represents the forward computation of the decoding network [19].

2)

Haptic reconstruction network

For a given haptic modal real image t_x, it is converted into a hidden space feature representation f^T using encoder E_T of haptic self-encoding network T. i.e: (19) $f^{T} = E_{T} (t_{x})$

The encoded feature vector f^T i.e. $h_{0}^{T}$ passes through two fully connected layers to obtain the hidden layer features $h_{1}^{T}$ , the decoding network Dec^T of the haptic self-coding network takes the hidden space features $h_{1}^{T}$ as inputs, and the features of each layer in the decoding process are represented as Eq. (20): (20) $f_{k}^{T} = D e c c_{k}^{T} (h_{1}^{T})$ where k represents layer k of the decoding network Dec^T and $D e c_{k}^{T} (\cdot)$ represents the forward computation of the self-coding network T.

3)

Discriminator network

For the adversarial lossy discriminative model, the inputs to the discriminator D_IS are the real tactile image t_real and the one generated by the cross-modal generation network t_fake, respectively. The inputs to D_T are the real tactile image t_real and the tactile information reconstructed by the tactile self-encoding network t_ae, respectively. The feature-level discrimination is precisely the discriminator of the common representation module, which tends to differentiate the implicit features whether the information encoded by the target modality or not, and the method is used to discover the common features between the different modalities. The method is used to uncover common features between different modalities. With the above definitions, the generative and discriminative models use the idea of games in generative networks, and the CRCM-GAN designed in this paper can be trained by jointly solving the learning problems of two parallel GANs.

2.2.2

Model Loss Function

1)

Generation against loss

The generative model aims to uncover the intrinsic structure and characteristics of the data, thus enabling the generation of multimodal data.

In this, G_IS and G_T networks respective discriminators D_IS and D_T are used as independent discriminators. In addition, two discriminators, D_C1 and D_C2, are designed to explore the common representation among different modalities: (21) $\min_{G_{I S,}, T_{T}} \max_{D_{I S, D}, D_{C 1}, D_{C}} L_{G_{1}} (G_{I S}, G_{T}, D_{I S}, D_{T}) + L_{G_{2}} (G_{I S}, G_{T}, D_{C 1}, D_{C 2})$

Cross-modal generative model for audio-visual co-representation G_IS the optimization objective is to minimize the minimization of the generative modal-target modal difference.

A discriminator network D_IS is used to determine the authenticity of the input image. Real haptic data t, pseudo-images generated by the cross-modal generation network $\hat{t}$ , G_IS The total generative adversarial loss is expressed through the following equation: (22) $\begin{matrix} L_{G_{I S}} (G_{I S}, D_{I S}) & = E_{(t) ~ P_{(t)}} [\log (D_{I S} (t))] \\ {+ E}_{(\hat{t}) ~ P_{(t)}} [\log (1 - D_{I S} (G_{I S} (\hat{t})))] \end{matrix}$ $$\matrix{ {{{\rm{L}}_{{G_{IS}}}}\left( {{G_{IS}},{D_{IS}}} \right)} & { = {{\rm{E}}_{(t)∼{P_{(t)}}}}\left[ {\log \left( {{D_{IS}}(t)} \right)} \right]} \cr {} & { + {{\rm{E}}_{(\hat t)∼{P_{(t)}}}}\left[ {\log \left( {1 - {D_{IS}}\left( {{G_{IS}}(\hat t)} \right)} \right)} \right]} \cr } $$

Then discriminator D_IS optimization objective is, maximize L_{G_IS} and generator G_IS optimization objective is, minimize L_{G_1S}: (23) $L_{G_{I S}} = E_{(t) ~ P_{(t)}} [\log (D_{I S} (G_{I S} (\hat{t})))]$ $${{\rm{L}}_{{G_{IS}}}} = {{\rm{E}}_{(t)∼{P_{(t)}}}}\left[ {\log \left( {{D_{IS}}\left( {{G_{IS}}(\hat t)} \right)} \right)} \right]$$

The loss function L_G₁ of the haptic network and the haptic self-encoding network, generated after visual audition, is as follows: (24) $L_{G_{1}} = L_{G_{I S}} + L_{G_{T}}$

2)

Common representation learning loss

In the discriminator D_C1 and D_C2 training phase, feature representations under the same path are labeled as 1, and feature representations under different paths are labeled as 0. This taps into the common representation between audiovisual and tactile and facilitates the generation of tactile modalities.

(25)

\begin{matrix} L_{G_{2}} = E_{(i, s, t) ~ P_{(i, s, t)}} [D_{C 1} (G_{I S e n c} (i, s)) - D_{C 1} (G_{T e n c} (t)) \\ + D_{C 2} (G_{T e n c} (t)) - D_{C 2} (G_{I S e n c} (i, s))] \end{matrix}

$$\matrix{ {{{\rm{L}}_{{G_2}}} = {{\rm{E}}_{(i,s,t)∼{P_{(i,s,t)}}}}[{D_{C1}}\left( {{G_{ISenc}}(i,s)} \right) - {D_{C1}}\left( {{G_{Tenc}}(t)} \right)} \cr { + {D_{C2}}\left( {{G_{Tenc}}(t)} \right) - {D_{C2}}\left( {{G_{ISenc}}(i,s)} \right)]} \cr } $$

3)

Feature-level supervised loss

Unlike the traditional feature matching loss, the algorithm proposed in this chapter is a feature-level supervised loss function during the generation of haptic signals by G_IS and G_T versus the reconstruction of haptic signals. G_T network when trained with the cross-modal generative network G_IS, network G_T will model the data distribution of the haptic data better and converge faster than network G_IS. Therefore, this model uses the output of each layer of the decoder of the haptic self-coding network G_T as supervisory information for the decoder of the cross-modal generative network G_IS, and imposes feature-level constraints on the output of each layer of the decoder. According to Eqs. (18) and (20) the feature supervision loss can be defined as: (26) $L_{F M} = \sum_{i = 1}^{n} {‖ f_{k}^{G} - f_{k}^{T} ‖}_{2}$ where ${‖ \cdot ‖}_{2}$ denotes the l₂ loss and n denotes the number of convolutional layers of the decoder network.

2.2.3

Training process and algorithm steps

For the cross-modal generative model, the feature vectors f^I and f^S with high-level representations of both visual and auditory modalities are first obtained from the encoder of the generative model, and the two are spliced to obtain the feature fusion vectors $h_{x}^{G}$ , which are then spliced to obtain the audio-visual representation vectors $s_{x}^{G}$ after the last fully-connected layer, which are then passed through a G_IS decoder to generate the cross-modal generative haptic representations $\hat{t}$ . Subsequently, the generative images are discriminated to distinguish between $\hat{t}$ is cross-modal generated data and t is real data [20]. The formula is as follows: (27) $\nabla_{θ_{D_{I S}}} \frac{1}{N} \sum_{x = 1}^{N} [\log (1 - D_{I S} (t)) + \log (D_{I S} (t))]$ where N is the number of instances in a batch. Similarly, the haptic reconstruction network reconstructs the input haptic information to obtain the reconstructed representation t′. The reconstructed representation is compared with the real image, and thus the discriminative model can also be updated using the following equation: (28) $\nabla_{θ_{D_{T}}} \frac{1}{N} \sum_{x = 1}^{N} [\log (1 - D_{T} (t')) + \log (D_{T} (t))]$

The stochastic gradient is calculated as follows: (29) $\nabla θ_{D_{C_{1}}} \frac{1}{N} \sum_{x = 1}^{N} [\log D_{C 1} (s_{x}^{G}, h_{x}^{G}) + (\log (1 - D_{C 1} (s_{x}^{T}, h_{x}^{G})))]$ where (s,h) denotes the joining of two feature vectors.

The stochastic gradient is calculated as follows: (30) $\nabla θ_{D_{C 2}} \frac{1}{N} \sum_{x = 1}^{N} [\log D_{C 2} (s_{x}^{T}, h_{x}^{T}) + (\log (1 - D_{C 2} (s_{x}^{G}, h_{x}^{T})))]$

The input to the decoder is a mapping matrix from the target modality to the common representation space, and the output is a reconstructed image of the target modality in the common representation space. The aim is to minimize the objective function to fit the true correlation distribution. Its stochastic gradient descent is given in the following equation: (31) $\nabla θ_{G_{I S}} \frac{1}{N} \sum_{x = 1}^{N} [\log D_{C 2} (s_{x}^{G}, h_{x}^{T}) + \log (D_{I S} (\hat{t}))]$

Similarly, for the self-coding network model, in order to fit the spatial distribution of the reconstructed signal to the real signal.

(32)

\nabla θ_{G_{T}} \frac{1}{N} \sum_{x = 1}^{N} [\log D_{C 1} (s_{x}^{T}, h_{x}^{G}) + \log (D_{T} (t'))]

The training process for generative and discriminative models involves iterating them until they reach a stable equilibrium. In this process, the generative model tries to generate samples that are similar to the real samples, while the discriminative model tries to distinguish between real and generated samples. As a result, the heterogeneity gap between the different modalities gradually decreases and the learning space shares the representation space.

3

Results and Discussion

3.1

Experiments on localized pose prediction of objects

The experimental platform is a UR3 robotic arm equipped with a Barrett hand dexterous hand. This experiment is only for one finger to predict the local attitude of the object. There are two experimental objects, namely a water bottle and a square. The objects are placed on a flat surface, given an initial position of the dexterous hand relative to the object, the dexterous hand equipped with a fingertip tactile sensor is utilized to grasp the object from open to closed, and the sensor outputs the data of the proximity unit during the approach process. The collected data set is processed with data outliers removal and normalization, and then the proximity sensing data of the 2 objects are input into the trained model respectively, and the prediction curves are obtained after model fitting as shown in Fig. 3 and Fig. 4. Where the parameter d, parameter xrot and zrot are the perceptual prediction values on each coordinate axis of the object pose, respectively.

Observing the parameter d, the prediction curves of the water bottle and the square body are gradually decreasing both tend to zero, presenting an obvious step shape, which is more effective.

The prediction curve of the square body shows an overall decreasing trend, with small fluctuations in the first half. Observing the parameters x_rot and z_rot, the overall trend of x_rot becomes smaller and z_rot becomes larger for both water bottle and square. This is due to the constant bending of the fingers of the dexterous hand resulting in attitude changes between the fingers and the objects, while the different surface structures of the grasped objects allow the proximity sensing unit to sense different localized attitudes. Overall, it is reasonable to predict the overall trend of the localized poses of different objects, and different object surfaces will produce corresponding predictions. It can be concluded that the sensing effect of the multimodal sensing model in this paper meets the design expectation.

3.2

Overall Multimodal Sensing of Objects

In order to analyze the effect of dictionary size K on the algorithm, by adjusting the size of K value, setting the pooling mode to average pooling, and observing the object recognition accuracy of OSL-SR and the algorithm in this paper, the results are shown in Figure 5.

The object recognition results are not only related to the specific algorithm, but also to the dictionary size parameter set by the algorithm. As the dictionary size K increases, the object recognition results of OSL-SR and this paper’s model in the case of single-sample learning also increase.The recognition accuracies of the OSL-SR model in the process of increasing the K value from 30 to 80 are 84, 86, 87, 90, 91, 91, respectively, while that of this paper’s model is 89, 90, 92, 94, 95, 92.The figure shows that the recognition accuracy of this paper’s model in the different K stages are higher than OSL-SR, which directly indicates that it has a better generalization ability under different parameters, and reflects that it is beneficial to improve the efficiency of the algorithm by considering the temporal characteristics of reconstructed data with coupling properties.

Figure 6 shows the F1 score results for perceptual state recognition for this paper and the other two algorithms. At any sparsity, the recognition of this paper’s model is significantly better than JKSC and AMDL. When T=5, the maximum recognition result is 0.953, which is higher than the recognition performance of the remaining two models. When T>5, the model of this paper starts to show a decreasing trend, but it is still higher than the other algorithms. And it has been further found that AMDL is more sensitive to sparsity than KSC because it considers the force association between multiple fingers. It can be concluded that the model in this paper is also superior in overall multimodal perception of objects.

4

Conclusion

In this paper, a sensing model for robotic multimodal information perception is designed to be able to complete the fusion of multimodal heterogeneous data information acquired by multiple sensing devices in structural settings, time scales, and spatial dimensions, and ultimately to realize the enhancement of robotic perceptual prediction capability. It is verified that it is reasonable to predict the overall trend for the local poses of different objects, and different object surfaces will produce corresponding prediction effects. Comparing this paper’s model with OSL-SR, JKSC and AMDL, it is found that the robot perception prediction effect is related to the dictionary size parameter set by the algorithm. As the dictionary size K increments, the object recognition results of OSL-SR and this paper’s model in the case of single-sample learning also increase, and the recognition accuracy of this paper’s model in different K stages is higher than that of OSL-SR, which proves that the model has a better generalization ability in the case of different parameters, and moreover, reflects that it is beneficial to improve the algorithm’s efficiency by considering the temporal features of reconstructed data with coupling characteristics. Under any sparsity, the recognition effect of this paper’s model is significantly better than that of JKSC and AMDL. At T=5, the maximum recognition result is 0.953, which is superior to the recognition performance of the other two models. When T > 5, the model in this paper starts to show a decreasing trend, but it is still higher than the other algorithms. In summary, this paper’s model achieves the design goal of intelligent enhancement of the robot’s perceptual prediction ability.

Project Number:

220124038, Heilongjiang Institute of Technology Horizontal Research Project, Development of a Web-based Product Selection Platform for S Enterprise’s Gear Reducers.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Research on robotic mechanical power sensing model based on multimodal sensor fusion

Jianjia Qi

Data publikacji: 17 mar 2025

Otrzymano: 06 lis 2024

Przyjęty: 18 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0312

Słowa kluczoweGenerative adversarial network, Multimodal fusion, Robot perception, Object recognition

© 2025 Jianjia Qi, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Słowa kluczowe
Generative adversarial network, Multimodal fusion, Robot perception, Object recognition