Uneingeschränkter Zugang

Recursive neural network-based design of unmanned aircraft swarm collaborative mission execution and autonomous navigation system

, , ,  und   
24. März 2025

Zitieren
COVER HERUNTERLADEN

Introduction

Recurrent neural network is a kind of neural network with recurrent connections, each of its nodes can receive the output of the previous moment as input. This recurrent connection makes recurrent neural network able to process sequential data, such as natural language, time series, etc [1-3]. Compared with the traditional feed-forward neural network, recurrent neural network has the ability of memory, which can influence the current output by memorizing the previous information, so as to better capture the temporal relationship in the sequence data, and it has a wide range of applications in the field of science and technology [4-6].

UAV is an important achievement of modern science and technology, which has the advantages of unmanned, high efficiency and low cost, and can be applied in many fields. Among them, UAV swarm collaborative mission execution is one of the important directions in UAV application [7-10]. Autonomous UAV navigation system is a system that allows UAVs to independently perceive the surrounding environment, analyze the surrounding information, independently plan the path according to their own mission characteristics, execute the flight mission, or react to the abnormal situation [11-13]. Due to the importance of UAV swarm collaborative mission execution, the need for UAVs to collaborate with each other to accomplish the mission, and the importance of autonomous navigation system, the design of UAV swarm collaborative mission execution and autonomous navigation system based on recurrent neural network is of great significance for the application of UAVs [14-16].

In this paper, firstly, by analyzing the process characteristics of UAV to realize autonomous navigation, Yolov5 target detection model is selected as the network model for visual reconstruction to carry out single-stage target detection of UAV. Based on the fusion of distance information and visual perception results, the collection of environmental information is carried out. On the basis of SAC algorithm, LSTM-enhanced Layered-RSAC algorithm is proposed. Next, the UAV autonomous navigation system is built, and the randomly selected localization data is tested for accuracy. Through the UAV fixed-point autonomous takeoff and landing test, the autonomous landing accuracy analysis is carried out, and the UAV operation situation index is derived. Finally, the performance of the Layered-RSAC algorithm is compared with other algorithms based on the adaptive attenuation degree of the a priori strategy.

Recursive neural network-based UAV navigation system
Definition of autonomous drone navigation

Autonomous navigation of unmanned aerial vehicles (UAVs) in complex environments refers to the generation of control commands by UAVs based on their own sensors’ observations of the surrounding environment to fly from a starting position to a target position in complex environments. The environment (e.g., urban environment with tall buildings, forest environment with regular trees) is unknown, obstacles in the environment appear randomly, and the starting position and target position are generated randomly.

Generally speaking, UAV control design involves three aspects, namely speed, direction, and height. For simplicity, this paper assumes that the UAV is flying at a fixed altitude during autonomous navigation, and the UAV only needs to control the change of direction and speed. In addition, this paper ignores the physical limitations of the UAV dynamics model and assumes that the control commands can take effect instantly. Based on the above simplifications and assumptions, let ψ = [ω, v, x, y] represent the state of the UAV, where ω ∈ [0, 2π], v, and [x, y] represent the first view direction, velocity, and horizontal position of the UAV, respectively, and let ϑt represent the command to control the first view direction of the UAV, and vt is the acceleration amount.

In practical applications, the observation of the surrounding environment by UAVs can be realized by using visual cameras, radars, etc., while the localization of its own position can be realized by using the Global Positioning System (GPS). However, in the simulation environment, it is assumed that the UAV only utilizes ranging sensors deployed in a limited number of directions to realize the perception of the surrounding local environment, and its position information can be directly given by the simulation environment. In this paper, we propose a deployment scheme for the virtual ranging sensors carried by the UAV, where the first view direction ω of the UAV is taken as the reference, and the nine ranging sensors deployed in the horizontal direction point to {(ω+i×θs)  mod  2π}i=44$$\left\{ {\left( {\omega + i \times {\theta _s}} \right)\ \bmod \ 2\pi } \right\}_{i = - 4}^4$$, where θs = 0.125π.

Based on the above definition, the navigation problem of the UAV in the complex environment can be further described as: the first viewing angle direction ω and speed v of the UAV are controlled through control instructions ϑt and vt, so that the UAV flying at fixed altitude can fly from any starting point position [x0,y0]$$\left[ {{x_0},{y_0}} \right]$$ to any target position [xd,yd]$$\left[ {{x_d},{y_d}} \right]$$ in the complex environment of randomly distributed obstacles with the help of the environmental observation information fed back by the sensors it carries, and avoid the obstacles in the environment during the flight, and the discrimination criteria for the UAV to reach the target are defined as: Success={ True if(xi,yt)(xd,yd)2C False otherwise$$Success = \left\{ {\begin{array}{c} {True}&{if{{\left\| {\left( {{x_i},{y_t}} \right) - \left( {{x_d},{y_d}} \right)} \right\|}_2} \leq C} \\ {False}&{otherwise} \end{array}} \right.$$

where C is the preset interval value.

Design and realization of UAV navigation system

This paper proposes an algorithmic framework for collaborative task execution and autonomous navigation of UAV swarms based on recurrent neural networks. The basic idea is that there are multiple UAVs and targets in the environment, firstly, the UAVs obtain the raw data according to the sensors they carry, and through the cooperative perception module, the raw visual data are reconstructed, spliced, grayscaled, and downscaled; the distance information between the UAVs, and the UAV’s ontological motion information are jointly mapped to the visual data, and stacked with the joint state of the joint perception results of the universality of the joint state of multiple moments description of multiple moments of joint perception results, which is used as an input to the collaborative decision-making module. The collaborative decision-making part utilizes the joint state description to generate the optimal action for each UAV end-to-end. By building a simulated navigation environment and using a centralized approach to interactive training, the joint rewards are jointly optimized and a stable collaborative strategy is finally achieved. The proposed algorithm is next described in detail in terms of the design and implementation of the joint state description, as well as the structure of the collaborative decision-making network.

Target detection based on Yolo algorithm

Recurrent neural network-based target detection methods can be divided into two main parts: two-stage target detection algorithms represented by FastR-CNN and single-stage target detection algorithms represented by Yolo series. The two-stage algorithm approach is: first generate the candidate region, and then through the convolutional neural network for classification or regression, usually has a higher accuracy but slower detection speed. The single-stage algorithm is: the target detection is regarded as a regression problem of the bounding box, and the image features are extracted directly to predict the object category and location, which usually has faster detection speed and is more capable of learning the generalized features of the object. For the real-time requirements of UAV control, this paper selects the single-stage detection algorithm to meet the needs of real-time control.Yolo series has launched nine versions since its introduction, of which YoloV5 has a greater advantage in ecology, accuracy and real-time, especially in the case of small samples more stable and robust, and in the lower performance of the GPU embedded platforms with better real-time performance. Therefore, in this paper, the Yolov5 target detection model is chosen as the network model for visual reconstruction to take into account the real-time control of UAVs and the convenience of future deployment of edge computing devices.

The basic principle of the YOLO family of algorithms is to directly predict and localize multiple targets in an image by employing a single neural network model with a single forward propagation. This class of algorithms usually divides the input image into S × S grid, each grid is responsible for predicting whether or not a target is contained within that grid as well as its location and category, and anchor frames obtained by clustering based on the data characteristics are introduced to assist in the prediction of the target, with multiple anchor frames predicted for each grid, with the anchor frames’ sizes and dimensions being correlated to the size and dimensions of the target. When the center point of an object falls into a grid cell, that grid is responsible for predicting the object, and each grid predicts the location information of B bounding boxes and compares it with the real labeled box, which is calculated as (2): Cij=Pi,j(object)×IOUpredtruth$$C_i^j = {P_{i,j}}(object) \times IOU_{pred}^{truth}$$

Eq:

Cij$$C_i^j$$ - confidence score for the jrd bounding box in the ind grid;

Pi,j(object)$${P_{i,j}}\left( {object} \right)$$ - the function associated with the category;

IOUpredtruth$$IOU_{pred}^{truth}$$ - the intersection and concurrency ratio of the predicted box with the real labeled box.

Convolutional neural networks with multiple layers and branches and different convolutional kernel sizes are used to extract feature maps at different scales and then fused, and supervised learning is used to complete network training using samples containing Ground Truth. The loss function usually consists of localization error (including center coordinates, quadratic cost function of width and height), confidence error (using cross-entropy loss), and category error (binary cross-entropy or quadratic cost function). Subsequent elimination of overlapping bounding boxes and reduction of false detection rate are carried out by means of post-processing such as non-maximal value suppression, so that information such as the localization and category of each object in the input image can be directly inferred from the training results. Its category error is calculated as (3): E1=i=0S2j=0BWijobj[C^ijlog(Cij)(1C^ij)log(1Cij)]$${E_1} = \sum\limits_{i = 0}^{{S^2}} {\sum\limits_{j = 0}^B {W_{ij}^{obj}} } \left[ {\hat C_i^j\log \left( {C_i^j} \right) - \left( {1 - \hat C_i^j} \right)\log \left( {1 - C_i^j} \right)} \right]$$

In Eq:

S2 - the number of image division grid cells;

B - number of bounding boxes;

Cij$$C_i^j$$ - prediction frame confidence score;

C^ij$$\hat C_i^j$$ - true frame confidence score;

Wijobj$$W_{ij}^{obj}$$ - positive or negative sample sign.

The position prediction of the bounding box is calculated as follows: bx=(scalexyσ(tx)scalexy12)+cx$${b_x} = \left( {scal{e_{xy}} \cdot \sigma \left( {{t_x}} \right) - \frac{{scal{e_{xy}} - 1}}{2}} \right) + {c_x}$$ by=(scalexyσ(ty)scalexy12)+cy$${b_y} = \left( {scal{e_{xy}} \cdot \sigma \left( {{t_y}} \right) - \frac{{scal{e_{xy}} - 1}}{2}} \right) + {c_y}$$ bw=pwetw$${b_w} = {p_w}{e^{{t_w}}}$$ bh=pheth$${b_h} = {p_h}{e^{{t_h}}}$$

In Eq:

(bx,by)$$\left( {{b_x},{b_y}} \right)$$ - pixel coordinates of the center point of the prediction box;

(cx,cy)$$\left( {{c_x},{c_y}} \right)$$ - coordinates of the upper left corner of the grid cell;

(tx,ty)$$\left( {{t_x},{t_y}} \right)$$ - the offset of the center point;

scalexy - a scaling factor greater than 1 to ensure that the prediction frame will reach the extreme values;

σ()$$\sigma \left( \cdot \right)$$ - the sigmoid function;

(bw,bh)$$\left( {{b_w},{b_h}} \right)$$ - the width and height of the prediction box;

(pw,ph)$$\left( {{p_w},{p_h}} \right)$$ - the width and height of the a priori bounding box obtained by clustering;

(tw,th)$$\left( {{t_w},{t_h}} \right)$$ - the scaling of the a priori bounding box.

The localization box error is calculated using the squared loss function: E2 = i=0s2j=0BWijobj[(σ(tx)ijσ(t^x)ij)2+(σ(ty)ijσ(t^y)ij)2] + i=0s2j=0BWijobj[(σ(tw)ijσ(t^w)ij)2+(σ(th)ijσ(t^h)ij)2]$$\begin{array}{rcl} {E_2} &=& \sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^B {W_{ij}^{obj}} } \left[ {{{\left( {\sigma \left( {{t_x}} \right)_i^j - \sigma \left( {{{\hat t}_x}} \right)_i^j} \right)}^2} + {{\left( {\sigma \left( {{t_y}} \right)_i^j - \sigma \left( {{{\hat t}_y}} \right)_i^j} \right)}^2}} \right] \\ &+& \sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^B {W_{ij}^{obj}} } \left[ {{{\left( {\sigma \left( {{t_w}} \right)_i^j - \sigma \left( {{{\hat t}_w}} \right)_i^j} \right)}^2} + {{\left( {\sigma \left( {{t_h}} \right)_i^j - \sigma \left( {{{\hat t}_h}} \right)_i^j} \right)}^2}} \right] \\ \end{array}$$

Eq:

(t^x,t^y)$$\left( {{{\hat t}_x},{{\hat t}_y}} \right)$$ - the center coordinates of the real markup box;

(t^wt^h)$$\left( {{{\hat t}_w}{{\hat t}_h}} \right)$$ - the width and height of the real labeling box.

The loss function is usually the sum of (3) and (8), which is used to build the dataset by collecting and labeling the data, and the neural network is trained using the Adam optimizer until the error is less than a set threshold.

Distance fusion based information acquisition

Since UAVs can only capture a limited range of information due to the limitation of the viewing angle when they only utilize the front RGB camera on board to collect environmental information, the limitation of the viewing angle may lead to collisions between UAVs that cannot sense each other in a cooperative UAV navigation task. The advantages of introducing the distance between UAVs into the sensing data are as follows:

When the distance between UAVs is too close, the decision-making part can adjust the strategy in time, such as changing the direction, as a way to avoid collision.

The inter-copter distance information combined with the visual perception results can better perceive the relative positional relationship between UAVs, and the intelligent body can have a better understanding of the spatial layout of the surrounding environment, which is more conducive to the planning of paths by the intelligent body.

The distance information can better coordinate the movement between UAVs, and the intelligent body can dynamically adjust the target allocation result according to the distance between UAVs and visual perception, thus ensuring more efficient cooperative navigation capability.

RSAC-based sub-network structure

For the autonomous navigation task of cellular-connected UAVs in highly dynamic environments, the algorithms in this study are designed with the idea of firstly satisfying the two basic requirements of UAV navigation, i.e., safety and efficiency. Therefore, the complex decision-making process of UAV navigation in this context is transformed into solving three more tractable subproblems, namely: avoiding obstacles, approaching the destination, and choosing a specific action from the solution schemes of the first two subproblems to ensure the UAV’s air-ground communication connectivity. Therefore, this study designs an RSAC-based sub-network architecture for the first two sub-problems, which consists of an Evade Network and an Approach Network, each of which generates actions that match the sub-tasks, where the Evade Network is used to generate actions that are applicable for obstacle avoidance, and the Approach Network is used to generate actions applicable to approaching the destination.

The elaboration of the structure of the sub-networks of this study begins with the composition of their input vectors. Since this input relates to the subproblem of obstacle avoidance and it should contain information related to the historical motion characteristics of the dynamic obstacles, the input of this sub-network structure, as shown in (9), still adopts real-time updated historical information as the input vector, but the difference is that this historical information contains the subactions generated by the sub-network based on the previous historical information: ht=(s0,a0,s1,a1,s2,,at1,st)$${h_t} = \left( {{s_0},{a_0},{s_1},{a_1},{s_2}, \ldots ,{a_{t - 1}},{s_t}} \right)$$

ht is the historical information that is updated in real time with the state transfer of the Markov decision process, and it plays an important role in the subsequent UAV decision iterations, which are updated as shown in (10): ht+1ht(at,st+1)$${h_{t + 1}} \leftarrow {h_t} \cup \left( {{a_t},{s_{t + 1}}} \right)$$

Once the input vectors were determined, the design of the sub-network architecture was accomplished in conjunction with the classical LSTM network structure. The hierarchical sub-network designed in this study utilizes an Actor-Critic architecture similar to the classical SAC algorithm, with each sub-network consisting of an Actor πθ, and two Critics Qφ1 and Qφ2. The introduction of two Value Networks avoids overestimation of the state-action pair Q values, and each Value Network is based on the same Eval-Target Network structure, which consists of an Eval Network with a parameter vector of φ, and a Target Network with a parameter vector of φ′. Parameters φ′ and φ are distinguished from each other by the fact that they are updated much less frequently. And it is worth mentioning that, unlike the input action information of the Value Network, the input action information of the Policy Network starts from a−1, which is assigned a default value of 0, aiming to maintain the uniform structure of the network.

Then, based on the overall algorithm flow, the action selection mode of the sub-network and the iterative update of its parameters are expounded, and in the interaction process between each cellular-connected UAV and the simulated complex dynamic urban environment, the strategy network of the sub-network Evade Network and Approach Network outputs the estimated probability distribution πθ(ωt|ht)$${\pi _\theta }\left( {{\omega _t}\left| {{h_t}} \right.} \right)$$ about angular velocity ωt and the estimated probability distribution πo(vt|ht)$${\pi _o}\left( {{v_t}\left| {{h_t}} \right.} \right)$$ of linear velocity vt based on the current historical trajectory ht, and then the two sub-networks can generate actions respectively through equation (11). aevadet$$a_{evade}^t$$ and aapproacht$$a_{approach}^t$$: a1=[ω1,vl]=[fo1(εi;hi),fθ2(εi;ht)]$${a_1} = \left[ {{\omega _1},{v_l}} \right] = \left[ {f_o^1\left( {{\varepsilon _i};{h_i}} \right),f_\theta ^2\left( {{\varepsilon _i};{h_t}} \right)} \right]$$

where fθ1(εi;hl)$$f_\theta ^1\left( {{\varepsilon _i};{h_l}} \right)$$ and fθ1(εi;hl)$$f_\theta ^1\left( {{\varepsilon _i};{h_l}} \right)$$ represent the results of sampling the distribution of the output of the first and second dimensions of the strategy network after adding noise, respectively. The specific architecture of the designed RSAC-based policy value network is shown in Fig. 1.

Figure 1.

The specific structure of the designed RSAC-based policy-value network

The Integrated Network then selects one of the two sub-actions as aIntegratedt$$a_{Integrated}^t$$. When the drone finishes executing the current action aIntegratedt$$a_{Integrated}^t$$ in the environment, it will be rewarded with rcvadet $$r_{cvade}^t$$ and rapproacht$$r_{approach}^{}t$$ for approaching an obstacle or a destination, and the history trajectory will be updated from ht to ht+1. At the same time, the sequences seqcvadet:(ht,aevadet,ht+1,revadet)$$seq_{cvade}^t:\left( {{h_t},a_{evade}^t,{h_{t + 1}},r_{evade}^t} \right)$$: and seqapproacht:(ht,aapproacht,ht+1,rapproacht)$$seq_{approach}^t:\left( {{h_t},a_{approach}^t,{h_{t + 1}},r_{approach}^t} \right)$$ are stored in the experience playback pool D$$\mathcal{D}$$, after which K trajectories are drawn from D$$\mathcal{D}$$ to iteratively update the neural network parameters φ & θ included in the RSAC algorithm.

As mentioned before, the value networks of the sub-networks are constructed i.e., to better train their strategy networks, while the estimates of their output Q values Qφ(ht,at)$${Q_\varphi }\left( {{h_t},{a_t}} \right)$$ serve as an evaluation of the decision to perform an action a, after observing the current historical trajectory h2. For each sub-network, the objective function of the iterative update Qφ is transformed into (12) with some fine-tuning: JQ(φ)=E(ht,at)~D[12(Qφ(ht,at)(r(hl,at)+γEht+1p(ht,Qt)[Vφ(ht+1)]))2]$$J_Q(\varphi ) = \mathop {\mathbb{E}}\limits_{\left( {{h_t},{a_t}} \right) - \mathcal{D}} \left[ {\frac{1}{2}{{\left( {{Q_\varphi }\left( {{h_t},{a_t}} \right) - \left( {r\left( {{h_l},{a_t}} \right) + \gamma \mathop {\mathbb{E}}\limits_{{h_{t + 1}} - p\left( {{h_t},{Q_t}} \right)} \left[ {{V_{\varphi '}} \cdot \left( {{h_{t + 1}}} \right)} \right]} \right)} \right)}^2}} \right]$$

where Vφ(hl+1)$${V_{\varphi '}}\left( {{h_{l + 1}}} \right)$$ is computed by Monte Carlo estimation of Eq. (13) by another target network. Vφ(st)=EQi~π[Qn(hi,at)αlog(π(ai|ht))]$${V_\varphi }\left( {{s_t}} \right) = \mathop {\mathbb{E}}\limits_{{Q_i}\sim \pi } \left[ {{Q_n}\left( {{h_i},{a_t}} \right) - \alpha \log \left( {\pi \left( {{a_i}|{h_t}} \right)} \right)} \right]$$

The process of iterative updating of parameters φ and φ′ will be repeated for several epochs (Epochs) until the value of Qφ(hi,at)$${Q_\varphi }\left( {{h_i},{a_t}} \right)$$ stabilizes and converges.

Similarly, small changes can be made to the objective function of the training strategy network as shown in (14): JFi(θ)=EhiDEaiπθ[αlog(πθ(at|hi))Qp(hi|at)]$${J_{{F_i}}}(\theta ) = \mathop {\mathbb{E}}\limits_{{h_i} - \mathcal{D}} \mathop {\mathbb{E}}\limits_{{a_i} - {\pi _\theta }} \left[ {\alpha \log \left( {{\pi _\theta }\left( {{a_t}|{h_i}} \right)} \right) - {Q_p}\left( {{h_i}|{a_t}} \right)} \right]$$

After this, a stochastic gradient ascent should be performed to maximize (14). Finally, the loss function of the temperature coefficient is now calculated by Eq. (15). J(α)=Eaiπt[α(log(πt(at|ht))+H¯)]$$J(\alpha ) = \mathop {\mathbb{E}}\limits_{{a_i} - {\pi _t}} \left[ { - \alpha \left( {\log \left( {{\pi _t}\left( {{a_t}|{h_t}} \right)} \right) + \overline {\mathcal{H}} } \right)} \right]$$

From the above introduction, it can be seen that the integrated complex strategy for autonomous UAV navigation in this algorithm evolves from three simpler strategies: avoidance, proximity and selection. The algorithm is named Layered-RSAC because it is based on the LSTM-enhanced SAC algorithm and it uses a layered neural network framework for solving complex optimization problems and applies a layer-by-layer optimization approach.

Experimental results and analysis
System testing and operational results
System construction and accuracy testing

In order to explore the application effect of the proposed autonomous UAV navigation system, this paper carries out the construction of the system. Set the UAV autonomous navigation occupying space as a quadrilateral of 4km2, and select four vertices as the localization points. The location of localization point 1 is the origin of the local coordinate system, and the direction of localization point 1 pointing to localization point 2 is the x-axis direction, and the direction of localization point 1 pointing to localization point 3 is the y-axis direction. Nine positions in the space were randomly selected to obtain their positioning data, and the positioning errors in the three directions are shown in Fig. 2 compared with the real position data measured by using the laser distance measuring equipment.

Figure 2.

System positioning error

Analysis of the data shows that the average localization error of the autonomous navigation system is 9.35 cm in the x-direction, 9.23 cm in the y-direction. While the average localization error in the Z-axis is 17.59 cm.

Autonomous take-off and landing of drones at fixed points

In order to improve the stability during takeoff and landing, ultrasonic altimetry data was fused in the near-ground stage for UAV altitude control. The landing point set during the experiment was (8, 8, 0). The UAV was unlocked from any position and took off, rose to a height of 3m, then flew to the pre-landing position (8, 8, -3), and then entered the landing mode.

A total of 18 flight tests were conducted, each time the UAV took off from a random position, and the actual landing position of the UAV is shown in Table 1. The average deviation of the landing in x-direction was 12.26cm, and the average deviation of the landing in y-direction was 9.72cm.

Analysis of UAV autonomous landing accuracy

Experiment Landing position/m Landing deviation/m
1 (7.813,7.930) (0.187,0.070)
2 (8.105,8.098) (0.105,0.098)
3 (7.983,8.021) (0.017,0.021)
4 (7.892,7.928) (0.108,0.072)
5 (8.221,8.097) (0.221,0.097)
6 (7.779,8.006) (0.221,0.006)
7 (8.201,8.195) (0.201,0.195)
8 (8.112,8.099) (0.112,0.099)
9 (8.250,8.192) (0.250,0.192)
10 (8.004,7.995) (0.004,0.005)
11 (7.921,8.107) (0.079,0.107)
12 (8.112,8.099) (0.112,0.099)
13 (7.911,8.024) (0.089,0.024)
14 (7.899,8.133) (0.101,0.133)
15 (7.980,8.210) (0.020,0.210)
16 (7.821,7.905) (0.179,0.095)
17 (8.199,8.207) (0.199,0.207)
18 (8.002,8.019) (0.002,0.019)
Indicators of drone operations

The metrics of the UAV’s operation during autonomous navigation are shown in Figure 3. Figure 3(a) demonstrates the variation of the UAV’s memory pool utilization versus the exploration completion rate; the memory pool was never full, and it is clear that the memory pool utilization was effectively controlled and grew rapidly at certain moments when it flew into a large amount of unknown space. The exploration completion rate explored more than 85% of the space.

Figure 3.

Navigation health indicators

Octree is a tree-based data structure for describing three-dimensional spaces, which can effectively avoid the problems of wasted storage space and excessive computational complexity. Figure 3(b) shows the change of the number of known nodes of the octree during the autonomous navigation of the UAV. According to the characteristics of the environmental obstacles, it can be seen that the idle nodes will trigger pruning as the exploration continues, and the number of idle nodes will become less in some periods of time, with a smoother overall growth trend. The occupied nodes, on the other hand, are less likely to trigger pruning because they can only explore to their surfaces, and the number keeps growing continuously. It can be seen that at t=150s, the change in the number of idle nodes and occupied nodes tends to be flat, indicating that most of the space has been explored at this time, and most of the corresponding nodes are known.

Superiority of Layered-RSAC algorithm

The UAV autonomous navigation algorithm Layered-RSAC proposed in this paper is simulated and analyzed, and the parameters of the algorithm in this study are shown in Table 2. In this paper, we set the space occupied by autonomous UAV navigation to be about 4km2, assuming that the UAV flight altitude is a constant value, and the maximum number of training times of the algorithm is set to be 186,000, while the influence of the natural environment is ignored for the time being. The parameters of the environment and experimental equipment for the design and simulation of this paper are: Inter®Core™ i7-9700k CPU, 32 GB dual-channel memory, Windows 10 64-bit operating system, Python3.8, Pytorch1.9.0.

Layered-RSAC algorithm parameters

Argument Value
Learning rate 0.0001
Batch learning scale 22
Multithread scale 15
Discount factor 0.99
Maximum training step size 186000
Prior Policy experience attenuation 0.00005
Prior Policy initial variance 0.15

The Layered-RSAC algorithm reduces the number of detours during autonomous UAV navigation by better matching the learning model’s capability with the a priori policy weighting in the early stage of model training. In order to verify the impact of this algorithm on the final autonomous navigation training effect, the extra distance ratio is designed to evaluate the path optimization of navigating a randomly generated start and end point under a certain training step, which can also reflect the degree of detour during navigation. The extra distance ratio is designed to be the ratio of the actual path length minus the diameter path length to the diameter path length.

Due to the large number of training times, this paper takes the average processing of the extra distance ratio per 5000 training steps, the first 2.5×106 training results are equivalent to 50 training results curve, in order to better demonstrate the algorithm’s training effect, different a priori strategy standard deviation conditions of the success rate of autonomous navigation curves are shown in Figure 4.

Figure 4.

Success rate curve of autonomous navigation under different prior strategies

As can be seen from Fig. 4, when the standard deviation is very small, the model navigation success rate is very low, with a peak value of about 30%. The reason for this is that when the initial standard deviation is too small, the Prior-Policy’s guidance for the UAV is too strong i.e. the Prior-Policy accounts for too large a proportion in the execution of the behavioral strategy, resulting in the learning strategy itself cannot find the motivation to improve its own ability, so the training results tend to be more in line with the ability of the Prior-Policy itself. When the standard deviation of Prior-Policy is too large, the guidance ability of Prior-Policy in the pre-training period is too weak, and the proportion of inexperienced learning strategies in the execution of behavioral strategies is relatively high, which leads to the difficulty of learning strategies to absorb the experience, and leads to the model being in the vibration of the ups and downs, and the success rate of the navigation is higher, although the performance of the model is not stable. The network with a standard deviation of 0.35 has a higher guidance for the decision-making network than the network with a standard deviation of 0.45, resulting in a slower decay of its Prior-Policy and a slower convergence of the model, but ultimately the converged navigational success is comparable to that of a standard deviation of 0.45.

Except for the cases where the model appears to have lower success rates and difficulty converging when the standard deviation is too low or too high, the model converges to higher navigation success rates using most of the initial Prior-Policy standard deviations that are within a reasonable range. This indicates that the Layered-RSAC algorithm is relatively robust to changes in the standard deviation of the hyperparameters of the Prior-Policy. Therefore, in this paper, we choose the better performance of 0.45 as the initial Prior-Policy standard deviation, and the autonomous navigation success rate curves of different algorithms with σ = 0.45 are shown in Fig. 5.

Figure 5.

The success rate of autonomous navigation for different algorithms is σ=0.45

As can be seen in Fig. 5, the Layered-RSAC algorithm proposed in this paper not only significantly outperforms the classical Deep Deterministic Policy Gradient (DDPG) and Prior-Policy, which does not use a learning network but only uses a priori policies for navigation, in terms of navigation success rate, but also significantly outperforms the convergence step in terms of success rate, which reaches 90% or more.The Layered-RSAC algorithm initially achieves 90% navigation success rate at 50 training steps and stabilizes at 90% to 100% success rate after 100 training steps. 50 training steps initially reaches 90% navigation success and stabilizes at 90% to 100% success after 100 training steps.The SAC algorithm, although it has a higher navigation success rate of 95% to 100%, is too slow to converge and initially reaches 90% success at 150 training steps and stabilizes at 150 training steps. Prior-Policy, which relies solely on a priori policies for navigation and doesn’t rely on reinforcement learning, converges faster and reaches convergence at 75 training steps, but ultimately stabilizes with just a 25% to 30% success rate. The classical DDPG algorithm has almost negligible navigation success rate due to the lack of learning effective experience. Simulation results show that the Layered-RSAC algorithm effectively improves the training convergence speed and navigation success rate.

Conclusion

This paper proposes a UAV swarm collaborative task execution and autonomous navigation system based on recurrent neural network, and conducts system operation tests and performance evaluation of the Layered-RSAC algorithm.

The data analysis shows that the average positioning error of the autonomous navigation system is 9.35 cm in the x-direction, 9.23 cm in the y-direction, and 17.59 cm in the z-direction, and the average positioning error is 9.23 cm in the y-direction, and 17.59 cm in the z-direction.A total of 18 autonomous takeoff and landing tests of UAVs are conducted at the fixed point, and each time, the UAVs take off from the random position, and the average deviation of the landing in the x-direction is 12.26 cm, and the average deviation of the landing in the y-direction is 9.72 cm. The average deviation of landing in x-direction is 12.26cm, and the average deviation of landing in y-direction is 9.72cm. In the process of autonomous navigation of UAVs, when t=150s, the exploration completion reaches 85%, and most of the space has been explored, and most of the corresponding nodes are in a known state.

When the standard deviation of the a priori strategy is 0.45, the model navigation success rate is optimal. Taking σ=0.45 as the initial value, the Layered-RSAC algorithm reaches 90% navigation success rate at the first time in 50 training steps and stabilizes at 90% to 100% success rate after 100 training steps, which is significantly ahead of Prior-Policy, DDPG and SAC algorithms, and effectively verifies the superiority of the Layered-RSAC algorithm.

Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
1 Hefte pro Jahr
Fachgebiete der Zeitschrift:
Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere