Open Access

Research on unmanned delivery path optimization strategy based on reinforcement learning in intelligent logistics system

,  and   
Sep 26, 2025

Cite
Download Cover

Introduction

With the rapid development of China’s economy, the rise of e-commerce and the increasing demand of consumers for logistics services, the logistics industry is facing unprecedented challenges. The traditional logistics and distribution methods can no longer meet the needs of modern society in terms of efficiency, cost and service quality [1-4]. Therefore, the concept of intelligent logistics came into being, aiming to realize the intelligence, automation and high efficiency of the logistics distribution process through the introduction of advanced information technology, Internet of Things, artificial intelligence and other means [5-7]. As an unmanned, fast and flexible distribution method, UAV distribution has unique advantages, but due to the complexity of the UAV distribution path planning and optimization problem, how to effectively plan the distribution path of UAVs and maximize the distribution efficiency has become a key issue [8-11].

The goal of path optimization is to find the shortest, safest, and most energy-efficient flight path for UAVs from the origin to the destination. Obstacles such as buildings, power lines, and communication towers in the urban environment can pose a threat to the flight of UAVs. Therefore, it is necessary to avoid these obstacles when planning the path [12-15]. Considering the regulations of air traffic control, the flight altitude and airspace of UAVs also need to be strictly limited. Weather conditions in different regions, such as the strength and direction of wind and rainfall, also affect the flight efficiency and safety of UAVs, so these factors should also be taken into account in path planning [16-19]. In order to achieve path optimization, some advanced techniques and algorithms can be applied, such as reinforcement learning. Reinforcement learning is a machine learning method, which is a process of learning through the interaction between an intelligent body and its environment [20-23]. In reinforcement learning, the intelligent body interacts with the environment by choosing actions and getting feedback information from the environment, thus achieving the optimal path optimization [24-25].

Literature [26] explored the application of UAVs in logistics operations, pointing out that datasets, constraint distributions, and features have a significant impact on the selection of optimal delivery schemes and methods. Literature [27] proposed an attention-based pointer network model to solve the UAV delivery trajectory optimization problem, which effectively and automatically adapts to new UAV trajectory data and outperforms heuristic algorithms in UAV trajectory analysis and optimization. Literature [28] aims to provide an overview of route planning concepts, logics, trends, and opportunities, noting that a proper look at the to-do tasks of operations management can identify development strategies or possible system introductions, including methods and tools. Literature [29] designed a factory logistics route planning model based on a grid environment and ant colony optimization algorithm, which enables accurate logistics route planning and provides a competitive solution for implementing smart logistics. Literature [30] proposed an intelligent logistics UAV to replace the courier to complete the delivery of small goods and be able to detect the surrounding obstacles in flight, and the ground control station can receive the feedback information to prepare for emergency actions. Literature [31] investigated the effects of wind speed and direction on the UAV flight state, as well as the logistics UAV path planning problem with UAV energy constraints, customer time windows and wind conditions, and merged LNS with GA to form GA-LNS, revealing that the algorithm can produce a superior solution. Literature [32] analyzed the application of Q-learning algorithm for reinforcement learning in optimizing logistics distribution paths and elaborated that the algorithm performs well in path optimization and time saving, improves the distribution efficiency, and reveals the potential of reinforcement learning in logistics distribution optimization.

Literature [33] proposed an unmanned vehicle path planning strategy based on ant colony algorithm, introduced the optimization strategy by analyzing the advantages and disadvantages of ant colony algorithm, and the results of the study provide important reference value for actual unmanned vehicle path planning. Literature [34] outlines the problems of wasteful logistics data supervision and high logistics and distribution costs in the development process of cold chain, and proposes a path optimization strategy for cold chain unmanned vehicles. Literature [35] proposed a dynamic path planning strategy based on fuzzy logic and improved ant colony optimization, revealing that the FLACO algorithm is capable of identifying the most cost-effective and efficient paths, and emphasizing that FLACO can find the most efficient and safe paths for unmanned vehicles. Literature [36] proposed a path planning method that combines BAS and SA algorithms, revealing that the method excels in planning paths in terms of path length, path points, and safe obstacle avoidance, and is able to satisfy the needs of path planning for low-altitude logistics for UAVs. Literature [37] examined the multi-vehicle-UAV cooperative distribution problem considering congestion, proposed a heuristic column generation algorithm, and designed a dynamic planning algorithm to solve the pricing problem, and launched a sensitivity analysis to derive the impact on management. The above research explores the path optimization methods for drones and unmanned vehicles under the algorithms and models based on attention, ant colony optimization algorithm, reinforcement learning, etc., which reveals the importance of path optimization for various industries, especially in the field of science and technology, and at the same time, it also shows that there are a lot of ways to achieve path optimization, which is important for the logistics and unmanned vehicle fields, It also shows that there are many ways to realize “path optimization”, which is important for logistics, drones and other fields.

The study first introduces the principles of deep reinforcement learning algorithms and their applications in navigation, including the basic principles of reinforcement learning and the principles of deep reinforcement learning based on the value function, policy gradient, and AC framework, which lays the theoretical foundation for the design of the subsequent UAV algorithms and comparative experiments. Subsequently, an improved, end-to-end, deep reinforcement learning method incorporating an attention mechanism is proposed to solve the vehicle path problem with a time window. The model combines the decoder with the attention mechanism to parameterize the stochastic strategy for solving the VRPTW. A reward function is designed to inform the model how to adjust the model parameters to improve the quality of the solution by incorporating the idea of Markov decision process, and the model is trained by a RINFORCE algorithm that fuses the reinforcement learning Actor Critic idea with the idea of round-based updating, incorporating the idea of Actor Critic, and a reinforcement learning-based environment of good state transfer function model. Finally, the trained deep reinforcement learning model is used to solve the problem, and the solution results are analyzed.

Principles of deep reinforcement learning algorithms
Principles of Reinforcement Learning Algorithm

The feedback framework of the reinforcement learning algorithm is shown in Fig. 1, each time the action is executed, the environment will give reward feedback to the unmanned delivery fleet, and the behavior of the unmanned delivery fleet will be affected by the environmental rewards, when the unmanned delivery fleet obtains the state observation at the moment of t as st, the action strategy function of the unmanned delivery fleet π will generate the probability of each action according to the current state observation A, where A ~ π(·|st), and after that the probability of the action A will be carried out according to the probabilistic sampling to obtain the action at selected by the unmanned delivery fleet, and acts the action at on the environment, which arrives at the new state st+1 with a transfer probability PStSt+1a$$P_{{S_t}{S_{t + 1}}}^a$$, wherein st+1 ~ P(·|st, at), while the environment scores the behavior of the unmanned delivery fleet and feeds back the immediate reward rτ. In a round, it is assumed that the unmanned delivery fleet undergoes T interactions, whose information for each interaction is represented using quaternions 〈st, at, rt, st+1〉 and the sequence of quaternions throughout the round constitutes a trajectory. The rewards over the entire trajectory are accumulated to obtain the reward U=t=0T1rt+1$$U = \sum\limits_{t = 0}^{T - 1} {{r_{t + 1}}}$$ and since the importance of the reward rt at the current moment is different from that of the reward rt+1 at a future moment, a discount factor γ is introduced to obtain the discounted reward R=t=0T1γtrt+1$$R = \sum\limits_{t = 0}^{T - 1} {{\gamma^t}} {r_{t + 1}}$$, where γ ∈ (0, 1) [38]. The unmanned delivery fleet aims to obtain the maximum discount reward, i.e., max(E(t=0T1γTtR(St,At,St+1)|π))$$max\left( {E(\sum\limits_{t = 0}^{T - 1} {{\gamma^{T - t}}} R({S_t},{A_t},{S_{t + 1}})|\pi )} \right)$$ and thus learns the optimal strategy.

Figure 1.

Perception-action-reward feedback framework for reinforcement learning

Partially Observable Markov Decision Processes

Reinforcement learning is usually modeled using a Markov decision process (MDP), denoted as a quintuple M = 〈S, A, P, R, γ〉, where S is the state space, A is the action space, P is the state transfer matrix, Pstst+1at=P[St+1=st+1|St=st,At=at] $$P_{{s_t}{s_{t + 1}}}^{{a_t}} = P[{S_{t + 1}} = {s_{t + 1}}|{S_t} = {s_t},{A_t} = {a_t}]$$, R is the reward function, denoted as R(st, at) = E[Rt+1St = st, At = at], and γ is the discount factor. However, the MDP usually models a fully observable environment, which is an idealized mathematical model for practical reinforcement learning applications [39]. In real-world environments, unmanned delivery fleets often do not have access to all the information about the environment as an observation of the environment, so the whole process is modeled as a more generalized partially observable Markov decision process (POMDP), which is usually denoted by PM = 〈S, A, P, R, Ω, O, γ〉, where Ω is the observation space and O is a set of conditional observation probabilities [40].

Principles of Deep Reinforcement Learning Algorithms

The most representative solution methods for reinforcement learning can be divided into three categories: value function methods, policy gradient methods, and Actor-Critic methods that combine the two. Before introducing the three categories of methods, some concepts or symbolic representations are first explained:

Value function: i.e., action value function Q(s, a), denotes the evaluation of action a in state s. Optimal action-value function: when the state is s, all actions in the action space are evaluated and the highest value is selected, denoted as Q*(s, a) = maxπQπ(s, a), where π is the strategy function, denoting the probability density function of selecting to action a in state s.

State-value function: the expectation of the value of the action, which is used to reflect the good ring of the current state, denoted as Vπ(s) = EA[Qπ(s, A)], where A is a random variable representing the action space.

Optimal action: denotes the action when Q*(s, a) is obtained, denoted as a* = argmaxaQ*(s, a).

Value function approach

Q-learning is the most representative value function-based approach among traditional reinforcement learning algorithms, where the Q value in the algorithm represents the cumulative reward estimate, denoted as Q(st, at), for taking a certain action given a certain state of the environment. Q The core process of learning is to continuously update the state-action value function to obtain the optimal strategy. When state st is acquired, an optimal action at*$$a_t^*$$ will be selected based on the action value, after which it will be applied to the environment to obtain a new state st+1 based on the state-transfer probability and an immediate reward rt. At this point, the Q value contains an estimate of the future discounted reward, denoted as: Q(st,at)=rt+γmaxaAQ(st+1,a)$$Q({s_t},{a_t}) = {r_t} + \gamma \;{\max _{a \in A}}Q({s_{t + 1}},a)$$

Its Q worth updating in the following way:

Since Q-learning cannot handle too large state-action spaces, it is difficult to be applied to complex tasks. Combining neural networks with Q-learning algorithms, the famous DQN algorithm is proposed, the core of which is the use of convolutional neural networks to fit the value function, which inputs the state s of the environment into the neural network, and after the network to score each action, the method can be seen as a record of Q the table using the neural network, which is more suitable for the tasks of high-dimensional state space. Q(s,a)=f(s,a;θ)$$Q(s,a) = f(s,a;\theta )$$

Where function f represents the function fitted by the neural network, θ is the parameter of function f, i.e., the weights of the network, and the training process of the network i.e., the updating of the parameters. Therefore the optimal Q value is expressed as: Q(s,a;θ)=E[r+Q(s,a;θ)|s,a]$$Q(s,a;\theta ) = E[r + Q(s',a';\theta )|\:s,a]$$

The quadratic cost function is used to calculate the error as equation (4): L(θ)=E{[r+γmaxaQ(s,a;θ)Q(s,a;θ)]2}$$L(\theta ) = E\{ {[r + \gamma ma{x_{a'}}{\text{Q}}(s',a';\theta ) - Q(s,a;\theta )]^2}\}$$

where maxaQ(s,a;θ)$$ma{x_{{a^\prime }}}Q({s^\prime },{a^\prime };\theta )$$ is the value of making the optimal action in the next moment state.

Subsequently, the gradient of the objective function is solved to obtain Eq. The update of the network parameters using the back propagation algorithm. L(θ)θ=E[r+γmaxaQ(s,a;θ)Q(s,a;θ)Q(s,a;θ)θ]$$\frac{{\partial L(\theta )}}{{\partial \theta }} = E[r + \gamma \ {\max_{{a^\prime }}}Q({s^\prime },{a^\prime };\theta ) - Q(s,a;\theta )\frac{{\partial Q(s,a;\theta )}}{{\partial \theta }}]$$ θt+1=θt+1+α[r+γmaxaQ(s,a;θ)Q(s,a;θ)]Q(s,a;θ)$${\theta_{t + 1}} = {\theta_{t + 1}} + \alpha \left[ {r + \gamma \ {{\max }_{a'}}Q(s',a';\theta ) - Q(s,a;\theta )} \right]\nabla Q(s,a;\theta )$$

The training flow of the DON algorithm is shown in Fig. 2.

Figure 2.

DQN algorithm training flowchart

In order to alleviate the Q-value overestimation problem in the DQN based on the introduction of dual network designed DDQN algorithm, the method of action selection of the Q-value calculation and action evaluation of the Q-value calculation is separated, the establishment of two action value function, one for the selection of the action based on the current state, and the other is used to evaluate the value of the current state is each action, the calculation of its TD-target is (7): yt=rt+γQ(st+1,argmaxaQ(st,at;θ);θ)$${y_t} = {r_t} + \gamma Q({s_{t + 1}},argma{x_a}Q({s_t},{a_t};\theta );{\theta^ - })$$

Where, θ is the parameter of the action selection network and θ is the parameter of the state evaluation network.

TD-error is calculated according to equation (8): L(θ)=1batchi=1batch(ri+γQ(st+1,argmaxaQ(st,at;θ);θ)Q(st,at;θ))2$$L(\theta ) = \frac{1}{{batch}}\sum\limits_{i = 1}^{batch} ( {r_i} + \gamma Q({s_{t + 1}},\arg {\max_a}Q({s_t},{a_t};\theta );{\theta^ - }) - Q({s_t},{a_t};\theta ){)^2}$$

Strategy gradient

The value function based method uses the action value network to estimate the value of all actions in the current state and selects the optimal action based on the action selection strategy and estimation results. In contrast, in the strategy gradient-based method, a neural network is used to directly approximate the strategy function π, as shown in (9), where the input of the strategy function is the state s, and P is the probability that the output is action a. π(a|s)π(a|s;θ)=P{At=a|St=s,θt=θ}$$\pi (a|s) \approx \pi (a|s;\theta ) = P\{ {A_t} = a|{S_t} = s,{\theta_t} = \theta \}$$

The unmanned delivery fleet can get the probability of executing a certain action in the current state according to the policy network, and after the action is executed, the environment will give rewards thus realizing the updating of the network, and in the actual use process, the state value function is usually used to evaluate the current state and the policy network, as shown in (10): Vπ(st)=aπ(a|st)Qπ(st,a)$${V_\pi }({s_t}) = \sum\limits_a \pi (a|{s_t}) \cdot {Q_\pi }({s_t},a)$$

As state st is better, its value V is greater, and similarly using the state value network to approximate the state value function yields (11): V(st;θ)=aπ(a|st;θ)Qπ(st,a)$$V({s_t};\theta ) = \sum\limits_a \pi (a|{s_t};\theta ) \cdot {Q_\pi }({s_t},a)$$

The goal is to maximize V by modifying network parameter θ, so the state value function is derived and the network parameters are updated using gradient ascent, i.e: θθ+βV(s;θ)θ$$\theta \leftarrow \theta + \beta \frac{{\partial V(s;\theta )}}{{\partial \theta }}$$

where V(s;θ)θ$$\frac{{\partial V(s;\theta )}}{{\partial \theta }}$$ is the strategy gradient.

The specific expression of the strategy gradient can be derived based on Eq. The derivation considers Qπ(st, a) as the reward value given by the environment, and the process is as follows: V(st;θ)θ = aπ(a|st;θ)Qπ(st,a)θ = aπ(a|st;θ)θQπ(st,a) = aπ(a|st;θ)log(π(a|st;θ))θQπ(st,a)$$\begin{array}{rcl} \frac{{\partial V({s_t};\theta )}}{{\partial \theta }} &=&\frac{{\partial \sum\limits_a \pi (a|{s_t};\theta ) \cdot {Q_\pi }({s_t},a)}}{{\partial \theta }} \\ &=&\sum\limits_a {\frac{{\partial \pi (a|{s_t};\theta )}}{{\partial \theta }}} {Q_\pi }({s_t},a) \\ &=&\sum\limits_a \pi (a|{s_t};\theta )\frac{{\partial \log (\pi (a|{s_t};\theta ))}}{{\partial \theta }}{Q_\pi }({s_t},a) \\ \end{array}$$

After that, its discrete and continuous expressions can be obtained: V(st;θ)θ=EA[log(π(A|st;θ))θQπ(st,A)]$$\frac{{\partial V({s_t};\theta )}}{{\partial \theta }} = {E_A}\left[ {\frac{{\partial {\text{log}}\left( {\pi (A|{s_t};\theta )} \right)}}{{\partial \theta }}{Q_\pi }({s_t},A)} \right]$$ V(st;θ)θ=EA~π(|st;θ)[log(π(A|st;θ))θQπ(st,A)]$$\frac{{\partial V({s_t};\theta )}}{{\partial \theta }} = {E_{A\sim \pi ( \cdot |{s_t};\theta )}}\left[ {\frac{{\partial \log (\pi (A|{s_t};\theta ))}}{{\partial \theta }}{Q_\pi }({s_t},A)} \right]$$

where A is a random variable representing the action space. Usually the strategy gradient is approximated using Monte Carlo sampling, i.e., randomly sampling an action a^$$\hat a$$ according to the strategy function, and the approximation is expressed as: g(a^;θ)=log(π(a^|st;θ))θQπ(st,a^)$$g(\hat a;\theta ) = \frac{{\partial \log (\pi (\hat a|{s_t};\theta ))}}{{\partial \theta }}{Q_\pi }({s_t},\hat a)$$

where g(a^;θ)$$g(\hat a;\theta )$$ is an unbiased estimate of the strategy gradient.

Actor-Critic.

Reinforcement learning methods based on the AC framework usually contain two networks, the Actor network is used to approximate the policy function, which is used for decision making, and the Critic network is used to approximate the value function, which is used for evaluating the actions made in the current state, and the framework of the Actor-Critic algorithm is shown in Fig. 3.

Figure 3.

Actor-Critic algorithm framework

Unmanned logistics fleet path optimization model with time windows
Unmanned Distribution Fleet Path Planning Model
Customer Status Information Fusion Module

The first part of the model is to map the local and global state information from reinforcement learning into a high dimensional vector space after a 1D convolution operation. For customer i its local information Xi=(xi,yi,ei,li,di)$$X_i^\prime = ({x_i},{y_i},{e_i},{l_i},d_i^\prime )$$ is mapped into a X^i$${\hat X_i}$$ vector with ζ dimensions after a one-dimensional convolution, and to reduce the number of parameters, this study shares the weights of the one-dimensional convolution for all customers [41]. The global variable Gt = {ηt, ct} is mapped onto a G^$${\hat G^\prime }$$-vector space with ζ dimensions using another one-dimensional convolutional layer. The dynamic and static elements contained in the local and global information are learned their features in two different convolutional layers, and finally they are fused, just like the multilayer information fusion in the supervised learning ResNet50, which combines G^$${\hat G^\prime }$$ and X^$${\hat X^\prime }$$ in a weighted linear combination after the ReLU activation function to finally get the fused information μit$$\mu_i^t$$ of the customer nodes i.

Attention mechanism module

Attention mechanism can effectively deal with the expansion of customer node size, decoding customer node i when more attention to the input customer node of that part of the customer, first of all, calculate the attention weight, first of all, the recursive neural network of the hidden state h′ and each node i of the fusion of the information μi$$\mu_i^\prime$$ to the similarity calculation of vi$$v_i^\prime$$, and then in the customer node 6 and the hidden state of the similarity of each node i and then do the Softmax normalization transform to get the Attention weight ai$$a_i^\prime$$, and then use the obtained attention weight on the input information μt$$\mu_t^\prime$$ weighted sum calculation to ct, and then find the probability distribution of each customer node i. The calculation process is as follows:

The calculation formula is as follows: vi=θvtanh(θu[μi;h]$$v_i^\prime = {\theta_v}\tanh ({\theta_u}[\mu_i^\prime ;{h^\prime }]$$ ait=Softmax(vt)$$a_i^t = {\text{Softmax}}({v^t})$$ ct=i=0|Vc|+1aitμit$${c^t} = \sum\limits_{i = 0}^{|{V_c}| + 1} {a_i^t} \mu_i^t$$

where h′ is the hidden state output of the recurrent neural network LSTM, vi$$v_i^\prime$$ is the ith term of vector v′, tanh is the nonlinear activation function, θv, θu are the trainable variables, and |Vc| is the number of customer nodes, where there are: tanh(x)=exexex+ex$$\tanh (x) = \frac{{{e^x} - {e^{ - x}}}}{{{e^x} + {e^{ - x}}}}$$ Softmax(xi)=exikexk$${\text{Softmax}}({x_i}) = \frac{{{e^{{x_i}}}}}{{\sum\limits_k {{e^{{x_k}}}} }}$$

Next, the probability estimate for each customer is computed, where gi$$g_i^\prime$$ is the i rd element of gi$$g_i^\prime$$ and θg, θc are all trainable variables: gi=θgtanh(θc[μi;ci])$$g_i^\prime = {\theta_g}{\text{tanh}}({\theta_c}[\mu_i^\prime ;{c_i}])$$ pi=Softmax(g)$$p_i^\prime = {\text{Softmax}}({g^\prime })$$

Long and short-term memory neural network module

The input to a long short-term memory neural network has three components, the hidden state output of the previous moment ht−1, the cellular state of the previous moment Ct−1, and the input of the moment t. X′ In the time window vehicle path problem, at decoding step t, the input to the LSTM is represented as the X′ containing the information of the customer node y′, and the hidden state of the previous moment ht−1, and the output of the hidden state ht is used as the input to the attention mechanism to compute the attentional weights.

Value network model

A separate network is used to obtain the Critic value estimation used to fit the trajectory rewards, the input to the Critic value estimation network is sample instance X[i], all trainable variables are φ, and the parameter φ is updated using a φ parameterized Critic network with a loss function of 1Mi=1M[R(Y[i])vφ(X[i])]2$$\frac{1}{M}\sum\limits_{i = 1}^M {{{[R({Y_{[i]}}) - {v_\varphi }({X_{[i]}})]}^2}}$$. The X[i] represents a training sample that denotes the ith training sample, which is the unmanned fleet of vehicles for the delivery task, and this sample contains n all the information about the customers, including the customer location distribution, the delivery time interval needed by the customer, and the customer goods demand. All the customer data are input into the network by one-dimensional convolution operation, and finally the number x1, x2, …., xn is obtained, and then the n data are passed through two layers of fully connected neural network to output the predicted value vφ(X[0]) for evaluating the improved strategy network.

Customer Shielding Program

A masking function is set up in the model, which shields those nodes that do not satisfy the feasibility conditions and serves the purpose of speeding up the training. If at the moment of decoding step t, currently in the customer node i, if the customer node j, ∀ji meets one of the following conditions, assign a very large negative number to the corresponding vjt$$v_j^t$$ and gjt$$g_j^t$$, so that the calculated weight ajt$$a_j^t$$ and probability pjt$$p_j^t$$ will be very close to 0, in order to achieve the purpose of shielding the customer j node:

Where Vd, Vc is the set of distribution centers and customers, respectively, completes the design of the masking function so that the model can be trained faster and satisfy the feasibility conditions.

Decoding Strategies

At each decoding step, the next next customer node to be visited is determined based on the probability estimates of all customer nodes. The model uses two decoding strategies, the greedy decoding strategy is to greedily select the customer node with the highest probability given by the model each time the next customer node to be visited is selected, and the stochastic decoding strategy is a strategy to randomly select the next customer node that obeys a probability distribution based on the estimated probability distribution.

Deep Reinforcement Learning Environment Design
Reward function design

The reward function is defined on the basis of the sequence of customer nodes Ym={y,t=0,1,2,...tm}$$Y_m^\prime = \{ {y^\prime },t = 0,1,2,...{t_m}\}$$ decoded by the decoder, and a good decoded sequence corresponds to a relatively high reward signal, since the objective of the vehicle path problem with time windows is to minimize the path cost or the minimum distance, in this study t=0tmw(y,y+1)$$ - \sum\limits_{t = 0}^{{t_m}} {w({y^\prime },{y^{\prime + 1}})}$$ is added to the reward function to achieve the objective that the smaller the distance is, the larger the reward signal is to minimize the path distance, and the other additions are to the violation of the problem constraints The Penalty. The portion that exceeds the upper limit of the time window is added to the reward function as a penalty, and the third term is 0 in most cases due to the masking function. The set reward function is given in the following equation: R(Y) = t=0tw(yt,yt+1)+β[t=0tmax((ey+1(η+w(y,yt+1)),0)] +β2[t=0tmax((ηt+w(yt,yt+1)lyt+1),0)]$$\begin{array}{rcl} R({Y^{{\prime_\infty }}}) &=& - \sum\limits_{t = 0}^{{t_\infty }} w ({y^t},{y^{t + 1}}) + \beta \left[ {\sum\limits_{t = 0}^{{t_\infty }} {\max } (({e_{y + 1}} - ({\eta^\prime } + w({y^\prime },{y^{t + 1}})),0)} \right] \\ &&+ {\beta_2}\left[ {\sum\limits_{t = 0}^{{t_\infty }} {\max } (({\eta^t} + w({y^t},{y^{t + 1}}) - {l_{{y^{t + 1}}}}),0)} \right] \\ \end{array}$$

State Transfer Functions

For VRPTW reinforcement learning the system states are denoted by Xi and Gi. The action of reinforcement learning is to add a customer node at the end of the current sequence. Use yi to denote the vertex selected at step t, and use Yi to denote the sequence of customer nodes formed before step t. Assume that the program terminates at step tm, and that the sequence terminates on the condition that there are no customers with demand and that all customer nodes have zero demand. At each moment of decoding step t, the decoder is provided with information from the local information and global information, i.e., Xt, GrYt, where Yt is the historical trajectory, and the conditional probability P(yt+1 = iXt, Gt, Yt) distribution of each customer node is computed according to the model, and based on the probability distribution the customer node selected yt+1 as the customer node where the decoding is going to be done to be added to the tail end of the decoding sequence by using a strategy such as randomly or greedily, and the following equations are used for the state transfer of the system that needs to be updated function:

The update of Gt as a global variable is done by G′ = {η′, c′}, firstly, on the system time: ηt+1={ w(yt,yt+1),If yt is a distribution center max(ey,ηt)+w(yt,yt+1),If yt is the customer$${\eta^{t + 1}} = \left\{ {\begin{array}{*{20}{l}} {w({y^t},{y^{t + 1}}),{\text{If }}{y^t}{\text{ is a distribution center}}} \\ {\max ({e_{y'}},{\eta^t}) + w({y^t},{y^{t + 1}}),{\text{If }}{y^t}{\text{ is the customer}}} \end{array}} \right.$$

Where w(yt, yt+1) is the traveling time from vertex yt to vertex yt+1, calculated from the Euclidean distance and the speed of the car. Euclidean distance: d12=(x1x2)2+(y1y2)2$${d_{12}} = \sqrt {{{({x_1} - {x_2})}^2} + {{({y_1} - {y_2})}^2}}$$

The car’s loading capacity is updated next: ct+1={ ctdytt,yt is the customer C,yt distribution center$${c^{t + 1}} = \left\{ {\begin{array}{*{20}{l}} {{c^t} - d_{{y^t}}^t,{y^t}{\text{ is the customer}}} \\ {{\text{C}},{y^t}{\text{ distribution center}}} \end{array}} \right.$$

There is no need to consider the case where ctdytt$${c^t} - d_{{y^t}}^t$$ is negative, the masking function is set up before and he will mask the nodes whose demand exceeds the loading capacity limit of the unmanned vehicle, giving a very low probability to the customer nodes that exceed the demand.

Meanwhile the update formula for the demand is as follows: dit+1={ 0,if yt=i dit,if yti$$d_i^{t + 1} = \left\{ {\begin{array}{*{20}{l}} {0,{\text{if }}{y^t} = i} \\ {d_i^t,{\text{if }}{y^t} \ne i} \end{array}} \right.$$

Training of deep reinforcement learning algorithms
Objective function

The objective function is the goal of updating the parameters of the reinforcement learning algorithm, expressed as some function of the policy parameters, with the aim of minimizing the objective function, which is set in the model to be the negative expected total return of trajectory Y sampled by random policy πθ: J(θ)=Eπo[R(Y)]$$J(\theta ) = {E_{{\pi^o}}}[R(Y)]$$

Gradient estimation

The gradient of the loss function for M instance of the VRPTW distribution problem using Monte Carlo sampling with respect to the policy network parameters is shown in the following equation: θJ(θ)=1Mi=1M[R(Y[i])vφ(X[i])]logPθ(Y[i]|X[i])$${\nabla_\theta }J(\theta ) = \frac{1}{M}\sum\limits_{i = 1}^M {[R({Y_{[i]}}) - {v_\varphi }({X_{[i]}})]\nabla \log {P_\theta }({Y_{[i]}}|{X_{[i]}})}$$ Pθ(Y[i]|X[i])=Pθ(y[i]1|X[i]0,G[i]0,Y[i]0)Pθ(y[i]2|X[i]1,G[i]1,Y[i]1) Pθ(y[i]|Y(i)||X[i]|Y(i)|1,G[i]|Y(i)|1,Y[i]|Y(i)|1)$$\begin{array}{l} {P_\theta }({Y_{[i]}}|{X_{[i]}}) = {P_\theta }(y_{[i]}^1|X_{[i]}^0,G_{[i]}^0,Y_{[i]}^0){P_\theta }(y_{[i]}^2|X_{[i]}^1,G_{[i]}^1,Y_{[i]}^1) \\ \cdots {P_\theta }(y_{[i]}^{|Y(i)|}|X_{[i]}^{|Y(i)| - 1},G_{[i]}^{|Y(i)| - 1},Y_{[i]}^{|Y(i)| - 1}) \\ \end{array}$$

Derive the formula according to the probability chain and obtain Pθ(y[i]t+1|X[i]t,G[i]t,Y[i]t)$${P_\theta }(y_{[i]}^{t + 1}|X_{[i]}^t,G_{[i]}^t,Y_{[i]}^t)$$ from the model, where M is the batch size, θ is the strategy network parameters, X[i] is the i th instance in the batch, R(Y[i]) is the actual trajectory expected payoff of the strategy at the ith instance, and vφ is the Critic network close to R(Y[i]). The parameters are updated next: θ=θ+α11Mi=1M[R(Y[i])vφ(X[i])]logPθ(Y[i]|X[i])$$\theta = \theta + {\alpha_1}\frac{1}{M}\sum\limits_{i = 1}^M {[R({Y_{[i]}}) - {v_\varphi }({X_{[i]}})]\nabla \log {P_\theta }({Y_{[i]}}|{X_{[i]}})}$$ dφ=1Mi=1Mφ[R(Y[i])vφ(X[i])]2$$d\varphi = \frac{1}{M}\sum\limits_{i = 1}^M {{\nabla_\varphi }} {[R({Y_{[i]}}) - {v_\varphi }({X_{[i]}})]^2}$$ φ=φ+α21Mi=1Mφ[R(Y[i])vφ(X[i])]2$$\varphi = \varphi + {\alpha_2}\frac{1}{M}\sum\limits_{i = 1}^M {{\nabla_\varphi }} {[R({Y_{[i]}}) - {v_\varphi }({X_{[i]}})]^2}$$

Example simulation experiments
Analysis of example results

In this section, experiments on solving the arithmetic example are conducted for an arithmetic example with a customer point size of 100. In the arithmetic example, there is no effect of time-varying characteristics and customer time window, i.e., the unmanned delivery fleet is at a constant speed throughout the delivery process and the customer can receive the service at any time. In the arithmetic example, time-varying characteristics and customer time windows affect the entire delivery scheme of the unmanned delivery fleet with drones, i.e., the speed of the unmanned delivery fleet changes over time, and the customer receives the service outside of its time window incurs a certain time-window penalty cost. The 100-size arithmetic example consists of 101 nodes, and the 100-size node information is shown in Table 1. The number 100 is the distribution center (Depot), the number 0-9 is the drone delivery customer point (DC), the number 90-99 is the unmanned delivery fleet delivery customer point (TC) and the number 10-89 is the unmanned delivery fleet and drone co-delivery customer point (FC), the node distribution diagram of the 100-scale example is shown in Figure 4, and the customer time window distribution diagram is shown in Figure 5.

100 scale node information

Numbering Coordinate Position Demand Coordinate Attribute Numbering Coordinate Position Demand Coordinate Attribute
0 (21,42) 2 DC 51 (34,14) 14 FC
1 (21,22) 0 DC 52 (25,40) 19 FC
2 (21,10) 2 DC 53 (58,29) 18 FC
3 (54,12) 1 DC 54 (30,71) 24 FC
4 (53,38) 5 DC 55 (64,56) 18 FC
5 (14,22) 2 DC 56 (43,53) 16 FC
6 (57,48) 8 DC 57 (54,58) 23 FC
7 (32,13) 0 DC 58 (3,61) 20 FC
8 (40,28) 1 DC 59 (43,60) 20 FC
9 (7,68) 1 DC 60 (19,23) 7 FC
10 (29,45) 16 FC 61 (54,42) 23 FC
11 (53,13) 12 FC 62 (46,47) 18 FC
12 (50,74) 14 FC 63 (19,34) 20 FC
13 (30,68) 19 FC 64 (31,41) 7 FC
14 (16,18) 17 FC 65 (19,72) 4 FC
15 (56,66) 22 FC 66 (19,21) 24 FC
16 (28,42) 20 FC 67 (35,69) 17 FC
17 (36,24) 11 FC 68 (46,9) 15 FC
18 (24,25) 20 FC 69 (46,53) 24 FC
19 (29,23) 11 FC 70 (17,24) 16 FC
20 (70,62) 17 FC 71 (14,56) 12 FC
21 (29,48) 10 FC 72 (32,2) 14 FC
22 (55,41) 18 FC 73 (24,31) 18 FC
23 (25,63) 18 FC 74 (11,27) 12 FC
24 (61,28) 11 FC 75 (20,30) 16 FC
25 (19,10) 17 FC 76 (44,27) 17 FC
26 (19,49) 23 FC 77 (60,38) 20 FC
27 (64,11) 17 FC 78 (23,23) 15 FC
28 (23,21) 19 FC 79 (40,13) 18 FC
29 (14,50) 6 FC 80 (43,37) 14 FC
30 (64,74) 17 FC 81 (45,34) 12 FC
31 (10,19) 20 FC 82 (3,4) 8 FC
32 (8,58) 16 FC 83 (3,56) 15 FC
33 (47,35) 16 FC 84 (42,16) 15 FC
34 (49,52) 7 FC 85 (5,18) 22 FC
35 (42,29) 16 FC 86 (17,11) 17 FC
36 (7,14) 21 FC 87 (18,25) 19 FC
37 (15,40) 19 FC 88 (49,10) 16 FC
38 (36,57) 22 FC 89 (6,18) 10 FC
39 (12,28) 14 FC 90 (5,47) 8 TC
40 (23,1) 18 FC 91 (53,9) 31 TC
41 (21,28) 17 FC 92 (19,47) 19 TC
42 (45,67) 19 FC 93 (65,17) 23 TC
43 (43,14) 17 FC 94 (68,48) 32 TC
44 (54,48) 17 FC 95 (54,52) 30 TC
45 (21,63) 11 FC 96 (41,42) 21 TC
46 (16,24) 17 FC 97 (69,32) 26 TC
47 (73,2) 11 FC 98 (31,12) 32 TC
48 (40,39) 22 FC 99 (60,66) 18 TC
49 (34,31) 24 FC 100 (32,38) 0 Distribution center
50 (14,41) 2 FC
Figure 4.

100 scale calculation example node distribution

Figure 5.

The 100 size of the customer time window distribution

Using the model trained under the ideal speed and no time window conditions combined with random sampling decoding method to solve the example, the solution result that the total cost of distribution of the UAV-assisted unmanned delivery fleet distribution case considering the customer service limitation under the size of 100 customer points is 2135, 100 example path scheme as shown in Table 2. As can be seen from the table, in the size of 100 customer points to consider the customer service limitations of the drone-assisted unmanned delivery fleet delivery case needs to be by four 400 capacity unmanned delivery fleet, respectively, in the four drones to complete the delivery task with the assistance of the drone. The first group of unmanned delivery fleet and drones provide logistics and distribution services to 26 customer locations, and the loading rate of the unmanned delivery fleet1 is 99.2%. Among them, 23 customer points were served by the unmanned delivery fleet and 3 customer points were served by UAV1. Overall, the distribution tasks of the unmanned distribution fleet and the drone under this distribution program are reasonably distributed and meet the requirements of vehicle loading capacity.

100 example path scheme

Numbering Path Total demand Loading rate
Truck 1 100→62→51→72→47→89→90→34→94→31→30→52→72→24→69→9→67→60→37→42→26→16→18→64→100 386 99.2%
Uav 1 100→0→ 47, 38→ 83→ 27, 65→ 14→ 69 12
Truck 2 100→56→75→72→61→82→33→45→17→24→70→87→35→26→87→42→69→93→17→40→100 322 98.3%
Uav 2 100→ 21→ 54, 64→ 0→ 32, 24→ 70→ 16, 36→80→88, 84→ 0→ 39, 76→ 1→ 99 69
Truck 3 100→50→96→54→59→40→13→98→19→18→54→48→67→35→60→59→23→97→95→76→76→100 352 99.6%
Uav 3 11→ 35→ 18, 47→ 57→ 38, 57→ 6→ 13 50
Truck 4 100→49→16→22→47→39→86→85→38→74→7→89→88→19→92→29→93→53→81→35→100 322 97.5%
Uav 4 100→ 91→ 15, 54→ 8→ 34, 78→ 82→ 61, 88→27→25, 26→5→20 61

The path diagram of the algorithm is shown in Fig. 6, which shows the customer points and access paths served by each of the unmanned delivery fleet and the UAV during the delivery process, and it can be judged that the unmanned delivery fleet and the UAV have completed the service to all the customer points in this delivery task.

Figure 6.

The path diagram of the example

Additional experiments and analysis
Single fleet vs. heterogeneous fleet

In the above arithmetic example, there is often a case of low vehicle loading rate, in this section, we use the same case as the arithmetic example and set up a heterogeneous fleet for solving, where the single fleet adopts the traditional distribution path scheme, and the heterogeneous fleet adopts the distribution path scheme designed in this paper, so that we can better understand the impact of different unmanned distribution schemes on solving the multi-trip path planning problem. Next, the results of solving the heterogeneous fleet model at different scales are shown and compared with the results of solving the single fleet. 25 The algorithmic path schemes are shown in Table 3. The table shows a comparison between the single fleet and heterogeneous fleet multi-trip problem solution schemes at 26 customer size. It is clearly seen that in the single fleet case, the loading rate for both 1-2 and 2-2 trips is below 50%, while in the heterogeneous fleet case, the loading rate for each trip is over 85%. The total cost of the final example is 915, with transportation costs of 875 and fixed costs of 40.

Example path scheme

Scale of 25(C=150) Scale heterogeneity
Numbering Total demand Loading rate General course Stroke rate Numbering Total demand Loading rate General course Stroke rate
1-1 142 97% 221 98.5% 1-1(C=200) 195 98.5% 202 87.6%
1-2 65 44.6% 2-1(C=150) 132 93.6% 213 92.6%
2-1 132 87.6% 192 82.6% 2-2(C=100) 80 86.7%
2-2 45 29.6%

50 The example path scenarios are shown in Table 4. In the single fleet case, the utilization rate of vehicle loaded goods is relatively low. In contrast, in the heterogeneous fleet case, a total of four persons make deliveries, and each person’s loading rate for each trip exceeds 80%, effectively saving vehicle transportation costs. The total cost of the final example is 1522, of which the transportation cost is 1432 and the fixed cost is 90.

Example path scheme

Scale of 50(C=150) Scale heterogeneity
Numbering Total demand Loading rate General course Stroke rate Numbering Total demand Loading rate General course Stroke rate
1-1 149 99.8% 220 94.2% 1-1(200) 193 95.1% 213 89.6%
1-2 93 60.3% 1-2(100) 96 96.9%
2-1 146 99.7% 241 99.3% 2-1(150) 143 96.5% 186 77.6%
2-2 38 23.1% 3-1(100) 96 97.4% 235 97.6%
3-1 148 96.4% 240 100% 3-2(100) 93 95.4%
3-2 66 40.5% 4-1(100) 84 82.2% 133 56%
4-1 68 48.3% 178 75.6%

The 100-calculation path scenario is shown in Table 5. In the single convoy case, the loading rates for the 2-3, 5-2, and 6-2 trips are 46.5%, 46%, and 32.8%, respectively. In contrast, in the heterogeneous fleet case, the lowest loading rate is 75.8% for the 7-2 trip, which is much higher than the single fleet loading rate. The total final cost was 3522, with transportation costs of 3239 and fixed costs of 283.

Example path scheme

Scale of 50(C=150) Scale heterogeneity
Numbering Total demand Loading rate General course Stroke rate Numbering Total demand Loading rate General course Stroke rate
1-1 144 93.8% 230 97.9% 1-1(200) 191 97.2% 233 95.5%
1-2 150 99.1% 2-1(200) 184 95.8% 230 75.8%
2-1 147 98.6% 215 88.6% 3-1(200) 199 99.6% 239 96.5%
2-2 130 91.2% 4-1(200) 196 97.3% 229 99.6%
2-3 80 46.5% 5-1(200) 198 85.5% 245 100%
3-1 150 95.7% 239 99.6% 6-1(200) 187 98.2%
3-2 136 88.3% 7-1(200) 138 92.2% 222 93.6%
4-1 149 95.2% 235 96.2% 7-2(200) 138 97.2%
4-2 131 87.5%
5-1 144 92.5% 232 98.2%
5-2 66 46%
6-1 147 98.3% 150 58.6%
6-2 57 32.8%
Analysis of the impact of work hour adjustments on route planning

In order to more deeply understand and analyze the influence of the combination between different working hours and unmanned delivery fleet loading capacity on the solution of multi-trip path planning problems with different scale nodes, this section randomly selects a case from the 10 test cases, and solves the solution for the combination of different unmanned delivery fleet loading capacity and unmanned delivery fleet working hours, corresponding to the distance that can be traveled by the unmanned delivery fleet, respectively. The solution results are as follows:

For the case of 25+1 scale nodes, the total cost reaches the minimum value when the unmanned delivery fleet loading capacity is 200 and the working duration is 7.5 hours, while the total cost under all three unmanned delivery fleet types reaches the maximum value when the working duration is 4 hours.

For the case of 50+1 scale nodes, we observe that among the different vehicle capacities, the cost of spending different working hours at 200 capacity is consistently lower than the other combinations of vehicle capacity and working hours. Specifically, the total cost reaches its highest value when the unmanned delivery fleet capacity is 100 and the working time is 11 hours. And the total cost reaches its lowest value when the unmanned delivery fleet capacity is 110 and the working hours are 6.5 hours.

For the case of 100+1 scale nodes, we observe a trend of smaller cost with larger on-board capacity as the number of scale nodes increases. Specifically, the total cost is minimized when the on-board capacity is 200 and the working duration is 8 hours. However, when the on-board capacity is 100, the solution costs for different working hours are higher than the case of on-board capacity of 200, especially peaking at 6.5 hours of work. The 100+1 node path planning is shown in Fig. 7 (Figs. a~c are for the 25+1 scale, 50+1 scale, and 100+1 scale, respectively).

Figure 7.

Node path planning

Conclusion

This paper focuses on the path optimization problem of unmanned aircraft delivery mode considering customer service constraints and time windows, designs a deep reinforcement learning solution framework that can effectively solve the problem according to the problem characteristics, and designs arithmetic experiments to validate the effectiveness of this paper’s method.

The study utilizes a deep reinforcement learning model for solving the problem, and finds that the first group of unmanned delivery fleet and UAVs provide logistics and distribution services for 26 customer points in total, and the loading rate of unmanned delivery fleet is 99.2%. Overall, the distribution tasks of the unmanned delivery fleet and UAVs under this distribution scheme are reasonable and meet the requirements of vehicle loading capacity.

In the comparison experiment of single fleet and heterogeneous fleet, the loading rate of 1-2 and 2-2 trips is lower than 50% in the case of single fleet, while the loading rate of heterogeneous fleet under the unmanned distribution path scheme designed in this paper is more than 85% for each trip. This proves the superiority of the intelligent distribution logistics model designed in this paper.

Language:
English