Automation is becoming more and more important to achieve high efficiency and productivities in manufacturing facilities, and there has been a large increase in the use of autonomous mobile robots (AMRs) for factory automation. With the number of AMRs increasing, how to optimally schedule them in a timely manner such that a large school of AMRs can finish all the assigned tasks within the shortest time presents a significant challenge for control engineers. Exhaustive search can provide an optimal solution. However, its associated computational time is too long to render it feasible for real-time control. This paper introduces a novel two-step algorithm for fast scheduling of AMRs that perform prioritized tasks involving transportation of tools/materials from a pick-up location to a drop-off point on the factory floor. The proposed two-step algorithm first clusters these tasks such that one cluster of tasks is assigned to one single AMR, followed by scheduling of the tasks within a cluster using a model-based learning technique. For the purpose of clustering and scheduling, a task space is defined. The results from the clustering and scheduling algorithms are compared with other widely used heuristic techniques. Both the clustering and the scheduling algorithms are shown to perform better on task sets of relevant sizes and generate real-time solutions for the scheduling of multiple AMRs under task space constraints with priorities.

## Introduction

Automation has become more and more important in manufacturing in recent times, and using autonomous mobile robots (AMRs) for automating tasks on the factory floor leads to more efficient and productive manufacturing. There are many challenges associated with this domain. First, prioritized tasks need to be continuously assigned to the entire school of AMRs. Second, route planning for the AMRs to finish assigned tasks needs to be performed. Third, all of the scheduling and routing should be completed in a timely manner such that the reconfiguration of the AMRs can be implemented in real time.

In this research effort, we will explore algorithms on scheduling of prioritized tasks for multiple AMRs within a manufacturing facility in a timely manner. Specifically, we focus on problems where 20–40 AMRs need to travel to about 1000 points to complete prioritized tasks. The prioritized tasks involve transporting materials from a specified pick-up location to a target drop-off location. Here, each task is associated with a start point *s*, an end point *e*, and a priority value *p*. Because the task is predefined on a “point space” and the start and end points are not interchangeable, the search of an optimal scheduling is asymmetric. In other words, the cost of moving between tasks is different in the opposite directions.

The problem under consideration is a variation of the vehicle routing problem (VRP) [1]. However, unlike most research in VRP, which focus on the actual point space, the focus in this paper is on the “task space,” i.e., the tasks are to be assigned to vehicles without repetition instead of the points on the factory floor. The search in the task space is significantly more challenging than the search in the point space. The asymmetricity associated with the task space search will lead to significantly increased computational burden. Since its conception, the VRP has been extensively studied in operations research and computer science, with various formulations presented over the years [2–6].

Exact algorithms for solving the VRP have been presented in detail by Semet et al. [7]. However, exact algorithms for solving the VRP are extremely computationally expensive, and therefore, research in heuristic methods has taken precedence over the years.

Research in heuristic methods has focused on two types of algorithms: classical (constructive or improvement) heuristics and metaheuristics. Classical constructive heuristics start off from an empty solution set and build the solution step-by-step while minimizing the cost as defined. There are many popular constructive heuristics designed for the VRP in literature, and the most popular of these algorithms are as follows. The Clark–Wright savings algorithm [8] and its variations merge multiple routes using a savings criterion. The Sweep algorithm [9] divides the VRP into smaller subproblems using polar coordinates. Petal algorithms [5,10] generate multiple feasible VRP subroutes and recombine them to form the final solution. The Fisher–Jaikumar algorithm [11] uses a generalized assignment formulation to break down the VRP into smaller subproblems. The min-min method [12] can be used to generate fast, close-to-optimal solutions for each vehicle in the VRP. However, these algorithms focus on symmetric versions of the VRP and need to be expanded for generalizing to asymmetric instances.

Improvement heuristics, on the other hand, use a feasible solution, often generated by constructive methods, and improve upon it using internal edge switches. Improvement heuristics such as the Lin–Kernighan method [13], which are designed for the traveling salesman problem (TSP), can be used for switching internal to a particular VRP route. For exchange between the routes, operators defined by Savelsbergh [14] such as Swap, Relocate, and 2-Opt* can be used. Thompson and Psaraftis [15] also define an inter-route improvement method, where *k* vertices are exchanged between *b* routes in a cyclic fashion. However, as mentioned earlier, improvement heuristics do need a feasible solution as a starting point and therefore are applicable only in conjunction with other constructive heuristic techniques.

Metaheuristics, which are also commonly used for generating approximate solutions to the VRP, are general heuristic algorithms that can be modified to solve specific combinatorial optimization problems [16]. Metaheuristics involving local search techniques define a neighborhood of the current feasible solution, e.g., simulated and deterministic annealing [17,18], Tabu Search [19,20], iterated local search [21], guided local search [22], and variable neighborhood search [23] algorithms. Population-based metaheuristics use a set of solutions, called a population, for finding close-to-optimal solutions. Population-based methods such as ant colony optimization [24], genetic algorithms [25–27], and adaptive memory procedures [28,29] have been used to solve the VRP. While metaheuristics have been studied extensively, there is no single general heuristic algorithm that performs well for a wide range of cases, and they depend on significant parameter tuning, or even choosing the correct heuristic since most heuristics perform well only on specific problem types [30]. Additionally, many of the heuristic methods presented previously are not suitable for real-time applications as they are computationally expensive. Variations of the VRP which are related to the problem being explored include VRP with Time Windows [31,32], the dial-a-ride problem [33,34], and the stacker crane problem [35,36], but most of the research exploring these problems do not include task priorities. Therefore, there is a need to develop new optimization algorithms to search the prioritized task space with high efficiency and low computational expense, so that an effective scheduling can happen in real time.

Learning techniques have also been used for combinatorial optimization, including the TSP and the VRP. Unlike other heuristic methods, learning techniques do not depend on defined rules and therefore can provide a general structure that can be applicable over a wide range of problem cases without modification. These techniques effectively learn their own rules on a case-by-case basis. Starting with artificial neural networks designed by Hopfield and Tank [37], various different neural network methods have been used for the TSP such as elastic nets [38] and self-organizing maps [39–42]. Extensions of elastic nets and self-organizing maps have also been designed to handle the VRP [43–46]. Use of learning methods for the TSP/VRP has not been very popular in recent times, especially due to lack of results [47], but with the advent of sequence-to-sequence learning [48] and use of pointer networks [49], some research efforts have been made in this direction. Since generating training data for combinatorial problems are difficult, Bello et al. [50] propose using reinforcement learning for unsupervised learning since no training data are required, which can lead to fast solutions as well as nullifying the requirement of optimal labeled data, which is difficult to obtain for the problem being explored. However, recent advances in learning methods focus only on the symmetric singular TSP, and prioritized tasks are yet to be explored in this context. It is therefore desirable to explore the learning methods for asymmetric TSP applications using the reinforcement learning paradigm as these methods have the potential to generate good-quality solutions in real time to a wide variety of problem cases.

This research effort is directed at developing a two-step scheduling method for AMRs in a task space with priorities, with a focus on real-time performance using heuristic methods. The idea behind the proposed algorithm is to first cluster the tasks into *k* separate clusters, where *k* is the number of AMRs available, followed by deciding the task completion sequence for every AMR. The rest of the paper is presented as follows: Sec. 2 describes the proposed two-step algorithm, with a mathematical formulation of the task space in Sec. 2.1, the clustering algorithm in the asymmetric task space in Sec. 2.2, and the scheduling algorithm for deciding the task sequence for every AMR in Sec. 2.3. For both the proposed steps, the results are presented in Sec. 3 for example cases as well as comparison with widely used heuristic techniques. Concluding remarks are presented in Sec. 4.

## Proposed Method

We propose to solve the multi-AMR routing problem with task priorities by first clustering the tasks according to the number of active AMRs, followed by using a learning-based method to determine the order of tasks to be completed by each vehicle.

### Defining the Task Space.

Since the objective is to determine the order of tasks, a task space is first defined from the task list and the set of points on the factory floor. Given any two tasks, “moving” from task 1 to task 2 is defined as the AMR traveling from the end point of the first task to the start point of the second task, and then completing the second task, and vice versa, which is similar to the formulation used by Treleaven et al. [51]. A schematic is shown in Fig. 1, where *A* and *B* are the points defining task 1, and *C* and *D* are the points defining task 2/3. The directions of the arrows show the corresponding start and end points. According to this definition of moving, the “distance” between any two tasks is the sum of (i) the cost of travel between the end point of the first task to the start point of the second task, and (ii) the cost of travel for the completion of the second task. A new “distance matrix” *D*_{tasks} in the task space can therefore be derived from the original distance matrix *D*_{points} between the given set of points on the factory floor. It can be seen from the figure that moving between tasks, as defined, is not symmetric, i.e., the distances between the two tasks are not the same in opposite directions. Therefore, the task space is nonmetric.

*D*

_{tasks}as defined earlier. As stated, there are three priority levels, with 1 being the highest priority and 3 being the lowest. The additional cost that is added to

*D*

_{tasks}is given by

where *d*_{mean} is the mean distance in the task distance matrix *D*_{tasks}, and *p*_{1} and *p*_{2} are the priority values of the first and second task, respectively. This extra penalty that is added is zero only when the two priority values are equal (*p*_{1} = *p*_{2}) and exponentially increases with the difference in priorities between the two tasks. This cost is asymmetric with the assumption that it should be easier to move from a lower priority task (higher value of *p*) to a higher priority task (lower value of *p*), i.e., *ν* > *ρ*. The mean value is multiplied in order to keep the orders of magnitude same. A negative cost is avoided in moving from a lower to a higher priority task in order to prevent continuous switching between priorities.

### Clustering.

The first step toward solving the proposed problem is to generate clusters of tasks to be completed by each vehicle. If there are *k* active AMRs on the factory floor, the clustering algorithm should generate *k* clusters of tasks. The challenge in clustering in the proposed problem is twofold: first, there should ideally be a good mix of tasks of all priorities, i.e., tasks of a single priority should not be completed by a single vehicle. Second, the task space is a nonmetric asymmetric space, and therefore, clustering can be difficult.

#### Priority Separation.

To ensure that there is no concentration of tasks of a single priority assigned to a single AMR, we define the priority value of each task as a third coordinate. Therefore, for each of the priority values, all tasks of that priority value are separated into *k* clusters using the clustering algorithm described later, after which the “centroids” of these clusters are matched across different priority values. This is to ensure that all AMRs have a mix of tasks from all three priorities.

A representation of priority values as an extra coordinate is shown in Fig. 2 for 300 tasks. The arrows join the start and end points for each task. The task space is asymmetric, and thus, modifications have to be made before simple clustering techniques like *k*-center methods [52], which require a space with a metric distance measure, can be applied.

#### Intra-Priority Clustering.

Each task with the same priority can be represented in the task space as (*s*, *e*), where *s* and *e* are the start and end points of the task, respectively. As defined earlier, the cost of moving between tasks is defined as the cost of traveling from the end point of the first task to the start point of the second task and then completing the second task. Since this cost is directional, we try to minimize the sum of the cost of travel in either direction while clustering, i.e., tasks are assumed to be “near” each other when the sum of the cost of travel between two tasks in either direction is small. This is to ensure that only those tasks which are near to each other in either direction are grouped together. Now, since the tasks are already given, the objective of clustering is to find *k* clusters such that if a pair of tasks is chosen from the same cluster, the sum of the cost of travel from the end point of the first task to the start point of the second task, and vice versa, should be small relative to tasks in other clusters.

Examples are shown in Figs. 3 and 4 to demonstrate this “nearness” measure. *A* and *B* are the start and end points for task 1, whereas *C* and *D* are the start and end points for task 2. The two arrangements of tasks are shown in Figs. 3 and 4, respectively. The sum of the cost of travel between the tasks in either direction is then given by *AD* + *AB* + *BC* + *CD*. However, since the tasks are already determined, the pairing of tasks has to be determined by the clustering algorithm, which minimizes the sum of the cross-distances, i.e., the sum of the distances between the start point of one task to the end point of the other. This can be represented as the sum of the diagonals of the quadrilateral in Fig. 3, and the sum *AD* + *BC* in Fig. 4, since each of the points *A*–*D* is in the two-dimensional Euclidean ($\mathbb{R}2$) space.

The sum of the cross-distances, which, as defined previously, is the sum of the distances between the start point of one task to the end point of the other. This is symmetric as the sum of cross-distances (SCD) does not change with direction of travel, i.e., from task 1 to task 2 or vice versa. However, the sum of cross-distances does not satisfy the triangle inequality. This is clear from the simple counterexample shown in Fig. 5, where there are three tasks with start points (*s*_{1}, *s*_{2}, *s*_{3}) and end points (*e*_{1}, *e*_{2}, *e*_{3}). The SCD between tasks 1 and 3 ($SCD=5+5$, marked by the blue lines) is greater than the sum of the same distance measure between tasks 1 and 2 ($SCD=1+1$), and tasks 2 and 3 (SCD = 1 + 1), thus violating the triangle inequality. Therefore, the sum of cross-distances is not a distance metric. Thus, a modification is proposed, which allows us to define a distance metric between any two tasks, on which *k*-center methods can be applied. To understand this modification, which is presented later in this section, we first look at the two example cases presented in Figs. 3 and 4.

where (*s*_{1}, *e*_{1}) and (*s*_{2}, *e*_{2}) are coordinates of the two tasks. In Eq. (3), the constant terms representing the predefined task costs are denoted as Γ.

Equation (4) is only valid when the start points and end points of the two tasks lie along the diagonals of the quadrilateral, as shown in Fig. 4.

*i*,

*j*), in the task set can be categorized using one of the two equations mentioned previously. Therefore, a combined inequality is presented as follows:

*i*,

*j*) in the task set as follows:

where *i*, *j* represent the tasks, and *s* and *e* are the start and end point vectors (in $\mathbb{R}2$) for each task. Unless specified otherwise, the norm used is the Euclidean norm throughout this paper. It is clear that the defined distance measure, *d _{ij}*, is a metric since it is always non-negative (sum of vector norms), zero if and only if

*i*=

*j*(it is assumed that there are no repeated tasks of same priority), it is symmetric and it follows the triangle inequality. The last property can be proved by using the fact that the start and end point spaces both follow the triangle inequality in $\mathbb{R}2$, and therefore, a distance measure which is the sum of the separation between the start points and the end points of two tasks will also satisfy the triangle inequality. A similar method can also be used in cases where the start and end points are in three dimensions.

*k*clusters of tasks within a priority level, one for each AMR, with each task within a cluster near to other tasks in the same cluster. This nearness can be defined by the SCD measure. However, as shown in Sec. 2.2.2, the SCD measure is nonmetric, and therefore, we propose using the defined distance metric,

*d*, instead. Using this distance metric, a

_{ij}*k*-center clustering technique similar to Lloyd's algorithm [53] (which requires a distance metric) can be defined on the task space in $\mathbb{R}4$, where each task is represented as $\delta =(xs,ys,xe,ye)$, and distances between tasks can be quantified using the defined metric. Mathematically, we define the intrapriority clustering problem as minimizing the cost function,

*J*(

*μ*,

*L*), given by

where *μ* is the set of cluster “centroids.” The centroid of each cluster is defined as the point in the task space that minimizes the sum of squared distances from all the tasks in that cluster, thus representing a center for that particular cluster. *L* is the cluster label of each task in the set (indicating the cluster to which it belongs), Δ is the entire task set within the current priority level, and *μ _{j}*(

*L*) is the centroid of the cluster to which the task $\delta i\u2208\Delta $ is assigned. The motivation behind choosing the cost function

_{i}*J*(

*μ*,

*L*) is that all points within a cluster should be nearer to the “central point” of that cluster than the central point of other clusters. Using Eq. (5), solving the clustering problem with the proposed distance metric reduces the upper bound of the SCD measures within each cluster.

Since solving the joint minimization problem shown in Eq. (7) is difficult, the proposed algorithm uses two alternating steps in every iteration to simplify the problem: (a) given a set of labels (indicating the cluster to which each task belongs) for all the tasks, update the centroid position for each cluster (centroid update step), and (b) given the new set of *k* centroid locations (one for each cluster), update the labels of each task such that each task is assigned to the nearest cluster (label update step).

*k*-center algorithm. In this step, for each cluster (defined from the label update step), we want to update the centroid position by choosing the point which minimizes the sum of squared distances from all the tasks in that cluster, as mentioned earlier. If the label set is fixed, this is equivalent to writing the minimization problem presented in Eq. (7) separately for each cluster as follows:

*C*is the cluster under consideration and (

_{j}*μ*,

_{s}*μ*)

_{e}*are the variables in the minimization problem representing the location of the centroid of the cluster $j,j={1,2,3,\u2026,k}$. Therefore, the centroid of each cluster can be represented as a pseudo-task with its start and end points as $(\mu s,\mu e)j$, respectively, and represented in the $\mathbb{R}4$ space as $(x\mu s,y\mu s,x\mu e,y\mu e)j$. The points (*

_{j}*s*,

_{i}*e*) represent the tasks in the cluster (there are $|Cj|$ tasks in the cluster, $\u2200(si,ei)\u2208Cj$). Now, since $(a+b)2\u22642(a2+b2)$ for real numbers

_{i}*a*,

*b*:

Although $(\mu s,\mu e)j$ is not the true centroid of the cluster *j* (as defined earlier in the section), it is used as an approximate position of the true centroid as this solution minimizes the upper bound of the cost function for a fixed label set, as shown in Eq. (8).

where *L _{i}* is the label of the

*i*th task and

*μ*is the set of cluster centroids, with $j={1,2,3,\u2026,k}$.

_{j}Now that both the steps of the proposed *k*-center (modified *k*-means (MKM)) algorithm have been defined, the complete algorithm is as shown in Algorithm 1 (with an initial set of labels *L*, *k* cluster centroids, ${\mu 1,\mu 2\u2026\mu k}$, and the cost function *J*(*μ*,*L*)). In Algorithm 1, the parameter *α* is a tunable acceptance limit of the cost function after each iteration, i.e., the algorithm allows the cost function to increase by a maximum of (1 + *α*) times and possibly generate a better solution by allowing local increases in the cost function after every iteration. This allows the clustering algorithm to possibly escape local minima. The centroid update step is shown in lines 3–7, whereas the label update step is shown in lines 8–10 in Algorithm 1. The rationale behind lines 11–15 is explained in Sec. 2.2.3.

Input: Δ, set of all tasks to be clustered within the same priority level, k, number of AMRs |

Output:${\mu 1,\mu 2\u2026\mu k}\u2208\mathbb{R}4$, set of cluster centroids, L, cluster label of each task in Δ |

1: initialize${\mu 1,\mu 2\u2026\mu k},L$ |

2: repeat |

3: for each cluster jdo |

4: $\mu \xafj,s=\u2211i=1|Cj,s|si|Cj,s|$ |

5: $\mu \xafj,e=\u2211i=1|Cj,e|ei|Cj,e|$ |

6: $\mu \xafj=(\mu \xafj,s,\mu \xafj,e)$ |

7: end for |

8: fori = 1 to $|\Delta |$do |

9: $L\xafi=argminjd\delta i\mu j,\delta i\u2208\Delta $ |

10: end for |

11: if$J(\mu \xaf,L\xaf)<(1+\alpha )J(\mu ,L)$then |

12: $\mu j,s=\mu \xafj,s$ |

13: $\mu j,e=\mu \xafj,e$ |

14: $L=L\xaf$ |

15: end if |

16: until convergence |

17: return${\mu 1,\mu 2\u2026\mu k},L$ |

Input: Δ, set of all tasks to be clustered within the same priority level, k, number of AMRs |

Output:${\mu 1,\mu 2\u2026\mu k}\u2208\mathbb{R}4$, set of cluster centroids, L, cluster label of each task in Δ |

1: initialize${\mu 1,\mu 2\u2026\mu k},L$ |

2: repeat |

3: for each cluster jdo |

4: $\mu \xafj,s=\u2211i=1|Cj,s|si|Cj,s|$ |

5: $\mu \xafj,e=\u2211i=1|Cj,e|ei|Cj,e|$ |

6: $\mu \xafj=(\mu \xafj,s,\mu \xafj,e)$ |

7: end for |

8: fori = 1 to $|\Delta |$do |

9: $L\xafi=argminjd\delta i\mu j,\delta i\u2208\Delta $ |

10: end for |

11: if$J(\mu \xaf,L\xaf)<(1+\alpha )J(\mu ,L)$then |

12: $\mu j,s=\mu \xafj,s$ |

13: $\mu j,e=\mu \xafj,e$ |

14: $L=L\xaf$ |

15: end if |

16: until convergence |

17: return${\mu 1,\mu 2\u2026\mu k},L$ |

#### Convergence Analysis.

The convergence of the intrapriority clustering algorithm can be shown by representing the MKM algorithm as an expectation–maximization style algorithm, as shown for the vanilla *k*-means algorithm by Bottou and Bengio [54]. Expectation–maximization is an iterative algorithm which uses latent parameters (labels *L* in our problem) in addition to the actual optimization parameters (${\mu 1,\mu 2\u2026\mu k}$, set of cluster centroids in our case) which simplify the optimization problem. Each iteration, then, consists of two steps: (a) given the values of the current set of optimization parameters, choose the best set of latent parameters in terms of the cost function (equivalent to assigning a label to each task in our case) and (b) given the updated set of latent parameters, update the optimization parameters (the centroid locations for the above-mentioned problem).

The first two terms in Eq. (15) show the change in the cost function for the label update step, and their sum is guaranteed to be less than or equal to zero since $L\xaf$ is defined as the best label assignment for each task for the distance metric $d\delta \mu \xaf(L)$ (using Eq. (13)), thus guaranteeing a decrease in cost (or the cost remaining the same). The third and the fourth terms of Eq. (15) show the change in cost function for the centroid update step. However, as shown in Eq. (11), the updated centroids $\mu \xaf$, given by Eq. (12), only minimize the upper bound of the cost function defined in Eq. (8), given a label set *L*. Therefore, a check has to be introduced to ensure that a decrease in cost is guaranteed in every step. Once the cost function $J(\mu \xaf,L\xaf)$ is computed for the new set of parameters, it is checked against the previous cost $J(\mu ,L)$. This check, which is also shown in Algorithm 1 (lines 11–15), ensures the decrease of the cost function after every iteration until the solution converges, i.e., the label set *L* does not change. Thus, the MKM algorithm generates a locally optimal solution for the intrapriority clustering problem.

As the sum of the squares of the metric distances between the points in a cluster and the cluster centroid decreases, it can be shown that the upper bound of the sum of the cross-distances also decreases. However, the sum of the cross-distances is not guaranteed to decrease with successive iterations. Therefore, the solution with the minimum sum of cross-distances between points in the cluster and the cluster centroid is chosen for each priority level.

#### Cluster Recombination.

Once the clustering is complete at each priority level, each of these clusters is represented by its pseudo-task centroid as described earlier. Since there are 3*k* clusters (*k* for each priority level), the next step is to form a cluster of clusters, with *k* final clusters, each containing three separate clusters, one from each priority level. The clustering is determined according to the nearness of the pair of centroids according to the SCD measure defined earlier.

A simple simulated annealing (SA) [55] based algorithm is used for cluster recombination. During the recombination phase, the simulated annealing algorithm does a local search around the current solution to get a better combination of clusters if possible. At every iteration, a random new solution is generated, and its quality (in terms of the cost) is compared with the current solution. If the new solution is better, it is always accepted. Worse solutions are also accepted with a probability (often restricted to the range [0, 1/2]), with a higher probability of acceptance if the cost difference is lower, thus allowing the algorithm to escape from locally optimal solutions. Additionally, the standard SA procedure introduces a “temperature” variable which also directly affects the probability of accepting worse solutions, i.e., the probability of accepting such solutions decreases with decreasing temperature. This is motivated from the physical process of annealing, which involves a controlled rate of temperature decrease, starting from a high initial temperature, to control specific material properties. The controlled temperature decrease rate in the physical process is analogous to a gradual decrease in the probability of worse solutions being accepted with increasing iterations. To summarize, in the standard SA, the temperature variable is used as a parameter to control the rate of acceptance of worse solutions with increasing iterations. The initial value, decrease rate and stopping conditions can be tuned according to the requirements of the problem under consideration.

### Task Ordering for Single Autonomous Mobile Robots.

Once the task set has been clustered, each cluster of tasks can be handled by a single AMR. Using the task distance matrix, *D*_{tasks}, defined in Sec. 2.1, a new model-based learning technique is proposed for task ordering with priorities for a single AMR.

#### Model Structure.

Following the pointer network architecture of Vinyals and coworkers [49], the model structure consists of two recurrent neural network (RNN) sections with long short-term memory (LSTM) units [56]. Recurrent neural networks use a defined network structure over multiple time steps (the parameters are shared over time steps) and are mainly used to predict time series data. LSTM units are particularly good for long-term dependence in sequential learning, as compared to vanilla RNN implementations, and are therefore chosen for the proposed problem. This is because long-term dependency will help the system avoid converging to the greedy solution in the proposed problem, i.e., the LSTM structure will not select the task nearest to the previously selected task after every iteration. The first section of the model is the encoder network, into which the task list for each vehicle is inputted one at a time. It is followed by the decoder network, which uses the encoded form of the task list to generate a sequential list of tasks to be carried out by that particular vehicle.

#### Encoder Network.

As described before, the encoder network receives an input at each step which consists of the normalized start and end point coordinates, as well as the task priority, and passes on this information as the cell and hidden states of the LSTM unit. A single LSTM unit is shown in Fig. 6 along with the various parameters that determine the output of the LSTM unit. In Fig. 6, *σ* is the sigmoid activation function, *c _{t}* and

*h*are the cell states and output states, respectively,

_{t}*i*,

*f*,

*o*are the activation vectors of the input, forget and output gate operations, respectively, and

*a*is the activation vector for the selector function which chooses the proportion of the input gate operation which affects the cell state. The cell state keeps a “record” of the past information after each time-step and is modified by the input gate (activation vector

*i*) and forget gate (activation vector

*f*) operations, which help in adding new information to the system and forgetting old information, respectively. As shown in Fig. 6, the updated cell state is a combination of the old cell state weighted by the forget gate activation vector,

*f*, and the input gate activation vector,

*i*, weighted by

*a*. The next output state is just a function of the updated cell state as shown in Fig. 6. The parameters $W{c,i,f,o}$ and $U{c,i,f,o}$ are tunable parameters of the unit, and

*u*is the external input to the system at each time-step.

_{t}The objective of the encoder network is to transform all the information from the task space into an encoding of a chosen dimension, which can then be used in the decoder network for task assignment as described in Sec. 2.3.3. It is to be noted that the size of the encoder network (number of LSTM units) is equal to the number of tasks to be completed by the vehicle in order to ensure that there is an encoded version of all the tasks.

#### Decoder Network.

where Ω is the cumulative sequence of tasks after each step, and *e* is the set of encoded vectors for each task, derived from the encoder network. This task selector only selects tasks which have not been completed by choosing the next task according to the probability distribution generated by the softmax function (with the probabilities of tasks already completed set to zero), thus effectively acting as a single layer network. The first input is the encoded vector representing the depot, which is just a task represented as $(zdepot,zdepot,1)$, where it is assigned the highest priority (*p *=* *1), and *z*_{depot} is the point representing the depot on the factory floor.

The entire architecture is shown in Fig. 7, where both the encoder and decoder sections along with the task selector are shown. After one epoch (one forward pass) of the encoder is complete, the encoder network generates an encoded form of all the tasks to be completed, which includes positions of the start and end points, as well as the priority of the tasks. This is used during the decoding stage as a feedforward input to the task selector along with the output state from the previous step to decide the next task to be completed. After one epoch of the decoder, the task sequence can be derived by tracing back the inputs generated by the task selector after each step.

#### Representation as a Control System.

The decoder section of the network presented in Fig. 7 can be represented as a discrete-time control system since a recurrent neural network structure has been used. As is the case in a discrete-time feedback control system, the control input for the next time-step is computed according to the previous states/output (with an additional feed-forward term if required) using a probabilistic feedback model. Similarly, in the task sequencing network, the states are the cell states (*c _{t}*) of the LSTM unit, the output corresponds to the output states (

*h*), the “time” variable is the counter of the number of tasks completed (number of steps completed), and the control input is the next task to be completed, which is decided according to Eqs. (16) and (17), which, in turn, depend on the output states. This comparison is shown in Fig. 8, where, similar to the feed forward term in the control system block diagram, the task selector also takes in an input of the encoded task list (

_{t}*e*).

#### Reward Function and Optimization.

*D*

_{tasks}have been defined in Sec. 2.1. The last leg of travel between the last task and the depot is not priority adjusted. Since the generated task list is probabilistic, the expected value of the reward can be written as

*p*

_{Φ}(Ω) is the probability of getting the task sequence Ω given the parameter set Φ and takes into account the recurrent nature of the network. Policy gradient methods can be used to maximize the expected value of the reward, using the REINFORCE algorithm [57]

where *b* is a baseline reward value, chosen as the exponential moving average. The expected value can be approximated using a Monte Carlo-based sampling using forward passes of the network at each step.

It is to be noted that only the given set of tasks is used for the optimization process over repeated iterations, and thus, no training examples are required in this process.

## Results

### Results of Asymmetric Clustering.

A sample result of intrapriority clustering using the MKM method is shown in Fig. 9. The base point set is a 1000-point randomly generated set, on which 700 tasks are defined by picking a pair of points and assigning the task a random priority. Since there are three vehicles, each priority level has three clusters. It can be seen that the sum of squares of the metric distance measure between points in a cluster and the cluster centroid decreases with every iteration.

The MKM method for intrapriority clustering is tested on task sets of different sizes, as shown in Fig. 10. Detailed results are also presented in Table 1. The point sets are randomly generated in the $\mathbb{R}2$ space (ten randomly generated sets of approximately same point density for each size), and tasks are defined as a random pair of points in this point set. The mean average SCD between two points within a cluster is shown for varying number of vehicles (3 V, 5 V, 10 V) for 200 total instances of each case, along with the standard deviations as error bars. The results are also compared to the nonhierarchical asymmetric *k*-medoid algorithm (AKMD) for clustering in asymmetric spaces [58]. It is seen that the MKM algorithm performs better than the AKMD algorithm as the mean average SCD values over 200 instances are smaller, which means that on an average, the points in a cluster are closer to each other using the SCD measure when using the MKM algorithm. The variability in the results over the instances, denoted by the standard deviation, is also lower for the MKM algorithm, which is important for probabilistic heuristic algorithms as a higher variability may lead to worse results when a single instance is considered. From Fig. 10, the worst-case solutions generated by MKM are better when compared to the AKMD algorithm. Additionally, the definition of a distance metric allows for the definition of a complete task space. Therefore, we are not restricted to selecting defined tasks as the centroids of the intrapriority clusters. These cluster centroids can then be used for cluster recombination in Sec. 2.2.4.

Mean average SCD (Std. Dev.) | |||||||
---|---|---|---|---|---|---|---|

MKM | AKMD | ||||||

No. of points | Task space | 3 V | 5 V | 10 V | 3 V | 5 V | 10 V |

250 | 200 | 365.68 (9.47) | 373.45 (8.41) | 373.82 (10.06) | 369.34 (14.44) | 377.41 (13.31) | 378.22 (13.19) |

150 | 368.21 (10.34) | 379.82 (10.25) | 380.25 (12.33) | 373.80 (14.96) | 383.91 (15.38) | 387.48 (14.79) | |

100 | 372.07 (12.50) | 371.98 (13.02) | 379.15 (13.00) | 377.35 (17.87) | 378.32 (18.43) | 387.27 (15.20) | |

500 | 350 | 360.63 (8.48) | 361.96 (6.40) | 364.19 (7.35) | 367.09 (13.57) | 367.70 (12.29) | 368.25 (10.39) |

250 | 358.27 (9.52) | 363.29 (8.51) | 366.59 (8.78) | 366.15 (14.72) | 371.26 (12.67) | 373.50 (12.02) | |

150 | 355.43 (10.66) | 359.72 (9.87) | 354.19 (10.64) | 365.33 (15.16) | 365.44 (14.89) | 359.26 (13.50) | |

1000 | 700 | 363.62 (6.71) | 365.14 (4.99) | 368.00 (5.39) | 370.46 (13.32) | 372.68 (10.82) | 373.72 (9.88) |

500 | 364.21 (7.44) | 364.60 (5.95) | 368.28 (6.37) | 370.52 (13.92) | 371.41 (11.43) | 373.00 (10.22) | |

300 | 356.16 (8.30) | 361.97 (6.47) | 361.05 (8.50) | 365.78 (14.27) | 369.70 (12.79) | 368.28 (12.72) |

Mean average SCD (Std. Dev.) | |||||||
---|---|---|---|---|---|---|---|

MKM | AKMD | ||||||

No. of points | Task space | 3 V | 5 V | 10 V | 3 V | 5 V | 10 V |

250 | 200 | 365.68 (9.47) | 373.45 (8.41) | 373.82 (10.06) | 369.34 (14.44) | 377.41 (13.31) | 378.22 (13.19) |

150 | 368.21 (10.34) | 379.82 (10.25) | 380.25 (12.33) | 373.80 (14.96) | 383.91 (15.38) | 387.48 (14.79) | |

100 | 372.07 (12.50) | 371.98 (13.02) | 379.15 (13.00) | 377.35 (17.87) | 378.32 (18.43) | 387.27 (15.20) | |

500 | 350 | 360.63 (8.48) | 361.96 (6.40) | 364.19 (7.35) | 367.09 (13.57) | 367.70 (12.29) | 368.25 (10.39) |

250 | 358.27 (9.52) | 363.29 (8.51) | 366.59 (8.78) | 366.15 (14.72) | 371.26 (12.67) | 373.50 (12.02) | |

150 | 355.43 (10.66) | 359.72 (9.87) | 354.19 (10.64) | 365.33 (15.16) | 365.44 (14.89) | 359.26 (13.50) | |

1000 | 700 | 363.62 (6.71) | 365.14 (4.99) | 368.00 (5.39) | 370.46 (13.32) | 372.68 (10.82) | 373.72 (9.88) |

500 | 364.21 (7.44) | 364.60 (5.95) | 368.28 (6.37) | 370.52 (13.92) | 371.41 (11.43) | 373.00 (10.22) | |

300 | 356.16 (8.30) | 361.97 (6.47) | 361.05 (8.50) | 365.78 (14.27) | 369.70 (12.79) | 368.28 (12.72) |

The computation time for this intrapriority clustering process is almost negligible, with even the largest task sets (700 tasks) taking less than 1 s on a laptop personal computer.

### Results of Encoder-Decoder Method.

To test the encoder–decoder algorithm, a test case is presented in Fig. 11. The test case has eight tasks with three levels of priorities (tasks {1, 3, 5} have *p *=* *1, shown in red, tasks {6, 7, 8} have *p *=* *2, shown in blue, and tasks {2, 4} have *p *=* *3, shown in green), with *p *=* *1 denoting the highest. The optimal solution for this test case, starting from task 1 (derived using exhaustive search), is the task sequence Ω = {1, 3, 5, 7, 8, 6, 4, 2} before returning to the start point of task 1. From Fig. 11, it can be seen that the algorithm starts off with an arbitrary solution and reaches the optimal solution while considering task priorities. An intermediate solution is also shown in Fig. 11.

The results on task sets of different sizes are also presented using the encoder–decoder network model trained using adam [59] in Fig. 12. It can be observed that during training, the reward value increases, and the standard deviation in the reward from task sequences generated by forward passes decreases. For the 100-task set, the computation time on a laptop PC was ≈ 30 s for the adam implementation for a single AMR. The sizes of the task sets are chosen according to the number of tasks which are generally performed by a single AMR, which is selected to be from 5 to 100. As the reward value increases, the cost of travel per robot goes down as the reward value is chosen to be the negative of the travel cost.

Figure 13 and Table 2 show the costs for the encoder–decoder model compared to those generated by local search using simulated annealing. For these results, the encoder–decoder model is run ten times for each task set, and the means and standard deviations (in the error bars) are shown. The total times for the simulations are also presented. For comparable results, the simulated annealing with local search is run for approximately the same total time as the encoder–decoder algorithm, and the derived means and standard deviations also reported. The base point spaces, which represent the coordinates of the pick-up and drop-off locations on the factory floor, are either randomly generated, or derived from TSPLIB [60] (PCB442, PR1002), or from matlab (MTB7), and represent a variety of cases (e.g., different number of points, clustering of points, distances between locations). The task sets are random chosen pairs of points in the base point set of different sizes. For each of the randomly generated base point sets, the reported values are the average over ten different point sets in the same range. For AMRs in manufacturing facilities, each AMR is not expected to perform too many tasks since there are multiple AMRs collaborating with each other, and therefore, the size of the task sets is kept between 30 and 70 for simulations.

Encoder–decoder | SA | ||||||
---|---|---|---|---|---|---|---|

Data set size | Task set size | Mean avg. length | Mean std. dev. | Total time (s) (iterations) | Mean avg. length | Mean std. dev. | Total time (s) (iterations) |

7 (MTB7) | 30 | 39,355 | 2983 | 54.46 (10) | 42,641 | 4963 | 58.67 (100) |

50 | 58,384 | 4278 | 101.77 (10) | 61,409 | 6033 | 106.87 (130) | |

70 | 83,595 | 4797 | 164.43 (10) | 88,232 | 6746 | 153.29 (150) | |

250 | 30 | 11,964 | 232 | 64.95 (10) | 12,038 | 855 | 65.45 (100) |

50 | 19,751 | 461 | 104.17 (10) | 20,081 | 1533 | 105.19 (130) | |

70 | 26,265 | 488 | 162.16 (10) | 26,594 | 1883 | 164.23 (150) | |

442 (PCB442) | 30 | 124,827 | 14,897 | 60.13 (10) | 131,476 | 16,404 | 59.12 (100) |

50 | 187,790 | 8817 | 104.57 (10) | 199,782 | 18,507 | 105.05 (130) | |

70 | 266,838 | 11,080 | 159.04 (10) | 292,385 | 24,875 | 165.38 (150) | |

500 | 30 | 11,911 | 297 | 62.83 (10) | 11,944 | 1089 | 62.88 (100) |

50 | 18,643 | 367 | 105.53 (10) | 18,658 | 1392 | 105.43 (130) | |

70 | 25,990 | 513 | 162.06 (10) | 26,068 | 1857 | 162.06 (150) | |

750 | 30 | 13,233 | 540 | 57.42 (10) | 13,489 | 1472 | 57.47 (100) |

50 | 18,957 | 512 | 104.93 (10) | 19,230 | 1611 | 105.67 (130) | |

70 | 26,297 | 533 | 163.53 (10) | 26,585 | 1819 | 164.00 (150) | |

1000 | 30 | 12,253 | 298 | 62.87 (10) | 12,534 | 1203 | 63.51 (100) |

50 | 19,228 | 515 | 106.58 (10) | 19,331 | 1473 | 105.09 (130) | |

70 | 25,884 | 420 | 162.74 (10) | 26,043 | 1878 | 161.19 (150) | |

1002 (PR1002) | 30 | 506,308 | 24,691 | 53.92 (10) | 580,935 | 66,739 | 59.21 (100) |

50 | 658,911 | 33,058 | 106.04 (10) | 769,055 | 76,465 | 98.57 (130) | |

70 | 964,590 | 41,282 | 163.26 (10) | 1,059,651 | 87,471 | 161.30 (150) |

Encoder–decoder | SA | ||||||
---|---|---|---|---|---|---|---|

Data set size | Task set size | Mean avg. length | Mean std. dev. | Total time (s) (iterations) | Mean avg. length | Mean std. dev. | Total time (s) (iterations) |

7 (MTB7) | 30 | 39,355 | 2983 | 54.46 (10) | 42,641 | 4963 | 58.67 (100) |

50 | 58,384 | 4278 | 101.77 (10) | 61,409 | 6033 | 106.87 (130) | |

70 | 83,595 | 4797 | 164.43 (10) | 88,232 | 6746 | 153.29 (150) | |

250 | 30 | 11,964 | 232 | 64.95 (10) | 12,038 | 855 | 65.45 (100) |

50 | 19,751 | 461 | 104.17 (10) | 20,081 | 1533 | 105.19 (130) | |

70 | 26,265 | 488 | 162.16 (10) | 26,594 | 1883 | 164.23 (150) | |

442 (PCB442) | 30 | 124,827 | 14,897 | 60.13 (10) | 131,476 | 16,404 | 59.12 (100) |

50 | 187,790 | 8817 | 104.57 (10) | 199,782 | 18,507 | 105.05 (130) | |

70 | 266,838 | 11,080 | 159.04 (10) | 292,385 | 24,875 | 165.38 (150) | |

500 | 30 | 11,911 | 297 | 62.83 (10) | 11,944 | 1089 | 62.88 (100) |

50 | 18,643 | 367 | 105.53 (10) | 18,658 | 1392 | 105.43 (130) | |

70 | 25,990 | 513 | 162.06 (10) | 26,068 | 1857 | 162.06 (150) | |

750 | 30 | 13,233 | 540 | 57.42 (10) | 13,489 | 1472 | 57.47 (100) |

50 | 18,957 | 512 | 104.93 (10) | 19,230 | 1611 | 105.67 (130) | |

70 | 26,297 | 533 | 163.53 (10) | 26,585 | 1819 | 164.00 (150) | |

1000 | 30 | 12,253 | 298 | 62.87 (10) | 12,534 | 1203 | 63.51 (100) |

50 | 19,228 | 515 | 106.58 (10) | 19,331 | 1473 | 105.09 (130) | |

70 | 25,884 | 420 | 162.74 (10) | 26,043 | 1878 | 161.19 (150) | |

1002 (PR1002) | 30 | 506,308 | 24,691 | 53.92 (10) | 580,935 | 66,739 | 59.21 (100) |

50 | 658,911 | 33,058 | 106.04 (10) | 769,055 | 76,465 | 98.57 (130) | |

70 | 964,590 | 41,282 | 163.26 (10) | 1,059,651 | 87,471 | 161.30 (150) |

It can be seen that the performance of the encoder–decoder network is competitive when compared to the simulated annealing with local search, with lower mean average tour lengths and smaller deviations in all cases. The lower variability in results generated by the encoder–decoder algorithm means that the worst-case results of the encoder–decoder algorithm are better than the simulated annealing method for all the presented cases. As mentioned earlier, lower variability in results is an important feature, especially if only a single instance is used in the final implementation due to real-time constraints. The speed of the algorithm can be controlled by either changing the total number of iterations for training or by changing the number of samples in the Monte Carlo-based sampling procedure.

All results presented in this section are computed using a laptop computer (Intel Core i7-7700HQ central processing unit @ 2.8 GHz, 16 GB RAM) on matlab. To reduce the computation time further, other programming platforms can be used.

## Conclusions

This paper proposes a novel two-step algorithm for real-time scheduling of AMRs, given a set of prioritized tasks which involve transportation of tools or materials from one point to another on the factory floor. Given the set of points and their intrapoint distances, a new task space is defined, with a proposed distance measure between any two tasks. The two-step algorithm follows a “cluster-first order-second” approach, where first, the tasks are assigned to each vehicle using an asymmetric modified *k*-means clustering method using a defined distance metric, and convergence to local optimum for this modified *k*-means algorithm is proved. This proposed clustering method is also compared with another nonhierarchical method for asymmetric clustering using medoids (AKMD) and is shown to perform better than the AKMD algorithm in terms of mean average cross-distances between points in a cluster.

The second part of the paper presents a model-based learning technique for prioritized task scheduling of a single AMR in this asymmetric task space, and for task sets of relevant sizes, this learning technique is shown to perform better in most cases when compared to a simulated annealing algorithm with local search, a widely used optimization scheme for generating real-time solutions. A reinforcement learning paradigm is used to improve the model iteratively without the need for any training data.