Abstract

In human–robot collaboration, robots and humans must work together in shared, overlapping, workspaces to accomplish tasks. If human and robot motion can be coordinated, then collisions between robot and human can seamlessly be avoided without requiring either of them to stop work. A key part of this coordination is anticipating humans’ future motion so robot motion can be adapted proactively. In this work, a generative neural network predicts a multi-step sequence of human poses for tabletop reaching motions. The multi-step sequence is mapped to a time-series based on a human speed versus motion distance model. The input to the network is the human’s reaching target relative to current pelvis location combined with current human pose. A dataset was generated of human motions to reach various positions on or above the table in front of the human starting from a wide variety of initial human poses. After training the network, experiments showed that the predicted sequences generated by this method matched the actual recordings of human motion within an L2 joint error of 7.6 cm and L2 link roll–pitch–yaw error of 0.301 rad on average. This method predicts motion for an entire reach motion without suffering from the exponential propagation of prediction error that limits the horizon of prior works.

Graphical Abstract Figure
Graphical Abstract Figure
Close modal

1 Introduction

A challenge in human–robot collaboration (HRC) is coordinating human and robot motion. In HRC, humans and robots share a common workspace and work together in close proximity to accomplish tasks, e.g., in manufacturing. In an HRC cell, the less coordinated human and robot motion is, occurrence of production delays and/or human discomfort becomes more likely. In the case of suboptimal coordination, the robot may have to stop and wait for the human to back away, causing production delay. The robot may also take trajectories that make the robot come close to the human, causing human discomfort and distrust in the robotic system. To improve human–robot coordination and avoid these problems, humans’ trajectories must be predicted so robot motion can be adapted ahead of potential disruptions. In a manufacturing setting, the location of parts is known or easily determined, which provides the target for human reaching motions. Therefore, the prediction is the sequence of human poses generated from interpolations between the current pose and reaching target. This work presents a method for predicting a sequence of human poses based only on the human’s current pose and the reaching target for the human’s left or right wrist.

Human motion can be predicted at high and low levels within an HRC system. At the low level of prediction, a time sequence of human poses is predicted. At the high level, coarse human actions are classified, and the end point of human motion can be predicted, but without time dependence. Zhang et al. developed a recurrent neural network (RNN) architecture including units for independent human parts for predicting the end-point of human motion for a robot–human part handover [1]. Liu and Wang used a hidden Markov model to generate a motion transition probability matrix for predicting next human coarse actions [2]. Maeda et al. proposed probabilistic motion primitives to predict human intent and generate a corresponding robot motion primitive [3,4]. These methods can only provide the end-point for human motion at best. They do not provide details about the motion between start and end, limiting their potential to adapt robot motions to avoid predicted disruptions.

Previous works on predicting time sequences of human poses, meaning the motion between start and end, in a manufacturing domain have used filters and/or RNNs. Mainprice and Berenson fit Gaussian mixture models (GMMs) to many recordings of human reaching motions and predicted future motion with the GMMs [5]. Wang et al. used an autoregressive integrated moving average model applied to the elbow and wrist angles to predict human tabletop reaching motions [6]. Kanazawa et al. used Gaussian mixture regression with expectation maximization to learn a model of human motion online [7]. Liu and Liu used an RNN to model human motion and used a modified Kalman filter to adapt RNN layer weights online [8]. Li et al. used a multi-step Gaussian process regression and previously recorded human trajectories to predict human reaching motion online [9]. Callens et al. developed a database of human motion models using probabilistic principal component analysis (PPCA) [10]. In many of these methods, the model predicts the next step based on the current step and then subsequent predictions are based on the previous prediction. Therefore, if there is a small error in predicting a relatively immediate step, that error will propagate through the prediction and result in exponential increase in error as the prediction horizon increases. Some of these works, such as GMMs and PPCA require a database of human trajectories which in turn requires computation time to compare current motion to a record.

Other state-of-the-art human motion prediction methods have demonstrated good results for predicting human motion in general activities, such as walking and eating, represented in the Human3.6M dataset. Martinez et al. utilized a sequence-to-sequence architecture, which is an RNN with gated recurrent units that takes a sequence of recent poses and generates a predicted sequence of future poses [11]. Mao et al. encoded human pose trajectories using the discrete cosine transform (DCT) and then used a graph convolutional network which predicts future DCT coefficients based on a sequence of DCT coefficients [12]. Li et al. utilized a neural network consisting of encoders that use convolutional layers to generate hidden states based on long and short-term input sequences and then use two fully connected layers to decode hidden states into pose sequences [13]. These methods generate predictions iteratively, causing exponential divergence of the prediction from the true trajectory over the time horizon. Therefore, they limit error analysis to predictions within a 1 s horizon. Such a short prediction horizon is infeasible for a manufacturing HRC setting where human motions are typically many seconds in duration.

The method herein uses a neural network to predict a sequence of human poses considering only the current human pose and reaching target as input. This method is designed to prevent the problem of error propagation over a long prediction horizon. The first step of this method is to warp the time scale of human motion observations in the training data to a dimensionless phase scale so each training sample shows consistent timing of changes in human pose elements. After conditioning the training data, a generative neural network is fit to the training data. The neural network assembled in this work is inspired by generator networks in generative adversarial networks (GANs) [14,15]. Once the network is trained, it is used to predict a multi-step sequence of human poses. To use the prediction in the time domain, linear interpolation is used to match the multi-step prediction to a sequence having duration based on the anticipated human average speed.

The novelty of this work is development of a neural network and data pre/post-conditioning to generate a predicted human pose sequence over a horizon of multiple seconds based only on the current human pose and relative reach target. This method utilizes the repetitiveness of human motion in manufacturing by considering the reaching target as an input. This method is unique in representing the human pose with quaternions so human link dimensions are preserved and the neural network inputs are continuous, enabling better network fit. Other representations are either not continuous or allow link lengths to change instantaneously. The method herein can also generate a prediction in real-time (faster than 30 Hz) over a long horizon without suffering from exponential propagation of error that occurs with other works. This method is also trained and predicts based on data collected with a depth camera-based skeleton tracking system. Other methods utilize a more precise motion capture system for human tracking but require wearable sensing equipment, making them infeasible for a manufacturing setting.

The output of the method herein can be used as an input to proactive-n-reactive robot algorithms to avoid anticipated, time-varying delays. Figure 1 shows how this human motion prediction method, shown by the large block in the lower left, fits into the control scheme for a robotic system in an HRC workcell. The robot considers the task goal and a predicted time sequence of human motion to plan robot motion that accomplishes the goal while avoiding the human. The robotic system then uses a safety controller to adjust robot speed or stop the robot along the planned path if the robot gets too close to the human. In the robotic system, real-time human pose is captured by the workcell sensor suite. A new human motion prediction can then be generated based on the human’s reaching target (e.g., part to pick up) and current human pose. Targets can be determined by existing methods that locate objects based on image inputs [16]. The remaining sections of this paper are organized as follows: (2) methods, (3) results, and (4) conclusions. Section 2 is further divided into subsections: (2.1) human pose representation, (2.2) collection of training data, (2.3) preconditioning of the training data, (2.4) network architecture, and (2.5) post-conditioning output to a time sequence of poses. Section 3 is divided into subsections: (3.1) training results, (3.2) prediction accuracy, and (3.3) implementation into an HRC system.

Fig. 1
Control block diagram for a robotic system in an HRC workcell
Fig. 1
Control block diagram for a robotic system in an HRC workcell
Close modal

2 Methods

The method in this work is composed of five parts. First, human pose is represented as pelvis Cartesian location and a set of quaternions that relate each human link to the world z-axis. Second, a dataset is collected in which many iterations of various human motions are recorded as the human reaches to various target locations. Third, the recorded data are conditioned to have a consistent phase scale instead of a time scale to improve neural network training. Fourth, a neural network is created to predict a multi-step sequence of human poses the human will pass through to reach for a target, given the target and current human pose as input. Fifth, the network output is post-processed to have a time scale that matches the estimated duration of human motion based on average motion velocities from the recorded data.

2.1 Human Representation.

The human pose in this work is the stacked vector of the human pelvis location and the seven quaternions that define the axis for the torso, neck, shoulders, upper arms, and forearms
(1)
The Cartesian pelvis location, Pp, is defined in the world coordinate frame. The qt,, qfR are the human link quaternions, shown in Table 1 and Fig. 2(a). Human link lengths and radii are required to fully define the volume each human link occupies. The method herein considers those parameters as constants determined a priori by the tracking system. The quaternions define a rotation about a vector to align the world z-axis with each human link’s axis
(2)
where the subscript i indicates one of the human link quaternions, as in Eq. (1) and Table 1. The quaternion elements are determined from a rotation angle (θi) and the unit vector (vi) about which rotation is defined
(3)
Fig. 2
(a) Links of the human kinematic chain, and (b) rotation angle and axis for the right forearm quaternion and cylinder link representation
Fig. 2
(a) Links of the human kinematic chain, and (b) rotation angle and axis for the right forearm quaternion and cylinder link representation
Close modal
Table 1

Links of the human kinematic chain

LinkDescriptionProximal jointDistal jointLink quaternion
1Torso/SpinePelvisSpineqt
2NeckSpineShoulder MPqn
3Shoulder–shoulderShoulder MPShouldersqs
4Left upper armLeft shoulderLeft elbowquL
5Left ForearmLeft elbowLeft wristqfL
6Right upper armRight shoulderRight elbowquR
7Right forearmRight elbowRight wristqfR
LinkDescriptionProximal jointDistal jointLink quaternion
1Torso/SpinePelvisSpineqt
2NeckSpineShoulder MPqn
3Shoulder–shoulderShoulder MPShouldersqs
4Left upper armLeft shoulderLeft elbowquL
5Left ForearmLeft elbowLeft wristqfL
6Right upper armRight shoulderRight elbowquR
7Right forearmRight elbowRight wristqfR

The world frame z-axis would align with human link i if it were rotated by θi about vi. Figure 2(b) shows an example of the rotation angle and vector to align the world z-axis with the right forearm. Figure 2(b) also shows the arm generalized to a cylinder for use by collision detection and path planning algorithms. Considering other possible angle/axis rotation representations, such as roll–pitch–yaw, for example, the quaternion representation seems best suited for use with a neural network because quaternion elements are bounded between 1 and 1. The quaternion elements will also be continuous as the human moves. In contrast, the roll–pitch–yaw is either bounded but with discontinuity at ±π radians, or without discontinuity but allowing angles to tend toward ±. An example of this problem would occur if the human extended an arm outward and repeated full rotations of the arm about the shoulder. While it is physically impossible for a human joint to do a full revolution due to muscle and joint limits, sensing systems can perceive multiple full revolutions. Therefore, the quaternion representation is used to allow for the perception of full revolutions of human joints.

2.2 Collection of the Training Dataset.

To amass a dataset of human motions, human pose was recorded while one person performed a variety of tabletop reaching tasks. Over 1750 motion sequences were recorded per each arm. Each reaching motion had a target wrist Cartesian location. An array of targets of known 3D positions was selected to cover the workspace in front of the human in the robotic cell, shown in Fig. 3. Tabletop level targets are shown as small, filled circles. The tips of rods extending upward from the table at each circle by 15, 30, and 45 cm were used to create elevated targets, shown by tall rectangles. Human joint locations were tracked, using the two depth cameras circled by ovals in Fig. 3, and converted into the quaternion representation via the method in Ref. [17]. The targets are in the range x[0.6,0.6], y[0,0.6], z[0,0.5] m, with the human standing near the edge of the table near [x,y,z]=[0,0.8,0]m. For reference, the tabletop height is z = 0 m . For each arm, motions included reaches to targets on both left and right sides of the pelvis to include some cross-body motions. For reaches over long distances, motions could require the human to walk more than one step to reach the target. In this case, the prediction of motion may become significantly less accurate because the human has many more options for potential trajectories. Therefore, this work applies to human reaching motions in a tabletop setting in manufacturing where objects the human will interact with are within about 1 m of the pelvis. For motions to objects more than 1 m away, another algorithm such as in Ref. [18] could be used to generate relatively coarse predictions of occupancy at the expense of precision.

Fig. 3
HRC workcell used in collecting data and validation
Fig. 3
HRC workcell used in collecting data and validation
Close modal

2.3 Conditioning of Training Data.

The time required for humans to complete tasks likely varies over each iteration of the task, possibly due to distractions or tiring as work shifts progress. As humans reach for the same target over many iterations, the poses in the recorded time-series may be very similar, but will likely occur at different times through the motion. If the time scale of the training data was warped to have a consistent number of steps per sequence, then the effect of varying timing would be minimized, making each training record for a particular task as similar as possible. Trial and error showed that matching the time scale of all records to a common phase scale reduced the prediction error at the end of network training. Therefore, dynamic time warping (DTW), specifically the FastDTW algorithm, was used to match all training records to a common phase scale [19].

Consider two time sequences of human poses, h1 and h2, both having the same time-step. DTW outputs a mapping of time-steps in h1 to time-steps in h2, called a warp path which is denoted W. DTW uses dynamic programming to find the shortest warp path through a 2D grid where one dimension is the time-step index of h1, denoted i, and the other dimension is the time-step index of h2, denoted j. The path search starts from the first index of h1 and first index of h2 and must end at the last index of h1 and last index of h2. Dynamic programming iteratively determines a matrix the same size as the 2D h1 by h2 grid which indicates the distance of the shortest warp path to reach cell i,j from the start cell (i=0,j=0). The distance matrix is updated iteratively according to
(4)
where D(i,j) is the Euclidean distance between the ith data point in h1 and the jth data point in h2
(5)
Dynamic programming iterations stop when the distance matrix reaches steady-state values. Then a greedy search of that matrix finds the lowest cost warp path from start to end of h1 and h2
(6)
where wk(i,j) is the kth step of warp path W, indicating the mapping between ith element of h1 and jth element of h2. Therefore, wki indicates the ith value of wk and wkj indicates the jth value of wk. FastDTW improves the speed of standard DTW by using a three-level approach for graph bisection to reduce the dimension of the search grid.
To pre-condition the input data, the Euclidean error between the reaching wrist and reaching target was determined across all time-steps of each recorded motion sequence
(7)
where Pwt is the position of the reaching wrist, either left or right, and ewt is the wrist position error at time-step t and Ptgt is the target wrist location. Then, per each recording, error was scaled so error is one at the start of motion and zero at the end of motion
(8)
where Ewunit is the unitized error for the series of poses. Plots of Ewunit versus time revealed most errors followed a decreasing sigmoid shaped curve. Therefore, a desired error curve was taken to be approximately the average of unitized error curves from all recorded motions, given by
(9)
FastDTW was then used to determine the optimal mapping between the time-steps in the unitized error curve for each motion sample (Ewunit) and the desired error curve (Edesired). Figure 4 illustrates the warping of a sample’s time scale by FastDTW. The dotted line is the desired error curve, dashed line is the observed error curve from the raw motion sample, and solid line is the error curve after warping the time scale of the raw motion to match the desired curve. The arrows show how FastDTW matches points in the raw error curve to points in the desired error curve. After the records’ time scales were warped, then each record is down sampled to ten points evenly spaced along the warped time scale, which will be called the phase scale herein. The phase scale indicates a percentage of reach motion completion at which a pose occurs. Trial and error showed that a sequence of ten poses nearly matched the original sequence, but more poses did not improve accuracy. Now each training record presents a change in pose elements with more consistent scale than the raw samples.
Fig. 4
Warping actual wrist/target error to the desired error curve
Fig. 4
Warping actual wrist/target error to the desired error curve
Close modal

2.4 Neural Network Architecture and Training.

A neural network was assembled to predict a ten-step sequence of future human poses based on the current pose and human reaching target
(10)
where Hp denotes the set of ten sequential human poses for the prediction and ϕ are the network parameters. The prediction is ten steps long because ten poses nearly matched the actual sequence and increasing the number of poses didn’t significantly improve error. The z is the vector input to the network given by
(11)
where Ptgt is the reaching target relative to the human’s current pelvis location and h is the current human pose represented by pelvis location and quaternions. A separate neural network is used for predicting left arm reaches and right arm reaches. Trial and error led to the conclusion that separate networks reduced the difference between actual and predicted motions. The left arm network predicts pelvis location and quaternions for the torso, neck, shoulders, left upper arm, and left forearm. The right arm network predicts pelvis location and quaternions for the torso, neck, shoulders, right upper arm, and right forearm. When using either network, it is assumed that pose elements not predicted by the network (opposing upper arm and forearm quaternions) are held constant throughout the motion. Implementation of the left and right networks is discussed in Sec. 3.3. Each neural network outputs a 10×23 matrix where each row is a phase step in the motion prediction. The columns correspond to the pelvis coordinates or quaternion elements in the human pose representation required for reaches with either left or right arm.

The network architecture is inspired by the generator network in a GAN [14,15]. The generator neural network in this method is a sequence of five transposed convolution layers as shown in Table 2 and Fig. 5. In Fig. 5, the input vector consisting of Cartesian reaching target and current human pose is shown in the left. The 3D blocks indicate the relative shape of the output of each network layer, with the text over the blocks indicating matrix size in the order: channels, height, and width. The layer operation is indicated in the text below the 3D blocks. The transposed convolution layers indicate the size of convolution kernel, kernel stride, and input padding in the format: height by width. The layer operations also indicate the activation function used by the layer, further explained below. A hyperbolic tangent activation function followed by network normalization of the human pose quaternions output is applied to the output of the final convolution layer. Each quaternion in the pose must have an L2 norm of 1 unit, so the human pose quaternions in each phase step of the network output are normalized.

Fig. 5
Neural network architecture for predicting human pose sequences
Fig. 5
Neural network architecture for predicting human pose sequences
Close modal
Table 2

Neural network layers

LayerDescriptionInput size (c, h, w)Kernel size (h, w)Stride (h, w)Padding (h, w)Output size (c, h, w)
1Convolution transpose26, 1, 1(3,6)(1,1)None256, 3, 6
2Convolution transpose256, 3, 6(3,3)(1,1)(1,1)128, 3, 6
3Convolution transpose128, 3, 6(3,3)(2,2)(1,1)64, 5, 11
4Convolution transpose64, 5, 11(3,3)(2,2)None32, 10, 23
5Convolution transpose32, 10, 23(3,3)(1,1)(1,1)1, 10, 23
6Tanh, L2 quat. norm1, 10, 231, 10, 23
LayerDescriptionInput size (c, h, w)Kernel size (h, w)Stride (h, w)Padding (h, w)Output size (c, h, w)
1Convolution transpose26, 1, 1(3,6)(1,1)None256, 3, 6
2Convolution transpose256, 3, 6(3,3)(1,1)(1,1)128, 3, 6
3Convolution transpose128, 3, 6(3,3)(2,2)(1,1)64, 5, 11
4Convolution transpose64, 5, 11(3,3)(2,2)None32, 10, 23
5Convolution transpose32, 10, 23(3,3)(1,1)(1,1)1, 10, 23
6Tanh, L2 quat. norm1, 10, 231, 10, 23
Transposed convolutional layers generate output of larger height and/or width than that of the layer input. This property allows the network to generate a predicted pose sequence of higher dimension than the input vector. The transposed convolution layers convolve a filter kernel over the layer input to produce the layer output. The filter kernel is a matrix having height and width indicated in Fig. 5. Filter kernel elements are learned dynamically by backpropagating network output error. The kernel is convolved with the input by shifting the kernel by the stride width across columns and stride height across rows and performing the sum of elementwise multiplication between the kernel and input at each kernel position. Each element of the transpose convolution layer output is determined by
(12)
where kco(i,j,m) is the element at row i, column j, channel m of the kernel for output channel co [20]. The si and sj indicate the stride width (rows) and height (columns), respectively. The ni and nj indicate the number of kernel strides that have occurred along the height and width of the input, respectively. The a(m,nisi+i,njsj+j) indicates the layer input at row nisi+i, column njsj+j for input channel m, and ci is the number of input channels. The kernel sizes, stride, and padding were selected to expand the network input into a matrix having shape 10×23 for output sequences with ten phase steps. The number of channels in the output of each layer was found by trial and error. The number of channels in subsequent convolution layers were selected to be half that of the preceding convolution layer. Adding channels to the output of the first convolution layer reduced test set loss, but with diminishing returns over 256.

Batch normalization is applied after each convolution layer to improve the stability of network parameters while training [21]. The scaled exponential linear unit (SELU) is applied after each batch normalization operation [22]. The SELU activation function was selected to prevent exploding gradients and vanishing gradients by including a term for a positive gradient when its input is less than zero. Exploding gradients cause network parameters and outputs to tend toward infinite values. Vanishing gradients cause the gradients resulting from network output error to diminish as the error is backpropagated, so the gradient becomes too small to train layers near the input. Other activation functions, such as the rectified linear unit (ReLU) and LeakyReLU, were also tested, but the SELU produced lower network loss at the end of network training.

When training the network weights, the prediction output by the network is compared to the actual sequences of future human poses from the conditioned training data. Since the network predicts a ten-step sequence of poses, ten poses evenly spaced along the phase scale are taken from each training sample for comparison with the network output. The L1 (absolute) error between the predicted sequence (Hp after training iteration t) and actual sequence is used as the network loss function
(13)
where the summation is over all matrix elements; meaning across all channels, and height and width of each channel. The network parameters are adjusted based on the error using the adam gradient-based optimizer and backpropagation [23]. For each network layer, the partial derivative of the “layer output” with respect to network parameters is determined from chain rule: product of the partial derivative of the “layer output” with respect to each SELU activation function and the partial derivative of the SELU activation function with respect to each network parameter. Backpropagation uses the partial derivatives of layer outputs with respect to network weights to determine the effect the network loss should have on adjusting the network parameters using a form of gradient descent, such as the adam optimizer.

The proposed generative neural network for human motion prediction simultaneously predicts the human pose that places his/her hand at the target location (Ptgt) at the end of the prediction and predicts the sequence of poses that human will take when traversing between start and end poses. In order to prevent exponential growth of prediction error over the prediction horizon, the prediction method must consider a target location for the human’s hand, such as Ptgt in the method herein. That necessitates using a method for estimating human pose that places the hand at the target location. Since the proposed method uses a neural network for prediction, it estimates a final human poses that matches realistic human poses as close as possible based on the motion recordings used to train the network.

2.5 Post-Processing for Time-Series Prediction.

The output of the neural network is the predicted sequence of human poses evenly spaced throughout a human task, but having a phase scale, not a time scale. Therefore, the ten-step prediction is interpolated to generate a prediction of a duration matching the anticipated duration of the human reaching motion. From the dataset of reaching motion human pose sequences, a relationship is observed in the scatter plot of the average speed during the reach motions versus distance between start and end wrist positions for the motion, shown as dots in Fig. 6. Figure 6 also shows the mean (μ, dashed line) and the mean ± two standard deviations (±2σ) (upper and lower lines, shaded area) of the wrist speed as a function of reach distance. The line for mean wrist speed shows slight curvature, indicating a quadratic function will likely fit the speed (vest) versus reach distance relationship better than a straight line. The quadratic best-fit line was
(14)
where d is the distance in meters between start and end wrist positions. The best-fit line is shown as a solid line in the middle of the shaded region in Fig. 6. To ensure the best-fit line was not skewed by outliers, the modified Thompson–Tau method was used to omit outlier points. Speed versus distance points whose standardized residual error was greater than two and whose neighbor points had residual error 0.5 less than it where rejected one at a time until no outliers met that criterion. This criterion is defined as
(15)
where Sxy is the standardized residual error of the estimate, subscript i indicates the ith data point, vacti is the observed speed, and n is the number of data points less previously rejected outliers. Since wrist speed can be predicted as vest by Eq. (14), the estimated duration of the motion can be predicted according to
(16)
The number of time-steps generated by interpolating the ten-step prediction is then N=test/dt, where dt is the desired prediction time resolution.
Fig. 6
Average wrist speed during reach motion versus Euclidean distance between wrist start and end positions
Fig. 6
Average wrist speed during reach motion versus Euclidean distance between wrist start and end positions
Close modal

3 Results

3.1 Network Training.

To evaluate the success of training the neural network, the collected samples of pose sequences during reaching motions were randomly divided among a train set and a test set, having 80% and 20% of the total samples, respectively. To reiterate, the train and test sets consisted of motions from one person. The neural network was allowed 3000 epochs of training. In each epoch, the network parameters were adjusted based on the error in Eq. (13). During training, a batch size of 32 samples was used to reduce training time. The lower two curves (labeled “train left” and “train right”) in Fig. 7 correspond to the sum of L1 error between the predicted and actual sequences in the train set averaged over each epoch for the left and right arm neural networks, respectively. At the end of each epoch, the network inferred a predicted pose sequence based on the start pose from each sample of the test set. The sum of L1 errors between the predicted and actual sequences of the test set are shown as the lines labeled “test left” and “test right” in Fig. 7 for the left and right arm neural networks, respectively.

Fig. 7
Network output loss on the train and test sets over epochs of training
Fig. 7
Network output loss on the train and test sets over epochs of training
Close modal

Figure 7 shows that loss on the train set continues to decrease with diminishing returns after 3000 epochs, but loss on the test set stopped decreasing after about 450 epochs. This indicates the left and right networks overfit to the samples in the train set. The models can accurately predict a sequence that has been used for training but are much less accurate when predicting a sequence from the test set. Therefore, the error between actual and predicted pose elements was evaluated separately to determine if the test set loss could be attributed to a particular pose element.

Figure 8 shows the L2 error between the actual and predicted pelvis location, averaged across all time-steps of all samples in the train and test sets for each epoch for the network for left arm reaches. The line labeled “test” that plateaus at about 4 cm corresponds to the test set, the line labeled “train” that continues to decrease is from the train set, and the dotted line is the average of the test set loss over the last ten epochs. This figure shows the test set loss due to pelvis location reaches a minimum after about 350 epochs and then rises to a slightly higher steady-state loss. The increase in loss after the minimum indicates that the pelvis location features contributed significantly to the plateau in test set loss.

Fig. 8
L2 error between predicted pelvis and actual pelvis location on the train and test sets with the left arm network
Fig. 8
L2 error between predicted pelvis and actual pelvis location on the train and test sets with the left arm network
Close modal

Figure 9 shows the L2 norm of the difference in predicted and actual quaternion elements for the torso, neck, left upper arm, and left forearm, averaged over all time-steps of all test set samples for each epoch with the left arm network. Figure 9 shows a moving average of train set losses and test set losses as dashed and solid lines, respectively, for each quaternion. To reiterate, Figs. 8 and 9 are showing the network loss from Fig. 7 broken out into individual pose elements. The test set losses for all pose quaternions in Fig. 9 plateaued after more epochs than the pelvis location in Fig. 8, with the exception of the neck quaternion (qn). Figure 9 even shows quaternion test set losses are still slightly decreasing after 3000 epochs. This means that the plateau in combined test set loss after 450 epochs shown in Fig. 7 can be attributed to the challenge of learning pelvis displacements (Fig. 8) and neck orientation (Fig. 9) for reaching motions. Pelvis displacement prediction is an anticipated challenge considering the options a human has for reaching for a target relatively far away, such as 1 m away. This type of reach could be accomplished by bending at the hips and extending an arm toward the target without moving one’s feet, so the pelvis barely moves. Another option is to take a step toward the target and extend the arm so the torso can remain more upright, resulting in a pelvis displacement of many centimeters. Since reaches toward targets over about 0.7 m away have these two options, the network less accurately predicts pelvis displacement. Neck orientation prediction is also a challenge since the human doesn’t always look at the reach target. Plots of pelvis displacement prediction error and quaternion element prediction error such as Figs. 8 and 9 for the right arm are nearly same as for the left arm, resulting in the same conclusion about the plateau in network loss on the test set.

Fig. 9
L2 error between predicted quaternion elements and actual quaternion elements, per human link, on the train and test sets with the left arm network
Fig. 9
L2 error between predicted quaternion elements and actual quaternion elements, per human link, on the train and test sets with the left arm network
Close modal

3.2 Predication Accuracy.

The goal of this work was development of a method of predicting human motion sequences. Therefore, the accuracy of the predicted sequence relative to actual motion sequences is a primary concern. One metric for assessing prediction accuracy is the Euclidean error between the predicted reaching wrist location and the target at the end of each motion, denoted reach position error herein. Table 3 shows the reach position error averaged over samples from the test set for the left and right networks and combined results for the final step of the sensed and predicted sequences. The second and third columns show error in sensed and predicted wrist position, each relative to the true reach target described in Sec. 2.2. The fourth column shows error between sensed and predicted position. The 80% trimmed mean was used to exclude data in the lower and upper 10% which may be outliers due to error in sensing. These statistics show 10.6 cm error between the final wrist position in predicted sequences and the reach target. The recorded data also show average sensor wrist measurement error of 10.2 cm from the tracking system. Since the average sensor wrist errors from the recorded dataset are nearly as large as the error of the predicted motion, inaccuracy of the sensing system used to generate the datasets contributes significantly to inaccuracy in predictions generated by the networks. As mentioned earlier, the difficulty in predicting pelvis displacement is another source of prediction error. Figure 10 shows an example of an actual and predicted human pose sequence. The lower five subplots show the prediction of quaternion elements closely matches the actual. The top subplot shows pelvis displacement, indicating a prediction error of about 3 cm in the x-direction at the end of the sequence.

Fig. 10
Predicted and actual sequences for the human pose elements for a left arm reach
Fig. 10
Predicted and actual sequences for the human pose elements for a left arm reach
Close modal
Table 3

Trimmed mean (80%) position error between wrist and reach target at the end of reach motions, averaged over all samples

SideSensor (camera) wrist error (cm)Predicted wrist error (cm)Predicted/ Sensed difference (cm)
Left9.29.87.7
Right11.311.48.3
Combined10.210.68.0
SideSensor (camera) wrist error (cm)Predicted wrist error (cm)Predicted/ Sensed difference (cm)
Left9.29.87.7
Right11.311.48.3
Combined10.210.68.0

Prediction error along entire motion sequences, not just the final sequence step, also provides insight into the usefulness of this method for motion prediction. Table 4 shows errors between the predicted and actual pose, averaged over all human joints/links, all time-steps, and all sequences in the test set, considering the post-processed network output as the predicted and the raw test set samples as the actual, meaning predicted and actual both use a time scale and not a phase scale. To reiterate, the test set included motions from one person. The second column shows the Euclidean norm joint errors averaged over all human joints. The third column shows the Euclidean norm between quaternion elements per human link quaternion, and fourth column shows Euclidean norm between link roll–pitch–yaw, averaged over all human links. The quaternions in the prediction output by the method herein, which define orientation of each human link, were converted to roll–pitch–yaw only for the purpose of comparing prediction error with other methods. Other state-of-the-art methods report prediction error for link orientation using the roll–pitch–yaw format [1113].

Table 4

Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of all test set sequences for post-processed network output versus raw test set samples

SideAvg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
Left8.30.1130.325
Right7.60.1050.274
Combined7.60.1050.301
SideAvg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
Left8.30.1130.325
Right7.60.1050.274
Combined7.60.1050.301

Table 5 shows the error comparison between the network output without post-processing and the test set samples after pre-processing, meaning the predicted and actual sequences both use the phase scale instead of a time scale. The average error between predicted and actual joint locations averaged over all joint locations and all sequences in the test set was 7.6 cm with the time scale and 5.8 cm with the phase scale. The error in Euclidean norm of roll–pitch–yaw averaged over all links, time-steps, and test set sequences was 0.301 rad when using the time scale and 0.198 rad with the phase scale. Smaller phase scale errors than time scale errors indicate that timing of motions makes the poses less predictable. However, the time scale results are more realistic since the method output is a time-series of predicted human poses. Tables 4 and 5 also show error in the L2 norm of quaternion differences, but it is less interpretable than roll–pitch–yaw since quaternion elements have mixed units.

Table 5

Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of all test set sequences for raw network output versus time-warped test set samples

SideAvg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
Left6.30.0740.205
Right5.80.0710.190
Combined5.80.0710.198
SideAvg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
Left6.30.0740.205
Right5.80.0710.190
Combined5.80.0710.198

While the position and orientation errors are larger than desired, they are comparable to the result in a recent work on human motion prediction for activities in the Human3.6M dataset over a 1 s horizon [12]. The method herein provides a significant advantage in that generated predictions are over a considerably longer time horizon than other recent works. Prior works predict the next iteration based on prediction at the current iteration. This causes prediction error to increase exponentially as the prediction horizon increases, so prior works don’t report prediction error beyond a 1 s horizon. The method herein predicts over the entire duration required for a human to reach for a target location, which was up to 3.5 s for recorded motions.

Prediction accuracy was also evaluated using recorded motions from five participants, not including the person recorded for the train and test set. Participants ranged in torso lengths over [0.506,0.580] m, upper arm lengths over [0.261,0.295] m, and forearm lengths over [0.220,0.261] m. The participants performed motions to touch targets sequentially in four sequences with each arm. Each sequence consisted of a subset of the tabletop and elevated targets shown in Fig. 3. The recorded motion sequences were then separated into individual motions from one target to another. Then predictions for the motion from one target to another target could be compared to the recorded motions. Table 6 shows the accuracy results from motions recorded in the five-participant study, averaged over both left and right arm motions. Table 6 can be compared to results in Table 4 as they show the same statistics. The difference between the two tables is that Table 4 shows results from one person’s motions in the test set and Table 6 shows results averaged from the five participants. The prediction network was also trained with the train set consisting of motions from one person who was not part of the five participant study. Therefore, a comparison of accuracy results from the five-person study (Table 6) to the single-person test set results (Table 4) will indicate if the predictions are biased toward the person that performed motions for the datasets used to train and test the prediction network. On the other hand, comparison of results will show if the prediction network trained and tested with one person’s motions can generalize to a variety of people’s motions.

Table 6

Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of recorded motions from five participants

Avg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
7.50.1250.409
Avg. joint error (cm)Avg. quaternion errorAvg. RPY error (rad)
7.50.1250.409

The left-most column of Table 6 shows the L2 norm of the difference between predicted and recorded joint positions. It shows that the difference in joint positions between recordings and predictions was slightly less from the five participants than the single person test dataset from Table 4. Most importantly, the joint position error with the five participants was not greater than with the test dataset from one person. The right column of Table 6 also shows the error in human link roll, pitch, yaw like the right column of the “combined” row of Table 4. The link orientation of roll, pitch, and yaw errors with five participants (in Table 6) was 0.108 rad greater than that of the test dataset. The motions recorded in the five-participant study also showed participants performed motions 15.9% slower than estimated by the wrist velocity versus reach distance model from Sec. 2.5. These results indicate that the prediction accuracy of the network could be improved if a more comprehensive dataset consisting of motions from multiple people could be amassed and used from training and testing the network. A more comprehensive dataset from multiple people would permit the network to generalize to people with a greater variety of link lengths and motion preferences.

All procedures performed for studies involving human participants were in accordance with the ethical standards stated in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. Informed consent was obtained from all participants. Documentation provided upon request.

3.3 Implementation Into Human–Robot collaboration System.

When the method herein is part of a larger HRC system, such as in Fig. 1, it can be triggered to generate a prediction once a reaching target is received. This method can also predict continuously while motions are in progress. Since there is a separate network to predict left and right arm motions, this method must be accompanied by an algorithm that predicts which arm will perform the reach motion. One possibility is to assume that if the reach target is to the right of the pelvis, then predict motion for the right arm, otherwise predict for the left arm. If the person is currently holding a piece with one arm, then it can be assumed that the reach will be performed with that arm. An advantage of the method herein is prediction inference in less than 2 ms. This was the performance on an Intel i9 computer with an nvidia RTX3070 GPU, using the GPU for model inference and CPU for pre/post-processing of data. This means that if the actual human motion starts to deviate from the prediction while reaching for a target, this method can rapidly predict a new, more accurate prediction. Deviations requiring rapid updates could include changes in poses, timing, right/left arm usage, or change of target. The generative neural network herein performs a fixed number of computations to infer a prediction. Therefore, variations in inference time are not due to variation in the generative neural network inputs, duration of the predictions, or distance covered by the predicted motions. Variations in inference time are due to other processes that may be happening in the robotic system, such as robot path planning for example.

Figures 11(b)11(e) show a more realistic scenario of moving a piston-rod assembly from a fixture on the table to an engine block in chronological order. It shows an initial, final, and two intermediate actual and predicted human poses. The human is represented as collections of cylinders corresponding to actual and prediction labeled “a” and “p”, respectively. The engine parts are shown as small and large shaded boxes corresponding to a piston-rod assembly and engine block, respectively. Figure 11(a) shows the engine block in the lower part of the image and piston-rod assembly in the upper right corner, from the human’s perspective. The reaching target for the right arm is shown as the sphere over the engine block. Figure 11(a) also shows the right wrist trajectory as a dashed arrow. Figure 11 shows the method herein predicted a sequence that closely matched both the pose and timing of the actual sequence.

Fig. 11
(a) Engine block and piston-rod assembly, and (b)–(e) four time-steps from a predicted (with reaching wrist labeled “p”) and actual (with reaching wrist labeled “a”) motion sequence in increasing time from (b) to (e)
Fig. 11
(a) Engine block and piston-rod assembly, and (b)–(e) four time-steps from a predicted (with reaching wrist labeled “p”) and actual (with reaching wrist labeled “a”) motion sequence in increasing time from (b) to (e)
Close modal

The human motion prediction method was integrated into the spatio-temporal planning and prediction framework (STAP-PPF) in Ref. [24]. In STAP-PPF, the human motion prediction method herein was use to predict motions for a sequence of human actions. For example, the human could do a sequence of motions such as in the previous paragraph. In STAP-PPF, the prediction method herein predicts each motion in the sequence based on the location of parts, tools, or workpieces the human will manipulate as well as the human’s current pose or the final pose of the previous prediction. This allows STAP-PPF to generate a prediction of human motion for entire manufacturing sequences. Then robot motions can be planned so the robot avoids the human predictions at necessary times to prevent production disruptions. However, it was necessary for STAP-PPF to also update the robot’s motion in real-time as the human’s real-time motions likely deviate from the predictions in timing or position. Therefore, as the robot and human were performing manufacturing tasks, the prediction method herein updated the prediction of the human’s motion at 30 Hz (the update rate of the sensor suite cameras) based on the human’s current pose and target. That permitted STAP-PPF to update robot motions after every prediction update so the robot motion could accommodate the real-time deviations in human motion. When using the online re-planning feature of STAP-PPF, the human motion prediction method herein performed inference on the Intel i9 CPU on the mentioned computer for inference, not using a GPU, and updated the predictions at 30 Hz.

Since the method was trained on motions from a single person, this work demonstrates the effectiveness of this method to predict that persons motions over the prediction horizons necessary to complete motions. Results from the five-person study in Sec. 3.2 show that motion recordings from a variety of people are necessary for the proposed human motion prediction method to generalize to unforeseen individuals. For the proposed approach to be utilized in a realistic manufacturing environment, it must generalize well to unforeseen people that were not part of the training motion recordings. It would be too time consuming to collect data and train a generative neural network on all people that may work in an HRC manufacturing cell. Therefore, for the proposed approach to be used in a real-manufacturing setting, a much larger dataset of motion recording must be amassed from a large variety of people and then used to train the generative neural network and the motion velocity model.

4 Conclusions

In summary, this work developed a neural network and pre/post-processing to predict sequences of human poses for reaching motions in an HRC workcell. The input to the method is the human’s current pose and the reaching target for either the left or right wrist. A key advantage of this method is its ability to predict motion for the entire duration of a reach motion, which was a prediction horizon up to 3.5 s in train/test set motions. The method herein does not suffer from exponential propagation of errors experienced by prior works, which limited them to a horizon of 1 second. Additionally, the method herein infers predictions in less than 2 ms per predicted motion, permitting real-time updates to the predictions of human motion. The method herein provides predicted human pose sequences for input to other robot control algorithms to permit proactive avoidance of humans. Proactive responses can mitigate production delays and reduce human discomfort in HRC. Future work will incorporate this method with proactive robot path planning and seek ways to reduce prediction error, such as improved sensing technology.

Acknowledgment

Funding was provided by the NSF/NRI: INT: COLLAB:Manufacturing USA: Intelligent Human–Robot Collaboration for Smart Factory (Award ID #:1830383). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the National Science Foundation.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

Nomenclature

z =

feature vector input to the neural network

ewt =

Euclidean error between the reaching wrist (left or right) and the target at time t in the sequence.

h,Hp,Hact =

single human pose, sequence of predicted human poses, and actual human pose sequence

Pp,Ptgt =

human pelvis location and reaching target relative to Pp in the world coordinate frame

qt, qn, qs, quL, qfL, quR, qfR =

quaternions relating the world z-axis to the human torso, neck, shoulder–shoulder vector, left upper arm, left forearm, right upper arm, and right forearm, respectively

qi=wi,xi,yi,zi =

quaternion i of the human, where wi denotes rotation angle and xi, yi, zi components to denote elements of rotation vector

θi,vi =

rotation angle and vector that determine ith quaternion of the human

Pwt =

location of the reaching wrist (left or right) in the world frame at time t

D(i,j) =

Euclidean distance between the ith point in one sequence and the jth point in another sequence

W, wk, wk(i,j) =

warp path that maps elements of one sequence to elements of another, an element of W which has an i and j component for each sequence, and either the i or j component of wk

a(c,i,j) =

filter kernel input from node i,j in channel c from the previous neural net layer

o(c,i,j) =

output of node i,j in a channel c in a layer of the neural network

ϕ =

set of all learned neural network parameters

vest,vact,Sxy =

wrist speed estimate, actual speed, and standardized residual error of the estimate

ep(zt,ϕt) =

L1 error at time-step t between the predicted pose sequence generated by the neural network and actual sequence from train/test samples

References

1.
Zhang
,
J.
,
Liu
,
H.
,
Chang
,
Q.
,
Wang
,
L.
, and
Gao
,
R. X.
,
2020
, “
Recurrent Neural Network for Motion Trajectory Prediction in Human-Robot Collaborative Assembly
,”
CIRP Ann. Manuf. Technol.
,
69
(
1
), pp.
9
12
.
2.
Liu
,
H.
, and
Wang
,
L.
,
2017
, “
Human Motion Prediction for Human–Robot Collaboration
,”
ASME J. Manuf. Syst.
,
44
(
2
), pp.
287
294
.
3.
Maeda
,
G.
,
Ewerton
,
M.
,
Lioutikov
,
R.
,
Ben Amor
,
H.
,
Peters
,
J.
, and
Neumann
,
G.
,
2014
, “
Learning Interaction for Collaborative Tasks With Probabilistic Movement Primitives
,”
IEEE/RAS International Conference on Humanoid Robots
,
Madrid, Spain
,
Nov. 18–20
, pp.
527
534
.
4.
Maeda
,
G.
,
Neumann
,
G.
,
Ewerton
,
M.
,
Lioutikov
,
R.
,
Kroemer
,
O.
, and
Peters
,
J.
,
2017
, “
Probabilistic Movement Primitives for Coordination of Multiple Human-Robot Collaborative Tasks
,”
Auton. Rob.
,
41
(
3
), pp.
593
612
.
5.
Mainprice
,
J.
, and
Berenson
,
D.
,
2013
, “
Human-Robot Collaborative Manipulation Planning Using Early Prediction of Human Motion
,”
IEEE/RSJ International Conference on Intelligent Robots and Systems
,
Tokyo, Japan
,
Nov. 3–8
, pp.
299
306
.
6.
Wang
,
Y.
,
Sheng
,
Y.
,
Wang
,
J.
, and
Zhang
,
W.
,
2018
, “
Optimal Collision-Free Robot Trajectory Generation Based on Time Series Prediction of Human Motion
,”
IEEE Rob. Autom. Lett.
,
3
(
1
), pp.
226
233
.
7.
Kanazawa
,
A.
,
Kinugawa
,
J.
, and
Kosuge
,
K.
,
2019
, “
Adaptive Motion Planning for a Collaborative Robot Based on Prediction Uncertainty to Enhance Human Safety and Work Efficiency
,”
IEEE Trans. Rob.
,
35
(
4
), pp.
817
832
.
8.
Liu
,
R.
, and
Liu
,
C.
,
2021
, “
Human Motion Prediction Using Adaptable Recurrent Neural Networks and Inverse Kinematics
,”
IEEE Contr. Syst. Lett.
,
5
(
5
), pp.
1651
1656
.
9.
Li
,
Q.
,
Zhang
,
Z.
,
You
,
Y.
,
Mu
,
Y.
, and
Feng
,
C.
,
2020
, “
Data Driven Models for Human Motion Prediction in Human-Robot Collaboration
,”
IEEE Access
,
8
, pp.
227690
227702
.
10.
Callens
,
T.
,
der Have
,
T. V.
,
Rossom
,
S. V.
,
Schutter
,
J. D.
, and
Aertbeliën
,
E.
,
2020
, “
A Framework for Recognition and Prediction of Human Motions in Human-Robot Collaboration Using Probabilistic Motion Models
,”
IEEE Rob. Autom. Lett.
,
5
(
4
), pp.
5151
5158
.
11.
Martinez
,
J.
,
Black
,
M. J.
, and
Romero
,
J.
,
2017
, “
On Human Motion Prediction Using Recurrent Neural Networks
,” IEEE CVPR, pp.
4674
4683
.
12.
Mao
,
W.
,
Liu
,
M.
,
Salzmann
,
M.
, and
Li
,
H.
,
2019
, “
Learning Trajectory Dependencies for Human Motion Prediction
,” IEEE/CVF International Conference on Computer Vision (ICCV), pp.
9488
9496
.
13.
Li
,
C.
,
Zhang
,
Z.
,
Lee
,
W. S.
, and
Lee
,
G. H.
,
2018
, “
Convolutional Sequence to Sequence Model for Human Dynamics
,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Salt Lake City, UT
,
June 18–23
.
14.
Goodfellow
,
I. J.
,
Pouget-Abadie
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
,
Courville
,
A. C.
, and
Bengio
,
Y.
,
2014
, “
Generative Adversarial Nets
,”
Conference on Neural Information Processing Systems
,
Montreal, Canada
,
Dec. 8–13
.
15.
Radford
,
A.
,
Metz
,
L.
, and
Chintala
,
S.
,
2016
, “
Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks
,”
International Conference on Learning Representations
,
San Juan, Puerto Rico
,
May 2–4
.
16.
Papadaki
,
A.
, and
Pateraki
,
M.
,
2023
, “
6D Object Localization in Car-Assembly Industrial Environment
,”
J. Imag.
,
9
(
3
), pp.
72
94
.
17.
Flowers
,
J. T.
, and
Wiens
,
G. J.
,
2022
, “
Comparison of Human Skeleton Trackers Paired With a Novel Skeleton Fusion Algorithm
,”
Manufacturing Science and Engineering Conference
,
West Lafayette, IN
,
June 27–July 1
.
18.
Pellegrinelli
,
S.
,
Moro
,
F. L.
,
Pedrocchi
,
N.
,
Molinari Tosatti
,
L.
, and
Tolio
,
T.
,
2016
, “
A Probabilistic Approach to Workspace Sharing for Human–Robot Cooperation in Assembly Tasks
,”
CIRP Ann.
,
65
(
1
), pp.
57
60
.
19.
Salvador
,
S.
, and
Chan
,
P.
,
2007
, “
FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space
,”
Intell. Data Anal.
,
11
(
5
), pp.
561
580
.
20.
Dumoulin
,
V.
, and
Visin
,
F.
,
2018
, “
A Guide to Convolution Arithmetic for Deep Learning
,”
arXiv
. https://arxiv.org/abs/1603.07285
21.
Ioffe
,
S.
, and
Szegedy
,
C.
,
2015
, “
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
,”
International Conference on Machine Learning
,
Lille, France
,
July 7–9
, pp.
448
456
.
22.
Klambauer
,
G.
,
Unterthiner
,
T.
,
Mayr
,
A.
, and
Hochreiter
,
S.
,
2017
, “
Self-Normalizing Neural Networks
,”
Conference on Neural Information Processing Systems
,
Long Beach, CA
,
Dec. 4–9
.
23.
Kingma
,
D. P.
, and
Ba
,
J.
,
2015
, “
Adam: A Method for Stochastic Optimization
,” International Conference on Learning Representations, San Diego, CA.
24.
Flowers
,
J.
, and
Wiens
,
G.
,
2023
, “
A Spatio-Temporal Prediction and Planning Framework for Proactive Human–Robot Collaboration
,”
ASME J. Manuf. Sci. Eng.
,
145
(
12
), p.
121011
.