Abstract
In human–robot collaboration, robots and humans must work together in shared, overlapping, workspaces to accomplish tasks. If human and robot motion can be coordinated, then collisions between robot and human can seamlessly be avoided without requiring either of them to stop work. A key part of this coordination is anticipating humans’ future motion so robot motion can be adapted proactively. In this work, a generative neural network predicts a multi-step sequence of human poses for tabletop reaching motions. The multi-step sequence is mapped to a time-series based on a human speed versus motion distance model. The input to the network is the human’s reaching target relative to current pelvis location combined with current human pose. A dataset was generated of human motions to reach various positions on or above the table in front of the human starting from a wide variety of initial human poses. After training the network, experiments showed that the predicted sequences generated by this method matched the actual recordings of human motion within an L2 joint error of 7.6 cm and L2 link roll–pitch–yaw error of 0.301 rad on average. This method predicts motion for an entire reach motion without suffering from the exponential propagation of prediction error that limits the horizon of prior works.
1 Introduction
A challenge in human–robot collaboration (HRC) is coordinating human and robot motion. In HRC, humans and robots share a common workspace and work together in close proximity to accomplish tasks, e.g., in manufacturing. In an HRC cell, the less coordinated human and robot motion is, occurrence of production delays and/or human discomfort becomes more likely. In the case of suboptimal coordination, the robot may have to stop and wait for the human to back away, causing production delay. The robot may also take trajectories that make the robot come close to the human, causing human discomfort and distrust in the robotic system. To improve human–robot coordination and avoid these problems, humans’ trajectories must be predicted so robot motion can be adapted ahead of potential disruptions. In a manufacturing setting, the location of parts is known or easily determined, which provides the target for human reaching motions. Therefore, the prediction is the sequence of human poses generated from interpolations between the current pose and reaching target. This work presents a method for predicting a sequence of human poses based only on the human’s current pose and the reaching target for the human’s left or right wrist.
Human motion can be predicted at high and low levels within an HRC system. At the low level of prediction, a time sequence of human poses is predicted. At the high level, coarse human actions are classified, and the end point of human motion can be predicted, but without time dependence. Zhang et al. developed a recurrent neural network (RNN) architecture including units for independent human parts for predicting the end-point of human motion for a robot–human part handover [1]. Liu and Wang used a hidden Markov model to generate a motion transition probability matrix for predicting next human coarse actions [2]. Maeda et al. proposed probabilistic motion primitives to predict human intent and generate a corresponding robot motion primitive [3,4]. These methods can only provide the end-point for human motion at best. They do not provide details about the motion between start and end, limiting their potential to adapt robot motions to avoid predicted disruptions.
Previous works on predicting time sequences of human poses, meaning the motion between start and end, in a manufacturing domain have used filters and/or RNNs. Mainprice and Berenson fit Gaussian mixture models (GMMs) to many recordings of human reaching motions and predicted future motion with the GMMs [5]. Wang et al. used an autoregressive integrated moving average model applied to the elbow and wrist angles to predict human tabletop reaching motions [6]. Kanazawa et al. used Gaussian mixture regression with expectation maximization to learn a model of human motion online [7]. Liu and Liu used an RNN to model human motion and used a modified Kalman filter to adapt RNN layer weights online [8]. Li et al. used a multi-step Gaussian process regression and previously recorded human trajectories to predict human reaching motion online [9]. Callens et al. developed a database of human motion models using probabilistic principal component analysis (PPCA) [10]. In many of these methods, the model predicts the next step based on the current step and then subsequent predictions are based on the previous prediction. Therefore, if there is a small error in predicting a relatively immediate step, that error will propagate through the prediction and result in exponential increase in error as the prediction horizon increases. Some of these works, such as GMMs and PPCA require a database of human trajectories which in turn requires computation time to compare current motion to a record.
Other state-of-the-art human motion prediction methods have demonstrated good results for predicting human motion in general activities, such as walking and eating, represented in the Human3.6M dataset. Martinez et al. utilized a sequence-to-sequence architecture, which is an RNN with gated recurrent units that takes a sequence of recent poses and generates a predicted sequence of future poses [11]. Mao et al. encoded human pose trajectories using the discrete cosine transform (DCT) and then used a graph convolutional network which predicts future DCT coefficients based on a sequence of DCT coefficients [12]. Li et al. utilized a neural network consisting of encoders that use convolutional layers to generate hidden states based on long and short-term input sequences and then use two fully connected layers to decode hidden states into pose sequences [13]. These methods generate predictions iteratively, causing exponential divergence of the prediction from the true trajectory over the time horizon. Therefore, they limit error analysis to predictions within a 1 s horizon. Such a short prediction horizon is infeasible for a manufacturing HRC setting where human motions are typically many seconds in duration.
The method herein uses a neural network to predict a sequence of human poses considering only the current human pose and reaching target as input. This method is designed to prevent the problem of error propagation over a long prediction horizon. The first step of this method is to warp the time scale of human motion observations in the training data to a dimensionless phase scale so each training sample shows consistent timing of changes in human pose elements. After conditioning the training data, a generative neural network is fit to the training data. The neural network assembled in this work is inspired by generator networks in generative adversarial networks (GANs) [14,15]. Once the network is trained, it is used to predict a multi-step sequence of human poses. To use the prediction in the time domain, linear interpolation is used to match the multi-step prediction to a sequence having duration based on the anticipated human average speed.
The novelty of this work is development of a neural network and data pre/post-conditioning to generate a predicted human pose sequence over a horizon of multiple seconds based only on the current human pose and relative reach target. This method utilizes the repetitiveness of human motion in manufacturing by considering the reaching target as an input. This method is unique in representing the human pose with quaternions so human link dimensions are preserved and the neural network inputs are continuous, enabling better network fit. Other representations are either not continuous or allow link lengths to change instantaneously. The method herein can also generate a prediction in real-time (faster than 30 Hz) over a long horizon without suffering from exponential propagation of error that occurs with other works. This method is also trained and predicts based on data collected with a depth camera-based skeleton tracking system. Other methods utilize a more precise motion capture system for human tracking but require wearable sensing equipment, making them infeasible for a manufacturing setting.
The output of the method herein can be used as an input to proactive-n-reactive robot algorithms to avoid anticipated, time-varying delays. Figure 1 shows how this human motion prediction method, shown by the large block in the lower left, fits into the control scheme for a robotic system in an HRC workcell. The robot considers the task goal and a predicted time sequence of human motion to plan robot motion that accomplishes the goal while avoiding the human. The robotic system then uses a safety controller to adjust robot speed or stop the robot along the planned path if the robot gets too close to the human. In the robotic system, real-time human pose is captured by the workcell sensor suite. A new human motion prediction can then be generated based on the human’s reaching target (e.g., part to pick up) and current human pose. Targets can be determined by existing methods that locate objects based on image inputs [16]. The remaining sections of this paper are organized as follows: (2) methods, (3) results, and (4) conclusions. Section 2 is further divided into subsections: (2.1) human pose representation, (2.2) collection of training data, (2.3) preconditioning of the training data, (2.4) network architecture, and (2.5) post-conditioning output to a time sequence of poses. Section 3 is divided into subsections: (3.1) training results, (3.2) prediction accuracy, and (3.3) implementation into an HRC system.
2 Methods
The method in this work is composed of five parts. First, human pose is represented as pelvis Cartesian location and a set of quaternions that relate each human link to the world z-axis. Second, a dataset is collected in which many iterations of various human motions are recorded as the human reaches to various target locations. Third, the recorded data are conditioned to have a consistent phase scale instead of a time scale to improve neural network training. Fourth, a neural network is created to predict a multi-step sequence of human poses the human will pass through to reach for a target, given the target and current human pose as input. Fifth, the network output is post-processed to have a time scale that matches the estimated duration of human motion based on average motion velocities from the recorded data.
2.1 Human Representation.

() Links of the human kinematic chain, and () rotation angle and axis for the right forearm quaternion and cylinder link representation
Links of the human kinematic chain
Link | Description | Proximal joint | Distal joint | Link quaternion |
---|---|---|---|---|
1 | Torso/Spine | Pelvis | Spine | |
2 | Neck | Spine | Shoulder MP | |
3 | Shoulder–shoulder | Shoulder MP | Shoulders | |
4 | Left upper arm | Left shoulder | Left elbow | |
5 | Left Forearm | Left elbow | Left wrist | |
6 | Right upper arm | Right shoulder | Right elbow | |
7 | Right forearm | Right elbow | Right wrist |
Link | Description | Proximal joint | Distal joint | Link quaternion |
---|---|---|---|---|
1 | Torso/Spine | Pelvis | Spine | |
2 | Neck | Spine | Shoulder MP | |
3 | Shoulder–shoulder | Shoulder MP | Shoulders | |
4 | Left upper arm | Left shoulder | Left elbow | |
5 | Left Forearm | Left elbow | Left wrist | |
6 | Right upper arm | Right shoulder | Right elbow | |
7 | Right forearm | Right elbow | Right wrist |
The world frame z-axis would align with human link if it were rotated by about . Figure 2(b) shows an example of the rotation angle and vector to align the world z-axis with the right forearm. Figure 2(b) also shows the arm generalized to a cylinder for use by collision detection and path planning algorithms. Considering other possible angle/axis rotation representations, such as roll–pitch–yaw, for example, the quaternion representation seems best suited for use with a neural network because quaternion elements are bounded between and 1. The quaternion elements will also be continuous as the human moves. In contrast, the roll–pitch–yaw is either bounded but with discontinuity at radians, or without discontinuity but allowing angles to tend toward . An example of this problem would occur if the human extended an arm outward and repeated full rotations of the arm about the shoulder. While it is physically impossible for a human joint to do a full revolution due to muscle and joint limits, sensing systems can perceive multiple full revolutions. Therefore, the quaternion representation is used to allow for the perception of full revolutions of human joints.
2.2 Collection of the Training Dataset.
To amass a dataset of human motions, human pose was recorded while one person performed a variety of tabletop reaching tasks. Over 1750 motion sequences were recorded per each arm. Each reaching motion had a target wrist Cartesian location. An array of targets of known 3D positions was selected to cover the workspace in front of the human in the robotic cell, shown in Fig. 3. Tabletop level targets are shown as small, filled circles. The tips of rods extending upward from the table at each circle by 15, 30, and 45 cm were used to create elevated targets, shown by tall rectangles. Human joint locations were tracked, using the two depth cameras circled by ovals in Fig. 3, and converted into the quaternion representation via the method in Ref. [17]. The targets are in the range , , m, with the human standing near the edge of the table near . For reference, the tabletop height is z = 0 m . For each arm, motions included reaches to targets on both left and right sides of the pelvis to include some cross-body motions. For reaches over long distances, motions could require the human to walk more than one step to reach the target. In this case, the prediction of motion may become significantly less accurate because the human has many more options for potential trajectories. Therefore, this work applies to human reaching motions in a tabletop setting in manufacturing where objects the human will interact with are within about 1 m of the pelvis. For motions to objects more than 1 m away, another algorithm such as in Ref. [18] could be used to generate relatively coarse predictions of occupancy at the expense of precision.
2.3 Conditioning of Training Data.
The time required for humans to complete tasks likely varies over each iteration of the task, possibly due to distractions or tiring as work shifts progress. As humans reach for the same target over many iterations, the poses in the recorded time-series may be very similar, but will likely occur at different times through the motion. If the time scale of the training data was warped to have a consistent number of steps per sequence, then the effect of varying timing would be minimized, making each training record for a particular task as similar as possible. Trial and error showed that matching the time scale of all records to a common phase scale reduced the prediction error at the end of network training. Therefore, dynamic time warping (DTW), specifically the FastDTW algorithm, was used to match all training records to a common phase scale [19].
2.4 Neural Network Architecture and Training.
The network architecture is inspired by the generator network in a GAN [14,15]. The generator neural network in this method is a sequence of five transposed convolution layers as shown in Table 2 and Fig. 5. In Fig. 5, the input vector consisting of Cartesian reaching target and current human pose is shown in the left. The 3D blocks indicate the relative shape of the output of each network layer, with the text over the blocks indicating matrix size in the order: channels, height, and width. The layer operation is indicated in the text below the 3D blocks. The transposed convolution layers indicate the size of convolution kernel, kernel stride, and input padding in the format: height by width. The layer operations also indicate the activation function used by the layer, further explained below. A hyperbolic tangent activation function followed by network normalization of the human pose quaternions output is applied to the output of the final convolution layer. Each quaternion in the pose must have an L2 norm of 1 unit, so the human pose quaternions in each phase step of the network output are normalized.
Neural network layers
Layer | Description | Input size (c, h, w) | Kernel size (h, w) | Stride (h, w) | Padding (h, w) | Output size (c, h, w) |
---|---|---|---|---|---|---|
1 | Convolution transpose | 26, 1, 1 | (3,6) | (1,1) | None | 256, 3, 6 |
2 | Convolution transpose | 256, 3, 6 | (3,3) | (1,1) | (1,1) | 128, 3, 6 |
3 | Convolution transpose | 128, 3, 6 | (3,3) | (2,2) | (1,1) | 64, 5, 11 |
4 | Convolution transpose | 64, 5, 11 | (3,3) | (2,2) | None | 32, 10, 23 |
5 | Convolution transpose | 32, 10, 23 | (3,3) | (1,1) | (1,1) | 1, 10, 23 |
6 | Tanh, L2 quat. norm | 1, 10, 23 | — | — | — | 1, 10, 23 |
Layer | Description | Input size (c, h, w) | Kernel size (h, w) | Stride (h, w) | Padding (h, w) | Output size (c, h, w) |
---|---|---|---|---|---|---|
1 | Convolution transpose | 26, 1, 1 | (3,6) | (1,1) | None | 256, 3, 6 |
2 | Convolution transpose | 256, 3, 6 | (3,3) | (1,1) | (1,1) | 128, 3, 6 |
3 | Convolution transpose | 128, 3, 6 | (3,3) | (2,2) | (1,1) | 64, 5, 11 |
4 | Convolution transpose | 64, 5, 11 | (3,3) | (2,2) | None | 32, 10, 23 |
5 | Convolution transpose | 32, 10, 23 | (3,3) | (1,1) | (1,1) | 1, 10, 23 |
6 | Tanh, L2 quat. norm | 1, 10, 23 | — | — | — | 1, 10, 23 |
Batch normalization is applied after each convolution layer to improve the stability of network parameters while training [21]. The scaled exponential linear unit (SELU) is applied after each batch normalization operation [22]. The SELU activation function was selected to prevent exploding gradients and vanishing gradients by including a term for a positive gradient when its input is less than zero. Exploding gradients cause network parameters and outputs to tend toward infinite values. Vanishing gradients cause the gradients resulting from network output error to diminish as the error is backpropagated, so the gradient becomes too small to train layers near the input. Other activation functions, such as the rectified linear unit (ReLU) and LeakyReLU, were also tested, but the SELU produced lower network loss at the end of network training.
The proposed generative neural network for human motion prediction simultaneously predicts the human pose that places his/her hand at the target location () at the end of the prediction and predicts the sequence of poses that human will take when traversing between start and end poses. In order to prevent exponential growth of prediction error over the prediction horizon, the prediction method must consider a target location for the human’s hand, such as in the method herein. That necessitates using a method for estimating human pose that places the hand at the target location. Since the proposed method uses a neural network for prediction, it estimates a final human poses that matches realistic human poses as close as possible based on the motion recordings used to train the network.
2.5 Post-Processing for Time-Series Prediction.

Average wrist speed during reach motion versus Euclidean distance between wrist start and end positions
3 Results
3.1 Network Training.
To evaluate the success of training the neural network, the collected samples of pose sequences during reaching motions were randomly divided among a train set and a test set, having 80% and 20% of the total samples, respectively. To reiterate, the train and test sets consisted of motions from one person. The neural network was allowed 3000 epochs of training. In each epoch, the network parameters were adjusted based on the error in Eq. (13). During training, a batch size of 32 samples was used to reduce training time. The lower two curves (labeled “train left” and “train right”) in Fig. 7 correspond to the sum of L1 error between the predicted and actual sequences in the train set averaged over each epoch for the left and right arm neural networks, respectively. At the end of each epoch, the network inferred a predicted pose sequence based on the start pose from each sample of the test set. The sum of L1 errors between the predicted and actual sequences of the test set are shown as the lines labeled “test left” and “test right” in Fig. 7 for the left and right arm neural networks, respectively.
Figure 7 shows that loss on the train set continues to decrease with diminishing returns after 3000 epochs, but loss on the test set stopped decreasing after about 450 epochs. This indicates the left and right networks overfit to the samples in the train set. The models can accurately predict a sequence that has been used for training but are much less accurate when predicting a sequence from the test set. Therefore, the error between actual and predicted pose elements was evaluated separately to determine if the test set loss could be attributed to a particular pose element.
Figure 8 shows the L2 error between the actual and predicted pelvis location, averaged across all time-steps of all samples in the train and test sets for each epoch for the network for left arm reaches. The line labeled “test” that plateaus at about 4 cm corresponds to the test set, the line labeled “train” that continues to decrease is from the train set, and the dotted line is the average of the test set loss over the last ten epochs. This figure shows the test set loss due to pelvis location reaches a minimum after about 350 epochs and then rises to a slightly higher steady-state loss. The increase in loss after the minimum indicates that the pelvis location features contributed significantly to the plateau in test set loss.

L2 error between predicted pelvis and actual pelvis location on the train and test sets with the left arm network
Figure 9 shows the L2 norm of the difference in predicted and actual quaternion elements for the torso, neck, left upper arm, and left forearm, averaged over all time-steps of all test set samples for each epoch with the left arm network. Figure 9 shows a moving average of train set losses and test set losses as dashed and solid lines, respectively, for each quaternion. To reiterate, Figs. 8 and 9 are showing the network loss from Fig. 7 broken out into individual pose elements. The test set losses for all pose quaternions in Fig. 9 plateaued after more epochs than the pelvis location in Fig. 8, with the exception of the neck quaternion (). Figure 9 even shows quaternion test set losses are still slightly decreasing after 3000 epochs. This means that the plateau in combined test set loss after 450 epochs shown in Fig. 7 can be attributed to the challenge of learning pelvis displacements (Fig. 8) and neck orientation (Fig. 9) for reaching motions. Pelvis displacement prediction is an anticipated challenge considering the options a human has for reaching for a target relatively far away, such as 1 m away. This type of reach could be accomplished by bending at the hips and extending an arm toward the target without moving one’s feet, so the pelvis barely moves. Another option is to take a step toward the target and extend the arm so the torso can remain more upright, resulting in a pelvis displacement of many centimeters. Since reaches toward targets over about 0.7 m away have these two options, the network less accurately predicts pelvis displacement. Neck orientation prediction is also a challenge since the human doesn’t always look at the reach target. Plots of pelvis displacement prediction error and quaternion element prediction error such as Figs. 8 and 9 for the right arm are nearly same as for the left arm, resulting in the same conclusion about the plateau in network loss on the test set.

L2 error between predicted quaternion elements and actual quaternion elements, per human link, on the train and test sets with the left arm network
3.2 Predication Accuracy.
The goal of this work was development of a method of predicting human motion sequences. Therefore, the accuracy of the predicted sequence relative to actual motion sequences is a primary concern. One metric for assessing prediction accuracy is the Euclidean error between the predicted reaching wrist location and the target at the end of each motion, denoted reach position error herein. Table 3 shows the reach position error averaged over samples from the test set for the left and right networks and combined results for the final step of the sensed and predicted sequences. The second and third columns show error in sensed and predicted wrist position, each relative to the true reach target described in Sec. 2.2. The fourth column shows error between sensed and predicted position. The 80% trimmed mean was used to exclude data in the lower and upper 10% which may be outliers due to error in sensing. These statistics show 10.6 cm error between the final wrist position in predicted sequences and the reach target. The recorded data also show average sensor wrist measurement error of 10.2 cm from the tracking system. Since the average sensor wrist errors from the recorded dataset are nearly as large as the error of the predicted motion, inaccuracy of the sensing system used to generate the datasets contributes significantly to inaccuracy in predictions generated by the networks. As mentioned earlier, the difficulty in predicting pelvis displacement is another source of prediction error. Figure 10 shows an example of an actual and predicted human pose sequence. The lower five subplots show the prediction of quaternion elements closely matches the actual. The top subplot shows pelvis displacement, indicating a prediction error of about 3 cm in the x-direction at the end of the sequence.
Trimmed mean (80%) position error between wrist and reach target at the end of reach motions, averaged over all samples
Side | Sensor (camera) wrist error (cm) | Predicted wrist error (cm) | Predicted/ Sensed difference (cm) |
---|---|---|---|
Left | 9.2 | 9.8 | 7.7 |
Right | 11.3 | 11.4 | 8.3 |
Combined | 10.2 | 10.6 | 8.0 |
Side | Sensor (camera) wrist error (cm) | Predicted wrist error (cm) | Predicted/ Sensed difference (cm) |
---|---|---|---|
Left | 9.2 | 9.8 | 7.7 |
Right | 11.3 | 11.4 | 8.3 |
Combined | 10.2 | 10.6 | 8.0 |
Prediction error along entire motion sequences, not just the final sequence step, also provides insight into the usefulness of this method for motion prediction. Table 4 shows errors between the predicted and actual pose, averaged over all human joints/links, all time-steps, and all sequences in the test set, considering the post-processed network output as the predicted and the raw test set samples as the actual, meaning predicted and actual both use a time scale and not a phase scale. To reiterate, the test set included motions from one person. The second column shows the Euclidean norm joint errors averaged over all human joints. The third column shows the Euclidean norm between quaternion elements per human link quaternion, and fourth column shows Euclidean norm between link roll–pitch–yaw, averaged over all human links. The quaternions in the prediction output by the method herein, which define orientation of each human link, were converted to roll–pitch–yaw only for the purpose of comparing prediction error with other methods. Other state-of-the-art methods report prediction error for link orientation using the roll–pitch–yaw format [11–13].
Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of all test set sequences for post-processed network output versus raw test set samples
Side | Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|---|
Left | 8.3 | 0.113 | 0.325 |
Right | 7.6 | 0.105 | 0.274 |
Combined | 7.6 | 0.105 | 0.301 |
Side | Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|---|
Left | 8.3 | 0.113 | 0.325 |
Right | 7.6 | 0.105 | 0.274 |
Combined | 7.6 | 0.105 | 0.301 |
Table 5 shows the error comparison between the network output without post-processing and the test set samples after pre-processing, meaning the predicted and actual sequences both use the phase scale instead of a time scale. The average error between predicted and actual joint locations averaged over all joint locations and all sequences in the test set was 7.6 cm with the time scale and 5.8 cm with the phase scale. The error in Euclidean norm of roll–pitch–yaw averaged over all links, time-steps, and test set sequences was 0.301 rad when using the time scale and 0.198 rad with the phase scale. Smaller phase scale errors than time scale errors indicate that timing of motions makes the poses less predictable. However, the time scale results are more realistic since the method output is a time-series of predicted human poses. Tables 4 and 5 also show error in the L2 norm of quaternion differences, but it is less interpretable than roll–pitch–yaw since quaternion elements have mixed units.
Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of all test set sequences for raw network output versus time-warped test set samples
Side | Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|---|
Left | 6.3 | 0.074 | 0.205 |
Right | 5.8 | 0.071 | 0.190 |
Combined | 5.8 | 0.071 | 0.198 |
Side | Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|---|
Left | 6.3 | 0.074 | 0.205 |
Right | 5.8 | 0.071 | 0.190 |
Combined | 5.8 | 0.071 | 0.198 |
While the position and orientation errors are larger than desired, they are comparable to the result in a recent work on human motion prediction for activities in the Human3.6M dataset over a 1 s horizon [12]. The method herein provides a significant advantage in that generated predictions are over a considerably longer time horizon than other recent works. Prior works predict the next iteration based on prediction at the current iteration. This causes prediction error to increase exponentially as the prediction horizon increases, so prior works don’t report prediction error beyond a 1 s horizon. The method herein predicts over the entire duration required for a human to reach for a target location, which was up to 3.5 s for recorded motions.
Prediction accuracy was also evaluated using recorded motions from five participants, not including the person recorded for the train and test set. Participants ranged in torso lengths over m, upper arm lengths over m, and forearm lengths over m. The participants performed motions to touch targets sequentially in four sequences with each arm. Each sequence consisted of a subset of the tabletop and elevated targets shown in Fig. 3. The recorded motion sequences were then separated into individual motions from one target to another. Then predictions for the motion from one target to another target could be compared to the recorded motions. Table 6 shows the accuracy results from motions recorded in the five-participant study, averaged over both left and right arm motions. Table 6 can be compared to results in Table 4 as they show the same statistics. The difference between the two tables is that Table 4 shows results from one person’s motions in the test set and Table 6 shows results averaged from the five participants. The prediction network was also trained with the train set consisting of motions from one person who was not part of the five participant study. Therefore, a comparison of accuracy results from the five-person study (Table 6) to the single-person test set results (Table 4) will indicate if the predictions are biased toward the person that performed motions for the datasets used to train and test the prediction network. On the other hand, comparison of results will show if the prediction network trained and tested with one person’s motions can generalize to a variety of people’s motions.
Trimmed mean (80%) of L2 norm of difference between predicted and actual joint locations, quaternions, and roll–pitch–yaw over all human links and all time-steps of recorded motions from five participants
Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|
7.5 | 0.125 | 0.409 |
Avg. joint error (cm) | Avg. quaternion error | Avg. RPY error (rad) |
---|---|---|
7.5 | 0.125 | 0.409 |
The left-most column of Table 6 shows the L2 norm of the difference between predicted and recorded joint positions. It shows that the difference in joint positions between recordings and predictions was slightly less from the five participants than the single person test dataset from Table 4. Most importantly, the joint position error with the five participants was not greater than with the test dataset from one person. The right column of Table 6 also shows the error in human link roll, pitch, yaw like the right column of the “combined” row of Table 4. The link orientation of roll, pitch, and yaw errors with five participants (in Table 6) was 0.108 rad greater than that of the test dataset. The motions recorded in the five-participant study also showed participants performed motions 15.9% slower than estimated by the wrist velocity versus reach distance model from Sec. 2.5. These results indicate that the prediction accuracy of the network could be improved if a more comprehensive dataset consisting of motions from multiple people could be amassed and used from training and testing the network. A more comprehensive dataset from multiple people would permit the network to generalize to people with a greater variety of link lengths and motion preferences.
All procedures performed for studies involving human participants were in accordance with the ethical standards stated in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. Informed consent was obtained from all participants. Documentation provided upon request.
3.3 Implementation Into Human–Robot collaboration System.
When the method herein is part of a larger HRC system, such as in Fig. 1, it can be triggered to generate a prediction once a reaching target is received. This method can also predict continuously while motions are in progress. Since there is a separate network to predict left and right arm motions, this method must be accompanied by an algorithm that predicts which arm will perform the reach motion. One possibility is to assume that if the reach target is to the right of the pelvis, then predict motion for the right arm, otherwise predict for the left arm. If the person is currently holding a piece with one arm, then it can be assumed that the reach will be performed with that arm. An advantage of the method herein is prediction inference in less than 2 ms. This was the performance on an Intel i9 computer with an nvidia RTX3070 GPU, using the GPU for model inference and CPU for pre/post-processing of data. This means that if the actual human motion starts to deviate from the prediction while reaching for a target, this method can rapidly predict a new, more accurate prediction. Deviations requiring rapid updates could include changes in poses, timing, right/left arm usage, or change of target. The generative neural network herein performs a fixed number of computations to infer a prediction. Therefore, variations in inference time are not due to variation in the generative neural network inputs, duration of the predictions, or distance covered by the predicted motions. Variations in inference time are due to other processes that may be happening in the robotic system, such as robot path planning for example.
Figures 11(b)–11(e) show a more realistic scenario of moving a piston-rod assembly from a fixture on the table to an engine block in chronological order. It shows an initial, final, and two intermediate actual and predicted human poses. The human is represented as collections of cylinders corresponding to actual and prediction labeled “a” and “p”, respectively. The engine parts are shown as small and large shaded boxes corresponding to a piston-rod assembly and engine block, respectively. Figure 11(a) shows the engine block in the lower part of the image and piston-rod assembly in the upper right corner, from the human’s perspective. The reaching target for the right arm is shown as the sphere over the engine block. Figure 11(a) also shows the right wrist trajectory as a dashed arrow. Figure 11 shows the method herein predicted a sequence that closely matched both the pose and timing of the actual sequence.

() Engine block and piston-rod assembly, and ()–() four time-steps from a predicted (with reaching wrist labeled “p”) and actual (with reaching wrist labeled “a”) motion sequence in increasing time from () to ()
The human motion prediction method was integrated into the spatio-temporal planning and prediction framework (STAP-PPF) in Ref. [24]. In STAP-PPF, the human motion prediction method herein was use to predict motions for a sequence of human actions. For example, the human could do a sequence of motions such as in the previous paragraph. In STAP-PPF, the prediction method herein predicts each motion in the sequence based on the location of parts, tools, or workpieces the human will manipulate as well as the human’s current pose or the final pose of the previous prediction. This allows STAP-PPF to generate a prediction of human motion for entire manufacturing sequences. Then robot motions can be planned so the robot avoids the human predictions at necessary times to prevent production disruptions. However, it was necessary for STAP-PPF to also update the robot’s motion in real-time as the human’s real-time motions likely deviate from the predictions in timing or position. Therefore, as the robot and human were performing manufacturing tasks, the prediction method herein updated the prediction of the human’s motion at 30 Hz (the update rate of the sensor suite cameras) based on the human’s current pose and target. That permitted STAP-PPF to update robot motions after every prediction update so the robot motion could accommodate the real-time deviations in human motion. When using the online re-planning feature of STAP-PPF, the human motion prediction method herein performed inference on the Intel i9 CPU on the mentioned computer for inference, not using a GPU, and updated the predictions at 30 Hz.
Since the method was trained on motions from a single person, this work demonstrates the effectiveness of this method to predict that persons motions over the prediction horizons necessary to complete motions. Results from the five-person study in Sec. 3.2 show that motion recordings from a variety of people are necessary for the proposed human motion prediction method to generalize to unforeseen individuals. For the proposed approach to be utilized in a realistic manufacturing environment, it must generalize well to unforeseen people that were not part of the training motion recordings. It would be too time consuming to collect data and train a generative neural network on all people that may work in an HRC manufacturing cell. Therefore, for the proposed approach to be used in a real-manufacturing setting, a much larger dataset of motion recording must be amassed from a large variety of people and then used to train the generative neural network and the motion velocity model.
4 Conclusions
In summary, this work developed a neural network and pre/post-processing to predict sequences of human poses for reaching motions in an HRC workcell. The input to the method is the human’s current pose and the reaching target for either the left or right wrist. A key advantage of this method is its ability to predict motion for the entire duration of a reach motion, which was a prediction horizon up to 3.5 s in train/test set motions. The method herein does not suffer from exponential propagation of errors experienced by prior works, which limited them to a horizon of 1 second. Additionally, the method herein infers predictions in less than 2 ms per predicted motion, permitting real-time updates to the predictions of human motion. The method herein provides predicted human pose sequences for input to other robot control algorithms to permit proactive avoidance of humans. Proactive responses can mitigate production delays and reduce human discomfort in HRC. Future work will incorporate this method with proactive robot path planning and seek ways to reduce prediction error, such as improved sensing technology.
Acknowledgment
Funding was provided by the NSF/NRI: INT: COLLAB:Manufacturing USA: Intelligent Human–Robot Collaboration for Smart Factory (Award ID #:1830383). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the National Science Foundation.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- =
feature vector input to the neural network
- =
Euclidean error between the reaching wrist (left or right) and the target at time t in the sequence.
- =
single human pose, sequence of predicted human poses, and actual human pose sequence
- =
human pelvis location and reaching target relative to in the world coordinate frame
- , , , , , , =
quaternions relating the world z-axis to the human torso, neck, shoulder–shoulder vector, left upper arm, left forearm, right upper arm, and right forearm, respectively
- =
quaternion of the human, where denotes rotation angle and , , components to denote elements of rotation vector
- =
rotation angle and vector that determine quaternion of the human
- =
location of the reaching wrist (left or right) in the world frame at time
- =
Euclidean distance between the point in one sequence and the point in another sequence
- , , =
warp path that maps elements of one sequence to elements of another, an element of which has an and component for each sequence, and either the or component of
- =
filter kernel input from node , in channel from the previous neural net layer
- =
output of node , in a channel in a layer of the neural network
- =
set of all learned neural network parameters
- =
wrist speed estimate, actual speed, and standardized residual error of the estimate
- =
L1 error at time-step t between the predicted pose sequence generated by the neural network and actual sequence from train/test samples