Pose estimation of human–machine interactions such as bicycling plays an important role to understand and study human motor skills. In this paper, we report the development of a human whole-body pose estimation scheme with application to rider–bicycle interactions. The pose estimation scheme is built on the fusion of measurements of a monocular camera on the bicycle and a set of small wearable gyroscopes attached to the rider's upper- and lower-limbs and the trunk. A single feature point is collocated with each wearable gyroscope and also on the body segment link where the gyroscope is not attached. An extended Kalman filter (EKF) is designed to fuse the visual-inertial measurements to obtain the drift-free whole-body poses. The pose estimation design also incorporates a set of constraints from human anatomy and the physical rider–bicycle interactions. The performance of the estimation design is validated through ten subject riding experiments. The results illustrate that the maximum errors for all joint angle estimations by the proposed scheme are within 3 degs. The pose estimation scheme can be further extended and used in other types of physical human–machine interactions.
Whole-body pose or gait information is commonly used to study human movement science and biomechanics. Accurate and reliable human whole-body pose information benefits both clinical disease diagnosis and rehabilitation treatment. Most studies of pose estimation are for human activities such as walking, running, and stance. For human–machine interactions, such as bicycling, driving, kayaking, and riding segways, pose or gait estimation plays an important role to understand and study human motor skills. However, it is not straightforward to estimate the whole-body poses in these activities due to complex, machine-coupled dynamic movements, and few studies have been reported for pose estimation in these interactions.
Optical or magnetic marker-based human motion capture systems are commercially available and commonly used. However, in addition to high cost, these motion capture systems are limited to laboratory usage and it is difficult to be used for personal activities at home or in outdoor environment. Wearable sensors, such as small-size gyroscopes, accelerometers, and inertial measurement units (IMU), have been used to attach on human body to obtain the motion, segment orientation, or position information [1,2]. It is well known that directly integrating the measurements of these sensors produces unacceptable results due to drifting noises. Integration of the measurements from these inertial sensors and other complementary sensors (e.g., global positioning system, magnetometers, and ultrasonic sensors) is commonly taken to enhance the accuracy of the pose or position estimation.
We propose a visual-inertial fusion framework to simultaneously estimate the poses of the human whole-body and the moving platform. Several attractive features motivate us to use bicycling as an application example to demonstrate the proposed pose estimation scheme. First, bicycling provides a good application example for the above-mentioned physical human–machine interactions. Sitting on the moving bicycle, the rider actively reacts to balance the unstable platform through whole-body movements. The multilimb and body movements on a moving bicycle provide a rich and challenging problem to identify and extract the human motion from the platform motion. Second, recent clinical studies show that bicycling helps diagnose or treat patients with postural balancing disorders , Parkinson's diseases , and mental retardation . These bicycle-based rehabilitation and clinical applications need accurate pose estimations in outdoor environment.
One of major applications of visual-inertial fusion is for robotic localization and navigation [6–11]. The popular fusion methods used in these applications include the extended Kalman filter (EKF) or the unscented Kalman filter. Many existing developments focus on the estimation and calibration of the rigidly mounted camera-IMU pair for robot navigation and mapping applications [12–14]. Integration of visual and inertial measurements is also applied to human motion tracking and localization. For example, in Ref. , measurements from the IMU mounted on human body are fused with the camera located on the body or off-body fixtures. In Ref. , a real-time hybrid visual-inertial fusion scheme is presented to track only the three-dimensional articulated human arm motion for home-based rehabilitation. To estimate human walking, in Ref.  a monocular visual odometry is fused with IMU measurements and a human walking model. In Ref. , the use of monocular video with the multibody dynamic models is reported to estimate the lower-limb poses for real-time human tracking. The above-mentioned visual-inertial integrations consider the cases of inertial frame-fixed cameras or rigidly connected camera-inertial sensor pairs. For the inertial frame-fixed camera, the limitation is that the workspace is confined and it is not suitable for outdoor usage. Rigidly connected camera-inertial sensor pairs cannot be used as wearable sensors for the pose estimation of human activities because they are not of light-weight and small-size and may cause significant intrusive effects on human motions.
In the application such as the rider–bicycle interactions, the camera is proposed to be mounted on the moving platform and not rigidly fixed together with the inertial sensors. The monocular camera provides only relative distances between the body segments and the moving platform. The rationales of the bicycle-mounted camera paired with the human body-mounted gyroscope sensors come from several reasons. First, for whole-body human pose estimation in physical human–machine interactions such as bicycling, the attitudes of a large number of body segments need to be measured or estimated. It is desirable to separate the camera with the inertial sensors to obtain the relative position of multiple locations on the body segments simultaneously with that only one camera is needed. Second, without attaching a relatively large amount of cameras on human body segments, the small-size wearable sensors such as miniature gyroscopes and the point markers bring minimally intrusive disturbances for human movements. Finally, the setup procedure of the proposed system is simple and the cost of the system is low. The noncollocated camera-gyroscope approach is attractive for human-in-the-loop or human–machine interactions applications.
In Refs. [19–22], the measurements from the body IMU and force sensors are fused to estimate the bicycle rider trunk orientation. With the help of those force sensors, the absolute angles are indirectly captured and used to compensate for the estimation drifts of the trunk and the bicycle angles by using the inertial sensors. The limitation of such fusion is that it cannot be applicable for complex structures formed by limb segments. In Ref. , a visual-inertial integration scheme is presented to estimate the real-time upper-limb pose when riding a bicycle. This paper extends the approaches and the results in Ref.  in several aspects. We improve the visual-inertial fusion framework to estimate the rider's whole-body segments orientations using a simpler system configuration than that in Ref. . Second, the feature point used in this work is a single dot and much simpler than the rectangular square used in Ref. . The feature point simplification brings challenges to the visual-inertial fusion. Finally, we conduct extensive multiple subject experiments to validate and demonstrate the design. This paper is an extension of the conference publications [23,24] with significantly additional analyses and experiments. The proposed framework can be easily extended to other human–machine interaction activities (e.g., driving, kayaking) as long as they can be treated as similar human-moving platform systems.
The rest of the paper is organized as follows: In Sec. 2, we describe the system configuration and setup. Section 3 presents the visual measurements and gyroscope models. Section 4 discusses the visual-inertial fusion design. The experimental results and discussions are presented in Sec. 5. Finally, we summarize the concluding remarks in Sec. 6.
Rider–Bicycle Systems and Configuration
Figure 1(a) shows the instrumented bicycle and the wearable sensors. Two sets of the gyroscope units (both from Motion Sense Inc., Hangzhou, China) are used in experiments. One tri-axial gyroscope unit (model 605) is mounted to the bicycle frame and the trunk, respectively, and four small-size tri-axial gyroscope units (model slimAHRS) are attached to the rider's forearms and thighs, as shown in Fig. 1(b). A small-size battery and a wireless transmitter module are integrated and packed with wearable gyroscope units. A feature dot attached to the outside of the gyroscope box provides the position information. Using one single dot as the feature point reduces the image processing time and enables the real-time applications. Around the feature point, a rectangular square marker is used to provide the ground truth poses for outdoor experiments . A total of seven feature points are attached to the rider's body: four on the upper-limb (one on each forearm together with the gyroscope and the other one on the upper arm), one on the trunk, and the other two on the lower-limb (one on each thigh) (see Fig. 2).
A commercial mountain bicycle was modified and equipped with various sensors. A high-resolution monocular camera (model Manta G-145 from Allied Vision Technologies, 1392 × 1040 pixels, 16 fps) is mounted on an extended rod rigidly connected to the bicycle frame. The image acquisition and processing are conducted on the onboard computer through a Gigabit Ethernet connection. The wireless receivers of the bicycle and wearable gyroscopes are connected to a real-time embedded system (NI cRIO 9082-RT) mounted on the rear rack. Two optical encoders (Grayhill, Series 63R 256 lines) are mounted on the crank and the steering fork to measure their absolute angles, respectively. All sensor measurements are sampled at the frequency of 100 Hz except that the image sampling rate is 16 Hz. The ground truth of both the rider and the bicycle poses are obtained by the optical motion capturing systems (8 Bonita cameras from Vicon Inc., Oxford, UK) for indoor experiments. The data collections among the Vicon system, cRIO, and the camera computer are synchronized through wireless networks connections.
Figure 1(b) shows the configuration of the visual-inertial system. We define the inertial frame with the Z-axis downward, the camera frame , the image frame , and the feature frames , . As shown in Fig. 1(b), and are for the features attached to left upper-limb, and for the right upper-limb, for the trunk, and finally and for the features on the left and right thighs, respectively.
We simultaneously estimate the joint angles of the upper- and lower-limbs, the trunk, and the bicycle roll angle. The upper-limb can be modeled as a three-joint link: the wrist (three degrees-of-the-freedom (3DOF)), the elbow (1DOF), and the shoulder (3DOF). During bicycling, the hands are assumed firmly attached to handlebar. Therefore, the limb-trunk system constitutes an articulated chain and the two ends of the chain are the handlebar and seat, respectively. As long as the trunk angles, the wrist angles, and the elbow angles are known, the shoulder angles are indeed uniquely determined. Therefore, in this study we neglect the estimations for shoulder angles and focus only on the estimations of the trunk angles, the wrist angles, and the elbow angles. The elbow is considered as a 1DOF joint  with joint angle γie, the wrist as a 3DOF joint with orientation by the Euler angles θiw, , and , for the left and right limbs, respectively. Euler angles θh, , and are used to capture the 3DOF trunk orientation. Similarly, each of the lower-limb is considered as a 5DOF articulated link: the hip (thigh) joint is a 3DOF link whose orientation is captured by Euler angles , the 1DOF knee joint with joint angle γik, and the 1DOF ankle joint with angle δia, , for the left and right limbs, respectively. Figure 2 shows the details of the limb pose configurations.
The pose estimation problem is to obtain a real-time estimate of of dynamically bicycle riding motion. The estimation scheme uses measurements from the six gyroscopes on the bicycle and the human body and the camera images of the seven single-dot feature points.
Vision Measurements and Gyroscope Kinematic Models
Figure 1(a) shows the feature markers on the human body. We treat the center of the black dot as the location of the feature point. Five of these markers are collocated with the five gyroscopes (see Fig. 1(b)). The feature mark is segmented from the image by the threshold. Similar to the approach in Ref. , the Ostu algorithm  is chosen to calculate the optimum separating threshold through minimizing the intraclass variance of two classes (i.e., foreground and background pixels) to distinguish markers from the environment. The Blob analysis method  is then used to extract image features and to generate the size and position of the target. A run-length encoding method is used to process the image before conducting the Blob analysis for quickly locating the features.
where frame is attached to the bicycle. For the feature frames on the upper-limb, is chosen as the handlebar frame and for the feature frames on the trunk and the lower-limb, is attached on the bicycle seat.
Gyroscope Kinematic Models.
Due to the length of the functions in Eq. (12), we here omit to write them explicitly. Plugging Eqs. (11) and (12) into Eq. (8), we obtain the kinematic relationship between , and . Similar relationships are obtained for and with the gyroscope measurements and . These kinematic relationships will be used in the EKF design in Sec. 4.2.
Rider–Bicycle Interactions Constraints and Extended Kalman Filter-Based Pose Estimation
In this section, we first present the constraints imposed by the physical rider–bicycle interactions and then discuss the EKF-based whole-body pose estimation design.
Human Anatomy and Rider–Bicycle Interaction Constraints
Lower-Limb Pedaling Motion Constraint.
in the hip frame as a function of the quaternion coordinates of the lower-limb (through a transformation similar to Eq. (11)).
where is a function of crank angle .
Upper-Limb Anatomical Constraints.
Quaternion Coordinate Constraint.
In summary, a total of 15 constraints are obtained, i.e., , and .
Extended Kalman Filter-Based Pose Estimation.
as the third set of the state equations.
The observability condition is checked for the EKF design (using matrices F and H) and the rank condition is satisfied.
Remark1. The EKF design contains the bicycle roll kinematics (20) and uses measurements from the bicycle gyroscope and the wearable body gyroscopes. None of these sensors gives absolute bicycle roll angle information. As shown in Sec. 5, it is interesting that the EKF-based sensing fusion helps maintain a bounded estimation error for the bicycle roll angle while the directly integration of the bicycle gyroscope measurements results in a diverging estimation.
Remark2. In the EKF design, we do not consider any noise models for the gyroscope measurements. As demonstrated in Ref. , the inclusion of the first-order colored noise model could help further reduce the estimation errors. However, the improvements of the estimation performance with the noise model is marginal and therefore, we do not present the noise model here.
We conduct the human riding experiments in both an indoor laboratory as shown in Fig. 1(a) and an outdoor setting. Due to the space constraint, the subjects ride the bicycle for a circular trajectory of a radius of 2 m and a moving speed around 1–1.5 m/s in an indoor environment. For outdoor riding experiments, the subjects ride for a large circle with a radius around 7 m and a moving speed around 2.5–3 m/s.
The physical parameters for the bicycle are as follows: deg, δ = 10 deg, and m. We recruited ten healthy and experienced bicycle riders (eight male and two female with age: 27 ± 3 years, height: 176 ± 4 cm, and weight: 70 ± 7 kg) to conduct the experiments. The duration for each riding experiment run was around 2 mins. Before conducting experiments, we measured all subjects' biomechanic parameters and the locations of the gyroscopes and feature markers on their body segments for pose calculations. All the subjects signed their informed consent using a protocol approved by the Institutional Review Board (IRB) at Rutgers University. In the following, we first describe the indoor experimental results and then present the outdoor experiments.
Figure 3 shows the experimental results of the pose estimation for one subject. We present the Euler angle representation for a better visualization. For comparison purposes, we also plot the pose estimates by directly integrating strapdown gyroscope measurements. Figures 3(a) and 3(b) show the pose estimates for the right upper-limb, namely the wrist angles and the elbow angle, Fig. 3(c) shows the trunk pose estimation, and Figs. 3(d) and 3(e) show the right lower-limb's pose estimation, including the thigh's Euler angles, the knee angle, and the ankle angle. Finally, Fig. 3(f) shows the estimates of the bicycle roll angle. The results shown in Fig. 3 clearly demonstrate that the visual-inertial fusion results closely follow the ground truth of the rider and the bicycle poses. The fusion scheme certainly outperforms the direct integration of strapdown gyroscope measurements, which drift over time and the errors increase dramatically after 25 s. The corresponding estimation error comparisons are shown in Fig. 4. The maximum errors for all joint angle estimations by the visual-inertial fusion are within 2–3 deg. Compared with other fusion-based pose estimation designs for regular human activities, the proposed approach for human–machine interactions shows the comparable accuracy level. For example, 2.4–3.2 deg mean root-mean-square (RMS) orientation estimation errors for different segments are demonstrated by using the fusion of the inertial and the magnetic sensors in Ref. . A 2.8 deg mean RMS error is shown by fusing the accelerometers and the gyroscopes in Ref. .
To further demonstrate the performance of the EKF-based design for indoor testing, we compute the statistics of the pose estimation errors for all ten subjects. Figure 5 shows the calculated statistics of the estimation errors of all limbs and trunk joint angles over time. We plot the estimation error statistics obtained from both the visual-inertial fusion and the gyroscope integration schemes. It is clearly observed that for all subject runs, the estimation errors by the EKF fusion are near zero. Table 1 further lists the mean and one SD of the RMS errors for all subjects. The third row in Table 1 shows the EKF estimation errors for the upper-limb pose without using constraints and . It is interesting to observe that incorporating and into the EKF slightly improves the estimation performance. The results shown in Table 1 and Fig. 5 confirm the consistently superior performance of the EKF-based pose estimation than those by the direct integration of gyroscope measurements.
In Fig. 4(b), no integration results are plotted because only one gyroscope is attached to each forearm and no gyroscope measurements for the upper arm. For the lower-limb, although only one gyroscope is attached to each thigh, we still obtain the knee and ankle angles accurately through the gyroscope measurements and constraint (16). Clearly, these constraints enable the design to use few number of wearable gyroscopes. The pose estimation in outdoor experiments shows the consistent results similar to the above-discussed indoor-experiment results. Figure 6 shows the rider pose and bicycle roll angle estimation. The rider pose estimates shown in Figs. 6(a)–6(c) demonstrate that the EKF-based scheme produces the accurate estimation results and the performance is comparable to the results in the indoor experiments. The prediction results of the bicycle roll angle shown in Fig. 6(d) also demonstrate the superior performance than the IMU integration.
Figure 7 shows the time trajectories of the mean and one standard deviation of the pose estimates of major body segments and the trunk of the ten bicycle riders in outdoor experiments. From the plots, we clearly observe that within around 50 s riding experiments, the estimates of all postangles by the EKF-based fusion scheme are all around zero, and the one SD values are mostly within a range of 2 deg. Therefore, the proposed visual-inertial fusion scheme can be used to estimate the human body motion in bicycle riding in outdoor environment. On the other hand, from the plots in Fig. 7, we also notice that the SDs for most joint angles of human body segments slowly grow over time though the growths are much smaller than those by directly integrating of the gyroscope signals (see comparison results in Fig. 4). The slow growth of the estimation errors is primarily due to the vision estimation errors.
The main contribution of the work is the development of a low-cost, wearable visual-inertial design for whole-body human pose estimation in physical human–machine interactions with a moving and dynamic platform. Comparing with the commercially available inertial-based whole-body sensor systems (e.g., Ref. ), our approach has several attractive features. The integration of the gyroscopes with visual measurements as well as the physical constraints is robust and reliable while the accuracy of the use of magnetometers in the sensor systems  is potentially vulnerable to highly dynamic motion and environmental disturbances, such as existence of metallic objects nearby. Additionally, we focus on using the possible least measurement information while maintaining the consistent estimation performance in all joint angles. Finally, the cost of the wearable sensors (only gyroscopes and one camera) in our approach is much less than most commercial whole-body sensor systems.
One potential limitation of using visual measurements in practice is to obtain the high-quality, reliable images, especially for rapid human movement. Image occlusion, out of field of view of the camera, or blurring images reduces the vision-based pose estimation accuracy. Figure 8 shows the estimation errors of the EKF fusion results for the right wrist angles under a random loss of camera images for all ten subjects. We calculated the EKF estimation errors by randomly dropping the image frames following a given loss percentage. With increasing percentages of the lost camera images, the EKF fusion performance deteriorates. Table 2 lists the statistics of the RMS of the pose estimation for each limb and the trunk among all ten subjects. Clearly, these results confirm that the performance of the visual-inertial fusion scheme is robust to a certain amount of the vision frame loss. For example, with a 20% image loss, the average estimation error of each limb is still less than 2.5 deg.
Besides the riding bicycle example, the proposed method could also be applicable for accurate pose estimations in other applications of human exercises or human–machine interactions, such as car driving, kayaking, or flying an airplane. The proposed pose estimation could also be potentially used for monitoring human activities of elderly and disabled patients (on or not on moving platforms) in a nonlaboratory environment. The system uses the miniaturized wearable gyroscopes and collocated point markers, and these devices are easily attached to human body segments without significantly interfering the human activities in daily personal life. The setup procedure of the proposed system is simple since both the gyroscope bias and noise levels and the camera calibration can be conveniently obtained before the usage of the system.
We presented a visual-inertial fusion scheme for human whole-body pose estimation in human–machine interactions with applications to bicycle riding. The fusion scheme was built on the measurements of a calibrated monocular camera and a set of gyroscopes attached to rider's body segments. With two feature points on each segment of the upper-limb and one feature point on each thigh of the lower-limb, the EKF-based fusion scheme estimated the whole-body poses when the subjects ride a bicycle. The EKF design also incorporated the constraints from human anatomy and the physical rider–bicycle interactions. The superior performance of the pose estimation design was extensively tested by multiple subjects for both indoor and outdoor experiments. The results showed that the maximum errors for all joint angle estimations by the proposed scheme were within 3 deg and this performance was similar to those of wearable IMUs for regular human activities. The proposed scheme provided a low-cost, high-accuracy, and wearable design for whole-body human pose estimation in human–machine interactions and can be potentially used in other types of human–machine interactions for outdoor or personal activities.
This work was supported in part by the U.S. National Science Foundation under Award Nos. CMMI-0954966 and CMMI-1334389, the Shanghai Eastern Scholarship Program through Shanghai University (J. Yi), the National Natural Science Foundation of China under Award No. 61403307 (Y. Zhang), and fellowships from Chinese Scholarship Council (K. Yu and X. Lu). The authors also thank K. Chen, M. Trkov, P. Wang, and other members at the Robotics, Automation, and Mechatronics (RAM) Lab at Rutgers University for their helpful discussions and experimental help.