Pose estimation of human–machine interactions such as bicycling plays an important role to understand and study human motor skills. In this paper, we report the development of a human whole-body pose estimation scheme with application to rider–bicycle interactions. The pose estimation scheme is built on the fusion of measurements of a monocular camera on the bicycle and a set of small wearable gyroscopes attached to the rider's upper- and lower-limbs and the trunk. A single feature point is collocated with each wearable gyroscope and also on the body segment link where the gyroscope is not attached. An extended Kalman filter (EKF) is designed to fuse the visual-inertial measurements to obtain the drift-free whole-body poses. The pose estimation design also incorporates a set of constraints from human anatomy and the physical rider–bicycle interactions. The performance of the estimation design is validated through ten subject riding experiments. The results illustrate that the maximum errors for all joint angle estimations by the proposed scheme are within 3 degs. The pose estimation scheme can be further extended and used in other types of physical human–machine interactions.

## Introduction

Whole-body pose or gait information is commonly used to study human movement science and biomechanics. Accurate and reliable human whole-body pose information benefits both clinical disease diagnosis and rehabilitation treatment. Most studies of pose estimation are for human activities such as walking, running, and stance. For human–machine interactions, such as bicycling, driving, kayaking, and riding segways, pose or gait estimation plays an important role to understand and study human motor skills. However, it is not straightforward to estimate the whole-body poses in these activities due to complex, machine-coupled dynamic movements, and few studies have been reported for pose estimation in these interactions.

Optical or magnetic marker-based human motion capture systems are commercially available and commonly used. However, in addition to high cost, these motion capture systems are limited to laboratory usage and it is difficult to be used for personal activities at home or in outdoor environment. Wearable sensors, such as small-size gyroscopes, accelerometers, and inertial measurement units (IMU), have been used to attach on human body to obtain the motion, segment orientation, or position information [1,2]. It is well known that directly integrating the measurements of these sensors produces unacceptable results due to drifting noises. Integration of the measurements from these inertial sensors and other complementary sensors (e.g., global positioning system, magnetometers, and ultrasonic sensors) is commonly taken to enhance the accuracy of the pose or position estimation.

We propose a visual-inertial fusion framework to simultaneously estimate the poses of the human whole-body and the moving platform. Several attractive features motivate us to use bicycling as an application example to demonstrate the proposed pose estimation scheme. First, bicycling provides a good application example for the above-mentioned physical human–machine interactions. Sitting on the moving bicycle, the rider actively reacts to balance the unstable platform through whole-body movements. The multilimb and body movements on a moving bicycle provide a rich and challenging problem to identify and extract the human motion from the platform motion. Second, recent clinical studies show that bicycling helps diagnose or treat patients with postural balancing disorders [3], Parkinson's diseases [4], and mental retardation [5]. These bicycle-based rehabilitation and clinical applications need accurate pose estimations in outdoor environment.

One of major applications of visual-inertial fusion is for robotic localization and navigation [611]. The popular fusion methods used in these applications include the extended Kalman filter (EKF) or the unscented Kalman filter. Many existing developments focus on the estimation and calibration of the rigidly mounted camera-IMU pair for robot navigation and mapping applications [1214]. Integration of visual and inertial measurements is also applied to human motion tracking and localization. For example, in Ref. [15], measurements from the IMU mounted on human body are fused with the camera located on the body or off-body fixtures. In Ref. [16], a real-time hybrid visual-inertial fusion scheme is presented to track only the three-dimensional articulated human arm motion for home-based rehabilitation. To estimate human walking, in Ref. [17] a monocular visual odometry is fused with IMU measurements and a human walking model. In Ref. [18], the use of monocular video with the multibody dynamic models is reported to estimate the lower-limb poses for real-time human tracking. The above-mentioned visual-inertial integrations consider the cases of inertial frame-fixed cameras or rigidly connected camera-inertial sensor pairs. For the inertial frame-fixed camera, the limitation is that the workspace is confined and it is not suitable for outdoor usage. Rigidly connected camera-inertial sensor pairs cannot be used as wearable sensors for the pose estimation of human activities because they are not of light-weight and small-size and may cause significant intrusive effects on human motions.

In the application such as the rider–bicycle interactions, the camera is proposed to be mounted on the moving platform and not rigidly fixed together with the inertial sensors. The monocular camera provides only relative distances between the body segments and the moving platform. The rationales of the bicycle-mounted camera paired with the human body-mounted gyroscope sensors come from several reasons. First, for whole-body human pose estimation in physical human–machine interactions such as bicycling, the attitudes of a large number of body segments need to be measured or estimated. It is desirable to separate the camera with the inertial sensors to obtain the relative position of multiple locations on the body segments simultaneously with that only one camera is needed. Second, without attaching a relatively large amount of cameras on human body segments, the small-size wearable sensors such as miniature gyroscopes and the point markers bring minimally intrusive disturbances for human movements. Finally, the setup procedure of the proposed system is simple and the cost of the system is low. The noncollocated camera-gyroscope approach is attractive for human-in-the-loop or human–machine interactions applications.

In Refs. [1922], the measurements from the body IMU and force sensors are fused to estimate the bicycle rider trunk orientation. With the help of those force sensors, the absolute angles are indirectly captured and used to compensate for the estimation drifts of the trunk and the bicycle angles by using the inertial sensors. The limitation of such fusion is that it cannot be applicable for complex structures formed by limb segments. In Ref. [23], a visual-inertial integration scheme is presented to estimate the real-time upper-limb pose when riding a bicycle. This paper extends the approaches and the results in Ref. [23] in several aspects. We improve the visual-inertial fusion framework to estimate the rider's whole-body segments orientations using a simpler system configuration than that in Ref. [23]. Second, the feature point used in this work is a single dot and much simpler than the rectangular square used in Ref. [23]. The feature point simplification brings challenges to the visual-inertial fusion. Finally, we conduct extensive multiple subject experiments to validate and demonstrate the design. This paper is an extension of the conference publications [23,24] with significantly additional analyses and experiments. The proposed framework can be easily extended to other human–machine interaction activities (e.g., driving, kayaking) as long as they can be treated as similar human-moving platform systems.

The rest of the paper is organized as follows: In Sec. 2, we describe the system configuration and setup. Section 3 presents the visual measurements and gyroscope models. Section 4 discusses the visual-inertial fusion design. The experimental results and discussions are presented in Sec. 5. Finally, we summarize the concluding remarks in Sec. 6.

## Rider–Bicycle Systems and Configuration

### Rider–Bicycle Systems.

Figure 1(a) shows the instrumented bicycle and the wearable sensors. Two sets of the gyroscope units (both from Motion Sense Inc., Hangzhou, China) are used in experiments. One tri-axial gyroscope unit (model 605) is mounted to the bicycle frame and the trunk, respectively, and four small-size tri-axial gyroscope units (model slimAHRS) are attached to the rider's forearms and thighs, as shown in Fig. 1(b). A small-size battery and a wireless transmitter module are integrated and packed with wearable gyroscope units. A feature dot attached to the outside of the gyroscope box provides the position information. Using one single dot as the feature point reduces the image processing time and enables the real-time applications. Around the feature point, a rectangular square marker is used to provide the ground truth poses for outdoor experiments [20]. A total of seven feature points are attached to the rider's body: four on the upper-limb (one on each forearm together with the gyroscope and the other one on the upper arm), one on the trunk, and the other two on the lower-limb (one on each thigh) (see Fig. 2).

A commercial mountain bicycle was modified and equipped with various sensors. A high-resolution monocular camera (model Manta G-145 from Allied Vision Technologies, 1392 × 1040 pixels, 16 fps) is mounted on an extended rod rigidly connected to the bicycle frame. The image acquisition and processing are conducted on the onboard computer through a Gigabit Ethernet connection. The wireless receivers of the bicycle and wearable gyroscopes are connected to a real-time embedded system (NI cRIO 9082-RT) mounted on the rear rack. Two optical encoders (Grayhill, Series 63R 256 lines) are mounted on the crank and the steering fork to measure their absolute angles, respectively. All sensor measurements are sampled at the frequency of 100 Hz except that the image sampling rate is 16 Hz. The ground truth of both the rider and the bicycle poses are obtained by the optical motion capturing systems (8 Bonita cameras from Vicon Inc., Oxford, UK) for indoor experiments. The data collections among the Vicon system, cRIO, and the camera computer are synchronized through wireless networks connections.

### System Configuration.

Figure 1(b) shows the configuration of the visual-inertial system. We define the inertial frame $N$ with the Z-axis downward, the camera frame $C$, the image frame $I$, and the feature frames $Fi$, $i=1,…,7$. As shown in Fig. 1(b), $F1$ and $F2$ are for the features attached to left upper-limb, $F3$ and $F4$ for the right upper-limb, $F5$ for the trunk, and finally $F6$ and $F7$ for the features on the left and right thighs, respectively.

We simultaneously estimate the joint angles of the upper- and lower-limbs, the trunk, and the bicycle roll angle. The upper-limb can be modeled as a three-joint link: the wrist (three degrees-of-the-freedom (3DOF)), the elbow (1DOF), and the shoulder (3DOF). During bicycling, the hands are assumed firmly attached to handlebar. Therefore, the limb-trunk system constitutes an articulated chain and the two ends of the chain are the handlebar and seat, respectively. As long as the trunk angles, the wrist angles, and the elbow angles are known, the shoulder angles are indeed uniquely determined. Therefore, in this study we neglect the estimations for shoulder angles and focus only on the estimations of the trunk angles, the wrist angles, and the elbow angles. The elbow is considered as a 1DOF joint [25] with joint angle γie, the wrist as a 3DOF joint with orientation by the Euler angles θiw, $φiw$, and $ϕiw, i=l,r$, for the left and right limbs, respectively. Euler angles θh, $φh$, and $ϕh$ are used to capture the 3DOF trunk orientation. Similarly, each of the lower-limb is considered as a 5DOF articulated link: the hip (thigh) joint is a 3DOF link whose orientation is captured by Euler angles $(θil,φil,ϕil)$, the 1DOF knee joint with joint angle γik, and the 1DOF ankle joint with angle δia, $i=l,r$, for the left and right limbs, respectively. Figure 2 shows the details of the limb pose configurations.

Although the Euler angle representation is used, we can calculate the pronation-supination, radial-ulnar, and flexion-extension in the anatomical conventions definition. Moreover, for each 3DOF joint (i.e., wrist, trunk, and hip joints), we introduce the quaternion representation to avoid the potential singularity issue with the use of Euler angles. We define the quaternion coordinates $hq=[hq0 hq1 hq2 hq3]T$, $iwq=[iwq0 iwq1 iwq2 iwq3]T$, and $ilq=[ilq0 ilq1 ilq2 ilq3]T$ for the trunk, wrist and thigh joints, $i=l,r$, for the left- and right-limb, respectively. For presentation convenience, we also define $xh=hq, xiw=iwq$, $xe=[γle γre]T$, and $xil=[ilqTγik δia]T$, and the state variable for the EKF design is defined as
$X(t)=[φb xhT xlwT xrwT xllT xrlT xeT x˙eT]T∈ℝ29$
(1)

The pose estimation problem is to obtain a real-time estimate of $X(t)$ of dynamically bicycle riding motion. The estimation scheme uses measurements from the six gyroscopes on the bicycle and the human body and the camera images of the seven single-dot feature points.

## Vision Measurements and Gyroscope Kinematic Models

### Vision-Based Measurements.

Figure 1(a) shows the feature markers on the human body. We treat the center of the black dot as the location of the feature point. Five of these markers are collocated with the five gyroscopes (see Fig. 1(b)). The feature mark is segmented from the image by the threshold. Similar to the approach in Ref. [23], the Ostu algorithm [26] is chosen to calculate the optimum separating threshold through minimizing the intraclass variance of two classes (i.e., foreground and background pixels) to distinguish markers from the environment. The Blob analysis method [27] is then used to extract image features and to generate the size and position of the target. A run-length encoding method is used to process the image before conducting the Blob analysis for quickly locating the features.

A pinhole model is considered for the monocular camera as follows:
$[uv1]=KI[xczcyczc1]$
(2)
where (u, v) in $I$ is the image of point $(xc,yc,zc)$ in $C, KI$ is the camera intrinsic parameter matrix and obtained through the calibration process. We denote the transform matrix between $C$ and $Fi$ as $FiCT, i=1,…,7$. For the ith feature point $(ixf,iyf,izf)$, its coordinates $(ixc,iyc,izc)$ in $C$ satisfy
$[ ixc iyc izc1]=FiCT[ ixf iyf izf1]$
(3)
where $FiCT$ is further decomposed as
(4)

where frame $B$ is attached to the bicycle. For the feature frames on the upper-limb, $B$ is chosen as the handlebar frame and for the feature frames on the trunk and the lower-limb, $B$ is attached on the bicycle seat.

Using Eqs. (2)(4) and setting the feature point as the origin of the world frame, i.e., $ixf=iyf=izf=0$, we obtain
$izc[uivi1]=KI( BCRFiBp+BCp)=:Ii(X)∈ℝ3$
(5)
where $BCp$ is a constant and related to the camera mounting parameters. We write both $BCR$ and $FiBp$ as functions of attitudes of feature $Fi$, i.e., X. From Eq. (5), we have
$ui=Ii1(X)/Ii3(X), vi=Ii2(X)/Ii3(X), i=1,…,7$
where $Iij(X)$, j = 1, 2, 3, is the jth element of $Ii(X)$. The image positions of all seven feature points are concatenated together as
$h(X)=[u1 v1 … u7 v7]T$
(6)

### Gyroscope Kinematic Models.

For the bicycle gyroscope, we have the following relationship for the measurements $ωb$ [20]
$ωb=RyT(δ)RxT(φb)[00ψ˙]+RyT(δ)[φ˙b00]$
(7)
where $Ri(ϑ)$ represents the three-dimensional rotational matrix around the i-axis with angle ϑ, $i=x,y,z$, and δ is the angle between the bicycle gyroscope and the horizontal plane (see Fig. 1(b)). Similarly, the forearm gyroscope measurements $ωia, i=l,r$, for the left and right arms are written as
$ωia=[ϕ˙iw00]+RxT(ϕiw)[0φ˙iw0]+RxT(ϕiw)RyT(φiw)[00θ˙iw]+RxT(ϕiw)RyT(φiw)RzT(θiw)[00ξ˙]+RxT(ϕiw)RyT(φiw)RzT(θiw)RzT(β)RyT(ξ)[φ˙b00]+RxT(ϕiw)RyT(φiw)RzT(θiw)RzT(β)RyT(ξ)RxT(φb)[00ψ˙]$
(8)
where β and ξ are the bicycle caster and steering angles, respectively. The trunk gyroscope measurements $ωh$ are written as
$ωh=RxT(θh)RyT(φh)[00ϕ˙h]+RxT(θh)[0φ˙h0]+[θ˙h00]+RxT(θh)RyT(φh)RzT(ϕh)[φ˙b00]+RxT(θh)RyT(φh)RzT(ϕh)RxT(φb)[00ψ˙]$
(9)
and the lower-limb gyroscope measurements $ωil, i=l,r$,
$ωil=RxT(θil)RyT(φil)[00ϕ˙il]+RxT(θil)[0φ˙il0]+[θ˙il00]+RxT(θil)RyT(φil)RzT(ϕil)[φ˙b00]+RxT(θil)RyT(φil)RzT(ϕil)RxT(φb)[00ψ˙]$
(10)
The kinematic equations in Eqs. (8)(10) give the relationship between the attitude rates and the gyroscope measurements through the Euler angle representation. It is desirable to write these kinematic relationships by using the quaternion representation. This is achieved by using the transformation between the Euler angles and quaternion representations. For example, the Euler angles of the wrist are written as
$[ϕiwφiwθiw]=[arctan2(iwq0iwq1+iwq2iwq3)1−2(iwq12+iwq22)arcsin2(iwq0iwq2−iwq1iwq3)arctan2(iwq0iwq3+iwq1iwq2)1−2(iwq22+iwq32)]$
(11)
Taking a time derivative of Eq. (11), we obtain
$[ϕ˙iwφ˙iwθ˙iw]=[f1iw(xiw,x˙iw)f2iw(xiw,x˙iw)f3iw(xiw,x˙iw)], i=l,r$
(12)

Due to the length of the functions in Eq. (12), we here omit to write them explicitly. Plugging Eqs. (11) and (12) into Eq. (8), we obtain the kinematic relationship between $x˙iw, xiw$, and $ωia$. Similar relationships are obtained for $xh$ and $ilq$ with the gyroscope measurements $ωh$ and $ωil$. These kinematic relationships will be used in the EKF design in Sec. 4.2.

## Rider–Bicycle Interactions Constraints and Extended Kalman Filter-Based Pose Estimation

In this section, we first present the constraints imposed by the physical rider–bicycle interactions and then discuss the EKF-based whole-body pose estimation design.

### Human Anatomy and Rider–Bicycle Interaction Constraints

#### Lower-Limb Pedaling Motion Constraint.

Let us first consider the lower-limb constraints due to pedaling [28]. Figure 2(b) illustrates the lower-limb pose during pedaling. The hip joint, the knee joint, and the ankle joint are denoted as A, B, and C, respectively. We also denote the foot–pedal contact point as D. Without causing confusion, we use A, B, C, and D to indicate the corresponding frames attached to hip, knee, ankle, and pedal joints. It is then straightforward to obtain that the homogeneous transformation from D to A is
$DATi=[ BARiρBi01][ CBRiρCi01][ DCRiρDi01]$
(13)
for $i=l,r$, where $BARi=RxT(θil)RyT(φil)RzT(ϕil)$, $BCRi=Ry(γik), DCRi=Ry(δik), ρBi=[0 0 lilu]T$, $ρCi=[0 0 lill]T, ρDi=[0 0 lif]T$, and lilu, lill, and lif are the lengths of segment links of the lower-limb. From Eq. (13), we obtain the position vector $ArDi$ of point D
$ArDi=BARi CBRiρDi+BARiρCi+ρBi, i=l,r$
(14)

in the hip frame as a function of the quaternion coordinates of the lower-limb $xil$ (through a transformation similar to Eq. (11)).

On the other hand, when pedaling, the feet are always on the pedal and the position of D can be obtained through the crank angle $φc(t)$ (measured by the crank encoder). Thus, we obtain that
$ArDi=rO+rODi(φc(t)), i=l,r$
(15)
where constant vector $rO$ represents the location of pedal center O in the hip frame, and $rODi(φc(t))=[±rccφc(t) 0 rcsφc(t)]T$ (positive for i = l and negative for i = r) with rc is the crank radius. Combining Eqs. (14) and (15), we have
$c1iC(ilq;γik;δia):=BARi CBRiρDi+BARiρCi=di(t)$
(16)

where $di(t):=rO−ρBi+rODi(φc(t))$ is a function of crank angle $φc(t)$.

#### Upper-Limb Anatomical Constraints.

The second set of constraints is the location of the rider neck E as shown in Fig. 1(b). We obtain the position vector of E, denoted as $ArE(xiw;xe)$ (only the state variables in the function arguments are written explicitly), through the linkage of the handlebar and upper-limb from the bicycle seat. On the other hand, we calculate the position vector of E in the trunk frame as
$ArE(xh)=R(xh)ρE$
where $R(xh)$ is the rotation matrix with quaternion $xh$ and constant vector $ρE=[0 lt 0]T$ is the relative position vector of E in trunk frame (see Fig. 1(b)). Therefore, we write down the second set of constraints as
$c2C(xiw;xe;xh):=ArE(xiw;xe)=R(xh)ρE$
(17)
We obtain the third constraint from human anatomy that the shoulder length is constant, denoted as ls. We calculate the two shoulder points from the left and right elbows and thus, we express this constraint as
$c3C=c3(xe)=ls$
(18)

#### Quaternion Coordinate Constraint.

The last set of constraints are the unity of the quaternion representation for the upper-limb, trunk. and the thigh joints, i.e.,
$c4C=||xh||2=||xiw||2=||ilq||2=1, i=l,r$
(19)

In summary, a total of 15 constraints are obtained, i.e., $c1iC(ilq;γik;δia)∈ℝ6, c2C(xiw;xe;xh)∈ℝ3, c3C∈ℝ$, and $c4C∈ℝ5$.

### Extended Kalman Filter-Based Pose Estimation.

From Eq. (7), we solve for $φ˙b$ and obtain
$φ˙b=gb(ωb)=[ cos δ0 sin δ]ωb$
(20)
as the first state equation. Similarly, from Eqs. (8), (11), and (12) we solve for $x˙lw$ and $x˙rw$ and obtain (the closed-form expressions in X are omitted here due to the length of these functions)
$x˙lw=glw(X,ωlw,ωb), x˙rw=grw(X,ωrw,ωb)$
(21)
as the second set of the state equations. From Eq. (9), we solve for $x˙h$ (through transformations (11) and (12)) and obtain
$x˙h=gh(X,ωh,ωb)$
(22)

as the third set of the state equations.

Since only one gyroscope is attached to each lower-limb, unlike the wrist, trunk, or thigh joints, the knee angle rate ($γ˙ik$) and the ankle angle rate ($δ˙ia$) cannot be written as similar to those kinematic relationships in Eqs. (8), (9), or (10). To overcome such a measurement deficiency and to include γik and δia into the EKF state equation, we take the time derivative of Eq. (16) and obtain
$∂c1iC∂ilqq˙il+∂c1iC∂γikγ˙ik+∂c1iC∂δiaδ˙ia=d˙i(t)$
(23)
Equation (23) has three equations and combining with Eq. (10), we solve for $x˙il=[ilq˙ γ˙ik δ˙ia]T$ and obtain
$x˙ll=gll(X,ωll,ωb), x˙rl=grl(X,ωrl,ωb)$
(24)
We form the state vector $f(X(t),Ω(t))=[gb glwT grwT gllT grlT ghT]T∈ℝ29$, where measurements $Ω(t)=[ωbT ωlwT ωrwT ωhT ωllT ωrlT d˙(t)T ξ(t)]T∈ℝ22$. We rewrite the EKF state equation in discrete-time as
$X(k)=X(k−1)+ΔTf(X(k−1),Ω(k−1))+w(k)$
(25)
at the kth sampling time and $ΔT=0.01$ s is the data sampling period, where $w(k)∼N(0,σu)$ is the mean-zero white noise with variance matrix $σu$. The EKF outputs use the vision-based measurements in Eq. (6) and the constraints discussed in Sec. 4.1, namely,
$Y(k)=[h(X(k))c1lC(X(k))c1rC(X(k))c2C(X(k))c3C(X(k))c4C(X(k))]+nY(k)=hY(X(k))+nY(k)$
where $Y(k)∈ℝ29$ and $nY(k)∼N(0,σY)$ is the white noise vector with variance matrix $σY$. We obtain the Jacobian matrices $F(k)=In+∂f/∂X|X(k),Ω(k)$ for state dynamics and $H(k)=∂hY/∂X|X(k)$ for the outputs. The EKF design is calculated as [29]
$X̂(k|k−1)=X̂(k−1|k−1)+ΔTf(X̂(k−1|k−1),Ω(k−1))X̂(k|k)=X̂(k|k−1)+W(k)[Y(k)−H(k)X̂(k|k−1)]W(k)=P(k|k−1)HT(k)S−1(k), P(k|k−1)=F(k)P(k−1|k−1)FT(k)+σuS(k)=H(k)P(k|k−1)HT(k)+σY, P(k|k)=(In−W(k)H(k))P(k|k−1)$

The observability condition is checked for the EKF design (using matrices F and H) and the rank condition is satisfied.

Remark1. The EKF design contains the bicycle roll kinematics (20) and uses measurements from the bicycle gyroscope and the wearable body gyroscopes. None of these sensors gives absolute bicycle roll angle information. As shown in Sec. 5, it is interesting that the EKF-based sensing fusion helps maintain a bounded estimation error for the bicycle roll angle while the directly integration of the bicycle gyroscope measurements results in a diverging estimation.

Remark2. In the EKF design, we do not consider any noise models for the gyroscope measurements. As demonstrated in Ref. [20], the inclusion of the first-order colored noise model could help further reduce the estimation errors. However, the improvements of the estimation performance with the noise model is marginal and therefore, we do not present the noise model here.

## Experiments

### Experimental Results.

We conduct the human riding experiments in both an indoor laboratory as shown in Fig. 1(a) and an outdoor setting. Due to the space constraint, the subjects ride the bicycle for a circular trajectory of a radius of 2 m and a moving speed around 1–1.5 m/s in an indoor environment. For outdoor riding experiments, the subjects ride for a large circle with a radius around 7 m and a moving speed around 2.5–3 m/s.

The physical parameters for the bicycle are as follows: $β=16.7$ deg, δ = 10 deg, and $rc=0.17$ m. We recruited ten healthy and experienced bicycle riders (eight male and two female with age: 27 ± 3 years, height: 176 ± 4 cm, and weight: 70 ± 7 kg) to conduct the experiments. The duration for each riding experiment run was around 2 mins. Before conducting experiments, we measured all subjects' biomechanic parameters and the locations of the gyroscopes and feature markers on their body segments for pose calculations. All the subjects signed their informed consent using a protocol approved by the Institutional Review Board (IRB) at Rutgers University. In the following, we first describe the indoor experimental results and then present the outdoor experiments.

Figure 3 shows the experimental results of the pose estimation for one subject. We present the Euler angle representation for a better visualization. For comparison purposes, we also plot the pose estimates by directly integrating strapdown gyroscope measurements. Figures 3(a) and 3(b) show the pose estimates for the right upper-limb, namely the wrist angles and the elbow angle, Fig. 3(c) shows the trunk pose estimation, and Figs. 3(d) and 3(e) show the right lower-limb's pose estimation, including the thigh's Euler angles, the knee angle, and the ankle angle. Finally, Fig. 3(f) shows the estimates of the bicycle roll angle. The results shown in Fig. 3 clearly demonstrate that the visual-inertial fusion results closely follow the ground truth of the rider and the bicycle poses. The fusion scheme certainly outperforms the direct integration of strapdown gyroscope measurements, which drift over time and the errors increase dramatically after 25 s. The corresponding estimation error comparisons are shown in Fig. 4. The maximum errors for all joint angle estimations by the visual-inertial fusion are within 2–3 deg. Compared with other fusion-based pose estimation designs for regular human activities, the proposed approach for human–machine interactions shows the comparable accuracy level. For example, 2.4–3.2 deg mean root-mean-square (RMS) orientation estimation errors for different segments are demonstrated by using the fusion of the inertial and the magnetic sensors in Ref. [30]. A 2.8 deg mean RMS error is shown by fusing the accelerometers and the gyroscopes in Ref. [31].

To further demonstrate the performance of the EKF-based design for indoor testing, we compute the statistics of the pose estimation errors for all ten subjects. Figure 5 shows the calculated statistics of the estimation errors of all limbs and trunk joint angles over time. We plot the estimation error statistics obtained from both the visual-inertial fusion and the gyroscope integration schemes. It is clearly observed that for all subject runs, the estimation errors by the EKF fusion are near zero. Table 1 further lists the mean and one SD of the RMS errors for all subjects. The third row in Table 1 shows the EKF estimation errors for the upper-limb pose without using constraints $c2C$ and $c3C$. It is interesting to observe that incorporating $c2C$ and $c3C$ into the EKF slightly improves the estimation performance. The results shown in Table 1 and Fig. 5 confirm the consistently superior performance of the EKF-based pose estimation than those by the direct integration of gyroscope measurements.

In Fig. 4(b), no integration results are plotted because only one gyroscope is attached to each forearm and no gyroscope measurements for the upper arm. For the lower-limb, although only one gyroscope is attached to each thigh, we still obtain the knee and ankle angles accurately through the gyroscope measurements and constraint (16). Clearly, these constraints enable the design to use few number of wearable gyroscopes. The pose estimation in outdoor experiments shows the consistent results similar to the above-discussed indoor-experiment results. Figure 6 shows the rider pose and bicycle roll angle estimation. The rider pose estimates shown in Figs. 6(a)6(c) demonstrate that the EKF-based scheme produces the accurate estimation results and the performance is comparable to the results in the indoor experiments. The prediction results of the bicycle roll angle shown in Fig. 6(d) also demonstrate the superior performance than the IMU integration.

Figure 7 shows the time trajectories of the mean and one standard deviation of the pose estimates of major body segments and the trunk of the ten bicycle riders in outdoor experiments. From the plots, we clearly observe that within around 50 s riding experiments, the estimates of all postangles by the EKF-based fusion scheme are all around zero, and the one SD values are mostly within a range of 2 deg. Therefore, the proposed visual-inertial fusion scheme can be used to estimate the human body motion in bicycle riding in outdoor environment. On the other hand, from the plots in Fig. 7, we also notice that the SDs for most joint angles of human body segments slowly grow over time though the growths are much smaller than those by directly integrating of the gyroscope signals (see comparison results in Fig. 4). The slow growth of the estimation errors is primarily due to the vision estimation errors.

### Discussions.

The main contribution of the work is the development of a low-cost, wearable visual-inertial design for whole-body human pose estimation in physical human–machine interactions with a moving and dynamic platform. Comparing with the commercially available inertial-based whole-body sensor systems (e.g., Ref. [2]), our approach has several attractive features. The integration of the gyroscopes with visual measurements as well as the physical constraints is robust and reliable while the accuracy of the use of magnetometers in the sensor systems [2] is potentially vulnerable to highly dynamic motion and environmental disturbances, such as existence of metallic objects nearby. Additionally, we focus on using the possible least measurement information while maintaining the consistent estimation performance in all joint angles. Finally, the cost of the wearable sensors (only gyroscopes and one camera) in our approach is much less than most commercial whole-body sensor systems.

One potential limitation of using visual measurements in practice is to obtain the high-quality, reliable images, especially for rapid human movement. Image occlusion, out of field of view of the camera, or blurring images reduces the vision-based pose estimation accuracy. Figure 8 shows the estimation errors of the EKF fusion results for the right wrist angles under a random loss of camera images for all ten subjects. We calculated the EKF estimation errors by randomly dropping the image frames following a given loss percentage. With increasing percentages of the lost camera images, the EKF fusion performance deteriorates. Table 2 lists the statistics of the RMS of the pose estimation for each limb and the trunk among all ten subjects. Clearly, these results confirm that the performance of the visual-inertial fusion scheme is robust to a certain amount of the vision frame loss. For example, with a 20% image loss, the average estimation error of each limb is still less than 2.5 deg.

Besides the riding bicycle example, the proposed method could also be applicable for accurate pose estimations in other applications of human exercises or human–machine interactions, such as car driving, kayaking, or flying an airplane. The proposed pose estimation could also be potentially used for monitoring human activities of elderly and disabled patients (on or not on moving platforms) in a nonlaboratory environment. The system uses the miniaturized wearable gyroscopes and collocated point markers, and these devices are easily attached to human body segments without significantly interfering the human activities in daily personal life. The setup procedure of the proposed system is simple since both the gyroscope bias and noise levels and the camera calibration can be conveniently obtained before the usage of the system.

## Conclusion

We presented a visual-inertial fusion scheme for human whole-body pose estimation in human–machine interactions with applications to bicycle riding. The fusion scheme was built on the measurements of a calibrated monocular camera and a set of gyroscopes attached to rider's body segments. With two feature points on each segment of the upper-limb and one feature point on each thigh of the lower-limb, the EKF-based fusion scheme estimated the whole-body poses when the subjects ride a bicycle. The EKF design also incorporated the constraints from human anatomy and the physical rider–bicycle interactions. The superior performance of the pose estimation design was extensively tested by multiple subjects for both indoor and outdoor experiments. The results showed that the maximum errors for all joint angle estimations by the proposed scheme were within 3 deg and this performance was similar to those of wearable IMUs for regular human activities. The proposed scheme provided a low-cost, high-accuracy, and wearable design for whole-body human pose estimation in human–machine interactions and can be potentially used in other types of human–machine interactions for outdoor or personal activities.

## Acknowledgment

This work was supported in part by the U.S. National Science Foundation under Award Nos. CMMI-0954966 and CMMI-1334389, the Shanghai Eastern Scholarship Program through Shanghai University (J. Yi), the National Natural Science Foundation of China under Award No. 61403307 (Y. Zhang), and fellowships from Chinese Scholarship Council (K. Yu and X. Lu). The authors also thank K. Chen, M. Trkov, P. Wang, and other members at the Robotics, Automation, and Mechatronics (RAM) Lab at Rutgers University for their helpful discussions and experimental help.

## References

References
1.
Bonato
,
P.
,
2010
, “
Wearable Sensors and Systems
,”
IEEE Eng. Med. Biol. Mag.
,
29
(
3
), pp.
25
36
.
2.
Xsens, 2016, “Xsens,” Xsens, Enschede, The Netherlands, accessed Oct. 15, 2016, http://www.xsens.com
3.
Song
,
C.-G.
,
Kim
,
J.-Y.
, and
Kim
,
N.-G.
,
2004
, “
A New Postural Balance Control System for Rehabilitation Training Based on Virtual Cycling
,”
IEEE Trans. Inform. Technol. Biomed.
,
8
(
2
), pp.
200
207
.
4.
Aerts
,
M. B.
,
Abdo
,
W. F.
, and
Bloem
,
B. R.
,
2011
, “
The ‘Bicycle Sign’ for Atypical Parkinsonism
,”
Lancet
,
377
(
9760
), pp.
125
126
.
5.
Burt
,
T. L.
,
Porretta
,
D. L.
, and
Klein
,
R. E.
,
2007
, “
Use of Adapted Bicycles on the Learning of Conventional Cycling by Children With Mental Retardation
,”
Edu. Train. Dev. Disabil.
,
42
(
3
), pp.
364
379
.
6.
You
,
S.
,
Neumann
,
U.
, and
Azuma
,
R.
,
1999
, “
Hybrid Inertial and Vision Tracking for Augmented Reality Registration
,”
Virtual Reality Conference
, Houston, TX, Mar. 13–17, pp.
260
267
.
7.
Strelow
,
D.
, and
Singh
,
S.
,
2004
, “
Motion Estimation From Image and Inertial Measurements
,”
Int. J. Rob. Res.
,
23
(
12
), pp.
1157
1195
.
8.
Lobo
,
J.
, and
Dias
,
J.
,
2003
, “
Vision and Inertial Sensor Cooporation Using Gravity as a Vertical Reference
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
25
(
12
), pp.
1597
1608
.
9.
Jones
,
E. S.
, and
Soatto
,
S.
,
2011
, “
Visual-Inertial Navigation, Mapping and Localization: A Scalable Real-Time Causal Approach
,”
Int. J. Rob. Res.
,
30
(
4
), pp.
407
430
.
10.
Lupton
,
T.
, and
Sukkarieh
,
S.
,
2012
, “
Visual-Inertial-Aided Navigation for High-Dynamic Motion in Built Environments Without Initial Conditions
,”
IEEE Trans. Rob.
,
28
(
1
), pp.
61
76
.
11.
Li
,
M.
, and
Mourikis
,
A. I.
,
2012
, “
High-Precision, Consistent EKF-Based Visual-Inertial Odometry
,”
Int. J. Rob. Res.
,
32
(
6
), pp.
690
711
.
12.
Armesto
,
L.
,
Tornero
,
J.
, and
Vincze
,
M.
,
2007
, “
Fast Ego-Motion Estimation With Multi-Rate Fusion of Inertial and Vision
,”
Int. J. Rob. Res.
,
26
(
6
), pp.
577
589
.
13.
Mirzaei
,
F.
, and
Roumeliotis
,
S.
,
2008
, “
A Kalman Filter-Based Algorithm for IMU-Camera Calibration: Observability Analysis and Performance Evaluation
,”
IEEE Trans. Rob.
,
24
(
5
), pp.
1143
1156
.
14.
Martinelli
,
A.
,
2012
, “
Vision and IMU Data Fusion: Closed-Form Solutions for Attitude, Speed, Absolute Scale, and Bias Determination
,”
IEEE Trans. Rob.
,
28
(
1
), pp.
44
60
.
15.
Foxlin
,
E.
,
Altshuler
,
Y.
,
Naimark
,
L.
, and
Harrington
,
M.
,
2004
, “
Flighttracker: A Novel Optical/Inertial Tracker for Cockpit Enhanced Vision
,” 3rd
IEEE/ACM
International Symposium on Mixed Augmented Reality
, Arlington, VA, Nov. 2–5, pp.
212
221
.
16.
Tao
,
Y.
,
Hu
,
H.
, and
Zhou
,
H.
,
2007
, “
Integration of Vision and Inertial Sensors for 3D Arm Motion Tracking in Home-Based Rehabilitation
,”
Int. J. Rob. Res.
,
26
(
6
), pp.
607
624
.
17.
Hu
,
J.-S.
,
Tseng
,
C.-Y.
,
Chen
,
M.-Y.
, and
Sun
,
K.-C.
,
2013
, “
IMU-Assisted Monocular Visual Odometry Including the Human Walking Model for Wearable Applications
,”
IEEE
International Conference on Robotics Automation
, Karlsruhe, Germany, May 6–10, pp.
2879
2884
.
18.
Agarwal
,
P.
,
Kumar
,
S.
,
Ryde
,
J.
,
Corso
,
J. J.
, and
Krovi
,
V. N.
,
2014
, “
Estimating Dynamics On-the-Fly Using Monocular Video for Vision-Based Robotics
,”
IEEE/ASME Mechatronics
,
19
(
4
), pp.
1412
1423
.
19.
Zhang
,
Y.
,
Liu
,
R.
,
Trkov
,
M.
, and
Yi
,
J.
,
2012
, “
Rider/Bicycle Pose Estimation With Integrated IMU/Seat Force Sensor Measurements
,”
IEEE/ASME
Int. Conf. Adv. Intelli. Mechatronics, Kaohsiung, Taiwan, July 11–14, pp. 604–609.
20.
Zhang
,
Y.
,
Chen
,
K.
, and
Yi
,
J.
,
2013
, “
Rider Trunk and Bicycle Pose Estimation With Fusion of Force/Inertial Sensors
,”
IEEE Trans. Biomed. Eng.
,
60
(
9
), pp.
2541
2551
.
21.
Zhang
,
Y.
,
Chen
,
K.
,
Yi
,
J.
, and
Liu
,
L.
,
2014
, “
Pose Estimation in Physical Human-Machine Interactions With Application to Bicycle Riding
,”
IEEE/RSJ
International Conference on Intelligent Robotics System
, Chicago, IL, Sept. 14–18, pp.
3333
3338
.
22.
Zhang
,
Y.
,
Chen
,
K.
,
Yi
,
J.
,
Liu
,
T.
, and
Pan
,
Q.
,
2016
, “
Whole-Body Pose Estimation in Human Bicycle Riding Using a Small Set of Wearable Sensors
,”
IEEE/ASME Mechatronics
,
21
(
1
), pp.
163
174
.
23.
Lu
,
X.
,
Zhang
,
Y.
,
Yu
,
K.
,
Yi
,
J.
, and
Liu
,
J.
,
2013
, “
Upper Limb Pose Estimation in Rider-Bicycle Interactions With an Un-Calibrated Monocular Camera and Wearable Gyroscopes
,”
ASME
Paper No. DSCC2013-3839.
24.
Lu
,
X.
,
Yu
,
K.
,
Zhang
,
Y.
,
Yi
,
J.
, and
Liu
,
J.
,
2014
, “
Whole-Body Pose Estimation in Physical Rider-Bicycle Interactions With a Monocular Camera and Wearable Gyroscopes
,”
IEEE/RSJ
International Conference on Intelligent Robotics System
, Chicago, IL, Sept. 14–18, pp.
4124
4129
.
25.
Kim
,
H.
,
Miller
,
L. M.
,
Byl
,
N.
,
Abrams
,
G. M.
, and
Rosen
,
J.
,
2012
, “
Redundancy Resolution of the Human Arm and an Upper Limb Exoskeleton
,”
IEEE Trans. Biomed. Eng.
,
59
(
6
), pp.
1770
1779
.
26.
Otsu
,
N.
,
1979
, “
A Threshold Selection Method From Gray Level Histograms
,”
IEEE Trans. Syst., Man, Cybern.
,
9
(
1
), pp.
62
66
.
27.
González
,
R. C.
, and
Woods
,
R. E.
,
2008
,
Digital Image Processing
,
3rd ed.
,
Prentice Hall
,
.
28.
Chen
,
K.
,
Zhang
,
Y.
, and
Yi
,
J.
,
2013
, “
Modeling Rider/Bicycle Interactions With Learned Dynamics on Constrained Embedding Manifolds
,”
IEEE/ASME
Int. Conf. Adv. Intelli. Mechatronics, Wollongong, Australia, pp. 442–447.
29.
Yi
,
J.
,
Wang
,
H.
,
Zhang
,
J.
,
Song
,
D.
,
Jayasuriya
,
S.
, and
Liu
,
J.
,
2009
, “
Kinematic Modeling and Analysis of Skid-Steered Mobile Robots With Applications to Low-Cost Inertial-Measurement-Unit-Based Motion Estimation
,”
IEEE Trans. Rob.
,
25
(
5
), pp.
1087
1097
.
30.
Roetenberg
,
D.
,
Slycke
,
P. J.
, and
Veltink
,
P. H.
,
2007
, “
Ambulatory Position and Orientation Tracking Fusing Magnetic and Inertial Sensing
,”
IEEE Trans. Biomed. Eng.
,
54
(
5
), pp.
883
890
.
31.
Luinge
,
H. J.
, and
Veltink
,
P. H.
,
2005
, “
Measuring Orientation of Human Body Segments Using Miniature Gyroscopes and Accelerometers
,”
Med. Biol. Eng. Comput.
,
43
(
2
), pp.
273
282
.