Abstract
Augmented reality (AR) has been applied to facilitate human–robot collaboration in manufacturing. It enhances real-time communication and interaction between humans and robots as a new paradigm of interface. This research conducts an experimental study to systematically evaluate and compare various input modality designs based on hand gestures, eye gaze, head movements, and voice in industrial robot programming. These modalities allow users to perform common robot planning tasks from a distance through an AR headset, including pointing, tracing, 1D rotation, 3D rotation, and switch state. Statistical analyses of both objective and subjective measures collected from the experiment reveal the relative effectiveness of each modality design in assisting individual tasks in terms of positional deviation, operational efficiency, and usability. A verification test on programming a robot to complete a pick-and-place procedure not only demonstrates the practicality of these modality designs but also confirms their cross-comparison results. Significant findings from the experimental study provide design guidelines for AR input modalities that assist in planning robot motions.
1 Introduction
Industry 4.0 has emerged as an effective approach to implementing the concept of smart manufacturing. Modern manufacturing systems have thus become highly automated and intelligent, capable of performing complex tasks while maintaining production efficiency, flexibility, and quality. Industrial robots are important technical elements that no longer exist as stand-alone units in most factories nowadays. However, fully autonomous robot operations without human intervention are not always practical on the shop floor [1]. A more practical approach is human–robot collaboration (HRC) that combines superior accuracy and repeatability of robots with high flexibility and adaptability of humans [2]. A straightforward method to implement HRC is to have robots perform specific tasks by following the motion plans created by humans. Robot programming by demonstration (PbD) technology has been developed to successfully fulfill this need. A typical PbD process involves a human planner manipulating a robot in an actual manufacturing environment using a teaching pendant or a similar instructive device. The process can become tedious, error-prone, and unintuitive for the planner when maneuvering an industrial robot within a complex manufacturing environment. Collision avoidance through correct robot positioning in 3D space incurs a high mental workload. New user interfaces and interaction methods for robot programming are required to overcome this problem in the current PbD technology.
Augmented reality (AR) overlays virtual content onto real-world scenes to enhance user's perception and situation awareness of the environment. AR is the only human-centric technology among the nine key enabling elements in Industry 4.0, allowing real-time user interaction with the environment through various sensory modalities [3]. Previous studies [4–6] have verified that AR can enhance interactions among humans, robots, and the environment in HRC as an intuitive interface. AR-assisted robot motion planning has been applied to various manufacturing applications, such as pick-and-place [4], welding [5,7], spraying [8], and machining [9]. Most applications chose head-mounted display devices, such as goggles and smart glasses, over handheld devices to free user's hands during planning [3]. Advanced AR headsets provide multimodal sensory inputs for real-time interaction that are more flexible and intuitive than traditional keyboards and touchscreens. However, input modalities based on hand ray, head movement, and eye gaze produced different positional precision and usability for pointing and tracing tasks in AR-assisted robot PbD [7]. It is necessary to examine how effectively different input modalities and interface designs assist in robot motion planning in 3D space. The results may provide insights into which input modality design is best suited for specific planning tasks. To address this research gap, this study conducts a systematic analysis of interaction modality designs for industrial robot programming using an AR headset. AR interfaces are developed using various sensory modalities, including hand gestures, eye gaze, head movement, and voice. Evaluation experiments are conducted to compare their relative effectiveness in pointing, tracing, rotation, and switch state, which are basic planning tasks in industrial robot programming, based on both objective and subjective measures. Statistical analyses of the experimental results reveal the suitability of each modality design for specific tasks. These findings may serve as design guidelines for enhancing interaction modalities in AR-based robot programming.
2 Related Studies
Previous research has developed AR interfaces that enable users to plan robot motions using various input methods. The effectiveness of these interfaces has been mainly validated through their implementation in specific manufacturing operations. The following reviews summarize related studies with a focus on the interface design and evaluation for industrial robot programming.
Ostanin and Klimchik [10] implemented an AR-based robot programming system using HoloLens. Users can specify virtual waypoints using a head cursor combined with an air tap gesture, and then translate or rotate these waypoints to plan robot's positions and orientations. Robot's moving trajectory can be created with these waypoints using various methods such as point-to-point, linear, arc, or freeform curves. Their later work [11] improved the input interface design by adding virtual gizmos, allowing users to easily grasp and drag waypoints in 3D space for fine adjustments. A control interface was introduced for switching between gripper's opening and closing. The system was then applied to robot pick-and-place tasks and drawing tasks to validate its effectiveness.
Rudorfer et al. [12] integrated 3D object recognition into AR to estimate the initial pose of an object for robotic manipulation. Users can direct a head ray at an object, select it using an air tap gesture, and specify a target position by moving their head. A robot then automatically grasps the object and moves it to the target position, completing the pick-and-place operation. Soares et al. [13] developed an AR interface that tracks the position of user's fingertip using HoloLens 2. The interface enables users to draw trajectories in 3D space by specifying the starting and ending positions with an air tap gesture. The drawn trajectories were converted into robot motion commands through coordinate transformation between the HoloLens and robot coordinate systems. Fang et al. [9] proposed using a handheld marker for robot motion programming. A stereo vision camera was used to recognize the handheld marker, allowing the input of point positions and orientations. The robot path-following tasks demonstrated the feasibility of the proposed idea.
This study aims to enhance AR interface design for industrial robot motion planning, focusing on five common input tasks: pointing, tracing, 1D rotation, 3D rotation, and switch state. The efficacy and usability of various input modalities for similar tasks have been investigated, but research findings are scattered across different studies [6,14,15]. Table 1 provides a comparative summary of their important conclusions.2
A summary of previous studies on modality designs in augmented reality
Input task | Modality | ||||
---|---|---|---|---|---|
Near distance | Far distance | ||||
Head gesture/Touch | Hand ray | Eye gaze | Head movement | Voice | |
Pointing | |||||
Tracing |
|
|
|
| N/A |
Rotation | N/A | N/A |
|
| |
Switch state |
| N/A |
|
|
Input task | Modality | ||||
---|---|---|---|---|---|
Near distance | Far distance | ||||
Head gesture/Touch | Hand ray | Eye gaze | Head movement | Voice | |
Pointing | |||||
Tracing |
|
|
|
| N/A |
Rotation | N/A | N/A |
|
| |
Switch state |
| N/A |
|
|
The literature review shows that many studies have effectively utilized AR interfaces for industrial robot planning, incorporating various interaction modalities to facilitate user input tasks. Simulation of robot motions by overlaying virtual cues onto real scenes demonstrates the effectiveness of these interfaces in guiding robots to complete a wide range of manufacturing tasks [7–12]. However, there are still deficiencies in those studies that require further investigation.
Most input modalities for pointing were designed either to select an object or to click on a button. The pointing location only needs to fall within object's or button's range, allowing for considerable positional deviation. The input point was estimated as the intersection between a ray and a virtual object or plane, leading to limited 3D input capabilities.
As shown in Table 1, existing input modalities such as head movement and voice are mostly unsuitable for rotating objects. In contrast, hand ray works well at close distances, but its effectiveness for rotating tasks beyond arm's reach has not been thoroughly investigated.
Previous studies indicated that hand gestures [28], voice [19], and head movement [30] are more suitable for switching state, although each conclusion was derived from different applications. It is necessary to cross-compare them within the same context to obtain generalized conclusions.
This research aims to systematically study input modality designs in AR for industrial robot planning in manufacturing. AR interfaces are developed with various sensory modalities to complete basic planning tasks, including pointing, tracing, rotation, and switch state. We conduct evaluation experiments to analyze and compare the performance of each modality design across different tasks based on both objective and subjective measures. Findings obtained from analyzing the experimental results may provide practical guidelines for AR interface design from the perspective of human–computer interactions. The remainder of this paper is organized as follows. Section 3 describes the design of interaction modalities for all input tasks and their implementation in AR. Section 4 introduces the experimental design for evaluating the performance of the interaction modalities and presents the statistical analysis of the results. Section 5 summarizes the significant findings derived from the analysis and proposes possible explanations. Section 6 describes a test scenario that validates the effectiveness of the proposed modality designs and their cross-comparison results. Section 7 provides concluding remarks and suggestions for future research.
3 Design of Input Modalities
3.1 Planning Tasks in Industrial Robot Programming.
Multi-axis industrial robots have been applied to perform various manufacturing operations such as pick-and-place, assembly, welding, and machining [5]. Although these operations can vary significantly in their process characteristics, planning a robot to complete them mainly involves a sequence of motion commands that determine the position and orientation of robot's end-effector. The corresponding motion commands for each robot joint are then determined by the post-processor specific to the robot configuration. A human planner may need to modify the commands based on the simulation results of the robot motions to avoid collisions or optimize the overall operation. This research focuses on four basic input tasks identified from the planning procedure described above.
Pointing: The user specifies a position on a target in 3D space using a virtual cursor provided by an AR interface. The specified position may serve as the center of robot's end-effector or as an end or intermediate point of a moving trajectory.
Tracing: The user continuously specifies a series of positions in 3D space using a virtual cursor provided by an AR interface. These positions may work as a moving trajectory of robot's end-effector, such as a welding or machining path. Continuous tracing is sometimes not possible with delays, as specifying consecutive positions requires a period of time in real implementation. The trajectory can be generated using these positions through various interpolation methods.
1D rotation: Users rotate an object around its axis to a target position in 3D space, referred to as 1D rotation due to the single degree of rotational freedom. The input determines the orientation of robot's end-effector or the axis of a robot joint.
3D rotation: Users adjust the pose of an object to a target orientation in 3D space, which involves simultaneous manipulation of three degrees of rotational freedom. This task is frequently used to position robot's end-effector to avoid collisions or maintain specific machining conditions.
Switch state: Switching the state of robot's function is a common input command in robot planning. For instance, users may want to start or stop robot’s motion and open or close the end-effector.
3.2 Input Modalities in Each Task.
An advanced AR headset, such as the HoloLens 2, provides various sensory modalities for real-time user interaction with AR scenes, including haptic, visual, auditory, and multimodal feedback [33]. These modalities have been applied to facilitate human–robot collaborations in smart manufacturing for purposes of robot programming [5,7,13], safety awareness [34,35], and mutual understanding [36,37]. The input methods developed by those studies include finger touch, hand ray, gesture, head gaze, eye gaze, voice, and their combinations. Building on these methods, this research proposes the following input modalities for the basic planning tasks described in Sec. 3.1. Table 2 shows the schematics of the modality designs for each input task.
Modality designs for each planning task
Pointing | Tracing | 1D rotation | 3D rotation | Switch state |
---|---|---|---|---|
Hand ray | Hand ray | Disk gizmo | Cube gizmo | Hand gesture |
Finger point | Finger point | Hand ray | Finger direction | Voice |
Head movement | Head movement | Finger rotation | U-gesture | Head movement |
Thumbs up-gesture | Hold-rotate | Eye gaze | ||
U-gesture |
3.2.1 Pointing
Hand ray: A line segment extends from the center of user's palm with its end point working as a cursor in 3D space. They can control the cursor movement to a target point by rotating and translating their hand. The line is always displayed in AR to indicate the cursor position.
Finger point: A cursor is created by offsetting a fixed distance from the index fingertip of user's dominant hand. Unlike the hand ray pointing, the user manipulates the cursor only by translating the fingertip without the need to rotate it. This method improves the traditional finger touch by enabling interaction beyond arm's reach. A circle is shown at the cursor indicating its current position.
Head movement: A cursor position is determined by extending a distance from the center of the main camera of an AR headset. The user manipulates the cursor in 3D space by moving and rotating their head. A circle is shown at the cursor indicating its current position.
3.2.2 Tracing.
The modality designs for hand ray, finger point, and head movement remain the same as the pointing input. The user needs to continuously move the cursor along a specified trajectory. All points generated during the movement process will be recorded to represent the trajectory.
3.2.3 1D Rotation.
Previous studies have distinguished two interaction paradigms for bare-hand object manipulation in mixed reality [15,38]. Metaphoric mapped interaction is defined as the mental models generated from repeated patterns in daily experiences, while isomorphic mapped interaction is characterized by the one-to-one literal spatial relations between input actions and their resulting effects. Experimental results suggest that interaction using hand manipulation is more suitable for rotating objects [15]. Therefore, we propose both metaphoric and isomorphic interaction designs using bare hand for 2D rotation input.
Disk gizmo: A virtual disk is created as a gizmo to rotate an object around a specific axis. The user needs to grasp the disk using a pinch hand gesture and drag it around its normal to make a rotation.
Hand ray: This input method is the same as that used for the pointing task. The user conducts the rotation process by clicking on a button using the cursor controlled by the hand ray. The “increase” and “decrease” buttons correspond to rotating in two opposite directions.
Finger rotation: The user initiates the rotation process by extending the index finger of the dominant hand. Turning the finger clockwise and counterclockwise corresponds to rotating in two opposite directions. The process stops when the finger is retracted.
Thumbs up-gesture: The user initiates the rotation process by showing a thumbs up-gesture. Turning the other four fingers clockwise and counterclockwise corresponds to rotating in two opposite directions. The process stops when the gesture is not maintained.
U-gesture: The user forms a U-shape gesture with the thumb and index finger of their dominant hand. Turning the two fingers clockwise and counterclockwise corresponds to rotating in two opposite directions. The process stops when the gesture is not maintained.
3.2.4 3D Rotation
Cube gizmo: A virtual cube is created as a gizmo to represent the orientation of the object to be rotated. The user needs to grasp the cube using a pinch hand gesture and change its orientation to a specific direction in 3D space.
Finger direction: The user controls the orientation of an object directly using the index finger of their dominant hand.
U-gesture: The user forms a U-shape gesture with the thumb and index finger of their dominant hand. The normal of the plane determined by the U-shape works as the orientation of the object to be rotated.
Hold-rotate: The user first wraps the four fingers, except the thumb of their dominant hand, around the rotational axis, then moves the axis to a specific direction in 3D space. The process is controlled by simultaneously rotating the wrist and moving the arm.
3.2.5 Switch State.
The switching action between states normally does not require high positional accuracy in a graphical use interface. Eye gaze has been implemented as an efficient input modality for quick selection and activation in AR applications [39]. Therefore, an input modality using eye gaze is developed for switching state in addition to other hand gesture-based designs.
U-gesture: The user forms a U-shape gesture with the thumb and index finger of their dominant hand to indicate the “open” state. Closing the U-shape by bringing the two fingers together indicates the “closed” state.
Voice: The user speaks the words “open” and “close” to specify the current state.
Eye gaze: The user chooses between the “open” and “close” states by staring at the corresponding button for a short period of time.
Head movement: The initial neural state starts at the natural position of user's head. Rotating the head more than 15 deg to the right and left corresponds to the “open” and “close” states, respectively.
4 Experiment Design
4.1 Experimental Procedure.
A within-group experiment is proposed to assess the performance of different modality designs for each planning task at a distance using HoloLens 2. It offers real-time interaction features like hand gesture recognition, eye tracking, head tracking, and voice recognition, which can be integrated into AR applications using Microsoft Mixed Reality Toolkit (MRTK) v2. In this experiment, subjects need to confirm (commit) after entering the specified point or rotation. For this purpose, they press the button on a Logitech R800 laser pointer while holding it in their non-dominant hand. The confirmation signal is wirelessly transmitted to the HoloLens to minimize potential delays. The pinch gesture (or airtap), commonly used in AR applications, can cause body imbalance or prolong the task. Twenty college students, evenly divided by gender and aged between 20 and 24, were recruited to participate in the experiment. All participants had previously used HoloLens 2. Wearing the AR headset, they needed to complete five planning tasks using all the developed modalities. An experimenter supervised the entire process as each participant conducted the experiment.
As shown in Fig. 1, the experiment begins with participants signing an informed consent form, followed by the experimenter explaining the experimental procedure, its purpose, and the guidelines. During the subsequent preparation session, the virtual and real worlds are first aligned and the HoloLens 2 adjusts its display using the built-in calibration procedure to optimize performance for each participant. A task sequence is then randomly generated to minimize potential learning effects. The AR applications developed for the experiment undergo a setting change according to each input task. The experimenter explains and demonstrates how each input modality functions for the current task. Afterward, participants practice the design of each modality using the AR headset. The actual experiment begins after a break that follows the practice session. The experiment session starts by generating a random sequence for all input modalities. Participants repeatedly conduct an input task using a different modality each time according to the sequence. Once all modalities have been tested, participants must complete questionnaires for each modality design they experienced. Completing an input task varies by individual and depends on the number of modality designs under investigation. A 5-minute break is arranged before proceeding to the test of the next input task.
An AR application is developed for a specific input task and provides all the input modalities required for that task. The application sequentially activates the modality interfaces based on the random sequence generated prior to performing the task (refer to Fig. 1). Switching between different input tasks requires the experimenter to manually load different AR applications. This process takes place during the break following the completion of each input task.
4.2 Assessment Criteria.
The performance of the developed modalities is evaluated and compared using both objective and subjective measures collected from the experiment. The objective measures include positional accuracy, precision (if applicable), and completion time, while the subjective measure is obtained through a proprietary questionnaire. They are described for each individual task as in the following.
For pointing, the accuracy is estimated as the average deviation of multiple input points from the target position prompt in AR. The precision is defined as the standard deviation derived from the same group of input points. For tracing, the accuracy and precision are estimated relative to the target trajectory prompt in AR and the trajectory interpolated from the input points, respectively. In the two rotation tasks, participants rotate a virtual end-effector to match the target pose prompt in the AR scene. The accuracy and precision are calculated from multiple inputs, similar to the methods used for pointing. These two assessments are not applicable to state switching. Instead, the error count is used as the performance index. An error occurs when participants fail to switch to the correct state. The completion time is averaged from multiple trials for each task.
In addition, a proprietary simplified questionnaire was developed by deriving relevant questions from the standard NASA-TLX and SUS forms to gather subjective feedback on the modality designs. Participants respond to a five-point Likert scale for each perspective, including physical load, mental load, complexity, intuitiveness, and overall preference. Five points indicate that the performance is extremely ideal, while one point indicates it is extremely undesirable. Note that, unlike the NASA-TLX or SUS forms, the score on the five-point Likert scale directly reflects participants' subjective perception of physical or mental “ease,” indicating their preference.
5 Results and Discussion
5.1 Pointing.
Each participant specifies a point at four target positions in AR using three modality designs, completing the task twice. The accuracy data estimated from the experimental results pass Levene's homogeneity test but do not pass the Kolmogorov–Smirnov normality test. Therefore, the non-parametric variance analysis using the Kruskal–Wallis method shows significant differences among the modalities (p-value < 0.01). As shown in Fig. 2, the post-hoc Dunn's test indicates that the accuracy of the finger point modality is significantly higher than that of hand ray and head movement. Moreover, the accuracy of the hand ray is significantly higher than that of head movement.
The precision data estimated from the experimental results do not pass Levene's or Kolmogorov–Smirnov normality tests. No significant differences exist among the three modality designs, according to the Kruskal–Wallis analysis (p-value = 0.722). The averaged precision for each modality is shown in Fig. 3.
The completion time of the experimental results passes Levene's homogeneity test but does not pass the Kolmogorov–Smirnov normality test. No significant differences exist among the three modality designs, according to the Kruskal–Wallis analysis (p-value = 0.485). The average time for each modality is shown in Fig. 4.
The Kruskal–Wallis method is applied to analyze the results for each of the five aspects of the questionnaire separately. Significant differences among the modality designs are shown in Fig. 5. Hand ray and finger point both have lower physical and mental loads compared to head movement (p-value < 0.01). Hand ray is considered more intuitive than finger point and head movement (p-value < 0.05). Participants prefer using hand ray and finger point over head movement (p-value < 0.01).
5.2 Tracing.
Each participant specifies a trajectory at three target line segments in AR using three modality designs, completing the task twice. We do not distinguish between the three lines in the subsequent analysis, instead averaging their measures. The accuracy is estimated as the distance to the target line, while the precision is estimated relative to the line constructed by linear regression from discrete input points.
The accuracy data calculated from the experimental results do not pass Levene's (p-value = 0.013) or Kolmogorov–Smirnov normality tests (p-value < 0.01). The Kruskal–Wallis analysis indicates significant differences among the modality designs. As shown in Fig. 6, the post-hoc Dunn's test indicates that finger point is more accurate than the other two modalities, with hand ray being more accurate than head movement. The precision data pass Levene's homogeneity test (p-value = 0.691) but do not pass the Kolmogorov–Smirnov normality test (p-value = 0.003). The Kruskal–Wallis analysis indicates significant differences among the modality designs. As shown in Fig. 7, the post-hoc Dunn's test indicates that head movement has lower precision than that of the other modalities.
The completion time of the experimental results passes Levene's homogeneity test (p-value = 0.323) but does not pass the Kolmogorov–Smirnov normality test (p-value < 0.01). The Kruskal–Wallis analysis indicates significant differences among the modality designs. As shown in Fig. 8, the post-hoc Dunn's test indicates that finger point takes longer time than the other two modalities.
The Kruskal–Wallis analysis shows significant differences among the subjective ratings of the modality designs, as shown in Fig. 9. Hand ray and finger point both have lower physical and mental loads compared to head movement (p-value < 0.01). Head movement is considered more complex than hand ray (p-value < 0.05). Hand ray is considered more intuitive than head movement (p-value < 0.05). Participants prefer using hand ray and finger point over head movement (p-value < 0.01).
5.3 1D Rotation.
Each participant rotates a virtual end-effector to a specific position around a rotational axis using five modality designs, completing the task three times. Since each independent variable has an equal amount of data and the sample size is sufficiently large, a normal distribution can be assumed according to the central limit theorem. Therefore, normality and homogeneity tests are not required. The analysis of variance (ANOVA) test indicates significant differences in rotation deviation among the different modalities (p-value < 0.01). The post-hoc Tukey Honestly Significant Difference (HSD) analysis indicates that U-gesture has a larger deviation than hand ray, finger point, and disk gizmo, as shown in Fig. 10.
The ANOVA test indicates significant differences in the completion time among the different modalities (p-value < 0.01). The post-hoc Tukey HSD analysis indicates that disk gizmo takes longer time than hand ray, finger point, and thumbs up-gesture, as shown in Fig. 11. The Kruskal–Wallis analysis shows significant differences among the modality designs (see Fig. 12). Both U-gesture and disk gizmo have higher physical load than hand ray and finger point (p-value < 0.01). Finger point has lower mental load than U-gesture, thumbs up-gesture, and disk gizmo (p-value < 0.01). Finger point is considered less complex than U-gesture, thumbs up-gesture, and disk gizmo (p-value < 0.05). It is considered more intuitive than thumbs up-gesture and disk gizmo (p-value < 0.05). Finally, participants prefer using hand ray and finger point over U-gesture and disk gizmo (p-value < 0.05).
5.4 3D Rotation.
Each participant rotates a virtual end-effector to the target pose in AR space using four modality designs, completing the task three times. Since each independent variable has an equal amount of data and the sample size is sufficiently large, a normal distribution can be assumed according to the central limit theorem. Therefore, normality and homogeneity tests are not required. The ANOVA test indicates significant differences in orientation deviation to the target among the different modalities (p-value < 0.01). The post-hoc Tukey HSD analysis indicates that the cube gizmo has a smaller deviation than the other modalities, as shown in Fig. 13. In contrast, the ANOVA test indicates no significant differences in the completion time among the four modalities (p-value = 0.349). Figure 14 shows the average time for each modality.
The Kruskal–Wallis test is applied to analyze the results of the five aspects of the questionnaire separately. Significant differences among the modality designs are shown in Fig. 15. Cube gizmo has lower physical load than all the other modalities (p-value < 0.01). No significant differences exist in mental load, complexity, or intuitiveness. Participants prefer using the cube gizmo over the other modalities for this task (p-value < 0.01).
5.5 Switch State.
Each participant switches a virtual end-effector between the open and closed states using four modality designs, completing the task three times. Since each independent variable has an equal amount of data and the sample size is sufficiently large, a normal distribution can be assumed according to the central limit theorem. Therefore, normality and homogeneity tests are not required. Every switch was correctly conducted in the experiment. The ANOVA test indicates significant differences in the completion time among the different modalities (p-value < 0.01). The post-hoc Tukey HSD analysis indicates that voice takes longer time than the other modalities, as shown in Fig. 16.
The Kruskal–Wallis test is applied to analyze the results of the five aspects of the questionnaire separately. Significant differences among the modality designs are shown in Fig. 17. Head movement has higher physical load than voice and hand gesture. Head movement has higher mental load and is considered more complex and less intuitive than the other modalities. Finally, participants prefer using voice and head gesture over eye gaze and head movement for this task. All conclusions are based on the p-value < 0.01.
5.6 Discussions.
This section summarizes the key findings obtained from the statistical analyses of the experimental results described in the previous section. Possible explanations and implications of these findings are also discussed.
The hand ray and finger point modalities are more accurate than head movement for specifying points in AR space. Participants prefer using the two modalities because of their lower physical and mental loads, as well as better usability. Previous studies [7,20] reached similar conclusions when comparing hand ray with head movement.
The finger point modality is the most accurate for tracing a trajectory in AR space. However, this modality takes longer time to complete the tracing task than the others. Participants prefer using finger point and hand ray over head movement because of their lower physical and mental loads, as well as better usability. Previous studies [7,20] reached similar conclusions when comparing hand ray with head movement.
The disk gizmo modality produces less deviation compared to the other modalities, while it takes longer time to complete 1D rotation. Participants prefer using hand ray and finger point over U-gesture and disk gizmo.
The cube gizmo modality results in less deviation when positioning a virtual object to a specified pose in AR space. Manipulating the gizmo in HoloLens consists of: (1) holding a control point using the pinch gesture and (2) conducting the rotation. This two-step operation may require more time compared to modalities that involve only one step. Cube gizmo is the most preferred modality design, with a lower physical load. In contrast, other hand or gesture-based modalities require intricate motor skills of the human hand.
The voice modality takes longer time to switch states, as processing voice input and recognition is slower than other sensory modalities. Participants prefer using voice and head gesture over eye gaze and head movement to switch states in AR.
In addition, similar to previous findings3 [7,22], input modalities based on head or eye gaze are too sensitive for tasks demanding high positional precision. Involuntary physiological actions in the human body, such as breathing, eye micro-saccades, and head jitter, introduce noise into gaze signals and focal points. Hand-based input modalities function more effectively under such circumstances. As the AR applications developed for the experiment are similar, the calibration error results estimated in our previous work [7] are directly referenced here. The highest accuracy achieved in pointing with a hand gesture interface is 4.47 mm, while the finest precision is 1.95 mm. For pointing with a head gaze interface, the highest accuracy and finest precision are 4.98 mm and 2.62 mm, respectively. Note that the final accuracy or precision for input tasks performed through an AR interface is not simply the numerical sum of individual calibration and perception errors.
Hand or finger-based modalities outperform the others in pointing and tracing based on both objective and subjective measures. However, their positional accuracy or precision in the current implementation may only support robot programming tasks that permit generous tolerances. The disk gizmo input is the most precise design for 1D rotation, but receives lower usability ratings than the other modality. The cube gizmo input performs best in terms of positional accuracy in 3D rotation and is considered the most preferred design. Modality designs using gizmos are more accurate in indicating a rotational amount in both rotation tasks in AR space. An explanation derived from observation is that users tended to keep adjusting the input to improve accuracy using the instant visual feedback provided by the gizmos, which may prolong the input process. In contrast, the feedback is less explicit in the other modalities. In fact, the time required to rotate an object to a specific pose in the current implementation is too lengthy, regardless of the input modality.
Multiple sources of sensory information across various modalities provide practical value for complex applications in human–computer interaction. The design of multimodal inputs is closely linked to cross-comparison and the baseline performance of each input modality. Previous studies [40–42] have indicated the complexity of integrating multiple modality cues to facilitate human interaction with the environment, requiring these cues to merge into a coherent and unambiguous perception of the human body and its surroundings. The challenge of sensory combination can be modeled and evaluated using signal detection theory [42]. The theory treats each sensory perception as a probabilistic process. The proposed goal is to minimize the variance in the final perceptual estimate of an environmental property by effectively balancing individual sensory inputs, which can vary across individuals. The baseline assessment conducted in this work offers design references to achieve this goal. Alternatively, the combination must adapt to the dynamic nature of human awareness in response to changes in external conditions. This is particularly important when the combined sensory cues are designed to aid the receiver in improving situational awareness.
Interaction through finger rotation in the experiment is designed by integrating two hand gesture recognition features offered by Microsoft MRTK as follows. User's finger is considered to be rotating when the following two conditions are detected simultaneously: (1) the index finger is extended, and (2) the wrist is rotating. Precisely speaking, finger rotation is indirectly recognized by detecting wrist rotation. A perceivable delay occurs in hand tracking with HoloLens 2; however, this delay is not expected to significantly impact the experimental results compared to the average completion duration.
The performance of an input modality is largely limited by its implementation in AR. Under ideal conditions, the relative performance of different modality designs might vary from the current results in both objective and subjective measures. However, the experimental findings still offer some degree of insights into comparisons among different designs. For example, input modalities involving a high physiological load due to body motion are likely to receive lower scores in subjective assessments, regardless of their implementation effectiveness. A challenging question yet to be addressed is the extent to which imperfect implementation influences the performance measures used for comparison. We found the technical capabilities of current AR technology insufficient to support real-time interaction for all the modality designs examined in this work. However, the associated delay might not substantially impact the experimental results for some of them. The insufficiency arises from a combination of factors, including devices, algorithms, and interface design. Unless AR interaction can be executed in real-time, understanding the relative weights of these factors becomes highly difficult, if not impossible.
6 Verification Test
A verification test was conducted to demonstrate the practicality of the proposed modalities in robot programming. The performance of different combinations of modalities was compared in a pick-and-place procedure. As shown in Fig. 18, a UR3 robot starts from the origin position, picks up the part from a given location, moves it through a narrow passage, and places it into a pocket in the procedure. Three subjects were recruited to participate in the test by determining the corresponding robot motions in AR. A practice session was arranged for each modality combination before the actual test. The procedure involves nine planning tasks using four different input modalities (see Fig. 19). Table 3 lists the frequency of occurrences for each planning task in the verification test. Table 4 shows the different combinations of modality designs being compared. The optimal and worst combinations consist of the modality designs with the highest and lowest positional accuracy, respectively, if applicable. Three random combinations consist of modality designs randomly selected from Table 2 for each task. Each participant completed three trials using the optimal and worst combination as well as one trial with each random combination. Performance was estimated by averaging the results from the three trials for all combinations.
Frequency of occurrences for each planning task in the verification test
Task | Frequency of occurrences |
---|---|
Pointing | 4 |
Tracing | 1 |
1D rotation | 2 |
Switch state | 2 |
Task | Frequency of occurrences |
---|---|
Pointing | 4 |
Tracing | 1 |
1D rotation | 2 |
Switch state | 2 |
Modality combinations in the verification test
Combination | # of trials | Input modality | |||
---|---|---|---|---|---|
Pointing | Tracing | 1D rotation | Switch state | ||
Optimal | 3 | Finger point | Finger point | Disk gizmo | Head movement |
Random 1 | 1 | Hand ray | Hand ray | Hand ray | Hand gesture |
Random 2 | 1 | Hand ray | Hand ray | Finger rotation | Hand gesture |
Random 3 | 1 | Hand ray | Hand ray | Thumbs up-gesture | Eye gaze |
Worse | 3 | Head movement | Head movement | U-gesture | Voice |
Combination | # of trials | Input modality | |||
---|---|---|---|---|---|
Pointing | Tracing | 1D rotation | Switch state | ||
Optimal | 3 | Finger point | Finger point | Disk gizmo | Head movement |
Random 1 | 1 | Hand ray | Hand ray | Hand ray | Hand gesture |
Random 2 | 1 | Hand ray | Hand ray | Finger rotation | Hand gesture |
Random 3 | 1 | Hand ray | Hand ray | Thumbs up-gesture | Eye gaze |
Worse | 3 | Head movement | Head movement | U-gesture | Voice |
In the test, the number of failures in correctly completing the planning tasks is used as the performance measure instead of positional deviation. These failures typically occur due to excessive positional deviations caused by input modalities. For instance, robot's end-effector must be correctly rotated according to part's orientation to pick it up. A similar condition holds for placing the part into the pocket. The end-effector can grasp the part only when its center is accurately aligned with the part height. Moreover, the moving trajectory specified by the tracing modality needs to be closely aligned with the passage. Figure 20(a) shows the average number of failures occurring in each trial for all three combinations. The optimal and worst combinations yield the fewest and most failures, respectively. The result reflects the performance ranking of the modality designs determined by the experiment. Figure 20(b) compares the completion time for the combinations. The optimal combination consumed the most time because its modality designs (finger point and disk gizmo) require more time to perform input tasks in the test. This conclusion is also consistent with the findings on the performance ranking.
7 Conclusion and Future Work
AR is the only enabling technology in Industry 4.0 designed specifically from a human-centric perspective. It has been implemented as an intuitive interface in human–robot collaboration to enhance real-time bidirectional interactions. A human planner can program the motions of an industrial robot directly in the real manufacturing environment through AR interfaces, overcoming the limitations of traditional robot PbD. However, most related studies have developed proprietary modality designs based on various sensory cues for specific robot applications. The design of interaction modalities in AR-assisted robot programming still lacks generalized guidelines for practical use. Therefore, this research conducted an experimental study to systematically evaluate and compare various input modality designs for common planning tasks in industrial robot programming. The focus was to analyze the relative effectiveness of these designs based on hand gestures, eye gaze, head movement, and voice in performing tasks such as pointing, tracing, 1D rotation, 3D rotation, and switch state. Statistical analyses of positional accuracy, precision, completion time, and usability from the experiment led to the following findings:
Hand gesture-based modality designs result in less deviation and lower physical loads for pointing and tracing compared to the other designs. However, the positional accuracy or precision offered by the hand or finger-based modalities in the current implementation is not able to support robot programming applications that require tight tolerances. A possible reason is the lack of instant depth cues in the holographic display of the AR headset in use.
Due to uncontrolled physiological actions in the human body, head or eye gaze-based input modalities are too sensitive for most input tasks that require positional precision. They also received lower subjective ratings from participants in the experiment.
Modality designs using gizmos (disk and cube) are more accurate for specifying a rotational amount in 1D rotation and the pose of an object in 3D rotation. Users can continue adjusting the input to improve accuracy with the instant visual feedback offered by the gizmos, although this may prolong the input process.
Finally, a test scenario of a pick-and-place robotic operation verified the effectiveness of the modality designs proposed in this work. Performance assessments of different combinations of modality designs in the test also support the cross-comparison results obtained from the experiment. The experimental findings of this study provide systematic guidelines for designing AR input modalities using sensory cues for planning robot motions in real manufacturing environments. They thus enhance human–robot collaborations from a human–computer interaction perspective. Note that objective measures such as positional deviation and completion time may not necessarily align with the usability assessment results. Hence, future work can focus on developing modality designs using multimodal sensory cues to close the gap. We found the technical capabilities of current AR technology insufficient to support real-time interaction for all the modality designs tested in the experiment. A challenging question that remains to be addressed is the extent to which imperfect implementation affects the performance measures used for comparison. To reduce positional deviation of input modalities further can increase the practical value of AR technology in industrial settings. This can be achieved by integrating external metrology devices or providing depth cues in the AR display.
Footnotes
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
No data, models, or code were generated or used for this article.