Abstract
One of the expectations for the next generation of industrial robots is to work collaboratively with humans as robotic co-workers. Robotic co-workers must be able to communicate with human collaborators intelligently and seamlessly. However, industrial robots in prevalence are not good at understanding human intentions and decisions. We demonstrate a steady-state visual evoked potential (SSVEP)-based brain-computer interface (BCI) which can directly deliver human cognition to robots through a headset. The BCI is applied to a part-picking robot. The BCI sends decisions to the robot while operators visually inspecting the quality of parts. The BCI is verified through a human subject study. In the study, a camera by the side of the conveyor takes photos of each industrial part and presents it to the operator automatically. When the operator looks at the photo, the electroencephalography (EEG) is collected through the BCI. The inspection decision is extracted through SSVEPs in EEG. When a defective part is identified by the operator, the signal is communicated to the robot, which locates the defective part by a second camera and removes it from the conveyor. The robot can grasp various part with our random grasp planning algorithm (2FRG). We have developed a CNN-CCA model for SSVEP extraction. The model is trained on a dataset collected in our offline experiment. Our approach outperforms the existing CCA, CCA-SVM, and PSD-SVM models. The CNN-CCA model is further validated in an online experiment and achieved 93% accuracy in identifying and removing defective parts.
1 Introduction
Robots assist humans in various fields, including manufacturing industry, surgery, and activities of daily living. As they become more and more intelligent, robots are engaging in many new tasks in collaboration with humans. Human-robot collaboration refers to a collaborative process in which humans and robots work together to achieve a common goal. By collaborating with humans, robots can handle the automation problems that require a large amount of human knowledge, complex robot motion plans, or reprogramming for various objects. However, in the current human-robot collaborative interactions, human operators have to operate robots while processing the human part of the task. Ideally, we hope robots can maintain a tacit understanding with operators as “robotic co-workers.” When operators make decisions, the robots should directly understand the decisions and take appropriate actions. In this way, robots will function seamlessly with operators without extra manual operations.
A direct way to obtain human decisions is to read information from the brain. Human brain activities are accompanied by bioelectrical signal. The signal can be measured from the scalp and known as electroencephalography (EEG). EEG measures voltage fluctuations resulting from ionic current within the neurons of the brain [1]. By analyzing and processing EEG, humans can control external devices with brain activities. Since 1988, brain-computer interfaces (BCIs) have been built for controlling robots [2]. When performing tasks, human decisions can be sent directly to robots through BCIs. Therefore, by combing EEG-based BCIs with robots, robotic co-workers can obtain the decisions of operators through EEG when the operators are concentrating on the task without making any physical operation on the robot. While working with EEG-based robotic co-workers on industrial tasks, EEG can reveal the intentions, decisions, and mental status of workers, thereby reducing the labor and knowledge for operating the robots. Meanwhile, EEG-based robotic co-workers can provide work opportunities for people with disabilities who want to engage in industrial jobs and realize their social value.
In this study, we demonstrate an EEG-based BCI for a part-picking robotic co-worker. The robot picks up defective parts based on the decisions directly collected from the brain activities of the operator while the operator inspecting the qualities of the parts. The development details of object extraction, robot motion planning, EEG acquisition, and EEG processing are described in the technical approach section. We propose a CNN-CCA method to improve the SSVEP classification accuracy. The model is validated and compared with existing methods in an offline experiment, and the BCI is further tested in an online experiment.
2 Background
The challenges for the development of robotic co-workers come from two aspects: physical human-robot interaction and cognitive human-robot interaction. The most critical problem to be solved by physical human-computer interaction is how to ensure human safety while collaborating with robots. Some of the work in hardware design (e.g., lightweight robots) [3] and safety actuation [4,5] greatly reduced the possible damage caused by human-robot collisions. In terms of software, robots are programmed to avoid collisions or actively react in collisions [6,7]. Other contributions include safety and production optimization [8,9], human safety quantification [10–12], etc. Regarding cognitive human-computer interaction, research mainly focuses on new human-computer interaction and human intention prediction. Gemignani et al. [13] developed a robot-operator dialogue interface that allows non-expert operators to interact with the robot through voice without knowing any internal representation of the robot. Sheikholeslami et al. [14] and Gleeson et al. [15] studied the intuition of human observers on gestures to explore human-robot interaction gestures. However, with the use of voice and gesture interfaces, operators still need to operate the robot in collaboration. To eliminate human operation, Beetz et al. [16] applied artificial intelligence to analyze the operator’s intentions by predicting the operator’s motion. Oshin et al. [17] predicted a set of future actions of human through the Kinect video stream using a convolutional neural network (CNN) model.
Studies show that humans can achieve multi-dimensional robot control through BCIs with many different strategies and input modalities. Invasive BCIs can be used to implement accurate and complicated robot controls. Vogel et al. [18] demonstrated that human subjects could continuously control a robot arm to retrieve a drink container through an invasive BCI named BrainGate. However, invasive BCI requires surgery to place a chip on the brain. The EEG-based Non-invasive BCIs are capable of controlling various devices with a wearable headset. Edlinger et al. [19] built a virtual smart home where devices like TV, MP3 player, and phone, can be controlled through a BCI using P300 and steady-state visually evoked potentials (SSVEPs). Riaz et al. [20] and Yin et al. [21] developed a BCI language communication tool using P300 speller and speech imagery. In robot controls, Ying et al. [22] built an on-line robot grasp planning framework using a BCI to select the grasping target and grasping pose. Hortal et al. [23] trained a robot to touch one of the four target areas by detecting four different mental tasks. Gandhi et al. [24] and LaFleur et al. [25] proposed a mobile robot and quadcopter control interface for 2D and 3D navigation through motor imagery.
In our preliminary study [26], we demonstrated an EEG-based BCI for a robotic co-worker to pick up defective parts from a conveyor while the operator checking the quality of the parts. In this paper, we summarize the previous work and extend it by improving robot grasp planning, EEG classification, and robotic control.
3 Technical Approach
3.1 Overview.
The part-picking robotic co-worker integrates a manipulator (a 5-Dof KUKA youBot arm with a two-finger gripper), two cameras (two Logitech HD cameras), a DC motor conveyor, and an EEG-driven BCI as shown in Fig. 1(a). The BCI includes an EEG headset (B-Alert X24, 20 EEG channels) and a LED monitor for stimuli generation. The camera beside the front end of the conveyor (refer as the front camera in the following paragraphs for short) is used to detect the moment when a new industrial part is loaded onto the conveyor and then takes a photo of it at the same time. The monitor displays the photo to the operator as the visual stimuli. The operator inspects the quality of the part through the photo. The inspection result collected and analyzed from the EEG is sent to the robot as a removal decision. If notified that the part is defective, the robot will pick the part out of the conveyor. The camera beside the rear end of the conveyor (called the rear camera) helps the robot find and pick up the defective part.
The detailed workflow is presented in Fig. 1(b). Once the front camera detects a new part, the part will be registered in the Log thread as the ith part. The front camera takes a photo for it and stores as P(i). The monitor is programed to have four blocks arranged as a 2 × 2 matrix to display photos. P(i) will be displayed in one of the un-occupied blocks. If all blocks are occupied, the monitor will clear a block to make it available. The photo will be displayed for 10 s and then automatically cleared from the block. While the photo is displaying, the operator can inspect the quality of the part through the monitor. Meanwhile, the operator’s neural signal, i.e., the EEG, is collected by the EEG-headset. The EEG is interpreted to be a binary decision d(i) and stored in the Log thread as well. Here, we have d(i) = 0 representing the part is qualified and d(i) = 1 for the part is defective. When the part moves to the rear end of the conveyor, and if d(i) = 1, the rear camera will extract its position μ(i) = (x, y) in real time to provide a close-loop feedback for the youBot to pick it up. The robot picks the part following a grasping plan G(i), which is generated when the part passes the front camera.
The operator sits in front of the monitor and inspects the qualities of parts through photos. Once he/she identifies a defective part, the operator should stare at the photo until it is marked with a green square (detection succeed) or until the photo vanishes (detection failed). If the part is qualified, the operator should avoid staring at the photo for more than 2 s. Otherwise, it may lead to a false positive for defective part identification.
3.2 Part-Detection.
We use the threshold method to extract parts from the background for the photos taken by the front and rear cameras. In pre-processing, the morphologically open with structural element of the disk (10 pixels radius) is applied to eliminate lighting effect. Pixels with intensity lower than 0.2 are cluster as the background. Then, we remove the connected regions with area less than 20 pixels to reduce noise. The rest regions are considered as the extracted parts.
We use a binary signal to perceive when a part is loaded onto the conveyor and when the part passes the end of the conveyor. The binary signal is defined as the projection of the extracted regions along the conveyor’s moving direction. In the front camera, when a new step-down occurs, we record that a new part has been loaded onto the conveyor. Similarly, in the rear camera, if there is a new step-down in the binary signal, we record that a part has passed the end of the conveyor. Additionally, to split multiple parts in the same photo, we cut the photo into small segments at the midpoint of each step-down to step-up, so that the extracted regions are segmented into several sub-regions with each sub-region contains only one part. We denote the sub-regions which contains only the ith part as S(i). This is a simple way to track the number of parts on the conveyor, but it requires that the parts must be placed with some space between each other along the conveyor’s moving direction.
3.3 Grasping Planning.
The geometric center μ(i) of each part can be easily tracked by averaging all the pixels in S(i). To obtain a generalized grasping plan for various industrial parts, we developed our own grasping algorithm called two-finger gripper random grasp (2FRG). It can generate robust 3-Dof picking gestures for the two-finger gripper to pick various objects. It samples multiple lines as possible finger-moving directions randomly. Along each line, all the possible finger grasping positions are searched. We define the grasping position as the position that both fingers touch the object and there is enough space to insert the fingers. Among the grasping positions, only the positions can construct firm grasps will be kept. A firm grasp requires each finger to have at least two points on both edges touching the object or at least one point in the middle touching the object. We use the distance between the midpoint of the grippers and μ(i) as the score to evaluate the grasp. The algorithm 2FRG returns the firm grasp with the highest score. A demonstration of the firm grasp is shown in Fig. 2.
In Fig. 3, grasping plans are calculated for five different parts with 200 grasping samples for each part. The optimal firm grasp for each part is shown in Fig. 3(c). The grasping positions are close to geometric centers and the grippers grasp the objects in comfortable directions. The pseudocode for 2FRG is presented in Algorithm 1. The grasping plan G(i) = (φ, δ) is constructed by the grasping direction φ and the offset δ from the grasping position to μ(i). Here, φ is the angle of the vector g1 − g2 and δ = g1/2 + g2/2 − μ, where g1 and g2 are midpoints of the inner sides of the fingers.
Algorithm 1 Two-finger Gripper Random Grasp (2FRG)
1. Inputs:
2. S(i): extracted region of one object stored in the binary image I
3. N: number of maximum sampling times
4. L: geometric constrains of a gripper
5. Output:
6. (g1, g2): positions of gripper fingers
7.
8. Do for N times
9. {
10. Uniformly sample two points q1 and q2 in I
11. Create a line passing q1 and q2 and let
12. The collection of all line segments of l such that L is satisfied if a gripper g1 or
13. If {lf} is not empty, then
14. {
15. The collection of (g1, g2) with , and such that (g1, g2) is a grasping position.
16. The collection of such that (g1, g2) is a firm grasp.
17. }
18. }
19. Return the with the highest score, where
3.4 Robot Motion Planning and Control.
3.5 Electroencephalography Acquisition.
Our EEG-based BCI is a non-invasive implementation. Non-invasive BCIs yield lower performance than invasive BCIs, but they are easy to wear and require no surgery. In our system, EEGs of operators are collected through a B-Alert X24 headset (Advanced Brain Monitoring, Carlsbad, CA), which has 20 electrodes positioned following the 10–20 system and a pair of reference channels. The sampling frequency is 256 Hz. The device is minimalistic and can be comfortably worn for an hour at a time without rewetting or reseating of the electrodes.
The decisions of the operator are identified through SSVEPs. SSVEPs are natural EEG responses to visual stimulation at specific frequencies. The signal can be triggered by looking at a flicker that flashes on a constant frequency. The EEG response will have a signal component at the same frequency as the flicker. Conversely, by monitoring frequency components of the operator’s EEG, the system can recognize the flashing frequency of the flicker. Based on the property of SSVEPs, we flash the photos of parts in different frequencies on the monitor. From the EEG, we can recognize which photo the operator is staring at. When the operator did not see any defective part on the monitor, he/she should avoid staring at a single photo to make sure there is no significant frequency components corresponding to any of the displaying photo. We call this situation the idle state. On the other hand, when a defective part is detected, the operator should stare at its photo until the photo is marked. In this case, a significant frequency component will be found at the EEG which has same frequency as the flashing frequency of the photo.
The monitor is programed to display the visual stimuli for generating SSVEPs. It displays photos in four square blocks. As shown in Fig. 4, the size of the blocks is 300 pixels × 300 pixels. The four blocks flash at 6 Hz, 6.67 Hz, 7.5 Hz, and 8.57 Hz, respectively. Once a new part is observed by the front camera, its photo will be displayed in one of the blocks and flashes at the frequency corresponding to the block. The photo of the next observed part will be displayed in the next available block. Each photo presents on the monitor for 10 s. However, the monitor can display up to four photos at the same time. When a new part is observed but all the blocks are currently occupied, the first displayed photo will be cleared to make an available block for the new photo. The flashing is programed on Windows Direct-X.
3.6 Electroencephalography Processing.
In our previous study, we demonstrated five SSVEP classification methods, which are canonical correlation analysis (CCA) [27], individual templated-based CCA (IT-CCA) [28], support vector machine method (SVM), power spectral density-based SVM method (PSD-SVM) [29], and CCA-based SVM method (CCA-SVM) [26]. The validation dataset was collected in our offline experiment. The study illustrated that our proposed CCA-SVM method significantly outperformed the other comparison methods. However, the accuracy of the CCA-SVM method still not satisfactory especially for industrial applications. To further improve performance, we consider using neural network models. In a previous study, Nik et al. [30] showed that convolutional neural networks (CNNs) and linear models perform significantly better than long short-term memory (LSTM) for SSVEP classification. Because EEG has low signal-to-noise ratio and few training samples, pure LSTM and CNN models are easily overfitted. In our previous study, our model-driven CNN model Conv-CA [31], which combines the CNN structure and CCA, achieved the best performance on a 40-target SSVEP benchmark dataset [32]. However, the offline dataset we collected for this application is not at the same phase as required by the Conv-CA model. Thus, we introduce a new SSVEP classification method CNN-CCA as a derivative of the Conv-CA model to address the phase issue in the dataset. CNN-CCA provides a great performance boosts from previous tested methods. The performance of our new proposed CNN-CCA method is verified by comparing with CCA, PSD-SVM, and CCA-SVM methods.
3.6.1 Canonical Correlation Analysis Method.
3.6.2 SVM-Based Methods.
3.6.3 Convolutional Neural Network Canonical Correlation Analysis Method (CNN-CCA).
Because the CCA-layer can provide non-linear operations, the CNN layers are active by linear activation functions. The first layer of the CNNs has 16 filters of 16 × 4 kernels. It convolutes EEGs in all input channels (Nc = 4) in a short local time period (16 sampling points or 23.4 ms). The second layer combines the 16 filters in the first layer together It uses 1 × 4 kernels to weight EEGs from different channels. The third layer applies an 1 × 4 kernel with no padding (The first and second layers use zeros paddings to keep outputs the same size as the inputs.) to transform the data into a one-dimensional signal . At the end of the CNN layers, we apply a dropout with dropping rate 5% to for regularization. The detailed structure is shown in Fig. 5.
The CNN-CCA is implemented in python–Keras with tensorflow backend. We use categorical cross-entropy as the loss function. The optimization is solved with Adam algorithm (learning rate (1e-4), beta1 (0.9), beta2 (0.999), gradient clipping (5)) with batch size 32.
4 Experimental Setup
We established an offline experiment to test the performances of the above SSVEP classification methods. The experiment required subjects to stare at a flashing photo for 15 s in each trial. The photo flashed at one of the frequencies of 0 Hz, 6 Hz, 6.67 Hz, 7.5 Hz, and 8.57 Hz. The experiment took five runs with five trials in each run. During the experiment, subjects wearing EEG-based BCI headset were required to stay still and blink as less as possible. Five subjects (age 25–35 years, four males, one female) attended the experiment.
After the offline experiment, subject 1, 2, and 3 participated in the online experiment. The online experiment required subjects to select 2 defective industrial parts from 10 parts which were manually placed on the conveyor by another operator in random order. The online experiment took three runs. All subjects successfully accomplished the task. Figure 6 shows the user interface and hardware during the on-line experiment.
4.1 Results.
Classification accuracies of all the methods on 4 different data lengths (also called time windows), i.e., 0.5 s, 1.0 s, 1.5 s, and 2.0 s, were used to evaluate the performances. The data were extracted with a step of 0.15 × time window length. We compared our proposed CNN-CCA method with CCA, PSD-SVM, and CCA-SVM using leave-one-out cross validation. Specifically, one of the five trials of the EEG data was used as test data and the other 4 trials were used as the training dataset. We repeated the process five times so that every trial is tested.
As shown in Fig. 7, the performance of the CNN-CCA was found to be superior to the other three comparison methods across all five subjects in all tested time window lengths. Table 1 lists the average classification accuracies across all subjects. Compared to the most commonly used CCA method, the CNN-CCA improved the average classification accuracies in the time windows of 0.5 s, 1.0 s, 1.5 s, and 2.0 s by 31.43%, 23.00%, 16.25%, and 12.92%, respectively. Especially for the 0.5 s time window of subject 5, the classification accuracy increased from 45.86% to 96.00%. Compared to the CCA-SVM method, which was the best method in our pervious study, the classification accuracies improved by 16.26%, 9.84%, 6.96%, and 6.83% in 0.5 s, 1.0 s, 1.5 s, and 2.0 s time windows, respectively. For the subject 4, whose SSVEP had lower classification accuracy among the five subjects, the accuracy in the 2.0 s time window increased from 72.27% to 88.37%.
Time window (s) | 0.5 | 1.0 | 1.5 | 2.0 | |
---|---|---|---|---|---|
Accuracy (%) | CCA-SVM | 63.98 | 80.14 | 86.67 | 89.73 |
PSD-SVM | 74.22 | 81.71 | 82.98 | 82.07 | |
CCA | 48.81 | 66.98 | 77.38 | 83.64 | |
CNN-CCA | 80.24 | 89.98 | 93.63 | 96.56 |
Time window (s) | 0.5 | 1.0 | 1.5 | 2.0 | |
---|---|---|---|---|---|
Accuracy (%) | CCA-SVM | 63.98 | 80.14 | 86.67 | 89.73 |
PSD-SVM | 74.22 | 81.71 | 82.98 | 82.07 | |
CCA | 48.81 | 66.98 | 77.38 | 83.64 | |
CNN-CCA | 80.24 | 89.98 | 93.63 | 96.56 |
In our implementation, the most frequent class was the idle state (i.e., n = 1) if the industrial parts on the conveyor was good parts in most cases. Thus, the classification accuracy of the idle state was more important than the other classes. To check the performance of idle state classification, we calculated the confusion matrices and marked the accuracies of the idle state classifications in Fig. 8. When the time window is 0.5 s or 1.0 s, the PSD-SVM method had the best idle state classification among the four methods. However, the classification accuracy for all five classes was only 81.71% in the 1.0 s time window. As the length of the time window increased, the CNN-CCA method become the best idle state classification method. In the 2.0 s time window, the idle state classification accuracy was 95% and the average classification accuracy for all five classes was 96.56%.
Figure 9 shows the CNN-CCA applied to the recorded data of our previous online experiment. In the online experiment, the SSVEPs was classified in 2 s time windows. The subject 1 and subject 3 completed the experiment with all the parts detected correctly. The subject 2 got one false positive case in the first and second runs. Compared to their off-line experiment, the subject 2 had higher classification accuracy than the subject 3. However, the subject 2 had worse idle state classification, which caused more false positive cases.
4.2 Conclusions.
We developed an EEG-based BCI for a part-picking robotic co-worker, where an operator is able to collaborate with the robot and communicate defective part without manually operating the robot. The robot removes defective parts from the conveyor based mental command from the operator. The decisions were extracted through SSVEPs and sent to the robot. We proposed a new CNN-CCA method to classify SSVEP. Its performance was verified on our offline experiment data and compared with the existing CCA, CCA-SVM, and PSD-SVM methods on 0.5 s, 1.0 s, 1.5 s, and 2.0 s window lengths of EEG data. Our CNN-CCA was found to be better than all other methods that were tested for the length of the time window. The average classification accuracies across all five subjects were found to be 80.24%, 89.98%, 93.63%, and 96.56% on 0.5 s, 1.0 s, 1.5 s, and 2.0 s time window length, respectively. Then, we established an online experiment with 2.0 s time window length. The average defective part inspection successful rate is 93.33% using the CNN-CCA method. BCI-based system has potential to be a new communication pathway between human and robots in many manufacturing applications in the future.
Acknowledgment
This work was supported by the National Science Foundation Award Number: 1464737.
Data Availability Statement
The authors attest that all data for this study are included in the paper.