Appearance-based localization is a robot self-navigation technique that integrates visual appearance and kinematic information. To analyze the visual appearance, we need to build a regression model based on extracted visual features from raw images as predictors to estimate the robot's location in two-dimensional (2D) coordinates. Given the training data, our first problem is to find the optimal subset of the features that maximize the localization performance. To achieve appearance-based localization of a mobile robot, we propose an integrated localization model that consists of two main components: the group least absolute shrinkage and selection operator (LASSO) regression and sequential Bayesian filtering. We project the output of the LASSO regression onto the kinematics of the mobile robot via sequential Bayesian filtering. In particular, we examine two candidates for the Bayesian estimator: the extended Kalman filter (EKF) and particle filter (PF). Our method is implemented in both indoor mobile robot and outdoor vehicle equipped with an omnidirectional camera. The results validate the effectiveness of our proposed approach.
Localization of a robot relative to its environment using visual information has received extensive attention over past few decades from the robotic and computer vision communities . Autonomous unmanned aerial vehicles with vision-based navigation and localization [2,3] to cope with global positioning system (GPS)-denied environments play major roles in tasks such as search and rescue, environment monitoring, and surveillance [4,5]. Self-driving cars could benefit from GPS-denied geo-localization using visual odometry . The appearance-based localization could be useful in indoor navigation , as well as in minimizing levels of location uncertainties in mobile sensor networks for regression of environmental fields [8,9].
The appearance-based method is based on a supervised learning framework that utilizes collections of location coordinates and visual observations. Models are learned from a large number of images at the training step, then newly collected images are recognized . In related studies, gradient orientation histograms  and low dimensional representation of vision data  are used to localize mobile robots. In Ref. , a visual potential is used for navigating a mobile robot. A method for recognizing scene categories by comparing the histograms of local features is presented in Ref. . Additionally, the speeded up robust feature (SURF) has been often used for classification type problems such as topological localization (i.e., image matching) . In Ref. , the authors developed an X-ray vision system to build a map of a totally unseen territory by utilizing the Wi-Fi signal. While the careful feature engineering for such applications determines the ultimate performance, optimal feature selection for appearance-based localization has not been fully investigated to date. Bellotto et al. proposed an appearance-based localization using digital zoom and visual compass in order to integrate the visual appearance with odometry information .
In most of the cases, utilizing the full set of features or manual feature engineering is unlikely to yield optimal outcomes. Therefore, selecting the most effective features is critical in two ways: (1) to maximize the performance of the tasks and (2) to eliminate redundant features. For example, in localization tasks, the selected features must be robust to local noise and sensitive to different locations over the spatial field. One example of widely used feature selection technique is the principal component analysis, which is utilized to find the mapping from image features to the robot locations . There are a number of methods to determine the significant variables in principal component analysis. However, since those methods are unsupervised learning, the selection process might not be optimized for the localization performance. Thus, we focus our investigation on optimal feature selection in a supervised learning fashion.
Appearance-based localization shall take into account kinematics of the mobile robots. Sequential Bayesian filtering approaches such as the Kalman filter and the particle filter (PF) have been applied extensively in vehicle localization. The Kalman filter  is based on the assumption that the system is linear, and the noise processes are distributed as Gaussian distributions. Thus, tracking the state distribution of a vehicle can be computed efficiently by propagating only the first and second moments of the distribution. For nonlinear systems, an approximation is made by an extended Kalman filter (EKF) in which the nonlinear system is linearized at every time-step such that the Kalman filtering can be implemented. On the other hand, the PF  approximates the real distribution via a Monte Carlo (MC) approximation using particles. Recently, these methods have been widely applied in GPS-based localization applications  or in cases where there are explicit observation models, for instance, mapping from Wi-Fi signal to positions . However, the visual observation is high-dimensional and nonlinear, which makes directly applying filtering approaches impossible. Thus, a common solution is to extract low-dimensional features from the raw images then fit these features to the position of the vehicle . Hence, in this paper, we plan to utilize the least absolute shrinkage and selection operator (LASSO) technique to select localization features from raw images in order to project onto the kinematics of mobile robots. The LASSO regression is chosen to perform both variable selection and regularization in order to enhance the prediction quality .
The recent research efforts that are closely related to our problem are summarized as follows. The location for a set of image features from new observations is inferred by comparing new features with the calculated map [23,24]. There exists effort on automatically finding the transformation that maximizes the mutual information between two random variables . Using Gaussian process regression, Brooks et al.  and Schairer et al.  present effective approaches to build a map from a sparse set of noisy feature observations taken from known locations using an omnidirectional camera to infer the location. Building on Brook's approach , the work in Ref.  expands it more to the purpose of feature extraction and selection in order to improve the quality of localization. The localization is then based on a maximum likelihood estimation with a feature map from optimally selected visual features using Gaussian process regression .
In this paper, we first unwrap the omnidirectional images into panoramic ones and extract three types of visual features—fast Fourier transform (FFT), color histogram, and SURF. We concatenate all different types of features together and normalize the feature vectors to eliminate dominancy in values among different features. We then treat the robot's coordinates as multivariate targets and train the group LASSO  regression model using the training data set. Once the model is trained, we feed the newly captured image into the model to obtain an estimated position. Next, we treat the estimated position as a noisy observation and apply sequential Bayesian filtering, e.g., EKF or PF, in order to project onto vehicle's kinematics and to filter out the noise from the group LASSO regression outcomes. Preliminary results for indoor and outdoor experiments were reported in Refs.  and , respectively.
The contribution of our paper is as follows. In contrast to approaches used in Refs.  and , we focus on building a direct mapping from features space to the coordinates instead of solving for locations from the feature map via maximum likelihood estimation. Therefore, the test phase is executed in a short period of time since our method does not require any image matching or optimization steps. Second, our method is robust to visual noise and does not need high resolution images, which allows it to be applicable in low-cost embedded systems. For instance, we use relatively low quality and noise-prone images captured from a regular webcam (Logitech C270, Logitech, Newark, CA), yet the system yields reasonable results. Furthermore, in contrast to Ref. , our proposed method does not require the depth measurements of the SURF points.
The overall structure of this paper is as follows: We introduce the visual features that are used in this study in Sec. 2. We then discuss the LASSO regression in Sec. 3 and the group LASSO regression in Sec. 4. In Sec. 5, we show how to integrate the EKF into the LASSO-based localization performance. Experiment results are discussed in Sec. 6. For notations, we use for the entry of row i, column j, for the ith row of the matrix A, and as the normal distribution with zero mean and variance .
Inspired by a biological observation (e.g., insects and arthropods), recently omnidirectional cameras have been studied for navigation . In contrast to conventional cameras with restricted fields of view, panoramic cameras can provide the entire surrounding environmental view in a single image . In order to extract useful features from an omnidirectional image, we first unwrap it as follows. The raw image captured by the omnidirectional camera is a nonlinear mapping of a panoramic view onto a doughnut shape. We recover the panoramic view by applying a reverse mapping of the pixels from the wrapped view onto a panoramic view . Figures 1(a) and 1(b) show the wrapped and unwrapped images, respectively.
From the panoramic images, we extract three types of visual features. We will use the notation generally for the collection of all types of the image features that are extracted from image i. In particular, we use the FFT coefficients, the histogram, and SURF as the image features . These feature types and their properties are briefly explained below:
where is the ith two-dimensional realized image and j is the imaginary unit. To use the FFT, we convert the panoramic color images to gray scale 128 × 128 pixel images, i.e., . After applying the FFT on the resized spatial image, we obtain the same-sized image in the frequency domain, i.e., . Each complex point on the frequency domain is characterized by its magnitude and phase. However, it has been shown that the magnitude of the Fourier transformed image captures most of the information of the repeated geometric patterns in the spatial domain image . Furthermore, the magnitude is robust to the rotation of the vehicle in the yaw angle. Finally, it was shown that the first 15 components of the FFT frequency spectrum carry enough information to correctly match a pair of images . Based on the above knowledge, we specify the magnitude of the first 16 FFT components of the x axis, i.e., to be our FFT features. Figure 1(c) shows the two-dimensional FFT magnitude plot of an image.
Histogram: The image histogram  is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. The number of pixels in each tonal bin of the image histogram is used as an image feature. A histogram provides a global descriptor for an image and is also robust to rotation in yaw angle of the robot.
Speeded-up robust features: SURF  is a powerful scale- and rotation-invariant descriptor that is obtained from Haar wavelet responses. It is widely used for computer vision tasks such as object recognition, image registration, and classification. For a point of interest (namely, a SURF point), the feature is a 64-dimensional vector that is calculated locally based on the neighborhood region around the point. To form a feature vector, a patch of 16 × 16 pixels is divided into groups of pixels (4 × 4 grid). For each element of the 4 × 4 cell, the sum of gradients and their absolute values are computed for each direction, i.e., ∑dx, , ∑dy, and . Thus, for each SURF point, there is a feature vector of dimension .
However, since the number of SURF points vary across images, we apply an additional step to achieve the SURF feature for each image as follows. The SURF points from the whole data set are accumulated and put into the k-means clustering . Each centroid is defined as a codeword and the collection of centroids is defined as the codebook. Each SURF point is mapped into the index of the nearest centroid in the codebook. Therefore, we obtain a histogram of codewords for each image that indicates the appearance frequency of all codewords in the image. The number of bins of the histogram is the size of visual features for the image. In this paper, we use a 100 bins histogram.
For the indoor environment, we utilize the FFT and histogram, since they are more robust to the dynamic change of the surrounding compared to the SURF. For the outdoor experiment, since a large portion of the image is covered by uniform tonal pixels of the sky, the FFT and histogram features are not sensitive to the change in surrounding scenes. The SURF features capture the local properties of points of interest in an image other than the global properties like the other two. Therefore, we use the SURF for the outdoor experiment. Features from different types are concatenated and normalized.
The Least Absolute Shrinkage and Selection Operator
where is a penalty parameter. The level of shrinkage applied to the estimated , i.e., the number of zero entries of , is determined by λ. For instance in Fig. 2(a), the lines show the evolution of elements of the vector as we change λ from 0 (no penalty applied) to a final value (when all entries are zeros). Increasing λ makes sparser as more of its elements are driven toward 0 as shown in Fig. 2(a). The collection of vectors 's that correspond to different values of λ's is obtained from the training data set. It is then applied to the validation data set to form the validation error curve. An example of such an error curve obtained for a range of λ is shown in Fig. 2(b). Note that the curve has a minimum near λ = 27, which is the optimal value of λ. This value achieves a good balance between the goodness of fit from the least squares estimation and the sparseness of for model selection. In this study, we use the root-mean-squared error (RMSE) to compare two trajectories so the optimal λ is chosen at the minimum of the RMSE curve.
The Group Least Absolute Shrinkage and Selection Operator
In Sec. 3, the estimated target yi is a scalar quantity. However, since the two coordinates of the robot's location need to be estimated simultaneously, we introduce the group LASSO regression  to estimate a multivariate target.
Define operator as .
where ⊗ is the Kronecker product and is the weight matrix,3 i.e., , .
Note that the regularization penalizes the norm of the rows of estimate matrix in Eq. (5), which are corresponding to the two coordinates. Thus, features that are not significant for the localization of both coordinates are more likely to be eliminated.
The Extended Kalman Filter
In this section, we show how to implement the EKF on the group LASSO-based localization as a postprocessing step. We describe the bicycle kinematics of a front wheel-driven vehicle for the outdoor experiment. For the indoor experiment, we utilize the unicycle kinematics, and the rest of the EKF shall stay unchanged.
where ηi is the robot's orientation.
The initial value for the covariance matrices in the EKF is determined as follows. For the observation covariance matrix, we set it to be , where σobs is obtained from the RMSE of the group LASSO regression and I is a 2 × 2 identity matrix. This shows that the output of the group LASSO regression is used as the input to the filter. Similarly, we set the initial value for the process covariance matrix to be , where σprocess is obtained from the RMSE of the open-loop-based localization.
The Particle Filter.
The PF uses a sampling approach, with a set of particles to compute the posterior distribution of a stochastic process given noisy observations without any assumptions of linearity or Gaussianity. We construct a separate study using the PF to replace EKF step using the outcomes of group LASSO regression based localization.
In addition, the observation density can be computed from Eq. (12) as follows: Note that even we started with an attempt to find the state's posterior distribution for the whole time horizon, i.e., , Eq. (20) suggests that information about and is “absorbed” in the weight and there is no need to store them. In other words, we only need to propagate to iteration i + 1.
In this section, we show how we implement our proposed method in an indoor experimental setup. The mobile robot and the testing environment are shown in Fig. 3(a).
We control the robot remotely via wireless communication units, while the panoramic vision of the surrounding scene is recorded. The average speed of the robot is . In order to track the true positions of the robot, we utilize an overhead camera (Logitech HD webcam C910, Logitech, Newark, CA) that is mounted on the ceiling. Note that the positions obtained from the ceiling camera are only used for evaluating the method's performance. In summary, we collect the following three data sets all sampled at 0.5 Hz: the command inputs, turn rates, and collapsed time between samplings ; the surrounding scene recorded in the panoramic vision; and the positions of the mobile robot (ground truth).
We compare the localization results performed by the following three techniques:
Open-loop based localization: We naively apply the recorded command inputs to the robot's kinematics, which is described in Eq. (8) to obtain the robot's position.
Group LASSO-based localization: We randomly sample without replacement of the whole data set and categorize them into three labels: training, validation, and testing by corresponding percentages: 50%, 25%, and 25%. Then, we apply the group LASSO regression to the labeled data and compare the performance on the test data with other techniques.
Group LASSO-based and EKF localization: We apply the EKF as a postprocessing to the test result of the group LASSO-based localization as described in Sec. 5.
Group LASSO-based and PF localization: We apply the PF  as a postprocessing to the test result of the group LASSO-based localization.
For the indoor experiment, the test data are the collection of locations that are randomly chosen from the original trajectory. Therefore, when we run the EKF and the PF to estimate the trajectory of the robot, we assume that the robot does not take measurements at the nontest locations, i.e., estimation with missing observations. If an observation is made, the EKF executes the update step by Eq. (13). Otherwise, it predicts the state vector by open-loop kinematics and propagates the covariance matrix by Eq. (10). Similarly, for the PF, the weights of the filter density are updated once an observation is made. Otherwise, a uniform probability is assigned for all particles .
To illustrate the estimation with missing observation by the EKF, Fig. 4 shows the evolutions of the first and second entries of the diagonal of matrix Pi ( and ) that represent the uncertainties in the x and y axes. Notice that the variance of position coordinates drops whenever an observation, which is indicated by a vertical dashed line in Fig. 4, is made. This simulates a practical situation where the localization observations are mostly missing and very sparse due to line-of-sight or GPS-denied environments.
Table 1 compares the localization results performed by open-loop localization, group LASSO-based localization, group LASSO-based with EKF localization, and group LASSO-based with PF localization which are denoted as “Open-loop,” “Group LASSO,” “Group LASSO + EKF,” and “Group LASSO + PF,” respectively.
Table 1 shows the RMSEs of the four methods. The open-loop prediction yields the worst performance, since there is no feedback in the prediction, the error is accumulated. The group LASSO-based localization reduces 55% features and yields 29% lower RMSE compared to the open-loop method. By applying the EKF, the group LASSO-based and EKF localization results in the lowest RMSE with 55% reduction compared to the open-loop case. Figure 5 shows the true trajectory (solid line), group LASSO-based localization (square dots), group LASSO and EKF localization (red dotted dashed line), and group LASSO and PF localization (dotted line). Table 1 has indicated the excellent performance of our proposed method.
In this section, we demonstrate the effectiveness of our proposed method using an outdoor experimental study.
Figure 6 shows the data acquisition circuit and the surveillance vehicle. We drive a vehicle (Acura RSX 2003) in Michigan State University campus and record the panoramic scene via a omnidirectional camera installed on the top of the vehicle. The average speed of the vehicle is . We collect three data sets, which are sampled at 1 Hz: (1) the estimated control inputs and sampling times , (2) the scene recorded by the omnidirectional camera that is stored in the Raspberry Pi (Raspberry Pi model B+, Raspberry Pi Foundation, Cambridge, UK), and (3) the positions of the vehicle that are tracked by the Xsens GPS (MTi-G-700, Xsens Technologies B.V., Enschede, The Netherlands, position accuracy 2.5 m).
We compare the localization performances based on the following three techniques:
Open-loop based localization: We naively apply the open-loop prediction by using the kinematics described in Eq. (8) and the command inputs.
Group LASSO-based localization: We record two separated trajectories for training and testing, as shown in Fig. 7. The original data are divided for training and validation. The group LASSO regression uses the training data to train the estimate matrix , the validation data to compute the optimal matrix , and then applies to estimate the test data .
Group LASSO and EKF localization: We apply the EKF as a postprocessing to the test result of the group LASSO-based localization as described in Sec. 5.
Group LASSO and PF localization: We apply the PF as a postprocessing to the test result of the group LASSO-based localization.
For this outdoor experiment, we also use the RMSE to evaluate the performance of the three techniques described earlier.
Table 1 shows the RMSEs of the three methods (column 2) and the number of used features over the initial total (column 3). The group LASSO-based localization reduces 75.5% of the whole features. By applying the EKF, the group LASSO-based localization with the EKF results in the best RMSE with 30.6% reduction compared to the open-loop method. Figure 8 shows the true trajectory (solid line), group LASSO-based localization (square dots), group LASSO and EKF localization (dotted dashed line), and group LASSO and PF localization (dotted line).
The estimation step of the group LASSO-based localization shown in Eq. (4) is linear regression. Thus, its estimation error has a smaller bias compared to the open-loop prediction, which gradually deviates from the true trajectory, but exhibits high variance. Therefore, we utilize the EKF (as well as PF) as a low-pass filter to smooth the estimated trajectory by projecting the noisy outputs of the group LASSO-based localization onto the vehicle kinematics. Figure 9(a) shows the evolution of the group LASSO regression estimate matrix . Each pair of features that is corresponding to one location (two coordinates) is plotted in the same color. The optimal is plotted in Fig. 9(b). Note that the matrix is sparse with a large number of entries having values of zero.
For comparison, we also implement the PF  that takes the group LASSO output as measurements. Since the PF involves random generation of the particles, by adapting the evaluation criteria from Ref. , we evaluate the performance of the PF by the MC simulation with 100 runs. The statistics of the MC are reported in Table 1 with the mean and standard variation denoted by . The estimated path by the group LASSO and PF localization with lowest RMSE among the MC simulations is plotted in dotted line in Fig. 8. A snapshot of the PF at sampling time t = 50 is shown in Fig. 10. The weight densities of 800 particles are plotted in gray-scale colored dots. Note that the sum of weight densities over all particles equals to 1. The PF estimated location is computed by averaging the locations of all particles with the corresponding weight densities.
The computational time is reported in Table 1 in seconds. Since the filtering processes (EKF and PF) are deployed as the postprocessing after the group LASSO, we report their test phase computational times as an additional part to the group LASSO regression, i.e., after the “+” sign in the last column of Table 1. Generally, the train phase requires 30 min for both indoor and outdoor data, while the test phase is significantly shorter. The prediction process of the group LASSO includes one matrix multiplication in Eq. (4) and the visual features extraction. In general for one data point, extracting the SURF, FFT, and histogram features take 0.68, 0.25, and 0.17 s, respectively. Since the SURF feature involves clustering, it requires more time to extract compared to the FFT and histogram. Therefore, the test phase computational time in the outdoor experiment is longer than in the indoor case.
A visual summary of the RMSEs of the two experiments is shown in Fig. 11. In the indoor experiment, the open-loop localization RMSE level is higher than that of the vision-based localization. The group LASSO and Kalman filtering obtained the best RMSE result. On the other hand, in the outdoor experiment, the open-loop localization RMSE level is quite small. This is due to the fact that (in contrast to the indoor mobile robot with known control inputs) we estimated the control inputs by fitting the vehicle model to the collected trajectory data optimizing the goodness-of-the fit between the model and real trajectories. In this case, the group LASSO and Kalman filtering also outperforms others. The RMSE of the group LASSO in the indoor case outperforms that of the outdoor case. This may be due to the fact that the sky takes a large portion of the outdoor scene providing not much dependable features for localization.
The paper presented a novel appearance-based approach based on group LASSO regression. We have shown the effectiveness of our method in feature selection and localization enhancement by combining the group LASSO regression and the sequential Bayesian filters. The experiment study shows the significant reduction (75.5%) in the features while improving the localization performance. The performance of the system can be further improved by adapting a better vision camera as well as larger training data. Additionally, it would be possible to improve the performance if we use a Bayesian filter considering the vehicle dynamics for the real car outdoor experiments.
The linearity assumption used in group LASSO regression could restrict the localization performance due to the nonlinear nature of the map from visual features to the robot positions. Therefore, a potential future study is to apply variable selection with a nonlinear regression technique, such as Gaussian process regression, to relax the linearity condition.
Our method could be used potentially as another locational sensor in a sensor fusion part for lightly equipped unmanned vehicles that are often facing GPS-denied environments.
National Research Foundation of Korea, Ministry of Science and ICT (2016R1A2B4008237).
Ministry of Trade, Industry and Energy, MOTIE, South Korea (Technology Innovation Program 10073129).
National Science Foundation (CAREER Award CMMI-0846547).
Vietnam Education Foundation Fellowship.
Yonsei University (New Faculty Research and Facility Grant).
The Frobenius norm of a matrix A is defined by .
and are the identity and zero matrices of size k × k, respectively.