Customization is an increasing trend in fashion product industry to reflect individual lifestyles. Previous studies have examined the idea of virtual footwear try-on in augmented reality (AR) using a depth camera. However, the depth camera restricts the deployment of this technology in practice. This research proposes to estimate the 6-DoF pose of a human foot from a color image using deep learning models to solve the problem. We construct a training dataset consisting of synthetic and real foot images that are automatically annotated. Three convolutional neural network models (DOPE, DOPE2, and YOLO6d) are trained with the dataset to predict the foot pose in real-time. The model performances are evaluated using metrics for accuracy, computational efficiency, and training time. A prototyping system implementing the best model demonstrates the feasibility of virtual footwear try-on using a RGB camera. Test results also indicate the necessity of real training data to bridge the reality gap in estimating the human foot pose.

