Autonomous driving and other potential computer vision applications require high-performance three-dimensional (3D) human pose estimation, which involves the estimation of joint locations of the human body retrieved from an image or video file. Due to its low maturity and lack of sufficient 3D in-the-wild dataset, the progress for an optimized solution has been limited. Most of the existing methods struggle with pose estimation on in-the-wild images as they are trained on a synthetic dataset, which is far from the real-life scenario.
Interestingly, even the compelling results from Mo2Cap2  and xR-EgoPose  have faced severe accuracy degradation because of the synthetic dataset and lack of large-scale in-the-wild images. The accuracy of these existing methodologies dropped when the human body parts were interacting with the surrounding. The image below shows a significant error in the prediction of the body movement when interacting with the external stimulus. The input image along with the external reference provides enough information to form an accurate egocentric pose, however, the Mo2Cap2 output is more distorted, signifying the drop in performance. In reference to the proposed methodology, the output is more accurate in real-life situations because of the matured and improved in-the-wild dataset.
A team of researchers from MPI Informatics, Saarland Informatics Campus and Facebook Reality Labs proposed a new egocentric pose estimation method that is trained on a new in-the-wild dataset with weak external supervision. Bringing it closer to real-life scenarios, the in-the-wild egocentric dataset employed in the implementation is captured by a head-mounted fish eye camera and an external camera called the Egocentric Poses in the Wild (EgoPW).
Did the in-the-wild egocentric dataset help with the estimation method?
In the paper, “Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision,” the researchers have introduced a new optimization method to generate pseudo labels for the developed egocentric dataset by adding the supervision from an external camera. The method is designed under weak supervision from the external view, which has outperformed the existing methods for the in-the-wild egocentric dataset; especially when there is severe interference of the external surroundings with the human body parts.
The initial steps in building the new methodology involved the creation of the EgoPW dataset, which is later labeled using the multi-view-based optimization method. Once the dataset for the model is ready, the next step deals with taking the dataset from the Mo2Cap2 research paper and trying to minimize the difference between synthetic and real datasets through a domain classifier. The complete overview of the method can be seen in the image below.
For evaluation purposes, the innovators have included more real-world datasets from Mo2Cap2 and Wang et al.  wherein, the Mo2Cap2 consists of 2.7k frames of two people in an indoor and outdoor activity while the latter dataset contains 12k frames of two people in the studio. “To measure the accuracy of our pseudo labels, we evaluate our optimization method only on the dataset from Wang et al., since the Mo2Cap2 dataset does not include the external view,” the researchers explain. “To evaluate our method on the in-the-wild data, we also conduct a qualitative evaluation on the test set of the EgoPW dataset,” they add.
The results are performed on two metrics, PA-MPJPE and BA-MPJPE which are estimations of the accuracy of a single-body pose. In the first stage of the model, the pseudo label generation– high accuracy means better network performance. After the evaluation of the accuracy of pseudo labels performed on Wang et al’s dataset, the results show substantial improvement in the accuracy; leveraging the egocentric view and external view during optimization. Moving forward with the experimentation, the researchers performed comparisons on 3D pose estimation employing Wang et al.’s dataset and Mo2Cap2 dataset. For single-frame-based methods on Wang et al.’s test dataset, the proposed methodology seems to outperform the Mo2Cap2 implementation by 20.1% and increase in accuracy by 27.0% against xR-egopose. For the Mo2Cap2 test dataset, the researchers note, “our method performs better than Mo2Cap2 and xR-egopose by 8.8% and 4.2%, respectively.”
Under severe occlusions, the method has outperformed all the existing state-of-the-art methods both quantitively and qualitatively. For future work, the researchers aim to develop a video-based method for estimating temporally, consistent egocentric poses but with an in-the-wild video dataset. To improve the setup to capture in-the-wild images, there is an addition of various sensors like IMUs and depth cameras which can be seen evidently in the subsequent research articles.
The research paper is publicly available in the ArXiv repository and can be accessed here.
 Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Diogo Luvizon, Christian Theobalt: Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision. DOI arXiv:2201.07929 [cs.CV]
 Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, Hans-Peter Seidel, Christian Theobalt: Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera. DOI arXiv:1803.05959 [cs.CV]
 D. Tome, P. Peluse, L. Agapito and H. Badino, "xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7727-7737, DOI: 10.1109/ICCV.2019.00782.
 Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Christian Theobalt: Estimating Egocentric 3D Human Pose in Global Space. DOI arXiv:2104.13454 [cs.CV]