Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation


Abstract in English

In the presence of annotated data, deep human pose estimation networks yield impressive performance. Nevertheless, annotating new data is extremely time-consuming, particularly in real-world conditions. Here, we address this by leveraging contrastive self-supervised (CSS) learning to extract rich latent vectors from single-view videos. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space, well-suited for human pose estimation. Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.

Download