CoKe: Localized Contrastive Learning for Robust Keypoint Detection


Abstract in English

Todays most popular approaches to keypoint detection involve very complex network architectures that aim to learn holistic representations of all keypoints. In this work, we take a step back and ask: Can we simply learn a local keypoint representation from the output of a standard backbone architecture? This will help make the network simpler and more robust, particularly if large parts of the object are occluded. We demonstrate that this is possible by looking at the problem from the perspective of representation learning. Specifically, the keypoint kernels need to be chosen to optimize three types of distances in the feature space: Features of the same keypoint should be similar to each other, while differing from those of other keypoints, and also being distinct from features from the background clutter. We formulate this optimization process within a framework, which we call CoKe, which includes supervised contrastive learning. CoKe needs to make several approximations to enable representation learning process on large datasets. In particular, we introduce a clutter bank to approximate non-keypoint features, and a momentum update to compute the keypoint representation while training the feature extractor. Our experiments show that CoKe achieves state-of-the-art results compared to approaches that jointly represent all keypoints holistically (Stacked Hourglass Networks, MSS-Net) as well as to approaches that are supervised by detailed 3D object geometry (StarMap). Moreover, CoKe is robust and performs exceptionally well when objects are partially occluded and significantly outperforms related work on a range of diverse datasets (PASCAL3D+, MPII, ObjectNet3D).

Download