A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

91 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Christoph Feichtenhofer

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Christoph Feichtenhofer - Haoqi Fan - Bo Xiong

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/facebookresearch/SlowFast

قيم البحث

105 - Xiaohua Zhai , Joan Puigcerver , Alexander Kolesnikov 2019

Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations?

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي التعلم الالي

MVCNet: Multiview Contrastive Network for Unsupervised Representation Learning for 3D CT Lesions

165 - Penghua Zhai , Huaiwei Cong , Gangming Zhao 2021

emph{Objective and Impact Statement}. With the renaissance of deep learning, automatic diagnostic systems for computed tomography (CT) have achieved many successful applications. However, they are mostly attributed to careful expert annotations, whic h are often scarce in practice. This drives our interest to the unsupervised representation learning. emph{Introduction}. Recent studies have shown that self-supervised learning is an effective approach for learning representations, but most of them rely on the empirical design of transformations and pretext tasks. emph{Methods}. To avoid the subjectivity associated with these methods, we propose the MVCNet, a novel unsupervised three dimensional (3D) representation learning method working in a transformation-free manner. We view each 3D lesion from different orientations to collect multiple two dimensional (2D) views. Then, an embedding function is learned by minimizing a contrastive loss so that the 2D views of the same 3D lesion are aggregated, and the 2D views of different lesions are separated. We evaluate the representations by training a simple classification head upon the embedding layer. emph{Results}. Experimental results show that MVCNet achieves state-of-the-art accuracies on the LIDC-IDRI (89.55%), LNDb (77.69%) and TianChi (79.96%) datasets for emph{unsupervised representation learning}. When fine-tuned on 10% of the labeled data, the accuracies are comparable to the supervised learning model (89.46% vs. 85.03%, 73.85% vs. 73.44%, 83.56% vs. 83.34% on the three datasets, respectively). emph{Conclusion}. Results indicate the superiority of MVCNet in emph{learning representations with limited annotations}.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Large-Scale Unsupervised Deep Representation Learning for Brain Structure

138 - Ayush Jaiswal , Dong Guo , Cauligi S. Raghavendra 2018

Machine Learning (ML) is increasingly being used for computer aided diagnosis of brain related disorders based on structural magnetic resonance imaging (MRI) data. Most of such work employs biologically and medically meaningful hand-crafted features calculated from different regions of the brain. The construction of such highly specialized features requires a considerable amount of time, manual oversight and careful quality control to ensure the absence of errors in the computational process. Recent advances in Deep Representation Learning have shown great promise in extracting highly non-linear and information-rich features from data. In this paper, we present a novel large-scale deep unsupervised approach to learn generic feature representations of structural brain MRI scans, which requires no specialized domain knowledge or manual intervention. Our method produces low-dimensional representations of brain structure, which can be used to reconstruct brain images with very low error and exhibit performance comparable to FreeSurfer features on various classification tasks.

التعلم الآلي التعلم الالي

Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting

91 - Lingbo Liu , Jiaqi Chen , Hefeng Wu 2020

Crowd counting is a fundamental yet challenging task, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only used the limited information of RGB images and cannot well discover potential pedestri ans in unconstrained scenarios. In this work, we find that incorporating optical and thermal information can greatly help to recognize pedestrians. To promote future researches in this field, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to capture the complementary information of different modalities fully. Specifically, our IADM incorporates two collaborative information transfers to dynamically enhance the modality-shared and modality-specific representations with a dual information propagation mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the proposed approach is universal for multimodal crowd counting and is also capable to achieve superior performance on the ShanghaiTechRGBD dataset. Finally, our source code and benchmark are released at {url{http://lingboliu.com/RGBT_Crowd_Counting.html}}.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

Spatiotemporal Contrastive Video Representation Learning

88 - Rui Qian , Tianjian Meng , Boqing Gong 2020

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the sa me short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2x filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي