No Arabic abstract
A class of recent semi-supervised learning (SSL) methods heavily rely on domain-specific data augmentations. In contrast, generative SSL methods involve unsupervised learning based on generative models by either joint-training or pre-training, and are more appealing from the perspective of being domain-agnostic, since they do not inherently require data augmentations. Joint-training estimates the joint distribution of observations and labels, while pre-training is taken over observations only. Recently, energy-based models (EBMs) have achieved promising results for generative modeling. Joint-training via EBMs for SSL has been explored with encouraging results across different data modalities. In this paper, we make two contributions. First, we explore pre-training via EBMs for SSL and compare it to joint-training. Second, a suite of experiments are conducted over domains of image classification and natural language labeling to give a realistic whole picture of the performances of EBM based SSL methods. It is found that joint-training EBMs outperform pre-training EBMs marginally but nearly consistently.
We investigate a strategy for improving the efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt, based on noise-contrastive estimation and neighbourhood component analysis, that aims to distinguish examples of different classes in addition to the self-supervised instance-wise pretext tasks. On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute. Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal that can significantly accelerate contrastive learning of visual representations. Our code is available online at github.com/facebookresearch/suncet.
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
Co-training, extended from self-training, is one of the frameworks for semi-supervised learning. Without natural split of features, single-view co-training works at the cost of training extra classifiers, where the algorithm should be delicately designed to prevent individual classifiers from collapsing into each other. To remove these obstacles which deter the adoption of single-view co-training, we present a simple and efficient algorithm Multi-Head Co-Training. By integrating base learners into a multi-head structure, the model is in a minimal amount of extra parameters. Every classification head in the unified model interacts with its peers through a Weak and Strong Augmentation strategy, in which the diversity is naturally brought by the strong data augmentation. Therefore, the proposed method facilitates single-view co-training by 1). promoting diversity implicitly and 2). only requiring a small extra computational overhead. The effectiveness of Multi-Head Co-Training is demonstrated in an empirical study on standard semi-supervised learning benchmarks.
Recently, learning a model that generalizes well on out-of-distribution (OOD) data has attracted great attention in the machine learning community. In this paper, after defining OOD generalization via Wasserstein distance, we theoretically show that a model robust to input perturbation generalizes well on OOD data. Inspired by previous findings that adversarial training helps improve input-robustness, we theoretically show that adversarially trained models have converged excess risk on OOD data, and empirically verify it on both image classification and natural language understanding tasks. Besides, in the paradigm of first pre-training and then fine-tuning, we theoretically show that a pre-trained model that is more robust to input perturbation provides a better initialization for generalization on downstream OOD data. Empirically, after fine-tuning, this better-initialized model from adversarial pre-training also has better OOD generalization.
In this work, we use a span-based approach for Vietnamese constituency parsing. Our method follows the self-attention encoder architecture and a chart decoder using a CKY-style inference algorithm. We present analyses of the experiment results of the comparison of our empirical method using pre-training models XLM-Roberta and PhoBERT on both Vietnamese datasets VietTreebank and NIIVTB1. The results show that our model with XLM-Roberta archived the significantly F1-score better than other pre-training models, VietTreebank at 81.19% and NIIVTB1 at 85.70%.