Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative

59 0 0.0 ( 0 )

Download Cite

Added by Lucio Dery

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Lucio M. Dery - Paul Michel - Ameet Talwalkar

Machine Learning Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Pre-training, where models are trained on an auxiliary objective with abundant data before being fine-tuned on data from the downstream task, is now the dominant paradigm in NLP. In general, the pre-training step relies on little to no direct knowledge of the task on which the model will be fine-tuned, even when the end-task is known in advance. Our work challenges this status-quo of end-task agnostic pre-training. First, on three different low-resource NLP tasks from two domains, we demonstrate that multi-tasking the end-task and auxiliary objectives results in significantly better downstream task performance than the widely-used task-agnostic continued pre-training paradigm of Gururangan et al. (2020). We next introduce an online meta-learning algorithm that learns a set of multi-task weights to better balance among our multiple auxiliary objectives, achieving further improvements on end task performance and data efficiency.

rate research

Why Should we Combine Training and Post-Training Methods for Out-of-Distribution Detection?

109 - Aristotelis-Angelos Papadopoulos , Nazim Shaikh , Mohammad Reza Rajati 2019

Deep neural networks are known to achieve superior results in classification tasks. However, it has been recently shown that they are incapable to detect examples that are generated by a distribution which is different than the one they have been trained on since they are making overconfident prediction for Out-Of-Distribution (OOD) examples. OOD detection has attracted a lot of attention recently. In this paper, we review some of the most seminal recent algorithms in the OOD detection field, we divide those methods into training and post-training and we experimentally show how the combination of the former with the latter can achieve state-of-the-art results in the OOD detection task.

Machine Learning Computer Vision and Pattern Recognition Machine Learning

Weighted Training for Cross-Task Learning

102 - Shuxiao Chen , Koby Crammer , Hangfeng He 2021

In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning

Machine Learning Computation and Language Machine Learning

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

357 - Yulin Wang , Zanlin Ni , Shiji Song 2021

Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.

Computer Vision and Pattern Recognition Machine Learning Machine Learning

Speech Model Pre-training for End-to-End Spoken Language Understanding

197 - Loren Lugosch , Mirco Ravanelli , Patrick Ignoto 2019

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the models ability to generalize to new phrases not heard during training.

Audio and Speech Processing Computation and Language Machine Learning

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

114 - Luyu Gao , Jamie Callan 2021

Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

Information Retrieval Computation and Language

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions