Rethinking Why Intermediate-Task Fine-Tuning Works

86 0 0.0 ( 0 )

Download Cite

Added by Ting-Yun Chang

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Ting-Yun Chang - Chi-Jen Lu

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Supplementary Training on Intermediate Labeled-data Tasks (STILTs) is a widely applied technique, which first fine-tunes the pretrained language models on an intermediate task before on the target task of interest. While STILTs is able to further improve the performance of pretrained language models, it is still unclear why and when it works. Previous research shows that those intermediate tasks involving complex inference, such as commonsense reasoning, work especially well for RoBERTa. In this paper, we discover that the improvement from an intermediate task could be orthogonal to it containing reasoning or other complex skills -- a simple real-fake discrimination task synthesized by GPT2 can benefit diverse target tasks. We conduct extensive experiments to study the impact of different factors on STILTs. These findings suggest rethinking the role of intermediate fine-tuning in the STILTs pipeline.

rate research

Rethinking the Hyperparameters for Fine-tuning

139 - Hao Li , Pratik Chaudhari , Hao Yang 2020

Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyperparameters and keeping them fixed to values normally used for training from scratch. This paper re-examines several common practices of setting hyperparameters for fine-tuning. Our findings are based on extensive empirical evaluation for fine-tuning on various transfer learning benchmarks. (1) While prior works have thoroughly investigated learning rate and batch size, momentum for fine-tuning is a relatively unexplored parameter. We find that the value of momentum also affects fine-tuning performance and connect it with previous theoretical findings. (2) Optimal hyperparameters for fine-tuning, in particular, the effective learning rate, are not only dataset dependent but also sensitive to the similarity between the source domain and target domain. This is in contrast to hyperparameters for training from scratch. (3) Reference-based regularization that keeps models close to the initial model does not necessarily apply for dissimilar datasets. Our findings challenge common practices of fine-tuning and encourages deep learning practitioners to rethink the hyperparameters for fine-tuning.

Computer Vision and Pattern Recognition Machine Learning Machine Learning

Why Indexing Works

81 - J. B. Heaton , N. G. Polson , J. H. Witte 2015

We develop a simple stock selection model to explain why active equity managers tend to underperform a benchmark index. We motivate our model with the empirical observation that the best performing stocks in a broad market index often perform much better than the other stocks in the index. Randomly selecting a subset of securities from the index may dramatically increase the chance of underperforming the index. The relative likelihood of underperformance by investors choosing active management likely is much more important than the loss to those same investors from the higher fees for active management relative to passive index investing. Thus, active management may be even more challenging than previously believed, and the stakes for finding the best active managers may be larger than previously assumed.

Portfolio Management Statistical Finance

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

135 - Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani 2021

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing information across tasks. In this paper, we show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which condition on task, adapter position, and layer id in a transformer model. This parameter-efficient multi-task learning framework allows us to achieve the best of both worlds by sharing knowledge across tasks via hypernetworks while enabling the model to adapt to each individual task through task-specific adapters. Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29% parameters per task. We additionally demonstrate substantial performance improvements in few-shot domain generalization across a variety of tasks. Our code is publicly available in https://github.com/rabeehk/hyperformer.

Computation and Language

Why ResNet Works? Residuals Generalize

69 - Fengxiang He , Tongliang Liu , 2019

Residual connections significantly boost the performance of deep neural networks. However, there are few theoretical results that address the influence of residuals on the hypothesis complexity and the generalization ability of deep neural networks. This paper studies the influence of residual connections on the hypothesis complexity of the neural network in terms of the covering number of its hypothesis space. We prove that the upper bound of the covering number is the same as chain-like neural networks, if the total numbers of the weight matrices and nonlinearities are fixed, no matter whether they are in the residuals or not. This result demonstrates that residual connections may not increase the hypothesis complexity of the neural network compared with the chain-like counterpart. Based on the upper bound of the covering number, we then obtain an $mathcal O(1 / sqrt{N})$ margin-based multi-class generalization bound for ResNet, as an exemplary case of any deep neural network with residual connections. Generalization guarantees for similar state-of-the-art neural network architectures, such as DenseNet and ResNeXt, are straight-forward. From our generalization bound, a practical implementation is summarized: to approach a good generalization ability, we need to use regularization terms to control the magnitude of the norms of weight matrices not to increase too much, which justifies the standard technique of weight decay.

Machine Learning Machine Learning

Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation

101 - Alexander R. Fabbri , Simeng Han , Haoyuan Li 2020

Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains. In this work, we introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner. WikiTransfer fine-tunes pretrained models on pseudo-summaries, produced from generic Wikipedia data, which contain characteristics of the target dataset, such as the length and level of abstraction of the desired summaries. WikiTransfer models achieve state-of-the-art, zero-shot abstractive summarization performance on the CNN-DailyMail dataset and demonstrate the effectiveness of our approach on three additional diverse datasets. These models are more robust to noisy data and also achieve better or comparable few-shot performance using 10 and 100 training examples when compared to few-shot transfer from other summarization datasets. To further boost performance, we employ data augmentation via round-trip translation as well as introduce a regularization term for improved few-shot transfer. To understand the role of dataset aspects in transfer performance and the quality of the resulting output summaries, we further study the effect of the components of our unsupervised fine-tuning data and analyze few-shot performance using both automatic and human evaluation.

Computation and Language