Subscribe to the gold package and get unlimited access to Shamra Academy

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

81 0 0.0 ( 0 )

Download Cite

Added by Sebastian Ruder

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Matthew E. Peters - Sebastian Ruder - Noah A. Smith

Computation and Language Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

While most previous work has focused on different pretraining objectives and architectures for transfer learning, we ask how to best adapt the pretrained model to a given target task. We focus on the two most common forms of adaptation, feature extraction (where the pretrained weights are frozen), and directly fine-tuning the pretrained model. Our empirical results across diverse NLP tasks with two state-of-the-art models show that the relative performance of fine-tuning vs. feature extraction depends on the similarity of the pretraining and target tasks. We explore possible explanations for this finding and provide a set of adaptation guidelines for the NLP practitioner.

rate research

To tune or not to tune the number of trees in random forest?

219 - Philipp Probst , Anne-Laure Boulesteix 2017

The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that more trees are better, in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.

Machine Learning Machine Learning

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

76 - Sinong Wang , Madian Khabsa , Hao Ma 2020

Pretraining NLP models with variants of Masked Language Model (MLM) objectives has recently led to a significant improvements on many tasks. This paper examines the benefits of pretrained models as a function of the number of training samples used in the downstream task. On several text classification tasks, we show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%. Our findings indicate that MLM-based models might reach a diminishing return point as the supervised data size increases significantly.

Computation and Language Machine Learning Machine Learning

To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning

220 - Lukas Lange , Jannik Strotgen , Heike Adel 2021

In low-resource settings, model transfer can help to overcome a lack of labeled data for many tasks and domains. However, predicting useful transfer sources is a challenging problem, as even the most similar sources might lead to unexpected negative transfer results. Thus, ranking methods based on task and text similarity may not be sufficient to identify promising sources. To tackle this problem, we propose a method to automatically determine which and how many sources should be exploited. For this, we study the effects of model transfer on sequence labeling across various domains and tasks and show that our methods based on model similarity and support vector machines are able to predict promising sources, resulting in performance increases of up to 24 F1 points.

Computation and Language Machine Learning

Tune smarter not harder: A principled approach to tuning learning rates for shallow nets

77 - Thulasi Tholeti , Sheetal Kalyani 2020

Effective hyper-parameter tuning is essential to guarantee the performance that neural networks have come to be known for. In this work, a principled approach to choosing the learning rate is proposed for shallow feedforward neural networks. We associate the learning rate with the gradient Lipschitz constant of the objective to be minimized while training. An upper bound on the mentioned constant is derived and a search algorithm, which always results in non-divergent traces, is proposed to exploit the derived bound. It is shown through simulations that the proposed search method significantly outperforms the existing tuning methods such as Tree Parzen Estimators (TPE). The proposed method is applied to three different existing applications: a) channel estimation in OFDM systems, b) prediction of the exchange currency rates and c) offset estimation in OFDM receivers, and it is shown to pick better learning rates than the existing methods using the same or lesser compute power.

Machine Learning Optimization and Control Machine Learning

To Seal or Not To Seal

330 - Javad Eshraghi , Sunghwan Jung , Pavlos P. Vlachos 2019

When an object impacts the free surface of a liquid, it ejects a splash curtain upwards and creates an air cavity below the free surface. As the object descends into the liquid, the air cavity eventually closes under the action of hydrostatic pressure (deep seal). In contrast, the surface curtain may splash outwards or dome over and close, creating a surface seal. In this paper we experimentally investigate how the splash curtain dynamics are governed by the interplay of cavity pressure difference, gravity, and surface tension and how they control the occurrence, or not, of surface seal. Based on the experimental observations and measurements, we develop an analytical model to describe the trajectory and dynamics of the splash curtain. The model enables us to reveal the scaling relationship for the dimensionless surface seal time and discover the existence of a critical dimensionless number that predicts the occurrence of surface seal. This scaling indicates that the most significant parameter governing the occurrence of surface seal is the velocity of the airflow rushing into the cavity. This is in contrast to the current understanding which considers the impact velocity as the determinant parameter.

Fluid Dynamics

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions