ترغب بنشر مسار تعليمي؟ اضغط هنا

Large-batch training has been essential in leveraging large-scale datasets and models in deep learning. While it is computationally beneficial to use large batch sizes, it often requires a specially designed learning rate (LR) schedule to achieve a c omparable level of performance as in smaller batch training. Especially, when the number of training epochs is constrained, the use of a large LR and a warmup strategy is critical in the final performance of large-batch training due to the reduced number of updating steps. In this work, we propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget. In specific, the whole schedule consists of two phases: adaptive warmup and predefined decay, where the LR is increased until the training loss no longer decreases and decreased to zero until the end of training. Here, whether the training loss has reached the minimum value is robustly checked with Gaussian process smoothing in an online manner with a low computational burden. Coupled with adaptive stochastic optimizers such as AdamP and LAMB, the proposed scheduler successfully adjusts the LRs without cumbersome hyperparameter tuning and achieves comparable or better performances than tuned baselines on various image classification benchmarks and architectures with a wide range of batch sizes.
Unsupervised representation learning has recently received lots of interest due to its powerful generalizability through effectively leveraging large-scale unlabeled data. There are two prevalent approaches for this, contrastive learning and generati ve pre-training, where the former learns representations from instance-wise discrimination tasks and the latter learns them from estimating the likelihood. These seemingly orthogonal approaches have their own strengths and weaknesses. Contrastive learning tends to extract semantic information and discards details irrelevant for classifying objects, making the representations effective for discriminative tasks while degrading robustness to out-of-distribution data. On the other hand, the generative pre-training directly estimates the data distribution, so the representations tend to be robust but not optimal for discriminative tasks. In this paper, we show that we could achieve the best of both worlds by a hybrid training scheme. Specifically, we demonstrated that a transformer-based encoder-decoder architecture trained with both contrastive and generative losses can learn highly discriminative and robust representations without hurting the generative performance. We extensively validate our approach on various tasks.
A meta-model is trained on a distribution of similar tasks such that it learns an algorithm that can quickly adapt to a novel task with only a handful of labeled examples. Most of current meta-learning methods assume that the meta-training set consis ts of relevant tasks sampled from a single distribution. In practice, however, a new task is often out of the task distribution, yielding a performance degradation. One way to tackle this problem is to construct an ensemble of meta-learners such that each meta-learner is trained on different task distribution. In this paper we present a method for constructing a mixture of meta-learners (MxML), where mixing parameters are determined by the weight prediction network (WPN) optimized to improve the few-shot classification performance. Experiments on various datasets demonstrate that MxML significantly outperforms state-of-the-art meta-learners, or their naive ensemble in the case of out-of-distribution as well as in-distribution tasks.
While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, some issues remain to be tackled to apply them to real-world problem domains. First, a continual learning model should effectively handle catast rophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness.
We present a personalized and reliable prediction model for healthcare, which can provide individually tailored medical services such as diagnosis, disease treatment, and prevention. Our proposed framework targets at making personalized and reliable predictions from time-series data, such as Electronic Health Records (EHR), by modeling two complementary components: i) a shared component that captures global trend across diverse patients and ii) a patient-specific component that models idiosyncratic variability for each patient. To this end, we propose a composite model of a deep neural network to learn complex global trends from the large number of patients, and Gaussian Processes (GP) to probabilistically model individual time-series given relatively small number of visits per patient. We evaluate our model on diverse and heterogeneous tasks from EHR datasets and show practical advantages over standard time-series deep models such as pure Recurrent Neural Network (RNN).
Hyperparameter optimization aims to find the optimal hyperparameter configuration of a machine learning model, which provides the best performance on a validation dataset. Manual search usually leads to get stuck in a local hyperparameter configurati on, and heavily depends on human intuition and experience. A simple alternative of manual search is random/grid search on a space of hyperparameters, which still undergoes extensive evaluations of validation errors in order to find its best configuration. Bayesian optimization that is a global optimization method for black-box functions is now popular for hyperparameter optimization, since it greatly reduces the number of validation error evaluations required, compared to random/grid search. Bayesian optimization generally finds the best hyperparameter configuration from random initialization without any prior knowledge. This motivates us to let Bayesian optimization start from the configurations that were successful on similar datasets, which are able to remarkably minimize the number of evaluations. In this paper, we propose deep metric learning to learn meta-features over datasets such that the similarity over them is effectively measured by Euclidean distance between their associated meta-features. To this end, we introduce a Siamese network composed of deep feature and meta-feature extractors, where deep feature extractor provides a semantic representation of each instance in a dataset and meta-feature extractor aggregates a set of deep features to encode a single representation over a dataset. Then, our learned meta-features are used to select a few datasets similar to the new dataset, so that hyperparameters in similar datasets are adopted as initializations to warm-start Bayesian hyperparameter optimization.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا