ﻻ يوجد ملخص باللغة العربية
Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to $15.1 times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
Huge neural network models have shown unprecedented performance in real-world applications. However, due to memory constraints, model parallelism must be utilized to host large models that would otherwise not fit into the memory of a single device. P
A major challenge in training deep learning models is the lack of high quality and complete datasets. In the paper, we present a masking approach for training deep learning models from a publicly available but incomplete dataset. For example, city of
Neural networks trained with class-imbalanced data are known to perform poorly on minor classes of scarce training data. Several recent works attribute this to over-fitting to minor classes. In this paper, we provide a novel explanation of this issue
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the curren
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed so