ﻻ يوجد ملخص باللغة العربية
Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.
This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It a
Machine learning (ML) tasks are becoming ubiquitous in todays network applications. Federated learning has emerged recently as a technique for training ML models at the network edge by leveraging processing capabilities across the nodes that collect
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively,
Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and thi
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial ben