Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

92 0 0.0 ( 0 )

Download Cite

Added by Jiarui Fang

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jiarui Fang - Yang Yu - Shenggui Li

Machine Learning Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. It learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset. Unfortunately, PTM training requires prohibitively expensive computing devices, especially fine-tuning, which is still a game for a small proportion of people in the AI community. Enabling PTMs training on low-quality devices, PatrickStar now makes PTM accessible to everyone. PatrickStar reduces memory requirements of computing platforms by using the CPU-GPU heterogeneous memory space to store model data, consisting of parameters, gradients, and optimizer states. We observe that the GPU memory available for model data changes regularly, in a tide-like pattern, decreasing and increasing iteratively. However, the existing heterogeneous training works do not take advantage of this pattern. Instead, they statically partition the model data among CPU and GPU, leading to both memory waste and memory abuse. In contrast, PatrickStar manages model data in chunks, which are dynamically distributed in heterogeneous memory spaces. Chunks consist of stateful tensors which run as finite state machines during training. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs using data parallelism, with the lowest communication bandwidth requirements and more efficient bandwidth utilization. Experimental results show PatrickStar trains a 12 billion parameters GPT model, 2x larger than the STOA work, on an 8-V100 and 240GB CPU memory node, and is also more efficient on the same model size.

rate research

Memory-Efficient Pipeline-Parallel DNN Training

98 - Deepak Narayanan , Amar Phanishayee , Kaiyu Shi 2020

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$times$ with similar final model accuracy.

Machine Learning Distributed Parallel and Cluster Computing Machine Learning

Parallel Corpus Filtering via Pre-trained Language Models

118 - Boliang Zhang , Ajay Nagesh , 2020

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.

Computation and Language Machine Learning

Integrating Pre-trained Model into Rule-based Dialogue Management

114 - Jun Quan , Meng Yang , Qiang Gan 2021

Rule-based dialogue management is still the most popular solution for industrial task-oriented dialogue systems for their interpretablility. However, it is hard for developers to maintain the dialogue logic when the scenarios get more and more complex. On the other hand, data-driven dialogue systems, usually with end-to-end structures, are popular in academic research and easier to deal with complex conversations, but such methods require plenty of training data and the behaviors are less interpretable. In this paper, we propose a method to leverages the strength of both rule-based and data-driven dialogue managers (DM). We firstly introduce the DM of Carina Dialog System (CDS, an advanced industrial dialogue system built by Microsoft). Then we propose the model-trigger design to make the DM trainable thus scalable to scenario changes. Furthermore, we integrate pre-trained models and empower the DM with few-shot capability. The experimental results demonstrate the effectiveness and strong few-shot capability of our method.

Computation and Language Artificial Intelligence

An empirical study of domain-agnostic semi-supervised learning via energy-based models: joint-training and pre-training

73 - Yunfu Song , Huahuan Zheng , Zhijian Ou 2020

A class of recent semi-supervised learning (SSL) methods heavily rely on domain-specific data augmentations. In contrast, generative SSL methods involve unsupervised learning based on generative models by either joint-training or pre-training, and are more appealing from the perspective of being domain-agnostic, since they do not inherently require data augmentations. Joint-training estimates the joint distribution of observations and labels, while pre-training is taken over observations only. Recently, energy-based models (EBMs) have achieved promising results for generative modeling. Joint-training via EBMs for SSL has been explored with encouraging results across different data modalities. In this paper, we make two contributions. First, we explore pre-training via EBMs for SSL and compare it to joint-training. Second, a suite of experiments are conducted over domains of image classification and natural language labeling to give a realistic whole picture of the performances of EBM based SSL methods. It is found that joint-training EBMs outperform pre-training EBMs marginally but nearly consistently.

Machine Learning

PI : a Parallel in-memory skip list based Index

146 - Zhongle Xie , Qingchao Cai , H.V. Jagadish 2016

Due to the coarse granularity of data accesses and the heavy use of latches, indices in the B-tree family are not efficient for in-memory databases, especially in the context of todays multi-core architecture. In this paper, we present PI, a Parallel in-memory skip list based Index that lends itself naturally to the parallel and concurrent environment, particularly with non-uniform memory access. In PI, incoming queries are collected, and disjointly distributed among multiple threads for processing to avoid the use of latches. For each query, PI traverses the index in a Breadth-First-Search (BFS) manner to find the list node with the matching key, exploiting SIMD processing to speed up the search process. In order for query processing to be latch-free, PI employs a light-weight communication protocol that enables threads to re-distribute the query workload among themselves such that each list node that will be modified as a result of query processing will be accessed by exactly one thread. We conducted extensive experiments, and the results show that PI can be up to three times as fast as the Masstree, a state-of-the-art B-tree based index.

Databases Distributed Parallel and Cluster Computing

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions