Do you want to publish a course? Click here

LightSeq: A High Performance Inference Library for Transformers

77   0   0.0 ( 0 )
 Added by Yang Wei
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose LightSeq, a highly efficient inference library for models in the Transformer family. LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint. LightSeq can easily import models trained using PyTorch and Tensorflow. Experimental results on machine translation benchmarks show that LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared with FasterTransformer, a concurrent CUDA implementation. The code is available at https://github.com/bytedance/lightseq.



rate research

Read More

134 - Akira SaiToh 2011
A C++ library, named ZKCM, has been developed for the purpose of multiprecision matrix calculations, which is based on the GNU MP and MPFR libraries. It is especially convenient for writing programs involving tensor-product operations, tracing-out operations, and singular-value decompositions. Its extension library, ZKCM_QC, for simulating quantum computing has been developed using the time-dependent matrix-product-state simulation method. This report gives a brief introduction to the libraries with sample programs.
61 - Oleg Smirnov 2021
The adoption of neural networks and deep learning in non-Euclidean domains has been hindered until recently by the lack of scalable and efficient learning frameworks. Existing toolboxes in this space were mainly motivated by research and education use cases, whereas practical aspects, such as deploying and maintaining machine learning models, were often overlooked. We attempt to bridge this gap by proposing TensorFlow RiemOpt, a Python library for optimization on Riemannian manifolds in TensorFlow. The library is designed with the aim for a seamless integration with the TensorFlow ecosystem, targeting not only research, but also streamlining production machine learning pipelines.
In this paper, we develop face.evoLVe -- a comprehensive library that collects and implements a wide range of popular deep learning-based methods for face recognition. First of all, face.evoLVe is composed of key components that cover the full process of face analytics, including face alignment, data processing, various backbones, losses, and alternatives with bags of tricks for improving performance. Later, face.evoLVe supports multi-GPU training on top of different deep learning platforms, such as PyTorch and PaddlePaddle, which facilitates researchers to work on both large-scale datasets with millions of images and low-shot counterparts with limited well-annotated data. More importantly, along with face.evoLVe, images before & after alignment in the common benchmark datasets are released with source codes and trained models provided. All these efforts lower the technical burdens in reproducing the existing methods for comparison, while users of our library could focus on developing advanced approaches more efficiently. Last but not least, face.evoLVe is well designed and vibrantly evolving, so that new face recognition approaches can be easily plugged into our framework. Note that we have used face.evoLVe to participate in a number of face recognition competitions and secured the first place. The version that supports PyTorch is publicly available at https://github.com/ZhaoJ9014/face.evoLVe.PyTorch and the PaddlePaddle version is available at https://github.com/ZhaoJ9014/face.evoLVe.PyTorch/tree/master/paddle. Face.evoLVe has been widely used for face analytics, receiving 2.4K stars and 622 forks.
We present cudaclaw, a CUDA-based high performance data-parallel framework for the solution of multidimensional hyperbolic partial differential equation (PDE) systems, equations describing wave motion. cudaclaw allows computational scientists to solve such systems on GPUs without being burdened by the need to write CUDA code, worry about thread and block details, data layout, and data movement between the different levels of the memory hierarchy. The user defines the set of PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes care of orchestrating the computations and data transfers to maximize arithmetic throughput. cudaclaw treats the different spatial dimensions separately to allow suitable block sizes and dimensions to be used in the different directions, and includes a number of optimizations to minimize access to global memory.
Dijkstra observed that verifying correctness of a program is difficult and conjectured that derivation of a program hand-in-hand with its proof of correctness was the answer. We illustrate this goal-oriented approach by applying it to the domain of dense linear algebra libraries for distributed memory parallel computers. We show that algorithms that underlie the implementation of most functionality for this domain can be systematically derived to be correct. The benefit is that an entire family of algorithms for an operation is discovered so that the best algorithm for a given architecture can be chosen. This approach is very practical: Ideas inspired by it have been used to rewrite the dense linear algebra software stack starting below the Basic Linear Algebra Subprograms (BLAS) and reaching up through the Elemental distributed memory library, and every level in between. The paper demonstrates how formal methods and rigorous mathematical techniques for correctness impact HPC.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا