ﻻ يوجد ملخص باللغة العربية
Recently, large-scale transformer-based models have been proven to be effective over a variety of tasks across many domains. Nevertheless, putting them into production is very expensive, requiring comprehensive optimization techniques to reduce inference costs. This paper introduces a series of transformer inference optimization techniques that are both in algorithm level and hardware level. These techniques include a pre-padding decoding mechanism that improves token parallelism for text generation, and highly optimized kernels designed for very long input length and large hidden size. On this basis, we propose a transformer inference acceleration library -- Easy and Efficient Transformer (EET), which has a significant performance improvement over existing libraries. Compared to Faster Transformer v4.0s implementation for GPT-2 layer on A100, EET achieves a 1.5-4.5x state-of-art speedup varying with different context lengths. EET is available at https://github.com/NetEase-FuXi/EET. A demo video is available at https://youtu.be/22UPcNGcErg.
We present skweak, a versatile, Python-based software toolkit enabling NLP developers to apply weak supervision to a wide range of NLP tasks. Weak supervision is an emerging machine learning paradigm based on a simple idea: instead of labelling data
The literature has witnessed the success of leveraging Pre-trained Language Models (PLMs) and Transfer Learning (TL) algorithms to a wide range of Natural Language Processing (NLP) applications, yet it is not easy to build an easy-to-use and scalable
We probe pre-trained transformer language models for bridging inference. We first investigate individual attention heads in BERT and observe that attention heads at higher layers prominently focus on bridging relations in-comparison with the lower an
Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing spars
Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre