أصبح تحسين كفاءة المحولات جذابة بشكل متزايد مؤخرا.تم اقتراح مجموعة واسعة من الطرق، على سبيل المثال، التشذيب، الكمي، البنيات الجديدة وغيرها. ولكن هذه الأساليب إما متطورة في التنفيذ أو التعتمد على الأجهزة.في هذه الورقة، نظير على أنه يمكن تحسين كفاءة المحولات من خلال الجمع بين بعض الطرق البسيطة والأجهزة غير المرجعية، بما في ذلك ضبط المعلمات فرط، وخيارات تصميم أفضل واستراتيجيات التدريب.في مهام الترجمة الأخبار WMT، نحسن كفاءة الاستدلال لنظام محول قوي بنسبة 3.80x على وحدة المعالجة المركزية و 2.52X على GPU.
Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or dependent on hardware. In this paper, we show that the efficiency of Transformer can be improved by combining some simple and hardware-agnostic methods, including tuning hyper-parameters, better design choices and training strategies. On the WMT news translation tasks, we improve the inference efficiency of a strong Transformer system by 3.80x on CPU and 2.52x on GPU.
References used
https://aclanthology.org/
Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to mul
Transformer models are expensive to fine-tune, slow for inference, and have large storage requirements. Recent approaches tackle these shortcomings by training smaller models, dynamically reducing the model size, and by training light-weight adapters
Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorat
Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away th
We probe pre-trained transformer language models for bridging inference. We first investigate individual attention heads in BERT and observe that attention heads at higher layers prominently focus on bridging relations in-comparison with the lower an