ﻻ يوجد ملخص باللغة العربية
Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication.
Non-autoregressive transformer models have achieved extremely fast inference speed and comparable performance with autoregressive sequence-to-sequence models in neural machine translation. Most of the non-autoregressive transformers decode the target
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global i
Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for impr
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown
The Listen, Attend and Spell (LAS) model and other attention-based automatic speech recognition (ASR) models have known limitations when operated in a fully online mode. In this paper, we analyze the online operation of LAS models to demonstrate that