Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware

346 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Aaron Voelker

تاريخ النشر 2020

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Peter Blouw - Gurshaant Malik - Benjamin Morcos

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Keyword spotting (KWS) provides a critical user interface for many mobile and edge applications, including phones, wearables, and cars. As KWS systems are typically always on, maximizing both accuracy and power efficiency are central to their utility. In this work we use hardware aware training (HAT) to build new KWS neural networks based on the Legendre Memory Unit (LMU) that achieve state-of-the-art (SotA) accuracy and low parameter counts. This allows the neural network to run efficiently on standard hardware (212$mu$W). We also characterize the power requirements of custom designed accelerator hardware that achieves SotA power efficiency of 8.79$mu$W, beating general purpose low power hardware (a microcontroller) by 24x and special purpose ASICs by 16x.

قيم البحث

480 - Peter Blouw , Xuan Choo , Eric Hunsberger 2018

Using Intels Loihi neuromorphic research chip and ABRs Nengo Deep Learning toolkit, we analyze the inference speed, dynamic power consumption, and energy cost per inference of a two-layer neural network keyword spotter trained to recognize a single p hrase. We perform comparative analyses of this keyword spotter running on more conventional hardware devices including a CPU, a GPU, Nvidias Jetson TX1, and the Movidius Neural Compute Stick. Our results indicate that for this inference application, Loihi outperforms all of these alternatives on an energy cost per inference basis while maintaining equivalent inference accuracy. Furthermore, an analysis of tradeoffs between network size, inference speed, and energy cost indicates that Loihis comparative advantage over other low-power computing devices improves for larger networks.

التعلم الآلي التعلم الالي

Training Keyword Spotters with Limited and Synthesized Speech Data

121 - James Lin , Kevin Kilgour , Dominik Roblek 2020

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders

149 - Raghav Menon , Herman Kamper , Ewald van der Westhuizen 2018

We compare features for dynamic time warping (DTW) when used to bootstrap keyword spotting (KWS) in an almost zero-resource setting. Such quickly-deployable systems aim to support United Nations (UN) humanitarian relief efforts in parts of Africa wit h severely under-resourced languages. Our objective is to identify acoustic features that provide acceptable KWS performance in such environments. As supervised resource, we restrict ourselves to a small, easily acquired and independently compiled set of isolated keywords. For feature extraction, a multilingual bottleneck feature (BNF) extractor, trained on well-resourced out-of-domain languages, is integrated with a correspondence autoencoder (CAE) trained on extremely sparse in-domain data. On their own, BNFs and CAE features are shown to achieve a more than 2% absolute performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, with a more than 11% absolute improvement in ROC AUC over MFCCs and more than twice as many top-10 retrievals for two evaluated languages, English and Luganda. We conclude that integrating BNFs with the CAE allows both large out-of-domain and sparse in-domain resources to be exploited for improved ASR-free keyword spotting.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

324 - Hanrui Wang , Zhanghao Wu , Zhijian Liu 2020

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to desi gn Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $textit{arbitrary encoder-decoder attention}$ and $textit{heterogeneous layers}$. Then we train a $textit{SuperTransformer}$ that covers all candidates in the design space, and efficiently produces many $textit{SubTransformers}$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $textit{SubTransformer}$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT14 translation task on Raspberry Pi-4, HAT can achieve $textbf{3}times$ speedup, $textbf{3.7}times$ smaller size over baseline Transformer; $textbf{2.7}times$ speedup, $textbf{3.6}times$ smaller size over Evolved Transformer with $textbf{12,041}times$ less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git

الحساب واللغة التعلم الآلي الحوسبة العصبية والتطورية

Efficient keyword spotting using dilated convolutions and gating

304 - Alice Coucke , Mohammed Chlieh , Thibault Gisselbrecht 2018

We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent succ ess of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated activations and residual connections are also added, following a similar configuration to WaveNet. In addition, we apply a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural network using LSTM cells, with a significant decrease in false rejection rate. The underlying dataset - Hey Snips utterances recorded by over 2.2K different speakers - has been made publicly available to establish an open reference for wake-word detection.

التعلم الآلي الحساب واللغة أنظمة الصوت في الحاسوب