Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

163 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Li Wang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Li Wang - Rongzhi Gu - Nuo Chen

أنظمة الصوت في الحاسوب الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size. However, for metric learning, due to data limitations, the speech anchor is highly susceptible to the acoustic environment and speakers. Also, we note that the 1D-CNN models have limited capability to capture long-term temporal acoustic features. To address the above problems, we propose to utilize text anchors to improve the stability of anchors. Furthermore, a new type of model (LG-Net) is exquisitely designed to promote long-short term acoustic feature modeling based on 1D-CNN and self-attention. Experiments are conducted on Google Speech Commands Dataset version 1 (GSCDv1) and 2 (GSCDv2). The results demonstrate that the proposed text anchor based metric learning method shows consistent improvements over speech anchor on representative CNN-based models. Moreover, our LG-Net model achieves SOTA accuracy of 97.67% and 96.79% on two datasets, respectively. It is encouraged to see that our lighter LG-Net with only 74k parameters obtains 96.82% KWS accuracy on the GSCDv1 and 95.77% KWS accuracy on the GSCDv2.

قيم البحث

352 - Shenghua Hu , Jing Wang , Yujun Wang 2021

Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. To solve this problem, this paper proposes a separable temporal convolution neural network with attention, it has a small number of parameters. Through the time convolution combined with attention mechanism, a small number of parameters model (32.2K) is implemented while maintaining high performance. The proposed model achieves 95.7% accuracy on the Google Speech Commands dataset, which is close to the performance of Res15(239K), the state-of-the-art model in KWS at present.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Metric Learning for Keyword Spotting

315 - Jaesung Huh , Minjae Lee , Heesoo Heo 2020

The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. There fore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing high false alarm rates in real-world scenarios. In reality, keyword spotting is a detection problem where predefined target keywords are detected from a variety of unknown sounds. This shares many similarities to metric learning problems in that the unseen and unknown non-target sounds must be clearly differentiated from the target keywords. However, a key difference is that the target keywords are known and predefined. To this end, we propose a new method based on metric learning that maximises the distance between target and non-target keywords, but also learns per-class weights for target keywords `a la classification objectives. Experiments on the Google Speech Commands dataset show that our method significantly reduces false alarms to unseen non-target keywords, while maintaining the overall classification accuracy.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data

97 - Menglong Xu , Shengqiang Li , Chengdong Liang 2021

Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are ou t of the training data are frequently encountered. Most conventional methods aim to maximize the classification accuracy on the training set, without taking the unseen sounds into account. To enhance the robustness of the deep neural networks based KWS, in this paper, we introduce a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC). The proposed method not only maximizes the classification accuracy of keywords on the closed training set, but also maximizes the AUC score for optimizing the performance of non-keyword segments detection. Experimental results on the Google Speech Commands dataset v1 and v2 show that our method achieves new state-of-the-art performance in terms of most evaluation metrics.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Small-footprint Keyword Spotting Using Deep Neural Network and Connectionist Temporal Classifier

236 - Zhiming Wang , Xiaolong Li , Jun Zhou 2017

Mainly for the sake of solving the lack of keyword-specific data, we propose one Keyword Spotting (KWS) system using Deep Neural Network (DNN) and Connectionist Temporal Classifier (CTC) on power-constrained small-footprint mobile devices, taking ful l advantage of general corpus from continuous speech recognition which is of great amount. DNN is to directly predict the posterior of phoneme units of any personally customized key-phrase, and CTC to produce a confidence score of the given phoneme sequence as responsive decision-making mechanism. The CTC-KWS has competitive performance in comparison with purely DNN based keyword specific KWS, but not increasing any computational complexity.

الحساب واللغة

Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting

99 - Shenghua Hu , Jing Wang , Yujun Wang 2021

Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. In this paper, we propose a temporally pooled attentio n module which can capture global features better than the AveragePool. Besides, we design a separable temporal convolution network which leverages depthwise separable and temporal convolution to reduce the number of parameter and calculations. Finally, taking advantage of separable temporal convolution and temporally pooled attention, a efficient neural network (ST-AttNet) is designed for KWS system. We evaluate the models on the publicly available Google speech commands data sets V1. The number of parameters of proposed model (48K) is 1/6 of state-of-the-art TC-ResNet14-1.5 model (305K). The proposed model achieves a 96.6% accuracy, which is comparable to the TC-ResNet14-1.5 model (96.6%).

أنظمة الصوت في الحاسوب معالجة الصوت والكلام