Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Query-by-Example Keyword Spotting system using Multi-head Attention and Softtriple Loss

138 0 0.0 ( 0 )

Download Cite

Added by Jinmiao Huang

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jinmiao Huang - Waseem Gharbieh - Han Suk Shim

Computation and Language Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper proposes a neural network architecture for tackling the query-by-example user-defined keyword spotting task. A multi-head attention module is added on top of a multi-layered GRU for effective feature extraction, and a normalized multi-head attention module is proposed for feature aggregation. We also adopt the softtriple loss - a combination of triplet loss and softmax loss - and showcase its effectiveness. We demonstrate the performance of our model on internal datasets with different languages and the public Hey-Snips dataset. We compare the performance of our model to a baseline system and conduct an ablation study to show the benefit of each component in our architecture. The proposed work shows solid performance while preserving simplicity.

rate research

DONUT: CTC-based Query-by-Example Keyword Spotting

79 - Loren Lugosch , Samuel Myer , Vikrant Singh Tomar 2018

Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, a CTC-based algorithm for online query-by-example keyword spotting that enables custom wakeword detection. The algorithm works by recording a small number of training examples from the user, generating a set of label sequence hypotheses from these training examples, and detecting the wakeword by aggregating the scores of all the hypotheses given a new audio recording. Our method combines the generalization and interpretability of CTC-based keyword spotting with the user-adaptation and convenience of a conventional query-by-example system. DONUT has low computational requirements and is well-suited for both learning and inference on embedded systems without requiring private user data to be uploaded to the cloud.

Machine Learning Sound Audio and Speech Processing

A hybrid text normalization system using multi-head self-attention for mandarin

466 - Junhui Zhang , Junjie Pan , Xiang Yin 2019

In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset. Overall, the performance of the system is improved by over 1.5% on sentence-level and it has a potential to improve further.

Computation and Language

Generating Diverse Translation by Manipulating Multi-Head Attention

66 - Zewei Sun , Shujian Huang , Hao-Ran Wei 2019

Transformer model has been widely used on machine translation tasks and obtained state-of-the-art results. In this paper, we report an interesting phenomenon in its encoder-decoder multi-head attention: different attention heads of the final decoder layer align to different word translation candidates. We empirically verify this discovery and propose a method to generate diverse translations by manipulating heads. Furthermore, we make use of these diverse translations with the back-translation technique for better data augmentation. Experiment results show that our method generates diverse translations without severe drop in translation quality. Experiments also show that back-translation with these diverse translations could bring significant improvement on performance on translation tasks. An auxiliary experiment of conversation response generation task proves the effect of diversity as well.

Computation and Language

Cascaded Head-colliding Attention

154 - Lin Zheng , Zhiyong Wu , Lingpeng Kong 2021

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity on texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on texttt{WMT14 EN-DE} in machine translation, due to its improvements on the parameter efficiency.footnote{Our implementation is publicly available at url{https://github.com/LZhengisme/CODA}.}

Computation and Language Machine Learning

Small-footprint Keyword Spotting Using Deep Neural Network and Connectionist Temporal Classifier

236 - Zhiming Wang , Xiaolong Li , Jun Zhou 2017

Mainly for the sake of solving the lack of keyword-specific data, we propose one Keyword Spotting (KWS) system using Deep Neural Network (DNN) and Connectionist Temporal Classifier (CTC) on power-constrained small-footprint mobile devices, taking full advantage of general corpus from continuous speech recognition which is of great amount. DNN is to directly predict the posterior of phoneme units of any personally customized key-phrase, and CTC to produce a confidence score of the given phoneme sequence as responsive decision-making mechanism. The CTC-KWS has competitive performance in comparison with purely DNN based keyword specific KWS, but not increasing any computational complexity.

Computation and Language

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Query-by-Example Keyword Spotting system using Multi-head Attention and Softtriple Loss

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions