ﻻ يوجد ملخص باللغة العربية
This paper focuses on wake on intent (WOI) techniques for platforms with limited compute and memory. Our approach of utterance-level intent classification is based on a sequence of keywords in the utterance instead of a single fixed key phrase. The keyword sequence is transformed into four types of input features, namely acoustics, phones, word2vec and speech2vec for individual intent learning and then fused decision making. If a wake intent is detected, it will trigger the power-costly ASR afterwards. The system is trained and tested on a newly collected internal dataset in Intel called AMIE, which will be reported in this paper for the first time. It is demonstrated that our novel technique with the representation of the key-phrases successfully achieved a noise robust intent classification in different domains including in-car human-machine communications. The wake on intent system will be low-power and low-complexity, which makes it suitable for always on operations in real life hardware-based applications.
Confidence measure is a performance index of particular importance for automatic speech recognition (ASR) systems deployed in real-world scenarios. In the present study, utterance-level neural confidence measure (NCM) in end-to-end automatic speech r
Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetit
Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to imp
In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. T
Human emotional speech is, by its very nature, a variant signal. This results in dynamics intrinsic to automatic emotion classification based on speech. In this work, we explore a spectral decomposition method stemming from fluid-dynamics, known as D