ﻻ يوجد ملخص باللغة العربية
The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. Therefore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing high false alarm rates in real-world scenarios. In reality, keyword spotting is a detection problem where predefined target keywords are detected from a variety of unknown sounds. This shares many similarities to metric learning problems in that the unseen and unknown non-target sounds must be clearly differentiated from the target keywords. However, a key difference is that the target keywords are known and predefined. To this end, we propose a new method based on metric learning that maximises the distance between target and non-target keywords, but also learns per-class weights for target keywords `a la classification objectives. Experiments on the Google Speech Commands dataset show that our method significantly reduces false alarms to unseen non-target keywords, while maintaining the overall classification accuracy.
Smart audio devices are gated by an always-on lightweight keyword spotting program to reduce power consumption. It is however challenging to design models that have both high accuracy and low latency for accurate and fast responsiveness. Many efforts
Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are ou
Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have ach
We consider feature learning for efficient keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations in parts of Africa in which almost no language reso
The objective of this paper is open-set speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.