ﻻ يوجد ملخص باللغة العربية
In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based model that given a short window in the time domain, detects whether one or more devices have been subjected to a sample drop event. The candidate time windows are selected from a set of large time intervals, possibly including a sample drop, and by using a preprocessing step. The latter is based on the application of normalized cross-correlation between signals acquired by different devices. The architecture of the neural network relies on a CNN-LSTM encoder, followed by multi-head attention. The experiments are conducted using both artificial and real data. Our proposed approach obtained F1 score of 88% on an evaluation set extracted from the CHiME-5 corpus. A comparable performance was found in a larger set of experiments conducted on a set of multi-channel artificial scenes.
We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly compos
Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a sign
Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In thi
Speech emotion recognition is a challenging and important research topic that plays a critical role in human-computer interaction. Multimodal inputs can improve the performance as more emotional information is used for recognition. However, existing
Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural N