ﻻ يوجد ملخص باللغة العربية
Deep gated convolutional networks have been proved to be very effective in single channel speech separation. However current state-of-the-art framework often considers training the gated convolutional networks in time-frequency (TF) domain. Such an approach will result in limited perceptual score, such as signal-to-distortion ratio (SDR) upper bound of separated utterances and also fail to exploit an end-to-end framework. In this paper we present an integrated simple and effective end-to-end approach to monaural speech separation, which consists of deep gated convolutional neural networks (GCNN) that takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speakers voice. In addition long short-term memory (LSTM) is employed for long term temporal modeling. For the objective, we propose to train the network by directly optimizing utterance level SDR in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus demonstrate that this new scheme can produce more discriminative separated utterances and leading to performance improvement on the speaker separation task.
Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1) multi-scal
Dialect identification (DID) is a special case of general language identification (LID), but a more challenging problem due to the linguistic similarity between dialects. In this paper, we propose an end-to-end DID system and a Siamese neural network
Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time fram
Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidde
End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior