ﻻ يوجد ملخص باللغة العربية
Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for formant tracking. In addition to the conventional implementation, we modified the architecture from three aspects. First, we turned off the causal mode of dilated convolution, making the dilated convolution see the future speech frames. Second, each hidden layer reused the output information from all the previous layers through dense connection. Third, we also adopted a gating mechanism to alleviate the problem of gradient disappearance by selectively forgetting unimportant information. The model was validated on the open access formant database VTR. The experiment showed that our proposed model was easy to converge and achieved an overall mean absolute percent error (MAPE) of 8.2% on speech-labeled frames, compared to three competitive baselines of 9.4% (LSTM), 9.1% (Bi-LSTM) and 8.9% (TCN).
Speech recognition is a well developed research field so that the current state of the art systems are being used in many applications in the software industry, yet as by today, there still does not exist such robust system for the recognition of wor
Multi-stage learning is an effective technique to invoke multiple deep-learning modules sequentially. This paper applies multi-stage learning to speech enhancement by using a multi-stage structure, where each stage comprises a self-attention (SA) blo
This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker i
Recent efforts have been made on domestic activities classification from audio recordings, especially the works submitted to the challenge of DCASE (Detection and Classification of Acoustic Scenes and Events) since 2018. In contrast, few studies were
In this paper, we study a discriminatively trained deep convolutional network for the task of visual tracking. Our tracker utilizes both motion and appearance features that are extracted from a pre-trained dual stream deep convolution network. We sho