No Arabic abstract
In this paper, we propose a sub-utterance unit selection framework to remove acoustic segments in audio recordings that carry little information for acoustic scene classification (ASC). Our approach is built upon a universal set of acoustic segment units covering the overall acoustic scene space. First, those units are modeled with acoustic segment models (ASMs) used to tokenize acoustic scene utterances into sequences of acoustic segment units. Next, paralleling the idea of stop words in information retrieval, stop ASMs are automatically detected. Finally, acoustic segments associated with the stop ASMs are blocked, because of their low indexing power in retrieval of most acoustic scenes. In contrast to building scene models with whole utterances, the ASM-removed sub-utterances, i.e., acoustic utterances without stop acoustic segments, are then used as inputs to the AlexNet-L back-end for final classification. On the DCASE 2018 dataset, scene classification accuracy increases from 68%, with whole utterances, to 72.1%, with segment selection. This represents a competitive accuracy without any data augmentation, and/or ensemble strategy. Moreover, our approach compares favourably to AlexNet-L with attention.
Automatic speech recognition (ASR) tasks are resolved by end-to-end deep learning models, which benefits us by less preparation of raw data, and easier transformation between languages. We propose a novel end-to-end deep learning model architecture namely cascaded CNN-resBiLSTM-CTC. In the proposed model, we add residual blocks in BiLSTM layers to extract sophisticated phoneme and semantic information together, and apply cascaded structure to pay more attention mining information of hard negative samples. By applying both simple Fast Fourier Transform (FFT) technique and n-gram language model (LM) rescoring method, we manage to achieve word error rate (WER) of 3.41% on LibriSpeech test clean corpora. Furthermore, we propose a new batch-varied method to speed up the training process in length-varied tasks, which result in 25% less training time.
In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.
Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which affects the judgement. The proposed framework mimics this mechanism using various approaches. Firstly, we employ three methods to concatenate tag vectors extracted using an audio tagging system with an intermediate hidden layer of an acoustic scene classification system. We also explore the multi-head attention on the feature map of an acoustic scene classification system using tag vectors. Experiments conducted on the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate the effectiveness of the proposed scheme. Concatenation and multi-head attention show a classification accuracy of 75.66 % and 75.58 %, respectively, compared to 73.63 % accuracy of the baseline. The system with the proposed two approaches combined demonstrates an accuracy of 76.75 %.
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns. For example, a cooking scene may contain several sound sources including silverware clinking, chopping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention network to model the complex temporal input structures for ASC. The proposed network takes the audios time-frequency representation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various patterns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with back-propagation. Experimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our proposed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winners results.
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub500 conversational evaluation datasets.