بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Frame-based overlapping speech detection using Convolutional Neural Networks

110 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Midia Yousefi

تاريخ النشر 2020

مجال البحث هندسة إلكترونية

والبحث باللغة English

تأليف Midia Yousefi - John H.L. Hansen

معالجة الصوت والكلام معالجة الإشارات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlapping speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using different spectral features, and show that pyknogram features outperforms other commonly used speech features. The proposed system can predict overlapping speech with an accuracy of 84% and Fscore of 88% on a dataset of mixed speech generated based on the GRID dataset.

قيم البحث

358 - Tom Sercu , Neil Mallinar 2019

We introduce Multi-Frame Cross-Entropy training (MFCE) for convolutional neural network acoustic models. Recognizing that similar to RNNs, CNNs are in nature sequence models that take variable length inputs, we propose to take as input to the CNN a p art of an utterance long enough that multiple labels are predicted at once, therefore getting cross-entropy loss signal from multiple adjacent frames. This increases the amount of label information drastically for small marginal computational cost. We show large WER improvements on hub5 and rt02 after training on the 2000-hour Switchboard benchmark.

معالجة الصوت والكلام الحساب واللغة التعلم الآلي

Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network

116 - Jee-weon Jung , Hee-Soo Heo , Youngki Kwon 2021

In this work, we propose an overlapped speech detection system trained as a three-class classifier. Unlike conventional systems that perform binary classification as to whether or not a frame contains overlapped speech, the proposed approach classifi es into three classes: non-speech, single speaker speech, and overlapped speech. By training a network with the more detailed label definition, the model can learn a better notion on deciding the number of speakers included in a given frame. A convolutional recurrent neural network architecture is explored to benefit from both convolutional layers capability to model local patterns and recurrent layers ability to model sequential information. The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set, showing a 20% increase in recall along with higher precision. In addition, we also introduce a simple approach to utilize the proposed overlapped speech detection model for speaker diarization which ranked third place in the Track 1 of the DIHARD III challenge.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

MIMO Speech Compression and Enhancement Based on Convolutional Denoising Autoencoder

121 - You-Jin Li , Syu-Siang Wang , Yu Tsao 2020

For speech-related applications in IoT environments, identifying effective methods to handle interference noises and compress the amount of data in transmissions is essential to achieve high-quality services. In this study, we propose a novel multi-i nput multi-output speech compression and enhancement (MIMO-SCE) system based on a convolutional denoising autoencoder (CDAE) model to simultaneously improve speech quality and reduce the dimensions of transmission data. Compared with conventional single-channel and multi-input single-output systems, MIMO systems can be employed in applications that handle multiple acoustic signals need to be handled. We investigated two CDAE models, a fully convolutional network (FCN) and a Sinc FCN, as the core models in MIMO systems. The experimental results confirm that the proposed MIMO-SCE framework effectively improves speech quality and intelligibility while reducing the amount of recording data by a factor of 7 for transmission.

معالجة الصوت والكلام معالجة الإشارات

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

93 - Qiquan Zhang , Aaron Nicolson , Mingjiang Wang 2019

Deep learning has achieved substantial improvement on single-channel speech enhancement tasks. However, the performance of multi-layer perceptions (MLPs)-based methods is limited by the ability to capture the long-term effective history information. The recurrent neural networks (RNNs), e.g., long short-term memory (LSTM) model, are able to capture the long-term temporal dependencies, but come with the issues of the high latency and the complexity of training.To address these issues, the temporal convolutional network (TCN) was proposed to replace the RNNs in various sequence modeling tasks. In this paper we propose a novel TCN model that employs multi-branch structure, called multi-branch TCN (MB-TCN), for monaural speech enhancement.The MB-TCN exploits split-transform-aggregate design, which is expected to obtain strong representational power at a low computational complexity.Inspired by the TCN, the MB-TCN model incorporates one dimensional causal dilated CNN and residual learning to expand receptive fields for capturing long-term temporal contextual information.Our extensive experimental investigation suggests that the MB-TCNs outperform the residual long short-term memory networks (ResLSTMs), temporal convolutional networks (TCNs), and the CNN networks that employ dense aggregations in terms of speech intelligibility and quality, while providing superior parameter efficiency. Furthermore, our experimental results demonstrate that our proposed MB-TCN model is able to outperform multiple state-of-the-art deep learning-based speech enhancement methods in terms of five widely used objective metrics.

معالجة الصوت والكلام معالجة الإشارات

FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition

297 - Bonaventure F. P. Dossou , Yeno K. S. Gbenou 2021

Using mel-spectrograms over conventional MFCCs features, we assess the abilities of convolutional neural networks to accurately recognize and classify emotions from speech data. We introduce FSER, a speech emotion recognition model trained on four va lid speech databases, achieving a high-classification accuracy of 95,05%, over 8 different emotion classes: anger, anxiety, calm, disgust, happiness, neutral, sadness, surprise. On each benchmark dataset, FSER outperforms the best models introduced so far, achieving a state-of-the-art performance. We show that FSER stays reliable, independently of the language, sex identity, and any other external factor. Additionally, we describe how FSER could potentially be used to improve mental and emotional health care and how our analysis and findings serve as guidelines and benchmarks for further works in the same direction.

معالجة الصوت والكلام الرؤية الحاسوبية وتمييز الأنماط تفاعل الإنسان والحاسوب

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

معھد الشام العالي للعلوم الشرعية واللغة العربية والدراسات والبحوث الإسلامية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Frame-based overlapping speech detection using Convolutional Neural Networks

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً