Half-Truth: A Partially Fake Audio Detection Dataset

104 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jiangyan Yi

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jiangyan Yi - Ye Bai - Jianhua Tao

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection.

قيم البحث

107 - Haoxin Ma , Jiangyan Yi , Jianhua Tao 2021

Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and ret raining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining takes a lot of time and computation resources. Besides, previous data are unavailable due to privacy in some situations. To solve the above problems, this paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally. A knowledge distillation loss is introduced to loss function to preserve the memory of original model. Supposing the distribution of genuine voice is consistent among different scenarios, an extra embedding similarity loss is used as another constraint to further do a positive sample alignment. Experiments are conducted on the ASVspoof2019 dataset. The results show that our proposed method outperforms fine-tuning by the relative reduction of average equal error rate up to 81.62%.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Identification of fake stereo audio

72 - Tianyun Liu , Diqun Yan 2021

Channel is one of the important criterions for digital audio quality. General-ly, stereo audio two channels can provide better perceptual quality than mono audio. To seek illegal commercial benefit, one might convert mono audio to stereo one with fak e quality. Identifying of stereo faking audio is still a less-investigated audio forensic issue. In this paper, a stereo faking corpus is first present, which is created by Haas Effect technique. Then the effect of stereo faking on Mel Frequency Cepstral Coefficients (MFCC) is analyzed to find the difference between the real and faked stereo audio. Fi-nally, an effective algorithm for identifying stereo faking audio is proposed, in which 80-dimensional MFCC features and Support Vector Machine (SVM) classifier are adopted. The experimental results on three datasets with five different cut-off frequencies show that the proposed algorithm can ef-fectively detect stereo faking audio and achieve a good robustness.

أنظمة الصوت في الحاسوب الوسائط المتعددة معالجة الصوت والكلام

LSSED: a large-scale dataset and benchmark for speech emotion recognition

87 - Weiquan Fan , Xiangmin Xu , Xiaofen Xing 2021

Speech emotion recognition is a vital contributor to the next generation of human-computer interaction (HCI). However, current existing small-scale databases have limited the development of related research. In this paper, we present LSSED, a challen ging large-scale english speech emotion dataset, which has data collected from 820 subjects to simulate real-world distribution. In addition, we release some pre-trained models based on LSSED, which can not only promote the development of speech emotion recognition, but can also be transferred to related downstream tasks such as mental health analysis where data is extremely difficult to collect. Finally, our experiments show the necessity of large-scale datasets and the effectiveness of pre-trained models. The dateset will be released on https://github.com/tobefans/LSSED.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي الحساب واللغة

Cough Detection Using Selected Informative Features from Audio Signals

137 - Xinru Chen , Menghan Hu , Guangtao Zhai 2021

Cough is a common symptom of respiratory and lung diseases. Cough detection is important to prevent, assess and control epidemic, such as COVID-19. This paper proposes a model to detect cough events from cough audio signals. The models are trained by the dataset combined ESC-50 dataset with self-recorded cough recordings. The test dataset contains inpatient cough recordings collected from inpatients of the respiratory disease department in Ruijin Hospital. We totally build 15 cough detection models based on different feature numbers selected by Random Frog, Uninformative Variable Elimination (UVE), and Variable influence on projection (VIP) algorithms respectively. The optimal model is based on 20 features selected from Mel Frequency Cepstral Coefficients (MFCC) features by UVE algorithm and classified with Support Vector Machine (SVM) linear two-class classifier. The best cough detection model realizes the accuracy, recall, precision and F1-score with 94.9%, 97.1%, 93.1% and 0.95 respectively. Its excellent performance with fewer dimensionality of the feature vector shows the potential of being applied to mobile devices, such as smartphones, thus making cough detection remote and non-contact.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي الوسائط المتعددة

Automatic non-invasive Cough Detection based on Accelerometer and Audio Signals

357 - Madhurananda Pahar , Igor Miranda , Andreas Diacon 2021

We present an automatic non-invasive way of detecting cough events based on both accelerometer and audio signals. The acceleration signals are captured by a smartphone firmly attached to the patients bed, using its integrated accelerometer. The a udio signals are captured simultaneously by the same smartphone using an external microphone. We have compiled a manually-annotated dataset containing such simultaneously-captured acceleration and audio signals for approximately 6000 cough and 68000 non-cough events from 14 adult male patients in a tuberculosis clinic. LR, SVM and MLP are evaluated as baseline classifiers and compared with deep architectures such as CNN, LSTM, and Resnet50 using a leave-one-out cross-validation scheme. We find that the studied classifiers can use either acceleration or audio signals to distinguish between coughing and other activities including sneezing, throat-clearing, and movement on the bed with high accuracy. However, in all cases, the deep neural networks outperform the shallow classifiers by a clear margin and the Resnet50 offers the best performance by achieving an AUC exceeding 0.98 and 0.99 for acceleration and audio signals respectively. While audio-based classification consistently offers a better performance than acceleration-based classification, we observe that the difference is very small for the best systems. Since the acceleration signal requires less processing power, and since the need to record audio is sidestepped and thus privacy is inherently secured, and since the recording device is attached to the bed and not worn, an accelerometer-based highly accurate non-invasive cough detector may represent a more convenient and readily accepted method in long-term cough monitoring.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي التعلم الآلي