بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

113 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Andreas Stolcke

تاريخ النشر 2021

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhenning Tan - Yuguang Yang - Eunjung Han

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same household, we propose a household-adapted nonlinear mapping to a low dimensional space to complement the global scoring metric. The combined scoring function is optimized on labeled or pseudo-labeled speaker utterances. With input dropout, the proposed scoring model reduces EER by 45-71% in simulated households with 2 to 7 hard-to-discriminate speakers per household. On real-world internal data, the EER reduction is 49.2%. From t-SNE visualization, we also show that clusters formed by household-adapted speaker embeddings are more compact and uniformly distributed, compared to clusters formed by global embeddings before adaptation.

قيم البحث

83 - Youngki Kwon , Jee-weon Jung , Hee-Soo Heo 2021

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have direct ly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Emotion Invariant Speaker Embeddings for Speaker Identification with Emotional Speech

114 - Biswajit Dev Sarma , Rohan Kumar Das 2020

Emotional state of a speaker is found to have significant effect in speech production, which can deviate speech from that arising from neutral state. This makes identifying speakers with different emotions a challenging task as generally the speaker models are trained using neutral speech. In this work, we propose to overcome this problem by creation of emotion invariant speaker embedding. We learn an extractor network that maps the test embeddings with different emotions obtained using i-vector based system to an emotion invariant space. The resultant test embeddings thus become emotion invariant and thereby compensate the mismatch between various emotional states. The studies are conducted using four different emotion classes from IEMOCAP database. We obtain an absolute improvement of 2.6% in accuracy for speaker identification studies using emotion invariant speaker embedding against average speaker model based framework with different emotions.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Learning to fool the speaker recognition

94 - Jiguo Li , Xinfeng Zhang , Jizheng Xu 2020

Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as finge rprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life. In this paper, we attempt to fool the state-of-the-art speaker recognition model and present textit{speaker recognition attacker}, a lightweight model to fool the deep speaker recognition model by adding imperceptible perturbations onto the raw speech waveform. We find that the speaker recognition system is also vulnerable to the attack, and we achieve a high success rate on the non-targeted attack. Besides, we also present an effective method to optimize the speaker recognition attacker to obtain a trade-off between the attack success rate with the perceptual quality. Experiments on the TIMIT dataset show that we can achieve a sentence error rate of $99.2%$ with an average SNR $57.2text{dB}$ and PESQ 4.2 with speed rather faster than real-time.

معالجة الصوت والكلام التشفير والأمن أنظمة الصوت في الحاسوب

Universal Adversarial Perturbations Generative Network for Speaker Recognition

119 - Jiguo Li , Xinfeng Zhang , Chuanmin Jia 2020

Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs) for the speaker recognition systems. We proposed a generative network to learn the mapping from the low-dimensional normal distribution to the UAPs subspace, then synthesize the UAPs to perturbe any input signals to spoof the well-trained speaker recognition model with high probability. Experimental results on TIMIT and LibriSpeech datasets demonstrate the effectiveness of our model.

معالجة الصوت والكلام التشفير والأمن أنظمة الصوت في الحاسوب

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

269 - Naoyuki Kanda , Yashesh Gaur , Xiaofei Wang 2020

We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with at tention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT by introducing a speaker inventory as an auxiliary input to produce speaker labels as well as multi-speaker transcriptions. All model parameters are optimized by speaker-attributed maximum mutual information criterion, which represents a joint probability for overlapped speech recognition and speaker identification. Experiments on LibriSpeech corpus show that our proposed method achieves significantly better speaker-attributed word error rate than the baseline that separately performs overlapped speech recognition and speaker identification.

معالجة الصوت والكلام الحساب واللغة أنظمة الصوت في الحاسوب

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الجزيرة الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً