بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation

114 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zitao Liu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Hang Li - Yu Kang - Yang Hao

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The quality of vocal delivery is one of the key indicators for evaluating teacher enthusiasm, which has been widely accepted to be connected to the overall course qualities. However, existing evaluation for vocal delivery is mainly conducted with manual ratings, which faces two core challenges: subjectivity and time-consuming. In this paper, we present a novel machine learning approach that utilizes pairwise comparisons and a multimodal orthogonal fusing algorithm to generate large-scale objective evaluation results of the teacher vocal delivery in terms of fluency and passion. We collect two datasets from real-world education scenarios and the experiment results demonstrate the effectiveness of our algorithm. To encourage reproducible results, we make our code public available at url{https://github.com/tal-ai/ML4VocalDelivery.git}.

قيم البحث

114 - Bence Mark Halpern , Julian Fritsch , Enno Hermann 2021

The development of pathological speech systems is currently hindered by the lack of a standardised objective evaluation framework. In this work, (1) we utilise existing detection and analysis techniques to propose a general framework for the consiste nt evaluation of synthetic pathological speech. This framework evaluates the voice quality and the intelligibility aspects of speech and is shown to be complementary using our experiments. (2) Using our proposed evaluation framework, we develop and test a dysarthric voice conversion system (VC) using CycleGAN-VC and a PSOLA-based speech rate modification technique. We show that the developed system is able to synthesise dysarthric speech with different levels of speech intelligibility.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

175 - Keunwoo Choi , Yuxuan Wang 2021

We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabi lities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and tested them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves performance. The results also suggest that adopting modality dropout does not degrade the performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition

116 - Maurice Gerczuk , Shahin Amiriparian , Sandra Ottl 2021

In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, Emo Set contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only $3.5$ times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EmoSet. Measured by McNemars test, these improvements are further significant for ten datasets at $p<0.05$ while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EmoNet framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EmoNet provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

The Performance Evaluation of Attention-Based Neural ASR under Mixed Speech Input

230 - Bradley He , Martin Radfar 2021

In order to evaluate the performance of the attention based neural ASR under noisy conditions, the current trend is to present hours of various noisy speech data to the model and measure the overall word/phoneme error rate (W/PER). In general, it is unclear how these models perform when exposed to a cocktail party setup in which two or more speakers are active. In this paper, we present the mixtures of speech signals to a popular attention-based neural ASR, known as Listen, Attend, and Spell (LAS), at different target-to-interference ratio (TIR) and measure the phoneme error rate. In particular, we investigate in details when two phonemes are mixed what will be the predicted phoneme; in this fashion we build a model in which the most probable predictions for a phoneme are given. We found a 65% relative increase in PER when LAS was presented with mixed speech signals at TIR = 0 dB and the performance approaches the unmixed scenario at TIR = 30 dB. Our results show the model, when presented with mixed phonemes signals, tend to predict those that have higher accuracies during evaluation of original phoneme signals.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

Surfboard: Audio Feature Extraction for Modern Machine Learning

113 - Raphael Lenain , Jack Weston , Abhishek Shivkumar 2020

We introduce Surfboard, an open-source Python library for extracting audio features with application to the medical domain. Surfboard is written with the aim of addressing pain points of existing libraries and facilitating joint use with modern machi ne learning frameworks. The package can be accessed both programmatically in Python and via its command line interface, allowing it to be easily integrated within machine learning workflows. It builds on state-of-the-art audio analysis packages and offers multiprocessing support for processing large workloads. We review similar frameworks and describe Surfboards architecture, including the clinical motivation for its features. Using the mPower dataset, we illustrate Surfboards application to a Parkinsons disease classification task, highlighting common pitfalls in existing research. The source code is opened up to the research community to facilitate future audio research in the clinical domain.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة إيبلا الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً