تتوفر أنظمة الإجابة على الأسئلة (QA) الآن من خلال العديد من التطبيقات التجارية لمجموعة واسعة من المجالات، مما يخدم ملايين المستخدمين الذين يتفاعلون معهم عبر واجهات الكلام.ومع ذلك، فإن المعايير الحالية في أبحاث ضمنيا لا تحسب الأخطاء التي قد تعرضها نماذج التعرف على الكلام، ولا تفكر في اختلافات اللغة (لهجات) للمستخدمين.لمعالجة هذه الفجوة، نزيد من مجموعة بيانات QA الحالية لبناء معيارا متعدد الهياكل المتعددة، معيار QA المنطوقة في خمس لغات (العربية، البنغالية، الإنجليزية، الكورية، الكورية) مع أكثر من 68K Audio مطالبات في 24 لهجة من 255 متحدثا.نحن نقدم نتائج خط الأساس عرض الأداء العالمي الحقيقي لأنظمة ضمان الجودة وتحليل تأثير مجموعة متنوعة اللغات وغيرها من سمات المتكلم الحساسة على أداء المصب.أخيرا، ندرس عدالة نماذج ASR و QA فيما يتعلق بسكان المستخدمين الأساسيين.
Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address this gap, we augment an existing QA dataset to construct a multi-dialect, spoken QA benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers. We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance. Last, we study the fairness of the ASR and QA models with respect to the underlying user populations.
References used
https://aclanthology.org/
Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity---where languages have few reference articles---and information asymmetry--
Generating high quality question-answer pairs is a hard but meaningful task. Although previous works have achieved great results on answer-aware question generation, it is difficult to apply them into practical application in the education field. Thi
Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously
The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large
Large volumes of interaction logs can be collected from NLP systems that are deployed in the real world. How can this wealth of information be leveraged? Using such interaction logs in an offline reinforcement learning (RL) setting is a promising app