Evaluating Coherence in Dialogue Systems using Entailment

105 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ehsan Kamalloo

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Nouha Dziri - Ehsan Kamalloo - Kory W. Mathewson

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers. Automatic metrics such as BLEU correlate weakly with human annotations, resulting in a significant bias across different models and datasets. Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable. Moreover, judges tend to evaluate a small number of dialogues, meaning that minor differences in evaluation configuration may lead to dissimilar results. In this paper, we present interpretable metrics for evaluating topic coherence by making use of distributed sentence representations. Furthermore, we introduce calculable approximations of human judgment based on conversational coherence by adopting state-of-the-art entailment techniques. Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses.

قيم البحث

191 - Nouha Dziri , Hannah Rashkin , Tal Linzen 2021

Knowledge-grounded dialogue agents are systems designed to conduct a conversation based on externally provided background information, such as a Wikipedia page. Such dialogue agents, especially those based on neural network language models, often pro duce responses that sound fluent but are not justified by the background information. Progress towards addressing this problem requires developing automatic evaluation metrics that can quantify the extent to which responses are grounded in background information. To facilitate evaluation of such metrics, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN). BEGIN consists of 8113 dialogue turns generated by language-model-based dialogue systems, accompanied by humans annotations specifying the relationship between the systems response and the background information. These annotations are based on an extension of the natural language inference paradigm. We use the benchmark to demonstrate the effectiveness of adversarially generated data for improving an evaluation metric based on existing natural language inference datasets.

الحساب واللغة

Evaluating Document Coherence Modelling

171 - Aili Shen , Meladel Mistica , Bahar Salehi 2021

While pretrained language models (LM) have driven impressive gains over morpho-syntactic and semantic tasks, their ability to model discourse and pragmatic phenomena is less clear. As a step towards a better understanding of their discourse modelling capabilities, we propose a sentence intrusion detection task. We examine the performance of a broad range of pretrained LMs on this detection task for English. Lacking a dataset for the task, we introduce INSteD, a novel intruder sentence detection dataset, containing 170,000+ documents constructed from English Wikipedia and CNN news articles. Our experiments show that pretrained LMs perform impressively in in-domain evaluation, but experience a substantial drop in the cross-domain setting, indicating limited generalisation capacity. Further results over a novel linguistic probe dataset show that there is substantial room for improvement, especially in the cross-domain setting.

الحساب واللغة الذكاء الاصطناعي

Attention over Parameters for Dialogue Systems

296 - Andrea Madotto , Zhaojiang Lin , Chien-Sheng Wu 2020

Dialogue systems require a great deal of different but complementary expertise to assist, inform, and entertain humans. For example, different domains (e.g., restaurant reservation, train ticket booking) of goal-oriented dialogue systems can be viewe d as different skills, and so does ordinary chatting abilities of chit-chat dialogue systems. In this paper, we propose to learn a dialogue system that independently parameterizes different dialogue skills, and learns to select and combine each of them through Attention over Parameters (AoP). The experimental results show that this approach achieves competitive performance on a combined dataset of MultiWOZ, In-Car Assistant, and Persona-Chat. Finally, we demonstrate that each dialogue skill is effectively learned and can be combined with other skills to produce selective responses.

الحساب واللغة التعلم الآلي

Adversarial Examples for Evaluating Reading Comprehension Systems

95 - Robin Jia , Percy Liang 2017

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75%$ F1 score to $36%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7%$. We hope our insights will motivate the development of new models that understand language more precisely.

الحساب واللغة التعلم الآلي

Conditioned Query Generation for Task-Oriented Dialogue Systems

190 - Stephane dAscoli , Alice Coucke , Francesco Caltagirone 2019

Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less a ccurate than human supervision, has the advantage of being cheap and fast. In this paper we propose a novel controlled data generation method that could be used as a training augmentation framework for closed-domain dialogue. Our contribution is twofold. First we show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder. Then we introduce a novel protocol called query transfer that allows to leverage a broad, unlabelled dataset to extract relevant information. Comparison with two different baselines shows that our method, in the appropriate regime, consistently improves the diversity of the generated queries without compromising their quality.

الحساب واللغة التعلم الآلي التعلم الالي