ﻻ يوجد ملخص باللغة العربية
Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue models understanding beyond the token-level evaluation on the generated text. The probing tasks are deterministic like automatic metrics and requires human judgement in their designing; benefiting from the best of both worlds. With experiments on probe tasks we observe that, unlike RNN based architectures, transformer model may not be learning to comprehend the input text despite its generated text having higher overlap with the target text.
Evaluating the quality of a dialogue system is an understudied problem. The recent evolution of evaluation method motivated this survey, in which an explicit and comprehensive analysis of the existing methods is sought. We are first to divide the eva
Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master. However, it is difficult for new researchers (eg{students}) to fully {grasp} this ability. It would be fascinating if we could trai
A dialogue is essentially a multi-turn interaction among interlocutors. Effective evaluation metrics should reflect the dynamics of such interaction. Existing automatic metrics are focused very much on the turn-level quality, while ignoring such dyna
Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones nor can they identify the specific discourse phenomen
In this paper, we propose to formulate the task-oriented dialogue system as the purely natural language generation task, so as to fully leverage the large-scale pre-trained models like GPT-2 and simplify complicated delexicalization prepossessing. Ho