أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Daniel Khashabi

Reframing Instructional Prompts to GPTks Language

101 - Swaroop Mishra , Daniel Khashabi , Chitta Baral 2021

How can model designers turn task instructions into effective prompts for language models? Backed by extensive empirical analysis on GPT3, we observe important features for successful instructional prompts, and propose several reframing techniques fo r model designers to create such prompts. For example, a complex task can be decomposed into multiple simpler tasks. We experiment over 12 NLP tasks across 6 diverse categories (question generation, classification, etc.). Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity over existing few-shot baselines. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible. Furthermore, we observe that such gains are not limited to GPT3; the reframed tasks remain superior over raw instructions across different model architectures, underscoring the cross-model generality of these guidelines. We hope these empirical-driven techniques will pave way for more effective ways to prompt LMs in future.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

341 - Jieyu Zhao , Daniel Khashabi , Tushar Khot 2021

Is it possible to use natural language to intervene in a models behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) models unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a systems social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even todays powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Few-shot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

127 - Swaroop Mishra , Daniel Khashabi , Chitta Baral 2021

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. NLP models built with the conventional paradigm, however, often struggle with gen eralization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that is equipped with the understanding of human-readable instructions that define the tasks, and can generalize to new tasks. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions and 193k task instances. The instructions are obtained from crowdsourcing instructions used to collect existing NLP datasets and mapped to a unified schema. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models can benefit from instructions to generalize across tasks. These models, however, are far behind supervised task-specific models, indicating significant room for more progress in this direction.

الحساب واللغة الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط

GooAQ: Open Question Answering with Diverse Answer Types

385 - Daniel Khashabi , Amos Ng , Tushar Khot 2021

While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Googles responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmarkT5 models on GooAQ and observe that: (a) in line with recent work, LMs strong performance on GooAQs short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as how and why questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.

الحساب واللغة الذكاء الاصطناعي

Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

259 - Sumithra Bhakthavatsalam , Daniel Khashabi , Tushar Khot 2021

We present the ARC-DA dataset, a direct-answer (open response, freeform) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-wor ld questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex question-answering by the community. ARC-DA is available at https://allenai.org/data/arc-da

الحساب واللغة الذكاء الاصطناعي

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

398 - Daniel Khashabi , Gabriel Stanovsky , Jonathan Bragg 2021

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automa tic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

الحساب واللغة الذكاء الاصطناعي

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

230 - Mor Geva , Daniel Khashabi , Elad Segal 2021

A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in StrategyQA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of $sim$66%.

الحساب واللغة

ParsiNLU: A Suite of Language Understanding Challenges for Persian

687 - Daniel Khashabi , Arman Cohan , Siamak Shakeri 2020

Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of t he widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.

الحساب واللغة الذكاء الاصطناعي

UnQovering Stereotyping Biases via Underspecified Questions

232 - Tao Li , Tushar Khot , Daniel Khashabi 2020

While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified q uestions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.

الحساب واللغة

Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

95 - Tushar Khot , Daniel Khashabi , Kyle Richardson 2020

We propose a general framework called Text Modular Networks(TMNs) for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. To ensure solvability of simpler tasks, TMNs lea rn the textual input-output behavior (i.e., language) of existing models through their datasets. This differs from prior decomposition-based approaches which, besides being designed specifically for each complex task, produce decompositions independent of existing sub-models. Specifically, we focus on Question Answering (QA) and show how to train a next-question generator to sequentially produce sub-questions targeting appropriate sub-models, without additional human annotation. These sub-questions and answers provide a faithful natural language explanation of the models reasoning. We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator. Our experiments show that ModularQA is more versatile than existing explainable systems for DROP and HotpotQA datasets, is more robust than state-of-the-art blackbox (uninterpretable) systems, and generates more understandable and trustworthy explanations compared to prior work.

الحساب واللغة الذكاء الاصطناعي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد