أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Rodrigo Nogueira

On the ability of monolingual models to learn language-agnostic representations

42 - Leandro Rodrigues de Souza , Rodrigo Nogueira , Roberto Lotufo 2021

Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with sha red parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language. That is, we find that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language. Surprisingly, the models show a similar performance on a same task regardless of the pretraining language. For example, models pretrained on distant languages such as German and Portuguese perform similarly on English tasks.

الحساب واللغة

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

110 - Rodrigo Nogueira , Zhiying Jiang , Jimmy Lin 2021

The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to -sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values. We find that how a number is represented in its surface form has a strong influence on the models accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (e.g., 32), and it struggles to learn with character-level representations (e.g., 3 2). By introducing position tokens (e.g., 3 10e1 2), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation. This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot learn addition rules that are independent of the length of the numbers seen during training. Code to reproduce our experiments is available at https://github.com/castorini/transformers-arithmetic

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models

143 - Ronak Pradeep , Rodrigo Nogueira , 2021

We propose a design pattern for tackling text ranking problems, dubbed Expando-Mono-Duo, that has been empirically validated for a number of ad hoc retrieval tasks in different domains. At the core, our design relies on pretrained sequence-to-sequenc e models within a standard multi-stage ranking architecture. Expando refers to the use of document expansion techniques to enrich keyword representations of texts prior to inverted indexing. Mono and Duo refer to components in a reranking pipeline based on a pointwise model and a pairwise model that rerank initial candidates retrieved using keyword search. We present experimental results from the MS MARCO passage and document ranking tasks, the TREC 2020 Deep Learning Track, and the TREC-COVID challenge that validate our design. In all these tasks, we achieve effectiveness that is at or near the state of the art, in some cases using a zero-shot approach that does not exploit any training data from the target task. To support replicability, implementations of our design pattern are open-sourced in the Pyserini IR toolkit and PyGaggle neural reranking library.

استرجاع المعلومات الحساب واللغة

Scientific Claim Verification with VERT5ERINI

233 - Ronak Pradeep , Xueguang Ma , Rodrigo Nogueira 2020

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain. We propose VERT5ERINI that exploits T5 for abstract retrieval, sentence selection and label predictio n, which are three critical sub-tasks of claim verification. We evaluate our pipeline on SCIFACT, a newly curated dataset that requires models to not just predict the veracity of claims but also provide relevant sentences from a corpus of scientific literature that support this decision. Empirically, our pipeline outperforms a strong baseline in each of the three steps. Finally, we show VERT5ERINIs ability to generalize to two new datasets of COVID-19 claims using evidence from the ever-expanding CORD-19 corpus.

الحساب واللغة استرجاع المعلومات

Pretrained Transformers for Text Ranking: BERT and Beyond

85 - Jimmy Lin , Rodrigo Nogueira , 2020

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processi ng applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.

استرجاع المعلومات الحساب واللغة

Can questions summarize a corpus? Using question generation for characterizing COVID-19 research

162 - Gabriela Surita , Rodrigo Nogueira , Roberto Lotufo 2020

What are the latent questions on some textual data? In this work, we investigate using question generation models for exploring a collection of documents. Our method, dubbed corpus2question, consists of applying a pre-trained question generation mode l over a corpus and aggregating the resulting questions by frequency and time. This technique is an alternative to methods such as topic modelling and word cloud for summarizing large amounts of textual data. Results show that applying corpus2question on a corpus of scientific articles related to COVID-19 yields relevant questions about the topic. The most frequent questions are what is covid 19 and what is the treatment for covid. Among the 1000 most frequent questions are what is the threshold for herd immunity and what is the role of ace2 in viral entry. We show that the proposed method generated similar questions for 13 of the 27 expert-made questions from the CovidQA question answering dataset. The code to reproduce our experiments and the generated questions are available at: https://github.com/unicamp-dl/corpus2question

استرجاع المعلومات الحساب واللغة التعلم الآلي

Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting

76 - Sheng-Chieh Lin , Jheng-Hong Yang , Rodrigo Nogueira 2020

Conversational search plays a vital role in conversational information seeking. As queries in information seeking dialogues are ambiguous for traditional ad-hoc information retrieval (IR) systems due to the coreference and omission resolution problem s inherent in natural language dialogue, resolving these ambiguities is crucial. In this paper, we tackle conversational passage retrieval (ConvPR), an important component of conversational search, by addressing query ambiguities with query reformulation integrated into a multi-stage ad-hoc IR system. Specifically, we propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting. For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals. For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model. Detailed analyses of the two CQR methods are provided quantitatively and qualitatively, explaining their advantages, disadvantages, and distinct behaviors. Moreover, to leverage the strengths of both CQR methods, we propose combining their output with reciprocal rank fusion, yielding state-of-the-art retrieval effectiveness, 30% improvement in terms of NDCG@3 compared to the best submission of TREC CAsT 2019.

الحساب واللغة الذكاء الاصطناعي استرجاع المعلومات

Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

141 - Sheng-Chieh Lin , Jheng-Hong Yang , Rodrigo Nogueira 2020

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs). We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task. In CQR benchmarks of task-oriented dialogue systems, we evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task. Examining a variety of architectures with different numbers of parameters, we demonstrate that the recent text-to-text transfer transformer (T5) achieves the best results both on CANARD and CAsT with fewer parameters, compared to similar transformer architectures.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

TTTTTackling WinoGrande Schemas

95 - Sheng-Chieh Lin , Jheng-Hong Yang , Rodrigo Nogueira 2020

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the entailment token as a score of the hypothesis. Our first (and only) submission to the official leaderboard yielded 0.7673 AUC on March 13, 2020, which is the best known result at this time and beats the previous state of the art by over five points.

الحساب واللغة التعلم الآلي

Document Ranking with a Pretrained Sequence-to-Sequence Model

95 - Rodrigo Nogueira , Zhiying Jiang , Jimmy Lin 2020

This work proposes a novel adaptation of a pretrained sequence-to-sequence model to the task of document ranking. Our approach is fundamentally different from a commonly-adopted classification-based formulation of ranking, based on encoder-only pretr ained transformer architectures such as BERT. We show how a sequence-to-sequence model can be trained to generate relevance labels as target words, and how the underlying logits of these target words can be interpreted as relevance probabilities for ranking. On the popular MS MARCO passage ranking task, experimental results show that our approach is at least on par with previous classification-based models and can surpass them with larger, more-recent models. On the test collection from the TREC 2004 Robust Track, we demonstrate a zero-shot transfer-based approach that outperforms previous state-of-the-art models requiring in-dataset cross-validation. Furthermore, we find that our approach significantly outperforms an encoder-only model in a data-poor regime (i.e., with few training examples). We investigate this observation further by varying target words to probe the models use of latent knowledge.

استرجاع المعلومات التعلم الآلي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد