أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Jimmy Lin

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

110 - Rodrigo Nogueira , Zhiying Jiang , Jimmy Lin 2021

The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to -sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values. We find that how a number is represented in its surface form has a strong influence on the models accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (e.g., 32), and it struggles to learn with character-level representations (e.g., 3 2). By introducing position tokens (e.g., 3 10e1 2), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation. This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot learn addition rules that are independent of the length of the numbers seen during training. Code to reproduce our experiments is available at https://github.com/castorini/transformers-arithmetic

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

96 - Jimmy Lin , Daniel Campos , Nick Craswell 2021

Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the state of the art (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be significantly better than another that are obscured by the current official evaluation metric (MRR@100).

استرجاع المعلومات

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

50 - Jimmy Lin , Xueguang Ma , Sheng-Chieh Lin 2021

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, rele vance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.

استرجاع المعلومات

Pretrained Transformers for Text Ranking: BERT and Beyond

85 - Jimmy Lin , Rodrigo Nogueira , 2020

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processi ng applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.

استرجاع المعلومات الحساب واللغة

Generalized and Scalable Optimal Sparse Decision Trees

72 - Jimmy Lin , Chudi Zhong , Diane Hu 2020

Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning. Despite efforts over the past 40 years, only recently have optimization breakthroughs been made that h ave allowed practical algorithms to find optimal decision trees. These new techniques have the potential to trigger a paradigm shift where it is possible to construct sparse decision trees to efficiently optimize a variety of objective functions without relying on greedy splitting and pruning heuristics that often lead to suboptimal solutions. The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables. We present techniques that produce optimal decision trees over a variety of objectives including F-score, AUC, and partial area under the ROC convex hull. We also introduce a scalable algorithm that produces provably optimal results in the presence of continuous variables and speeds up decision tree construction by several orders of magnitude relative to the state-of-the art.

التعلم الآلي التعلم الالي

Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2

301 - He Bai , Peng Shi , Jimmy Lin 2020

The semantics of a text is manifested not only by what is read, but also by what is not read. In this article, we will study how the implicit not read information such as end-of-paragraph (eop) and end-of-sequence (eos) affect the quality of text gen eration. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the eop in the fine-tuning stage. Experimental results on English story generation show that eop can lead to higher BLEU score and lower eos perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without eop or eos during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with eop.

الحساب واللغة

Document Ranking with a Pretrained Sequence-to-Sequence Model

95 - Rodrigo Nogueira , Zhiying Jiang , Jimmy Lin 2020

This work proposes a novel adaptation of a pretrained sequence-to-sequence model to the task of document ranking. Our approach is fundamentally different from a commonly-adopted classification-based formulation of ranking, based on encoder-only pretr ained transformer architectures such as BERT. We show how a sequence-to-sequence model can be trained to generate relevance labels as target words, and how the underlying logits of these target words can be interpreted as relevance probabilities for ranking. On the popular MS MARCO passage ranking task, experimental results show that our approach is at least on par with previous classification-based models and can surpass them with larger, more-recent models. On the test collection from the TREC 2004 Robust Track, we demonstrate a zero-shot transfer-based approach that outperforms previous state-of-the-art models requiring in-dataset cross-validation. Furthermore, we find that our approach significantly outperforms an encoder-only model in a data-poor regime (i.e., with few training examples). We investigate this observation further by varying target words to probe the models use of latent knowledge.

استرجاع المعلومات التعلم الآلي

Runtime Optimizations for Prediction with Tree-Based Models

75 - Nima Asadi , Jimmy Lin , 2012

Tree-based models have proven to be an effective solution for web ranking as well as other problems in diverse domains. This paper focuses on optimizing the runtime performance of applying such models to make predictions, given an already-trained mod el. Although exceedingly simple conceptually, most implementations of tree-based models do not efficiently utilize modern superscalar processor architectures. By laying out data structures in memory in a more cache-conscious fashion, removing branches from the execution flow using a technique called predication, and micro-batching predictions using a technique called vectorization, we are able to better exploit modern processor architectures and significantly improve the speed of tree-based models over hard-coded if-else blocks. Our work contributes to the exploration of architecture-conscious runtime implementations of machine learning algorithms.

قواعد البيانات استرجاع المعلومات التعلم الآلي

The Unified Logging Infrastructure for Data Analytics at Twitter

231 - George Lee , Jimmy Lin , Chuang Liu 2012

In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are colle cted and structured to begin with. In this paper, we present Twitters production logging infrastructure and its evolution from application-specific logging to a unified client events log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize session sequences, which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.

قواعد البيانات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد