أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Jamie Callan

Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

141 - HongChien Yu , Chenyan Xiong , Jamie Callan 2021

Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a cha llenging task due to the shortness and ambiguity of search queries. This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. It also keeps the document index unchanged to reduce overhead. ANCE-PRF significantly outperforms ANCE and other recent dense retrieval systems on several datasets. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.

استرجاع المعلومات الذكاء الاصطناعي

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

114 - Luyu Gao , Jamie Callan 2021

Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

استرجاع المعلومات الحساب واللغة

Is Your Language Model Ready for Dense Representation Fine-tuning?

334 - Luyu Gao , Jamie Callan 2021

Pre-trained language models (LM) have become go-to text representation encoders. Prior research used deep LMs to encode text sequences such as sentences and passages into single dense vector representations. These dense representations have been used in efficient text comparison and embedding-based retrieval. However, dense encoders suffer in low resource situations. Many techniques have been developed to solve this problem. Despite their success, not much is known about why this happens. This paper shows that one cause lies in the readiness of the LM to expose its knowledge through dense representation in fine-tuning, which we term Optimization Readiness. To validate the theory, we present Condenser, a general pre-training architecture based on Transformer LMs, to improve dense optimization readiness. We show that fine-tuning from Condenser significantly improves performance for small and/or noisy training sets.

الحساب واللغة استرجاع المعلومات

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

74 - Luyu Gao , Zhuyun Dai , Jamie Callan 2021

Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft semantic matching all query document terms, but they lose the comp utation efficiency of exact match systems. This paper presents COIL, a contextualized exact match retrieval architecture that brings semantic lexical matching. COIL scoring is based on overlapping query document tokens contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep language models. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.

استرجاع المعلومات الحساب واللغة

Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline

178 - Luyu Gao , Zhuyun Dai , Jamie Callan 2021

Pre-trained deep language models~(LM) have advanced the state-of-the-art of text retrieval. Rerankers fine-tuned from deep LM estimates candidate relevance based on rich contextualized matching signals. Meanwhile, deep LMs can also be leveraged to im prove search index, building retrievers with better recall. One would expect a straightforward combination of both in a pipeline to have additive performance gain. In this paper, we discover otherwise and that popular reranker cannot fully exploit the improved retrieval result. We, therefore, propose a Localized Contrastive Estimation (LCE) for training rerankers and demonstrate it significantly improves deep two-stage models.

استرجاع المعلومات

Assessing the Benefits of Model Ensembles in Neural Re-Ranking for Passage Retrieval

69 - Luis Borges , Bruno Martins , Jamie Callan 2021

Our work aimed at experimentally assessing the benefits of model ensembling within the context of neural methods for passage reranking. Starting from relatively standard neural models, we use a previous technique named Fast Geometric Ensembling to ge nerate multiple model instances from particular training schedules, then focusing or attention on different types of approaches for combining the results from the multiple model instances (e.g., averaging the ranking scores, using fusion methods from the IR literature, or using supervised learning-to-rank). Tests with the MS-MARCO dataset show that model ensembling can indeed benefit the ranking quality, particularly with supervised learning-to-rank although also with unsupervised rank aggregation.

استرجاع المعلومات

PGT: Pseudo Relevance Feedback Using a Graph-Based Transformer

96 - HongChien Yu , Zhuyun Dai , Jamie Callan 2021

Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models. This paper shows that Transformer-based rerankers can also benefit from the extra context that PRF provides. It presents PGT, a graph-b ased Transformer that sparsifies attention between graph nodes to enable PRF while avoiding the high computational complexity of most Transformer architectures. Experiments show that PGT improves upon non-PRF Transformer reranker, and it is at least as accurate as Transformer PRF models that use full attention, but with lower computational costs.

استرجاع المعلومات الذكاء الاصطناعي التعلم الآلي

Generating Categories for Sets of Entities

79 - Shuo Zhang , Krisztian Balog , Jamie Callan 2020

Category systems are central components of knowledge bases, as they provide a hierarchical grouping of semantically related concepts and entities. They are a unique and valuable resource that is utilized in a broad range of information access tasks. To aid knowledge editors in the manual process of expanding a category system, this paper presents a method of generating categories for sets of entities. First, we employ neural abstractive summarization models to generate candidate categories. Next, the location within the hierarchy is identified for each candidate. Finally, structure-, content-, and hierarchy-based features are used to rank candidates to identify by the most promising ones (measured in terms of specificity, hierarchy, and importance). We develop a test collection based on Wikipedia categories and demonstrate the effectiveness of the proposed approach.

استرجاع المعلومات الحساب واللغة

Ranking Clarification Questions via Natural Language Inference

157 - Vaibhav Kumar , Vikas Raunak , Jamie Callan 2020

Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems. Such interactions could help in filling information gaps for better machine comprehension of the que ry. For the task of ranking clarification questions, we hypothesize that determining whether a clarification question pertains to a missing entry in a given post (on QA forums such as StackExchange) could be considered as a special case of Natural Language Inference (NLI), where both the post and the most relevant clarification question point to a shared latent piece of information or context. We validate this hypothesis by incorporating representations from a Siamese BERT model fine-tuned on NLI and Multi-NLI datasets into our models and demonstrate that our best performing model obtains a relative performance improvement of 40 percent and 60 percent respectively (on the key metric of Precision@1), over the state-of-the-art baseline(s) on the two evaluation sets of the StackExchange dataset, thereby, significantly surpassing the state-of-the-art.

التعلم الآلي الذكاء الاصطناعي استرجاع المعلومات

Understanding BERT Rankers Under Distillation

126 - Luyu Gao , Zhuyun Dai , Jamie Callan 2020

Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems. Knowledge embedded in such models allows them to pick up complex matching signals between passages and queries. However, the high computation cost during inference limits their deployment in real-world search scenarios. In this paper, we study if and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation. Our experiments demonstrate that it is crucial to use a proper distillation procedure, which produces up to nine times speedup while preserving the state-of-the-art performance.

استرجاع المعلومات التعلم الآلي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد