Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

السيد تاي: معيار متعدد اللغات لاسترجاع كثيف

693 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

multi-lingual benchmark multi-lingual benchmark dataset المعيار متعدد اللغات مجموعة بيانات معيار متعدد اللغات صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call mDPR''. Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse--dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.

References used

https://aclanthology.org/

rate research

Condenser: a Pre-training Architecture for Dense Retrieval

630 - Association for Computation Linguistics 2021 مقالة

Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text compar ison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs' internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.

pre-training architecture dense بنية التدريب قبل التدريب كثيف صناعة حمض الفوسفور

Cross-Lingual Training of Dense Retrievers for Document Retrieval

675 - Association for Computation Linguistics 2021 مقالة

Dense retrieval has shown great success for passage ranking in English. However, its effectiveness for non-English languages remains unexplored due to limitation in training resources. In this work, we explore different transfer techniques for docume nt ranking from English annotations to non-English languages. Our experiments reveal that zero-shot model-based transfer using mBERT improves search quality. We find that weakly-supervised target language transfer is competitive compared to generation-based target language transfer, which requires translation models.

dense retrievers retrievers for document cross-lingual training استرداد كثيف المستردون للوثيقة التدريب عبر اللغات صناعة حمض الفوسفور المزيد..

Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval

691 - Association for Computation Linguistics 2021 مقالة

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Build ing on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BeamDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BeamDR captures the implicit relationships between evidence in the reasoning chain. The code is available at https://github.com/ henryzhao5852/BeamDR.

unstructured text النص غير منظم صناعة حمض الفوسفور

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

669 - Association for Computation Linguistics 2021 مقالة

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColBERT's expressive MaxSim operator into a simple dot product. The advantage of the bi-encoder teacher--student setup is that we can efficiently add in-batch negatives during knowledge distillation, enabling richer interactions between teacher and student models. In addition, using ColBERT as the teacher reduces training cost compared to a full cross-encoder. Experiments on the MS MARCO passage and document ranking tasks and data from the TREC 2019 Deep Learning Track demonstrate that our approach helps models learn robust representations for dense retrieval effectively and efficiently.

المنطق العلائقي dense retrieval applies knowledge distillation استرجاع كثيف ينطبق على تقطير المعرفة صناعة حمض الفوسفور

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

699 - Association for Computation Linguistics 2021 مقالة

Abstract Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open- domain question answering evaluation set comprising 10k question-an swer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on heavily curated, language- independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state- of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages.1

linguistically diverse benchmark multilingual open domain domain question answering معيار متنوع لغويا مجال مفتوح متعدد اللغات إجابة سؤال المجال صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

السيد تاي: معيار متعدد اللغات لاسترجاع كثيف

Ask ChatGPT about the research

Read More

suggested questions