أوراق بحثية, رسائل ماجستير ودكتوراه حول استرجاع المعلومات

Phrase Retrieval Learns Passage Retrieval, Too

276 - Jinhyuk Lee , Alexander Wettig , Danqi Chen 2021

Dense retrieval methods have shown great promise over sparse retrieval methods in a range of NLP problems. Among them, dense phrase retrieval-the most fine-grained retrieval unit-is appealing because phrases can be directly used as the output for que stion answering and slot filling tasks. In this work, we follow the intuition that retrieving phrases naturally entails retrieving larger text blocks and study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents. We first observe that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy (+3-5% in top-5 accuracy) compared to passage retrievers, which also helps achieve superior end-to-end QA performance with fewer passages. Then, we provide an interpretation for why phrase-level supervision helps learn better fine-grained entailment compared to passage-level supervision, and also show that phrase retrieval can be improved to achieve competitive performance in document-retrieval tasks such as entity linking and knowledge-grounded dialogue. Finally, we demonstrate how phrase filtering and vector quantization can reduce the size of our index by 4-10x, making dense phrase retrieval a practical and versatile solution in multi-granularity retrieval.

الحساب واللغة استرجاع المعلومات

Zero-Shot Open Information Extraction using Question Generation and Reading Comprehension

150 - Himanshu Gupta , Amogh Badugu , Tamanna Agrawal 2021

Typically, Open Information Extraction (OpenIE) focuses on extracting triples, representing a subject, a relation, and the object of the relation. However, most of the existing techniques are based on a predefined set of relations in each domain whic h limits their applicability to newer domains where these relations may be unknown such as financial documents. This paper presents a zero-shot open information extraction technique that extracts the entities (value) and their descriptions (key) from a sentence, using off the shelf machine reading comprehension (MRC) Model. The input questions to this model are created using a novel noun phrase generation method. This method takes the context of the sentence into account and can create a wide variety of questions making our technique domain independent. Given the questions and the sentence, our technique uses the MRC model to extract entities (value). The noun phrase corresponding to the question, with the highest confidence, is taken as the description (key). This paper also introduces the EDGAR10-Q dataset which is based on publicly available financial documents from corporations listed in US securities and exchange commission (SEC). The dataset consists of paragraphs, tagged values (entities), and their keys (descriptions) and is one of the largest among entity extraction datasets. This dataset will be a valuable addition to the research community, especially in the financial domain. Finally, the paper demonstrates the efficacy of the proposed technique on the EDGAR10-Q and Ade corpus drug dosage datasets, where it obtained 86.84 % and 97% accuracy, respectively.

استرجاع المعلومات الحساب واللغة التعلم الآلي

FOMO: Topics versus documents in legal eDiscovery

199 - Herbert Roitblat 2021

In the United States, the parties to a lawsuit are required to search through their electronically stored information to find documents that are relevant to the specific case and produce them to their opposing party. Negotiations over the scope of th ese searches often reflect a fear that something will be missed (Fear of Missing Out: FOMO). A Recall level of 80%, for example, means that 20% of the relevant documents will be left unproduced. This paper makes the argument that eDiscovery is the process of identifying responsive information, not identifying documents. Documents are the carriers of the information; they are not the direct targets of the process. A given document may contain one or more topics or factoids and a factoid may appear in more than one document. The coupon collectors problem, Heaps law, and other analyses provide ways to model the problem of finding information from among documents. In eDiscovery, however, the parties do not know how many factoids there might be in a collection or their probabilities. This paper describes a simple model that estimates the confidence that a fact will be omitted from the produced set (the identified set), while being contained in the missed set. Two data sets are then analyzed, a small set involving microaggressions and larger set involving classification of web pages. Both show that it is possible to discover at least one example of each available topic within a relatively small number of documents, meaning the further effort will not return additional novel information. The smaller data set is also used to investigate whether the non-random order of searching for responsive documents commonly used in eDiscovery (called continuous active learning) affects the distribution of topics-it does not.

استرجاع المعلومات

335 - Zihao Zhao , Jiawei Chen , Sheng Zhou 2021

Recommender system usually suffers from severe popularity bias -- the collected interaction data usually exhibits quite imbalanced or even long-tailed distribution over items. Such skewed distribution may result from the users conformity to the group , which deviates from reflecting users true preference. Existing efforts for tackling this issue mainly focus on completely eliminating popularity bias. However, we argue that not all popularity bias is evil. Popularity bias not only results from conformity but also item quality, which is usually ignored by existing methods. Some items exhibit higher popularity as they have intrinsic better property. Blindly removing the popularity bias would lose such important signal, and further deteriorate model performance. To sufficiently exploit such important information for recommendation, it is essential to disentangle the benign popularity bias caused by item quality from the harmful popularity bias caused by conformity. Although important, it is quite challenging as we lack an explicit signal to differentiate the two factors of popularity bias. In this paper, we propose to leverage temporal information as the two factors exhibit quite different patterns along the time: item quality revealing item inherent property is stable and static while conformity that depends on items recent clicks is highly time-sensitive. Correspondingly, we further propose a novel Time-aware DisEntangled framework (TIDE), where a click is generated from three components namely the static item quality, the dynamic conformity effect, as well as the user-item matching score returned by any recommendation model. Lastly, we conduct interventional inference such that the recommendation can benefit from the benign popularity bias while circumvent the harmful one. Extensive experiments on three real-world datasets demonstrated the effectiveness of TIDE.

استرجاع المعلومات

A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles

144 - Malte Ostendorff , Corinna Breitinger , Bela Gipp 2021

Literature recommendation systems (LRS) assist readers in the discovery of relevant content from the overwhelming amount of literature available. Despite the widespread adoption of LRS, there is a lack of research on the user-perceived recommendation characteristics for fundamentally different approaches to content-based literature recommendation. To complement existing quantitative studies on literature recommendation, we present qualitative study results that report on users perceptions for two contrasting recommendation classes: (1) link-based recommendation represented by the Co-Citation Proximity (CPA) approach, and (2) text-based recommendation represented by Lucenes MoreLikeThis (MLT) algorithm. The empirical data analyzed in our study with twenty users and a diverse set of 40 Wikipedia articles indicate a noticeable difference between text- and link-based recommendation generation approaches along several key dimensions. The text-based MLT method receives higher satisfaction ratings in terms of user-perceived similarity of recommended articles. In contrast, the CPA approach receives higher satisfaction scores in terms of diversity and serendipity of recommendations. We conclude that users of literature recommendation systems can benefit most from hybrid approaches that combine both link- and text-based approaches, where the users information needs and preferences should control the weighting for the approaches used. The optimal weighting of multiple approaches used in a hybrid recommendation system is highly dependent on a users shifting needs.

استرجاع المعلومات المكتبات الرقمية

Evaluating Music Recommendations with Binary Feedback for Multiple Stakeholders

112 - Sasha Stoikov , Hongyi Wen 2021

High quality user feedback data is essential to training and evaluating a successful music recommendation system, particularly one that has to balance the needs of multiple stakeholders. Most existing music datasets suffer from noisy feedback and sel f-selection biases inherent in the data collected by music platforms. Using the Piki Music dataset of 500k ratings collected over a two-year time period, we evaluate the performance of classic recommendation algorithms on three important stakeholders: consumers, well-known artists and lesser-known artists. We show that a matrix factorization algorithm trained on both likes and dislikes performs significantly better compared to one trained only on likes for all three stakeholders.

استرجاع المعلومات

Quoka Atlas of Scholarly Knowledge Production: An Interactive Sensemaking Tool for Exploring the Outputs of Research Institutions

255 - Benjamin Adams , Richard Hosking 2021

The vast amount of research produced at institutions world-wide is extremely diverse, and coarse-grained quantitative measures of impact often obscure the individual contributions of these institutions to specific research fields and topics. We show that by applying an information retrieval model to index research articles which are faceted by institution and time, we can develop tools to rank institutions given a keyword query. We present an interactive atlas, Quoka, designed to enable a user to explore these rankings contextually by geography and over time. Through a set of use cases we demonstrate that the atlas can be used to perform sensemaking tasks to learn and collect information about the relationships between institutions and scholarly knowledge production.

المكتبات الرقمية استرجاع المعلومات

On-the-Fly Ensemble Pruning in Evolving Data Streams

122 - Sanem Elbasi , Alican Buyukc{c}ak{i}r , Hamed Bonab 2021

Ensemble pruning is the process of selecting a subset of componentclassifiers from an ensemble which performs at least as well as theoriginal ensemble while reducing storage and computational costs.Ensemble pruning in data streams is a largely unexpl ored area ofresearch. It requires analysis of ensemble components as they arerunning on the stream, and differentiation of useful classifiers fromredundant ones. We present CCRP, an on-the-fly ensemble prun-ing method for multi-class data stream classification empoweredby an imbalance-aware fusion of class-wise component rankings.CCRP aims that the resulting pruned ensemble contains the bestperforming classifier for each target class and hence, reduces the ef-fects of class imbalance. The conducted experiments on real-worldand synthetic data streams demonstrate that different types of en-sembles that integrate CCRP as their pruning scheme consistentlyyield on par or superior performance with 20% to 90% less averagememory consumption. Lastly, we validate the proposed pruningscheme by comparing our approach against pruning schemes basedon ensemble weights and basic rank fusion methods.

التعلم الآلي استرجاع المعلومات

Matching with Transformers in MELT

182 - Sven Hertling , Jan Portisch , Heiko Paulheim 2021

One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and the refore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is possible. In this paper, we model the ontology matching task as classification problem and present approaches based on transformer models. We further provide an easy to use implementation in the MELT framework which is suited for ontology and knowledge graph matching. We show that a transformer-based filter helps to choose the correct correspondences given a high-recall alignment and already achieves a good result with simple alignment post-processing methods.

الحساب واللغة الذكاء الاصطناعي استرجاع المعلومات

FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining

159 - Zhoujun Cheng , Haoyu Dong , Fan Cheng 2021

Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, which performs calculations on numerical values in tables, is naturally a strong supervision of numerical reasoning. More importantly, large amounts of spreadsheets with expert-made formulae are available on the web and can be obtained easily. FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae. We design two formula pretraining tasks to explicitly guide FORTAP to learn numerical reference and calculation in semi-structured tables. FORTAP achieves state-of-the-art results on two representative downstream tasks, cell type classification and formula prediction, showing great potential of numerical-reasoning-aware pretraining.

استرجاع المعلومات التعلم الآلي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد