Extraction of Salient Sentences from Labelled Documents

386 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Misha Denil

تاريخ النشر 2014

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Misha Denil - Alban Demiraj - Nando de Freitas

الحساب واللغة استرجاع المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present a hierarchical convolutional document model with an architecture designed to support introspection of the document structure. Using this model, we show how to use visualisation techniques from the computer vision literature to identify and extract topic-relevant sentences. We also introduce a new scalable evaluation technique for automatic sentence extraction systems that avoids the need for time consuming human annotation of validation data.

قيم البحث

153 - Kaihao Guo , Tianpei Jiang , Haipeng Zhang 2021

Event extraction is a classic task in natural language processing with wide use in handling large amount of yet rapidly growing financial, legal, medical, and government documents which often contain multiple events with their elements scattered and mixed across the documents, making the problem much more difficult. Though the underlying relations between event elements to be extracted provide helpful contextual information, they are somehow overlooked in prior studies. We showcase the enhancement to this task brought by utilizing the knowledge graph that captures entity relations and their attributes. We propose a first event extraction framework that embeds a knowledge graph through a Graph Neural Network and integrates the embedding with regular features, all at document-level. Specifically, for extracting events from Chinese financial announcements, our method outperforms the state-of-the-art method by 5.3% in F1-score.

الحساب واللغة استرجاع المعلومات التعلم الآلي

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

153 - Ruixue Zhang , Wei Yang , Luyun Lin 2020

Techniques for automatically extracting important content elements from business documents such as contracts, statements, and filings have the potential to make business operations more efficient. This problem can be formulated as a sequence labeling task, and we demonstrate the adaption of BERT to two types of business documents: regulatory filings and property lease agreements. There are aspects of this problem that make it easier than standard information extraction tasks and other aspects that make it more difficult, but on balance we find that modest amounts of annotated data (less than 100 documents) are sufficient to achieve reasonable accuracy. We integrate our models into an end-to-end cloud platform that provides both an easy-to-use annotation interface as well as an inference interface that allows users to upload documents and inspect model outputs.

الحساب واللغة استرجاع المعلومات التعلم الآلي

On a Topic Model for Sentences

321 - Georgios Balikas , Massih-Reza Amini , Marianne Clausel 2016

Probabilistic topic models are generative models that describe the content of documents by discovering the latent topics underlying them. However, the structure of the textual input, and for instance the grouping of words in coherent text spans such as sentences, contains much information which is generally lost with these models. In this paper, we propose sentenceLDA, an extension of LDA whose goal is to overcome this limitation by incorporating the structure of the text in the generative and inference processes. We illustrate the advantages of sentenceLDA by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text classification) evaluation tasks on different text collections.

الحساب واللغة استرجاع المعلومات التعلم الآلي

Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles

78 - M. Tarik Altuncu , Sophia N. Yaliraki , Mauricio Barahona 2020

Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructur ed corpora of text into `topics that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Using Neighborhood Context to Improve Information Extraction from Visual Documents Captured on Mobile Phones

70 - Kalpa Gunaratna , Vijay Srinivasan , Sandeep Nama 2021

Information Extraction from visual documents enables convenient and intelligent assistance to end users. We present a Neighborhood-based Information Extraction (NIE) approach that uses contextual language models and pays attention to the local neighb orhood context in the visual documents to improve information extraction accuracy. We collect two different visual document datasets and show that our approach outperforms the state-of-the-art global context-based IE technique. In fact, NIE outperforms existing approaches in both small and large model sizes. Our on-device implementation of NIE on a mobile platform that generally requires small models showcases NIEs usefulness in practical real-world applications.

التعلم الآلي استرجاع المعلومات