New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

استخراج البيانات الوصفية متعددة اللوزتين: عرض في السيريلية

111 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

bootstrapping multilingual metadata multilingual metadata extraction bootstrapping multilingual bootstraping البيانات الوصفية متعددة اللغات استخراج البيانات الوصفية متعددة اللغات bootstraping متعددة اللغات صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.

References used

https://aclanthology.org/

rate research

Beyond Text: Incorporating Metadata and Label Structure for Multi-Label Document Classification using Heterogeneous Graphs

193 - Association for Computation Linguistics 2021 مقالة

Multi-label document classification, associating one document instance with a set of relevant labels, is attracting more and more research attention. Existing methods explore the incorporation of information beyond text, such as document metadata or label structure. These approaches however either simply utilize the semantic information of metadata or employ the predefined parent-child label hierarchy, ignoring the heterogeneous graphical structures of metadata and labels, which we believe are crucial for accurate multi-label document classification. Therefore, in this paper, we propose a novel neural network based approach for multi-label document classification, in which two heterogeneous graphs are constructed and learned using heterogeneous graph transformers. One is metadata heterogeneous graph, which models various types of metadata and their topological relations. The other is label heterogeneous graph, which is constructed based on both the labels' hierarchy and their statistical dependencies. Experimental results on two benchmark datasets show the proposed approach outperforms several state-of-the-art baselines.

multi-label document classification incorporating metadata تصنيف المستندات متعددة الملصقات دمج البيانات الوصفية صناعة حمض الفوسفور

Multilingual Paraphrase Generation For Bootstrapping New Features in Task-Oriented Dialog Systems

259 - Association for Computation Linguistics 2021 مقالة

The lack of labeled training data for new features is a common problem in rapidly changing real-world dialog systems. As a solution, we propose a multilingual paraphrase generation model that can be used to generate novel utterances for a target feat ure and target language. The generated utterances can be used to augment existing training data to improve intent classification and slot labeling models. We evaluate the quality of generated utterances using intrinsic evaluation metrics and by conducting downstream evaluation experiments with English as the source language and nine different target languages. Our method shows promise across languages, even in a zero-shot setting where no seed data is available.

task-oriented dialog systems dialog systems multilingual paraphrase generation أنظمة الحوار إعادة صياغة إعادة صياغة متعددة اللغات صناعة حمض الفوسفور

Relation Extraction Using Multiple Pre-Training Models in Biomedical Domain

242 - Association for Computation Linguistics 2021 مقالة

The number of biomedical documents is increasing rapidly. Accordingly, a demand for extracting knowledge from large-scale biomedical texts is also increasing. BERT-based models are known for their high performance in various tasks. However, it is oft en computationally expensive. A high-end GPU environment is not available in many situations. To attain both high accuracy and fast extraction speed, we propose combinations of simpler pre-trained models. Our method outperforms the latest state-of-the-art model and BERT-based models on the GAD corpus. In addition, our method shows approximately three times faster extraction speed than the BERT-based models on the ChemProt corpus and reduces the memory size to one sixth of the BERT ones.

multiple pre-training models multiple pre-training biomedical domain نماذج متعددة التدريب مسبقا متعددة ما قبل التدريب النطاق الطبي الطبيعي صناعة حمض الفوسفور المزيد..

Machine Reading Comprehension as Data Augmentation: A Case Study on Implicit Event Argument Extraction

412 - Association for Computation Linguistics 2021 مقالة

Implicit event argument extraction (EAE) is a crucial document-level information extraction task that aims to identify event arguments beyond the sentence level. Despite many efforts for this task, the lack of enough training data has long impeded th e study. In this paper, we take a new perspective to address the data sparsity issue faced by implicit EAE, by bridging the task with machine reading comprehension (MRC). Particularly, we devise two data augmentation regimes via MRC, including: 1) implicit knowledge transfer, which enables knowledge transfer from other tasks, by building a unified training framework in the MRC formulation, and 2) explicit data augmentation, which can explicitly generate new training examples, by treating MRC models as an annotator. The extensive experiments have justified the effectiveness of our approach --- it not only obtains state-of-the-art performance on two benchmarks, but also demonstrates superior results in a data-low scenario.

استخراج العلاقات منخفض الموارد implicit event argument event argument حجة الحدث الضمني حجة الحدث صناعة حمض الفوسفور

A Web Scale Entity Extraction System

281 - Association for Computation Linguistics 2021 مقالة

Understanding the semantic meaning of content on the web through the lens of entities and concepts has many practical advantages. However, when building large-scale entity extraction systems, practitioners are facing unique challenges involving findi ng the best ways to leverage the scale and variety of data available on internet platforms. We present learnings from our efforts in building an entity extraction system for multiple document types at large scale using multi-modal Transformers. We empirically demonstrate the effectiveness of multi-lingual, multi-task and cross-document type learning. We also discuss the label collection schemes that help to minimize the amount of noise in the collected data.

entity extraction system web scale entity entity extraction نظام استخراج الكيان كيان مقياس الويب استخراج الكيان صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

استخراج البيانات الوصفية متعددة اللوزتين: عرض في السيريلية

Ask ChatGPT about the research

Read More

suggested questions