Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

StoryDB: Broad Multi-language Narrative Dataset

StoryDB: مجموعة بيانات سرد متعددة اللغات

779 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

broad multi-language narrative broad multi-language dataset broad multi-language السرد متعدد اللغات مجموعة البيانات متعددة اللغات صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper presents StoryDB --- a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

References used

https://aclanthology.org/

rate research

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

800 - Association for Computation Linguistics 2021 مقالة

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate fine-tuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.

multi-label legal document multi-lingual and multi-label وثيقة قانونية متعددة العلامات متعدد اللغات ومتعددة التسمية صناعة حمض الفوسفور

UniteD-SRL: A Unified Dataset for Span- and Dependency-Based Multilingual and Cross-Lingual Semantic Role Labeling

693 - Association for Computation Linguistics 2021 مقالة

Multilingual and cross-lingual Semantic Role Labeling (SRL) have recently garnered increasing attention as multilingual text representation techniques have become more effective and widely available. While recent work has attained growing success, re sults on gold multilingual benchmarks are still not easily comparable across languages, making it difficult to grasp where we stand. For example, in CoNLL-2009, the standard benchmark for multilingual SRL, language-to-language comparisons are affected by the fact that each language has its own dataset which differs from the others in size, domains, sets of labels and annotation guidelines. In this paper, we address this issue and propose UniteD-SRL, a new benchmark for multilingual and cross-lingual, span- and dependency-based SRL. UniteD-SRL provides expert-curated parallel annotations using a common predicate-argument structure inventory, allowing direct comparisons across languages and encouraging studies on cross-lingual transfer in SRL. We release UniteD-SRL v1.0 at https://github.com/SapienzaNLP/united-srl.

تمثيل التفاعل متعدد الوسائط cross-lingual semantic role الدور الدلالي عبر اللغات صناعة حمض الفوسفور

An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model

1088 - Association for Computation Linguistics 2021 مقالة

Multi-turn response selection models have recently shown comparable performance to humans in several benchmark datasets. However, in the real environment, these models often have weaknesses, such as making incorrect predictions based heavily on super ficial patterns without a comprehensive understanding of the context. For example, these models often give a high score to the wrong response candidate containing several keywords related to the context but using the inconsistent tense. In this study, we analyze the weaknesses of the open-domain Korean Multi-turn response selection models and publish an adversarial dataset to evaluate these weaknesses. We also suggest a strategy to build a robust model in this adversarial environment.

multi-turn response selection building robust multi-turn response selection models اختيار الاستجابة المتعددة بناء قوي الدوران نماذج اختيار الاستجابة صناعة حمض الفوسفور المزيد..

MFAQ: a Multilingual FAQ Dataset

626 - Association for Computation Linguistics 2021 مقالة

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its o wn challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model, and training script.

multilingual faq dataset multilingual faq mfaq مجموعة بيانات متعددة اللغات DataSet. التعليمات العامة متعددة اللغات MFAQ. صناعة حمض الفوسفور المزيد..

``Let Your Characters Tell Their Story'': A Dataset for Character-Centric Narrative Understanding

1026 - Association for Computation Linguistics 2021 مقالة

When reading a literary piece, readers often make inferences about various characters' roles, personalities, relationships, intents, actions, etc. While humans can readily draw upon their past experiences to build such a character-centric view of the narrative, understanding characters in narratives can be a challenging task for machines. To encourage research in this field of character-centric narrative understanding, we present LiSCU -- a new dataset of literary pieces and their summaries paired with descriptions of characters that appear in them. We also introduce two new tasks on LiSCU: Character Identification and Character Description Generation. Our experiments with several pre-trained language models adapted for these tasks demonstrate that there is a need for better models of narrative comprehension.

character-centric narrative understanding character description generation فهم السرد المرحل بالشخصية توليد وصف الشخصية صناعة حمض الفوسفور

StoryDB: Broad Multi-language Narrative Dataset

StoryDB: مجموعة بيانات سرد متعددة اللغات

Ask ChatGPT about the research

Read More

suggested questions