تقدم هذه الورقة StoryDB --- مجموعة بيانات واسعة متعددة اللغات من الروايات.StoryDB هي جثة من النصوص التي تضم قصص في 42 لغة مختلفة.تتضمن كل لغة 500+ قصص.تشمل بعض اللغات أكثر من 20 ألف قصة.يتم فهرسة كل قصة عبر اللغات والمسمى مع العلامات مثل النوع أو الموضوع.يعرض Corpus تباين موضعي ولغوي غني ويمكن أن يكون بمثابة مورد لدراسة دور السرد في معالجة اللغة الطبيعية في مختلف اللغات بما في ذلك الموارد المنخفضة.نوضح أيضا كيف يمكن استخدام مجموعة البيانات لقياس ثلاث نماذج متعددة اللغات الحديثة، وهي mdistillbert و mbert و xlm-roberta.
This paper presents StoryDB --- a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
References used
https://aclanthology.org/
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We
Multilingual and cross-lingual Semantic Role Labeling (SRL) have recently garnered increasing attention as multilingual text representation techniques have become more effective and widely available. While recent work has attained growing success, re
Multi-turn response selection models have recently shown comparable performance to humans in several benchmark datasets. However, in the real environment, these models often have weaknesses, such as making incorrect predictions based heavily on super
In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its o
When reading a literary piece, readers often make inferences about various characters' roles, personalities, relationships, intents, actions, etc. While humans can readily draw upon their past experiences to build such a character-centric view of the