New community

Subscribe to the gold package and get unlimited access to Shamra Academy

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

WikiAiAIASP: مجموعة بيانات لتلخيص القائم على الجانب المجال

244 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Abstract Aspect-based summarization is the task of generating focused summaries based on specific points of interest. Such summaries aid efficient analysis of text, such as quickly understanding reviews or opinions from different angles. However, due to large differences in the type of aspects for different domains (e.g., sentiment, product features), the development of previous models has tended to be domain-specific. In this paper, we propose WikiAsp,1 a large-scale dataset for multi-domain aspect- based summarization that attempts to spur research in the direction of open-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. We propose several straightforward baseline models for this task and conduct experiments on the dataset. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.

References used

https://aclanthology.org/

rate research

MM-AVS: A Full-Scale Dataset for Multi-modal Summarization

313 - Association for Computation Linguistics 2021 مقالة

Multimodal summarization becomes increasingly significant as it is the basis for question answering, Web search, and many other downstream tasks. However, its learning materials have been lacking a holistic organization by integrating resources from various modalities, thereby lagging behind the research progress of this field. In this study, we release a full-scale multimodal dataset comprehensively gathering documents, summaries, images, captions, videos, audios, transcripts, and titles in English from CNN and Daily Mail. To our best knowledge, this is the first collection that spans all modalities and nearly comprises all types of materials available in this community. In addition, we devise a baseline model based on the novel dataset, which employs a newly proposed Jump-Attention mechanism based on transcripts. The experimental results validate the important assistance role of the external information for multimodal summarization.

multi-modal summarization multi-modal web search تلخيص متعددة الوسائط متعددة مشروط البحث في الويب صناعة حمض الفوسفور المزيد..

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

442 - Association for Computation Linguistics 2021 مقالة

This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic description s as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

large-scale media interview media interview dataset large-scale media مقابلة وسائط واسعة النطاق بيانات مقابلة مع وسائل الإعلام وسائل الإعلام واسعة النطاق صناعة حمض الفوسفور المزيد..

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

366 - Association for Computation Linguistics 2021 مقالة

Cross-lingual summarization is a challenging task for which there are no cross-lingual scientific resources currently available. To overcome the lack of a high-quality resource, we present a new dataset for monolingual and cross-lingual summarization considering the English-German pair. We collect high-quality, real-world cross-lingual data from Spektrum der Wissenschaft, which publishes human-written German scientific summaries of English science articles on various subjects. The generated Spektrum dataset is small; therefore, we harvest a similar dataset from the Wikipedia Science Portal to complement it. The Wikipedia dataset consists of English and German articles, which can be used for monolingual and cross-lingual summarization. Furthermore, we present a quantitative analysis of the datasets and results of empirical experiments with several existing extractive and abstractive summarization models. The results suggest the viability and usefulness of the proposed dataset for monolingual and cross-lingual summarization.

cross-lingual summarization monolingual and cross-lingual تلخيص تبادل اللغات أحادي اللغة واللغة صناعة حمض الفوسفور

297 - Association for Computation Linguistics 2021 مقالة

ROUGE is a widely used evaluation metric in text summarization. However, it is not suitable for the evaluation of abstractive summarization systems as it relies on lexical overlap between the gold standard and the generated summaries. This limitation becomes more apparent for agglutinative languages with very large vocabularies and high type/token ratios. In this paper, we present semantic similarity models for Turkish and apply them as evaluation metrics for an abstractive summarization task. To achieve this, we translated the English STSb dataset into Turkish and presented the first semantic textual similarity dataset for Turkish as well. We showed that our best similarity models have better alignment with average human judgments compared to ROUGE in both Pearson and Spearman correlations.

similarity based evaluation semantic similarity based based evaluation التقييم القائم على التشابه التشابه الدلالي مقرها تقييم مقرها صناعة حمض الفوسفور المزيد..

Transfer-based Enrichment of a Hungarian Named Entity Dataset

276 - Association for Computation Linguistics 2021 مقالة

In this paper, we present a major update to the first Hungarian named entity dataset, the Szeged NER corpus. We used zero-shot cross-lingual transfer to initialize the enrichment of entity types annotated in the corpus using three neural NER models: two of them based on the English OntoNotes corpus and one based on the Czech Named Entity Corpus finetuned from multilingual neural language models. The output of the models was automatically merged with the original NER annotation, and automatically and manually corrected and further enriched with additional annotation, like qualifiers for various entity types. We present the evaluation of the zero-shot performance of the two OntoNotes-based models and a transformer-based new NER model trained on the training part of the final corpus. We release the corpus and the trained model.

hungarian named entity named entity dataset czech named entity الكيان المجاري المسمى مجموعة بيانات الكيان المسماة Czech اسمي كيان صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

WikiAiAIASP: مجموعة بيانات لتلخيص القائم على الجانب المجال

Ask ChatGPT about the research

Read More

suggested questions