New community

Subscribe to the gold package and get unlimited access to Shamra Academy

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

MediaSum: مجموعة بيانات مقابلة وسائط واسعة النطاق لتلخيص الحوار

442 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

large-scale media interview media interview dataset large-scale media مقابلة وسائط واسعة النطاق بيانات مقابلة مع وسائل الإعلام وسائل الإعلام واسعة النطاق صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

References used

https://aclanthology.org/

rate research

A Large-Scale Dataset for Empathetic Response Generation

321 - Association for Computation Linguistics 2021 مقالة

Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain small, delaying research efforts in this area, for example, the development of emotion-aware chatbots. One main technical challenge has been the cost of manually annotating dialogues with the right emotion labels. In this paper, we describe a large-scale silver dataset consisting of 1M dialogues annotated with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category. To achieve this goal, we have developed a novel data curation pipeline starting with a small seed of manually annotated data and eventually scaling it to a satisfactory size. We compare its quality against a state-of-the-art gold dataset using both offline experiments and visual validation methods. The resultant procedure can be used to create similar datasets in the same domain as well as in other domains.

قواعد المنطق empathetic response استجابة متعاطف صناعة حمض الفوسفور

WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia

295 - Association for Computation Linguistics 2021 مقالة

Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting. Our model is suitably efficient and outperforms previously published state-of-the-art results for the task.

المهام تحسين صفر النار event coreference cross-document event الحدث comeference. حدث عبر المستندات صناعة حمض الفوسفور

MM-AVS: A Full-Scale Dataset for Multi-modal Summarization

312 - Association for Computation Linguistics 2021 مقالة

Multimodal summarization becomes increasingly significant as it is the basis for question answering, Web search, and many other downstream tasks. However, its learning materials have been lacking a holistic organization by integrating resources from various modalities, thereby lagging behind the research progress of this field. In this study, we release a full-scale multimodal dataset comprehensively gathering documents, summaries, images, captions, videos, audios, transcripts, and titles in English from CNN and Daily Mail. To our best knowledge, this is the first collection that spans all modalities and nearly comprises all types of materials available in this community. In addition, we devise a baseline model based on the novel dataset, which employs a newly proposed Jump-Attention mechanism based on transcripts. The experimental results validate the important assistance role of the external information for multimodal summarization.

multi-modal summarization multi-modal web search تلخيص متعددة الوسائط متعددة مشروط البحث في الويب صناعة حمض الفوسفور المزيد..

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

243 - Association for Computation Linguistics 2021 مقالة

Abstract Aspect-based summarization is the task of generating focused summaries based on specific points of interest. Such summaries aid efficient analysis of text, such as quickly understanding reviews or opinions from different angles. However, due to large differences in the type of aspects for different domains (e.g., sentiment, product features), the development of previous models has tended to be domain-specific. In this paper, we propose WikiAsp,1 a large-scale dataset for multi-domain aspect- based summarization that attempts to spur research in the direction of open-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. We propose several straightforward baseline models for this task and conduct experiments on the dataset. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.

aspect-based summarization multi-domain aspect-based summarization abstract aspect-based summarization تلخيص القائم على الجانب تلخيص القائم على الجانب متعدد المجالات التلخيص القائم على الجانب المجردة صناعة حمض الفوسفور المزيد..

Itihasa: A large-scale corpus for Sanskrit to English translation

328 - Association for Computation Linguistics 2021 مقالة

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

work introduces itihasa english translation large-scale translation dataset العمل يقدم Itihasa. الترجمة إلى الإنجليزية مجموعة بيانات الترجمة على نطاق واسع صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

MediaSum: مجموعة بيانات مقابلة وسائط واسعة النطاق لتلخيص الحوار

Ask ChatGPT about the research

Read More

suggested questions