Do you want to publish a course? Click here

A New Dataset and Efficient Baselines for Document-level Text Simplification in German

مجموعة بيانات جديدة وأنظمة أساس فعالة لتبسيط النص على مستوى المستند باللغة الألمانية

413   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

The task of document-level text simplification is very similar to summarization with the additional difficulty of reducing complexity. We introduce a newly collected data set of German texts, collected from the Swiss news magazine 20 Minuten (20 Minutes') that consists of full articles paired with simplified summaries. Furthermore, we present experiments on automatic text simplification with the pretrained multilingual mBART and a modified version thereof that is more memory-friendly, using both our new data set and existing simplification corpora. Our modifications of mBART let us train at a lower memory cost without much loss in performance, in fact, the smaller mBART even improves over the standard model in a setting with multiple simplification levels.



References used
https://aclanthology.org/
rate research

Read More

Text simplification is a valuable technique. However, current research is limited to sentence simplification. In this paper, we define and investigate a new task of document-level text simplification, which aims to simplify a document consisting of m ultiple sentences. Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia and perform analysis and human evaluation on it to show that the dataset is reliable. Then, we propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task. Finally, we select several representative models as baseline models for this task and perform automatic evaluation and human evaluation. We analyze the results and point out the shortcomings of the baseline models.
As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an infodemic' -- a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society. To combat this infodemic, there i s an urgent need for benchmark datasets that can help researchers develop and evaluate models geared towards automatic detection of disinformation. While there are increasing efforts to create adequate, open-source benchmark datasets for English, comparable resources are virtually unavailable for German, leaving research for the German language lagging significantly behind. In this paper, we introduce the new benchmark dataset FANG-COVID consisting of 28,056 real and 13,186 fake German news articles related to the COVID-19 pandemic as well as data on their propagation on Twitter. Furthermore, we propose an explainable textual- and social context-based model for fake news detection, compare its performance to black-box'' models and perform feature ablation to assess the relative importance of human-interpretable features in distinguishing fake news from authentic news.
Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose document-level natural language inference (NLI) for contracts'', a novel, real-wor ld application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as Some obligations of Agreement may survive termination.'') and a contract, and it is asked to classify whether each hypothesis is entailed by'', contradicting to'' or not mentioned by'' (neutral to) the contract as well as identifying evidence'' for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (a) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (b) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.
Automatic personalized corrective feedback can help language learners from different backgrounds better acquire a new language. This paper introduces a learner English dataset in which learner errors are accompanied by information about possible erro r sources. This dataset contains manually annotated error causes for learner writing errors. These causes tie learner mistakes to structures from their first languages, when the rules in English and in the first language diverge. This new dataset will enable second language acquisition researchers to computationally analyze a large quantity of learner errors that are related to language transfer from the learners' first language. The dataset can also be applied in personalizing grammatical error correction systems according to the learners' first language and in providing feedback that is informed by the cause of an error.
The paper introduces a new resource, CoDeRooMor, for studying the morphology of modern Swedish word formation. The approximately 16.000 lexical items in the resource have been manually segmented into word-formation morphemes, and labeled for their ca tegories, such as prefixes, suffixes, roots, etc. Word-formation mechanisms, such as derivation and compounding have been associated with each item on the list. The article describes the selection of items for manual annotation and the principles of annotation, reports on the reliability of the manual annotation, and presents tools, resources and some first statistics. Given the''gold'' nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling (cf statistical subword approach). The resource will be made freely available.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا