Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek

The Glaux Corpus: القضايا المنهجية في تصميم كائن طويل الأجل ومتنوع متعدد الطبقات من اليونانية القديمة

785 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper describes the GLAUx project (the Greek Language Automated''), an ongoing effort to develop a large long-term diachronic corpus of Greek, covering sixteen centuries of literary and non-literary material annotated with NLP methods. After providing an overview of related corpus projects and discussing the general architecture of the corpus, it zooms in on a number of larger methodological issues in the design of historical corpora. These include the encoding of textual variants, handling extralinguistic variation and annotating linguistic ambiguity. Finally, the long- and short-term perspectives of this project are discussed.

References used

https://aclanthology.org/

rate research

DELA Corpus - A Document-Level Corpus Annotated with Context-Related Issues

945 - Association for Computation Linguistics 2021 مقالة

Recently, the Machine Translation (MT) community has become more interested in document-level evaluation especially in light of reactions to claims of human parity'', since examining the quality at the level of the document rather than at the sentenc e level allows for the assessment of suprasentential context, providing a more reliable evaluation. This paper presents a document-level corpus annotated in English with context-aware issues that arise when translating from English into Brazilian Portuguese, namely ellipsis, gender, lexical ambiguity, number, reference, and terminology, with six different domains. The corpus can be used as a challenge test set for evaluation and as a training/testing corpus for MT as well as for deep linguistic analysis of context issues. To the best of our knowledge, this is the first corpus of its kind.

document-level corpus annotated dela corpus corpus annotated وصف مستوى المستند المشروح ديلا كوربوس corpus المشروح صناعة حمض الفوسفور المزيد..

Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus

806 - Association for Computation Linguistics 2021 مقالة

Style transfer has been widely explored in natural language generation with non-parallel corpus by directly or indirectly extracting a notion of style from source and target domain corpus. A common shortcoming of existing approaches is the prerequisi te of joint annotations across all the stylistic dimensions under consideration. Availability of such dataset across a combination of styles limits the extension of these setups to multiple style dimensions. While cascading single-dimensional models across multiple styles is a possibility, it suffers from content loss, especially when the style dimensions are not completely independent of each other. In our work, we relax this requirement of jointly annotated data across multiple styles by using independently acquired data across different style dimensions without any additional annotations. We initialize an encoder-decoder setup with transformer-based language model pre-trained on a generic corpus and enhance its re-writing capability to multiple target style dimensions by employing multiple style-aware language models as discriminators. Through quantitative and qualitative evaluation, we show the ability of our model to control styles across multiple style dimensions while preserving content of the input text. We compare it against baselines involving cascaded state-of-the-art uni-dimensional style transfer models.

discriminative feedback feedback on disjoint disjoint corpus ردود الفعل التمييزية ردود الفعل على مفكضة كورفنس كوربوس صناعة حمض الفوسفور المزيد..

The FairyNet Corpus - Character Networks for German Fairy Tales

603 - Association for Computation Linguistics 2021 مقالة

This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms fo r the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.

german fairy tales fairy tales german fairy حكايات خرافية الألمانية حكايات خرافية الألمانية صناعة حمض الفوسفور المزيد..

Multilingual ELMo and the Effects of Corpus Sampling

639 - Association for Computation Linguistics 2021 مقالة

Multilingual pretrained language models are rapidly gaining popularity in NLP systems for non-English languages. Most of these models feature an important corpus sampling step in the process of accumulating training data in different languages, to en sure that the signal from better resourced languages does not drown out poorly resourced ones. In this study, we train multiple multilingual recurrent language models, based on the ELMo architecture, and analyse both the effect of varying corpus size ratios on downstream performance, as well as the performance difference between monolingual models for each language, and broader multilingual language models. As part of this effort, we also make these trained models available for public use.

corpus sampling important corpus sampling corpus sampling step أخذ العينات كوربوس أخذ أخذ العينات كوربوس المهمة كوربوس أخذ العينات الخطوة صناعة حمض الفوسفور المزيد..

Towards Layered Events and Schema Representations in Long Documents

732 - Association for Computation Linguistics 2021 مقالة

In this thesis proposal, we explore the application of event extraction to literary texts. Considering the lengths of literary documents modeling events in different granularities may be more adequate to extract meaningful information, as individual elements contribute little to the overall semantics. We adapt the concept of schemas as sequences of events all describing a single process, connected through shared participants extending it to for multiple schemas in a document. Segmentation of event sequences into schemas is approached by modeling event sequences, on such task as the narrative cloze task, the prediction of missing events in sequences. We propose building on sequences of event embeddings to form schema embeddings, thereby summarizing sections of documents using a single representation. This approach will allow for the comparisons of different sections of documents and entire literary works. Literature is a challenging domain based on its variety of genres, yet the representation of literary content has received relatively little attention.

layered events long documents layered الأحداث الطبقات وثائق طويلة الطبقات صناعة حمض الفوسفور المزيد..

The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek

The Glaux Corpus: القضايا المنهجية في تصميم كائن طويل الأجل ومتنوع متعدد الطبقات من اليونانية القديمة

Ask ChatGPT about the research

Read More

suggested questions