New community

Subscribe to the gold package and get unlimited access to Shamra Academy

The Corpora They Are a-Changing: a Case Study in Italian Newspapers

The Corpora هم متغيرون: دراسة حالة في الصحف الإيطالية

196 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The use of automatic methods for the study of lexical semantic change (LSC) has led to the creation of evaluation benchmarks. Benchmark datasets, however, are intimately tied to the corpus used for their creation questioning their reliability as well as the robustness of automatic methods. This contribution investigates these aspects showing the impact of unforeseen social and cultural dimensions. We also identify a set of additional issues (OCR quality, named entities) that impact the performance of the automatic methods, especially when used to discover LSC.

References used

https://aclanthology.org/

rate research

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

187 - Association for Computation Linguistics 2021 مقالة

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are f requently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.

documenting large webtext large webtext corpora clean crawled corpus توثيق WebText كبير corpa webtext كبير نظيفة الزحف كوربوس صناعة حمض الفوسفور المزيد..

The Use of Corpora in an Interdisciplinary Approach to Localization

344 - Association for Computation Linguistics 2021 مقالة

Translation Studies and more specifically, its subfield Descriptive Translation Studies [Holmes 1988/2000] is, according to many scholars [Gambier, 2009; Nenopoulou, 2007; Munday, 2001/2008; Hermans, 1999; Snell-Hornby et al., 1994 e.t.c], a highly i nterdisciplinary field of study. The aim of the present paper is to describe the role of polysemiotic corpora in the study of university website localization from a multidisciplinary perspective. More specifically, the paper gives an overview of an on-going postdoctoral research on the identity formation of Greek university websites on the web, focusing on the methodology adopted with reference to corpora compilation based on methodological tools and concepts from various fields such as Translation Studies, social semiotics, cultural studies, critical discourse analysis and marketing. The objects of comparative analysis are Greek and French original and translated (into English) university websites as well as original British and American university website versions. Up to now research findings have shown that polysemiotic corpora can be a valuable tool not only of quantitative but also of qualitative analysis of website localization both for scholars and translation professionals working with multimodal genres.

interdisciplinary approach descriptive translation studies translation studies نهج متعدد التخصصات دراسات الترجمة الوصفية دراسات الترجمة صناعة حمض الفوسفور المزيد..

Evaluate the efficiency of the BIM in managing design changes: A Case Study

1696 - Aِl-Baath University 2016 ورقة بحثية

This paper describes a case study in which we documented six examples of design changes. Then they are analyzed in two directions, the first is related to assessing the efficiency of the tools of BIM towards design changes and discovering their weaknesses, while the second is related to analyzing the characteristics of these changes within the information model of BIM and presenting them in the modeling of the change process.

Change Management إدارة التغيير نمذجة معلومات البناء تغييرات التصميم Design Changes (Building Information Models (BIM

A Case Study of In-House Competition for Ranking Constructive Comments in a News Service

151 - Association for Computation Linguistics 2021 مقالة

Ranking the user comments posted on a news article is important for online news services because comment visibility directly affects the user experience. Research on ranking comments with different metrics to measure the comment quality has shown con structiveness'' used in argument analysis is promising from a practical standpoint. In this paper, we report a case study in which this constructiveness is examined in the real world. Specifically, we examine an in-house competition to improve the performance of ranking constructive comments and demonstrate the effectiveness of the best obtained model for a commercial service.

ranking constructive comments ranking constructive الترتيب تعليق بناء الترتيب البناء صناعة حمض الفوسفور

Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study

316 - Association for Computation Linguistics 2021 مقالة

Recent advances in Unsupervised Neural Machine Translation (UNMT) has minimized the gap between supervised and unsupervised machine translation performance for closely related language-pairs. However and the situation is very different for distant la nguage pairs. Lack of overlap in lexicon and low syntactic similarity such as between English and IndoAryan languages leads to poor translation quality in existing UNMT systems. In this paper and we show that initialising the embedding layer of UNMT models with cross-lingual embeddings leads to significant BLEU score improvements over existing UNMT models where the embedding layer weights are randomly initialized. Further and freezing the embedding layer weights leads to better gains compared to updating the embedding layer weights during training. We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi and English-Bengali and English-Gujarati. Our analysis shows that initialising embedding layer with static cross-lingual embedding mapping is essential for training of UNMT models for distant language-pairs.

indoaryan case study دراسة حالة الهند صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

The Corpora They Are a-Changing: a Case Study in Italian Newspapers

The Corpora هم متغيرون: دراسة حالة في الصحف الإيطالية

Ask ChatGPT about the research

Read More

suggested questions