New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

توثيق WebText Corpora الكبيرة: دراسة حالة على Colussal Clash Cruwled Corpus

148 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

documenting large webtext large webtext corpora clean crawled corpus توثيق WebText كبير corpa webtext كبير نظيفة الزحف كوربوس صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.

References used

https://aclanthology.org/

rate research

The Corpora They Are a-Changing: a Case Study in Italian Newspapers

158 - Association for Computation Linguistics 2021 مقالة

The use of automatic methods for the study of lexical semantic change (LSC) has led to the creation of evaluation benchmarks. Benchmark datasets, however, are intimately tied to the corpus used for their creation questioning their reliability as well as the robustness of automatic methods. This contribution investigates these aspects showing the impact of unforeseen social and cultural dimensions. We also identify a set of additional issues (OCR quality, named entities) that impact the performance of the automatic methods, especially when used to discover LSC.

italian newspapers study in italian الصحف الإيطالية دراسة في الإيطالية صناعة حمض الفوسفور

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

189 - Association for Computation Linguistics 2021 مقالة

Abstract Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and thus are too resource- hungry and computation-i ntensive to suit low- capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted considerable research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

نماذج اللغة المستقبلية abstract pre-trained transformer-based مجردة محول المدرب مسبقا صناعة حمض الفوسفور

Synthetic Examples Improve Cross-Target Generalization: A Study on Stance Detection on a Twitter corpus.

411 - Association for Computation Linguistics 2021 مقالة

Cross-target generalization is a known problem in stance detection (SD), where systems tend to perform poorly when exposed to targets unseen during training. Given that data annotation is expensive and time-consuming, finding ways to leverage abundan t unlabeled in-domain data can offer great benefits. In this paper, we apply a weakly supervised framework to enhance cross-target generalization through synthetically annotated data. We focus on Twitter SD and show experimentally that integrating synthetic data is helpful for cross-target generalization, leading to significant improvements in performance, with gains in F1 scores ranging from +3.4 to +5.1.

improve cross-target generalization improve cross-target تحسين التعميم المستهدف تحسين الهدف عبر صناعة حمض الفوسفور

Classification of hierarchical text using geometric deep learning: the case of clinical trials corpus

367 - Association for Computation Linguistics 2021 مقالة

We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using t he permutation invariant message passing operations, we show that we can gain extra performance improvements using our proposed selective graph pooling operation that arises from the fact that some parts of the hierarchy are invariable across different documents. We applied our model to classify clinical trial (CT) protocols into completed and terminated categories. We use bag-of-words based, as well as pre-trained transformer-based embeddings to featurize the graph nodes, achieving f1-scoresaround 0.85 on a publicly available large scale CT registry of around 360K protocols. We further demonstrate how the selective pooling can add insights into the CT termination status prediction. We make the source code and dataset splits accessible.

geometric deep learning clinical trials corpus geometric deep التعلم العميق الهندسي التجارب السريرية Corpus. عميق هندسي صناعة حمض الفوسفور المزيد..

Bilingual Terminology Extraction Using Neural Word Embeddings on Comparable Corpora

237 - Association for Computation Linguistics 2021 مقالة

Term and glossary management are vital steps of preparation of every language specialist, and they play a very important role at the stage of education of translation professionals. The growing trend of efficient time management and constant time con straints we may observe in every job sector increases the necessity of the automatic glossary compilation. Many well-performing bilingual AET systems are based on processing parallel data, however, such parallel corpora are not always available for a specific domain or a language pair. Domain-specific, bilingual access to information and its retrieval based on comparable corpora is a very promising area of research that requires a detailed analysis of both available data sources and the possible extraction techniques. This work focuses on domain-specific automatic terminology extraction from comparable corpora for the English -- Russian language pair by utilizing neural word embeddings.

neural word embeddings comparable corpora neural word كلمة embeddings العصبية Corporable الكلمة العصبية صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

توثيق WebText Corpora الكبيرة: دراسة حالة على Colussal Clash Cruwled Corpus

Ask ChatGPT about the research

Read More

suggested questions