Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

تمثيلات الأصناف اللغوية موثوقة بإجراء تدابير تشابه Corpus

455 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

References used

https://aclanthology.org/

rate research

Measuring Biases of Word Embeddings: What Similarity Measures and Descriptive Statistics to Use?

727 - Association for Computation Linguistics 2021 مقالة

Word embeddings are widely used in Natural Language Processing (NLP) for a vast range of applications. However, it has been consistently proven that these embeddings reflect the same human biases that exist in the data used to train them. Most of the introduced bias indicators to reveal word embeddings' bias are average-based indicators based on the cosine similarity measure. In this study, we examine the impacts of different similarity measures as well as other descriptive techniques than averaging in measuring the biases of contextual and non-contextual word embeddings. We show that the extent of revealed biases in word embeddings depends on the descriptive statistics and similarity measures used to measure the bias. We found that over the ten categories of word embedding association tests, Mahalanobis distance reveals the smallest bias, and Euclidean distance reveals the largest bias in word embeddings. In addition, the contextual models reveal less severe biases than the non-contextual word embedding models.

متطلبات المراجعة كلمة embeddings. كلمة صناعة حمض الفوسفور

Inducing Language-Agnostic Multilingual Representations

470 - Association for Computation Linguistics 2021 مقالة

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT evaluation across 19 typologically diverse languages. Our findings expose the limitations of these approaches---unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches' additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however.

inducing language-agnostic multilingual language-agnostic multilingual representations inducing language-agnostic حث لغة متعددة اللغات متعددة اللغات التعليم متعدد اللغات متعددة اللغات تحفيز اللغة الأذرع صناعة حمض الفوسفور المزيد..

Tracing Source Language Interference in Translation with Graph-Isomorphism Measures

564 - Association for Computation Linguistics 2021 مقالة

Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent research has shown that instances of translationese (source language interference) can even be detected in embedding spaces, comparing embeddings spaces of original language data with embedding spaces resulting from translations into the same language, using a simple Eigenvector-based divergence from isomorphism measure. To date, it remains an open question whether alternative graph-isomorphism measures can produce better results. In this paper, we (i) explore Gromov-Hausdorff distance, (ii) present a novel spectral version of the Eigenvector-based method, and (iii) evaluate all approaches against a broad linguistic typological database (URIEL). We show that language distances resulting from our spectral isomorphism approaches can reproduce genetic trees on a par with previous work without requiring any explicit linguistic information and that the results can be extended to non-Indo-European languages. Finally, we show that the methods are robust under a variety of modeling conditions.

source language interference tracing source language source language مصدر لغة المصدر لغة المصدر تتبع لغة المصدر صناعة حمض الفوسفور المزيد..

On the Role of Corpus Ordering in Language Modeling

583 - Association for Computation Linguistics 2021 مقالة

Language models pretrained on vast corpora of unstructured text using self-supervised learning framework are used in numerous natural language understanding and generation tasks. Many studies show that language acquisition in humans follows a rather structured simple-to-complex pattern and guided by this intuition, curriculum learning, which enables training of computational models in a meaningful order, such as processing easy samples before hard ones, has been shown to potentially reduce training time. The question remains whether curriculum learning can benefit pretraining of language models. In this work, we perform comprehensive experiments involving multiple curricula strategies varying the criteria for complexity and the training schedules. Empirical results of training transformer language models on English corpus and evaluating it intrinsically as well as after fine-tuning across eight tasks from the GLUE benchmark, show consistent improvement gains over conventional vanilla training. Interestingly, in our experiments, when evaluated on one epoch, the best model following a document-level hard-to-easy curriculum, outperforms the vanilla model by 1.7 points (average GLUE score) and it takes the vanilla model twice as many training steps to reach comparable performance.

دراسة منهجية corpus ordering role of corpus ترتيب Corpus. دور الجثة صناعة حمض الفوسفور

How to Obtain Reliable Labels for MBTI Classification from Texts?

751 - Association for Computation Linguistics 2021 مقالة

Automatic detection of the Myers-Briggs Type Indicator (MBTI) from short posts attracted noticeable attention in the last few years. Recent studies showed that this is quite a difficult task, especially on commonly used Twitter data. Obtaining MBTI l abels is also difficult, as human annotation requires trained psychologists, and automatic way of obtaining them is through long questionnaires of questionable usability for the task. In this paper, we present a method for collecting reliable MBTI labels via only four carefully selected questions that can be applied to any type of textual data.

classification from texts obtain reliable labels mbti classification تصنيف من النصوص الحصول على ملصقات موثوقة MBTI تصنيف صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

تمثيلات الأصناف اللغوية موثوقة بإجراء تدابير تشابه Corpus

Ask ChatGPT about the research

Read More

suggested questions