تدفق هذه الورقة تشابه داخل حدود وبين 84 أصناف لغة عبر تسعة لغات.يتم استخلاص هذه الشركة من المصادر الرقمية (الويب والتويت)، مما يتيح لنا تقييم ما إذا كانت هذه الشركات المشار إليها على جغرافية موثوقة في النمذجة الاختلاف اللغوي.الفكرة الأساسية هي أنه، إذا تمثل كل مصدر بشكل كاف مجموعة متنوعة من اللغات الأساسية واحدة، فيجب أن تكون التشابه بين هذه المصادر مستقرة عبر جميع اللغات والبلدان.توضح الورقة أن هناك اتفاق ثابت بين هذه المصادر باستخدام تدابير التشابه القائم على التردد.يوفر هذا دليلا إضافيا على أن شركة Corsea المرجعية الرقمية التي تمت الإشارة إليها باستمرار تمثل الأصناف اللغوية المحلية.
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.
References used
https://aclanthology.org/
Word embeddings are widely used in Natural Language Processing (NLP) for a vast range of applications. However, it has been consistently proven that these embeddings reflect the same human biases that exist in the data used to train them. Most of the
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we
Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent
Language models pretrained on vast corpora of unstructured text using self-supervised learning framework are used in numerous natural language understanding and generation tasks. Many studies show that language acquisition in humans follows a rather
Automatic detection of the Myers-Briggs Type Indicator (MBTI) from short posts attracted noticeable attention in the last few years. Recent studies showed that this is quite a difficult task, especially on commonly used Twitter data. Obtaining MBTI l