NorDial: A Preliminary Corpus of Written Norwegian Dialect Use

75 0 0.0 ( 0 )

Download Cite

Added by Jeremy Barnes

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jeremy Barnes - Petter M{ae}hlum - Samia Touileb

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Norway has a large amount of dialectal variation, as well as a general tolerance to its use in the public sphere. There are, however, few available resources to study this variation and its change over time and in more informal areas, eg on social media. In this paper, we propose a first step to creating a corpus of dialectal variation of written Norwegian. We collect a small corpus of tweets and manually annotate them as Bokm{aa}l, Nynorsk, any dialect, or a mix. We further perform preliminary experiments with state-of-the-art models, as well as an analysis of the data to expand this corpus in the future. Finally, we make the annotations and models available for future work.

rate research

The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

58 - Samia Touileb , Jeremy Barnes 2021

Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct, but use the same script, b) typologically similar, but use a distinct script, or c) are typologically similar and use the same script. We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.

Computation and Language

Sexism detection: The first corpus in Algerian dialect with a code-switching in Arabic/ French and English

134 - Imane Guellil , Ahsan Adeel , Faical Azouaou 2021

In this paper, an approach for hate speech detection against women in Arabic community on social media (e.g. Youtube) is proposed. In the literature, similar works have been presented for other languages such as English. However, to the best of our knowledge, not much work has been conducted in the Arabic language. A new hate speech corpus (Arabic_fr_en) is developed using three different annotators. For corpus validation, three different machine learning algorithms are used, including deep Convolutional Neural Network (CNN), long short-term memory (LSTM) network and Bi-directional LSTM (Bi-LSTM) network. Simulation results demonstrate the best performance of the CNN model, which achieved F1-score up to 86% for the unbalanced corpus as compared to LSTM and Bi-LSTM.

Computation and Language

Whats in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

82 - Alexandra Sasha Luccioni , Joseph D. Viviano 2021

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.

Computation and Language

Large-Scale Contextualised Language Modelling for Norwegian

73 - Andrey Kutuzov , Jeremy Barnes , Erik Velldal 2021

We present the ongoing NorLM initiative to support the creation and use of very large contextualised language models for Norwegian (and in principle other Nordic languages), including a ready-to-use software environment, as well as an experience report for data preparation and training. This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian. For additional background and access to the data, models, and software, please see http://norlm.nlpl.eu

Computation and Language

Learning to Recognize Dialect Features

75 - Dorottya Demszky , Devyani Sharma , Jonathan H. Clark 2020

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in He {} running. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.

Computation and Language