Research papers, master and doctoral theses about mixed

Navigating the Kaleidoscope of COVID-19 Misinformation Using Deep Learning

404 - Association for Computation Linguistics 2021 مقالة

Irrespective of the success of the deep learning-based mixed-domain transfer learning approach for solving various Natural Language Processing tasks, it does not lend a generalizable solution for detecting misinformation from COVID-19 social media da ta. Due to the inherent complexity of this type of data, caused by its dynamic (context evolves rapidly), nuanced (misinformation types are often ambiguous), and diverse (skewed, fine-grained, and overlapping categories) nature, it is imperative for an effective model to capture both the local and global context of the target domain. By conducting a systematic investigation, we show that: (i) the deep Transformer-based pre-trained models, utilized via the mixed-domain transfer learning, are only good at capturing the local context, thus exhibits poor generalization, and (ii) a combination of shallow network-based domain-specific models and convolutional neural networks can efficiently extract local as well as global context directly from the target data in a hierarchical fashion, enabling it to offer a more generalizable solution.

كلمة توليد مشكلة navigating the kaleidoscope mixed-domain transfer learning التنقل في المشكال التعلم نقل المجال المختلط صناعة حمض الفوسفور

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text

326 - Association for Computation Linguistics 2021 مقالة

Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generat ion and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.

المتغيرات المستهدفة code-mixed hinglish text النص مختلط النص Hinglish صناعة حمض الفوسفور

``Something Something Hota Hai!'' An Explainable Approach towards Sentiment Analysis on Indian Code-Mixed Data

261 - Association for Computation Linguistics 2021 مقالة

The increasing use of social media sites in countries like India has given rise to large volumes of code-mixed data. Sentiment analysis of this data can provide integral insights into people's perspectives and opinions. Code-mixed data is often noisy in nature due to multiple spellings for the same word, lack of definite order of words in a sentence, and random abbreviations. Thus, working with code-mixed data is more challenging than monolingual data. Interpreting a model's predictions allows us to determine the robustness of the model against different forms of noise. In this paper, we propose a methodology to integrate explainable approaches into code-mixed sentiment analysis. By interpreting the predictions of sentiment analysis models we evaluate how well the model is able to adapt to the implicit noises present in code-mixed data.

hota hai indian code-mixed data explainable approach هوتا هاي البيانات المزدجة التعليمية الهندي النهج القادم صناعة حمض الفوسفور المزيد..

Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification

224 - Association for Computation Linguistics 2021 مقالة

Recent multilingual pre-trained models, like XLM-RoBERTa (XLM-R), have been demonstrated effective in many cross-lingual tasks. However, there are still gaps between the contextualized representations of similar words in different languages. To solve this problem, we propose a novel framework named Multi-View Mixed Language Training (MVMLT), which leverages code-switched data with multi-view learning to fine-tune XLM-R. MVMLT uses gradient-based saliency to extract keywords which are the most relevant to downstream tasks and replaces them with the corresponding words in the target language dynamically. Furthermore, MVMLT utilizes multi-view learning to encourage contextualized embeddings to align into a more refined language-invariant space. Extensive experiments with four languages show that our model achieves state-of-the-art results on zero-shot cross-lingual sentiment classification and dialogue state tracking tasks, demonstrating the effectiveness of our proposed model.

mixed language training saliency-based multi-view mixed multi-view mixed language تدريب لغة مختلطة متعددة مقرها متعدد الألوان مختلطة لغة مختلطة متعددة صناعة حمض الفوسفور المزيد..

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

452 - Association for Computation Linguistics 2021 مقالة

Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural language generat ion (NLG) tasks. Various widely popular metrics perform poorly with the code-mixed NLG tasks. To address this challenge, we present a metric in- dependent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments on the generated code-mixed text. As a use case, we demonstrate the performance of MIPE on the machine-generated Hinglish (code-mixing of Hindi and English languages) sentences from the HinGE corpus. We can extend the proposed evaluation strategy to other code-mixed language pairs, NLG tasks, and evaluation metrics with minimal to no effort.

effective code-mixed nlg metric independent pipeline independent pipeline كود فعال مختلط NLG خط أنابيب مستقلة متري خط أنابيب مستقلة صناعة حمض الفوسفور المزيد..

Towards Code-Mixed Hinglish Dialogue Generation

248 - Association for Computation Linguistics 2021 مقالة

Code-mixed language plays a crucial role in communication in multilingual societies. Though the recent growth of web users has greatly boosted the use of such mixed languages, the current generation of dialog systems is primarily monolingual. This in crease in usage of code-mixed language has prompted dialog systems in a similar language. We present our work in Code-Mixed Dialog Generation, an unexplored task in code-mixed languages, generating utterances in code-mixed language rather than a single language that is more often just English. We present a new synthetic corpus in code-mix for dialogs, CM-DailyDialog, by converting an existing English-only dialog corpus to a mixed Hindi-English corpus. We then propose a baseline approach where we show the effectiveness of using mBART like multilingual sequence-to-sequence transformers for code-mixed dialog generation. Our best performing dialog models can conduct coherent conversations in Hindi-English mixed language as evaluated by human and automatic metrics setting new benchmarks for the Code-Mixed Dialog Generation task.

hinglish dialogue generation code-mixed hinglish dialogue hinglish dialogue توليد الحوار الهندي حوار هانشروي مختلط الحوار الهندي صناعة حمض الفوسفور المزيد..

Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text

166 - Association for Computation Linguistics 2021 مقالة

Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will h elp solve NLP tasks such as Spell Checking, Named Entity Recognition, Part-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train models. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data: (1) Word Level Classification (2) Sentence Level word-by-word Classification and compare these approaches presenting two strong baselines for LID on these datasets.

code-mixed telugu-english text corpus creation low-resource code-mixed telugu-english نص خلط رمز التيلجو إنشاء كوربوس Low-Resource Code-Mixed Telugu English صناعة حمض الفوسفور المزيد..

Classification of Code-Mixed Text Using Capsule Networks

281 - Association for Computation Linguistics 2021 مقالة

A major challenge in analysing social me-dia data belonging to languages that use non-English script is its code-mixed nature. Recentresearch has presented state-of-the-art contex-tual embedding models (both monolingual s.a.BERT and multilingual s.a. XLM-R) as apromising approach. In this paper, we showthat the performance of such embedding mod-els depends on multiple factors, such as thelevel of code-mixing in the dataset, and thesize of the training dataset. We empiricallyshow that a newly introduced Capsule+biGRUclassifier could outperform a classifier built onthe English-BERT as well as XLM-R just witha training dataset of about 6500 samples forthe Sinhala-English code-mixed data.

capsule networks code-mixed text text using capsule شبكات كبسولة النص المختلط النص باستخدام كبسولة صناعة حمض الفوسفور المزيد..

Neural Machine Translation for Sinhala-English Code-Mixed Text

205 - Association for Computation Linguistics 2021 مقالة

Code-mixing has become a moving method of communication among multilingual speakers. Most of the social media content of the multilingual societies are written in code-mixed text. However, most of the current translation systems neglect to convert co de-mixed texts to a standard language. Most of the user written code-mixed content in social media remains unprocessed due to the unavailability of linguistic resource such as parallel corpus. This paper proposes a Neural Machine Translation(NMT) model to translate the Sinhala-English code-mixed text to the Sinhala language. Due to the limited resources available for Sinhala-English code-mixed(SECM) text, a parallel corpus is created with SECM sentences and Sinhala sentences. Srilankan social media sites contain SECM texts more frequently than the standard languages. The model proposed for code-mixed text translation in this study is a combination of Encoder-Decoder framework with LSTM units and Teachers Forcing Algorithm. The translated sentences from the model are evaluated using BLEU(Bilingual Evaluation Understudy) metric. Our model achieved a remarkable BLEU score for the translation.

الخطأ النحوي التلقائي sinhala-english code-mixed text Sinhala-English النص المختلط صناعة حمض الفوسفور

A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text

263 - Association for Computation Linguistics 2021 مقالة

Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in code-mixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.

pre-trained transformer cnn model code-mixed social-media text محول مدرب مسبقا نموذج سي إن إن نص البيانات الاجتماعية المختلطة صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد