Research papers, master and doctoral theses about الموارد المنخفضة

Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining

175 - Association for Computation Linguistics 2021 مقالة

With the rapid increase in the volume of dialogue data from daily life, there is a growing demand for dialogue summarization. Unfortunately, training a large summarization model is generally infeasible due to the inadequacy of dialogue data with anno tated summaries. Most existing works for low-resource dialogue summarization directly pretrain models in other domains, e.g., the news domain, but they generally neglect the huge difference between dialogues and conventional articles. To bridge the gap between out-of-domain pretraining and in-domain fine-tuning, in this work, we propose a multi-source pretraining paradigm to better leverage the external summary data. Specifically, we exploit large-scale in-domain non-summary data to separately pretrain the dialogue encoder and the summary decoder. The combined encoder-decoder model is then pretrained on the out-of-domain summary data using adversarial critics, aiming to facilitate domain-agnostic summarization. The experimental results on two public datasets show that with only limited training data, our approach achieves competitive performance and generalizes well in different dialogue scenarios.

low-resource dialogue summarization تلخيص حوار الموارد المنخفضة صناعة حمض الفوسفور

Unsupervised Paraphrasing Consistency Training for Low Resource Named Entity Recognition

201 - Association for Computation Linguistics 2021 مقالة

Unsupervised consistency training is a way of semi-supervised learning that encourages consistency in model predictions between the original and augmented data. For Named Entity Recognition (NER), existing approaches augment the input sequence with t oken replacement, assuming annotations on the replaced positions unchanged. In this paper, we explore the use of paraphrasing as a more principled data augmentation scheme for NER unsupervised consistency training. Specifically, we convert Conditional Random Field (CRF) into a multi-label classification module and encourage consistency on the entity appearance between the original and paraphrased sequences. Experiments show that our method is especially effective when annotations are limited.

low resource named resource named entity الموارد المنخفضة اسمه الكيان المسمى الموارد صناعة حمض الفوسفور

Language Model Pretraining and Transfer Learning for Very Low Resource Languages

230 - Association for Computation Linguistics 2021 مقالة

This paper describes our submission for the shared task on Unsupervised MT and Very Low Resource Supervised MT at WMT 2021. We submitted systems for two language pairs: German ↔ Upper Sorbian (de ↔ hsb) and German-Lower Sorbian (de ↔ dsb). For de ↔ h sb, we pretrain our system using MASS (Masked Sequence to Sequence) objective and then finetune using iterative back-translation. Final finetunng is performed using the parallel data provided for translation objective. For de ↔ dsb, no parallel data is provided in the task, we use final de ↔ hsb model as initialization of the de ↔ dsb model and train it further using iterative back-translation, using the same vocabulary as used in the de ↔ hsb model.

السامية العليا low resource languages pretraining and transfer لغات الموارد المنخفضة محاكاة ونقل صناعة حمض الفوسفور

Automatic Sentence Simplification in Low Resource Settings for Urdu

158 - Association for Computation Linguistics 2021 مقالة

To build automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. We present a lexical and sy ntactically simplified Urdu simplification corpus with a detailed analysis of the various simplification operations and human evaluation of corpus quality. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified and syntactically simplified corpora. In addition, we compare our corpus with other existing simplification corpora by building simplification systems and evaluating these systems using BLEU and SARI scores. Our system achieves the highest BLEU score and comparable SARI score in comparison to other systems. We release our simplification corpora for the benefit of the research community.

low resource settings resource settings إعدادات الموارد المنخفضة إعدادات الموارد صناعة حمض الفوسفور

Data augmentation for low-resource grapheme-to-phoneme mapping

414 - Association for Computation Linguistics 2021 مقالة

In this paper we explore a very simple neural approach to mapping orthography to phonetic transcription in a low-resource context. The basic idea is to start from a baseline system and focus all efforts on data augmentation. We will see that some techniques work, but others do not.

تعديل المسافة المرجحة low-resource context low-resource سياق منخفض الموارد الموارد المنخفضة صناعة حمض الفوسفور

NICT-2 Translation System at WAT-2021: Applying a Pretrained Multilingual Encoder-Decoder Model to Low-resource Language Pairs

80 - Association for Computation Linguistics 2021 مقالة

In this paper, we present the NICT system (NICT-2) submitted to the NICT-SAP shared task at the 8th Workshop on Asian Translation (WAT-2021). A feature of our system is that we used a pretrained multilingual BART (Bidirectional and Auto-Regressive Tr ansformer; mBART) model. Because publicly available models do not support some languages in the NICT-SAP task, we added these languages to the mBART model and then trained it using monolingual corpora extracted from Wikipedia. We fine-tuned the expanded mBART model using the parallel corpora specified by the NICT-SAP task. The BLEU scores greatly improved in comparison with those of systems without the pretrained model, including the additional languages.

low-resource language pairs pretrained multilingual encoder-decoder pretrained multilingual bart أزواج لغة الموارد المنخفضة مسببة التشفير متعدد اللغات ما يلاحظه البارت متعدد اللغات صناعة حمض الفوسفور المزيد..

Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages

399 - Association for Computation Linguistics 2021 مقالة

Neural Machine Translation (NMT) for Low Resource Languages (LRL) is often limited by the lack of available training data, making it necessary to explore additional techniques to improve translation quality. We propose the use of the Prefix-Root-Post fix-Encoding (PRPE) subword segmentation algorithm to improve translation quality for LRLs, using two agglutinative languages as case studies: Quechua and Indonesian. During the course of our experiments, we reintroduce a parallel corpus for Quechua-Spanish translation that was previously unavailable for NMT. Our experiments show the importance of appropriate subword segmentation, which can go as far as improving translation quality over systems trained on much larger quantities of data. We show this by achieving state-of-the-art results for both languages, obtaining higher BLEU scores than large pre-trained models with much smaller amounts of data.

وسائل التواصل الاجتماعي التعليقات agglutinative low-resource languages improve translation quality بلغ ملغات الموارد المنخفضة تحسين جودة الترجمة صناعة حمض الفوسفور

Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text

377 - Association for Computation Linguistics 2021 مقالة

In this shared task, we seek the participating teams to investigate the factors influencing the quality of the code-mixed text generation systems. We synthetically generate code-mixed Hinglish sentences using two distinct approaches and employ human annotators to rate the generation quality. We propose two subtasks, quality rating prediction and annotators' disagreement prediction of the synthetic Hinglish dataset. The proposed subtasks will put forward the reasoning and explanation of the factors influencing the quality and human perception of the code-mixed text.

low-resource synthetically generated generated code-mixed hinglish synthetically generated code-mixed الموارد المنخفضة الناتج توليد رمز مختلطة هينجلديش توليد الكود المزدوج صناعة حمض الفوسفور المزيد..

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

488 - Association for Computation Linguistics 2021 مقالة

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settin gs. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

مهام اللغة المكثفة low-resource scenarios سيناريوهات الموارد المنخفضة صناعة حمض الفوسفور

Creating and Evaluating Resources for Sentiment Analysis in the Low-resource Language: Sindhi

209 - Association for Computation Linguistics 2021 مقالة

In this paper, we develop Sindhi subjective lexicon using a merger of existing English resources: NRC lexicon, list of opinion words, SentiWordNet, Sindhi-English bilingual dictionary, and collection of Sindhi modifiers. The positive or negative sent iment score is assigned to each Sindhi opinion word. Afterwards, we determine the coverage of the proposed lexicon with subjectivity analysis. Moreover, we crawl multi-domain tweet corpus of news, sports, and finance. The crawled corpus is annotated by experienced annotators using the Doccano text annotation tool. The sentiment annotated corpus is evaluated by employing support vector machine (SVM), recurrent neural network (RNN) variants, and convolutional neural network (CNN).

low-resource language creating and evaluating evaluating resources لغة الموارد المنخفضة خلق وتقييم تقييم الموارد صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد