New community

Subscribe to the gold package and get unlimited access to Shamra Academy

HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints

HINTEDBT: زيادة الترجمة مرة أخرى مع تلميحات الجودة والترجمة

280 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

augmenting back-translation hints augmenting زيادة الترجمة تلميحات صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

الترجمة المرجودة (BT) من Target Monolingual Corpora هي استراتيجية تكبير البيانات المستخدمة على نطاق واسع للترجمة الآلية العصبية (NMT)، خاصة بالنسبة لأزواج لغة الموارد المنخفضة. لتحسين فعالية بيانات BT المتاحة، نقدم HINTEDBT --- عائلة من التقنيات التي توفر تلميحات (من خلال العلامات) إلى التشفير والكشف. أولا، نقترح طريقة جديدة لاستخدام بيانات BT عالية الجودة عالية ومنخفضة من خلال توفير تلميحات (كعلامات مصدر على التشفير) إلى النموذج حول جودة كل زوج من المصدر. لا نقوم بتصفية بيانات منخفضة الجودة ولكنها تظهر بدلا من ذلك أن هذه التلميحات تمكن النموذج للتعلم بفعالية من البيانات الصاخبة. ثانيا، نتعلم مشكلة التنبؤ بما إذا كان يجب ترجمة الرمزية المصدر أو ترجمة إلى اللغة المستهدفة، وهو أمر شائع في مهام الترجمة عبر البرامج النصية (I.E.E.، حيث لا يشارك المصدر والهدف النصي الكتابي). بالنسبة لهذه الحالات، نقترح تدريب النموذج بألم تلميحات إضافية (كعلامات مستهدفة على وحدة فك التشفير) التي توفر معلومات حول العملية المطلوبة في المصدر (الترجمة أو الترجمة والترجمة). نقوم بإجراء تجارب وتحليلات مفصلة على معايير WMT القياسية لثلاثة أزواج لغة منخفضة / متوسطة الموارد: الهندية، الغوجاراتية، تاميل إلى الإنجليزية. تقارن أساليبنا إيجابية مع خمس خطوط خطوط خطية قوية وأنشأت بشكل جيد. نظهر أن استخدام هذه التلميحات، كلاهما بشكل منفصل وبعضها البعض، يحسن جودة الترجمة بشكل كبير ويؤدي إلى أداء أحدث في أزواج اللغة الثلاثة في إعدادات ثنائية اللغة المقابلة.

Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve effectiveness of the available BT data, we introduce HintedBT---a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using both high and low quality BT data by providing hints (as source tags on the encoder) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as target tags on the decoder) that provide information about the operation required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: Hindi,Gujarati,Tamil-to-English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.

References used

https://aclanthology.org/

rate research

Approaching SMM4H with auto-regressive language models and back-translation

238 - Association for Computation Linguistics 2021 مقالة

We describe our submissions to the 6th edition of the Social Media Mining for Health Applications (SMM4H) shared task. Our team (OGNLP) participated in the sub-task: Classification of tweets self-reporting potential cases of COVID-19 (Task 5). For ou r submissions, we employed systems based on auto-regressive transformer models (XLNet) and back-translation for balancing the dataset.

التعلم الالي health applications التطبيقات الصحية صناعة حمض الفوسفور

Augmenting Knowledge-grounded Conversations with Sequential Knowledge Transition

471 - Association for Computation Linguistics 2021 مقالة

Knowledge data are massive and widespread in the real-world, which can serve as good external sources to enrich conversations. However, in knowledge-grounded conversations, current models still lack the fine-grained control over knowledge selection a nd integration with dialogues, which finally leads to the knowledge-irrelevant response generation problems: 1) knowledge selection merely relies on the dialogue context, ignoring the inherent knowledge transitions along with conversation flows; 2) the models often over-fit during training, resulting with incoherent response by referring to unrelated tokens from specific knowledge content in the testing phase; 3) although response is generated upon the dialogue history and knowledge, the models often tend to overlook the selected knowledge, and hence generates knowledge-irrelevant response. To address these problems, we proposed to explicitly model the knowledge transition in sequential multi-turn conversations by abstracting knowledge into topic tags. Besides, to fully utilizing the selected knowledge in generative process, we propose pre-training a knowledge-aware response generator to pay more attention on the selected knowledge. In particular, a sequential knowledge transition model equipped with a pre-trained knowledge-aware response generator (SKT-KG) formulates the high-level knowledge transition and fully utilizes the limited knowledge data. Experimental results on both structured and unstructured knowledge-grounded dialogue benchmarks indicate that our model achieves better performance over baseline models.

augmenting knowledge-grounded conversations sequential knowledge transition زيادة المحادثات المعرفة انتقال المعرفة المتسلسل صناعة حمض الفوسفور

Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision

683 - Association for Computation Linguistics 2021 مقالة

Recent state-of-the-art (SOTA) effective neural network methods and fine-tuning methods based on pre-trained models (PTM) have been used in Chinese word segmentation (CWS), and they achieve great results. However, previous works focus on training the models with the fixed corpus at every iteration. The intermediate generated information is also valuable. Besides, the robustness of the previous neural methods is limited by the large-scale annotated data. There are a few noises in the annotated corpus. Limited efforts have been made by previous studies to deal with such problems. In this work, we propose a self-supervised CWS approach with a straightforward and effective architecture. First, we train a word segmentation model and use it to generate the segmentation results. Then, we use a revised masked language model (MLM) to evaluate the quality of the segmentation results based on the predictions of the MLM. Finally, we leverage the evaluations to aid the training of the segmenter by improved minimum risk training. Experimental results show that our approach outperforms previous methods on 9 different CWS datasets with single criterion training and multiple criteria training and achieves better robustness.

augmenting chinese word augmenting chinese زيادة الكلمة الصينية زيادة الصينية صناعة حمض الفوسفور

Augmenting Transformers with KNN-Based Composite Memory for Dialog

485 - Association for Computation Linguistics 2021 مقالة

Various machine learning tasks can benefit from access to external information of different modalities, such as text and images. Recent work has focused on learning architectures with large memories capable of storing this knowledge. We propose augme nting generative Transformer neural networks with KNN-based Information Fetching (KIF) modules. Each KIF module learns a read operation to access fixed external knowledge. We apply these modules to generative dialog modeling, a challenging task where information must be flexibly retrieved and incorporated to maintain the topic and flow of conversation. We demonstrate the effectiveness of our approach by identifying relevant knowledge required for knowledgeable but engaging dialog from Wikipedia, images, and human-written dialog utterances, and show that leveraging this retrieved information improves model performance, measured by automatic and human evaluation.

knn-based composite memory composite memory knn-based composite الذاكرة المركبة القائمة على KNN الذاكرة المركبة مركب يعتمد على knn صناعة حمض الفوسفور المزيد..

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation

370 - Association for Computation Linguistics 2021 مقالة

Pre-training (PT) and back-translation (BT) are two simple and powerful methods to utilize monolingual data for improving the model performance of neural machine translation (NMT). This paper takes the first step to investigate the complementarity be tween PT and BT. We introduce two probing tasks for PT and BT respectively and find that PT mainly contributes to the encoder module while BT brings more benefits to the decoder. Experimental results show that PT and BT are nicely complementary to each other, establishing state-of-the-art performances on the WMT16 English-Romanian and English-Russian benchmarks. Through extensive analyses on sentence originality and word frequency, we also demonstrate that combining Tagged BT with PT is more helpful to their complementarity, leading to better translation quality. Source code is freely available at https://github.com/SunbowLiu/PTvsBT.

جهود التوضيحية صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints

HINTEDBT: زيادة الترجمة مرة أخرى مع تلميحات الجودة والترجمة

Ask ChatGPT about the research

Read More

suggested questions