Do you want to publish a course? Click here

Towards Precise Lexicon Integration in Neural Machine Translation

نحو تكامل المعجم الدقيقة في الترجمة الآلية العصبية

343   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Terminological consistency is an essential requirement for industrial translation. High-quality, hand-crafted terminologies contain entries in their nominal forms. Integrating such a terminology into machine translation is not a trivial task. The MT system must be able to disambiguate homographs on the source side and choose the correct wordform on the target side. In this work, we propose a simple but effective method for homograph disambiguation and a method of wordform selection by introducing multi-choice lexical constraints. We also propose a metric to measure the terminological consistency of the translation. Our results have a significant improvement over the current SOTA in terms of terminological consistency without any loss of the BLEU score. All the code used in this work will be published as open-source.



References used
https://aclanthology.org/
rate research

Read More

The paper presents experiments in neural machine translation with lexical constraints into a morphologically rich language. In particular and we introduce a method and based on constrained decoding and which handles the inflected forms of lexical ent ries and does not require any modification to the training data or model architecture. To evaluate its effectiveness and we carry out experiments in two different scenarios: general and domain-specific. We compare our method with baseline translation and i.e. translation without lexical constraints and in terms of translation speed and translation quality. To evaluate how well the method handles the constraints and we propose new evaluation metrics which take into account the presence and placement and duplication and inflectional correctness of lexical terms in the output sentence.
One key ingredient of neural machine translation is the use of large datasets from different domains and resources (e.g. Europarl, TED talks). These datasets contain documents translated by professional translators using different but consistent tran slation styles. Despite that, the model is usually trained in a way that neither explicitly captures the variety of translation styles present in the data nor translates new data in different and controllable styles. In this work, we investigate methods to augment the state of the art Transformer model with translator information that is available in part of the training data. We show that our style-augmented translation models are able to capture the style variations of translators and to generate translations with different styles on new data. Indeed, the generated variations differ significantly, up to +4.5 BLEU score difference. Despite that, human evaluation confirms that the translations are of the same quality.
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.
The neural machine translation approach has gained popularity in machine translation because of its context analysing ability and its handling of long-term dependency issues. We have participated in the WMT21 shared task of similar language translati on on a Tamil-Telugu pair with the team name: CNLP-NITS. In this task, we utilized monolingual data via pre-train word embeddings in transformer model based neural machine translation to tackle the limitation of parallel corpus. Our model has achieved a bilingual evaluation understudy (BLEU) score of 4.05, rank-based intuitive bilingual evaluation score (RIBES) score of 24.80 and translation edit rate (TER) score of 97.24 for both Tamil-to-Telugu and Telugu-to-Tamil translations respectively.
Interactive-predictive translation is a collaborative iterative process and where human translators produce translations with the help of machine translation (MT) systems interactively. Various sampling techniques in active learning (AL) exist to upd ate the neural MT (NMT) model in the interactive-predictive scenario. In this paper and we explore term based (named entity count (NEC)) and quality based (quality estimation (QE) and sentence similarity (Sim)) sampling techniques -- which are used to find the ideal candidates from the incoming data -- for human supervision and MT model's weight updation. We carried out experiments with three language pairs and viz. German-English and Spanish-English and Hindi-English. Our proposed sampling technique yields 1.82 and 0.77 and 0.81 BLEU points improvements for German-English and Spanish-English and Hindi-English and respectively and over random sampling based baseline. It also improves the present state-of-the-art by 0.35 and 0.12 BLEU points for German-English and Spanish-English and respectively. Human editing effort in terms of number-of-words-changed also improves by 5 and 4 points for German-English and Spanish-English and respectively and compared to the state-of-the-art.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا