Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

التحلل الطابع الصيني ل MT العصبي مع تعبيرات متعددة الكلمات

662 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

chinese character decomposition enhance machine translation chinese character التحلل الطابع الصيني تعزيز ترجمة الآلات شخصية صينية صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decompositions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Multiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.

References used

https://aclanthology.org/

rate research

Discovery of Multiword Expressions with Loanwords and Their Equivalents in the Persian Language

502 - Association for Computation Linguistics 2021 مقالة

This paper presents an attempt at multiword expressions (MWEs) discovery in the Persian language. It focuses on extracting MWEs containing lemmas of a particular group: loanwords in Persian and their equivalents proposed by the Academy of Persian Lan guage and Literature. In order to discover such MWEs, four association measures (AMs) are used and evaluated. Finally, the list of extracted MWEs is analyzed, and a comparison between expressions with loanwords and equivalents is presented. To our knowledge, this is the first time such analysis was provided for the Persian language.

سورانيا multiword expressions persian تعبيرات متعددة الكلمات صناعة حمض الفوسفور

Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction

676 - Association for Computation Linguistics 2021 مقالة

Sememes are defined as the atomic units to describe the semantic meaning of concepts. Due to the difficulty of manually annotating sememes and the inconsistency of annotations between experts, the lexical sememe prediction task has been proposed. How ever, previous methods heavily rely on word or character embeddings, and ignore the fine-grained information. In this paper, we propose a novel pre-training method which is designed to better incorporate the internal information of Chinese character. The Glyph enhanced Chinese Character representation (GCC) is used to assist sememe prediction. We experiment and evaluate our model on HowNet, which is a famous sememe knowledge base. The experimental results show that our method outperforms existing non-external information models.

lexical sememe prediction enhanced chinese character glyph enhanced chinese تنبؤات نظرية معجمية تعزيز الشخصية الصينية glyph عزز الصينية صناعة حمض الفوسفور المزيد..

Character-based Thai Word Segmentation with Multiple Attentions

929 - Association for Computation Linguistics 2021 مقالة

Character-based word-segmentation models have been extensively applied to agglutinative languages, including Thai, due to their high performance. These models estimate word boundaries from a character sequence. However, a character unit in sequences has no essential meaning, compared with word, subword, and character cluster units. We propose a Thai word-segmentation model that uses various types of information, including words, subwords, and character clusters, from a character sequence. Our model applies multiple attentions to refine segmentation inferences by estimating the significant relationships among characters and various unit types. The experimental results indicate that our model can outperform other state-of-the-art Thai word-segmentation models.

character-based thai word thai thai word-segmentation الكلمة التايلاندية القائمة على الأحرف التايلاندية التايلاندية تجزئة صناعة حمض الفوسفور المزيد..

The IICT-Yverdon System for the WMT 2021 Unsupervised MT and Very Low Resource Supervised MT Task

859 - Association for Computation Linguistics 2021 مقالة

In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by techniques such as back-translation and initialization from a parent model. We find that both techniques are beneficial and suffice to reach performance that compares with more sophisticated systems from the 2020 task. We then present the application of this system to the 2021 task for low-resource supervised Upper Sorbian (HSB) to German translation, in both directions. Finally, we present a contrastive system for HSB-DE in both directions, and for unsupervised German to Lower Sorbian (DSB) translation, which uses multi-task training with various training schedules to improve over the baseline.

الاستغلال المباشر resource supervised الإشراف على الموارد صناعة حمض الفوسفور

Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision

1229 - Association for Computation Linguistics 2021 مقالة

Recent state-of-the-art (SOTA) effective neural network methods and fine-tuning methods based on pre-trained models (PTM) have been used in Chinese word segmentation (CWS), and they achieve great results. However, previous works focus on training the models with the fixed corpus at every iteration. The intermediate generated information is also valuable. Besides, the robustness of the previous neural methods is limited by the large-scale annotated data. There are a few noises in the annotated corpus. Limited efforts have been made by previous studies to deal with such problems. In this work, we propose a self-supervised CWS approach with a straightforward and effective architecture. First, we train a word segmentation model and use it to generate the segmentation results. Then, we use a revised masked language model (MLM) to evaluate the quality of the segmentation results based on the predictions of the MLM. Finally, we leverage the evaluations to aid the training of the segmenter by improved minimum risk training. Experimental results show that our approach outperforms previous methods on 9 different CWS datasets with single criterion training and multiple criteria training and achieves better robustness.

augmenting chinese word augmenting chinese زيادة الكلمة الصينية زيادة الصينية صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

التحلل الطابع الصيني ل MT العصبي مع تعبيرات متعددة الكلمات

Ask ChatGPT about the research

Read More

suggested questions