Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Bilingual Terminology Extraction Using Neural Word Embeddings on Comparable Corpora

استخراج مصطلحات ثنائية اللغة باستخدام Adgeddings الكلمة العصبية على Corpora المقارنة

760 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

neural word embeddings comparable corpora neural word كلمة embeddings العصبية Corporable الكلمة العصبية صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Term and glossary management are vital steps of preparation of every language specialist, and they play a very important role at the stage of education of translation professionals. The growing trend of efficient time management and constant time constraints we may observe in every job sector increases the necessity of the automatic glossary compilation. Many well-performing bilingual AET systems are based on processing parallel data, however, such parallel corpora are not always available for a specific domain or a language pair. Domain-specific, bilingual access to information and its retrieval based on comparable corpora is a very promising area of research that requires a detailed analysis of both available data sources and the possible extraction techniques. This work focuses on domain-specific automatic terminology extraction from comparable corpora for the English -- Russian language pair by utilizing neural word embeddings.

References used

https://aclanthology.org/

rate research

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

808 - Association for Computation Linguistics 2021 مقالة

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously transla tes and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.

learning contextualised cross-lingual contextualised cross-lingual word تعلم السياق عبر اللغات الكلمة التبادلية السياقية صناعة حمض الفوسفور

Identify Bilingual Patterns and Phrases from a Bilingual Sentence Pair

650 - Association for Computation Linguistics 2021 مقالة

This paper presents a method for automatically identifying bilingual grammar patterns and extracting bilingual phrase instances from a given English-Chinese sentence pair. In our approach, the English-Chinese sentence pair is parsed to identify Engli sh grammar patterns and Chinese counterparts. The method involves generating translations of each English grammar pattern and calculating translation probability of words from a word-aligned parallel corpora. The results allow us to extract the most probable English-Chinese phrase pairs in the sentence pair. We present a prototype system that applies the method to extract grammar patterns and phrases in parallel sentences. An evaluation on randomly selected examples from a dictionary shows that our approach has reasonably good performance. We use human judge to assess the bilingual phrases generated by our approach. The results have potential to assist language learning and machine translation research.

sentence pair grammar patterns english grammar patterns جملة الزوج أنماط القواعد أنماط القواعد الإنجليزية صناعة حمض الفوسفور المزيد..

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

587 - Association for Computation Linguistics 2021 مقالة

Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited . Current approaches to deal with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-lingual similarity. This work presents an alternative approach which learns weights for multiple sentence-level features. These feature weights which are optimized directly for the task of improving translation performance, are used to score and filter sentences in the noisy corpora more effectively. We provide results of applying this technique to building NMT systems using the Paracrawl corpus for Estonian-English and show that it beats strong single feature baselines and hand designed combinations. Additionally, we analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs using the Maltese-English Paracrawl corpus.

denoising parallel corpora reward modeling modeling for denoising Denosising موازية Corpora. المكافأة النمذجة النمذجة ل denoising. صناعة حمض الفوسفور المزيد..

DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation

766 - Association for Computation Linguistics 2021 مقالة

In this paper, we provide a bilingual parallel human-to-human recommendation dialog dataset (DuRecDial 2.0) to enable researchers to explore a challenging task of multilingual and cross-lingual conversational recommendation. The difference between Du RecDial 2.0 and existing conversational recommendation datasets is that the data item (Profile, Goal, Knowledge, Context, Response) in DuRecDial 2.0 is annotated in two languages, both English and Chinese, while other datasets are built with the setting of a single language. We collect 8.2k dialogs aligned across English and Chinese languages (16.5k dialogs and 255k utterances in total) that are annotated by crowdsourced workers with strict quality control procedure. We then build monolingual, multilingual, and cross-lingual conversational recommendation baselines on DuRecDial 2.0. Experiment results show that the use of additional English data can bring performance improvement for Chinese conversational recommendation, indicating the benefits of DuRecDial 2.0. Finally, this dataset provides a challenging testbed for future studies of monolingual, multilingual, and cross-lingual conversational recommendation.

bilingual parallel corpus parallel corpus bilingual parallel ثنائية اللغة جوراليل بالتوازي كوربوس متوازي ثنائي اللغة صناعة حمض الفوسفور المزيد..

Arabic Handwritten Word Recognition using CBIR

1769 - Aِl-Baath University 2017 ورقة بحثية

This research describes a system for recognition of handwritten Arabic word without prior segmentation of the word into characters. In this system, the recognition will be happened at two levels. It is evolved basing on OCR (Optical Character Reco gnition), Hidden Markov Model, CBIR(Content Based Image Retrieval), it also involves Mathematical Morphology.

التعرف الضوئي على الحروف نموذج ماركوف المخفي التوابع المورفولوجية استرجاع الصور اعتمادا على المحتوى

Bilingual Terminology Extraction Using Neural Word Embeddings on Comparable Corpora

استخراج مصطلحات ثنائية اللغة باستخدام Adgeddings الكلمة العصبية على Corpora المقارنة

Ask ChatGPT about the research

Read More

suggested questions