تقدم هذه الورقة تنفيذ نهج محاذاة مصطلح ثنائي اللغة تم تطويره بواسطة RePar et al.(2019) إلى مجموعة بيانات من الكلمات الرئيسية الإستونية والروسية غير المعروفة التي تم تعيينها يدويا من قبل الصحفيين لوصف موضوع المقال.بدأنا بفصل البيانات إلى العلامات الإستونية والروسية بناء على ما إذا كانت مكتوبة في البرنامج النصي اللاتيني أو السيريلي.ثم اخترنا الموارد الخاصة باللغة المتاحة اللازمة لنظام المحاذاة للعمل.على الرغم من مجالات الموارد الخاصة باللغة (الترجمات والبيئة) لا تتطابق مع نطاق مجموعة البيانات (المقالات الإخبارية)، كنا قادرين على تحقيق نتائج محترمة مع التقييم اليدوي الذي يشير إلى أن ما يقرب من 3/4 من أزواج الكلمات الرئيسية المحاذاة على الأقلمباريات جزئية.
This paper presents the implementation of a bilingual term alignment approach developed by Repar et al. (2019) to a dataset of unaligned Estonian and Russian keywords which were manually assigned by journalists to describe the article topic. We started by separating the dataset into Estonian and Russian tags based on whether they are written in the Latin or Cyrillic script. Then we selected the available language-specific resources necessary for the alignment system to work. Despite the domains of the language-specific resources (subtitles and environment) not matching the domain of the dataset (news articles), we were able to achieve respectable results with manual evaluation indicating that almost 3/4 of the aligned keyword pairs are at least partial matches.
References used
https://aclanthology.org/
Word alignment identify translational correspondences between words in a parallel sentence pair and are used and for example and to train statistical machine translation and learn bilingual dictionaries or to perform quality estimation. Subword token
Acquisition of multilingual training data continues to be a challenge in word sense disambiguation (WSD). To address this problem, unsupervised approaches have been proposed to automatically generate sense annotations for training supervised WSD syst
In this paper, we study the abstractive sentence summarization. There are two essential information features that can influence the quality of news summarization, which are topic keywords and the knowledge structure of the news text. Besides, the exi
Automatic news recommendation has gained much attention from the academic community and industry. Recent studies reveal that the key to this task lies within the effective representation learning of both news and users. Existing works typically encod
Abstract We find that the requirement of model interpretations to be faithful is vague and incomplete. With interpretation by textual highlights as a case study, we present several failure cases. Borrowing concepts from social science, we identify th