تعد التطبيع المعجمي، بالإضافة إلى تقسيم الكلمات وعلامات جزء من الكلام، مهمة أساسية لمعالجة النصية اليابانية التي أنشأها المستخدم.في هذه الورقة، نقترح نموذج تحرير النصوص لحل المهمة الثلاثة المشتركة وطرق توليد البيانات المسمى Pseudo للتغلب على مشكلة نقص البيانات.أظهرت تجاربنا أن النموذج المقترح حقق أداء تطبيع أفضل عند التدريب على بيانات أكثر تنوعا المصممة بالقطرات.
Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseudo-labeled data generation to overcome the problem of data deficiency. Our experiments showed that the proposed model achieved better normalization performance when trained on more diverse pseudo-labeled data.
References used
https://aclanthology.org/
Domain adaption for word segmentation and POS tagging is a challenging problem for Chinese lexical processing. Self-training is one promising solution for it, which struggles to construct a set of high-quality pseudo training instances for the target
Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comp
The task of converting a nonstandard text to a standard and readable text is known as lexical normalization. Almost all the Natural Language Processing (NLP) applications require the text data in normalized form to build quality task-specific models.
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we fo
Character-based word-segmentation models have been extensively applied to agglutinative languages, including Thai, due to their high performance. These models estimate word boundaries from a character sequence. However, a character unit in sequences