تحليل مورفولوجي (MA) والتطبيع المعجمي (LN) هي مهام مهمة للنص الياباني الذي تم إنشاؤه بواسطة المستخدمين (UGT).لتقييم ومقارنة أنظمة MA / LN المختلفة، قمنا ببناء كوربوس UGT اليابانية المتاحة للجمهور.يشتمل كوربوس لدينا على 929 جمل مشروحة مع معلومات مورفولوجية وتطبيعا، إلى جانب معلومات الفئة المصنفة لظواهر خاصة بوحشية UGT.أظهرت التجارب على الجثة أداء أداء منخفضة من أساليب MA / LN الحالية للكلمات غير العامة والنماذج غير القياسية، مما يشير إلى أن الكائن ستكون معيارا صعبا لمزيد من البحث حول UGT.
Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.
References used
https://aclanthology.org/
Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseud
We present an algorithm based on multi-layer transformers for identifying Adverse Drug Reactions (ADR) in social media data. Our model relies on the properties of the problem and the characteristics of contextual word embeddings to extract two views
In this study, we proposed a novel Lexicon-based pseudo-labeling method utilizing explainable AI(XAI) approach. Existing approach have a fundamental limitation in their robustness because poor classifier leads to inaccurate soft-labeling, and it lead
User-generated texts include various types of stylistic properties, or noises. Such texts are not properly processed by existing morpheme analyzers or language models based on formal texts such as encyclopedias or news articles. In this paper, we pro
WARNING: This article contains contents that may offend the readers. Strategies that insert intentional noise into text when posting it are commonly observed in the online space, and sometimes they aim to let only certain community users understand t