Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

النص الذي تم إنشاؤه من قبل المستخدم لتقييم التحليل المورفولوجي الياباني والتطبيع المعجمي

458 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

evaluating japanese morphological japanese user-generated text japanese morphological analysis تقييم المورفولوجية اليابانية النص الياباني الناتج عن المستخدم التحليل المورفولوجي الياباني صناعة حمض الفوسفور

visit our facebook page

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

References used

https://aclanthology.org/

rate research

A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization

675 - Association for Computation Linguistics 2021 مقالة

Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseud o-labeled data generation to overcome the problem of data deficiency. Our experiments showed that the proposed model achieved better normalization performance when trained on more diverse pseudo-labeled data.

joint japanese word approach to joint text editing approach كلمة اليابانية المشتركة النهج إلى المشتركة نهج تحرير النص صناعة حمض الفوسفور المزيد..

View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

394 - Association for Computation Linguistics 2021 مقالة

We present an algorithm based on multi-layer transformers for identifying Adverse Drug Reactions (ADR) in social media data. Our model relies on the properties of the problem and the characteristics of contextual word embeddings to extract two views from documents. Then a classifier is trained on each view to label a set of unlabeled documents to be used as an initializer for a new classifier in the other view. Finally, the initialized classifier in each view is further trained using the initial training examples. We evaluated our model in the largest publicly available ADR dataset. The experiments testify that our model significantly outperforms the transformer-based models pretrained on domain-specific data.

extracting adverse drug adverse drug effects adverse drug reactions استخراج المخدرات السلبية تأثيرات المخدرات الضارة التفاعلات الدوائية الضارة صناعة حمض الفوسفور المزيد..

Semi-Supervised Learning Based on Auto-generated Lexicon Using XAI in Sentiment Analysis

408 - Association for Computation Linguistics 2021 مقالة

In this study, we proposed a novel Lexicon-based pseudo-labeling method utilizing explainable AI(XAI) approach. Existing approach have a fundamental limitation in their robustness because poor classifier leads to inaccurate soft-labeling, and it lead to poor classifier repetitively. Meanwhile, we generate the lexicon consists of sentiment word based on the explainability score. Then we calculate the confidence of unlabeled data with lexicon and add them into labeled dataset for the robust pseudo-labeling approach. Our proposed method has three contributions. First, the proposed methodology automatically generates a lexicon based on XAI and performs independent pseudo-labeling, thereby guaranteeing higher performance and robustness compared to the existing one. Second, since lexicon-based pseudo-labeling is performed without re-learning in most of models, time efficiency is considerably increased, and third, the generated high-quality lexicon can be available for sentiment analysis of data from similar domains. The effectiveness and efficiency of our proposed method were verified through quantitative comparison with the existing pseudo-labeling method and qualitative review of the generated lexicon.

semi-supervised learning based semi-supervised learning learning based التعلم شبه الإشراف على أساس التعلم شبه الإشرافه التعلم مقرها صناعة حمض الفوسفور المزيد..

The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

483 - Association for Computation Linguistics 2021 مقالة

User-generated texts include various types of stylistic properties, or noises. Such texts are not properly processed by existing morpheme analyzers or language models based on formal texts such as encyclopedias or news articles. In this paper, we pro pose a simple morphologically tight-fitting tokenizer (K-MT) that can better process proper nouns, coinages, and internet slang among other types of noise in Korean user-generated texts. We tested our tokenizer by performing classification tasks on Korean user-generated movie reviews and hate speech datasets, and the Korean Named Entity Recognition dataset. Through our tests, we found that K-MT is better fit to process internet slangs, proper nouns, and coinages, compared to a morpheme analyzer and a character-level WordPiece tokenizer.

noisy user-generated texts noisy user-generated morphologically tight-fitting tokenizer النصوص التي أنشأها المستخدم صاخبة صاخبة المستخدم مظلمة ضيقة مورفولوجية صناعة حمض الفوسفور المزيد..

Google-trickers, Yaminjeongeum, and Leetspeak: An Empirical Taxonomy for Intentionally Noisy User-Generated Text

559 - Association for Computation Linguistics 2021 مقالة

WARNING: This article contains contents that may offend the readers. Strategies that insert intentional noise into text when posting it are commonly observed in the online space, and sometimes they aim to let only certain community users understand t he genuine semantics. In this paper, we explore the purpose of such actions by categorizing them into tricks, memes, fillers, and codes, and organize the linguistic strategies that are used for each purpose. Through this, we identify that such strategies can be conducted by authors for multiple purposes, regarding the presence of stakeholders such as Peers' and Others'. We finally analyze how these strategies appear differently in each circumstance, along with the unified taxonomy accompanying examples.

معالجة اللغات الطبيعية تصنيف النصوص intentionally noisy user-generated noisy user-generated text intentionally noisy النص الفوضوي النص الضّاج المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

النص الذي تم إنشاؤه من قبل المستخدم لتقييم التحليل المورفولوجي الياباني والتطبيع المعجمي

Ask ChatGPT about the research

Read More

suggested questions