Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

المماثلون الضيقون الكورية ضيقة من الناحية المورفية للنصوص التي تم إنشاؤها من قبل صاخبة

484 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

noisy user-generated texts noisy user-generated morphologically tight-fitting tokenizer النصوص التي أنشأها المستخدم صاخبة صاخبة المستخدم مظلمة ضيقة مورفولوجية صناعة حمض الفوسفور

visit our facebook page

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

User-generated texts include various types of stylistic properties, or noises. Such texts are not properly processed by existing morpheme analyzers or language models based on formal texts such as encyclopedias or news articles. In this paper, we propose a simple morphologically tight-fitting tokenizer (K-MT) that can better process proper nouns, coinages, and internet slang among other types of noise in Korean user-generated texts. We tested our tokenizer by performing classification tasks on Korean user-generated movie reviews and hate speech datasets, and the Korean Named Entity Recognition dataset. Through our tests, we found that K-MT is better fit to process internet slangs, proper nouns, and coinages, compared to a morpheme analyzer and a character-level WordPiece tokenizer.

References used

https://aclanthology.org/

rate research

View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

394 - Association for Computation Linguistics 2021 مقالة

We present an algorithm based on multi-layer transformers for identifying Adverse Drug Reactions (ADR) in social media data. Our model relies on the properties of the problem and the characteristics of contextual word embeddings to extract two views from documents. Then a classifier is trained on each view to label a set of unlabeled documents to be used as an initializer for a new classifier in the other view. Finally, the initialized classifier in each view is further trained using the initial training examples. We evaluated our model in the largest publicly available ADR dataset. The experiments testify that our model significantly outperforms the transformer-based models pretrained on domain-specific data.

extracting adverse drug adverse drug effects adverse drug reactions استخراج المخدرات السلبية تأثيرات المخدرات الضارة التفاعلات الدوائية الضارة صناعة حمض الفوسفور المزيد..

Understanding Model Robustness to User-generated Noisy Texts

554 - Association for Computation Linguistics 2021 مقالة

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage artificially nois ed data. However, the amount and type of generated noise has so far been determined arbitrarily. We therefore propose to model the errors statistically from grammatical-error-correction corpora. We present a thorough evaluation of several state-of-the-art NLP systems' robustness in multiple languages, with tasks including morpho-syntactic analysis, named entity recognition, neural machine translation, a subset of the GLUE benchmark and reading comprehension. We also compare two approaches to address the performance drop: a) training the NLP models with noised data generated by our framework; and b) reducing the input noise with external system for natural language correction. The code is released at https://github.com/ufal/kazitext.

user-generated noisy texts noisy texts user-generated noisy النصوص الناتجة عن المستخدم نصوص صاخبة صاخبة التي تم إنشاؤها صناعة حمض الفوسفور المزيد..

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

458 - Association for Computation Linguistics 2021 مقالة

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comp rises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

evaluating japanese morphological japanese user-generated text japanese morphological analysis تقييم المورفولوجية اليابانية النص الياباني الناتج عن المستخدم التحليل المورفولوجي الياباني صناعة حمض الفوسفور المزيد..

Evaluating the Morphosyntactic Well-formedness of Generated Texts

352 - Association for Computation Linguistics 2021 مقالة

Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L'AMBRE -- a metric to evaluate the morphosyn tactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.

well-formedness of generated generated texts morphosyntactic well-formedness شكل جيد من ولدت النصوص التي تم إنشاؤها مورفوسنكتاكيتش شكل جيد صناعة حمض الفوسفور المزيد..

Google-trickers, Yaminjeongeum, and Leetspeak: An Empirical Taxonomy for Intentionally Noisy User-Generated Text

559 - Association for Computation Linguistics 2021 مقالة

WARNING: This article contains contents that may offend the readers. Strategies that insert intentional noise into text when posting it are commonly observed in the online space, and sometimes they aim to let only certain community users understand t he genuine semantics. In this paper, we explore the purpose of such actions by categorizing them into tricks, memes, fillers, and codes, and organize the linguistic strategies that are used for each purpose. Through this, we identify that such strategies can be conducted by authors for multiple purposes, regarding the presence of stakeholders such as Peers' and Others'. We finally analyze how these strategies appear differently in each circumstance, along with the unified taxonomy accompanying examples.

معالجة اللغات الطبيعية تصنيف النصوص intentionally noisy user-generated noisy user-generated text intentionally noisy النص الفوضوي النص الضّاج المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

المماثلون الضيقون الكورية ضيقة من الناحية المورفية للنصوص التي تم إنشاؤها من قبل صاخبة

Ask ChatGPT about the research

Read More

suggested questions