New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Hierarchical Character Tagger for Short Text Spelling Error Correction

الطابع الهرمي Tagger لتصحيح الأخطاء الإملائي النصي

350 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

spelling error correction text spelling error hierarchical character tagger تصحيح الأخطاء الإملائي خطأ تهجئة النص الطابع الهرمي Tagger. صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.

References used

https://aclanthology.org/

rate research

An Alignment-Agnostic Model for Chinese Text Error Correction

473 - Association for Computation Linguistics 2021 مقالة

This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which are common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters, but cannot handle missing or redundant characters due to inconsistency between model inputs and outputs. Although Seq2Seq-based or sequence tagging methods provide solutions to the three error types and achieved relatively good results in English context, they do not perform well in Chinese context according to our experiments. In our work, we propose a novel alignment-agnostic detect-correct framework that can handle both text aligned and non-aligned situations and can serve as a cold start model when no annotation data are provided. Experimental results on three datasets demonstrate that our method is effective and achieves a better performance than most recent published models.

text error correction chinese text error تصحيح خطأ النص خطأ النص الصيني صناعة حمض الفوسفور

Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Transduction

298 - Association for Computation Linguistics 2021 مقالة

Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-t o-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance. We analyze the distributions of different error classes using two unsupervised tasks as testbeds: converting informally romanized text into the native script of its language (for Russian, Arabic, and Kannada) and translating between a pair of closely related languages (Serbian and Bosnian). Finally, we investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.

comparative error analysis analysis in neural unsupervised character-level transduction تحليل الأخطاء المقارنة تحليل في العصابة نقل مستوى الطابع غير المنشأ صناعة حمض الفوسفور المزيد..

Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification

344 - Association for Computation Linguistics 2021 مقالة

Short text classification is a fundamental task in natural language processing. It is hard due to the lack of context information and labeled data in practice. In this paper, we propose a new method called SHINE, which is based on graph neural networ k (GNN), for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs which introduce more semantic and syntactic information. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts. Thus, comparing with existing GNN-based methods, SHINE can better exploit interactions between nodes of the same types and capture similarities between short texts. Extensive experiments on various benchmark short text datasets show that SHINE consistently outperforms state-of-the-art methods, especially with fewer labels.

graph representation learning short text classification تمثيل الرسم البياني التعلم تصنيف النص القصير صناعة حمض الفوسفور

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

527 - Association for Computation Linguistics 2021 مقالة

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

ocr post-correction spelling normalisation OCR بعد التصحيح التطبيع الإملائي ocr. صناعة حمض الفوسفور

Visual Cues and Error Correction for Translation Robustness

228 - Association for Computation Linguistics 2021 مقالة

Neural Machine Translation models are sensitive to noise in the input texts, such as misspelled words and ungrammatical constructions. Existing robustness techniques generally fail when faced with unseen types of noise and their performance degrades on clean texts. In this paper, we focus on three types of realistic noise that are commonly generated by humans and introduce the idea of visual context to improve translation robustness for noisy texts. In addition, we describe a novel error correction training regime that can be used as an auxiliary task to further improve translation robustness. Experiments on English-French and English-German translation show that both multimodal and error correction components improve model robustness to noisy texts, while still retaining translation quality on clean texts.

visual cues cues and error الإشارات البصرية العظة والخطأ صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Hierarchical Character Tagger for Short Text Spelling Error Correction

الطابع الهرمي Tagger لتصحيح الأخطاء الإملائي النصي

Ask ChatGPT about the research

Read More

suggested questions