Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

طريقة غير منشأة ل OCR بعد التصحيح والتطبيع الإملائي للفنلندية

601 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

ocr post-correction spelling normalisation OCR بعد التصحيح التطبيع الإملائي ocr. صناعة حمض الفوسفور

visit our facebook page

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

References used

https://aclanthology.org/

rate research

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

309 - Association for Computation Linguistics 2021 مقالة

The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised corpora. Our met hod is motivated by the following two findings: neural machine translation model usually tends to generate more high-frequency tokens and the difference of text complexity levels exists between the source and target language of a translation corpus. By taking the pair of the source sentences of translation corpus and the translations of their references in a bridge language, we can construct large-scale pseudo parallel SS data. Then, we keep these sentence pairs with a higher complexity difference as SS sentence pairs. The building SS corpora with an unsupervised approach can satisfy the expectations that the aligned sentences preserve the same meanings and have difference in text complexity levels. Experimental results show that SS methods trained by our corpora achieve the state-of-the-art results and significantly outperform the results on English benchmark WikiLarge.

تمكين التعميم المنهجي building sentence simplification sentence simplification corpora بناء جملة تبسيط جملة تبسيط corpora. صناعة حمض الفوسفور

Hierarchical Character Tagger for Short Text Spelling Error Correction

405 - Association for Computation Linguistics 2021 مقالة

State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, wh ich involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.

spelling error correction text spelling error hierarchical character tagger تصحيح الأخطاء الإملائي خطأ تهجئة النص الطابع الهرمي Tagger. صناعة حمض الفوسفور المزيد..

Capturing Speaker Incorrectness: Speaker-Focused Post-Correction for Abstractive Dialogue Summarization

369 - Association for Computation Linguistics 2021 مقالة

In this paper, we focus on improving the quality of the summary generated by neural abstractive dialogue summarization systems. Even though pre-trained language models generate well-constructed and promising results, it is still challenging to summar ize the conversation of multiple participants since the summary should include a description of the overall situation and the actions of each speaker. This paper proposes self-supervised strategies for speaker-focused post-correction in abstractive dialogue summarization. Specifically, our model first discriminates which type of speaker correction is required in a draft summary and then generates a revised summary according to the required type. Experimental results show that our proposed method adequately corrects the draft summaries, and the revised summaries are significantly improved in both quantitative and qualitative evaluations.

abstractive dialogue summarization capturing speaker incorrectness abstractive dialogue تلخيص الحوار المبشور التقاط المتكلم غير صحيح حوار مبادرة صناعة حمض الفوسفور المزيد..

The Post-Editing Workflow: Training Challenges for LSPs, Post-Editors and Academia

472 - Association for Computation Linguistics 2021 مقالة

Language technology is already largely adopted by most Language Service Providers (LSPs) and integrated into their traditional translation processes. In this context, there are many different approaches to applying Post-Editing (PE) of a machine tran slated text, involving different workflow processes and steps that can be more or less effective and favorable. In the present paper, we propose a 3-step Post-Editing Workflow (PEW). Drawing from industry insight, this paper aims to provide a basic framework for LSPs and Post-Editors on how to streamline Post-Editing workflows in order to improve quality, achieve higher profitability and better return on investment and standardize and facilitate internal processes in terms of management and linguist effort when it comes to PE services. We argue that a comprehensive PEW consists in three essential tasks: Pre-Editing, Post-Editing and Annotation/Machine Translation (MT) evaluation processes (Guerrero, 2018) supported by three essential roles: Pre-Editor, Post-Editor and Annotator (Gene, 2020). Furthermore, the pre-sent paper demonstrates the training challenges arising from this PEW, supported by empirical research results, as reflected in a digital survey among language industry professionals (Gene, 2020), which was conducted in the context of a Post-Editing Webinar. Its sample comprised 51 representatives of LSPs and 12 representatives of SLVs (Single Language Vendors) representatives.

language service providers post-editors and academia post-editing workflow مقدمي خدمات اللغة بعد المحررين والأوساط الأكاديمية سير العمل بعد التحرير صناعة حمض الفوسفور المزيد..

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

441 - Association for Computation Linguistics 2021 مقالة

Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the i ssue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate that improvements in language models now enable the detection and correction of OCR errors without consideration of the scanning image itself. The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors. We identify the canonical version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally, we investigate methods to detect and correct errors in single-copy texts. We show that on average, our method corrects over six times as many errors as it introduces. We also provide interesting analysis on the relation between scanning quality and other factors such as location and publication year.

cleaning dirty books previously scanned texts processing for previously تنظيف الكتب القذرة الصور الممسوحة ضوئيا سابقا معالجة سابقا صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

طريقة غير منشأة ل OCR بعد التصحيح والتطبيع الإملائي للفنلندية

Ask ChatGPT about the research

Read More

suggested questions