Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

A Novel Machine Learning Based Approach for Post-OCR Error Detection

نهج بناء على آلة التعلم الجديد للكشف عن خطأ ما بعد التعرف عبر الإنترنت

851 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

الكشف عن الأخطاء learning based approach approach for post-ocr النهج القائم على التعلم نهج لما بعد التعرف الضابط صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.

References used

https://aclanthology.org/

rate research

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

759 - Association for Computation Linguistics 2021 مقالة

Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the i ssue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate that improvements in language models now enable the detection and correction of OCR errors without consideration of the scanning image itself. The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors. We identify the canonical version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally, we investigate methods to detect and correct errors in single-copy texts. We show that on average, our method corrects over six times as many errors as it introduces. We also provide interesting analysis on the relation between scanning quality and other factors such as location and publication year.

cleaning dirty books previously scanned texts processing for previously تنظيف الكتب القذرة الصور الممسوحة ضوئيا سابقا معالجة سابقا صناعة حمض الفوسفور المزيد..

Mistake Captioning: A Machine Learning Approach for Detecting Mistakes and Generating Instructive Feedback

695 - Association for Computation Linguistics 2021 مقالة

Giving feedback to students is not just about marking their answers as correct or incorrect, but also finding mistakes in their thought process that led them to that incorrect answer. In this paper, we introduce a machine learning technique for mista ke captioning, a task that attempts to identify mistakes and provide feedback meant to help learners correct these mistakes. We do this by training a sequence-to-sequence network to generate this feedback based on domain experts. To evaluate this system, we explore how it can be used on a Linguistics assignment studying Grimm's Law. We show that our approach generates feedback that outperforms a baseline on a set of automated NLP metrics. In addition, we perform a series of case studies in which we examine successful and unsuccessful system outputs.

generating instructive feedback generating instructive detecting mistakes توليد ردود الفعل المفيدة توليد مفيد الكشف عن الأخطاء صناعة حمض الفوسفور المزيد..

Ranking Online Reviews Based on Their Helpfulness: An Unsupervised Approach

517 - Association for Computation Linguistics 2021 مقالة

Online reviews are an essential aspect of online shopping for both customers and retailers. However, many reviews found on the Internet lack in quality, informativeness or helpfulness. In many cases, they lead the customers towards positive or negati ve opinions without providing any concrete details (e.g., very poor product, I would not recommend it). In this work, we propose a novel unsupervised method for quantifying helpfulness leveraging the availability of a corpus of reviews. In particular, our method exploits three characteristics of the reviews, viz., relevance, emotional intensity and specificity, towards quantifying helpfulness. We perform three rankings (one for each feature above), which are then combined to obtain a final helpfulness ranking. For the purpose of empirically evaluating our method, we use review of four product categories from Amazon review. The experimental evaluation demonstrates the effectiveness of our method in comparison to a recent and state-of-the-art baseline.

online reviews based unsupervised approach reviews based الاستعراضات عبر الإنترنت نهج غير مؤظفي استعراضه صناعة حمض الفوسفور المزيد..

Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis

922 - Association for Computation Linguistics 2021 مقالة

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach publishable'' quality and according to the number of errors they produce. For the error annotation ta sk, an original error typology for transcription errors is proposed. This study also seeks to examine whether there is a difference in the performance of these systems between native and non-native English speakers. The experimental results suggest that among the four systems, Trint obtains the best scores. It is also observed that most systems perform noticeably better with native speakers and that all systems are most prone to fluency errors.

asr systems based benchmarking asr systems benchmarking asr أنظمة العصر مقرها معيار أنظمة ASR. معيار العسر صناعة حمض الفوسفور المزيد..

Cross-Lingual Transfer Learning for Hate Speech Detection

1233 - Association for Computation Linguistics 2021 مقالة

We address the task of automatic hate speech detection for low-resource languages. Rather than collecting and annotating new hate speech data, we show how to use cross-lingual transfer learning to leverage already existing data from higher-resource l anguages. Using bilingual word embeddings based classifiers we achieve good performance on the target language by training only on the source dataset. Using our transferred system we bootstrap on unlabeled target language data, improving the performance of standard cross-lingual transfer approaches. We use English as a high resource language and German as the target language for which only a small amount of annotated corpora are available. Our results indicate that cross-lingual transfer learning together with our approach to leverage additional unlabeled data is an effective way of achieving good performance on low-resource target languages without the need for any target-language annotations.

تكييف البرتغالية cross-lingual transfer learning التعلم تحويل اللغات صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Novel Machine Learning Based Approach for Post-OCR Error Detection

نهج بناء على آلة التعلم الجديد للكشف عن خطأ ما بعد التعرف عبر الإنترنت

Ask ChatGPT about the research

Read More

suggested questions