Do you want to publish a course? Click here

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

تنظيف الكتب القذرة: معالجة ما بعد التعرف عبر الإنترنت للنصوص الممسوحة ضوئيا سابقا

278   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate that improvements in language models now enable the detection and correction of OCR errors without consideration of the scanning image itself. The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors. We identify the canonical version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally, we investigate methods to detect and correct errors in single-copy texts. We show that on average, our method corrects over six times as many errors as it introduces. We also provide interesting analysis on the relation between scanning quality and other factors such as location and publication year.

References used
https://aclanthology.org/
rate research

Read More

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machin e learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.
In this paper, we place ourselves in a classification scenario in which the target classes and data type are not accessible during training. We use a meta-learning approach to determine whether or not meta-trained information from common social netwo rk data with fine-grained emotion labels can achieve competitive performance on messages labeled with different emotion categories. We leverage few-shot learning to match with the classification scenario and consider metric learning based meta-learning by setting up Prototypical Networks with a Transformer encoder, trained in an episodic fashion. This approach proves to be effective for capturing meta-information from a source emotional tag set to predict previously unseen emotional tags. Even though shifting the data type triggers an expected performance drop, our meta-learning approach achieves decent results when compared to the fully supervised one.
The aim of this clinical study is to determine the efficiency of miswak (cleaning sticks – chewing sticks) in comparison with toothbrush in removingdental plaque from facial, lingual and interproximal surfaces and reducing Gingivitis. A total of 56 dental students who were divided into two groups were included in the study. This study was divided into two stages. The first stage, and after the experimental plaque accumulation, the volunteers cleaned their teeth (toothbrush or miswak) for five minutes. Then the clinical measurements were recorded. The second stage, the volunteers used according to their group either toothbrush and dentifrice or miswak twice a day for five minutes for three weeks. In the first stage, there was no difference between the two groups in gingivitis. The values of Turesky’s index to assess the plaque on facial and lingual surfaces were a little higher for miswak users, but the difference was statistically insignificant. While the cleaning of interproximal spaces was more evident in toothbrush users, either facial or lingual, and the difference was significant (P < 0.01). Generally, neither miswak nor toothbrush were able to produce completely interproximal dental health. So, the cleaning of facial surfaces by using toothbrush or miswak was better than lingual surfaces (P <0.05). In the second stage, the plaque and gingival values were higher in miswak users, but the difference was not significant. The result: miswak cannot remove dental plaque completely, but the site of plaque on the facial and lingual surfaces were almost similar to those among toothbrush users. Finally, a toothbrush possessed a clear superiority in cleaning interproximal spaces, thus making it the oral Hygiene aid of choice.
The use of Named Entity Recognition (NER) over archaic Arabic texts is steadily increasing. However, most tools have been either developed for modern English or trained over English language documents and are limited over historical Arabic text. Even Arabic NER tools are often trained on modern web-sourced text, making their fit for a historical task questionable. To mitigate historic Arabic NER resource scarcity, we propose a dynamic ensemble model utilizing several learners. The dynamic aspect is achieved by utilizing predictors and features over NER algorithm results that identify which have performed better on a specific task in real-time. We evaluate our approach against state-of-the-art Arabic NER and static ensemble methods over a novel historical Arabic NER task we have created. Our results show that our approach improves upon the state-of-the-art and reaches a 0.8 F-score on this challenging task.
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا