No Arabic abstract
Convolutional Recurrent Neural Networks (CRNNs) excel at scene text recognition. Unfortunately, they are likely to suffer from vanishing/exploding gradient problems when processing long text images, which are commonly found in scanned documents. This poses a major challenge to goal of completely solving Optical Character Recognition (OCR) problem. Inspired by recently proposed memory-augmented neural networks (MANNs) for long-term sequential modeling, we present a new architecture dubbed Convolutional Multi-way Associative Memory (CMAM) to tackle the limitation of current CRNNs. By leveraging recent memory accessing mechanisms in MANNs, our architecture demonstrates superior performance against other CRNN counterparts in three real-world long text OCR datasets.
Many studies on (Offline) Handwritten Text Recognition (HTR) systems have focused on building state-of-the-art models for line recognition on small corpora. However, adding HTR capability to a large scale multilingual OCR system poses new challenges. This paper addresses three problems in building such systems: data, efficiency, and integration. Firstly, one of the biggest challenges is obtaining sufficient amounts of high quality training data. We address the problem by using online handwriting data collected for a large scale production online handwriting recognition system. We describe our image data generation pipeline and study how online data can be used to build HTR models. We show that the data improve the models significantly under the condition where only a small number of real images is available, which is usually the case for HTR models. It enables us to support a new script at substantially lower cost. Secondly, we propose a line recognition model based on neural networks without recurrent connections. The model achieves a comparable accuracy with LSTM-based models while allowing for better parallelism in training and inference. Finally, we present a simple way to integrate HTR models into an OCR system. These constitute a solution to bring HTR capability into a large scale OCR system.
CNN model is a popular method for imagery analysis, so it could be utilized to recognize handwritten digits based on MNIST datasets. For higher recognition accuracy, various CNN models with different fully connected layer sizes are exploited to figure out the relationship between the CNN fully connected layer size and the recognition accuracy. Inspired by previous pruning work, we performed pruning methods of distinctiveness on CNN models and compared the pruning performance with NN models. For better pruning performances on CNN, the effect of angle threshold on the pruning performance was explored. The evaluation results show that: for the fully connected layer size, there is a threshold, so that when the layer size increases, the recognition accuracy grows if the layer size smaller than the threshold, and falls if the layer size larger than the threshold; the performance of pruning performed on CNN is worse than on NN; as pruning angle threshold increases, the fully connected layer size and the recognition accuracy decreases. This paper also shows that for CNN models trained by the MNIST dataset, they are capable of handwritten digit recognition and achieve the highest recognition accuracy with fully connected layer size 400. In addition, for same dataset MNIST, CNN models work better than big, deep, simple NN models in a published paper.
Line segmentation from handwritten text images is one of the challenging task due to diversity and unknown variations as undefined spaces, styles, orientations, stroke heights, overlapping, and alignments. Though abundant researches, there is a need of improvement to achieve robustness and higher segmentation rates. In the present work, an adaptive approach is used for the line segmentation from handwritten text images merging the alignment of connected component coordinates and text height. The mathematical justification is provided for measuring the text height respective to the image size. The novelty of the work lies in the text height calculation dynamically. The experiments are tested on the dataset provided by the Chinese company for the project. The proposed scheme is tested on two different type of datasets; document pages having base lines and plain pages. Dataset is highly complex and consists of abundant and uncommon variations in handwriting patterns. The performance of the proposed method is tested on our datasets as well as benchmark datasets, namely IAM and ICDAR09 to achieve 98.01% detection rate on average. The performance is examined on the above said datasets to observe 91.99% and 96% detection rates, respectively.
In this paper, we study the problem of text line recognition. Unlike most approaches targeting specific domains such as scene-text or handwritten documents, we investigate the general problem of developing a universal architecture that can extract text from any image, regardless of source or input modality. We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs), and conduct extensive experiments to compare their accuracy and performance on widely used public datasets of scene and handwritten text. We find that a combination that so far has received little attention in the literature, namely a Self-Attention encoder coupled with the CTC decoder, when compounded with an external language model and trained on both public and internal data, outperforms all the others in accuracy and computational complexity. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length, a requirement for universal line recognition. Using an internal dataset collected from multiple sources, we also expose the limitations of current public datasets in evaluating the accuracy of line recognizers, as the relatively narrow image width and sequence length distributions do not allow to observe the quality degradation of the Transformer approach when applied to the transcription of long lines.
We investigate a new method to augment recurrent neural networks with extra memory without increasing the number of network parameters. The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks. Holographic Reduced Representations have limited capacity: as they store more information, each retrieval becomes noisier due to interference. Our system in contrast creates redundant copies of stored information, which enables retrieval with reduced noise. Experiments demonstrate faster learning on multiple memorization tasks.