ﻻ يوجد ملخص باللغة العربية
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Some historical and more recent printed documents have been scanned or stored at very low resolutions, such as 60 dpi. Though such scans are relatively easy for humans to read, they still present significant challenges for optical character recogniti
Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural net
Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this pa
Leveraging the advances of natural language processing, most recent scene text recognizers adopt an encoder-decoder architecture where text images are first converted to representative features and then a sequence of characters via `direct decoding.
The HGR is a quite challenging task as its performance is influenced by various aspects such as illumination variations, cluttered backgrounds, spontaneous capture, etc. The conventional CNN networks for HGR are following two stage pipeline to deal w