أوراق بحثية, رسائل ماجستير ودكتوراه حول الرؤية الحاسوبية وتمييز الأنماط

Fast Segmentation of Left Ventricle in CT Images by Explicit Shape Regression using Random Pixel Difference Features

244 - Peng Sun , Haoyin Zhou , Devon Lundine 2015

Recently, machine learning has been successfully applied to model-based left ventricle (LV) segmentation. The general framework involves two stages, which starts with LV localization and is followed by boundary delineation. Both are driven by supervi sed learning techniques. When compared to previous non-learning-based methods, several advantages have been shown, including full automation and improved accuracy. However, the speed is still slow, in the order of several seconds, for applications involving a large number of cases or case loads requiring real-time performance. In this paper, we propose a fast LV segmentation algorithm by joint localization and boundary delineation via training explicit shape regressor with random pixel difference features. Tested on 3D cardiac computed tomography (CT) image volumes, the average running time of the proposed algorithm is 1.2 milliseconds per case. On a dataset consisting of 139 CT volumes, a 5-fold cross validation shows the segmentation error is $1.21 pm 0.11$ for LV endocardium and $1.23 pm 0.11$ millimeters for epicardium. Compared with previous work, the proposed method is more stable (lower standard deviation) without significant compromise to the accuracy.

الرؤية الحاسوبية وتمييز الأنماط

Face Search at Scale: 80 Million Gallery

387 - Dayong Wang , Charles Otto , Anil K. Jain 2015

Due to the prevalence of social media websites, one challenge facing computer vision researchers is to devise methods to process and search for persons of interest among the billions of shared photos on these websites. Facebook revealed in a 2013 whi te paper that its users have uploaded more than 250 billion photos, and are uploading 350 million new photos each day. Due to this humongous amount of data, large-scale face search for mining web images is both important and challenging. Despite significant progress in face recognition, searching a large collection of unconstrained face images has not been adequately addressed. To address this challenge, we propose a face search system which combines a fast search procedure, coupled with a state-of-the-art commercial off the shelf (COTS) matcher, in a cascaded framework. Given a probe face, we first filter the large gallery of photos to find the top-k most similar faces using deep features generated from a convolutional neural network. The k candidates are re-ranked by combining similarities from deep features and the COTS matcher. We evaluate the proposed face search system on a gallery containing 80 million web-downloaded face images. Experimental results demonstrate that the deep features are competitive with state-of-the-art methods on unconstrained face recognition benchmarks (LFW and IJB-A). Further, the proposed face search system offers an excellent trade-off between accuracy and scalability on datasets consisting of millions of images. Additionally, in an experiment involving searching for face images of the Tsarnaev brothers, convicted of the Boston Marathon bombing, the proposed face search system could find the younger brothers (Dzhokhar Tsarnaev) photo at rank 1 in 1 second on a 5M gallery and at rank 8 in 7 seconds on an 80M gallery.

الرؤية الحاسوبية وتمييز الأنماط

Efficient Face Alignment via Locality-constrained Representation for Robust Recognition

416 - Yandong Wen , Weiyang Liu , Meng Yang 2015

Practical face recognition has been studied in the past decades, but still remains an open challenge. Current prevailing approaches have already achieved substantial breakthroughs in recognition accuracy. However, their performance usually drops dram atically if face samples are severely misaligned. To address this problem, we propose a highly efficient misalignment-robust locality-constrained representation (MRLR) algorithm for practical real-time face recognition. Specifically, the locality constraint that activates the most correlated atoms and suppresses the uncorrelated ones, is applied to construct the dictionary for face alignment. Then we simultaneously align the warped face and update the locality-constrained dictionary, eventually obtaining the final alignment. Moreover, we make use of the block structure to accelerate the derived analytical solution. Experimental results on public data sets show that MRLR significantly outperforms several state-of-the-art approaches in terms of efficiency and scalability with even better performance.

الرؤية الحاسوبية وتمييز الأنماط

Manitest: Are classifiers really invariant?

497 - Alhussein Fawzi , Pascal Frossard 2015

Invariance to geometric transformations is a highly desirable property of automatic classifiers in many image recognition tasks. Nevertheless, it is unclear to which extent state-of-the-art classifiers are invariant to basic transformations such as r otations and translations. This is mainly due to the lack of general methods that properly measure such an invariance. In this paper, we propose a rigorous and systematic approach for quantifying the invariance to geometric transformations of any classifier. Our key idea is to cast the problem of assessing a classifiers invariance as the computation of geodesics along the manifold of transformed images. We propose the Manitest method, built on the efficient Fast Marching algorithm to compute the invariance of classifiers. Our new method quantifies in particular the importance of data augmentation for learning invariance from data, and the increased invariance of convolutional neural networks with depth. We foresee that the proposed generic tool for measuring invariance to a large class of geometric transformations and arbitrary classifiers will have many applications for evaluating and comparing classifiers based on their invariance, and help improving the invariance of existing classifiers.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي التعلم الالي

Deep Fishing: Gradient Features from Deep Nets

716 - Albert Gordo , Adrien Gaidon , Florent Perronnin 2015

Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels. Deep learning is a marked departure from the previous state of the art, the Fisher Vecto r (FV), which relied on gradient-based encoding of local hand-crafted features. In this paper, we discuss a novel connection between these two approaches. First, we show that one can derive gradient representations from ConvNets in a similar fashion to the FV. Second, we show that this gradient representation actually corresponds to a structured matrix that allows for efficient similarity computation. We experimentally study the benefits of transferring this representation over the outputs of ConvNet layers, and find consistent improvements on the Pascal VOC 2007 and 2012 datasets.

الرؤية الحاسوبية وتمييز الأنماط

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

407 - Serena Yeung , Olga Russakovsky , Ning Jin 2015

Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we exte nd the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory (LSTM) deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

الرؤية الحاسوبية وتمييز الأنماط

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

346 - Baoguang Shi , Xiang Bai , Cong Yao 2015

Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognit ion. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

الرؤية الحاسوبية وتمييز الأنماط

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries

423 - Yuke Zhu , Ce Zhang , Christopher Re 2015

The complexity of the visual world creates significant challenges for comprehensive visual understanding. In spite of recent successes in visual recognition, todays vision systems would still struggle to deal with visual queries that require a deeper reasoning. We propose a knowledge base (KB) framework to handle an assortment of visual queries, without the need to train new classifiers for new tasks. Building such a large-scale multimodal KB presents a major challenge of scalability. We cast a large-scale MRF into a KB representation, incorporating visual, textual and structured data, as well as their diverse relations. We introduce a scalable knowledge base construction system that is capable of building a KB with half billion variables and millions of parameters in a few hours. Our system achieves competitive results compared to purpose-built models on standard recognition and retrieval tasks, while exhibiting greater flexibility in answering richer visual queries.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

394 - Shicong Liu , Hongtao Lu 2015

We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generate a balanced encoding for the target dataset. Existing methods did not explicitly consider this. To achieve these goals along with minimizing the quantization error (residue), we propose a novel dictionary optimization method called emph{Dictionary Annealing} that alternatively heats up a single dictionary by generating an intermediate dataset with residual vectors, cools down the dictionary by fitting the intermediate dataset, then extracts the new residual vectors for the next iteration. Better codes can be learned by DA for the ANN search tasks. DA is easily implemented on GPU to utilize the latest computing technology, and can easily extended to an online dictionary learning scheme. We show by experiments that our optimized dictionaries substantially reduce the overall quantization error. Jointly used with residual vector quantization, our optimized dictionaries lead to a better approximate nearest neighbor search performance compared to the state-of-the-art methods.

الرؤية الحاسوبية وتمييز الأنماط

Fine-grained Recognition Datasets for Biodiversity Analysis

444 - Erik Rodner , Marcel Simon , Gunnar Brehm 2015

In the following paper, we present and discuss challenging applications for fine-grained visual classification (FGVC): biodiversity and species analysis. We not only give details about two challenging new datasets suitable for computer vision researc h with up to 675 highly similar classes, but also present first results with localized features using convolutional neural networks (CNN). We conclude with a list of challenging new research directions in the area of visual classification for biodiversity research.

الرؤية الحاسوبية وتمييز الأنماط

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد