RAPiD: Rotation-Aware People Detection in Overhead Fisheye Images

121 0 0.0 ( 0 )

Download Cite

Added by Zhihao Duan

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Zhihao Duan - M. Ozan Tezcan - Hayato Nakamura

Computer Vision and Pattern Recognition

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Recent methods for people detection in overhead, fisheye images either use radially-aligned bounding boxes to represent people, assuming people always appear along image radius or require significant pre-/post-processing which radically increases computational complexity. In this work, we develop an end-to-end rotation-aware people detection method, named RAPiD, that detects people using arbitrarily-oriented bounding boxes. Our fully-convolutional neural network directly regresses the angle of each bounding box using a periodic loss function, which accounts for angle periodicities. We have also created a new dataset with spatio-temporal annotations of rotated bounding boxes, for people detection as well as other vision tasks in overhead fisheye videos. We show that our simple, yet effective method outperforms state-of-the-art results on three fisheye-image datasets. Code and dataset are available at http://vip.bu.edu/rapid .

rate research

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

96 - Anna Konrad , Ciaran Eising , Ganesh Sistu 2021

Keypoint detection and description is a commonly used building block in computer vision systems particularly for robotics and autonomous driving. Recently CNN based approaches have surpassed classical methods in a number of perception tasks. However, the majority of techniques to date have focused on standard cameras with little consideration given to fisheye cameras which are commonly used in autonomous driving. In this paper, we propose a novel training and evaluation pipeline for fisheye images. We make use of SuperPoint as our baseline which is a self-supervised keypoint detector and descriptor that has achieved state-of-the-art results on homography estimation. We introduce a fisheye adaptation pipeline to enable training on undistorted fisheye images. We evaluate the performance on the HPatches benchmark, and, by introducing a fisheye based evaluation methods for detection repeatability and descriptor matching correctness on the Oxford RobotCar datasets.

Computer Vision and Pattern Recognition Robotics

SHARP: Shape-Aware Reconstruction of People In Loose Clothing

127 - Sai Sagar Jinka , Rohan Chacko , Astitva Srivastava 2021

3D human body reconstruction from monocular images is an interesting and ill-posed problem in computer vision with wider applications in multiple domains. In this paper, we propose SHARP, a novel end-to-end trainable network that accurately recovers the detailed geometry and appearance of 3D people in loose clothing from a monocular image. We propose a sparse and efficient fusion of a parametric body prior with a non-parametric peeled depth map representation of clothed models. The parametric body prior constraints our model in two ways: first, the network retains geometrically consistent body parts that are not occluded by clothing, and second, it provides a body shape context that improves prediction of the peeled depth maps. This enables SHARP to recover fine-grained 3D geometrical details with just L1 losses on the 2D maps, given an input image. We evaluate SHARP on publicly available Cloth3D and THuman datasets and report superior performance to state-of-the-art approaches.

Computer Vision and Pattern Recognition

Captioning Images Taken by People Who Are Blind

65 - Danna Gurari , Yinan Zhao , Meng Zhang 2020

While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org

Computer Vision and Pattern Recognition

Whos Waldo? Linking People Across Text and Images

143 - Claire Yuqing Cui , Apoorv Khandelwal , Yoav Artzi 2021

We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Whos Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.

Computer Vision and Pattern Recognition Computation and Language Machine Learning

End-to-end people detection in crowded scenes

587 - Russell Stewart , Mykhaylo Andriluka 2015

Current people detectors operate either by scanning an image in a sliding window fashion or by classifying a discrete set of proposals. We propose a model that is based on decoding an image into a set of people detections. Our system takes an image as input and directly outputs a set of distinct detection hypotheses. Because we generate predictions jointly, common post-processing steps such as non-maximum suppression are unnecessary. We use a recurrent LSTM layer for sequence generation and train our model end-to-end with a new loss function that operates on sets of detections. We demonstrate the effectiveness of our approach on the challenging task of detecting people in crowded scenes.

Computer Vision and Pattern Recognition