Bridging the Gap Between Object Detection and User Intent via Query-Modulation

107 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Marco Fornoni

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Marco Fornoni - Chaochao Yan - Liangchen Luo

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.

قيم البحث

93 - Xin Lu , Quanquan Li , Buyu Li 2020

Modern object detection methods can be divided into one-stage approaches and two-stage ones. One-stage detectors are more efficient owing to straightforward architectures, but the two-stage detectors still take the lead in accuracy. Although recent w ork try to improve the one-stage detectors by imitating the structural design of the two-stage ones, the accuracy gap is still significant. In this paper, we propose MimicDet, a novel and efficient framework to train a one-stage detector by directly mimic the two-stage features, aiming to bridge the accuracy gap between one-stage and two-stage detectors. Unlike conventional mimic methods, MimicDet has a shared backbone for one-stage and two-stage detectors, then it branches into two heads which are well designed to have compatible features for mimicking. Thus MimicDet can be end-to-end trained without the pre-train of the teacher network. And the cost does not increase much, which makes it practical to adopt large networks as backbones. We also make several specialized designs such as dual-path mimicking and staggered feature pyramid to facilitate the mimicking process. Experiments on the challenging COCO detection benchmark demonstrate the effectiveness of MimicDet. It achieves 46.1 mAP with ResNeXt-101 backbone on the COCO test-dev set, which significantly surpasses current state-of-the-art methods.

الرؤية الحاسوبية وتمييز الأنماط

S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration

338 - Zhiqiang Shen , Zechun Liu , Jie Qin 2021

Previous studies dominantly target at self-supervised learning on real-valued networks and have achieved many promising results. However, on the more challenging binary neural networks (BNNs), this task has not yet been fully explored in the communit y. In this paper, we focus on this more difficult scenario: learning networks where both weights and activations are binary, meanwhile, without any human annotated labels. We observe that the commonly used contrastive objective is not satisfying on BNNs for competitive accuracy, since the backbone network contains relatively limited capacity and representation ability. Hence instead of directly applying existing self-supervised methods, which cause a severe decline in performance, we present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution, to minimize the loss and obtain desirable accuracy. Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.5~15% on BNNs. We further reveal that it is difficult for BNNs to recover the similar predictive distributions as real-valued models when training without labels. Thus, how to calibrate them is key to address the degradation in performance. Extensive experiments are conducted on the large-scale ImageNet and downstream datasets. Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods. Code is available at https://github.com/szq0214/S2-BNN.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Bridging Gap between Image Pixels and Semantics via Supervision: A Survey

78 - Jiali Duan , C.-C. Jay Kuo 2021

The fact that there exists a gap between low-level features and semantic meanings of images, called the semantic gap, is known for decades. Resolution of the semantic gap is a long standing problem. The semantic gap problem is reviewed and a survey o n recent efforts in bridging the gap is made in this work. Most importantly, we claim that the semantic gap is primarily bridged through supervised learning today. Experiences are drawn from two application domains to illustrate this point: 1) object detection and 2) metric learning for content-based image retrieval (CBIR). To begin with, this paper offers a historical retrospective on supervision, makes a gradual transition to the modern data-driven methodology and introduces commonly used datasets. Then, it summarizes various supervision methods to bridge the semantic gap in the context of object detection and metric learning.

الرؤية الحاسوبية وتمييز الأنماط

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

280 - Cheng Chi , Fangyun Wei , Han Hu 2020

Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While t hese different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer~cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of emph{key} instances to strengthen the main emph{query} representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a emph{key sampling} approach and a emph{shared location embedding} approach. The proposed module is named emph{bridging visual representations} (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about $1.5sim3.0$ AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about $2.0$ AP, reaching $52.7$ AP on COCO test-dev. The resulting network is named RelationNet++. The code will be available at https://github.com/microsoft/RelationNet2.

الرؤية الحاسوبية وتمييز الأنماط

Motion Planning Networks: Bridging the Gap Between Learning-based and Classical Motion Planners

99 - Ahmed H. Qureshi , Yinglong Miao , Anthony Simeonov 2019

This paper describes Motion Planning Networks (MPNet), a computationally efficient, learning-based neural planner for solving motion planning problems. MPNet uses neural networks to learn general near-optimal heuristics for path planning in seen and unseen environments. It takes environment information such as raw point-cloud from depth sensors, as well as a robots initial and desired goal configurations and recursively calls itself to bidirectionally generate connectable paths. In addition to finding directly connectable and near-optimal paths in a single pass, we show that worst-case theoretical guarantees can be proven if we merge this neural network strategy with classical sample-based planners in a hybrid approach while still retaining significant computational and optimality improvements. To train the MPNet models, we present an active continual learning approach that enables MPNet to learn from streaming data and actively ask for expert demonstrations when needed, drastically reducing data for training. We validate MPNet against gold-standard and state-of-the-art planning methods in a variety of problems from 2D to 7D robot configuration spaces in challenging and cluttered environments, with results showing significant and consistently stronger performance metrics, and motivating neural planning in general as a modern strategy for solving motion planning problems efficiently.

علم الروبوتات الذكاء الاصطناعي التعلم الآلي