Do you want to publish a course? Click here

DAP: Detection-Aware Pre-training with Weak Supervision

100   0   0.0 ( 0 )
 Added by Yuanyi Zhong
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

This paper presents a detection-aware pre-training (DAP) approach, which leverages only weakly-labeled classification-style datasets (e.g., ImageNet) for pre-training, but is specifically tailored to benefit object detection tasks. In contrast to the widely used image classification-based pre-training (e.g., on ImageNet), which does not include any location-related training tasks, we transform a classification dataset into a detection dataset through a weakly supervised object localization method based on Class Activation Maps to directly pre-train a detector, making the pre-trained model location-aware and capable of predicting bounding boxes. We show that DAP can outperform the traditional classification pre-training in terms of both sample efficiency and convergence speed in downstream detection tasks including VOC and COCO. In particular, DAP boosts the detection accuracy by a large margin when the number of examples in the downstream task is small.



rate research

Read More

Given a sufficiently large training dataset, it is relatively easy to train a modern convolution neural network (CNN) as a required image classifier. However, for the task of fish classification and/or fish detection, if a CNN was trained to detect or classify particular fish species in particular background habitats, the same CNN exhibits much lower accuracy when applied to new/unseen fish species and/or fish habitats. Therefore, in practice, the CNN needs to be continuously fine-tuned to improve its classification accuracy to handle new project-specific fish species or habitats. In this work we present a labelling-efficient method of training a CNN-based fish-detector (the Xception CNN was used as the base) on relatively small numbers (4,000) of project-domain underwater fish/no-fish images from 20 different habitats. Additionally, 17,000 of known negative (that is, missing fish) general-domain (VOC2012) above-water images were used. Two publicly available fish-domain datasets supplied additional 27,000 of above-water and underwater positive/fish images. By using this multi-domain collection of images, the trained Xception-based binary (fish/not-fish) classifier achieved 0.17% false-positives and 0.61% false-negatives on the projects 20,000 negative and 16,000 positive holdout test images, respectively. The area under the ROC curve (AUC) was 99.94%.
Training convolutional networks for semantic segmentation with strong (per-pixel) and weak (per-bounding-box) supervision requires a large amount of weakly labeled data. We propose two methods for selecting the most relevant data with weak supervision. The first method is designed for finding visually similar images without the need of labels and is based on modeling image representations with a Gaussian Mixture Model (GMM). As a byproduct of GMM modeling, we present useful insights on characterizing the data generating distribution. The second method aims at finding images with high object diversity and requires only the bounding box labels. Both methods are developed in the context of automated driving and experimentation is conducted on Cityscapes and Open Images datasets. We demonstrate performance gains by reducing the amount of employed weakly labeled images up to 100 times for Open Images and up to 20 times for Cityscapes.
The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single weak supervision source usually does not contain enough information to train a well-performing model. To this end, we propose a unified framework to train saliency detection models with diverse weak supervision sources. In this paper, we use category labels, captions, and unlabelled data for training, yet other supervision sources can also be plugged into this flexible framework. We design a classification network (CNet) and a caption generation network (PNet), which learn to predict object categories and generate captions, respectively, meanwhile highlight the most important regions for corresponding tasks. An attention transfer loss is designed to transmit supervision signal between networks, such that the network designed to be trained with one supervision source can benefit from another. An attention coherence loss is defined on unlabelled data to encourage the networks to detect generally salient regions instead of task-specific regions. We use CNet and PNet to generate pixel-level pseudo labels to train a saliency prediction network (SNet). During the testing phases, we only need SNet to predict saliency maps. Experiments demonstrate the performance of our method compares favourably against unsupervised and weakly supervised methods and even some supervised methods.
Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging. Unlike existing works that employ fully-supervised training with polygon annotations, we propose a novel text detection system termed SelfText Beyond Polygon (SBP) with Bounding Box Supervision (BBS) and Dynamic Self Training (DST), where training a polygon-based text detector with only a limited set of upright bounding box annotations. For BBS, we firstly utilize the synthetic data with character-level annotations to train a Skeleton Attention Segmentation Network (SASN). Then the box-level annotations are adopted to guide the generation of high-quality polygon-liked pseudo labels, which can be used to train any detectors. In this way, our method achieves the same performance as text detectors trained with polygon annotations (i.e., both are 85.0% F-score for PSENet on ICDAR2015 ). For DST, through dynamically removing the false alarms, it is able to leverage limited labeled data as well as massive unlabeled data to further outperform the expensive baseline. We hope SBP can provide a new perspective for text detection to save huge labeling costs. Code is available at: github.com/weijiawu/SBP.
Training object class detectors typically requires a large set of images with objects annotated by bounding boxes. However, manually drawing bounding boxes is very time consuming. In this paper we greatly reduce annotation time by proposing center-click annotations: we ask annotators to click on the center of an imaginary bounding box which tightly encloses the object instance. We then incorporate these clicks into existing Multiple Instance Learning techniques for weakly supervised object localization, to jointly localize object bounding boxes over all training images. Extensive experiments on PASCAL VOC 2007 and MS COCO show that: (1) our scheme delivers high-quality detectors, performing substantially better than those produced by weakly supervised techniques, with a modest extra annotation effort; (2) these detectors in fact perform in a range close to those trained from manually drawn bounding boxes; (3) as the center-click task is very fast, our scheme reduces total annotation time by 9x to 18x.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا