Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

138 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Liang-Chieh Chen

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Liang-Chieh Chen - Maxwell D. Collins - Yukun Zhu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.

قيم البحث

100 - Michael S. Ryoo , AJ Piergiovanni , Mingxing Tan 2019

Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, u sing modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي الحوسبة العصبية والتطورية

MixSearch: Searching for Domain Generalized Medical Image Segmentation Architectures

104 - Luyan Liu , Zhiwei Wen , Songwei Liu 2021

Considering the scarcity of medical data, most datasets in medical image analysis are an order of magnitude smaller than those of natural images. However, most Network Architecture Search (NAS) approaches in medical images focused on specific dataset s and did not take into account the generalization ability of the learned architectures on unseen datasets as well as different domains. In this paper, we address this point by proposing to search for generalizable U-shape architectures on a composited dataset that mixes medical images from multiple segmentation tasks and domains creatively, which is named MixSearch. Specifically, we propose a novel approach to mix multiple small-scale datasets from multiple domains and segmentation tasks to produce a large-scale dataset. Then, a novel weaved encoder-decoder structure is designed to search for a generalized segmentation network in both cell-level and network-level. The network produced by the proposed MixSearch framework achieves state-of-the-art results compared with advanced encoder-decoder networks across various datasets.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

Searching for Efficient Multi-Stage Vision Transformers

339 - Yi-Lun Liao , Sertac Karaman , Vivienne Sze 2021

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted in compute r vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple sub-networks with one forward-backward pass. After that, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate that ViT-ResNAS achieves better accuracy-MACs and accuracy-throughput trade-offs than the original DeiT and other strong baselines of ViT. Code is available at https://github.com/yilunliao/vit-search.

الرؤية الحاسوبية وتمييز الأنماط

Efficient and Accurate Multi-scale Topological Network for Single Image Dehazing

79 - Qiaosi Yi , Juncheng Li , Faming Fang 2021

Single image dehazing is a challenging ill-posed problem that has drawn significant attention in the last few years. Recently, convolutional neural networks have achieved great success in image dehazing. However, it is still difficult for these incre asingly complex models to recover accurate details from the hazy image. In this paper, we pay attention to the feature extraction and utilization of the input image itself. To achieve this, we propose a Multi-scale Topological Network (MSTN) to fully explore the features at different scales. Meanwhile, we design a Multi-scale Feature Fusion Module (MFFM) and an Adaptive Feature Selection Module (AFSM) to achieve the selection and fusion of features at different scales, so as to achieve progressive image dehazing. This topological network provides a large number of search paths that enable the network to extract abundant image features as well as strong fault tolerance and robustness. In addition, ASFM and MFFM can adaptively select important features and ignore interference information when fusing different scale representations. Extensive experiments are conducted to demonstrate the superiority of our method compared with state-of-the-art methods.

الرؤية الحاسوبية وتمييز الأنماط

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

94 - Shihua Huang , Zhichao Lu , Ran Cheng 2021

Recent advancements in deep neural networks have made remarkable leap-forwards in dense image prediction. However, the issue of feature alignment remains as neglected by most existing approaches for simplicity. Direct pixel addition between upsampled and local features leads to feature maps with misaligned contexts that, in turn, translate to mis-classifications in prediction, especially on object boundaries. In this paper, we propose a feature alignment module that learns transformation offsets of pixels to contextually align upsampled higher-level features; and another feature selection module to emphasize the lower-level features with rich spatial details. We then integrate these two modules in a top-down pyramidal architecture and present the Feature-aligned Pyramid Network (FaPN). Extensive experimental evaluations on four dense prediction tasks and four datasets have demonstrated the efficacy of FaPN, yielding an overall improvement of 1.2 - 2.6 points in AP / mIoU over FPN when paired with Faster / Mask R-CNN. In particular, our FaPN achieves the state-of-the-art of 56.7% mIoU on ADE20K when integrated within Mask-Former. The code is available from https://github.com/EMI-Group/FaPN.

الرؤية الحاسوبية وتمييز الأنماط