Updatable Siamese Tracker with Two-stage One-shot Learning

71 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xinglong Sun

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xinglong Sun - Guangliang Han - Lihong Guo

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Offline Siamese networks have achieved very promising tracking performance, especially in accuracy and efficiency. However, they often fail to track an object in complex scenes due to the incapacity in online update. Traditional updaters are difficult to process the irregular variations and sampling noises of objects, so it is quite risky to adopt them to update Siamese networks. In this paper, we first present a two-stage one-shot learner, which can predict the local parameters of primary classifier with object samples from diverse stages. Then, an updatable Siamese network is proposed based on the learner (SiamTOL), which is able to complement online update by itself. Concretely, we introduce an extra inputting branch to sequentially capture the latest object features, and design a residual module to update the initial exemplar using these features. Besides, an effective multi-aspect training loss is designed for our network to avoid overfit. Extensive experimental results on several popular benchmarks including OTB100, VOT2018, VOT2019, LaSOT, UAV123 and GOT10k manifest that the proposed tracker achieves the leading performance and outperforms other state-of-the-art methods

قيم البحث

75 - Qi Feng , Vitaly Ablavsky , Qinxun Bai 2019

We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU.

الرؤية الحاسوبية وتمييز الأنماط

SiamReID: Confuser Aware Siamese Tracker with Re-identification Feature

53 - Abu Md Niamul Taufique , Andreas Savakis , Michael Braun 2021

Siamese deep-network trackers have received significant attention in recent years due to their real-time speed and state-of-the-art performance. However, Siamese trackers suffer from similar looking confusers, that are prevalent in aerial imagery and create challenging conditions due to prolonged occlusions where the tracker object re-appears under different pose and illumination. Our work proposes SiamReID, a novel re-identification framework for Siamese trackers, that incorporates confuser rejection during prolonged occlusions and is well-suited for aerial tracking. The re-identification feature is trained using both triplet loss and a class balanced loss. Our approach achieves state-of-the-art performance in the UAVDT single object tracking benchmark.

الرؤية الحاسوبية وتمييز الأنماط

One-Shot Texture Retrieval with Global Context Metric

288 - Kai Zhu , Wei Zhai , Zheng-Jun Zha 2019

In this paper, we tackle one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image. To address this problem, we present an OS-TR network to encode both reference and query image, leading to achieve texture segmentation towards the reference category. Unlike the existing texture encoding methods that integrate CNN with orderless pooling, we propose a directionality-aware module to capture the texture variations at each direction, resulting in spatially invariant representation. To segment new categories given only few examples, we incorporate a self-gating mechanism into relation network to exploit global context information for adjusting per-channel modulation weights of local relation features. Extensive experiments on benchmark texture datasets and real scenarios demonstrate the above-par segmentation performance and robust generalization across domains of our proposed method.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

One-Vote Veto: Semi-Supervised Learning for Low-Shot Glaucoma Diagnosis

73 - Rui Fan , Christopher Bowd , Nicole Brye 2020

Convolutional neural networks (CNNs) are a promising technique for automated glaucoma diagnosis from images of the fundus, and these images are routinely acquired as part of an ophthalmic exam. Nevertheless, CNNs typically require a large amount of w ell-labeled data for training, which may not be available in many biomedical image classification applications, especially when diseases are rare and where labeling by experts is costly. This paper makes two contributions to address this issue: (1) It extends the conventional twin neural network and introduces a training method for low-shot learning when labeled data are limited and imbalanced, and (2) it introduces a novel semi-supervised learning strategy that uses additional unlabeled training data to achieve greater accuracy. Our proposed multi-task twin neural network (MTTNN) can employ any backbone CNN, and we demonstrate with four backbone CNNs that its accuracy with limited training data approaches the accuracy of backbone CNNs trained with a dataset that is 50 times larger. We also introduce One-Vote Veto (OVV) self-training, a semi-supervised learning strategy that is designed specifically for MTTNNs. By taking both self-predictions and contrastive-predictions of the unlabeled training data into account, OVV self-training provides additional pseudo labels for fine tuning a pretrained MTTNN. Using a large (imbalanced) dataset with 66715 fundus photographs acquired over 15 years, extensive experimental results demonstrate the effectiveness of low-shot learning with MTTNN and semi-supervised learning with OVV self-training. Three additional, smaller clinical datasets of fundus images acquired under different conditions (cameras, instruments, locations, populations) are used to demonstrate the generalizability of the proposed methods. Source code and pretrained models will be publicly available upon publication.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

ScaleNAS: One-Shot Learning of Scale-Aware Representations for Visual Recognition

95 - Hsin-Pai Cheng , Feng Liang , Meng Li 2020

Scale variance among different sizes of body parts and objects is a challenging problem for visual recognition tasks. Existing works usually design dedicated backbone or apply Neural architecture Search(NAS) for each task to tackle this challenge. Ho wever, existing works impose significant limitations on the design or search space. To solve these problems, we present ScaleNAS, a one-shot learning method for exploring scale-aware representations. ScaleNAS solves multiple tasks at a time by searching multi-scale feature aggregation. ScaleNAS adopts a flexible search space that allows an arbitrary number of blocks and cross-scale feature fusions. To cope with the high search cost incurred by the flexible space, ScaleNAS employs one-shot learning for multi-scale supernet driven by grouped sampling and evolutionary search. Without further retraining, ScaleNet can be directly deployed for different visual recognition tasks with superior performance. We use ScaleNAS to create high-resolution models for two different tasks, ScaleNet-P for human pose estimation and ScaleNet-S for semantic segmentation. ScaleNet-P and ScaleNet-S outperform existing manually crafted and NAS-based methods in both tasks. When applying ScaleNet-P to bottom-up human pose estimation, it surpasses the state-of-the-art HigherHRNet. In particular, ScaleNet-P4 achieves 71.6% AP on COCO test-dev, achieving new state-of-the-art result.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي