ﻻ يوجد ملخص باللغة العربية
Pool-based sampling in active learning (AL) represents a key framework for an-notating informative data when dealing with deep learning models. In this paper, we present a novel pipeline for pool-based Active Learning. Unlike most previous works, our method exploits accessible unlabelled examples during training to estimate their co-relation with the labelled examples. Another contribution of this paper is to adapt Visual Transformer as a sampler in the AL pipeline. Visual Transformer models non-local visual concept dependency between labelled and unlabelled examples, which is crucial to identifying the influencing unlabelled examples. Also, compared to existing methods where the learner and the sampler are trained in a multi-stage manner, we propose to train them in a task-aware jointly manner which enables transforming the latent space into two separate tasks: one that classifies the labelled examples; the other that distinguishes the labelling direction. We evaluated our work on four different challenging benchmarks of classification and detection tasks viz. CIFAR10, CIFAR100,FashionMNIST, RaFD, and Pascal VOC 2007. Our extensive empirical and qualitative evaluations demonstrate the superiority of our method compared to the existing methods. Code available: https://github.com/razvancaramalau/Visual-Transformer-for-Task-aware-Active-Learning
Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (Em
Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt
Scale variance among different sizes of body parts and objects is a challenging problem for visual recognition tasks. Existing works usually design dedicated backbone or apply Neural architecture Search(NAS) for each task to tackle this challenge. Ho
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a q
Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel pr