ﻻ يوجد ملخص باللغة العربية
Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.
A saliency guided hierarchical visual tracking (SHT) algorithm containing global and local search phases is proposed in this paper. In global search, a top-down saliency model is novelly developed to handle abrupt motion and appearance variation prob
Traditional gaze estimation methods typically require explicit user calibration to achieve high accuracy. This process is cumbersome and recalibration is often required when there are changes in factors such as illumination and pose. To address this
Using only a model that was trained to predict where people look at images, and no additional training data, we can produce a range of powerful editing effects for reducing distraction in images. Given an image and a mask specifying the region to edi
Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and sea
Human eyes concentrate different facial regions during distinct cognitive activities. We study utilising facial visual saliency maps to classify different facial expressions into different emotions. Our results show that our novel method of merely us