No Arabic abstract
We present a simple yet effective progressive self-guided loss function to facilitate deep learning-based salient object detection (SOD) in images. The saliency maps produced by the most relevant works still suffer from incomplete predictions due to the internal complexity of salient objects. Our proposed progressive self-guided loss simulates a morphological closing operation on the model predictions for progressively creating auxiliary training supervisions to step-wisely guide the training process. We demonstrate that this new loss function can guide the SOD model to highlight more complete salient objects step-by-step and meanwhile help to uncover the spatial dependencies of the salient object pixels in a region growing manner. Moreover, a new feature aggregation module is proposed to capture multi-scale features and aggregate them adaptively by a branch-wise attention mechanism. Benefiting from this module, our SOD framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively. Experimental results on several benchmark datasets show that our loss function not only advances the performance of existing SOD models without architecture modification but also helps our proposed framework to achieve state-of-the-art performance.
Salient object detection(SOD) aims at locating the most significant object within a given image. In recent years, great progress has been made in applying SOD on many vision tasks. The depth map could provide additional spatial prior and boundary cues to boost the performance. Combining the depth information with image data obtained from standard visual cameras has been widely used in recent SOD works, however, introducing depth information in a suboptimal fusion strategy may have negative influence in the performance of SOD. In this paper, we discuss about the advantages of the so-called progressive multi-scale fusion method and propose a mask-guided feature aggregation module(MGFA). The proposed framework can effectively combine the two features of different modalities and, furthermore, alleviate the impact of erroneous depth features, which are inevitably caused by the variation of depth quality. We further introduce a mask-guided refinement module(MGRM) to complement the high-level semantic features and reduce the irrelevant features from multi-scale fusion, leading to an overall refinement of detection. Experiments on five challenging benchmarks demonstrate that the proposed method outperforms 11 state-of-the-art methods under different evaluation metrics.
Fully Convolutional Neural Network (FCN) has been widely applied to salient object detection recently by virtue of high-level semantic feature extraction, but existing FCN based methods still suffer from continuous striding and pooling operations leading to loss of spatial structure and blurred edges. To maintain the clear edge structure of salient objects, we propose a novel Edge-guided Non-local FCN (ENFNet) to perform edge guided feature learning for accurate salient object detection. In a specific, we extract hierarchical global and local information in FCN to incorporate non-local features for effective feature representations. To preserve good boundaries of salient objects, we propose a guidance block to embed edge prior knowledge into hierarchical feature maps. The guidance block not only performs feature-wise manipulation but also spatial-wise transformation for effective edge embeddings. Our model is trained on the MSRA-B dataset and tested on five popular benchmark datasets. Comparing with the state-of-the-art methods, the proposed method achieves the best performance on all datasets.
In this paper, we aim to develop an efficient and compact deep network for RGB-D salient object detection, where the depth image provides complementary information to boost performance in complex scenarios. Starting from a coarse initial prediction by a multi-scale residual block, we propose a progressively guided alternate refinement network to refine it. Instead of using ImageNet pre-trained backbone network, we first construct a lightweight depth stream by learning from scratch, which can extract complementary features more efficiently with less redundancy. Then, different from the existing fusion based methods, RGB and depth features are fed into proposed guided residual (GR) blocks alternately to reduce their mutual degradation. By assigning progressive guidance in the stacked GR blocks within each side-output, the false detection and missing parts can be well remedied. Extensive experiments on seven benchmark datasets demonstrate that our model outperforms existing state-of-the-art approaches by a large margin, and also shows superiority in efficiency (71 FPS) and model size (64.9 MB).
Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all required to be pre-trained on the ImageNet to learn the hierarchy features which can help to provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize Self-Supervised Representation Learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pre-training, which makes the network capture rich semantic contexts and reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a consistency-difference aggregation (CDA) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The CDA module is general and suitable for both cross-modal and cross-level feature fusion. Extensive experiments on six benchmark RGB-D SOD datasets, our model pre-trained on the RGB-D dataset ($6,392$ without any annotations) can perform favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet ($1,280,000$ with image-level annotations).
Video salient object detection (VSOD) aims to locate and segment the most attractive object by exploiting both spatial cues and temporal cues hidden in video sequences. However, spatial and temporal cues are often unreliable in real-world scenarios, such as low-contrast foreground, fast motion, and multiple moving objects. To address these problems, we propose a new framework to adaptively capture available information from spatial and temporal cues, which contains Confidence-guided Adaptive Gate (CAG) modules and Dual Differential Enhancement (DDE) modules. For both RGB features and optical flow features, CAG estimates confidence scores supervised by the IoU between predictions and the ground truths to re-calibrate the information with a gate mechanism. DDE captures the differential feature representation to enrich the spatial and temporal information and generate the fused features. Experimental results on four widely used datasets demonstrate the effectiveness of the proposed method against thirteen state-of-the-art methods.