No Arabic abstract
Image co-segmentation is important for its advantage of alleviating the ill-pose nature of image segmentation through exploring the correlation between related images. Many automatic image co-segmentation algorithms have been developed in the last decade, which are investigated comprehensively in this paper. We firstly analyze visual/semantic cues for guiding image co-segmentation, including object cues and correlation cues. Then we describe the traditional methods in three categories of object elements based, object regions/contours based, common object model based. In the next part, deep learning based methods are reviewed. Furthermore, widely used test datasets and evaluation criteria are introduced and the reported performances of the surveyed algorithms are compared with each other. Finally, we discuss the current challenges and possible future directions and conclude the paper. Hopefully, this comprehensive investigation will be helpful for the development of image co-segmentation technique.
Recent advances in the joint processing of images have certainly shown its advantages over individual processing. Different from the existing works geared towards co-segmentation or co-localization, in this paper, we explore a new joint processing topic: image co-skeletonization, which is defined as joint skeleton extraction of objects in an image collection. Object skeletonization in a single natural image is a challenging problem because there is hardly any prior knowledge about the object. Therefore, we resort to the idea of object co-skeletonization, hoping that the commonness prior that exists across the images may help, just as it does for other joint processing problems such as co-segmentation. We observe that the skeleton can provide good scribbles for segmentation, and skeletonization, in turn, needs good segmentation. Therefore, we propose a coupled framework for co-skeletonization and co-segmentation tasks so that they are well informed by each other, and benefit each other synergistically. Since it is a new problem, we also construct a benchmark dataset by annotating nearly 1.8k images spread across 38 categories. Extensive experiments demonstrate that the proposed method achieves promising results in all the three possible scenarios of joint-processing: weakly-supervised, supervised, and unsupervised.
Automatic liver segmentation in 3D medical images is essential in many clinical applications, such as pathological diagnosis of hepatic diseases, surgical planning, and postoperative assessment. However, it is still a very challenging task due to the complex background, fuzzy boundary, and various appearance of liver. In this paper, we propose an automatic and efficient algorithm to segment liver from 3D CT volumes. A deep image-to-image network (DI2IN) is first deployed to generate the liver segmentation, employing a convolutional encoder-decoder architecture combined with multi-level feature concatenation and deep supervision. Then an adversarial network is utilized during training process to discriminate the output of DI2IN from ground truth, which further boosts the performance of DI2IN. The proposed method is trained on an annotated dataset of 1000 CT volumes with various different scanning protocols (e.g., contrast and non-contrast, various resolution and position) and large variations in populations (e.g., ages and pathology). Our approach outperforms the state-of-the-art solutions in terms of segmentation accuracy and computing efficiency.
Deep Metric Learning (DML) is helpful in computer vision tasks. In this paper, we firstly introduce DML into image co-segmentation. We propose a novel Triplet loss for Image Segmentation, called IS-Triplet loss for short, and combine it with traditional image segmentation loss. Different from the general DML task which learns the metric between pictures, we treat each pixel as a sample, and use their embedded features in high-dimensional space to form triples, then we tend to force the distance between pixels of different categories greater than of the same category by optimizing IS-Triplet loss so that the pixels from different categories are easier to be distinguished in the high-dimensional feature space. We further present an efficient triple sampling strategy to make a feasible computation of IS-Triplet loss. Finally, the IS-Triplet loss is combined with 3 traditional image segmentation losses to perform image segmentation. We apply the proposed approach to image co-segmentation and test it on the SBCoseg dataset and the Internet dataset. The experimental result shows that our approach can effectively improve the discrimination of pixels categories in high-dimensional space and thus help traditional loss achieve better performance of image segmentation with fewer training epochs.
Deep learning methods have achieved promising performance in many areas, but they are still struggling with noisy-labeled images during the training process. Considering that the annotation quality indispensably relies on great expertise, the problem is even more crucial in the medical image domain. How to eliminate the disturbance from noisy labels for segmentation tasks without further annotations is still a significant challenge. In this paper, we introduce our label quality evaluation strategy for deep neural networks automatically assessing the quality of each label, which is not explicitly provided, and training on clean-annotated ones. We propose a solution for network automatically evaluating the relative quality of the labels in the training set and using good ones to tune the network parameters. We also design an overfitting control module to let the network maximally learn from the precise annotations during the training process. Experiments on the public biomedical image segmentation dataset have proved the method outperforms baseline methods and retains both high accuracy and good generalization at different noise levels.
Recently, referring image segmentation has aroused widespread interest. Previous methods perform the multi-modal fusion between language and vision at the decoding side of the network. And, linguistic feature interacts with visual feature of each scale separately, which ignores the continuous guidance of language to multi-scale visual features. In this work, we propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network, and uses language to refine the multi-modal features progressively. Moreover, a co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features, which can promote the consistent of the cross-modal information representation in the semantic space. Finally, we propose a boundary enhancement module (BEM) to make the network pay more attention to the fine structure. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance under different evaluation metrics without any post-processing.