No Arabic abstract
The Random Walks (RW) algorithm is one of the most e - cient and easy-to-use probabilistic segmentation methods. By combining contrast terms with prior terms, it provides accurate segmentations of medical images in a fully automated manner. However, one of the main drawbacks of using the RW algorithm is that its parameters have to be hand-tuned. we propose a novel discriminative learning framework that estimates the parameters using a training dataset. The main challenge we face is that the training samples are not fully supervised. Speci cally, they provide a hard segmentation of the images, instead of a proba-bilistic segmentation. We overcome this challenge by treating the optimal probabilistic segmentation that is compatible with the given hard segmentation as a latent variable. This allows us to employ the latent support vector machine formulation for parameter estimation. We show that our approach signi cantly outperforms the baseline methods on a challenging dataset consisting of real clinical 3D MRI volumes of skeletal muscles.
The Random Walks (RW) algorithm is one of the most e - cient and easy-to-use probabilistic segmentation methods. By combining contrast terms with prior terms, it provides accurate segmentations of medical images in a fully automated manner. However, one of the main drawbacks of using the RW algorithm is that its parameters have to be hand-tuned. we propose a novel discriminative learning framework that estimates the parameters using a training dataset. The main challenge we face is that the training samples are not fully supervised. Speci cally, they provide a hard segmentation of the images, instead of a proba- bilistic segmentation. We overcome this challenge by treating the opti- mal probabilistic segmentation that is compatible with the given hard segmentation as a latent variable. This allows us to employ the latent support vector machine formulation for parameter estimation. We show that our approach signi cantly outperforms the baseline methods on a challenging dataset consisting of real clinical 3D MRI volumes of skeletal muscles.
Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and structures that are the basic building blocks of recognition. This motivates us to propose an end-to-end pixel-wise metric learning approach that mimics this process. In our approach, the optimal visual representation determines the right segmentation within individual images and associates segments with the same semantic classes across images. The core visual learning problem is therefore to maximize the similarity within segments and minimize the similarity between segments. Given a model trained this way, inference is performed consistently by extracting pixel-wise embeddings and clustering, with the semantic label determined by the majority vote of its nearest neighbors from an annotated set. As a result, we present the SegSort, as a first attempt using deep learning for unsupervised semantic segmentation, achieving $76%$ performance of its supervised counterpart. When supervision is available, SegSort shows consistent improvements over conventional approaches based on pixel-wise softmax training. Additionally, our approach produces more precise boundaries and consistent region predictions. The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.
EventNet is a large-scale video corpus and event ontology consisting of 500 events associated with event-specific concepts. In order to improve the quality of the current EventNet, we conduct the following steps and introduce EventNet version 1.1: (1) manually verify the correctness of event labels for all videos; (2) remove the YouTube user bias by limiting the maximum number of videos in each event from the same YouTube user as 3; (3) remove the videos which are currently not accessible online; (4) remove the video belonging to multiple event categories. After the above procedure, some events may contain only a small number of videos, and therefore we crawl more videos for those events to ensure every event will contain more than 50 videos. Finally, EventNet version 1.1 contains 67,641 videos, 500 events, and 5,028 event-specific concepts. In addition, we train a Convolutional Neural Network (CNN) model for event classification via fine-tuning AlexNet using EventNet version 1.1. Then we use the trained CNN model to extract FC7 layer feature and train binary classifiers using linear SVM for each event-specific concept. We believe this new version of EventNet will significantly facilitate research in computer vision and multimedia, and will put it online for public downloading in the future.
In this technical report, we present key details of our winning panoptic segmentation architecture EffPS_b1bs4_RVC. Our network is a lightweight version of our state-of-the-art EfficientPS architecture that consists of our proposed shared backbone with a modified EfficientNet-B5 model as the encoder, followed by the 2-way FPN to learn semantically rich multi-scale features. It consists of two task-specific heads, a modified Mask R-CNN instance head and our novel semantic segmentation head that processes features of different scales with specialized modules for coherent feature refinement. Finally, our proposed panoptic fusion module adaptively fuses logits from each of the heads to yield the panoptic segmentation output. The Robust Vision Challenge 2020 benchmarking results show that our model is ranked #1 on Microsoft COCO, VIPER and WildDash, and is ranked #2 on Cityscapes and Mapillary Vistas, thereby achieving the overall rank #1 for the panoptic segmentation task.
Current benchmarks for optical flow algorithms evaluate the estimation quality by comparing their predicted flow field with the ground truth, and additionally may compare interpolated frames, based on these predictions, with the correct frames from the actual image sequences. For the latter comparisons, objective measures such as mean square errors are applied. However, for applications like image interpolation, the expected users quality of experience cannot be fully deduced from such simple quality measures. Therefore, we conducted a subjective quality assessment study by crowdsourcing for the interpolated images provided in one of the optical flow benchmarks, the Middlebury benchmark. We used paired comparisons with forced choice and reconstructed absolute quality scale values according to Thurstones model using the classical least squares method. The results give rise to a re-ranking of 141 participating algorithms w.r.t. visual quality of interpolated frames mostly based on optical flow estimation. Our re-ranking result shows the necessity of visual quality assessment as another evaluation metric for optical flow and frame interpolation benchmarks.