ﻻ يوجد ملخص باللغة العربية
We propose an efficient inference framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video. Our method performs inference on selected keyframes and makes predictions for other frames via propagation based on motion vectors and residuals from the compressed video bitstream. Specifically, we propose a new motion vector-based warping method for propagating segmentation masks from keyframes to other frames in a multi-reference manner. Additionally, we propose a residual-based refinement module that can correct and add detail to the block-wise propagated segmentation masks. Our approach is flexible and can be added on top of existing video object segmentation algorithms. With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit t
Object detection in videos has drawn increasing attention since it is more practical in real scenarios. Most of the deep learning methods use CNNs to process each decoded frame in a video stream individually. However, the free of charge yet valuable
This paper presents a novel approach for segmenting moving objects in unconstrained environments using guided convolutional neural networks. This guiding process relies on foreground masks from independent algorithms (i.e. state-of-the-art algorithms
Annotating videos with object segmentation masks typically involves a two stage procedure of drawing polygons per object instance for all the frames and then linking them through time. While simple, this is a very tedious, time consuming and expensiv
This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single