ترغب بنشر مسار تعليمي؟ اضغط هنا

VIDOSAT: High-dimensional Sparsifying Transform Learning for Online Video Denoising

124   0   0.0 ( 0 )
 نشر من قبل Bihan Wen Mr
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Techniques exploiting the sparsity of images in a transform domain have been effective for various applications in image and video processing. Transform learning methods involve cheap computations and have been demonstrated to perform well in applications such as image denoising and medical image reconstruction. Recently, we proposed methods for online learning of sparsifying transforms from streaming signals, which enjoy good convergence guarantees, and involve lower computational costs than online synthesis dictionary learning. In this work, we apply online transform learning to video denoising. We present a novel framework for online video denoising based on high-dimensional sparsifying transform learning for spatio-temporal patches. The patches are constructed either from corresponding 2D patches in successive frames or using an online block matching technique. The proposed online video denoising requires little memory, and offers efficient processing. Numerical experiments compare the performance to the proposed video denoising scheme but fixing the transform to be 3D DCT, as well as prior schemes such as dictionary learning-based schemes, and the state-of-the-art VBM3D and VBM4D on several video data sets, demonstrating the promising performance of the proposed methods.



قيم البحث

اقرأ أيضاً

Graph-based representations play a key role in machine learning. The fundamental step in these representations is the association of a graph structure to a dataset. In this paper, we propose a method that aims at finding a block sparse representation of the graph signal leading to a modular graph whose Laplacian matrix admits the found dictionary as its eigenvectors. The role of sparsity here is to induce a band-limited representation or, equivalently, a modular structure of the graph. The proposed strategy is composed of two optimization steps: i) learning an orthonormal sparsifying transform from the data; ii) recovering the Laplacian, and then topology, from the transform. The first step is achieved through an iterative algorithm whose alternating intermediate solutions are expressed in closed form. The second step recovers the Laplacian matrix from the sparsifying transform through a convex optimization method. Numerical results corroborate the effectiveness of the proposed methods over both synthetic data and real brain data, used for inferring the brain functionality network through experiments conducted over patients affected by epilepsy.
Modeling temporal visual context across frames is critical for video instance segmentation (VIS) and other video understanding tasks. In this paper, we propose a fast online VIS model named CrossVIS. For temporal information modeling in VIS, we prese nt a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames. Different from previous schemes, crossover learning does not require any additional network parameters for feature enhancement. By integrating with the instance segmentation loss, crossover learning enables efficient cross-frame instance-to-pixel relation learning and brings cost-free improvement during inference. Besides, a global balanced instance embedding branch is proposed for more accurate and more stable online instance association. We conduct extensive experiments on three challenging VIS benchmarks, ie, YouTube-VIS-2019, OVIS, and YouTube-VIS-2021 to evaluate our methods. To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy. Code will be available to facilitate future research.
This paper proposes a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore actual motion when aligning features over time, or operate in an off-line mode to utilize future frames. In particular, our memory (i) holds the feature representation, (ii) is spatially warped over time to compensate for observer and scene motions, (iii) can carry long-term information, and (iv) enables predicting feature representations in future frames. By exploring a variant that operates at multiple temporal scales, we efficiently learn across even longer time horizons. We apply our online framework to object detection in videos, obtaining a large 2.3 times speed-up and losing only 0.9% mAP on ImageNet-VID dataset, compared to prior works that even use future frames. Finally, we demonstrate the predictive property of our representation in two novel detection setups, where features are propagated over time to (i) significantly enhance a real-time detector by more than 10% mAP in a multi-threaded online setup and to (ii) anticipate objects in future frames.
Within the field of image and video recognition, the traditional approach is a dataset split into fixed training and test partitions. However, the labelling of the training set is time-consuming, especially as datasets grow in size and complexity. Fu rthermore, this approach is not applicable to the home user, who wants to intuitively group their media without tirelessly labelling the content. Our interactive approach is able to iteratively cluster classes of images and video. Our approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency. We efficiently compute metric distances between signatures despite their inherent high dimensionality and provide discriminative feature selection, to allow common and distinctive elements to be identified from a small set of user labelled examples. These elements are then accentuated in the image signature to increase similarity between examples and pull correct classes together. By repeating this process in an online learning framework, the accuracy of similarity increases dramatically despite labelling only a few training examples. To demonstrate that the approach is agnostic to media type and features used, we evaluate on three image datasets (15 scene, Caltech101 and FG-NET), a mixed text and image dataset (ImageTag), a dataset used in active learning (Iris) and on three action recognition datasets (UCF11, KTH and Hollywood2). On the UCF11 video dataset, the accuracy is 86.7% despite using only 90 labelled examples from a dataset of over 1200 videos, instead of the standard 1122 training videos. The approach is both scalable and efficient, with a single iteration over the full UCF11 dataset of around 1200 videos taking approximately 1 minute on a standard desktop machine.
Achieving high-quality reconstructions from low-dose computed tomography (LDCT) measurements is of much importance in clinical settings. Model-based image reconstruction methods have been proven to be effective in removing artifacts in LDCT. In this work, we propose an approach to learn a rich two-layer clustering-based sparsifying transform model (MCST2), where image patches and their subsequent feature maps (filter residuals) are clustered into groups with different learned sparsifying filters per group. We investigate a penalized weighted least squares (PWLS) approach for LDCT reconstruction incorporating learned MCST2 priors. Experimental results show the superior performance of the proposed PWLS-MCST2 approach compared to other related recent schemes.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا