Layered Image Compression using Scalable Auto-encoder

157 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chuanmin Jia

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Chuanmin Jia - Zhaoyi Liu - Yao Wang

الوسائط المتعددة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper presents a novel convolutional neural network (CNN) based image compression framework via scalable auto-encoder (SAE). Specifically, our SAE based deep image codec consists of hierarchical coding layers, each of which is an end-to-end optimized auto-encoder. The coarse image content and texture are encoded through the first (base) layer while the consecutive (enhance) layers iteratively code the pixel-level reconstruction errors between the original and former reconstructed images. The proposed SAE structure alleviates the need to train multiple models for different bit-rate points by recently proposed auto-encoder based codecs. The SAE layers can be combined to realize multiple rate points, or to produce a scalable stream. The proposed method has similar rate-distortion performance in the low-to-medium rate range as the state-of-the-art CNN based image codec (which uses different optimized networks to realize different bit rates) over a standard public image dataset. Furthermore, the proposed codec generates better perceptual quality in this bit rate range.

قيم البحث

107 - Zhizheng Zhang , Zhibo Chen , Jianxin Lin 2018

In this paper, we propose a learned scalable/progressive image compression scheme based on deep neural networks (DNN), named Bidirectional Context Disentanglement Network (BCD-Net). For learning hierarchical representations, we first adopt bit-plane decomposition to decompose the information coarsely before the deep-learning-based transformation. However, the information carried by different bit-planes is not only unequal in entropy but also of different importance for reconstruction. We thus take the hidden features corresponding to different bit-planes as the context and design a network topology with bidirectional flows to disentangle the contextual information for more effective compressed representations. Our proposed scheme enables us to obtain the compressed codes with scalable rates via a one-pass encoding-decoding. Experiment results demonstrate that our proposed model outperforms the state-of-the-art DNN-based scalable image compression methods in both PSNR and MS-SSIM metrics. In addition, our proposed model achieves higher performance in MS-SSIM metric than conventional scalable image codecs. Effectiveness of our technical components is also verified through sufficient ablation experiments.

الوسائط المتعددة

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

90 - Jing Yi , Yaochen Zhu , Jiayi Xie 2021

In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by proje cting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

الوسائط المتعددة استرجاع المعلومات

Training Auto-encoder-based Optimizers for Terahertz Image Reconstruction

92 - Tak Ming Wong , Matthias Kahl , Peter Haring Bolivar 2019

Terahertz (THz) sensing is a promising imaging technology for a wide variety of different applications. Extracting the interpretable and physically meaningful parameters for such applications, however, requires solving an inverse problem in which a m odel function determined by these parameters needs to be fitted to the measured data. Since the underlying optimization problem is nonconvex and very costly to solve, we propose learning the prediction of suitable parameters from the measured data directly. More precisely, we develop a model-based autoencoder in which the encoder network predicts suitable parameters and the decoder is fixed to a physically meaningful model function, such that we can train the encoding network in an unsupervised way. We illustrate numerically that the resulting network is more than 140 times faster than classical optimization techniques while making predictions with only slightly higher objective values. Using such predictions as starting points of local optimization techniques allows us to converge to better local minima about twice as fast as optimization without the network-based initialization.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو

Multiview Navigation based on Extended Layered Depth Image Representation

370 - Uday Takyar , Thomas Maugey , Pascal Frossard 2013

Emerging applications in multiview streaming look for providing interactive navigation services to video players. The user can ask for information from any viewpoint with a minimum transmission delay. The purpose is to provide user with as much infor mation as possible with least number of redundancies. The recent concept of navigation segment representation consists of regrouping a given number of viewpoints in one signal and transmitting them to the users according to their navigation path. The question of the best description strategy of these navigation segments is however still open. In this paper, we propose to represent and code navigation segments by a method that extends the recent layered depth image (LDI) format. It consists of describing the scene from a viewpoint with multiple images organized in layers corresponding to the different levels of occluded objects. The notion of extended LDI comes from the fact that the size of this image is adapted to take into account the sides of the scene also, in contrary to classical LDI. The obtained results show a significant rate-distortion gain compared to classical multiview compression approaches in navigation scenario.

الوسائط المتعددة

Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model

196 - Ren Yang , Fabian Mentzer , Luc Van Gool 2020

The past few years have witnessed increasing interests in applying deep learning to video compression. However, the existing approaches compress a video frame with only a few number of reference frames, which limits their ability to fully exploit the temporal correlation among video frames. To overcome this shortcoming, this paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). Specifically, the RAE employs recurrent cells in both the encoder and decoder. As such, the temporal information in a large range of frames can be used for generating latent representations and reconstructing compressed outputs. Furthermore, the proposed RPM network recurrently estimates the Probability Mass Function (PMF) of the latent representation, conditioned on the distribution of previous latent representations. Due to the correlation among consecutive frames, the conditional cross entropy can be lower than the independent cross entropy, thus reducing the bit-rate. The experiments show that our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM. Moreover, our approach outperforms the default Low-Delay P (LDP) setting of x265 on PSNR, and also has better performance on MS-SSIM than the SSIM-tuned x265 and the slowest setting of x265. The codes are available at https://github.com/RenYang-home/RLVC.git.

معالجة الصور والفيديو الرؤية الحاسوبية وتمييز الأنماط