Learned Scalable Image Compression with Bidirectional Context Disentanglement Network

108 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhizheng Zhang

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhizheng Zhang - Zhibo Chen - Jianxin Lin

الوسائط المتعددة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we propose a learned scalable/progressive image compression scheme based on deep neural networks (DNN), named Bidirectional Context Disentanglement Network (BCD-Net). For learning hierarchical representations, we first adopt bit-plane decomposition to decompose the information coarsely before the deep-learning-based transformation. However, the information carried by different bit-planes is not only unequal in entropy but also of different importance for reconstruction. We thus take the hidden features corresponding to different bit-planes as the context and design a network topology with bidirectional flows to disentangle the contextual information for more effective compressed representations. Our proposed scheme enables us to obtain the compressed codes with scalable rates via a one-pass encoding-decoding. Experiment results demonstrate that our proposed model outperforms the state-of-the-art DNN-based scalable image compression methods in both PSNR and MS-SSIM metrics. In addition, our proposed model achieves higher performance in MS-SSIM metric than conventional scalable image codecs. Effectiveness of our technical components is also verified through sufficient ablation experiments.

قيم البحث

156 - Chuanmin Jia , Zhaoyi Liu , Yao Wang 2019

This paper presents a novel convolutional neural network (CNN) based image compression framework via scalable auto-encoder (SAE). Specifically, our SAE based deep image codec consists of hierarchical coding layers, each of which is an end-to-end opti mized auto-encoder. The coarse image content and texture are encoded through the first (base) layer while the consecutive (enhance) layers iteratively code the pixel-level reconstruction errors between the original and former reconstructed images. The proposed SAE structure alleviates the need to train multiple models for different bit-rate points by recently proposed auto-encoder based codecs. The SAE layers can be combined to realize multiple rate points, or to produce a scalable stream. The proposed method has similar rate-distortion performance in the low-to-medium rate range as the state-of-the-art CNN based image codec (which uses different optimized networks to realize different bit rates) over a standard public image dataset. Furthermore, the proposed codec generates better perceptual quality in this bit rate range.

الوسائط المتعددة

Checkerboard Context Model for Efficient Learned Image Compression

92 - Dailan He , Yaoyan Zheng , Baocheng Sun 2021

For learned image compression, the autoregressive context model is proved effective in improving the rate-distortion (RD) performance. Because it helps remove spatial redundancies among latent representations. However, the decoding process must be do ne in a strict scan order, which breaks the parallelization. We propose a parallelizable checkerboard context model (CCM) to solve the problem. Our two-pass checkerboard context calculation eliminates such limitations on spatial locations by re-organizing the decoding order. Speeding up the decoding process more than 40 times in our experiments, it achieves significantly improved computational efficiency with almost the same rate-distortion performance. To the best of our knowledge, this is the first exploration on parallelization-friendly spatial context model for learned image compression.

معالجة الصور والفيديو الرؤية الحاسوبية وتمييز الأنماط

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

128 - Chunxiao Liu , Zhendong Mao , An-An Liu 2019

Learning semantic correspondence between image and text is significant as it bridges the semantic gap between vision and language. The key challenge is to accurately find and correlate shared semantics in image and text. Most existing methods achieve this goal by representing the shared semantic as a weighted combination of all the fragments (image regions or text words), where fragments relevant to the shared semantic obtain more attention, otherwise less. However, despite relevant ones contribute more to the shared semantic, irrelevant ones will more or less disturb it, and thus will lead to semantic misalignment in the correlation phase. To address this issue, we present a novel Bidirectional Focal Attention Network (BFAN), which not only allows to attend to relevant fragments but also diverts all the attention into these relevant fragments to concentrate on them. The main difference with existing works is they mostly focus on learning attention weight while our BFAN focus on eliminating irrelevant fragments from the shared semantic. The focal attention is achieved by pre-assigning attention based on inter-modality relation, identifying relevant fragments based on intra-modality relation and reassigning attention. Furthermore, the focal attention is jointly applied in both image-to-text and text-to-image directions, which enables to avoid preference to long text or complex image. Experiments show our simple but effective framework significantly outperforms state-of-the-art, with relative Recall@1 gains of 2.2% on both Flicr30K and MSCOCO benchmarks.

الوسائط المتعددة

Two-pronged Strategy: Lightweight Augmented Graph Network Hashing for Scalable Image Retrieval

285 - Hui Cui , Lei Zhu , Jingjing Li 2021

Hashing learns compact binary codes to store and retrieve massive data efficiently. Particularly, unsupervised deep hashing is supported by powerful deep neural networks and has the desirable advantage of label independence. It is a promising techniq ue for scalable image retrieval. However, deep models introduce a large number of parameters, which is hard to optimize due to the lack of explicit semantic labels and brings considerable training cost. As a result, the retrieval accuracy and training efficiency of existing unsupervised deep hashing are still limited. To tackle the problems, in this paper, we propose a simple and efficient emph{Lightweight Augmented Graph Network Hashing} (LAGNH) method with a two-pronged strategy. For one thing, we extract the inner structure of the image as the auxiliary semantics to enhance the semantic supervision of the unsupervised hash learning process. For another, we design a lightweight network structure with the assistance of the auxiliary semantics, which greatly reduces the number of network parameters that needs to be optimized and thus greatly accelerates the training process. Specifically, we design a cross-modal attention module based on the auxiliary semantic information to adaptively mitigate the adverse effects in the deep image features. Besides, the hash codes are learned by multi-layer message passing within an adversarial regularized graph convolutional network. Simultaneously, the semantic representation capability of hash codes is further enhanced by reconstructing the similarity graph.

الوسائط المتعددة

Context-adaptive neural network based prediction for image compression

129 - Thierry Dumas 2018

This paper describes a set of neural network architectures, called Prediction Neural Networks Set (PNNS), based on both fully-connected and convolutional neural networks, for intra image prediction. The choice of neural network for predicting a given image block depends on the block size, hence does not need to be signalled to the decoder. It is shown that, while fully-connected neural networks give good performance for small block sizes, convolutional neural networks provide better predictions in large blocks with complex textures. Thanks to the use of masks of random sizes during training, the neural networks of PNNS well adapt to the available context that may vary, depending on the position of the image block to be predicted. When integrating PNNS into a H.265 codec, PSNR-rate performance gains going from 1.46% to 5.20% are obtained. These gains are on average 0.99% larger than those of prior neural network based methods. Unlike the H.265 intra prediction modes, which are each specialized in predicting a specific texture, the proposed PNNS can model a large set of complex textures.

الحوسبة العصبية والتطورية معالجة الصور والفيديو