No Arabic abstract
Rate-Distortion Optimized Quantization (RDOQ) has played an important role in the coding performance of recent video compression standards such as H.264/AVC, H.265/HEVC, VP9 and AV1. This scheme yields significant reductions in bit-rate at the expense of relatively small increases in distortion. Typically, RDOQ algorithms are prohibitively expensive to implement on real-time hardware encoders due to their sequential nature and their need to frequently obtain entropy coding costs. This work addresses this limitation using a neural network-based approach, which learns to trade-off rate and distortion during offline supervised training. As these networks are based solely on standard arithmetic operations that can be executed on existing neural network hardware, no additional area-on-chip needs to be reserved for dedicated RDOQ circuitry. We train two classes of neural networks, a fully-convolutional network and an auto-regressive network, and evaluate each as a post-quantization step designed to refine cheap quantization schemes such as scalar quantization (SQ). Both network architectures are designed to have a low computational overhead. After training they are integrated into the HM 16.20 implementation of HEVC, and their video coding performance is evaluated on a subset of the H.266/VVC SDR common test sequences. Comparisons are made to RDOQ and SQ implementations in HM 16.20. Our method achieves 1.64% BD-rate savings on luminosity compared to the HM SQ anchor, and on average reaches 45% of the performance of the iterative HM RDOQ algorithm.
In modern video encoders, rate control is a critical component and has been heavily engineered. It decides how many bits to spend to encode each frame, in order to optimize the rate-distortion trade-off over all video frames. This is a challenging constrained planning problem because of the complex dependency among decisions for different video frames and the bitrate constraint defined at the end of the episode. We formulate the rate control problem as a Partially Observable Markov Decision Process (POMDP), and apply imitation learning to learn a neural rate control policy. We demonstrate that by learning from optimal video encoding trajectories obtained through evolution strategies, our learned policy achieves better encoding efficiency and has minimal constraint violation. In addition to imitating the optimal actions, we find that additional auxiliary losses, data augmentation/refinement and inference-time policy improvements are critical for learning a good rate control policy. We evaluate the learned policy against the rate control policy in libvpx, a widely adopted open source VP9 codec library, in the two-pass variable bitrate (VBR) mode. We show that over a diverse set of real-world videos, our learned policy achieves 8.5% median bitrate reduction without sacrificing video quality.
Rate distortion theory is concerned with optimally encoding a given signal class $mathcal{S}$ using a budget of $R$ bits, as $Rtoinfty$. We say that $mathcal{S}$ can be compressed at rate $s$ if we can achieve an error of $mathcal{O}(R^{-s})$ for encoding $mathcal{S}$; the supremal compression rate is denoted $s^ast(mathcal{S})$. Given a fixed coding scheme, there usually are elements of $mathcal{S}$ that are compressed at a higher rate than $s^ast(mathcal{S})$ by the given coding scheme; we study the size of this set of signals. We show that for certain nice signal classes $mathcal{S}$, a phase transition occurs: We construct a probability measure $mathbb{P}$ on $mathcal{S}$ such that for every coding scheme $mathcal{C}$ and any $s >s^ast(mathcal{S})$, the set of signals encoded with error $mathcal{O}(R^{-s})$ by $mathcal{C}$ forms a $mathbb{P}$-null-set. In particular our results apply to balls in Besov and Sobolev spaces that embed compactly into $L^2(Omega)$ for a bounded Lipschitz domain $Omega$. As an application, we show that several existing sharpness results concerning function approximation using deep neural networks are generically sharp. We also provide quantitative and non-asymptotic bounds on the probability that a random $finmathcal{S}$ can be encoded to within accuracy $varepsilon$ using $R$ bits. This result is applied to the problem of approximately representing $finmathcal{S}$ to within accuracy $varepsilon$ by a (quantized) neural network that is constrained to have at most $W$ nonzero weights and is generated by an arbitrary learning procedure. We show that for any $s >s^ast(mathcal{S})$ there are constants $c,C$ such that, no matter how we choose the learning procedure, the probability of success is bounded from above by $minbig{1,2^{Ccdot Wlceillog_2(1+W)rceil^2 -ccdotvarepsilon^{-1/s}}big}$.
Limiting failures of machine learning systems is vital for safety-critical applications. In order to improve the robustness of machine learning systems, Distributionally Robust Optimization (DRO) has been proposed as a generalization of Empirical Risk Minimization (ERM)aiming at addressing this need. However, its use in deep learning has been severely restricted due to the relative inefficiency of the optimizers available for DRO in comparison to the wide-spread variants of Stochastic Gradient Descent (SGD) optimizers for ERM. We propose SGD with hardness weighted sampling, a principled and efficient optimization method for DRO in machine learning that is particularly suited in the context of deep learning. Similar to a hard example mining strategy in essence and in practice, the proposed algorithm is straightforward to implement and computationally as efficient as SGD-based optimizers used for deep learning, requiring minimal overhead computation. In contrast to typical ad hoc hard mining approaches, and exploiting recent theoretical results in deep learning optimization, we prove the convergence of our DRO algorithm for over-parameterized deep learning networks with ReLU activation and finite number of layers and parameters. Our experiments on brain tumor segmentation in MRI demonstrate the feasibility and the usefulness of our approach. Using our hardness weighted sampling leads to a decrease of 2% of the interquartile range of the Dice scores for the enhanced tumor and the tumor core regions. The code for the proposed hard weighted sampler will be made publicly available.
Deep neural networks often consist of a great number of trainable parameters for extracting powerful features from given datasets. On one hand, massive trainable parameters significantly enhance the performance of these deep networks. On the other hand, they bring the problem of over-fitting. To this end, dropout based methods disable some elements in the output feature maps during the training phase for reducing the co-adaptation of neurons. Although the generalization ability of the resulting models can be enhanced by these approaches, the conventional binary dropout is not the optimal solution. Therefore, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks and propose a feature distortion method (Disout) for addressing the aforementioned problem. In the training period, randomly selected elements in the feature maps will be replaced with specific values by exploiting the generalization error bound. The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated on several benchmark image datasets.
Federated learning facilitates learning across clients without transferring local data on these clients to a central server. Despite the success of the federated learning method, it remains to improve further w.r.t communicating the most critical information to update a model under limited communication conditions, which can benefit this learning scheme into a wide range of application scenarios. In this work, we propose a nonlinear quantization for compressed stochastic gradient descent, which can be easily utilized in federated learning. Based on the proposed quantization, our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process to a large extent. Extensive experiments are conducted on image classification and brain tumor semantic segmentation using the MNIST, CIFAR-10 and BraTS datasets where we show state-of-the-art effectiveness and impressive communication efficiency.