No Arabic abstract
Disentanglement is a highly desirable property of representation owing to its similarity to human understanding and reasoning. Many works achieve disentanglement upon information bottlenecks (IB). Despite their elegant mathematical foundations, the IB branch usually exhibits lower performance. In order to provide an insight into the problem, we develop an annealing test to calculate the information freezing point (IFP), which is a transition state to freeze information into the latent variables. We also explore these clues or inductive biases for separating the entangled factors according to the differences in the IFP distributions. We found the existing approaches suffer from the information diffusion problem, according to which the increased information diffuses in all latent variables. Based on this insight, we propose a novel disentanglement framework, termed the distilling entangled factor (DEFT), to address the information diffusion problem by scaling backward information. DEFT applies a multistage training strategy, including multigroup encoders with different learning rates and piecewise disentanglement pressure, to disentangle the factors stage by stage. We evaluate DEFT on three variants of dSprite and SmallNORB, which show low-variance and high-level disentanglement scores. Furthermore, the experiment under the correlative factors shows incapable of TC-based approaches. DEFT also exhibits a competitive performance in the unsupervised setting.
Disentangling data into interpretable and independent factors is critical for controllable generation tasks. With the availability of labeled data, supervision can help enforce the separation of specific factors as expected. However, it is often expensive or even impossible to label every single factor to achieve fully-supervised disentanglement. In this paper, we adopt a general setting where all factors that are hard to label or identify are encapsulated as a single unknown factor. Under this setting, we propose a flexible weakly-supervised multi-factor disentanglement framework DisUnknown, which Distills Unknown factors for enabling multi-conditional generation regarding both labeled and unknown factors. Specifically, a two-stage training approach is adopted to first disentangle the unknown factor with an effective and robust training method, and then train the final generator with the proper disentanglement of all labeled factors utilizing the unknown distillation. To demonstrate the generalization capacity and scalability of our method, we evaluate it on multiple benchmark datasets qualitatively and quantitatively and further apply it to various real-world applications on complicated datasets.
The transfer of knowledge from one policy to another is an important tool in Deep Reinforcement Learning. This process, referred to as distillation, has been used to great success, for example, by enhancing the optimisation of agents, leading to stronger performance faster, on harder domains [26, 32, 5, 8]. Despite the widespread use and conceptual simplicity of distillation, many different formulations are used in practice, and the subtle variations between them can often drastically change the performance and the resulting objective that is being optimised. In this work, we rigorously explore the entire landscape of policy distillation, comparing the motivations and strengths of each variant through theoretical and empirical analysis. Our results point to three distillation techniques, that are preferred depending on specifics of the task. Specifically a newly proposed expected entropy regularised distillation allows for quicker learning in a wide range of situations, while still guaranteeing convergence.
Let $X$ and $Y$ be dependent random variables. This paper considers the problem of designing a scalar quantizer for $Y$ to maximize the mutual information between the quantizers output and $X$, and develops fundamental properties and bounds for this form of quantization, which is connected to the log-loss distortion criterion. The main focus is the regime of low $I(X;Y)$, where it is shown that, if $X$ is binary, a constant fraction of the mutual information can always be preserved using $mathcal{O}(log(1/I(X;Y)))$ quantization levels, and there exist distributions for which this many quantization levels are necessary. Furthermore, for larger finite alphabets $2 < |mathcal{X}| < infty$, it is established that an $eta$-fraction of the mutual information can be preserved using roughly $(log(| mathcal{X} | /I(X;Y)))^{etacdot(|mathcal{X}| - 1)}$ quantization levels.
We investigate a class of aggregation-diffusion equations with strongly singular kernels and weak (fractional) dissipation in the presence of an incompressible flow. Without the flow the equations are supercritical in the sense that the tendency to concentrate dominates the strength of diffusion and solutions emanating from sufficiently localised initial data may explode in finite time. The main purpose of this paper is to show that under suitable spectral conditions on the flow, which guarantee good mixing properties, for any regular initial datum the solution to the corresponding advection-aggregation-diffusion equation is global if the prescribed flow is sufficiently fast. This paper can be seen as a partial extension of Kiselev and Xu (Arch. Rat. Mech. Anal. 222(2), 2016), and our arguments show in particular that the suppression mechanism for the classical 2D parabolic-elliptic Keller-Segel model devised by Kiselev and Xu also applies to the fractional Keller-Segel model (where $triangle$ is replaced by $-Lambda^gamma$) requiring only that $gamma>1$. In addition, we remove the restriction to dimension $d<4$.
Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the students output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.