No Arabic abstract
Generative Adversarial Networks (GANs) have become a popular method to learn a probability model from data. In this paper, we aim to provide an understanding of some of the basic issues surrounding GANs including their formulation, generalization and stability on a simple benchmark where the data has a high-dimensional Gaussian distribution. Even in this simple benchmark, the GAN problem has not been well-understood as we observe that existing state-of-the-art GAN architectures may fail to learn a proper generative distribution owing to (1) stability issues (i.e., convergence to bad local solutions or not converging at all), (2) approximation issues (i.e., having improper global GAN optimizers caused by inappropriate GANs loss functions), and (3) generalizability issues (i.e., requiring large number of samples for training). In this setup, we propose a GAN architecture which recovers the maximum-likelihood solution and demonstrates fast generalization. Moreover, we analyze global stability of different computational approaches for the proposed GAN optimization and highlight their pros and cons. Finally, we outline an extension of our model-based approach to design GANs in more complex setups than the considered Gaussian benchmark.
We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramer GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training.
In this paper, we consider a discrete time linear quadratic Gaussian (LQG) control problem in which state information of the plant is encoded in a variable-length binary codeword at every time step, and a control input is determined based on the codewords generated in the past. We derive a lower bound of the rate achievable by the class of prefix-free codes attaining the required LQG control performance. This lower bound coincides with the infimum of a certain directed information expression, and is computable by semidefinite programming (SDP). Based on a technique by Silva et al., we also provide an upper bound of the best achievable rate by constructing a controller equipped with a uniform quantizer with subtractive dither and Shannon-Fano coding. The gap between the obtained lower and upper bounds is less than $0.754r+1$ bits per time step regardless of the required LQG control performance, where $r$ is the rank of a signal-to-noise ratio matrix obtained by SDP, which is no greater than the dimension of the state.
In a noiseless linear estimation problem, one aims to reconstruct a vector x* from the knowledge of its linear projections y=Phi x*. There have been many theoretical works concentrating on the case where the matrix Phi is a random i.i.d. one, but a number of heuristic evidence suggests that many of these results are universal and extend well beyond this restricted case. Here we revisit this problematic through the prism of development of message passing methods, and consider not only the universality of the l1 transition, as previously addressed, but also the one of the optimal Bayesian reconstruction. We observed that the universality extends to the Bayes-optimal minimum mean-squared (MMSE) error, and to a range of structured matrices.
Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.
Rectified linear units, or ReLUs, have become the preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a ReLU-network -- a neural network with ReLU activations. As a primarily theoretical study, we limit ourselves to a single-layer network. The first problem we study corresponds to dictionary-learning in the presence of nonlinearity (modeled by the ReLU functions). Given a set of observation vectors $mathbf{y}^i in mathbb{R}^d, i =1, 2, dots , n$, we aim to recover $dtimes k$ matrix $A$ and the latent vectors ${mathbf{c}^i} subset mathbb{R}^k$ under the model $mathbf{y}^i = mathrm{ReLU}(Amathbf{c}^i +mathbf{b})$, where $mathbf{b}in mathbb{R}^d$ is a random bias. We show that it is possible to recover the column space of $A$ within an error of $O(d)$ (in Frobenius norm) under certain conditions on the probability distribution of $mathbf{b}$. The second problem we consider is that of robust recovery of the signal in the presence of outliers, i.e., large but sparse noise. In this setting we are interested in recovering the latent vector $mathbf{c}$ from its noisy nonlinear sketches of the form $mathbf{v} = mathrm{ReLU}(Amathbf{c}) + mathbf{e}+mathbf{w}$, where $mathbf{e} in mathbb{R}^d$ denotes the outliers with sparsity $s$ and $mathbf{w} in mathbb{R}^d$ denote the dense but small noise. This line of work has recently been studied (Soltanolkotabi, 2017) without the presence of outliers. For this problem, we show that a generalized LASSO algorithm is able to recover the signal $mathbf{c} in mathbb{R}^k$ within an $ell_2$ error of $O(sqrt{frac{(k+s)log d}{d}})$ when $A$ is a random Gaussian matrix.