No Arabic abstract
The Contrastive Divergence (CD) algorithm has achieved notable success in training energy-based models including Restricted Boltzmann Machines and played a key role in the emergence of deep learning. The idea of this algorithm is to approximate the intractable term in the exact gradient of the log-likelihood function by using short Markov chain Monte Carlo (MCMC) runs. The approximate gradient is computationally-cheap but biased. Whether and why the CD algorithm provides an asymptotically consistent estimate are still open questions. This paper studies the asymptotic properties of the CD algorithm in canonical exponential families, which are special cases of the energy-based model. Suppose the CD algorithm runs $m$ MCMC transition steps at each iteration $t$ and iteratively generates a sequence of parameter estimates ${theta_t}_{t ge 0}$ given an i.i.d. data sample ${X_i}_{i=1}^n sim p_{theta_star}$. Under conditions which are commonly obeyed by the CD algorithm in practice, we prove the existence of some bounded $m$ such that any limit point of the time average $left. sum_{s=0}^{t-1} theta_s right/ t$ as $t to infty$ is a consistent estimate for the true parameter $theta_star$. Our proof is based on the fact that ${theta_t}_{t ge 0}$ is a homogenous Markov chain conditional on the data sample ${X_i}_{i=1}^n$. This chain meets the Foster-Lyapunov drift criterion and converges to a random walk around the Maximum Likelihood Estimate. The range of the random walk shrinks to zero at rate $mathcal{O}(1/sqrt[3]{n})$ as the sample size $n to infty$.
In our recent paper, we showed that in exponential family, contrastive divergence (CD) with fixed learning rate will give asymptotically consistent estimates cite{wu2016convergence}. In this paper, we establish consistency and convergence rate of CD with annealed learning rate $eta_t$. Specifically, suppose CD-$m$ generates the sequence of parameters ${theta_t}_{t ge 0}$ using an i.i.d. data sample $mathbf{X}_1^n sim p_{theta^*}$ of size $n$, then $delta_n(mathbf{X}_1^n) = limsup_{t to infty} Vert sum_{s=t_0}^t eta_s theta_s / sum_{s=t_0}^t eta_s - theta^* Vert$ converges in probability to 0 at a rate of $1/sqrt[3]{n}$. The number ($m$) of MCMC transitions in CD only affects the coefficient factor of convergence rate. Our proof is not a simple extension of the one in cite{wu2016convergence}. which depends critically on the fact that ${theta_t}_{t ge 0}$ is a homogeneous Markov chain conditional on the observed sample $mathbf{X}_1^n$. Under annealed learning rate, the homogeneous Markov property is not available and we have to develop an alternative approach based on super-martingales. Experiment results of CD on a fully-visible $2times 2$ Boltzmann Machine are provided to demonstrate our theoretical results.
This paper investigates maximizers of the information divergence from an exponential family $E$. It is shown that the $rI$-projection of a maximizer $P$ to $E$ is a convex combination of $P$ and a probability measure $P_-$ with disjoint support and the same value of the sufficient statistics $A$. This observation can be used to transform the original problem of maximizing $D(cdot||E)$ over the set of all probability measures into the maximization of a function $Dbar$ over a convex subset of $ker A$. The global maximizers of both problems correspond to each other. Furthermore, finding all local maximizers of $Dbar$ yields all local maximizers of $D(cdot||E)$. This paper also proposes two algorithms to find the maximizers of $Dbar$ and applies them to two examples, where the maximizers of $D(cdot||E)$ were not known before.
We consider stochastic gradient descent and its averaging variant for binary classification problems in a reproducing kernel Hilbert space. In the traditional analysis using a consistency property of loss functions, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition but provided theoretical analysis was limited to the squared loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions. As for the averaged stochastic gradient descent, we show that the same convergence rate holds from the early phase of training. In experiments, we verify our analyses on the $L_2$-regularized logistic regression.
Fitting a graphical model to a collection of random variables given sample observations is a challenging task if the observed variables are influenced by latent variables, which can induce significant confounding statistical dependencies among the observed variables. We present a new convex relaxation framework based on regularized conditional likelihood for latent-variable graphical modeling in which the conditional distribution of the observed variables conditioned on the latent variables is given by an exponential family graphical model. In comparison to previously proposed tractable methods that proceed by characterizing the marginal distribution of the observed variables, our approach is applicable in a broader range of settings as it does not require knowledge about the specific form of distribution of the latent variables and it can be specialized to yield tractable approaches to problems in which the observed data are not well-modeled as Gaussian. We demonstrate the utility and flexibility of our framework via a series of numerical experiments on synthetic as well as real data.
We show how to convert divergent series, which typically occur in many applications in physics, into rapidly convergent inverse factorial series. This can be interpreted physically as a novel resummation of perturbative series. Being convergent, these new series allow rigorous extrapolation from an asymptotic region with a large parameter, to the opposite region where the parameter is small. We illustrate the method with various physical examples, and discuss how these convergent series relate to standard methods such as Borel summation, and also how they incorporate the physical Stokes phenomenon. We comment on the relation of these results to Dysons physical argument for the divergence of perturbation theory. This approach also leads naturally to a wide class of relations between bosonic and fermionic partition functions, and Klein-Gordon and Dirac determinants.