Convergence of Sparse Variational Inference in Gaussian Processes Regression

249 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل David Burt

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف David R. Burt - Carl Edward Rasmussen - Mark van der Wilk

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, $N$, due to the cubic (in $N$) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on $M ll N$ inducing variables to form an approximation at a cost of $mathcal{O}(NM^2)$. While the computational cost appears linear in $N$, the true complexity depends on how $M$ must scale with $N$ to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how $M$ needs to grow with $N$ to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with $Mll N$. Specifically, for the popular squared exponential kernel and $D$-dimensional Gaussian distributed covariates, $M=mathcal{O}((log N)^D)$ suffice and a method with an overall computational cost of $mathcal{O}(N(log N)^{2D}(loglog N)^2)$ can be used to perform inference.

قيم البحث

397 - Ayush Jain n Department of Computer Science 2021

Deep Gaussian Processes (DGPs) are multi-layer, flexible extensions of Gaussian processes but their training remains challenging. Sparse approximations simplify the training but often require optimization over a large number of inducing inputs and th eir locations across layers. In this paper, we simplify the training by setting the locations to a fixed subset of data and sampling the inducing inputs from a variational distribution. This reduces the trainable parameters and computation cost without significant performance degradations, as demonstrated by our empirical results on regression problems. Our modifications simplify and stabilize DGP training while making it amenable to sampling schemes for setting the inducing inputs.

التعلم الالي التعلم الآلي

Sparse Gaussian Process Variational Autoencoders

120 - Matthew Ashman , Jonathan So , Will Tebbutt 2020

Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data -- a natural occurrence in many spatio-temporal datasets -- in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.

التعلم الالي التعلم الآلي الحوسبة العصبية والتطورية

Connections and Equivalences between the Nystrom Method and Sparse Variational Gaussian Processes

92 - Veit Wild , Motonobu Kanagawa , Dino Sejdinovic 2021

We investigate the connections between sparse approximation methods for making kernel methods and Gaussian processes (GPs) scalable to massive data, focusing on the Nystrom method and the Sparse Variational Gaussian Processes (SVGP). While sparse app roximation methods for GPs and kernel methods share some algebraic similarities, the literature lacks a deep understanding of how and why they are related. This is a possible obstacle for the communications between the GP and kernel communities, making it difficult to transfer results from one side to the other. Our motivation is to remove this possible obstacle, by clarifying the connections between the sparse approximations for GPs and kernel methods. In this work, we study the two popular approaches, the Nystrom and SVGP approximations, in the context of a regression problem, and establish various connections and equivalences between them. In particular, we provide an RKHS interpretation of the SVGP approximation, and show that the Evidence Lower Bound of the SVGP contains the objective function of the Nystrom approximation, revealing the origin of the algebraic equivalence between the two approaches. We also study recently established convergence results for the SVGP and how they are related to the approximation quality of the Nystrom method.

التعلم الالي التعلم الآلي نظرية الإحصاء

Sparse Algorithms for Markovian Gaussian Processes

203 - William J. Wilkinson , Arno Solin , Vincent Adam 2021

Approximate Bayesian inference methods that scale to very large datasets are crucial in leveraging probabilistic models for real-world time series. Sparse Markovian Gaussian processes combine the use of inducing variables with efficient Kalman filter -like recursions, resulting in algorithms whose computational and memory requirements scale linearly in the number of inducing points, whilst also enabling parallel parameter updates and stochastic optimisation. Under this paradigm, we derive a general site-based approach to approximate inference, whereby we approximate the non-Gaussian likelihood with local Gaussian terms, called sites. Our approach results in a suite of novel sparse extensions to algorithms from both the machine learning and signal processing literature, including variational inference, expectation propagation, and the classical nonlinear Kalman smoothers. The derived methods are suited to large time series, and we also demonstrate their applicability to spatio-temporal data, where the model has separate inducing points in both time and space.

التعلم الالي التعلم الآلي

Scalable Exact Inference in Multi-Output Gaussian Processes

110 - Wessel P. Bruinsma , Eric Perim , Will Tebbutt 2019

Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is their computational sc aling $O(n^3 p^3)$, which is cubic in the number of both inputs $n$ (e.g., time points or locations) and outputs $p$. For this reason, a popular class of MOGPs assumes that the data live around a low-dimensional linear subspace, reducing the complexity to $O(n^3 m^3)$. However, this cost is still cubic in the dimensionality of the subspace $m$, which is still prohibitively expensive for many applications. We propose the use of a sufficient statistic of the data to accelerate inference and learning in MOGPs with orthogonal bases. The method achieves linear scaling in $m$ in practice, allowing these models to scale to large $m$ without sacrificing significant expressivity or requiring approximation. This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way. We demonstrate the efficacy of the method on various synthetic and real-world data sets.

التعلم الالي التعلم الآلي