Revisiting Random Binning Features: Fast Convergence and Strong Parallelizability

193 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Lingfei Wu

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Lingfei Wu - Ian E.H. Yen - Jie Chen

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Kernel method has been developed as one of the standard approaches for nonlinear learning, which however, does not scale to large data set due to its quadratic complexity in the number of samples. A number of kernel approximation methods have thus been proposed in the recent years, among which the random features method gains much popularity due to its simplicity and direct reduction of nonlinear problem to a linear one. The Random Binning (RB) feature, proposed in the first random-feature paper cite{rahimi2007random}, has drawn much less attention than the Random Fourier (RF) feature. In this work, we observe that the RB features, with right choice of optimization solver, could be orders-of-magnitude more efficient than other random features and kernel approximation methods under the same requirement of accuracy. We thus propose the first analysis of RB from the perspective of optimization, which by interpreting RB as a Randomized Block Coordinate Descent in the infinite-dimensional space, gives a faster convergence rate compared to that of other random features. In particular, we show that by drawing $R$ random grids with at least $kappa$ number of non-empty bins per grid in expectation, RB method achieves a convergence rate of $O(1/(kappa R))$, which not only sharpens its $O(1/sqrt{R})$ rate from Monte Carlo analysis, but also shows a $kappa$ times speedup over other random features under the same analysis framework. In addition, we demonstrate another advantage of RB in the L1-regularized setting, where unlike other random features, a RB-based Coordinate Descent solver can be parallelized with guaranteed speedup proportional to $kappa$. Our extensive experiments demonstrate the superior performance of the RB features over other random features and kernel approximation methods. Our code and data is available at { url{https://github.com/teddylfwu/RB_GEN}}.

قيم البحث

71 - Lingfei Wu , Pin-Yu Chen , Ian En-Hsu Yen 2018

Spectral clustering is one of the most effective clustering approaches that capture hidden cluster structures in the data. However, it does not scale well to large-scale problems due to its quadratic complexity in constructing similarity graphs and c omputing subsequent eigendecomposition. Although a number of methods have been proposed to accelerate spectral clustering, most of them compromise considerable information loss in the original data for reducing computational bottlenecks. In this paper, we present a novel scalable spectral clustering method using Random Binning features (RB) to simultaneously accelerate both similarity graph construction and the eigendecomposition. Specifically, we implicitly approximate the graph similarity (kernel) matrix by the inner product of a large sparse feature matrix generated by RB. Then we introduce a state-of-the-art SVD solver to effectively compute eigenvectors of this large matrix for spectral clustering. Using these two building blocks, we reduce the computational cost from quadratic to linear in the number of data points while achieving similar accuracy. Our theoretical analysis shows that spectral clustering via RB converges faster to the exact spectral clustering than the standard Random Feature approximation. Extensive experiments on 8 benchmarks show that the proposed method either outperforms or matches the state-of-the-art methods in both accuracy and runtime. Moreover, our method exhibits linear scalability in both the number of data samples and the number of RB features.

التعلم الآلي التعلم الالي

On Sampling Random Features From Empirical Leverage Scores: Implementation and Theoretical Guarantees

143 - Shahin Shahrampour , Soheil Kolouri 2019

Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features requir ed to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral operator (depending on input distribution), which is impractical to sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound, revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. Our experiments verify that the empirical algorithm consistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitive to supervised data-dependent kernel learning, without using the output (label) information.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Fast Incremental Expectation Maximization for finite-sum optimization: nonasymptotic convergence

83 - Gersende Fort , E. Moulines (CMAP 2020

Fast Incremental Expectation Maximization (FIEM) is a version of the EM framework for large datasets. In this paper, we first recast FIEM and other incremental EM type algorithms in the {em Stochastic Approximation within EM} framework. Then, we prov ide nonasymptotic bounds for the convergence in expectation as a function of the number of examples $n$ and of the maximal number of iterations $kmax$. We propose two strategies for achieving an $epsilon$-approximate stationary point, respectively with $kmax = O(n^{2/3}/epsilon)$ and $kmax = O(sqrt{n}/epsilon^{3/2})$, both strategies relying on a random termination rule before $kmax$ and on a constant step size in the Stochastic Approximation step. Our bounds provide some improvements on the literature. First, they allow $kmax$ to scale as $sqrt{n}$ which is better than $n^{2/3}$ which was the best rate obtained so far; it is at the cost of a larger dependence upon the tolerance $epsilon$, thus making this control relevant for small to medium accuracy with respect to the number of examples $n$. Second, for the $n^{2/3}$-rate, the numerical illustrations show that thanks to an optimized choice of the step size and of the bounds in terms of quantities characterizing the optimization problem at hand, our results desig a less conservative choice of the step size and provide a better control of the convergence in expectation.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Streaming Kernel PCA with $tilde{O}(sqrt{n})$ Random Features

94 - Enayat Ullah , Poorya Mianjy , Teodor V. Marinov 2018

We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, $O(sqrt{n} log n)$ features suffices to achieve $O(1/epsilon^2)$ sample complexity. Furtherm ore, we give a memory efficient streaming algorithm based on classical Ojas algorithm that achieves this rate.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Minimum complexity interpolation in random features models

93 - Michael Celentano , Theodor Misiakiewicz , Andrea Montanari 2021

Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in $mathbb{R}^d$, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm -- which is equivalent to a weighted $ell_2$ norm -- is replaced by a weighted functional $ell_p$ norm, which we refer to as $mathcal{F}_p$ norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinite-dimensional convex problem. We study random features approximations to these norms and show that, for $p>1$, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with $mathcal{F}_p$ norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models.

التعلم الآلي نظرية الإحصاء التعلم الالي