ﻻ يوجد ملخص باللغة العربية
We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problems includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover a good subspace for $bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(log k/epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.
This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$
In applications such as natural language processing or computer vision, one is given a large $n times d$ matrix $A = (a_{i,j})$ and would like to compute a matrix decomposition, e.g., a low rank approximation, of a function $f(A) = (f(a_{i,j}))$ appl
Given a data set of size $n$ in $d$-dimensional Euclidean space, the $k$-means problem asks for a set of $k$ points (called centers) so that the sum of the $ell_2^2$-distances between points of a given data set of size $n$ and the set of $k$ centers
We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias).
A distance matrix $A in mathbb R^{n times m}$ represents all pairwise distances, $A_{ij}=mathrm{d}(x_i,y_j)$, between two point sets $x_1,...,x_n$ and $y_1,...,y_m$ in an arbitrary metric space $(mathcal Z, mathrm{d})$. Such matrices arise in various