No Arabic abstract
We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the $k$-means objective function. The number of points required in an offline construction is in $tilde{O}(k epsilon^{-2}min(d,kepsilon^{-2}))$ which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets are maintained in data streams via the merge and reduce framework, which incurs are large space dependency on $log n$. Instead, our construction crucially relies on Johnson-Lindenstrauss type embeddings which combined with results from online algorithms give us a new technique for efficiently maintaining coresets in data streams without relying on merge and reduce. The final number of points stored by our algorithm in a data stream is in $tilde{O}(k^2 epsilon^{-2} log^2 n min(d,kepsilon^{-2}))$.
We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyds algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.
This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$-means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.
Euler k-means (EulerK) first maps data onto the unit hyper-sphere surface of equi-dimensional space via a complex mapping which induces the robust Euler kernel and next employs the popular $k$-means. Consequently, besides enjoying the virtues of k-means such as simplicity and scalability to large data sets, EulerK is also robust to noises and outliers. Although so, the centroids captured by EulerK deviate from the unit hyper-sphere surface and thus in strict distributional sense, actually are outliers. This weird phenomenon also occurs in some generic kernel clustering methods. Intuitively, using such outlier-like centroids should not be quite reasonable but it is still seldom attended. To eliminate the deviation, we propose two Rectified Euler k-means methods, i.e., REK1 and REK2, which retain the merits of EulerK while acquire real centroids residing on the mapped space to better characterize the data structures. Specifically, REK1 rectifies EulerK by imposing the constraint on the centroids while REK2 views each centroid as the mapped image from a pre-image in the original space and optimizes these pre-images in Euler kernel induced space. Undoubtedly, our proposed REKs can methodologically be extended to solve problems of such a category. Finally, the experiments validate the effectiveness of REK1 and REK2.
Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1pmvarepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problems includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover a good subspace for $bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(log k/epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.