Coresets for Clustering with Missing Values

127 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Shaofeng Jiang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Vladimir Braverman - Shaofeng H.-C. Jiang - Robert Krauthgamer andn Xuan Wu

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We provide the first coreset for clustering points in $mathbb{R}^d$ that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like $k$-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an $epsilon$-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that $(1+epsilon)$-approximates the clustering objective for every possible center set. Our coresets for $k$-Means and $k$-Median clustering have size $(jk)^{O(min(j,k))} (epsilon^{-1} d log n)^2$, where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for $k$-Means with missing values [Eiben et al., SODA 2021] to near-linear time. We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyds-style heuristic for $k$-Means with missing values.

قيم البحث

103 - Daniel Baker , Vladimir Braverman , Lingxiao Huang 2019

We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph $G$, with size $O_{epsilon, k}(mathrm{tw}(G))$ where $mathrm{tw}(G)$ is the treewidth of $G$, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(mathrm{tw}(G))$ on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median.

بنى وهياكل البيانات والخوارزميات

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

137 - Melanie Schmidt , Chris Schwiegelshohn , Christian Sohler 2018

We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyds algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.

بنى وهياكل البيانات والخوارزميات

On Coresets for Logistic Regression

157 - Alexander Munteanu , Chris Schwiegelshohn , Christian Sohler andn David P. Woodruff 2018

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1pmvarepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

بنى وهياكل البيانات والخوارزميات التعلم الآلي التعلم الالي

Sparse Coresets for SVD on Infinite Streams

327 - Vladimir Braverman , Dan Feldman , Harry Lang 2020

In streaming Singular Value Decomposition (SVD), $d$-dimensional rows of a possibly infinite matrix arrive sequentially as points in $mathbb{R}^d$. An $epsilon$-coreset is a (much smaller) matrix whose sum of square distances of the rows to any hyper plane approximates that of the original matrix to a $1 pm epsilon$ factor. Our main result is that we can maintain a $epsilon$-coreset while storing only $O(d log^2 d / epsilon^2)$ rows. Known lower bounds of $Omega(d / epsilon^2)$ rows show that this is nearly optimal. Moreover, each row of our coreset is a weighted subset of the input rows. This is highly desirable since it: (1) preserves sparsity; (2) is easily interpretable; (3) avoids precision errors; (4) applies to problems with constraints on the input. Previous streaming results for SVD that return a subset of the input required storing $Omega(d log^3 n / epsilon^2)$ rows where $n$ is the number of rows seen so far. Our algorithm, with storage independent of $n$, is the first result that uses finite memory on infinite streams. We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.

بنى وهياكل البيانات والخوارزميات

Randomized Composable Coresets for Matching and Vertex Cover

132 - Sepehr Assadi , Sanjeev Khanna 2017

A common approach for designing scalable algorithms for massive data sets is to distribute the computation across, say $k$, machines and process the data using limited communication between them. A particularly appealing framework here is the simulta neous communication model whereby each machine constructs a small representative summary of its own data and one obtains an approximate/exact solution from the union of the representative summaries. If the representative summaries needed for a problem are small, then this results in a communication-efficient and round-optimal protocol. While many fundamental graph problems admit efficient solutions in this model, two prominent problems are notably absent from the list of successes, namely, the maximum matching problem and the minimum vertex cover problem. Indeed, it was shown recently that for both these problems, even achieving a polylog$(n)$ approximation requires essentially sending the entire input graph from each machine. The main insight of our work is that the intractability of matching and vertex cover in the simultaneous communication model is inherently connected to an adversarial partitioning of the underlying graph across machines. We show that when the underlying graph is randomly partitioned across machines, both these problems admit randomized composable coresets of size $widetilde{O}(n)$ that yield an $widetilde{O}(1)$-approximate solution. This results in an $widetilde{O}(1)$-approximation simultaneous protocol for these problems with $widetilde{O}(nk)$ total communication when the input is randomly partitioned across $k$ machines. We further prove the optimality of our results. Finally, by a standard application of composable coresets, our results also imply MapReduce algorithms with the same approximation guarantee in one or two rounds of communication

بنى وهياكل البيانات والخوارزميات النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات