Locally Private $k$-Means Clustering with Constant Multiplicative Approximation and Near-Optimal Additive Error

88 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anamay Chaturvedi

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Anamay Chaturvedi - Matthew Jones - Huy L. Nguyen

بنى وهياكل البيانات والخوارزميات التشفير والأمن التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given a data set of size $n$ in $d$-dimensional Euclidean space, the $k$-means problem asks for a set of $k$ points (called centers) so that the sum of the $ell_2^2$-distances between points of a given data set of size $n$ and the set of $k$ centers is minimized. Recent work on this problem in the locally private setting achieves constant multiplicative approximation with additive error $tilde{O} (n^{1/2 + a} cdot k cdot max {sqrt{d}, sqrt{k} })$ and proves a lower bound of $Omega(sqrt{n})$ on the additive error for any solution with a constant number of rounds. In this work we bridge the gap between the exponents of $n$ in the upper and lower bounds on the additive error with two new algorithms. Given any $alpha>0$, our first algorithm achieves a multiplicative approximation guarantee which is at most a $(1+alpha)$ factor greater than that of any non-private $k$-means clustering algorithm with $k^{tilde{O}(1/alpha^2)} sqrt{d n} mbox{poly}log n$ additive error. Given any $c>sqrt{2}$, our second algorithm achieves $O(k^{1 + tilde{O}(1/(2c^2-1))} sqrt{d n} mbox{poly} log n)$ additive error with constant multiplicative approximation. Both algorithms go beyond the $Omega(n^{1/2 + a})$ factor that occurs in the additive error for arbitrarily small parameters $a$ in previous work, and the second algorithm in particular shows for the first time that it is possible to solve the locally private $k$-means problem in a constant number of rounds with constant factor multiplicative approximation and polynomial dependence on $k$ in the additive error arbitrarily close to linear.

قيم البحث

83 - Anamay Chaturvedi , Huy Nguyen , Eric Xu 2020

We introduce a new $(epsilon_p, delta_p)$-differentially private algorithm for the $k$-means clustering problem. Given a dataset in Euclidean space, the $k$-means clustering problem requires one to find $k$ points in that space such that the sum of s quares of Euclidean distances between each data point and its closest respective point among the $k$ returned is minimised. Although there exist privacy-preserving methods with good theoretical guarantees to solve this problem [Balcan et al., 2017; Kaplan and Stemmer, 2018], in practice it is seen that it is the additive error which dictates the practical performance of these methods. By reducing the problem to a sequence of instances of maximum coverage on a grid, we are able to derive a new method that achieves lower additive error then previous works. For input datasets with cardinality $n$ and diameter $Delta$, our algorithm has an $O(Delta^2 (k log^2 n log(1/delta_p)/epsilon_p + ksqrt{d log(1/delta_p)}/epsilon_p))$ additive error whilst maintaining constant multiplicative error. We conclude with some experiments and find an improvement over previously implemented work for this problem.

بنى وهياكل البيانات والخوارزميات التشفير والأمن التعلم الآلي

A note on differentially private clustering with large additive error

190 - Huy L. Nguyen 2020

In this note, we describe a simple approach to obtain a differentially private algorithm for k-clustering with nearly the same multiplicative factor as any non-private counterpart at the cost of a large polynomial additive error. The approach is the combination of a simple geometric observation independent of privacy consideration and any existing private algorithm with a constant approximation.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

724 - Michael B. Cohen , Sam Elder , Cameron Musco 2014

We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problem s includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover a good subspace for $bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(log k/epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Locally Private Hypothesis Selection

172 - Sivakanth Gopi , Gautam Kamath , Janardhan Kulkarni 2020

We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $mathcal{Q}$, we aim to output, under the constraints of $varepsilon $-local differential privacy, a distribution from $mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. This is a generalization of the classic problem of $k$-wise simple hypothesis testing, which corresponds to when $p in mathcal{Q}$, and we wish to identify $p$. Absent privacy constraints, this problem requires $O(log k)$ samples from $p$, and it was recently shown that the same complexity is achievable under (central) differential privacy. However, the naive approach to this problem under local differential privacy would require $tilde O(k^2)$ samples. We first show that the constraint of local differential privacy incurs an exponential increase in cost: any algorithm for this problem requires at least $Omega(k)$ samples. Second, for the special case of $k$-wise simple hypothesis testing, we provide a non-interactive algorithm which nearly matches this bound, requiring $tilde O(k)$ samples. Finally, we provide sequentially interactive algorithms for the general case, requiring $tilde O(k)$ samples and only $O(log log k)$ rounds of interactivity. Our algorithms are achieved through a reduction to maximum selection with adversarial comparators, a problem of independent interest for which we initiate study in the parallel setting. For this problem, we provide a family of algorithms for each number of allowed rounds of interaction $t$, as well as lower bounds showing that they are near-optimal for every $t$. Notably, our algorithms result in exponential improvements on the round complexity of previous methods.

بنى وهياكل البيانات والخوارزميات التشفير والأمن نظرية المعلومات

Near-Optimal Algorithms for Differentially-Private Principal Components

611 - Kamalika Chaudhuri , Anand D. Sarwate , Kaushik Sinha 2012

Principal components analysis (PCA) is a standard tool for identifying good low-dimensional approximations to data in high dimension. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on s uch data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method.

التعلم الالي التشفير والأمن التعلم الآلي