Approximation Algorithms for Socially Fair Clustering

180 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ali Vakilian

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yury Makarychev - Ali Vakilian

بنى وهياكل البيانات والخوارزميات التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present an $(e^{O(p)} frac{log ell}{loglogell})$-approximation algorithm for socially fair clustering with the $ell_p$-objective. In this problem, we are given a set of points in a metric space. Each point belongs to one (or several) of $ell$ groups. The goal is to find a $k$-medians, $k$-means, or, more generally, $ell_p$-clustering that is simultaneously good for all of the groups. More precisely, we need to find a set of $k$ centers $C$ so as to minimize the maximum over all groups $j$ of $sum_{u text{ in group }j} d(u,C)^p$. The socially fair clustering problem was independently proposed by Ghadiri, Samadi, and Vempala [2021] and Abbasi, Bhaskara, and Venkatasubramanian [2021]. Our algorithm improves and generalizes their $O(ell)$-approximation algorithms for the problem. The natural LP relaxation for the problem has an integrality gap of $Omega(ell)$. In order to obtain our result, we introduce a strengthened LP relaxation and show that it has an integrality gap of $Theta(frac{log ell}{loglogell})$ for a fixed $p$. Additionally, we present a bicriteria approximation algorithm, which generalizes the bicriteria approximation of Abbasi et al. [2021].

قيم البحث

131 - Ali Vakilian , Mustafa Yalc{c}{i}ner 2021

We consider the $k$-clustering problem with $ell_p$-norm cost, which includes $k$-median, $k$-means and $k$-center cost functions, under an individual notion of fairness proposed by Jung et al. [2020]: given a set of points $P$ of size $n$, a set of $k$ centers induces a fair clustering if for every point $vin P$, $v$ can find a center among its $n/k$ closest neighbors. Recently, Mahabadi and Vakilian [2020] showed how to get a $(p^{O(p)},7)$-bicriteria approximation for the problem of fair $k$-clustering with $ell_p$-norm cost: every point finds a center within distance at most $7$ times its distance to its $(n/k)$-th closest neighbor and the $ell_p$-norm cost of the solution is at most $p^{O(p)}$ times the cost of an optimal fair solution. In this work, for any $varepsilon>0$, we present an improved $(16^p +varepsilon,3)$-bicriteria approximation for the fair $k$-clustering with $ell_p$-norm cost. To achieve our guarantees, we extend the framework of [Charikar et al., 2002, Swamy, 2016] and devise a $16^p$-approximation algorithm for the facility location with $ell_p$-norm cost under matroid constraint which might be of an independent interest. Besides, our approach suggests a reduction from our individually fair clustering to a clustering with a group fairness requirement proposed by Kleindessner et al. [2019], which is essentially the median matroid problem [Krishnaswamy et al., 2011].

بنى وهياكل البيانات والخوارزميات الذكاء الاصطناعي أجهزة الكمبيوتر والمجتمع

Approximation Algorithms for Bregman Co-clustering and Tensor Clustering

370 - Stefanie Jegelka , Suvrit Sra , Arindam Banerjee 2009

In the past few years powerful generalizations to the Euclidean k-means problem have been made, such as Bregman clustering [7], co-clustering (i.e., simultaneous clustering of rows and columns of an input matrix) [9,18], and tensor clustering [8,34]. Like k-means, these more general problems also suffer from the NP-hardness of the associated optimization. Researchers have developed approximation algorithms of varying degrees of sophistication for k-means, k-medians, and more recently also for Bregman clustering [2]. However, there seem to be no approximation algorithms for Bregman co- and tensor clustering. In this paper we derive the first (to our knowledge) guaranteed methods for these increasingly important clustering settings. Going beyond Bregman divergences, we also prove an approximation factor for tensor clustering with arbitrary separable metrics. Through extensive experiments we evaluate the characteristics of our method, and show that it also has practical impact.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

137 - Melanie Schmidt , Chris Schwiegelshohn , Christian Sohler 2018

We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyds algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.

بنى وهياكل البيانات والخوارزميات

Approximation Algorithms for Bayesian Multi-Armed Bandit Problems

798 - Sudipto Guha , Kamesh Munagala 2013

In this paper, we consider several finite-horizon Bayesian multi-armed bandit problems with side constraints which are computationally intractable (NP-Hard) and for which no optimal (or near optimal) algorithms are known to exist with sub-exponential running time. All of these problems violate the standard exchange property, which assumes that the reward from the play of an arm is not contingent upon when the arm is played. Not only are index policies suboptimal in these contexts, there has been little analysis of such policies in these problem settings. We show that if we consider near-optimal policies, in the sense of approximation algorithms, then there exists (near) index policies. Conceptually, if we can find policies that satisfy an approximate version of the exchange property, namely, that the reward from the play of an arm depends on when the arm is played to within a constant factor, then we have an avenue towards solving these problems. However such an approximate version of the idling bandit property does not hold on a per-play basis and are shown to hold in a global sense. Clearly, such a property is not necessarily true of arbitrary single arm policies and finding such single arm policies is nontrivial. We show that by restricting the state spaces of arms we can find single arm policies and that these single arm policies can be combined into global (near) index policies where the approximate version of the exchange property is true in expectation. The number of different bandit problems that can be addressed by this technique already demonstrate its wide applicability.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Data Structures & Algorithms for Exact Inference in Hierarchical Clustering

76 - Craig S. Greenberg , Sebastian Macaluso , Nicholas Monath 2020

Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. Typically approximate alg orithms are used for inference due to the combinatorial number of possible hierarchical clusterings. In contrast to existing methods, we present novel dynamic-programming algorithms for emph{exact} inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that compares well to other benchmarks. Exact methods are relevant to data analyses in particle physics and for finding correlations among gene expression in cancer genomics, and we give examples in both areas, where our algorithms outperform greedy and beam search baselines. In addition, we consider Dasguptas cost with synthetic data.

بنى وهياكل البيانات والخوارزميات التعلم الآلي تحليل البيانات والإحصاءات والاحتمال