ترغب بنشر مسار تعليمي؟ اضغط هنا

Coresets for Clustering in Graphs of Bounded Treewidth

104   0   0.0 ( 0 )
 نشر من قبل Shaofeng Jiang
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph $G$, with size $O_{epsilon, k}(mathrm{tw}(G))$ where $mathrm{tw}(G)$ is the treewidth of $G$, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(mathrm{tw}(G))$ on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median.



قيم البحث

اقرأ أيضاً

The maximum/minimum bisection problems are, given an edge-weighted graph, to find a bipartition of the vertex set into two sets whose sizes differ by at most one, such that the total weight of edges between the two sets is maximized/minimized. Althou gh these two problems are known to be NP-hard, there is an efficient algorithm for bounded-treewidth graphs. In particular, Jansen et al. (SIAM J. Comput. 2005) gave an $O(2^tn^3)$-time algorithm when given a tree decomposition of width $t$ of the input graph, where $n$ is the number of vertices of the input graph. Eiben et al. (ESA 2019) improved the dependency of $n$ in the running time by giving an $O(8^tt^5n^2log n)$-time algorithm. Moreover, they showed that there is no $O(n^{2-varepsilon})$-time algorithm for trees under some reasonable complexity assumption. In this paper, we show an $O(2^t(tn)^2)$-time algorithm for both problems, which is asymptotically tight to their conditional lower bound. We also show that the exponential dependency of the treewidth is asymptotically optimal under the Strong Exponential Time Hypothesis. Finally, we discuss the (in)tractability of both problems with respect to special graph classes.
We give polynomial-time approximation schemes for monotone maximization problems expressible in terms of distances (up to a fixed upper bound) and efficiently solvable in graphs of bounded treewidth. These schemes apply in all fractionally treewidth- fragile graph classes, a property that is true for many natural graph classes with sublinear separators. We also provide quasipolynomial-time approximation schemes for these problems in all classes with sublinear separators.
We provide the first coreset for clustering points in $mathbb{R}^d$ that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like $k$ -Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an $epsilon$-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that $(1+epsilon)$-approximates the clustering objective for every possible center set. Our coresets for $k$-Means and $k$-Median clustering have size $(jk)^{O(min(j,k))} (epsilon^{-1} d log n)^2$, where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for $k$-Means with missing values [Eiben et al., SODA 2021] to near-linear time. We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyds-style heuristic for $k$-Means with missing values.
The Subgraph Isomorphism problem is of considerable importance in computer science. We examine the problem when the pattern graph H is of bounded treewidth, as occurs in a variety of applications. This problem has a well-known algorithm via color-cod ing that runs in time $O(n^{tw(H)+1})$ [Alon, Yuster, Zwick95], where $n$ is the number of vertices of the host graph $G$. While there are pattern graphs known for which Subgraph Isomorphism can be solved in an improved running time of $O(n^{tw(H)+1-varepsilon})$ or even faster (e.g. for $k$-cliques), it is not known whether such improvements are possible for all patterns. The only known lower bound rules out time $n^{o(tw(H) / log(tw(H)))}$ for any class of patterns of unbounded treewidth assuming the Exponential Time Hypothesis [Marx07]. In this paper, we demonstrate the existence of maximally hard pattern graphs $H$ that require time $n^{tw(H)+1-o(1)}$. Specifically, under the Strong Exponential Time Hypothesis (SETH), a standard assumption from fine-grained complexity theory, we prove the following asymptotic statement for large treewidth $t$: For any $varepsilon > 0$ there exists $t ge 3$ and a pattern graph $H$ of treewidth $t$ such that Subgraph Isomorphism on pattern $H$ has no algorithm running in time $O(n^{t+1-varepsilon})$. Under the more recent 3-uniform Hyperclique hypothesis, we even obtain tight lower bounds for each specific treewidth $t ge 3$: For any $t ge 3$ there exists a pattern graph $H$ of treewidth $t$ such that for any $varepsilon>0$ Subgraph Isomorphism on pattern $H$ has no algorithm running in time $O(n^{t+1-varepsilon})$. In addition to these main results, we explore (1) colored and uncolored problem variants (and why they are equivalent for most cases), (2) Subgraph Isomorphism for $tw < 3$, (3) Subgraph Isomorphism parameterized by pathwidth, and (4) a weighted problem variant.
We focus on counting the number of labeled graphs on $n$ vertices and treewidth at most $k$ (or equivalently, the number of labeled partial $k$-trees), which we denote by $T_{n,k}$. So far, only the particular cases $T_{n,1}$ and $T_{n,2}$ had been s tudied. We show that $$ left(c cdot frac{kcdot 2^k cdot n}{log k} right)^n cdot 2^{-frac{k(k+3)}{2}} cdot k^{-2k-2} leq T_{n,k} leq left(k cdot 2^k cdot nright)^n cdot 2^{-frac{k(k+1)}{2}} cdot k^{-k}, $$ for $k > 1$ and some explicit absolute constant $c > 0$. The upper bound is an immediate consequence of the well-known number of labeled $k$-trees, while the lower bound is obtained from an explicit algorithmic construction. It follows from this construction that both bounds also apply to graphs of pathwidth and proper-pathwidth at most $k$.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا