أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Michael Saks

Hellinger volume and number-on-the-forehead communication complexity

171 - Troy Lee , Nikos Leonardos , Michael Saks 2014

Information-theoretic methods have proven to be a very powerful tool in communication complexity, in particular giving an elegant proof of the linear lower bound for the two-party disjointness function, and tight lower bounds on disjointness in the m ulti-party number-in-the-hand (NIH) model. In this paper, we study the applicability of information theoretic methods to the multi-party number-on-the-forehead model (NOF), where determining the complexity of disjointness remains an important open problem. There are two basic parts to the NIH disjointness lower bound: a direct sum theorem and a lower bound on the one-bit AND function using a beautiful connection between Hellinger distance and protocols revealed by Bar-Yossef, Jayram, Kumar and Sivakumar [BYJKS04]. Inspired by this connection, we introduce the notion of Hellinger volume. We show that it lower bounds the information cost of multi-party NOF protocols and provide a small toolbox that allows one to manipulate several Hellinger volume terms and lower bound a Hellinger volume when the distributions involved satisfy certain conditions. In doing so, we prove a new upper bound on the difference between the arithmetic mean and the geometric mean in terms of relative entropy. We then apply these new tools to obtain a lower bound on the informational complexity of the AND_k function in the NOF setting. Finally, we discuss the difficulties of proving a direct sum theorem for information cost in the NOF model.

التعقيد الحسابي نظرية المعلومات نظرية المعلومات

Composition limits and separating examples for some Boolean function complexity measures

70 - Justin Gilmer , Michael Saks , Srikanth Srinivasan 2013

Block sensitivity ($bs(f)$), certificate complexity ($C(f)$) and fractional certificate complexity ($C^*(f)$) are three fundamental combinatorial measures of complexity of a boolean function $f$. It has long been known that $bs(f) leq C^{ast}(f) leq C(f) =O(bs(f)^2)$. We provide an infinite family of examples for which $C(f)$ grows quadratically in $C^{ast}(f)$ (and also $bs(f)$) giving optimal separations between these measures. Previously the biggest separation known was $C(f)=C^{ast}(f)^{log_{4.5}5}$. We also give a family of examples for which $C^{ast}(f)=Omega(bs(f)^{3/2})$. These examples are obtained by composing boolean functions in various ways. Here the composition $f circ g$ of $f$ with $g$ is obtained by substituting for each variable of $f$ a copy of $g$ on disjoint sets of variables. To construct and analyse these examples we systematically investigate the behaviour under function composition of these measures and also the sensitivity measure $s(f)$. The measures $s(f)$, $C(f)$ and $C^{ast}(f)$ behave nicely under composition: they are submultiplicative (where measure $m$ is submultiplicative if $m(f circ g) leq m(f)m(g)$) with equality holding under some fairly general conditions. The measure $bs(f)$ is qualitatively different: it is not submultiplicative. This qualitative difference was not noticed in the previous literature and we correct some errors that appeared in previous papers. We define the composition limit of a measure $m$ at function $f$, $m^{lim}(f)$ to be the limit as $k$ grows of $m(f^{(k)})^{1/k}$, where $f^{(k)}$ is the iterated composition of $f$ with itself $k$-times. For any function $f$ we show that $bs^{lim}(f) = (C^*)^{lim}(f)$ and characterize $s^{lim}(f), (C^*)^{lim}(f)$, and $C^{lim}(f)$ in terms of the largest eigenvalue of a certain set of $2times 2$ matrices associated with $f$.

التعقيد الحسابي

A Polynomial Time Algorithm for Lossy Population Recovery

80 - Ankur Moitra , Michael Saks 2013

We give a polynomial time algorithm for the lossy population recovery problem. In this problem, the goal is to approximately learn an unknown distribution on binary strings of length $n$ from lossy samples: for some parameter $mu$ each coordinate of the sample is preserved with probability $mu$ and otherwise is replaced by a `?. The running time and number of samples needed for our algorithm is polynomial in $n$ and $1/varepsilon$ for each fixed $mu>0$. This improves on algorithm of Wigderson and Yehudayoff that runs in quasi-polynomial time for any $mu > 0$ and the polynomial time algorithm of Dvir et al which was shown to work for $mu gtrapprox 0.30$ by Batman et al. In fact, our algorithm also works in the more general framework of Batman et al. in which there is no a priori bound on the size of the support of the distribution. The algorithm we analyze is implicit in previous work; our main contribution is to analyze the algorithm by showing (via linear programming duality and connections to complex analysis) that a certain matrix associated with the problem has a robust local inverse even though its condition number is exponentially small. A corollary of our result is the first polynomial time algorithm for learning DNFs in the restriction access model of Dvir et al.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Clustering is difficult only when it does not matter

91 - Amit Daniely , Nati Linial , Michael Saks 2012

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {em that can be clustered well}. More generally, despite the ubiquity and the great import ance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of good clustering. We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

التعلم الآلي بنى وهياكل البيانات والخوارزميات

A Tail Bound for Read-k Families of Functions

51 - Dmitry Gavinsky , Shachar Lovett , Michael Saks 2012

We prove a Chernoff-like large deviation bound on the sum of non-independent random variables that have the following dependence structure. The variables $Y_1,...,Y_r$ are arbitrary Boolean functions of independent random variables $X_1,...,X_m$, mod ulo a restriction that every $X_i$ influences at most $k$ of the variables $Y_1,...,Y_r$.

الرياضيات المتقطعة الاحتمالات

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

81 - Michael Saks , C. Seshadhri 2012

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $delta > 0$, given streaming access to an array of length $n$ provides a $(1+delta)$-multiplicative approximation to the emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((log^2 n)/delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $delta n$ for any $delta >0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$delta n$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(log^2 n)/delta)$, where $k$ is the maximum number of times any one symbol appears in $y$.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Tight lower bounds for online labeling problem

71 - Jan Bulanek , Michal Koucky , Michael Saks 2011

We consider the file maintenance problem (also called the online labeling problem) in which n integer items from the set {1,...,r} are to be stored in an array of size m >= n. The items are presented sequentially in an arbitrary order, and must be st ored in the array in sorted order (but not necessarily in consecutive locations in the array). Each new item must be stored in the array before the next item is received. If r<=m then we can simply store item j in location j but if r>m then we may have to shift the location of stored items to make space for a newly arrived item. The algorithm is charged each time an item is stored in the array, or moved to a new location. The goal is to minimize the total number of such moves done by the algorithm. This problem is non-trivial when n=<m<r. In the case that m=Cn for some C>1, algorithms for this problem with cost O(log(n)^2) per item have been given [IKR81, Wil92, BCD+02]. When m=n, algorithms with cost O(log(n)^3) per item were given [Zha93, BS07]. In this paper we prove lower bounds that show that these algorithms are optimal, up to constant factors. Previously, the only lower bound known for this range of parameters was a lower bound of Omega(log(n)^2) for the restricted class of smooth algorithms [DSZ05a, Zha93]. We also provide an algorithm for the sparse case: If the number of items is polylogarithmic in the array size then the problem can be solved in amortized constant time per item.

بنى وهياكل البيانات والخوارزميات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد