No Arabic abstract
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $Sigma$), we want to select $m-ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|Sigma|)cdot m^{g(k,|Sigma|)}cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.
In the Categorical Clustering problem, we are given a set of vectors (matrix) A={a_1,ldots,a_n} over Sigma^m, where Sigma is a finite alphabet, and integers k and B. The task is to partition A into k clusters such that the median objective of the clustering in the Hamming norm is at most B. That is, we seek a partition {I_1,ldots,I_k} of {1,ldots,n} and vectors c_1,ldots,c_kinSigma^m such that sum_{i=1}^ksum_{jin I_i}d_h(c_i,a_j)leq B, where d_H(a,b) is the Hamming distance between vectors a and b. Fomin, Golovach, and Panolan [ICALP 2018] proved that the problem is fixed-parameter tractable (for binary case Sigma={0,1}) by giving an algorithm that solves the problem in time 2^{O(Blog B)} (mn)^{O(1)}. We extend this algorithmic result to a popular capacitated clustering model, where in addition the sizes of the clusters should satisfy certain constraints. More precisely, in Capacitated Clustering, in addition, we are given two non-negative integers p and q, and seek a clustering with pleq |I_i|leq q for all iin{1,ldots,k}. Our main theorem is that Capacitated Clustering is solvable in time 2^{O(Blog B)}|Sigma|^B(mn)^{O(1)}. The theorem not only extends the previous algorithmic results to a significantly more general model, it also implies algorithms for several other variants of Categorical Clustering with constraints on cluster sizes.
We investigate the parameterized complexity of the following edge coloring problem motivated by the problem of channel assignment in wireless networks. For an integer q>1 and a graph G, the goal is to find a coloring of the edges of G with the maximum number of colors such that every vertex of the graph sees at most q colors. This problem is NP-hard for q>1, and has been well-studied from the point of view of approximation. Our main focus is the case when q=2, which is already theoretically intricate and practically relevant. We show fixed-parameter tractable algorithms for both the standard and the dual parameter, and for the latter problem, the result is based on a linear vertex kernel.
Given a graph $G=(V,E)$, two vertices $s,tin V$, and two integers $k,ell$, the Short Secluded Path problem is to find a simple $s$-$t$-path with at most $k$ vertices and $ell$ neighbors. We study the parameterized complexity of the problem with respect to four structural graph parameters: the vertex cover number, treewidth, feedback vertex number, and feedback edge number. In particular, we completely settle the question of the existence of problem kernels with size polynomial in these parameters and their combinations with $k$ and $ell$. We also obtain a $2^{O(w)}cdot ell^2cdot n$-time algorithm for graphs of treewidth $w$, which yields subexponential-time algorithms in several graph classes.
The feature-selection problem is formulated from an information-theoretic perspective. We show that the problem can be efficiently solved by an extension of the recently proposed info-clustering paradigm. This reveals the fundamental duality between feature selection and data clustering,which is a consequence of the more general duality between the principal partition and the principal lattice of partitions in combinatorial optimization.
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and $k=3$, and when the records have length at most 8 and $k=4$ . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value $k$). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the $k$-anonymity problem, when restricting the instance to records having length bounded by 3 and $k=3$. We show that such a restriction is APX-hard.