ﻻ يوجد ملخص باللغة العربية
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $Sigma$), we want to select $m-ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|Sigma|)cdot m^{g(k,|Sigma|)}cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.
In the Categorical Clustering problem, we are given a set of vectors (matrix) A={a_1,ldots,a_n} over Sigma^m, where Sigma is a finite alphabet, and integers k and B. The task is to partition A into k clusters such that the median objective of the clu
We investigate the parameterized complexity of the following edge coloring problem motivated by the problem of channel assignment in wireless networks. For an integer q>1 and a graph G, the goal is to find a coloring of the edges of G with the maximu
Given a graph $G=(V,E)$, two vertices $s,tin V$, and two integers $k,ell$, the Short Secluded Path problem is to find a simple $s$-$t$-path with at most $k$ vertices and $ell$ neighbors. We study the parameterized complexity of the problem with respe
The feature-selection problem is formulated from an information-theoretic perspective. We show that the problem can be efficiently solved by an extension of the recently proposed info-clustering paradigm. This reveals the fundamental duality between
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned i