Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Parameterized Complexity of Feature Selection for Categorical Data Clustering

93 0 0.0 ( 0 )

Download Cite

Added by Kirill Simonov

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Sayan Bandyapadhyay - Fedor V. Fomin - Petr A. Golovach

Data Structures and Algorithms Discrete Mathematics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $Sigma$), we want to select $m-ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|Sigma|)cdot m^{g(k,|Sigma|)}cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.

rate research

Parameterized Complexity of Categorical Clustering with Size Constraints

218 - Fedor V. Fomin , Petr A. Golovach , 2021

In the Categorical Clustering problem, we are given a set of vectors (matrix) A={a_1,ldots,a_n} over Sigma^m, where Sigma is a finite alphabet, and integers k and B. The task is to partition A into k clusters such that the median objective of the clustering in the Hamming norm is at most B. That is, we seek a partition {I_1,ldots,I_k} of {1,ldots,n} and vectors c_1,ldots,c_kinSigma^m such that sum_{i=1}^ksum_{jin I_i}d_h(c_i,a_j)leq B, where d_H(a,b) is the Hamming distance between vectors a and b. Fomin, Golovach, and Panolan [ICALP 2018] proved that the problem is fixed-parameter tractable (for binary case Sigma={0,1}) by giving an algorithm that solves the problem in time 2^{O(Blog B)} (mn)^{O(1)}. We extend this algorithmic result to a popular capacitated clustering model, where in addition the sizes of the clusters should satisfy certain constraints. More precisely, in Capacitated Clustering, in addition, we are given two non-negative integers p and q, and seek a clustering with pleq |I_i|leq q for all iin{1,ldots,k}. Our main theorem is that Capacitated Clustering is solvable in time 2^{O(Blog B)}|Sigma|^B(mn)^{O(1)}. The theorem not only extends the previous algorithmic results to a significantly more general model, it also implies algorithms for several other variants of Categorical Clustering with constraints on cluster sizes.

Data Structures and Algorithms Discrete Mathematics

On the Parameterized Complexity of the Maximum Edge Coloring Problem

769 - Prachi Goyal , Vikram Kamat , Neeldhara Misra 2013

We investigate the parameterized complexity of the following edge coloring problem motivated by the problem of channel assignment in wireless networks. For an integer q>1 and a graph G, the goal is to find a coloring of the edges of G with the maximum number of colors such that every vertex of the graph sees at most q colors. This problem is NP-hard for q>1, and has been well-studied from the point of view of approximation. Our main focus is the case when q=2, which is already theoretically intricate and practically relevant. We show fixed-parameter tractable algorithms for both the standard and the dual parameter, and for the latter problem, the result is based on a linear vertex kernel.

Data Structures and Algorithms Discrete Mathematics

Parameterized algorithms and data reduction for the short secluded $s$-$t$-path problem

80 - Rene van Bevern , Till Fluschnik , Oxana Yu. Tsidulko 2018

Given a graph $G=(V,E)$, two vertices $s,tin V$, and two integers $k,ell$, the Short Secluded Path problem is to find a simple $s$-$t$-path with at most $k$ vertices and $ell$ neighbors. We study the parameterized complexity of the problem with respect to four structural graph parameters: the vertex cover number, treewidth, feedback vertex number, and feedback edge number. In particular, we completely settle the question of the existence of problem kernels with size polynomial in these parameters and their combinations with $k$ and $ell$. We also obtain a $2^{O(w)}cdot ell^2cdot n$-time algorithm for graphs of treewidth $w$, which yields subexponential-time algorithms in several graph classes.

Data Structures and Algorithms Discrete Mathematics

Duality between Feature Selection and Data Clustering

148 - Chung Chan , Ali Al-Bashabsheh , Qiaoqiao Zhou 2016

The feature-selection problem is formulated from an information-theoretic perspective. We show that the problem can be efficiently solved by an extension of the recently proposed info-clustering paradigm. This reveals the fundamental duality between feature selection and data clustering,which is a consequence of the more general duality between the principal partition and the principal lattice of partitions in combinatorial optimization.

Information Theory Machine Learning Information Theory

Parameterized Complexity of the k-anonymity Problem

561 - Stefano Beretta , Paola Bonizzoni , Gianluca Della Vedova 2009

The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and $k=3$, and when the records have length at most 8 and $k=4$ . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value $k$). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the $k$-anonymity problem, when restricting the instance to records having length bounded by 3 and $k=3$. We show that such a restriction is APX-hard.

Data Structures and Algorithms Databases Discrete Mathematics

comments

Fetching comments

Syrian Virtual University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Parameterized Complexity of Feature Selection for Categorical Data Clustering

Ask ChatGPT about the research

No Arabic abstract

Read More