Biclustering with Alternating K-Means

102 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zichao Li

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Nicolas Fraiman - Zichao Li

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-diagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.

قيم البحث

143 - Suyun Liu , Luis Nunes Vicente 2021

In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfa irness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates. The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets. Moreover, we propose a novel companion algorithm, the stochastic alternating bi-objective gradient descent (SA2GD) algorithm, which can handle a smooth version of the considered bi-objective fair $k$-means problem, more amenable for analysis. A sublinear convergence rate of $mathcal{O}(1/T)$ is established under strong convexity for the determination of a stationary point of a weighted sum of the two functions parameterized by the number of steps or updates on each function.

التعلم الآلي التعلم الالي

Deep clustering with concrete k-means

125 - Boyan Gao , Yongxin Yang , Henry Gouk 2019

We address the problem of simultaneously learning a k-means clustering and deep feature representation from unlabelled data, which is of interest due to the potential of deep k-means to outperform traditional two-step feature extraction and shallow-c lustering strategies. We achieve this by developing a gradient-estimator for the non-differentiable k-means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concrete k-means model can be optimised with respect to the canonical k-means objective and is easily trained end-to-end without resorting to alternating optimisation. We demonstrate the efficacy of our method on standard clustering benchmarks.

التعلم الآلي التعلم الالي

$t$-$k$-means: A Robust and Stable $k$-means Variant

364 - Yiming Li , Yang Zhang , Qingtao Tang 2019

$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at url{https://github.com/THUYimingLi/t-k-means}.

التعلم الآلي التعلم الالي

Recombinator-k-means: A population based algorithm that exploits k-means++ for recombination

89 - Carlo Baldassi 2019

We present a simple heuristic algorithm for efficiently optimizing the notoriously hard minimum sum-of-squares clustering problem, usually addressed by the classical k-means heuristic and its variants. The algorithm, called recombinator-k-means, is v ery similar to a genetic algorithmic scheme: it uses populations of configurations, that are optimized independently in parallel and then recombined in a next-iteration population batch by exploiting a variant of the k-means++ seeding algorithm. An additional reweighting mechanism ensures that the population eventually coalesces into a single solution. Extensive tests measuring optimization objective vs computational time on synthetic and real-word data show that it is the only choice, among state-of-the-art alternatives (simple restarts, random swap, genetic algorithm with pairwise-nearest-neighbor crossover), that consistently produces good results at all time scales, outperforming competitors on large and complicated datasets. The only parameter that requires tuning is the population size. The scheme is rather general (it could be applied even to k-medians or k-medoids, for example). Our implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl.

التعلم الآلي التعلم الالي

Generalization of k-means Related Algorithms

101 - Yiwei Li 2019

This article briefly introduced Arthur and Vassilvitshiis work on textbf{k-means++} algorithm and further generalized the center initialization process. It is found that choosing the most distant sample point from the nearest center as new center can mostly have the same effect as the center initialization process in the textbf{k-means++} algorithm.

التعلم الآلي التعلم الالي