Do you want to publish a course? Click here

K-medoids Clustering of Data Sequences with Composite Distributions

295   0   0.0 ( 0 )
 Added by Qunwei Li
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

This paper studies clustering of data sequences using the k-medoids algorithm. All the data sequences are assumed to be generated from emph{unknown} continuous distributions, which form clusters with each cluster containing a composite set of closely located distributions (based on a certain distance metric between distributions). The maximum intra-cluster distance is assumed to be smaller than the minimum inter-cluster distance, and both values are assumed to be known. The goal is to group the data sequences together if their underlying generative distributions (which are unknown) belong to one cluster. Distribution distance metrics based k-medoids algorithms are proposed for known and unknown number of distribution clusters. Upper bounds on the error probability and convergence results in the large sample regime are also provided. It is shown that the error probability decays exponentially fast as the number of samples in each data sequence goes to infinity. The error exponent has a simple form regardless of the distance metric applied when certain conditions are satisfied. In particular, the error exponent is characterized when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy are used as the distance metric. Simulation results are provided to validate the analysis.



rate research

Read More

We address the problem of simultaneously learning a k-means clustering and deep feature representation from unlabelled data, which is of interest due to the potential of deep k-means to outperform traditional two-step feature extraction and shallow-clustering strategies. We achieve this by developing a gradient-estimator for the non-differentiable k-means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concrete k-means model can be optimised with respect to the canonical k-means objective and is easily trained end-to-end without resorting to alternating optimisation. We demonstrate the efficacy of our method on standard clustering benchmarks.
74 - Xia Yue , Wang Man , Jun Yue 2016
Clustering analysis has received considerable attention in spatial data mining for several years. With the rapid development of the geospatial information technologies, the size of spatial information data is growing exponentially which makes clustering massive spatial data a challenging task. In order to improve the efficiency of spatial clustering for large scale data, many researchers proposed several efficient clustering algorithms in parallel. In this paper, a new K-Medoids++ spatial clustering algorithm based on MapReduce for clustering massive spatial data is proposed. The initialization algorithm to decrease the number of iterations is combined with the MapReduce framework. Comparative Experiments conducted over different dataset and different number of nodes indicate that the proposed K-Medoids spatial clustering algorithm provides better efficiency than traditional K-Medoids and scales well while processing massive spatial data on commodity hardware.
Most of existing clustering algorithms are proposed without considering the selection bias in data. In many real applications, however, one cannot guarantee the data is unbiased. Selection bias might bring the unexpected correlation between features and ignoring those unexpected correlations will hurt the performance of clustering algorithms. Therefore, how to remove those unexpected correlations induced by selection bias is extremely important yet largely unexplored for clustering. In this paper, we propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias. Specifically, the decorrelation regularizer aims to learn the global sample weights which are capable of balancing the sample distribution, so as to remove unexpected correlations among features. Meanwhile, the learned weights are combined with k-means, which makes the reweighted k-means cluster on the inherent data distribution without unexpected correlation influence. Moreover, we derive the updating rules to effectively infer the parameters in DCKM. Extensive experiments results on real world datasets well demonstrate that our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias when clustering.
Clustering is one of the most common unsupervised learning tasks in machine learning and data mining. Clustering algorithms have been used in a plethora of applications across several scientific fields. However, there has been limited research in the clustering of point patterns - sets or multi-sets of unordered elements - that are found in numerous applications and data sources. In this paper, we propose two approaches for clustering point patterns. The first is a non-parametric method based on novel distances for sets. The second is a model-based approach, formulated via random finite set theory, and solved by the Expectation-Maximization algorithm. Numerical experiments show that the proposed methods perform well on both simulated and real data.
154 - Jianguo Chen , Philip S. Yu 2019
As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K-Nearest Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirements and non-iterative nature, DADC achieves low computational complexity and is suitable for large-scale data clustering.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا