ترغب بنشر مسار تعليمي؟ اضغط هنا

A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

155   0   0.0 ( 0 )
 نشر من قبل Jianguo Chen
 تاريخ النشر 2019
والبحث باللغة English




اسأل ChatGPT حول البحث

As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K-Nearest Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirements and non-iterative nature, DADC achieves low computational complexity and is suitable for large-scale data clustering.


قيم البحث

اقرأ أيضاً

Time Projection Chambers (TPCs) working in combination with Gas Electron Multipliers (GEMs) produce a very sensitive detector capable of observing low energy events. This is achieved by capturing photons generated during the GEM electron multiplicati on process by means of a high-resolution camera. The CYGNO experiment has recently developed a TPC Triple GEM detector coupled to a low noise and high spatial resolution CMOS sensor. For the image analysis, an algorithm based on an adapted version of the well-known DBSCAN was implemented, called iDBSCAN. In this paper a description of the iDBSCAN algorithm is given, including test and validation of its parameters, and a comparison with DBSCAN itself and a widely used algorithm known as Nearest Neighbor Clustering (NNC). The results show that the adapted version of DBSCAN is capable of providing full signal detection efficiency and very good energy resolution while improving the detector background rejection.
The effectiveness and performance of artificial neural networks, particularly for visual tasks, depends in crucial ways on the receptive field of neurons. The receptive field itself depends on the interplay between several architectural aspects, incl uding sparsity, pooling, and activation functions. In recent literature there are several ad hoc proposals trying to make receptive fields more flexible and adaptive to data. For instance, different parameterizations of convolutional and pooling layers have been proposed to increase their adaptivity. In this paper, we propose the novel theoretical framework of density-embedded layers, generalizing the transformation represented by a neuron. Specifically, the affine transformation applied on the input is replaced by a scalar product of the input, suitably represented as a piecewise constant function, with a density function associated with the neuron. This density is shown to describe directly the receptive field of the neuron. Crucially, by suitably representing such a density as a linear combination of a parametric family of functions, we can efficiently train the densities by means of any automatic differentiation system, making it adaptable to the problem at hand, and computationally efficient to evaluate. This framework captures and generalizes recent methods, allowing a fine tuning of the receptive field. In the paper, we define some novel layers and we experimentally validate them on the classic MNIST dataset.
Dimensionality reduction and clustering are often used as preliminary steps for many complex machine learning tasks. The presence of noise and outliers can deteriorate the performance of such preprocessing and therefore impair the subsequent analysis tremendously. In manifold learning, several studies indicate solutions for removing background noise or noise close to the structure when the density is substantially higher than that exhibited by the noise. However, in many applications, including astronomical datasets, the density varies alongside manifolds that are buried in a noisy background. We propose a novel method to extract manifolds in the presence of noise based on the idea of Ant colony optimization. In contrast to the existing random walk solutions, our technique captures points which are locally aligned with major directions of the manifold. Moreover, we empirically show that the biologically inspired formulation of ant pheromone reinforces this behavior enabling it to recover multiple manifolds embedded in extremely noisy data clouds. The algorithms performance is demonstrated in comparison to the state-of-the-art approaches, such as Markov Chain, LLPD, and Disperse, on several synthetic and real astronomical datasets stemming from an N-body simulation of a cosmological volume.
Data clustering with uneven distribution in high level noise is challenging. Currently, HDBSCAN is considered as the SOTA algorithm for this problem. In this paper, we propose a novel clustering algorithm based on what we call graph of density topolo gy (GDT). GDT jointly considers the local and global structures of data samples: firstly forming local clusters based on a density growing process with a strategy for properly noise handling as well as cluster boundary detection; and then estimating a GDT from relationship between local clusters in terms of a connectivity measure, givingglobal topological graph. The connectivity, measuring similarity between neighboring local clusters, is based on local clusters rather than individual points, ensuring its robustness to even very large noise. Evaluation results on both toy and real-world datasets show that GDT achieves the SOTA performance by far on almost all the popular datasets, and has a low time complexity of O(nlogn). The code is available at https://github.com/gaozhangyang/DGC.git.
Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propo se DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states, the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability, or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSEs state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard benchmarks.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا