No Arabic abstract
Clustering data into meaningful subsets is a major task in scientific data analysis. To date, various strategies ranging from model-based approaches to data-driven schemes, have been devised for efficient and accurate clustering. One important class of clustering methods that is of a particular interest is the class of exemplar-based approaches. This interest primarily stems from the amount of compressed information encoded in these exemplars that effectively reflect the major characteristics of the respective clusters. Affinity propagation (AP) has proven to be a powerful exemplar-based approach that refines the set of optimal exemplars by iterative pairwise message updates. However, a critical limitation is its inability to capitalize on known networked relations between data points often available for various scientific datasets. To mitigate this shortcoming, we propose geometric-AP, a novel clustering algorithm that effectively extends AP to take advantage of the network topology. Geometric-AP obeys network constraints and uses max-sum belief propagation to leverage the available network topology for generating smooth clusters over the network. Extensive performance assessment reveals a significant enhancement in the quality of the clustering results when compared to benchmark clustering schemes. Especially, we demonstrate that geometric-AP performs extremely well even in cases where the original AP fails drastically.
Affinity propagation is an exemplar-based clustering algorithm that finds a set of data-points that best exemplify the data, and associates each datapoint with one exemplar. We extend affinity propagation in a principled way to solve the hierarchical clustering problem, which arises in a variety of domains including biology, sensor networks and decision making in operational research. We derive an inference algorithm that operates by propagating information up and down the hierarchy, and is efficient despite the high-order potentials required for the graphical model formulation. We demonstrate that our method outperforms greedy techniques that cluster one layer at a time. We show that on an artificial dataset designed to mimic the HIV-strain mutation dynamics, our method outperforms related methods. For real HIV sequences, where the ground truth is not available, we show our method achieves better results, in terms of the underlying objective function, and show the results correspond meaningfully to geographical location and strain subtypes. Finally we report results on using the method for the analysis of mass spectra, showing it performs favorably compared to state-of-the-art methods.
In most cases deep learning architectures are trained disregarding the amount of operations and energy consumption. However, some applications, like embedded systems, can be resource-constrained during inference. A popular approach to reduce the size of a deep learning architecture consists in distilling knowledge from a bigger network (teacher) to a smaller one (student). Directly training the student to mimic the teacher representation can be effective, but it requires that both share the same latent space dimensions. In this work, we focus instead on relative knowledge distillation (RKD), which considers the geometry of the respective latent spaces, allowing for dimension-agnostic transfer of knowledge. Specifically we introduce a graph-based RKD method, in which graphs are used to capture the geometry of latent spaces. Using classical computer vision benchmarks, we demonstrate the ability of the proposed method to efficiently distillate knowledge from the teacher to the student, leading to better accuracy for the same budget as compared to existing RKD alternatives.
We propose a new clustering algorithm, Extended Affinity Propagation, based on pairwise similarities. Extended Affinity Propagation is developed by modifying Affinity Propagation such that the desirable features of Affinity Propagation, e.g., exemplars, reasonable computational complexity and no need to specify number of clusters, are preserved while the shortcomings, e.g., the lack of global structure discovery, that limit the applicability of Affinity Propagation are overcome. Extended Affinity Propagation succeeds not only in achieving this goal but can also provide various additional insights into the internal structure of the individual clusters, e.g., refined confidence values, relative cluster densities and local cluster strength in different regions of a cluster, which are valuable for an analyst. We briefly discuss how these insights can help in easily tuning the hyperparameters. We also illustrate these desirable features and the performance of Extended Affinity Propagation on various synthetic and real world datasets.
We consider a new kind of clustering problem in which clusters need not be independent of each other, but rather can have compositional relationships with other clusters (e.g., an image set consists of rectangles, circles, as well as combinations of rectangles and circles). This task is motivated by recent work in few-shot learning on compositional embedding models that structure the embedding space to distinguish the label sets, not just the individual labels, assigned to the examples. To tackle this clustering problem, we propose a new algorithm called Compositional Affinity Propagation (CAP). In contrast to standard Affinity Propagation as well as other algorithms for multi-view and hierarchical clustering, CAP can deduce compositionality among clusters automatically. We show promising results, compared to several existing clustering algorithms, on the MultiMNIST, OmniGlot, and LibriSpeech datasets. Our work has applications to multi-object image recognition and speaker diarization with simultaneous speech from multiple speakers.
The accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the need for scalable, performance-conscious analytics algorithms. To directly address this need, we propose a novel MapReduce implementation of the exemplar-based clustering algorithm known as Affinity Propagation. Our parallelization strategy extends to the multilevel Hierarchical Affinity Propagation algorithm and enables tiered aggregation of unstructured data with minimal free parameters, in principle requiring only a similarity measure between data points. We detail the linear run-time complexity of our approach, overcoming the limiting quadratic complexity of the original algorithm. Experimental validation of our clustering methodology on a variety of synthetic and real data sets (e.g. images and point data) demonstrates our competitiveness against other state-of-the-art MapReduce clustering techniques.