Semi-Supervised Active Clustering with Weak Oracles


Abstract in English

Semi-supervised active clustering (SSAC) utilizes the knowledge of a domain expert to cluster data points by interactively making pairwise same-cluster queries. However, it is impractical to ask human oracles to answer every pairwise query. In this paper, we study the influence of allowing not-sure answers from a weak oracle and propose algorithms to efficiently handle uncertainties. Different types of model assumptions are analyzed to cover realistic scenarios of oracle abstraction. In the first model, random-weak oracle, an oracle randomly abstains with a certain probability. We also proposed two distance-weak oracle models which simulate the case of getting confused based on the distance between two points in a pairwise query. For each weak oracle model, we show that a small query complexity is adequate for the effective $k$ means clustering with high probability. Sufficient conditions for the guarantee include a $gamma$-margin property of the data, and an existence of a point close to each cluster center. Furthermore, we provide a sample complexity with a reduced effect of the clusters margin and only a logarithmic dependency on the data dimension. Our results allow significantly less number of same-cluster queries if the margin of the clusters is tight, i.e. $gamma approx 1$. Experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties.

Download