Skip to main content

On Potts Model Clustering, Kernel K-means, and Density Estimation

Clustering high-dimensional data sets is becoming a common problem in emerging fields such as bioinformatics (e.g. finding subtypes of cancer in microarray gene expression data to deliver patient-specific treatment), and text mining (e.g. document clustering for fast and relevant search). Hence reliable clustering methods whose performances do not depend on the dimensionality of the data are needed. In this work, we introduce Potts model clustering as a powerful kernel-based clustering method. We built on the work of Blatt, Wiseman and Domany (1996) who, borrowing from known algorithms in physics, used Potts models as a general tool for data clustering. A great advantage of our Potts model clustering methodology over other kernel-based clustering methods, such as kernel K-means, is that it estimates both the clusters and their number simultaneously.

We argue that the key to the success of kernel-based methods lies in their ability to perform clustering through an implicit non-parametric modeling of the density underlying the data. We also show that a slightly modified version of both Potts model clustering and kernel K-means (a penalized Potts model clustering and a weighted kernel K-means, respectively), solve the same clustering problem. Furthermore we introduce an algorithm, a penalized version of the Wolff algorithm, to uncover the cluster structure suggested by penalized Potts model clustering.

The link between kernel-based methods and non-parametric density estimation allows us to propose several estimates of the kernel bandwidths in order to improve the performance of the algorithms. The Markov Chain Monte Carlo nature of Potts model clustering makes it possible to estimate kernel bandwidths under diverse bandwidth smoothing constraints. As a by-product the bandwidths derived from this procedure may be used to obtain a kernel density estimate of the data.

We applied Potts model clustering to gene expression data, and compared our results to those obtained by two competing methods: Gaussian model based clustering, and a non-parametric dendogram sharpening method.

Part of this work has been done in collaboration with Larissa Stanberry and Werner Stuetzle.