Mixture Model Clustering with the Multimix Program

Authors:

Murray A. Jorgensen
Department of Statistics
University of Waikato
Private Bag 3105
Hamilton, New Zealand
E-mail: maj@waikato.ac.nz
Phone: +64-7-838-4773
Fax: +64-7-838-4155

Lynette A. Hunt
Department of Statistics
University of Waikato
Private Bag 3105
Hamilton, New Zealand
E-mail: maj@waikato.ac.nz
Phone: +64-7-856-2889
Fax: +64-7-838-4155

Abstract:

Hunt (1996) has implemented the finite mixture model approach to clustering in a program called Multimix. The program is designed to cluster multivariate data with categorical and continuous variables and possibly containing missing values. The model fitted simultaneously generalises the Latent Class model and the mixture of multivariate normals model. Like either of these models Multimix can be used to form clusters by the Bayes allocation rule. This is the intended use of the program, although the parameter estimates can be used to give a succinct description of the clusters.

Use of the EM algorithm, with its view of the observed data as being notionally augmented by missing information to form the `complete data', gives a broad framework for estimation which is able to handle two types of missing information: unknown cluster assignment and missing data. Using the methodology of Little and Rubin (1987). in this way Multimix is able to handle missing data in a less ad hoc way than many clustering algorithms. The program runs in acceptable time with large data matrices (say hundreds of observations on tens of variables). Use of the missing-data facility increases execution time somewhat. In this presentation we describe the approach taken to the design of Multimix and how some of the statistical problems were dealt with. As examples of the use of the program we cluster a large medical dataset and a version of Fisher's Iris data in which a third of the values are randomly made `missing'.

Keywords:

Cluster analysis, EM algorithm, Latent class analysis, Local independence, Multivariate Normal distribution, Location model, Prostate cancer data, Missing data

Availability:

PostScript

Other information:

A version of the software is available on the Multimix ftp site.