Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Semi‐supervised clustering methods

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as ‘semi‐supervised clustering’ methods) that can be applied in these situations. The majority of these methods are modifications of the popular k‐means clustering method, and several of them will be described in detail. A brief description of some other semi‐supervised clustering algorithms is also provided. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
This figure illustrates how hierarchical clustering would partition a simple data set. In the first two steps, the two pairs of adjacent points would each be combined into a single cluster. In the third step, these two clusters would be combined into a larger cluster. In the final step, the remaining point would be combined to this cluster. All the data points are now combined into a single cluster, so the algorithm terminates.
[ Normal View | Magnified View ]
In the above figure, there are two clusters such that the cluster means differ with respect to x but not with respect to y. If 2‐means clustering is applied to both x and y, then it fails to identify the correct clusters, but 2‐means clustering produces satisfactory results when applied only to x.
[ Normal View | Magnified View ]
An example of a data set where two different sets of clusters exist and only one cluster is associated with the outcome of interest. In the above figure, darker shades of blue correspond to higher values of the features and lighter shades of blue correspond to lower values. Suppose that observations 1–100 have a disease of interest and observations 101–200 are controls. In this case we would be interested in identifying the clusters formed by features 1–50. However, conventional clustering algorithms will identify the clusters formed by features 50–150, since the distance between the centers of these two clusters is greater than the distance between the centers of the clusters formed by features 1–50.
[ Normal View | Magnified View ]
This figure shows an example of a situation where an (observed) outcome variable (namely survival) is a “noisy surrogate” for two unobserved clusters. Suppose there are two subtypes of cancer, patients with the first subtype (cluster) tend to have lower survival than patients with the second subtype. However, there is considerable overlap in the distribution of the survival times, so while a patient with a low survival time is more likely to be in cluster 1, it is not possible to assign each patient to cluster based only on their survival time.
[ Normal View | Magnified View ]
Hierarchical clustering was applied to the five data points plotted in the left panel. The resulting dendogram is shown on the right panel. Note that point 3 is much more distant from (and hence dissimilar to) the remaining four points. Thus, the height of the node where point 3 is merged to the remaining points is higher than the height of the other nodes in the graph.
[ Normal View | Magnified View ]

Browse by Topic

Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts