Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov

Clustering high dimensional data

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

High‐dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so‐called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high‐dimensional objects appear almost alike, new approaches for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high‐dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster groups objects that are similar to one another, whereas dissimilar objects are assigned to different clusters, possibly separating out noise. In this manner, clusters describe the data structure in an unsupervised manner, i.e., without the need for class labels. A number of clustering paradigms exist that provide different cluster models and different algorithmic approaches for cluster detection. Common to all approaches is the fact that they require some underlying assessment of similarity between data objects. In this article, we provide an overview of the effects of high‐dimensional spaces, and their implications for different clustering paradigms. We review models and algorithms that address clustering in high dimensions, with pointers to the literature, and sketch open research issues. We conclude with a summary of the state of the art. © 2012 Wiley Periodicals, Inc.

Figure 1.

Clustering: finding groups of data objects based on mutual similarity; dissimilar objects may be separated as noise.

[ Normal View 5K | Magnified View 10K ]
Figure 2.

Dendrogram: visualizing hierarchies of clusters.

[ Normal View 2K | Magnified View 3K ]
Figure 3.

High dimensions: change in density.

[ Normal View 69K | Magnified View 94K ]
Figure 4.

Data clustered in each axis, but spread out in their combination.

[ Normal View 2K | Magnified View 5K ]
Figure 5.

Clusters in different subspace projections.

[ Normal View 81K | Magnified View 109K ]
Figure 6.

Grids (left) and hyperplanes (right) for clustering in high‐dimensional spaces.

[ Normal View 7K | Magnified View 16K ]
Figure 7.

High‐dimensional time series: dimensionality reduction and discretization.

[ Normal View 58K | Magnified View 68K ]
Figure 8.

Document clustering: term frequency per document and inverse term frequency across documents form high‐dimensional vectors.

[ Normal View 50K | Magnified View 60K ]

Browse by Topic

Technologies > Structure Discovery and Clustering
blog comments powered by Disqus

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts