Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Statistical data mining

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract Data mining is widely used in modern science to extract signal from complex data sets. This article summarizes some of the key intellectual issues in the development of this field, largely from a historical perspective. There is particular emphasis on the Curse of Dimensionality, and its implications for non‐parametric regression, classification, and cluster analysis. Copyright © 2009 John Wiley & Sons, Inc. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification

The relationship between a recursive partition and a regression tree. These two representations are equivalent.

[ Normal View | Magnified View ]

A logistic or sigmoidal curve, as used for weighting linear combinations of explanatory variables in neural network regression.

[ Normal View | Magnified View ]

A representation of a neural network with one hidden layer. Linear combinations of explanatory variables at the bottom layer are combined in the hidden layer and summed to produce the output, a prediction for the response variable.

[ Normal View | Magnified View ]

A representation of the projection of an observation onto a subspace determined by the vector .

[ Normal View | Magnified View ]

The data shown in first panel of Figure 2, smoothed by a first‐order regression spline with knots at the integers.

[ Normal View | Magnified View ]

The top left panel shows the true function, and noisy observations from that function are generated by sampling x‐values uniformly, then adding ϵ error to the corresponding true y‐value, where the errors are independent with normal distribution N(0, (0.5)2). The next four panels show four different methods for univariate non‐parametric regression: fixed‐bin smoothing (middle left), a moving average (middle right), a running line smoother (bottom left), and a loess fit (bottom right).

[ Normal View | Magnified View ]

A plot of the volume of a cube in as a function of the length of one side, for p = 1, 2, 8.

[ Normal View | Magnified View ]

The left panel shows a simple model for clusters, in which each cluster has the same covariance matrix but different locations. The middle panel illustrates the case in which each cluster has the same covariance matrix, except with different scaling factors. The right panel is a case in which the three clusters have completely unrelated covariance matrices.

[ Normal View | Magnified View ]

The parallel coordinate plot shows two distinct sets of clusters. One depends only on the first three explanatory variables, whereas the second depends only on the last two explanatory variables.

[ Normal View | Magnified View ]

The data in the left are connected by a minimum spanning tree on the right. To perform a cluster analysis with this tree, one simply removes the two longest edges to produce three tight clusters.

[ Normal View | Magnified View ]

A simple cluster tree. The and are the first to be joined, and that cluster is joined by at the same time that and are joined. Ultimately, all five observations merge into a single cluster.

[ Normal View | Magnified View ]

This classification tree assigns patients to low risk (F) or high risk (G) with respect to the risk of myocardial infarction.

[ Normal View | Magnified View ]

A tensor spline basis function, or hockeystick function, as used in MARS.

[ Normal View | Magnified View ]

Related Articles

Massive datasets

Browse by Topic

Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts