Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Coping with high dimensionality in massive datasets

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract A massive dataset is characterized by its size and complexity. In its most basic form, such a dataset can be represented as a collection of n observations on p variables. Aggravation or even impasse can result if either number is huge. The more difficult challenge is usually associated with the case of very high dimensionality or ‘big p’. There is a fast growing literature on how to handle such challenges, but most of it is in a supervised learning context involving a specific objective function, as in regression or classification. Much less is known about effective strategies for more exploratory data analytic activities. The purpose of this article is to put into historical perspective much of the recent research on dimensionality reduction and variable selection in such problems. Examples of applications that have stimulated this research are discussed along with a sampling of the latest methodologies to illustrate the onslaught of creative ideas that have surfaced. From a practitioner's perspective, the most effective strategy may be to emphasize the role of interdisciplinary teamwork with decisions on how best to grapple with high dimensionality emerging from a mixture of statistical thinking and consideration of the circumstances of the application. WIREs Comp Stat 2011 3 95–103 DOI: 10.1002/wics.141 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Exploratory Data Analysis Data: Types and Structure > Massive Data Statistical and Graphical Methods of Data Analysis > Multivariate Analysis

Smoothed numbers of papers with titles that include one of the phrases shown in the inset, 1990–2008.

[ Normal View | Magnified View ]

Counts of papers by subject area, 1990 to April 29, 2010. Some papers may be counted twice.

[ Normal View | Magnified View ]

Smoothed numbers of papers with titles that include one of the phrases shown in the two color‐coded insets, 1990–2008. Red dots are for data analysis and data mining.

[ Normal View | Magnified View ]

Related Articles

Principal component analysis

Browse by Topic

Data: Types and Structure > Massive Data
Statistical and Graphical Methods of Data Analysis > Multivariate Analysis
Statistical Learning and Exploratory Methods of the Data Sciences > Exploratory Data Analysis

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts