Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Nonparametric density estimation for high‐dimensional data—Algorithms and applications

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Density estimation is one of the central areas of statistics whose purpose is to estimate the probability density function underlying the observed data. It serves as a building block for many tasks in statistical inference, visualization, and machine learning. Density estimation is widely adopted in the domain of unsupervised learning especially for the application of clustering. As big data become pervasive in almost every area of data sciences, analyzing high‐dimensional data that have many features and variables appears to be a major focus in both academia and industry. High‐dimensional data pose challenges not only from the theoretical aspects of statistical inference, but also from the algorithmic/computational considerations of machine learning and data analytics. This paper reviews a collection of selected nonparametric density estimation algorithms for high‐dimensional data, some of them are recently published and provide interesting mathematical insights. The important application domain of nonparametric density estimation, such as modal clustering, is also included in this paper. Several research directions related to density estimation and high‐dimensional data analysis are suggested by the authors. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Multivariate Analysis Statistical and Graphical Methods of Data Analysis > Density Estimation Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
Example of four modes but only three data points (shown in red at vertices of an equilateral triangle) in two dimensions. The surface is very flat, which is highlighted by the blue contours around the origin. See the text for further details
[ Normal View | Magnified View ]
Illustration of the mode tree (left) and the dendrogram clustering tree (right) of the geyser eruption times dataset. Note that the dendrogram (right) is created by hierarchical clustering based on the average linkage between modes
[ Normal View | Magnified View ]
Sequential importance sampling (SIS) to generate weighted samples of partition paths. Here, four partition samples are illustrated, and their corresponding weights are w1, w2, w3, w4, respectively (Wong, )
[ Normal View | Magnified View ]
Recursive sequential binary partitioning, where t = 1, 2, 3, 4 represent the level of partition, and the partition is performed sequentially. At each level, there are a variety of different ways to perform binary partition (Wong, )
[ Normal View | Magnified View ]
Real images and generated images by masked autoregressive flow (MAF) from Modified National Institute of Standards and Technology (MNIST) datasets. (a) Real images from MNIST datasets; (b) Generated images by MAF through MNIST datasets (Papamakarios et al., )
[ Normal View | Magnified View ]
The architecture of a neural autoregressive density estimation (NADE) model. The input vector x is a N‐dimensional binary vector, units with value 0 are shown in color black, while the units with value 1 are shown in color white. N input units represent the N dimensions in vector xo. We basically model each conditional probability density p(xd = 1∣xs < d) using a single layer feed‐forward neural network. There are N hidden layers to model N conditional probabilities. hd represents the dth hidden layer (d = 1, …, N). The output of each hidden layer is calculated via Equation 9. In this example, the vector represents the output, and its dimensions (i = 1, .., N) are the output of corresponding hidden layer hi. Note that each input unit connecting to the hidden layer through the weight‐sharing scheme, which is highlighted in the figure with the same color (Uria et al., )
[ Normal View | Magnified View ]
Bivariate scatterdiagram and bivariate mode tree for lagged geyser dataset. Kernel estimates for 201 choices of the logarithm of bandwidth h (scaled to (0, 1) were computed and the sample modes located. The data have three obvious clusters, which are visible in the scatterdiagram as well as the three long modal traces in the right frame
[ Normal View | Magnified View ]

Browse by Topic

Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
Statistical and Graphical Methods of Data Analysis > Multivariate Analysis
Statistical and Graphical Methods of Data Analysis > Density Estimation

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts