This Title All WIREs
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 1.939

There and back again: Outlier detection between statistical reasoning and data mining algorithms

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Outlier detection has been a topic in statistics for centuries. Over mainly the last two decades, there has been also an increasing interest in the database and data mining community to develop scalable methods for outlier detection. Initially based on statistical reasoning, however, these methods soon lost the direct probabilistic interpretability of the derived outlier scores. Here, we detail from a joint point of view of data mining and statistics the roots and the path of development of statistical outlier detection and of database‐related data mining methods for outlier detection. We discuss their inherent meaning, review approaches to again find a statistically meaningful interpretation of outlier scores, and sketch related current research topics.

This article is categorized under:

  • Algorithmic Development > Statistics
  • Algorithmic Development > Scalable Statistical Methods
  • Technologies > Machine Learning
The histogram (blue) of human gestation periods (based on 13,634 cases, as reported by Barnett, ) with a fitted normal distribution (green), describing the hypothesis of the judges, and the alternative distribution (red) according to Mr. Hadlum's conjecture, assuming not an unusual value of the normal distribution but just a different distribution shifted by around 10 weeks, that is, a later start of the gestation period. This assumption of a totally different “generating mechanism” accommodates the alleged outlier perfectly
[ Normal View | Magnified View ]
A distribution model (green density contours) computed for the inliers (green points) reveals the outlier (red point) as far of. If the outlier, however, was taken into account when fitting the distribution model to the data (red density contours), the outlier itself might be well covered by the model (it is masked), while some inlier might now appear as being too far off (the lower right inlier is swamped)
[ Normal View | Magnified View ]
A simple outlier detection scenario: There is a maximum level of true positives any outlier detection can possibly reach, dependent on the overlap of outlier‐ (here the blue uniform distribution) and inlier distributions (here the red Gaussian distribution)
[ Normal View | Magnified View ]
Is some data point an outlier or is the model wrong? (a) Linear model and some outlier and (b) more points and adapted (more complex) model
[ Normal View | Magnified View ]

Browse by Topic

Technologies > Machine Learning
Algorithmic Development > Scalable Statistical Methods
Algorithmic Development > Statistics

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts