Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.541

Core‐sets: An updated survey

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a some for a specific query set of k centers. However, traditional algorithms cannot handle modern systems that require parallel real‐time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots. Core‐set is a “small data” summarization of the input “big data,” where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintenance of streaming, distributed and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions. The challenge is to design coresets with provable tradeoff between their size and approximation error. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state‐of‐the‐art. This article is categorized under Algorithmic Development > Structure Discovery Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Machine Learning Algorithmic Development > Scalable Statistical Methods
(Left) A function (algorithm) f gets a points set and outputs its clustering into linear segments. Applying f on the coreset of the data would ideally yield the same result faster using less resources, without changing f. (Middle) An ε‐coreset (in red) for 1‐center queries far(P, q) = maxp ∈ Pp − q2 of a set P of (blue) points and a query q, both on the plane, where is the minimum cost. (Right) In k‐center clustering, the query is replaced by a set Q of k centers, far(P, Q) = maxp ∈ Pminq ∈ Qp − q2 and a grid is constructed around each of the optimal centers
[ Normal View | Magnified View ]
(Left) For every pair of lines (in blue) and (in red) there is a center c (in green) and w ≥ 0 such that the distance from every p to is the same as its distance to c multiplied by w. (Middle) An arbitrary input point c1 is the first center and coreset point. Its farthest input point p2 is the second coreset point, and c2 is the closest point between them to the origin. (Right) The next coreset point p3 is the farthest input point from c2, and c3 is the new closest point to the origin. After i = 1/ε iterations we have errori = ‖ci‖ ≤ ε
[ Normal View | Magnified View ]
(Left) A robust approximation to the optimal clustering (the stars) is computed on a small εn‐sample. About half of the closest points in the full set are then removed. (Middle) The process continues recursively on the remaining half and new robust approximation (the new stars) are produced, until there are no points left. (Right) The output (α, β)‐approximation is the union of robust estimators during the O(logn) iterations
[ Normal View | Magnified View ]

Browse by Topic

Algorithmic Development > Scalable Statistical Methods
Technologies > Machine Learning
Fundamental Concepts of Data and Knowledge > Big Data Mining
Algorithmic Development > Structure Discovery

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts