This Title All WIREs
How to cite this WIREs title:
WIREs Comp Stat

Thinking by classes in data science: the symbolic data analysis paradigm

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Data Science, considered as a science by itself, is in general terms, the extraction of knowledge from data. Symbolic data analysis (SDA) gives a new way of thinking in Data Science by extending the standard input to a set of classes of individual entities. Hence, classes of a given population are considered to be units of a higher level population to be studied. Such classes often represent the real units of interest. In order to take variability between the members of each class into account, classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data, called ‘symbolic’ as they cannot be reduced to numbers without losing much information. The first step in SDA is to build the symbolic data table where the rows are classes and the variables can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. SDA is a new paradigm which opens up a vast domain of research and applications by giving complementary results to classical methods applied to standard data. SDA also gives answers to big data and complex data challenges as big data can be reduced and summarized by classes and as complex data with multiple unstructured data tables and unpaired variables can be transformed into a structured data table with paired symbolic‐valued variables. WIREs Comput Stat 2016, 8:172–205. doi: 10.1002/wics.1384

From a standard data table (X, Y) describing a set of individuals X by a set of standard variables Y, to a symbolic data table (X′, Y′) describing a set of teams X′ by a set of symbolic variables Y′.
[ Normal View | Magnified View ]
An example of RSDA package output in case of a principal component analysis of interval‐valued variables.
[ Normal View | Magnified View ]
A NETSYR output of a PCA extended to symbolic data.
[ Normal View | Magnified View ]
A symbolic data table provided by the SYR software
[ Normal View | Magnified View ]
Some symbolic data analysis tools output.
[ Normal View | Magnified View ]
From relational data base to symbolic data.
[ Normal View | Magnified View ]
Building a symbolic data table from several ground populations described by different sets of variables and a unique class variable.
[ Normal View | Magnified View ]
The biplot of histogram‐valued variables needing copulas models.
[ Normal View | Magnified View ]
The ground data table where seven individuals are described by three binary variables.
[ Normal View | Magnified View ]
The first cell of this table means that if y1y2z1z2 and Y has a better explanatory power than Z, it has also a better discriminatory power than Z.
[ Normal View | Magnified View ]
The explanatory power of Y is much higher than the one of Z and the discriminatory power of Z′ is higher than the one of Y′.
[ Normal View | Magnified View ]
The tables X/U and U/X.
[ Normal View | Magnified View ]
Graphical representation of the variability inside symbolic data by four numeric and two symbolic variables.
[ Normal View | Magnified View ]
Individuals are uniformly distributed inside the circle. Therefore, there is no correlation between Y1 and Y2.
[ Normal View | Magnified View ]

Related Articles

Top Ten WICS Articles
WIREs at JSM 2017

Browse by Topic

Statistical Methods > Statistical Theory and Applications
Data Mining > Exploratory Data Analysis
Data Mining > Clustering and Classification

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts