Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.111

Phylogenetic networks: a new form of multivariate data summary for data mining and exploratory data analysis

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Exploratory data analysis (EDA) involving both graphical displays and numerical summaries of data, is intended to evaluate the characteristics of the data as well as providing a form of data mining. For multivariate data, the best‐known visual summaries include discriminant analysis, ordination, and clustering, particularly metric ordinations such as principal components analysis. However, these techniques have limiting mathematical assumptions that are not always realistic. Recently, network techniques have been developed in the biological field of phylogenetics that address some of these limitations. They are now widely used in biology under the name phylogenetic networks, but they are actually of general applicability to any multivariate dataset. Phylogenetic networks are fast and relatively easy to calculate, which makes them ideal as a tool for EDA. This review provides an overview of the field, with particular reference to the use of what are called splits graphs. There are several types of splits graph, which summarize the multivariate data in different ways. Example analyses are presented based on the neighbor‐net graph, which seems to be the most generally useful of the available algorithms. This should encourage the more widespread use of these networks whenever a summary of a multivariate dataset is required.

A phylogenetic network of the text of the third sentence of the Bible (Genesis 1:3) as it appears in various published editions. The network was constructed using the reduced‐median network algorithm (based on r = 2). There are five versions of the Bible superimposed at label (Webster's Bible 1833, English Revised Version 1885, American Standard Version 1901, King James Version 1611, Blayney Revision 1769), and two are superimposed at label (New King James Version 1982, New American Standard Bible 1971). The data were collated from various sources on the Internet.
[ Normal View | Magnified View ]
A phylogenetic network of the acoustic characteristics of selected concert halls and opera houses. The network was constructed using the neighbor‐net algorithm, with the Manhattan distance. The top 11 ranked performance halls are highlighted in red (and their rank is shown), and the other halls in the same network neighborhood are marked in purple. (The data are available from Ref .)
[ Normal View | Magnified View ]
A phylogenetic network of the results from the FIFA World Cup soccer competition, 1930–2010. The network was constructed using the neighbor‐net algorithm, with the distance measured as the Steinhaus dissimilarity, which ignores the so‐called negative matches. The countries are color‐coded by neighborhood within the network. (The data are available from Ref .)
[ Normal View | Magnified View ]
A phylogenetic network of the amino acid sequences of type‐I interferon of various mammal species. The nine different recognized subtypes are labeled (with Greek letters), although the individual samples are not labeled, and the relevant part of the network is color‐coded for each subtype. The network was constructed using the neighbor‐net algorithm, with the Hamming distance. (The data are available from Ref .)
[ Normal View | Magnified View ]
Phylogenetic network of the chart positions of five Simon & Garfunkel albums in eight countries. (a) The neighbor‐net analysis based on the Manhattan distance. (b) The split separating Sweden from the other countries. (c) The split (in red) separating Japan + Netherlands from the other countries. (d) The split (in red) separating Netherlands + New Zealand + France from the other countries. (e) The network distance (in bold) separating New Zealand from the United Kingdom. (The data are available from Ref .)
[ Normal View | Magnified View ]
The effect of a single‐gradient dataset on multivariate data summaries. (a) The dataset, with 20 objects (Taxon 1–Taxon 20) and 24 characters, each of which has two possible states (A or C). (b) The principal components ordination of the data (not all of the objects are labeled). (c) The UPGMA hierarchical clustering of the data. (d) The median network analysis of the data. (e) The neighbor‐net analysis of the data.
[ Normal View | Magnified View ]
A phylogenetic network of various sensory characteristics of single‐malt Scotch whiskies. The network was constructed using the neighbor‐net algorithm, with the weighted Bray‐Curtis similarity which ignores so‐called negative matches. The label colors represent different geographical regions within Scotland. (The data are available from Ref .)
[ Normal View | Magnified View ]
Two phylogenetic networks of the same dataset concerning the genotypes of three species. There are three observed nodes (labeled by the species name) in both cases (filled circles), but five and one inferred nodes (open circles), respectively. The edges represent genetic similarity between the species; the numbers count the observed character differences between them. (The data are available from Ref .)
[ Normal View | Magnified View ]
A phylogenetic network of the genetic relationships (measured using microsatellite data) among 255 populations of humans and chimpanzees. The edges leading to the leaf nodes (which are unlabeled) are color‐coded by their source. The network was constructed using the neighbor‐net algorithm, with the similarity measured as the proportion of shared alleles. (The data are available from Ref .)
[ Normal View | Magnified View ]
A phylogenetic network of morphological features of Thai Buddha statues. The network was constructed using the neighbor‐net algorithm, with the Hamming distance. The labels refer to the museum catalog numbers of the statues, and the colors represent different culture‐historical groupings. (The data are available from Ref .)
[ Normal View | Magnified View ]

Browse by Topic

Algorithmic Development > Biological Data Mining
Application Areas > Data Mining Software Tools
Technologies > Structure Discovery and Clustering
Technologies > Visualization

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts