This Title All WIREs
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 8.836

Chemoinformatics applications of cluster analysis

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Chemoinformatics applications of cluster analysis over the past 35 years include chemical diversity for compound acquisition, analysis of HTS results for lead discovery, 2D and 3D chemical similarity searching for virtual screening, and hypothesis generation for lead hopping using molecular shape and pharmacophore descriptors. These applications still provide the majority of cluster analysis usage, but the advent of greater and greater computational resources has allowed researchers to tackle applications of ever increasing scale and complexity. In the past few years, a far broader array of clustering methods is now used—some entirely new, some common to other disciplines, and others modified to specific chemoinformatic applications. The chemoinformatic applications have also broadened to include greater biological information more commonly associated with bioinformatics. Indeed, clustering techniques, such as coclustering or self‐organizing trees, commonly found in bioinformatics, are beginning to find chemoinformatic application uses. Issues such as visualization and validation of clustering results continue to present challenging problems, especially given that the scale of many problems now attempted has increased enormously. Some new validation techniques have been introduced in the chemoinformatics literature that now allow for both a better understanding of the clustering results and help point to methods of greater efficacy. Effective validation and visualization of clustering results of large data sets has proven to be more problematic. WIREs Comput Mol Sci 2014, 4:34–48. doi: 10.1002/wcms.1152

This article is categorized under:

  • Computer and Information Science > Chemoinformatics
Voronoi diagram of clustering of dimension reduced data, where the original dimension is the N‐1 dimension found from the NXN dissimilarity matrix produced by using N binary fingerprints and the BUB measure. Clusters (regions bounded by dotted lines) and cluster centroids (stars) are found by way of the K‐means algorithm.
[ Normal View | Magnified View ]
Clustering of 91 compounds giving binary fingerprints projected onto 2D with MDS, with activity surface shown in 3D. Points represent compounds, where colored points in 2D and correspond to colored points in 3D. 2D and 3D convex hulls represent four clusters.
[ Normal View | Magnified View ]
Tree map visualization of a clustering showing additional class information (active and inactive). Rectangles are scaled to the size of the clusters, showing twelve clusters. Colors distinguish active and inactive classes within each cluster. Small subdivisions lose labeling accuracy, such that some group classes have to be inferred (see clusters 1, 3, 6, 5, and 11).
[ Normal View | Magnified View ]
A cut (dotted blue line) through a dendrogram with hundreds of leaves creates a partition of 12 clusters. Visualizing the clusters with this many leaves or even knowing how large each cluster is becomes difficult without further tables and tools.
[ Normal View | Magnified View ]
Protein binding site of PDB number 11ba from the SCPDB. Alpha shapes (a generalization of a 3D convex hull, shown here as red facets) enclose the protein atoms of the binding site. The crystal ligand or conformations of other compounds can then be compared with the binding site void.
[ Normal View | Magnified View ]
Schematic representation of the iterative nature of the application of cluster analysis.
[ Normal View | Magnified View ]
Schematic representation of the decision‐making processes for the application of cluster analysis clustering method and the dimension reduction.
[ Normal View | Magnified View ]
Hierarchical clustering of the same data as in Figure , showing a cut of the dendrogram that partitions the data into nine clusters. The clusters found using K‐means on the dimension reduced data in Figure and those found in this figure are only modestly similar because of both the feature space and measures used to compute the clusters in each case differ respectively .
[ Normal View | Magnified View ]

Browse by Topic

Computer and Information Science > Chemoinformatics

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts