Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 8.127

Similarity searching

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract Similarity searching is one of the traditional and most widely applied approaches in chemical and pharmaceutical research to select compounds with desired properties from databases. The computational efficiency of many (but not all) similarity search techniques has further increased their popularity as compound databases began to rapidly grow in size. Different methods have been developed for small molecule similarity searching. However, foundations and intrinsic limitations of similarity searching are often not well understood, although a number of similarity methods are rather simplistic. Regardless of methodological details, all similarity search approaches depend on how molecular similarity is evaluated and quantified. In its essence, molecular similarity is a subjective concept and much dependent on how we represent and view molecular structures. Moreover, trying to understand the relationship between molecular similarity, however assessed, and structure‐dependent properties including, first and foremost, biological activity continues to be a challenging problem. Consequently, although similarity searching usually provides a quantitative readout and a ranking of compounds relative to chosen reference molecules, predicting structure–activity relationships on the basis of calculated similarity values often involves subjective criteria and chemical intuition. Thus, similarity searching is still far from being a routine application in database mining. In this review, we first discuss important principles underlying similarity searching, describe its tasks, and introduce major categories of search methods. Then, we focus on molecular fingerprints, the design and application of which can be regarded as a paradigm for the similarity search field. © 2011 John Wiley & Sons, Ltd. WIREs Comput Mol Sci 2011 1 260‐282 DOI: 10.1002/wcms.23 This article is categorized under: Computer and Information Science > Databases and Expert Systems

The structures of hits from two different virtual screens using 2D fingerprints are shown on the right.113 On the left, the reference molecule most similar to each hit is displayed and the MACCS Tc similarity reported.

[ Normal View | Magnified View ]

The outcome of an exemplary similarity search trial for 40 active molecules and 2000 database decoys is shown. The recall of active compounds is monitored in different ways.

[ Normal View | Magnified View ]

The construction of a hybrid fingerprint comprises two steps: fingerprint reduction and recombination.

[ Normal View | Magnified View ]

In n nearest neighbor (n‐NN) searching, the similarities between a database compound and its n‐NN reference compounds (yellow) are calculated and averaged.

[ Normal View | Magnified View ]

Two modified versions of the Tanimoto and Tversky coefficients, MTc(β) and wTv, are shown that are designed to balance or eliminate complexity effects (top). In addition, exemplary ‘hit rate landscapes’ of similarity calculations with wTv under systematic variation of the two variables α and β are shown (bottom). From the left to the right, the complexity of reference molecules increases.

[ Normal View | Magnified View ]

Six molecules of increasing size are shown (color code: red, blue, black, purple, green, and brown) (Reprinted with permission from Ref 62. Copyright 1988 ACS Publications.) The number of MACCS bit positions set on in each case is reported and the distribution of MACCS Tc values relative to ZINC database68 compounds. The larger the molecules are, the more bit positions are set on, and the apparent similarity to database compounds increases.

[ Normal View | Magnified View ]

The formulas of five similarity coefficients are shown.

[ Normal View | Magnified View ]

On the left, the PDR‐FP design is illustrated. Several consecutive bit positions code for the screening database value ranges of each descriptor (e.g., five bits code for the value range of the descriptor ‘diameter’). On the right, a search string resulting from individual fingerprints of multiple reference molecules is compared with the fingerprint of a database compound. The similarity of the nonbinary search string and the binary fingerprint is assessed by calculating the dot product divided by a normalization factor (i.e., the sum of the maximum frequency values per descriptor).

[ Normal View | Magnified View ]

Starting with a central carbon atom colored in red, the calculation of two atom environment layers is illustrated. The first layer is colored in blue and the second in green.

[ Normal View | Magnified View ]

A model substructure‐type fingerprint is shown. Bit positions are set on (i.e., to 1, gray) if the corresponding substructure is present in a molecule. Otherwise they are set off (i.e., to 0).

[ Normal View | Magnified View ]

Examples of reduced graphs and corresponding SMARTS representations are shown for a test molecule.

[ Normal View | Magnified View ]

A three‐point pharmacophore query derived from a single reference molecule is shown. This pharmacophore is compared with individual pharmacophores of database compounds.

[ Normal View | Magnified View ]

Two exemplary cathepsin inhibitors (1 and 2) are shown in different molecular representations.

[ Normal View | Magnified View ]

Related Articles

Representation of chemical structures
Algorithms for chemoinformatics
Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review

Browse by Topic

Computer and Information Science > Databases and Expert Systems

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts