Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 8.127

Formatting biological big data for modern machine learning in drug discovery

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Biological data is accumulating at an unprecedented rate, escalating the role of data‐driven methods in computational drug discovery. This scenario is favored by recent advances in machine learning algorithms, which are optimized for huge datasets and consistently beat the predictive performance of previous art, rapidly approaching human expert reasoning. The urge to couple biological data to cutting‐edge machine learning has spurred developments in data integration and knowledge representation, especially in the form of heterogeneous, multiplex and semantically‐rich biological networks. Today, thanks to the propitious rise in knowledge embedding techniques, these large and complex biological networks can be converted to a vector format that suits the majority of machine learning implementations. Here, we explain why this can be particularly transformative for drug discovery where, for decades, customary chemoinformatics methods have employed vector descriptors of compound structures as the standard input of their prediction tasks. A common vector format to represent biology and chemistry may push biological information into most of the existing steps of the drug discovery pipeline, boosting the accuracy of predictions and uncovering connections between small molecules and other biological entities such as targets or diseases. This article is categorized under: Computer and Information Science > Databases and Expert Systems Computer and Information Science > Chemoinformatics
Deep learning in chemoinformatics. (a) A classical multitarget prediction exercise based on chemogenomics (ChemoGx) data. Deep neural networks can read a molecule structure as a graph (e.g., convolutional graph networks), and be trained to optimally perform a multitask classification. An inner (usually the last) layer of the network corresponds to the chemical embedding. (b) An autoencoder is a type of neural network that includes an encoder and a decoder, compressing and decompressing the data, respectively. The encoder maps the input to a latent space (embedding), and the decoder maps the embedding back to the original representation. The embedding is a continuous vector that can be optimized for a certain property of interest “Z”. The interpolated vectors can be then decoded to generate new molecules. (c) MoleculeNet offers a number of benchmark datasets at different levels of resolution (from quantic properties to physiological properties of the molecules). For a brief explanation of the datasets, please visit http://moleculenet.ai/datasets‐1
[ Normal View | Magnified View ]
Biological embeddings. Given a heterogeneous network, the random walk algorithm can be run under the dictation of a certain meta‐path. This will result in a “corpus” (text‐like) output that can be apprehended with word2vec (using the skip‐gram model or the continuous bag of words model). As a result, each node visited by the random walker will be mapped to an embedding space, that is, each node will be assigned a vector representation. Compound embeddings can be then used in subsequent supervised learning, for example, to predict a clinical property (y) of the molecules, given training data. Alternatively, embeddings of different types can be compared (connected) between them to discover, for example, compound‐disease relationships
[ Normal View | Magnified View ]
Network embedding example. The aim of network embedding is to represent graph entities (typically nodes) as numerical vectors (embeddings) that preserve graph properties, such as local distances, modularity and global organization. Here, we have embedded a fraction (~1%) of the yeast interactome using a standard network embedding algorithm (node2vec; 128 dimensions), and projected the corresponding embeddings in a two‐dimensional plane using t‐Distributed Stochastic Neighbor Embedding (t‐SNE)
[ Normal View | Magnified View ]
Heterogeneous network of biology. (a) A meta‐graph of an in‐house heterogeneous network, mostly inspired by Hetionet and complemented with the Harmonizome. For simplicity, only the most representative edge types are shown. “Is a” and “has” relationships typically refer to ontologies. (b) A view of the nodes and edges composing the network. To obtain a representative network, we sub‐sampled 500 edges of each type. Different colors denote different types of edges, and size of the circles are proportional to the number of nodes
[ Normal View | Magnified View ]

Browse by Topic

Computer and Information Science > Chemoinformatics
Computer and Information Science > Databases and Expert Systems

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts