Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 14.016

In silico toxicology: comprehensive benchmarking of multi‐label classification methods applied to chemical toxicity data

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

One goal of toxicity testing, among others, is identifying harmful effects of chemicals. Given the high demand for toxicity tests, it is necessary to conduct these tests for multiple toxicity endpoints for the same compound. Current computational toxicology methods aim at developing models mainly to predict a single toxicity endpoint. When chemicals cause several toxicity effects, one model is generated to predict toxicity for each endpoint, which can be labor and computationally intensive when the number of toxicity endpoints is large. Additionally, this approach does not take into consideration possible correlation between the endpoints. Therefore, there has been a recent shift in computational toxicity studies toward generating predictive models able to predict several toxicity endpoints by utilizing correlations between these endpoints. Applying such correlations jointly with compounds' features may improve model's performance and reduce the number of required models. This can be achieved through multi‐label classification methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology. Therefore, we performed extensive benchmarking and analysis of over 19,000 multi‐label classification models generated using combinations of the state‐of‐the‐art methods. The methods have been evaluated from different perspectives using various metrics to assess their effectiveness. We were able to illustrate variability in the performance of the methods under several conditions. This review will help researchers to select the most suitable method for the problem at hand and provide a baseline for evaluating new approaches. Based on this analysis, we provided recommendations for potential future directions in this area.

This article is categorized under:

  • Computer and Information Science > Chemoinformatics
  • Computer and Information Science > Computer Algorithms and Programming
Predictability of compounds’ toxicity in (a) internal and (b) external validations. The heat maps show models performances in predicting the toxicity of each compound. Each row corresponds to a model, and each column corresponds to a compound. Cells represent each model’s performance in predicting the toxicity of each compound. Models are numbered from 0 to 19,185. The performance is calculated using mean absolute error metric and ranges from 0.0 (best performance) to 1.0 (worst performance). The compounds were clustered into three groups according to models’ performances in predicting the compounds toxicities: compounds with high predictability (green clusters), compounds with medium predictability (magenta clusters), and compounds with low predictability (orange clusters).
[ Normal View | Magnified View ]
Predictability of endpoints in (a) internal and (b) external validation. The heat maps show models’ performances in predicting each toxicity endpoint. Each row corresponds to a model, and each column corresponds to a toxicity endpoint. Cells represent model’s performance in predicting each endpoint. Models are numbered from 0 to 19,185. The performance is calculated using mean absolute error metric and ranges from 0.0 (best performance) to 1.0 (worst performance). The endpoints were clustered according to models’ performances in predicting the endpoints into two clusters: endpoints with high predictability (green clusters) and endpoints with low predictability (orange clusters).
[ Normal View | Magnified View ]
Performance of estimating the toxicity of a given endpoint using average toxicity values of other endpoints in (a) internal and (b) external validation. Each row corresponds to a performance metric, and each column corresponds to an endpoint. Each cell shows the calculated scores per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance).
[ Normal View | Magnified View ]
Area Under Receiver Operating Characteristics curve (AUROC) scores of the top‐ranked models generated by each multi‐label classification method and the binary relevance method per endpoint in (a) internal and (b) external validation. Rows correspond to the multi‐label classification methods and the binary relevance method. Column corresponds to endpoints. Each cell shows the AUROC scores of each method per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance). AUROC scores of 0.5 indicate random predictions. BR, binary relevance; CC, classifier chains; DL, deep learning; LP, label powerset; MLC‐BMaD, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbor; RAkEL, random K labelset; SSL, semi‐supervised learning.
[ Normal View | Magnified View ]
Accuracy scores per endpoint of the top‐ranked models generated by each multi‐label classification method and top ranked binary relevance model in (a) internal and (b) external validation. Rows correspond to the multi‐label classification methods and the binary relevance method. Column corresponds to endpoints. Each cell shows the accuracy scores of each method per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance). BR, binary relevance; CC, classifier chains; DL, deep learning; LP, label powerset; MLC‐BMaD, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbor; RAkEL, random K labelset; SSL, semi‐supervised learning.
[ Normal View | Magnified View ]
Comparison of macro‐average performances of models in internal and external validations. The scatter plots demonstrate models performances via five metrics: accuracy, F1‐score, precision, recall, and specificity. The x‐axis and y‐axis show model performances in internal and external validation, respectively. The closer the models are to the diagonal (from (0,0) point to (1) point)) of the scatter plots, the more similar is their performance in internal and external validations. However, models that have high variability between internal and external performance appear below or above the diagonal region and are marked in orange and blue, respectively.
[ Normal View | Magnified View ]
Comparison of macro‐average performances of multi‐label and binary relevance models in (a) internal and (b) external validation. Bar graphs show models performance via five metrics: accuracy, F1‐score, precision, recall, and specificity. Models are numbered from 0 to 19,185. The gray areas in bar graphs show the performance range of binary relevance models. BR, binary relevance; CC, classifier chains; LP, label powerset; MLC‐BMad, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbors; RAkEL: random K labelset.
[ Normal View | Magnified View ]
Data set description. (a) Toxicity profiles of 6644 compounds for 17 toxicity endpoints. Each row corresponds to a compound, each column corresponds to a toxicity endpoint, and each cell represents a compound’s activity per endpoint. Compounds are numbered from 0 to 6643. Red cells indicate active/toxic compounds, while blue cells indicate inactive/nontoxic compounds. However, gray cells denote the unknown toxicity. (b) A bar graph of the number of toxic and nontoxic compounds associated with each toxicity endpoint. (c) A bar graph of the number of known toxicity effects per compound. (d) A bar graph of the percentage of positive and negative toxicity effects per compound.
[ Normal View | Magnified View ]
Illustrations of some multi‐label classification methods. (a) X is the matrix of features of compounds C1Cn, where n is the number of compounds, and their features F1Fm, where m is the number of features. (b) L is the label matrix that consists of four labels in this example. Positive and negative labels are denoted by ‘1’ and ‘0’, respectively, while ‘?’ indicates missing labels. (c) Classifier chains method. Matrix X’ consists of the feature matrix X from part (a) extended with the label L1 from matrix L, where L is from part (b). The missing labels of L1 are imputed. X’ is used to train a model M to predict a second label, L2. (d) Label powerset method. Matrix L’ consists of the transformed multi‐class labels. Each unique label combination is a distinct class. For example, l1 indicates that L1 is positive, while ~l2 indicates that L2 is negative. Missing labels are not encoded. (e) Random K labelset method. Matrix L’ consists of two labelsets of length K = 2, and each labelset is represented using the label powerset method. In this example, the first labelset consists of labels L1 and L2, and the second labelset consists of labels L3 and L4. (f) Multi‐label Boolean matrix decomposition method. L’ is the decomposed matrix that consists of three latent labels in this example: L’1, L’2, and L’3. (g) Matrix Y′ is the second matrix from the decomposition based on the multi‐label Boolean matrix decomposition method.
[ Normal View | Magnified View ]
Overview of modeling approaches. (a) Three categories of the computational methods including feature selection, multi‐label classification, and base classifiers. MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbors; MLC‐BMaD, multi‐label Boolean matrix decomposition. (b) A list of base classifiers along with their corresponding kernels, solvers, splitting criteria, and distance metrics (when applicable). CD, Coordinate Decent; CG, Conjugate Gradient; LBFGS, Memory‐limited quasi‐Newton; SAG, Stochastic Average Gradient; RBF, Radial Basis Function. (c) Three feature selection methods. L1, L2, and L3: labels; X: the original feature set; X1, X2, X3: selected feature sets for labels L1, L2, and L3, respectively; xi: a single feature; Xs: the combined feature set; M1, M2, and M3: models for endpoints L1, L2, and L3, respectively; t: variance threshold.
[ Normal View | Magnified View ]
Illustrations of single‐label classification and multi‐label classification. X is the data set in which feature vectors describe compounds C1Cn; n is the number of compounds; F1Fm are features; m is the number of features. Y is the label vector (in single‐label classification) or the label matrix (in multi‐label classification). (a) Binary classification. (b) Multi‐class classification. (c) Multi‐label classification. Missing labels are denoted with ‘?.’ ‘1’ and ‘0’ are known labels.
[ Normal View | Magnified View ]
Effect of feature selection on models’ performance in (a) internal and (b) external validation. Bar graphs show models’ macro‐average performance via five metrics: accuracy, F1‐score, precision, recall, and specificity. Models are numbered from 0 to 19,185. SFS, supervised feature selection; UFS, unsupervised feature selection; LSFS, label‐specific feature selection, None, no feature selection method is applied.
[ Normal View | Magnified View ]
The relationship between compounds predictability and the number of known toxicity effects per compound. The histograms show the probability distribution of the number of known toxicity endpoints per compound for compounds with high, medium, and low predictability in (a) internal and (b) external validations.
[ Normal View | Magnified View ]

Related Articles

In silico toxicology: computational methods for the prediction of chemical toxicity
Computational toxicology: a tool for all industries
Machine learning methods in chemoinformatics

Browse by Topic

Computer and Information Science > Chemoinformatics
Computer and Information Science > Computer Algorithms and Programming

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts