This Title All WIREs
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 8.127

Making machine learning a useful tool in the accelerated discovery of transition metal complexes

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract As machine learning (ML) has matured, it has opened a new frontier in theoretical and computational chemistry by offering the promise of simultaneous paradigm shifts in accuracy and efficiency. Nowhere is this advance more needed, but also more challenging to achieve, than in the discovery of open‐shell transition metal complexes. Here, localized d or f electrons exhibit variable bonding that is challenging to capture even with the most computationally demanding methods. Thus, despite great promise, clear obstacles remain in constructing ML models that can supplement or even replace explicit electronic structure calculations. In this article, I outline the recent advances in building ML models in transition metal chemistry, including the ability to approach sub‐kcal/mol accuracy on a range of properties with tailored representations, to discover and enumerate complexes in large chemical spaces, and to reveal opportunities for design through analysis of feature importance. I discuss unique considerations that have been essential to enabling ML in open‐shell transition metal chemistry, including (a) the relationship of data set size/diversity, model complexity, and representation choice, (b) the importance of quantitative assessments of both theory and model domain of applicability, and (c) the need to enable autonomous generation of reliable, large data sets both for ML model training and in active learning or discovery contexts. Finally, I summarize the next steps toward making ML a mainstream tool in the accelerated discovery of transition metal complexes. This article is categorized under: Electronic Structure Theory > Density Functional Theory Software > Molecular Modeling Computer and Information Science > Chemoinformatics
A depiction of the interplay between three principles in machine learning model development: Data set size (i.e., number of data points), model complexity (i.e., linear models, LR, vs. nonlinear neural networks, NNs), and the detail that the representation captures ranging from highly local heuristic connectivity‐derived properties (e.g., local electronegativity differences) all the way to 3D representations (e.g., all atomic coordinates) and those that incorporate QM descriptors (e.g., the d‐band center used in catalysis). Three of our group's models for predicting ΔEH–L: Full revised autocorrelations (RACs) with an artificial neural network (RAC‐155/NN), selected universal RAC set with a kernel ridge regression model (URAC‐26/KRR), and the ad hoc mixed continuous/discrete local (MCDL‐25) features with an NN, are shown qualitatively in this data/model/representation space as red symbols. The translucent region represents an example of the size and complexity of conventional studies used to develop structure–property relationships
[ Normal View | Magnified View ]
Comparison of relationships between bond length and chemical composition in organic molecules (top) and inorganic complexes (bottom) obtained from first‐principles calculation with values obtained from an available force field shown in circle symbols. For organic chemistry, the length of C─C bonds in acetylene, ethylene, benzene, and ethane are compared. For inorganic chemistry, the Fe─C bonds in four spin and oxidation states of the homoleptic Fe(CO)6 complex are compared. The relative bond lengths obtained from first‐principles calculation are shown to scale across the two data sets: The organic molecules span a 0.34 å range from acetylene (1.20 å) to ethane (1.54 å), and the inorganic complex bond lengths span a 0.37 å range from singlet Fe(II) (1.94 å) to quintet Fe(II) (2.31 å)
[ Normal View | Magnified View ]
Comparison of typical workflow steps for chemical discovery by a human (computational chemist, left) and autonomous workflow (computational workflow, right). The dotted lines indicate a step in the computational workflow that is not yet established or completed, whereas the solid lines indicate steps that are routine or recently established
[ Normal View | Magnified View ]
Depiction of uncertainty in property space due to method reference choice (left) and in machine learning (ML) model training (right). (Left) Representation of the fitness landscape in chemical space as the method is changed from semi‐local DFT (GGA) to hybrid DFT to correlated wavefunction theory (WFT). (Right) Depiction of how a ML model representation of hybrid DFT chemical space (middle, right, labeled “reference”) would be limited both by estimated high uncertainty regions (top) or by actual residual errors in model fitting (bottom). On both sides, high fitness (i.e., compounds matching a target property) is indicated in yellow, whereas low fitness regions are in purple or black
[ Normal View | Magnified View ]

Browse by Topic

Computer and Information Science > Chemoinformatics
Software > Molecular Modeling
Electronic Structure Theory > Density Functional Theory

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts