This Title All WIREs
How to cite this WIREs title:
WIREs Comp Stat

Analyzing complex mathematical model behavior by partial least squares regression‐based multivariate metamodeling

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

The increasing complexity of mathematical models of complex systems like living cells has created a need for methods to reduce computational demand, maintain overview of the capabilities and feasibility of the models, compare alternative models, and obtain more reliable and effective fitting of models to experimental data. Metamodeling—statistical modeling of the behavior of complex mathematical models, also called ‘surrogate modeling’—is well established in many scientific disciplines, such as mechanical engineering and process simulation, and has recently also found use in computational biology, as well as other fields of bioscience. Many of these are based on partial least squares regression (PLSR) and various nonlinear and N‐way extensions of the PLSR. This is a versatile family of multivariate data modeling methods that combines a simple, flexible model structure (low‐rank bilinear subspace regression) and an intuitively attractive optimization criterion (maximized explained input–output covariance) to provide both predictive ability and graphical insight. This review summarizes the background for PLSR‐based metamodeling, and the use of PLSR and related methods in the main application areas of metamodeling: reduction of computational demand, sensitivity analysis, model comparison, and parameterization of models in relation to measured data. The methodology is generic, but here illustrated by examples from computational biology. The advantages and limitations of metamodeling for analyzing complex model behavior are discussed. WIREs Comput Stat 2014, 6:440–475. doi: 10.1002/wics.1325 This article is categorized under: Applications of Computational Statistics > Computational and Molecular Biology Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods
Classical and inverse multivariate metamodeling of a mathematical model M(.). While the classical metamodel C(.) has the same input–output direction as the original model, the inverse metamodel I(.) has the opposite direction. Once established, the two types of metamodels can be used for predicting outputs from inputs and inputs from outputs, respectively. (Reprinted with permission from Ref . Copyright 2014 BioMed Central)
[ Normal View | Magnified View ]
Metamodel simplification of an isolated part of a larger mathematical model. (Reprinted with permission from Ref ; copyright 2009 Elsevier) An optical model of light scattering in synchrotron FTIR spectra of prostate cancer cells simplified by principal component analysis (PCA)‐based metamodeling. (a) Measured spectra; absorbance versus wavenumber, for several 5 repeated spectra of two different cells (red and blue). (b) Spectra after metamodel‐based preprocessing, i.e., corrected for unwanted Mie‐scattering. (c) Basis for metamodeling: baseline spectra (absorbance vs wavenumber) obtained from a nonlinear model M(.) of Mie scattering by systematic simulations with different theoretical interferences (varying cell diameters and densities). Metamodeling of these data (PCA with six PCs) simplified an important part of the nonlinear model of the optical system to a large extent. (Reprinted with permission from Ref . Copyright 2009 Elsevier)
[ Normal View | Magnified View ]
The mathematical sloppiness of a nonlinear dynamic model. The structure within a fully neutral parameter set for a simple nonlinear ordinary differential equations (ODE) model revealed by metamodeling. The model M(.) was the one‐dimensional S‐system dx/dt = αxgβxh. (a) Data and model predictions: A curve was generated from the model by numerical integration for a certain (‘true’) combination of parameters α, β, g, and h and then used as ‘perfectly measured data with unknown parameters’. On top of it, 84 different but indistinguishable output curves are plotted, which were obtained by fitting the nonlinear ODE to the data by simplex optimization from different parameter starting values. (b) Sloppiness structure for the data: Bilinear principal component analysis (PCA) summary of the neutral parameter set (84 optimized solutions) for this curve, consisting of the ‘true’ (red star) and the 84 estimated parameter sets (blue dots) seen in the two principal components (PCs) that were required for perfect approximation of the neutral parameter set. The many equivalent parameter combinations fall along a simple manifold. (Reprinted with permission from Ref . Copyright 2014 Wiley)
[ Normal View | Magnified View ]
Example of a highly reduced factorial design for a computer experiment. The mechanistic model M(.) had 18 different input parameters (named along the plot's diagonal), and each was to be assessed at four different values in order to pick up possible nonlinear input–output patterns. A reduced factorial OMBR design was chosen, with only 128 different parameter value combinations. The figure shows that each of the 18 × 17/2 = 153 parameter pairs was sampled at all 16 parameter combinations; only in the last combination (named FRO × OOC‐P), 8 of the 16 possibilities were left untested. (Reprinted with permission from Ref . Copyright 2014 Wiley)
[ Normal View | Magnified View ]
Comparison of two mechanistic models via multivariate metamodeling. Two nonlinear spatiotemporal models (the Guccione and Costa laws of heart wall elasticity) were assessed: Simulation experiments were performed for each of the models separately, using parameter combinations based on experimental data from the literature. The input parameters of one of the models were a subset of the parameters of the other model. Because the two models mimicked the same spatiotemporal process, their outputs could also be merged. The joint input–output dataset was metamodeled by N‐PLSR (partial least squares regression). The figure shows the scores for the first three PLS components. The red and green surfaces are regions of PLS scores for the two laws. The input parameters and output phenotypes that contribute the most to the PLS components are shown as lines. (Reprinted with permission from Ref . Copyright 2013 Norwegian University ofLife Sciences)
[ Normal View | Magnified View ]
Metamodeling of a model with highly nonlinear input–output relations. The mechanistic model M(.) describes the mammalian circadian clock. The figure shows the global metamodel score plot for the first three partial least squares regression (PLSR) score vectors t1, t2, and t3, obtained by polynomial PLSR from the simulation data covering the full range of relevant parameter combinations. Two clusters (regions of different types of input–output relationships) are evident. By HC‐PLSR, subsequent local modeling of each cluster separately simplified the PLS regressions, and allowed the highly nonlinear input–output relationships to be modeled more correctly. (Reprinted with permission from Ref . Copyright 2011 BioMed Central)
[ Normal View | Magnified View ]
Model overview and model reduction by metamodeling. A two‐dimensional spatiotemporal model of cell differentiation was studied by image analysis and sensory descriptive analysis. The result of an inverse PLSR‐based metamodeling, relating model outputs X (computerized image analysis output cell configurations) to model input parameters Y for the first two PLS components. (a) Score plot showing the main systematic patterns among the simulations. The two first and most important PLSR components t1 and t2 represent the main X/Y covariance patterns. They are here seen in the X‐variables—in this case the image analysis output variables. Some typical examples of model output images illustrate the corresponding systematic output variation patterns in the cell differentiation. (b) The corresponding correlation loading plot, showing the two main patterns of intercorrelations within and between the input and output variables. The input variables (the model parameters Y) and the output cell differentiation descriptors (a number of variables obtained by automatic image analysis, X) are represented by red and blue symbols, respectively. The correlation loadings are the dimension‐free correlation coefficients of the individual X‐and Y‐variables to PLSR score vectors t1 and t2. (Reprinted with permission from Ref , Supplementary material. Copyright 2009 BioMed Central)
[ Normal View | Magnified View ]
A massive number of measured time series parameterized via multivariate metamodeling. A video camera was used for complex kinetic measurements of color development during silver staining of an electrophoresis gel. (a) Measurements: Time series measurements from a few of the >100,000 pixels showing color development in protein spots. (Reprinted with permission from Ref . Copyright 2012 Elsevier) (b) Simulations: Examples from the thousands of output curves generated by 40 alternative nonlinear growth curve models in extensive computer experiments. By direct look‐up (DLU) metamodeling, each of the >100,000 measured curves was quickly parameterized by each of 40 nonlinear mathematical curve models. The over‐all best model was identified, and its parameter estimates were then used for quantitative spatiotemporal characterization of the whole video recording of the proteomic color development process. (Reprinted with permission from Ref . Copyright 2012 Elsevier)
[ Normal View | Magnified View ]
Sensitivity analysis of a large, nonlinear dynamic model with input–output relationships that vary between regions in the parameter space: combined global and local sensitivity analysis by Hierarchical Cluster‐based PLSR. The effects of variations in the input parameters of a heart cell model on 104 selected model output phenotypes are shown for four local regions of the parameter space. These regions had been identified by cluster analysis of the score space of simulations from a preliminary global PLSR model. For each of 104 selected output (Y‐variables), the effects (represented by the regression coefficients in BA) are compared vertically by connected curves, for region clusters 1, 2, 3, and 4. This is shown for each of the regressor terms (X‐variables) found to be most interesting (the four input parameters gKr, Cao, Nao, and Ko and the two pair‐wise interactions Ko × gK1 and Ko × gNa). Four of the output phenotypes (apd25, ctttp, ctdecayrate, and apttp) are highlighted by colored curves. Summarizing the sensitivities across the 104 phenotypes within each term and cluster, the middle dot shows the median, the thick horizontal line shows the interquartile range (IQR = Q3 − Q1), the thin line shows ‘whiskers’ extending to the smallest and the largest data point within 1.5 times IQR of the IQR, while data points beyond that are marked by red dots. (Reprinted with permission from Ref . Copyright 2013 Elsevier)
[ Normal View | Magnified View ]
Faster computation of a complex model. (a) Slow mathematical model Outputs = M(Inputs): A finite element model of facial expressions, representing biomechanical simulations of facial expressions (from left to right) joy, sadness, snarl and the kissing gesture, as controlled by 18 input parameters. Average computation time: 2 h. (b) Fast classical metamodel OutputsC(Inputs) based on simulations according to the design in Figure . Average computation time for different versions of PLSR: <0.1 s. (Reprinted with permission from Ref . Copyright 2014 Wiley)
[ Normal View | Magnified View ]

Browse by Topic

Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods
Applications of Computational Statistics > Computational and Molecular Biology

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts