Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.541

A review of automatic differentiation and its efficient implementation

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Derivatives play a critical role in computational statistics, examples being Bayesian inference using Hamiltonian Monte Carlo sampling and the training of neural networks. Automatic differentiation (AD) is a powerful tool to automate the calculation of derivatives and is preferable to more traditional methods, especially when differentiating complex algorithms and mathematical functions. The implementation of AD, however, requires some care to insure efficiency. Modern differentiation packages deploy a broad range of computational techniques to improve applicability, run time, and memory management. Among these techniques are operation overloading, region‐based memory, and expression templates. There also exist several mathematical techniques which can yield high performance gains when applied to complex algorithms. For example, semi‐analytical derivatives can reduce by orders of magnitude the runtime required to numerically solve and differentiate an algebraic equation. Open and practical problems include the extension of current packages to provide more specialized routines, and finding optimal methods to perform higher‐order differentiation. This article is categorized under: Algorithmic Development > Scalable Statistical Methods
Run time to solve and differentiate a system of algebraic equations. All three solvers deploy Powell's dogleg method, an iterative algorithm that uses gradients to find the root, y*, of a function f = f(y, θ). The standard solver (green) uses an analytical expression for Jy and automatic differentiation (AD) to differentiate the iterative algorithm. The super node solver (blue) also uses an analytical expression for Jy and the implicit function theorem to compute derivatives. The built‐in solver (red) uses AD to compute Jy and the implicit function theorem. The computer experiment is run 100 times and the shaded areas represent the region encompassed by the 5th and 95th quantiles. The plot on the right provides the run time on a log scale for clarity purposes. Plot generated with ggplot2 (Wickham, )
[ Normal View | Magnified View ]
Checkpointing with the store‐all approach. This time, the forward sweep records intermediate values at each checkpoint. The forward sweep can then be rerun between two checkpoints, as opposed to starting from the input
[ Normal View | Magnified View ]
Checkpointing with the recompute‐all approach. In the above sketch, the target function is broken into four segments, using three checkpoints (white nodes). The yellow and red nodes, respectively, represent the input and output of the function. During the forward sweep, we only record when the arrow is thick. The dotted arrow represents a reverse automatic differentiation (AD) sweep. After each reverse sweep, we start a new forward evaluation sweep from the input
[ Normal View | Magnified View ]
Expression graph for the log‐normal density. The above graph is generated by the computer code for Equation 5. Each node represents a variable, labeled v1 through v10, which is calculated by applying a mathematical operator to variables lower on the expression graph. The top node (v10) is the output variable and the gray nodes are the input variables for which we require sensitivities. The arrows represent the flow of information when we numerically evaluate the log‐normal density. (Reprinted with permission from Figure of Carpenter et al. (). Copyright https://arxiv.org/licenses/nonexclusive‐distrib/1.0/license.html)
[ Normal View | Magnified View ]
Run time to solve and differentiate a system of algebraic equations. This time we only compare the solvers which use the implicit function theorem. The “super node + analytical” solver uses an analytical expression for Jy both to solve the equation and compute sensitivities. The built‐in solver uses automatic differentiation (AD) to compute Jy. The former is approximately two to three times faster. The computer experiment is run 100 times and the shaded areas represent the region encompassed by the 5th and 95th quantiles. Plot generated with ggplot2 (Wickham, )
[ Normal View | Magnified View ]

Browse by Topic

Algorithmic Development > Scalable Statistical Methods

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts