Miscellaneous Ops

The pyro.ops module implements tensor utilities that are mostly independent of the rest of Pyro.

Utilities for HMC

class DualAveraging(prox_center=0, t0=10, kappa=0.75, gamma=0.05)[source]

Bases: object

Dual Averaging is a scheme to solve convex optimization problems. It belongs to a class of subgradient methods which uses subgradients to update parameters (in primal space) of a model. Under some conditions, the averages of generated parameters during the scheme are guaranteed to converge to an optimal value. However, a counter-intuitive aspect of traditional subgradient methods is “new subgradients enter the model with decreasing weights” (see \([1]\)). Dual Averaging scheme solves that phenomenon by updating parameters using weights equally for subgradients (which lie in a dual space), hence we have the name “dual averaging”.

This class implements a dual averaging scheme which is adapted for Markov chain Monte Carlo (MCMC) algorithms. To be more precise, we will replace subgradients by some statistics calculated during an MCMC trajectory. In addition, introducing some free parameters such as t0 and kappa is helpful and still guarantees the convergence of the scheme.

References

[1] Primal-dual subgradient methods for convex problems, Yurii Nesterov

[2] The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo, Matthew D. Hoffman, Andrew Gelman

Parameters:
  • prox_center (float) – A “prox-center” parameter introduced in \([1]\) which pulls the primal sequence towards it.
  • t0 (float) – A free parameter introduced in \([2]\) that stabilizes the initial steps of the scheme.
  • kappa (float) – A free parameter introduced in \([2]\) that controls the weights of steps of the scheme. For a small kappa, the scheme will quickly forget states from early steps. This should be a number in \((0.5, 1]\).
  • gamma (float) – A free parameter which controls the speed of the convergence of the scheme.
reset()[source]
step(g)[source]

Updates states of the scheme given a new statistic/subgradient g.

Parameters:g (float) – A statistic calculated during an MCMC trajectory or subgradient.
get_state()[source]

Returns the latest \(x_t\) and average of \(\left\{x_i\right\}_{i=1}^t\) in primal space.

velocity_verlet(z, r, potential_fn, inverse_mass_matrix, step_size, num_steps=1, z_grads=None)[source]

Second order symplectic integrator that uses the velocity verlet algorithm.

Parameters:
  • z (dict) – dictionary of sample site names and their current values (type Tensor).
  • r (dict) – dictionary of sample site names and corresponding momenta (type Tensor).
  • potential_fn (callable) – function that returns potential energy given z for each sample site. The negative gradient of the function with respect to z determines the rate of change of the corresponding sites’ momenta r.
  • inverse_mass_matrix (torch.Tensor) – a tensor \(M^{-1}\) which is used to calculate kinetic energy: \(E_{kinetic} = \frac{1}{2}z^T M^{-1} z\). Here \(M\) can be a 1D tensor (diagonal matrix) or a 2D tensor (dense matrix).
  • step_size (float) – step size for each time step iteration.
  • num_steps (int) – number of discrete time steps over which to integrate.
  • z_grads (torch.Tensor) – optional gradients of potential energy at current z.
Return tuple (z_next, r_next, z_grads, potential_energy):
 

next position and momenta, together with the potential energy and its gradient w.r.t. z_next.

class WelfordCovariance(diagonal=True)[source]

Bases: object

Implements Welford’s online scheme for estimating (co)variance (see \([1]\)). Useful for adapting diagonal and dense mass structures for HMC.

References

[1] The Art of Computer Programming, Donald E. Knuth

reset()[source]
update(sample)[source]
get_covariance(regularize=True)[source]

Newton Optimizers

newton_step(loss, x, trust_radius=None)[source]

Performs a Newton update step to minimize loss on a batch of variables, optionally constraining to a trust region [1].

This is especially usful because the final solution of newton iteration is differentiable wrt the inputs, even when all but the final x is detached, due to this method’s quadratic convergence [2]. loss must be twice-differentiable as a function of x. If loss is 2+d-times differentiable, then the return value of this function is d-times differentiable.

When loss is interpreted as a negative log probability density, then the return values mode,cov of this function can be used to construct a Laplace approximation MultivariateNormal(mode,cov).

Warning

Take care to detach the result of this function when used in an optimization loop. If you forget to detach the result of this function during optimization, then backprop will propagate through the entire iteration process, and worse will compute two extra derivatives for each step.

Example use inside a loop:

x = torch.zeros(1000, 2)  # arbitrary initial value
for step in range(100):
    x = x.detach()          # block gradients through previous steps
    x.requires_grad = True  # ensure loss is differentiable wrt x
    loss = my_loss_function(x)
    x = newton_step(loss, x, trust_radius=1.0)
# the final x is still differentiable
[1] Yuan, Ya-xiang. Iciam. Vol. 99. 2000.
“A review of trust region algorithms for optimization.” ftp://ftp.cc.ac.cn/pub/yyx/papers/p995.pdf
[2] Christianson, Bruce. Optimization Methods and Software 3.4 (1994)
“Reverse accumulation and attractive fixed points.” http://uhra.herts.ac.uk/bitstream/handle/2299/4338/903839.pdf
Parameters:
  • loss (torch.Tensor) – A scalar function of x to be minimized.
  • x (torch.Tensor) – A dependent variable of shape (N, D) where N is the batch size and D is a small number.
  • trust_radius (float) – An optional trust region trust_radius. The updated value mode of this function will be within trust_radius of the input x.
Returns:

A pair (mode, cov) where mode is an updated tensor of the same shape as the original value x, and cov is an esitmate of the covariance DxD matrix with cov.shape == x.shape[:-1] + (D,D).

Return type:

tuple

newton_step_1d(loss, x, trust_radius=None)[source]

Performs a Newton update step to minimize loss on a batch of 1-dimensional variables, optionally regularizing to constrain to a trust region.

See newton_step() for details.

Parameters:
  • loss (torch.Tensor) – A scalar function of x to be minimized.
  • x (torch.Tensor) – A dependent variable with rightmost size of 1.
  • trust_radius (float) – An optional trust region trust_radius. The updated value mode of this function will be within trust_radius of the input x.
Returns:

A pair (mode, cov) where mode is an updated tensor of the same shape as the original value x, and cov is an esitmate of the covariance 1x1 matrix with cov.shape == x.shape[:-1] + (1,1).

Return type:

tuple

newton_step_2d(loss, x, trust_radius=None)[source]

Performs a Newton update step to minimize loss on a batch of 2-dimensional variables, optionally regularizing to constrain to a trust region.

See newton_step() for details.

Parameters:
  • loss (torch.Tensor) – A scalar function of x to be minimized.
  • x (torch.Tensor) – A dependent variable with rightmost size of 2.
  • trust_radius (float) – An optional trust region trust_radius. The updated value mode of this function will be within trust_radius of the input x.
Returns:

A pair (mode, cov) where mode is an updated tensor of the same shape as the original value x, and cov is an esitmate of the covariance 2x2 matrix with cov.shape == x.shape[:-1] + (2,2).

Return type:

tuple

newton_step_3d(loss, x, trust_radius=None)[source]

Performs a Newton update step to minimize loss on a batch of 3-dimensional variables, optionally regularizing to constrain to a trust region.

See newton_step() for details.

Parameters:
  • loss (torch.Tensor) – A scalar function of x to be minimized.
  • x (torch.Tensor) – A dependent variable with rightmost size of 2.
  • trust_radius (float) – An optional trust region trust_radius. The updated value mode of this function will be within trust_radius of the input x.
Returns:

A pair (mode, cov) where mode is an updated tensor of the same shape as the original value x, and cov is an esitmate of the covariance 3x3 matrix with cov.shape == x.shape[:-1] + (3,3).

Return type:

tuple

Tensor Contraction

contract_expression(equation, *shapes, **kwargs)[source]

Wrapper around opt_einsum.contract_expression() that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.

Parameters:cache_path (bool) – whether to cache the contraction path. Defaults to True.
contract(equation, *operands, **kwargs)[source]

Wrapper around opt_einsum.contract() that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.

Parameters:cache_path (bool) – whether to cache the contraction path. Defaults to True.
ubersum(equation, *operands, **kwargs)[source]

Generalized batched sum-product algorithm via tensor message passing.

This generalizes contract() in two ways:

  1. Multiple outputs are allowed, and intermediate results can be shared.
  2. Inputs and outputs can be batched along symbols given in batch_dims; reductions along batch_dims are product reductions.

The best way to understand this function is to try the examples below, which show how ubersum() calls can be implemented as multiple calls to contract() (which is generally more expensive).

To illustrate multiple outputs, note that the following are equivalent:

z1, z2, z3 = ubersum('ab,bc->a,b,c', x, y)  # multiple outputs

backend = 'pyro.ops.einsum.torch_log'
z1 = contract('ab,bc->a', x, y, backend=backend)
z2 = contract('ab,bc->b', x, y, backend=backend)
z3 = contract('ab,bc->c', x, y, backend=backend)

To illustrate batched inputs, note that the following are equivalent:

assert len(x) == 3 and len(y) == 3
z = ubersum('ab,ai,bi->b', w, x, y, batch_dims='i')

z = contract('ab,a,a,a,b,b,b->b', w, *x, *y, backend=backend)

When a sum dimension a always appears with a batch dimension i, then a corresponds to a distinct symbol for each slice of a. Thus the following are equivalent:

assert len(x) == 3 and len(y) == 3
z = ubersum('ai,ai->', x, y, batch_dims='i')

z = contract('a,b,c,a,b,c->', *x, *y, backend=backend)

When such a sum dimension appears in the output, it must be accompanied by all of its batch dimensions, e.g. the following are equivalent:

assert len(x) == 3 and len(y) == 3
z = ubersum('abi,abi->bi', x, y, batch_dims='i')

z0 = contract('ab,ac,ad,ab,ac,ad->b', *x, *y, backend=backend)
z1 = contract('ab,ac,ad,ab,ac,ad->c', *x, *y, backend=backend)
z2 = contract('ab,ac,ad,ab,ac,ad->d', *x, *y, backend=backend)
z = torch.stack([z0, z1, z2])

Note that each batch slice through the output is multilinear in all batch slices through all inptus, thus e.g. batch matrix multiply would be implemented without batch_dims, so the following are all equivalent:

xy = ubersum('abc,acd->abd', x, y, batch_dims='')
xy = torch.stack([xa.mm(ya) for xa, ya in zip(x, y)])
xy = torch.bmm(x, y)

Among all valid equations, some computations are polynomial in the sizes of the input tensors and other computations are exponential in the sizes of the input tensors. This function raises NotImplementedError whenever the computation is exponential.

Parameters:
  • equation (str) – An einsum equation, optionally with multiple outputs.
  • operands (torch.Tensor) – A collection of tensors.
  • batch_dims (str) – An optional string of batch dims.
  • cache (dict) – An optional shared_intermediates() cache.
  • modulo_total (bool) – Optionally allow ubersum to arbitrarily scale each result batch, which can significantly reduce computation. This is safe to set whenever each result batch denotes a nonnormalized probability distribution whose total is not of interest.
Returns:

a tuple of tensors of requested shape, one entry per output.

Return type:

tuple

Raises:
  • ValueError – if tensor sizes mismatch or an output requests a batched dim without that dim’s batch dims.
  • NotImplementedError – if contraction would have cost exponential in the size of any input tensor.

Statistical Utilities

gelman_rubin(input, chain_dim=0, sample_dim=1)[source]

Computes R-hat over chains of samples. It is required that input.size(sample_dim) >= 2 and input.size(chain_dim) >= 2.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • chain_dim (int) – the chain dimension.
  • sample_dim (int) – the sample dimension.
Returns torch.Tensor:
 

R-hat of input.

split_gelman_rubin(input, chain_dim=0, sample_dim=1)[source]

Computes R-hat over chains of samples. It is required that input.size(sample_dim) >= 4.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • chain_dim (int) – the chain dimension.
  • sample_dim (int) – the sample dimension.
Returns torch.Tensor:
 

split R-hat of input.

autocorrelation(input, dim=0)[source]

Computes the autocorrelation of samples at dimension dim.

Reference: https://en.wikipedia.org/wiki/Autocorrelation#Efficient_computation

Parameters:
  • input (torch.Tensor) – the input tensor.
  • dim (int) – the dimension to calculate autocorrelation.
Returns torch.Tensor:
 

autocorrelation of input.

autocovariance(input, dim=0)[source]

Computes the autocovariance of samples at dimension dim.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • dim (int) – the dimension to calculate autocorrelation.
Returns torch.Tensor:
 

autocorrelation of input.

effective_sample_size(input, chain_dim=0, sample_dim=1)[source]

Computes effective sample size of input.

Reference:

[1] Introduction to Markov Chain Monte Carlo,
Charles J. Geyer
[2] Stan Reference Manual version 2.18,
Stan Development Team
Parameters:
  • input (torch.Tensor) – the input tensor.
  • chain_dim (int) – the chain dimension.
  • sample_dim (int) – the sample dimension.
Returns torch.Tensor:
 

effective sample size of input.

resample(input, num_samples, dim=0, replacement=False)[source]

Draws num_samples samples from input at dimension dim.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • num_samples (int) – the number of samples to draw from input.
  • dim (int) – dimension to draw from input.
Returns torch.Tensor:
 

samples drawn randomly from input.

quantile(input, probs, dim=0)[source]

Computes quantiles of input at probs. If probs is a scalar, the output will be squeezed at dim.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • probs (list) – quantile positions.
  • dim (int) – dimension to take quantiles from input.
Returns torch.Tensor:
 

quantiles of input at probs.

pi(input, prob, dim=0)[source]

Computes percentile interval which assigns equal probability mass to each tail of the interval.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • prob (float) – the probability mass of samples within the interval.
  • dim (int) – dimension to calculate percentile interval from input.
Returns torch.Tensor:
 

quantiles of input at probs.

hpdi(input, prob, dim=0)[source]

Computes “highest posterior density interval” which is the narrowest interval with probability mass prob.

Parameters:
  • input (torch.Tensor) – the input tensor.
  • prob (float) – the probability mass of samples within the interval.
  • dim (int) – dimension to calculate percentile interval from input.
Returns torch.Tensor:
 

quantiles of input at probs.

waic(input, log_weights=None, pointwise=False, dim=0)[source]

Computes “Widely Applicable/Watanabe-Akaike Information Criterion” (WAIC) and its corresponding effective number of parameters.

Reference:

[1] WAIC and cross-validation in Stan, Aki Vehtari, Andrew Gelman

Parameters:
  • input (torch.Tensor) – the input tensor, which is log likelihood of a model.
  • log_weights (torch.Tensor) – weights of samples along dim.
  • dim (int) – the sample dimension of input.
Returns tuple:

tuple of WAIC and effective number of parameters.