Miscellaneous Ops¶
The pyro.ops
module implements tensor utilities
that are mostly independent of the rest of Pyro.
Utilities for HMC¶

class
DualAveraging
(prox_center=0, t0=10, kappa=0.75, gamma=0.05)[source]¶ Bases:
object
Dual Averaging is a scheme to solve convex optimization problems. It belongs to a class of subgradient methods which uses subgradients to update parameters (in primal space) of a model. Under some conditions, the averages of generated parameters during the scheme are guaranteed to converge to an optimal value. However, a counterintuitive aspect of traditional subgradient methods is “new subgradients enter the model with decreasing weights” (see \([1]\)). Dual Averaging scheme solves that phenomenon by updating parameters using weights equally for subgradients (which lie in a dual space), hence we have the name “dual averaging”.
This class implements a dual averaging scheme which is adapted for Markov chain Monte Carlo (MCMC) algorithms. To be more precise, we will replace subgradients by some statistics calculated during an MCMC trajectory. In addition, introducing some free parameters such as
t0
andkappa
is helpful and still guarantees the convergence of the scheme.References
[1] Primaldual subgradient methods for convex problems, Yurii Nesterov
[2] The NoUturn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo, Matthew D. Hoffman, Andrew Gelman
Parameters:  prox_center (float) – A “proxcenter” parameter introduced in \([1]\) which pulls the primal sequence towards it.
 t0 (float) – A free parameter introduced in \([2]\) that stabilizes the initial steps of the scheme.
 kappa (float) – A free parameter introduced in \([2]\)
that controls the weights of steps of the scheme.
For a small
kappa
, the scheme will quickly forget states from early steps. This should be a number in \((0.5, 1]\).  gamma (float) – A free parameter which controls the speed of the convergence of the scheme.

velocity_verlet
(z, r, potential_fn, inverse_mass_matrix, step_size, num_steps=1, z_grads=None)[source]¶ Second order symplectic integrator that uses the velocity verlet algorithm.
Parameters:  z (dict) – dictionary of sample site names and their current values
(type
Tensor
).  r (dict) – dictionary of sample site names and corresponding momenta
(type
Tensor
).  potential_fn (callable) – function that returns potential energy given z
for each sample site. The negative gradient of the function with respect
to
z
determines the rate of change of the corresponding sites’ momentar
.  inverse_mass_matrix (torch.Tensor) – a tensor \(M^{1}\) which is used to calculate kinetic energy: \(E_{kinetic} = \frac{1}{2}z^T M^{1} z\). Here \(M\) can be a 1D tensor (diagonal matrix) or a 2D tensor (dense matrix).
 step_size (float) – step size for each time step iteration.
 num_steps (int) – number of discrete time steps over which to integrate.
 z_grads (torch.Tensor) – optional gradients of potential energy at current
z
.
Return tuple (z_next, r_next, z_grads, potential_energy): next position and momenta, together with the potential energy and its gradient w.r.t.
z_next
. z (dict) – dictionary of sample site names and their current values
(type

potential_grad
(potential_fn, z)[source]¶ Gradient of potential_fn w.r.t. parameters z.
Parameters:  potential_fn – python callable that takes in a dictionary of parameters and returns the potential energy.
 z (dict) – dictionary of parameter values keyed by site name.
Returns: tuple of (z_grads, potential_energy), where z_grads is a dictionary with the same keys as z containing gradients and potential_energy is a torch scalar.
Newton Optimizers¶

newton_step
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of variables, optionally constraining to a trust region [1].
This is especially usful because the final solution of newton iteration is differentiable wrt the inputs, even when all but the final
x
is detached, due to this method’s quadratic convergence [2].loss
must be twicedifferentiable as a function ofx
. Ifloss
is2+d
times differentiable, then the return value of this function isd
times differentiable.When
loss
is interpreted as a negative log probability density, then the return valuesmode,cov
of this function can be used to construct a Laplace approximationMultivariateNormal(mode,cov)
.Warning
Take care to detach the result of this function when used in an optimization loop. If you forget to detach the result of this function during optimization, then backprop will propagate through the entire iteration process, and worse will compute two extra derivatives for each step.
Example use inside a loop:
x = torch.zeros(1000, 2) # arbitrary initial value for step in range(100): x = x.detach() # block gradients through previous steps x.requires_grad = True # ensure loss is differentiable wrt x loss = my_loss_function(x) x = newton_step(loss, x, trust_radius=1.0) # the final x is still differentiable
 [1] Yuan, Yaxiang. Iciam. Vol. 99. 2000.
 “A review of trust region algorithms for optimization.” ftp://ftp.cc.ac.cn/pub/yyx/papers/p995.pdf
 [2] Christianson, Bruce. Optimization Methods and Software 3.4 (1994)
 “Reverse accumulation and attractive fixed points.” http://uhra.herts.ac.uk/bitstream/handle/2299/4338/903839.pdf
Parameters:  loss (torch.Tensor) – A scalar function of
x
to be minimized.  x (torch.Tensor) – A dependent variable of shape
(N, D)
whereN
is the batch size andD
is a small number.  trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance DxD matrix withcov.shape == x.shape[:1] + (D,D)
.Return type:

newton_step_1d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 1dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters:  loss (torch.Tensor) – A scalar function of
x
to be minimized.  x (torch.Tensor) – A dependent variable with rightmost size of 1.
 trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 1x1 matrix withcov.shape == x.shape[:1] + (1,1)
.Return type:  loss (torch.Tensor) – A scalar function of

newton_step_2d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 2dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters:  loss (torch.Tensor) – A scalar function of
x
to be minimized.  x (torch.Tensor) – A dependent variable with rightmost size of 2.
 trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 2x2 matrix withcov.shape == x.shape[:1] + (2,2)
.Return type:  loss (torch.Tensor) – A scalar function of

newton_step_3d
(loss, x, trust_radius=None)[source]¶ Performs a Newton update step to minimize loss on a batch of 3dimensional variables, optionally regularizing to constrain to a trust region.
See
newton_step()
for details.Parameters:  loss (torch.Tensor) – A scalar function of
x
to be minimized.  x (torch.Tensor) – A dependent variable with rightmost size of 2.
 trust_radius (float) – An optional trust region trust_radius. The
updated value
mode
of this function will be withintrust_radius
of the inputx
.
Returns: A pair
(mode, cov)
wheremode
is an updated tensor of the same shape as the original valuex
, andcov
is an esitmate of the covariance 3x3 matrix withcov.shape == x.shape[:1] + (3,3)
.Return type:  loss (torch.Tensor) – A scalar function of
Tensor Indexing¶

vindex
(tensor, args)[source]¶ Vectorized advanced indexing with broadcasting semantics.
See also the convenience wrapper
Vindex
.This is useful for writing indexing code that is compatible with batching and enumeration, especially for selecting mixture components with discrete random variables.
For example suppose
x
is a parameter withx.dim() == 3
and we wish to generalize the expressionx[i, :, j]
from integeri,j
to tensorsi,j
with batch dims and enum dims (but no event dims). Then we can write the generalize version usingVindex
xij = Vindex(x)[i, :, j] batch_shape = broadcast_shape(i.shape, j.shape) event_shape = (x.size(1),) assert xij.shape == batch_shape + event_shape
To handle the case when
x
may also contain batch dimensions (e.g. ifx
was sampled in a plated context as when using vectorized particles),vindex()
uses the special convention thatEllipsis
denotes batch dimensions (hence...
can appear only on the left, never in the middle or in the right). Supposex
has event dim 3. Then we can write:old_batch_shape = x.shape[:3] old_event_shape = x.shape[3:] xij = Vindex(x)[..., i, :, j] # The ... denotes unknown batch shape. new_batch_shape = broadcast_shape(old_batch_shape, i.shape, j.shape) new_event_shape = (x.size(1),) assert xij.shape = new_batch_shape + new_event_shape
Note that this special handling of
Ellipsis
differs from the NEP [1].Formally, this function assumes:
 Each arg is either
Ellipsis
,slice(None)
, an integer, or a batchedtorch.LongTensor
(i.e. with empty event shape). This function does not support Nontrivial slices ortorch.BoolTensor
masks.Ellipsis
can only appear on the left asargs[0]
.  If
args[0] is not Ellipsis
thentensor
is not batched, and its event dim is equal tolen(args)
.  If
args[0] is Ellipsis
thentensor
is batched and its event dim is equal tolen(args[1:])
. Dims oftensor
to the left of the event dims are considered batch dims and will be broadcasted with dims of tensor args.
Note that if none of the args is a tensor with
.dim() > 0
, then this function behaves like standard indexing:if not any(isinstance(a, torch.Tensor) and a.dim() for a in args): assert Vindex(x)[args] == x[args]
References
 [1] https://www.numpy.org/neps/nep0021advancedindexing.html
 introduces
vindex
as a helper for vectorized indexing. The Pyro implementation is similar to the proposed notationx.vindex[]
except for slightly different handling ofEllipsis
.
Parameters:  tensor (torch.Tensor) – A tensor to be indexed.
 args (tuple) – An index, as args to
__getitem__
.
Returns: A nonstandard interpetation of
tensor[args]
.Return type:  Each arg is either

class
Vindex
(tensor)[source]¶ Bases:
object
Convenience wrapper around
vindex()
.The following are equivalent:
Vindex(x)[..., i, j, :] vindex(x, (Ellipsis, i, j, slice(None)))
Parameters: tensor (torch.Tensor) – A tensor to be indexed. Returns: An object with a special __getitem__()
method.
Tensor Contraction¶

contract_expression
(equation, *shapes, **kwargs)[source]¶ Wrapper around
opt_einsum.contract_expression()
that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.Parameters: cache_path (bool) – whether to cache the contraction path. Defaults to True.

contract
(equation, *operands, **kwargs)[source]¶ Wrapper around
opt_einsum.contract()
that optionally uses Pyro’s cheap optimizer and optionally caches contraction paths.Parameters: cache_path (bool) – whether to cache the contraction path. Defaults to True.

einsum
(equation, *operands, **kwargs)[source]¶ Generalized plated sumproduct algorithm via tensor variable elimination.
This generalizes
contract()
in two ways: Multiple outputs are allowed, and intermediate results can be shared.
 Inputs and outputs can be plated along symbols given in
plates
; reductions alongplates
are product reductions.
The best way to understand this function is to try the examples below, which show how
einsum()
calls can be implemented as multiple calls tocontract()
(which is generally more expensive).To illustrate multiple outputs, note that the following are equivalent:
z1, z2, z3 = einsum('ab,bc>a,b,c', x, y) # multiple outputs z1 = contract('ab,bc>a', x, y) z2 = contract('ab,bc>b', x, y) z3 = contract('ab,bc>c', x, y)
To illustrate plated inputs, note that the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = einsum('ab,ai,bi>b', w, x, y, plates='i') z = contract('ab,a,a,a,b,b,b>b', w, *x, *y)
When a sum dimension a always appears with a plate dimension i, then a corresponds to a distinct symbol for each slice of a. Thus the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = einsum('ai,ai>', x, y, plates='i') z = contract('a,b,c,a,b,c>', *x, *y)
When such a sum dimension appears in the output, it must be accompanied by all of its plate dimensions, e.g. the following are equivalent:
assert len(x) == 3 and len(y) == 3 z = einsum('abi,abi>bi', x, y, plates='i') z0 = contract('ab,ac,ad,ab,ac,ad>b', *x, *y) z1 = contract('ab,ac,ad,ab,ac,ad>c', *x, *y) z2 = contract('ab,ac,ad,ab,ac,ad>d', *x, *y) z = torch.stack([z0, z1, z2])
Note that each plate slice through the output is multilinear in all plate slices through all inptus, thus e.g. batch matrix multiply would be implemented without
plates
, so the following are all equivalent:xy = einsum('abc,acd>abd', x, y, plates='') xy = torch.stack([xa.mm(ya) for xa, ya in zip(x, y)]) xy = torch.bmm(x, y)
Among all valid equations, some computations are polynomial in the sizes of the input tensors and other computations are exponential in the sizes of the input tensors. This function raises
NotImplementedError
whenever the computation is exponential.Parameters:  equation (str) – An einsum equation, optionally with multiple outputs.
 operands (torch.Tensor) – A collection of tensors.
 plates (str) – An optional string of plate symbols.
 backend (str) – An optional einsum backend, defaults to ‘torch’.
 cache (dict) – An optional
shared_intermediates()
cache.  modulo_total (bool) – Optionally allow einsum to arbitrarily scale each result plate, which can significantly reduce computation. This is safe to set whenever each result plate denotes a nonnormalized probability distribution whose total is not of interest.
Returns: a tuple of tensors of requested shape, one entry per output.
Return type: Raises:  ValueError – if tensor sizes mismatch or an output requests a plated dim without that dim’s plates.
 NotImplementedError – if contraction would have cost exponential in the size of any input tensor.
Gaussian Contraction¶

class
Gaussian
(log_normalizer, info_vec, precision)[source]¶ Bases:
object
Nonnormalized Gaussian distribution.
This represents an arbitrary semidefinite quadratic function, which can be interpreted as a rankdeficient scaled Gaussian distribution. The precision matrix may have zero eigenvalues, thus it may be impossible to work directly with the covariance matrix.
Parameters:  log_normalizer (torch.Tensor) – a normalization constant, which is mainly used to keep track of normalization terms during contractions.
 info_vec (torch.Tensor) – information vector, which is a scaled version of the mean
info_vec = precision @ mean
. We use this represention to make gaussian contraction fast and stable.  precision (torch.Tensor) – precision matrix of this gaussian.

log_density
(value)[source]¶ Evaluate the log density of this Gaussian at a point value:
0.5 * value.T @ precision @ value + value.T @ info_vec + log_normalizer
This is mainly used for testing.

condition
(value)[source]¶ Condition this Gaussian on a trailing subset of its state. This should satisfy:
g.condition(y).dim() == g.dim()  y.size(1)
Note that since this is a nonnormalized Gaussian, we include the density of
y
in the result. Thuscondition()
is similar to afunctools.partial
binding of arguments:left = x[..., :n] right = x[..., n:] g.log_density(x) == g.condition(right).log_density(left)

marginalize
(left=0, right=0)[source]¶ Marginalizing out variables on either side of the event dimension:
g.marginalize(left=n).event_logsumexp() = g.logsumexp() g.marginalize(right=n).event_logsumexp() = g.logsumexp()
and for data
x
: g.condition(x).event_logsumexp()
 = g.marginalize(left=g.dim()  x.size(1)).log_density(x)

class
AffineNormal
(matrix, loc, scale)[source]¶ Bases:
object
Represents a conditional diagonal normal distribution over a random variable
Y
whose mean is an affine function of a random variableX
. The likelihood ofX
is thus:AffineNormal(matrix, loc, scale).condition(y).log_density(x)
which is equivalent to:
Normal(x @ matrix + loc, scale).to_event(1).log_prob(y)
Parameters:  matrix (torch.Tensor) – A transformation from
X
toY
. Should have rightmost shape(x_dim, y_dim)
.  loc (torch.Tensor) – A constant offset for
Y
’s mean. Should have rightmost shape(y_dim,)
.  scale (torch.Tensor) – Standard deviation for
Y
. Should have rightmost shape(y_dim,)
.

condition
(value)[source]¶ Condition on a
Y
value.Parameters: value (torch.Tensor) – A value of Y
.Return Gaussian: A gaussian likelihood over X
.
 matrix (torch.Tensor) – A transformation from

mvn_to_gaussian
(mvn)[source]¶ Convert a MultivaiateNormal distribution to a Gaussian.
Parameters: mvn (MultivariateNormal) – A multivariate normal distribution. Returns: An equivalent Gaussian object. Return type: Gaussian

matrix_and_mvn_to_gaussian
(matrix, mvn)[source]¶ Convert a noisy affine function to a Gaussian. The noisy affine function is defined as:
y = x @ matrix + mvn.sample()
Parameters:  matrix (Tensor) – A matrix with rightmost shape
(x_dim, y_dim)
.  mvn (MultivariateNormal) – A multivariate normal distribution.
Returns: A Gaussian with broadcasted batch shape and
.dim() == x_dim + y_dim
.Return type:  matrix (Tensor) – A matrix with rightmost shape

gaussian_tensordot
(x, y, dims=0)[source]¶ Computes the integral over two gaussians:
(x @ y)(a,c) = log(integral(exp(x(a,b) + y(b,c)), b)),where x is a gaussian over variables (a,b), y is a gaussian over variables (b,c), (a,b,c) can each be sets of zero or more variables, and dims is the size of b.
Parameters:  x – a Gaussian instance
 y – a Gaussian instance
 dims – number of variables to contract
Statistical Utilities¶

gelman_rubin
(input, chain_dim=0, sample_dim=1)[source]¶ Computes Rhat over chains of samples. It is required that
input.size(sample_dim) >= 2
andinput.size(chain_dim) >= 2
.Parameters:  input (torch.Tensor) – the input tensor.
 chain_dim (int) – the chain dimension.
 sample_dim (int) – the sample dimension.
Returns torch.Tensor: Rhat of
input
.

split_gelman_rubin
(input, chain_dim=0, sample_dim=1)[source]¶ Computes Rhat over chains of samples. It is required that
input.size(sample_dim) >= 4
.Parameters:  input (torch.Tensor) – the input tensor.
 chain_dim (int) – the chain dimension.
 sample_dim (int) – the sample dimension.
Returns torch.Tensor: split Rhat of
input
.

autocorrelation
(input, dim=0)[source]¶ Computes the autocorrelation of samples at dimension
dim
.Reference: https://en.wikipedia.org/wiki/Autocorrelation#Efficient_computation
Parameters:  input (torch.Tensor) – the input tensor.
 dim (int) – the dimension to calculate autocorrelation.
Returns torch.Tensor: autocorrelation of
input
.

autocovariance
(input, dim=0)[source]¶ Computes the autocovariance of samples at dimension
dim
.Parameters:  input (torch.Tensor) – the input tensor.
 dim (int) – the dimension to calculate autocorrelation.
Returns torch.Tensor: autocorrelation of
input
.

effective_sample_size
(input, chain_dim=0, sample_dim=1)[source]¶ Computes effective sample size of input.
Reference:
 [1] Introduction to Markov Chain Monte Carlo,
 Charles J. Geyer
 [2] Stan Reference Manual version 2.18,
 Stan Development Team
Parameters:  input (torch.Tensor) – the input tensor.
 chain_dim (int) – the chain dimension.
 sample_dim (int) – the sample dimension.
Returns torch.Tensor: effective sample size of
input
.

resample
(input, num_samples, dim=0, replacement=False)[source]¶ Draws
num_samples
samples frominput
at dimensiondim
.Parameters:  input (torch.Tensor) – the input tensor.
 num_samples (int) – the number of samples to draw from
input
.  dim (int) – dimension to draw from
input
.
Returns torch.Tensor: samples drawn randomly from
input
.

quantile
(input, probs, dim=0)[source]¶ Computes quantiles of
input
atprobs
. Ifprobs
is a scalar, the output will be squeezed atdim
.Parameters:  input (torch.Tensor) – the input tensor.
 probs (list) – quantile positions.
 dim (int) – dimension to take quantiles from
input
.
Returns torch.Tensor: quantiles of
input
atprobs
.

pi
(input, prob, dim=0)[source]¶ Computes percentile interval which assigns equal probability mass to each tail of the interval.
Parameters:  input (torch.Tensor) – the input tensor.
 prob (float) – the probability mass of samples within the interval.
 dim (int) – dimension to calculate percentile interval from
input
.
Returns torch.Tensor: quantiles of
input
atprobs
.

hpdi
(input, prob, dim=0)[source]¶ Computes “highest posterior density interval” which is the narrowest interval with probability mass
prob
.Parameters:  input (torch.Tensor) – the input tensor.
 prob (float) – the probability mass of samples within the interval.
 dim (int) – dimension to calculate percentile interval from
input
.
Returns torch.Tensor: quantiles of
input
atprobs
.

waic
(input, log_weights=None, pointwise=False, dim=0)[source]¶ Computes “Widely Applicable/WatanabeAkaike Information Criterion” (WAIC) and its corresponding effective number of parameters.
Reference:
[1] WAIC and crossvalidation in Stan, Aki Vehtari, Andrew Gelman
Parameters:  input (torch.Tensor) – the input tensor, which is log likelihood of a model.
 log_weights (torch.Tensor) – weights of samples along
dim
.  dim (int) – the sample dimension of
input
.
Returns tuple: tuple of WAIC and effective number of parameters.

fit_generalized_pareto
(X)[source]¶ Given a dataset X assumed to be drawn from the Generalized Pareto Distribution, estimate the distributional parameters k, sigma using a variant of the technique described in reference [1], as described in reference [2].
References [1] ‘A new and efficient estimation method for the generalized Pareto distribution.’ Zhang, J. and Stephens, M.A. (2009). [2] ‘Pareto Smoothed Importance Sampling.’ Aki Vehtari, Andrew Gelman, Jonah Gabry
Parameters: torch.Tensor – the input data X Returns tuple: tuple of floats (k, sigma) corresponding to the fit parameters

crps_empirical
(pred, truth)[source]¶ Computes negative Continuous Ranked Probability Score CRPS* [1] between a set of samples
pred
and true datatruth
. This uses ann log(n)
time algorithm to compute a quantity equal that would naively have complexity quadratic in the number of samplesn
:CRPS* = Epred  truth  1/2 Epred  pred' = (pred  truth).abs().mean(0)  (pred  pred.unsqueeze(1)).abs().mean([0, 1]) / 2
Note that for a single sample this reduces to absolute error.
References [1] Strictly Proper Scoring Rules, Prediction, and Estimation Tilmann Gneiting, Adrian E. Raftery (2007) https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf
Parameters:  pred (torch.Tensor) – A set of sample predictions batched on rightmost dim.
This should have shape
(num_samples,) + truth.shape
.  truth (torch.Tensor) – A tensor of true observations.
Returns: A tensor of shape
truth.shape
.Return type:  pred (torch.Tensor) – A set of sample predictions batched on rightmost dim.
This should have shape