Distributions¶
PyTorch Distributions¶
Most distributions in Pyro are thin wrappers around PyTorch distributions.
For details on the PyTorch distribution interface, see
torch.distributions.distribution.Distribution
.
For differences between the Pyro and PyTorch interfaces, see
TorchDistributionMixin
.
Bernoulli¶

class
Bernoulli
(probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.bernoulli.Bernoulli
withTorchDistributionMixin
.
Beta¶

class
Beta
(concentration1, concentration0, validate_args=None)¶ Wraps
torch.distributions.beta.Beta
withTorchDistributionMixin
.
Binomial¶

class
Binomial
(total_count=1, probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.binomial.Binomial
withTorchDistributionMixin
.
Categorical¶

class
Categorical
(probs=None, logits=None, validate_args=None)[source]¶ Wraps
torch.distributions.categorical.Categorical
withTorchDistributionMixin
.
Cauchy¶

class
Cauchy
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.cauchy.Cauchy
withTorchDistributionMixin
.
Chi2¶

class
Chi2
(df, validate_args=None)¶ Wraps
torch.distributions.chi2.Chi2
withTorchDistributionMixin
.
Dirichlet¶

class
Dirichlet
(concentration, validate_args=None)¶ Wraps
torch.distributions.dirichlet.Dirichlet
withTorchDistributionMixin
.
Exponential¶

class
Exponential
(rate, validate_args=None)¶ Wraps
torch.distributions.exponential.Exponential
withTorchDistributionMixin
.
ExponentialFamily¶

class
ExponentialFamily
(batch_shape=torch.Size([]), event_shape=torch.Size([]), validate_args=None)¶ Wraps
torch.distributions.exp_family.ExponentialFamily
withTorchDistributionMixin
.
FisherSnedecor¶

class
FisherSnedecor
(df1, df2, validate_args=None)¶ Wraps
torch.distributions.fishersnedecor.FisherSnedecor
withTorchDistributionMixin
.
Gamma¶

class
Gamma
(concentration, rate, validate_args=None)¶ Wraps
torch.distributions.gamma.Gamma
withTorchDistributionMixin
.
Geometric¶

class
Geometric
(probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.geometric.Geometric
withTorchDistributionMixin
.
Gumbel¶

class
Gumbel
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.gumbel.Gumbel
withTorchDistributionMixin
.
HalfCauchy¶

class
HalfCauchy
(scale, validate_args=None)¶ Wraps
torch.distributions.half_cauchy.HalfCauchy
withTorchDistributionMixin
.
HalfNormal¶

class
HalfNormal
(scale, validate_args=None)¶ Wraps
torch.distributions.half_normal.HalfNormal
withTorchDistributionMixin
.
Independent¶

class
Independent
(base_distribution, reinterpreted_batch_ndims, validate_args=None)[source]¶ Wraps
torch.distributions.independent.Independent
withTorchDistributionMixin
.
Laplace¶

class
Laplace
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.laplace.Laplace
withTorchDistributionMixin
.
LogNormal¶

class
LogNormal
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.log_normal.LogNormal
withTorchDistributionMixin
.
LogisticNormal¶

class
LogisticNormal
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.logistic_normal.LogisticNormal
withTorchDistributionMixin
.
LowRankMultivariateNormal¶

class
LowRankMultivariateNormal
(loc, cov_factor, cov_diag, validate_args=None)¶ Wraps
torch.distributions.lowrank_multivariate_normal.LowRankMultivariateNormal
withTorchDistributionMixin
.
Multinomial¶

class
Multinomial
(total_count=1, probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.multinomial.Multinomial
withTorchDistributionMixin
.
MultivariateNormal¶

class
MultivariateNormal
(loc, covariance_matrix=None, precision_matrix=None, scale_tril=None, validate_args=None)[source]¶ Wraps
torch.distributions.multivariate_normal.MultivariateNormal
withTorchDistributionMixin
.
NegativeBinomial¶

class
NegativeBinomial
(total_count, probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.negative_binomial.NegativeBinomial
withTorchDistributionMixin
.
Normal¶

class
Normal
(loc, scale, validate_args=None)¶ Wraps
torch.distributions.normal.Normal
withTorchDistributionMixin
.
OneHotCategorical¶

class
OneHotCategorical
(probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.one_hot_categorical.OneHotCategorical
withTorchDistributionMixin
.
Pareto¶

class
Pareto
(scale, alpha, validate_args=None)¶ Wraps
torch.distributions.pareto.Pareto
withTorchDistributionMixin
.
Poisson¶

class
Poisson
(rate, validate_args=None)¶ Wraps
torch.distributions.poisson.Poisson
withTorchDistributionMixin
.
RelaxedBernoulli¶

class
RelaxedBernoulli
(temperature, probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.relaxed_bernoulli.RelaxedBernoulli
withTorchDistributionMixin
.
RelaxedOneHotCategorical¶

class
RelaxedOneHotCategorical
(temperature, probs=None, logits=None, validate_args=None)¶ Wraps
torch.distributions.relaxed_categorical.RelaxedOneHotCategorical
withTorchDistributionMixin
.
StudentT¶

class
StudentT
(df, loc=0.0, scale=1.0, validate_args=None)¶ Wraps
torch.distributions.studentT.StudentT
withTorchDistributionMixin
.
TransformedDistribution¶

class
TransformedDistribution
(base_distribution, transforms, validate_args=None)¶ Wraps
torch.distributions.transformed_distribution.TransformedDistribution
withTorchDistributionMixin
.
Uniform¶

class
Uniform
(low, high, validate_args=None)¶ Wraps
torch.distributions.uniform.Uniform
withTorchDistributionMixin
.
Weibull¶

class
Weibull
(scale, concentration, validate_args=None)¶ Wraps
torch.distributions.weibull.Weibull
withTorchDistributionMixin
.
Pyro Distributions¶
Abstract Distribution¶

class
Distribution
[source]¶ Bases:
object
Base class for parameterized probability distributions.
Distributions in Pyro are stochastic function objects with
sample()
andlog_prob()
methods. Distribution are stochastic functions with fixed parameters:d = dist.Bernoulli(param) x = d() # Draws a random sample. p = d.log_prob(x) # Evaluates log probability of x.
Implementing New Distributions:
Derived classes must implement the methods:
sample()
,log_prob()
.Examples:
Take a look at the examples to see how they interact with inference algorithms.

__call__
(*args, **kwargs)[source]¶ Samples a random value (just an alias for
.sample(*args, **kwargs)
).For tensor distributions, the returned tensor should have the same
.shape
as the parameters.Returns: A random value. Return type: torch.Tensor

enumerate_support
(expand=True)[source]¶ Returns a representation of the parametrized distribution’s support, along the first dimension. This is implemented only by discrete distributions.
Note that this returns support values of all the batched RVs in lockstep, rather than the full cartesian product.
Parameters: expand (bool) – whether to expand the result to a tensor of shape (n,) + batch_shape + event_shape
. If false, the return value has unexpanded shape(n,) + (1,)*len(batch_shape) + event_shape
which can be broadcasted to the full shape.Returns: An iterator over the distribution’s discrete support. Return type: iterator

has_enumerate_support
= False¶

has_rsample
= False¶

log_prob
(x, *args, **kwargs)[source]¶ Evaluates log probability densities for each of a batch of samples.
Parameters: x (torch.Tensor) – A single value or a batch of values batched along axis 0. Returns: log probability densities as a onedimensional Tensor
with same batch size as value and params. The shape of the result should beself.batch_size
.Return type: torch.Tensor

sample
(*args, **kwargs)[source]¶ Samples a random value.
For tensor distributions, the returned tensor should have the same
.shape
as the parameters, unless otherwise noted.Parameters: sample_shape (torch.Size) – the size of the iid batch to be drawn from the distribution. Returns: A random value or batch of random values (if parameters are batched). The shape of the result should be self.shape()
.Return type: torch.Tensor

score_parts
(x, *args, **kwargs)[source]¶ Computes ingredients for stochastic gradient estimators of ELBO.
The default implementation is correct both for nonreparameterized and for fully reparameterized distributions. Partially reparameterized distributions should override this method to compute correct .score_function and .entropy_term parts.
Parameters: x (torch.Tensor) – A single value or batch of values. Returns: A ScoreParts object containing parts of the ELBO estimator. Return type: ScoreParts

TorchDistributionMixin¶

class
TorchDistributionMixin
[source]¶ Bases:
pyro.distributions.distribution.Distribution
Mixin to provide Pyro compatibility for PyTorch distributions.
You should instead use TorchDistribution for new distribution classes.
This is mainly useful for wrapping existing PyTorch distributions for use in Pyro. Derived classes must first inherit from
torch.distributions.distribution.Distribution
and then inherit fromTorchDistributionMixin
.
__call__
(sample_shape=torch.Size([]))[source]¶ Samples a random value.
This is reparameterized whenever possible, calling
rsample()
for reparameterized distributions andsample()
for nonreparameterized distributions.Parameters: sample_shape (torch.Size) – the size of the iid batch to be drawn from the distribution. Returns: A random value or batch of random values (if parameters are batched). The shape of the result should be self.shape(). Return type: torch.Tensor

shape
(sample_shape=torch.Size([]))[source]¶ The tensor shape of samples from this distribution.
Samples are of shape:
d.shape(sample_shape) == sample_shape + d.batch_shape + d.event_shape
Parameters: sample_shape (torch.Size) – the size of the iid batch to be drawn from the distribution. Returns: Tensor shape of samples. Return type: torch.Size

expand
(batch_shape, _instance=None)[source]¶ Returns a new
ExpandedDistribution
instance with batch dimensions expanded to batch_shape.Parameters:  batch_shape (tuple) – batch shape to expand to.
 _instance – unused argument for compatibility with
torch.distributions.Distribution.expand()
Returns: an instance of ExpandedDistribution.
Return type: ExpandedDistribution

expand_by
(sample_shape)[source]¶ Expands a distribution by adding
sample_shape
to the left side of itsbatch_shape
.To expand internal dims of
self.batch_shape
from 1 to something larger, useexpand()
instead.Parameters: sample_shape (torch.Size) – The size of the iid batch to be drawn from the distribution. Returns: An expanded version of this distribution. Return type: ExpandedDistribution

to_event
(reinterpreted_batch_ndims=None)[source]¶ Reinterprets the
n
rightmost dimensions of this distributionsbatch_shape
as event dims, adding them to the left side ofevent_shape
.Example:
>>> [d1.batch_shape, d1.event_shape] [torch.Size([2, 3]), torch.Size([4, 5])] >>> d2 = d1.to_event(1) >>> [d2.batch_shape, d2.event_shape] [torch.Size([2]), torch.Size([3, 4, 5])] >>> d3 = d1.to_event(2) >>> [d3.batch_shape, d3.event_shape] [torch.Size([]), torch.Size([2, 3, 4, 5])]
Parameters: reinterpreted_batch_ndims (int) – The number of batch dimensions to reinterpret as event dimensions. Returns: A reshaped version of this distribution. Return type: pyro.distributions.torch.Independent

mask
(mask)[source]¶ Masks a distribution by a boolean or booleanvalued tensor that is broadcastable to the distributions
batch_shape
.Parameters: mask (bool or torch.Tensor) – A boolean or boolean valued tensor. Returns: A masked copy of this distribution. Return type: MaskedDistribution

TorchDistribution¶

class
TorchDistribution
(batch_shape=torch.Size([]), event_shape=torch.Size([]), validate_args=None)[source]¶ Bases:
torch.distributions.distribution.Distribution
,pyro.distributions.torch_distribution.TorchDistributionMixin
Base class for PyTorchcompatible distributions with Pyro support.
This should be the base class for almost all new Pyro distributions.
Note
Parameters and data should be of type
Tensor
and all methods return typeTensor
unless otherwise noted.Tensor Shapes:
TorchDistributions provide a method
.shape()
for the tensor shape of samples:x = d.sample(sample_shape) assert x.shape == d.shape(sample_shape)
Pyro follows the same distribution shape semantics as PyTorch. It distinguishes between three different roles for tensor shapes of samples:
 sample shape corresponds to the shape of the iid samples drawn from the distribution. This is taken as an argument by the distribution’s sample method.
 batch shape corresponds to nonidentical (independent) parameterizations of the distribution, inferred from the distribution’s parameter shapes. This is fixed for a distribution instance.
 event shape corresponds to the event dimensions of the distribution, which is fixed for a distribution class. These are collapsed when we try to score a sample from the distribution via d.log_prob(x).
These shapes are related by the equation:
assert d.shape(sample_shape) == sample_shape + d.batch_shape + d.event_shape
Distributions provide a vectorized
log_prob()
method that evaluates the log probability density of each event in a batch independently, returning a tensor of shapesample_shape + d.batch_shape
:x = d.sample(sample_shape) assert x.shape == d.shape(sample_shape) log_p = d.log_prob(x) assert log_p.shape == sample_shape + d.batch_shape
Implementing New Distributions:
Derived classes must implement the methods
sample()
(orrsample()
if.has_rsample == True
) andlog_prob()
, and must implement the propertiesbatch_shape
, andevent_shape
. Discrete classes may also implement theenumerate_support()
method to improve gradient estimates and set.has_enumerate_support = True
.
expand
(batch_shape, _instance=None)¶ Returns a new
ExpandedDistribution
instance with batch dimensions expanded to batch_shape.Parameters:  batch_shape (tuple) – batch shape to expand to.
 _instance – unused argument for compatibility with
torch.distributions.Distribution.expand()
Returns: an instance of ExpandedDistribution.
Return type: ExpandedDistribution
AVFMultivariateNormal¶

class
AVFMultivariateNormal
(loc, scale_tril, control_var)[source]¶ Bases:
pyro.distributions.torch.MultivariateNormal
Multivariate normal (Gaussian) distribution with transport equation inspired control variates (adaptive velocity fields).
A distribution over vectors in which all the elements have a joint Gaussian density.
Parameters:  loc (torch.Tensor) – Ddimensional mean vector.
 scale_tril (torch.Tensor) – Cholesky of Covariance matrix; D x D matrix.
 control_var (torch.Tensor) – 2 x L x D tensor that parameterizes the control variate; L is an arbitrary positive integer. This parameter needs to be learned (i.e. adapted) to achieve lower variance gradients. In a typical use case this parameter will be adapted concurrently with the loc and scale_tril that define the distribution.
Example usage:
control_var = torch.tensor(0.1 * torch.ones(2, 1, D), requires_grad=True) opt_cv = torch.optim.Adam([control_var], lr=0.1, betas=(0.5, 0.999)) for _ in range(1000): d = AVFMultivariateNormal(loc, scale_tril, control_var) z = d.rsample() cost = torch.pow(z, 2.0).sum() cost.backward() opt_cv.step() opt_cv.zero_grad()

arg_constraints
= {'control_var': Real(), 'loc': Real(), 'scale_tril': LowerTriangular()}¶
BetaBinomial¶

class
BetaBinomial
(concentration1, concentration0, total_count=1, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Compound distribution comprising of a betabinomial pair. The probability of success (
probs
for theBinomial
distribution) is unknown and randomly drawn from aBeta
distribution prior to a certain number of Bernoulli trials given bytotal_count
.Parameters: 
arg_constraints
= {'concentration0': GreaterThan(lower_bound=0.0), 'concentration1': GreaterThan(lower_bound=0.0), 'total_count': IntegerGreaterThan(lower_bound=0)}¶

concentration0
¶

concentration1
¶

has_enumerate_support
= True¶

mean
¶

support
¶

variance
¶

ConditionalDistribution¶
ConditionalTransformedDistribution¶
Delta¶

class
Delta
(v, log_density=0.0, event_dim=0, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Degenerate discrete distribution (a single point).
Discrete distribution that assigns probability one to the single element in its support. Delta distribution parameterized by a random choice should not be used with MCMC based inference, as doing so produces incorrect results.
Parameters:  v (torch.Tensor) – The single support element.
 log_density (torch.Tensor) – An optional density for this Delta. This
is useful to keep the class of
Delta
distributions closed under differentiable transformation.  event_dim (int) – Optional event dimension, defaults to zero.

arg_constraints
= {'log_density': Real(), 'v': Real()}¶

has_rsample
= True¶

mean
¶

support
= Real()¶

variance
¶
DirichletMultinomial¶

class
DirichletMultinomial
(concentration, total_count=1, is_sparse=False, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Compound distribution comprising of a dirichletmultinomial pair. The probability of classes (
probs
for theMultinomial
distribution) is unknown and randomly drawn from aDirichlet
distribution prior to a certain number of Categorical trials given bytotal_count
.Parameters:  or torch.Tensor concentration (float) – concentration parameter (alpha) for the Dirichlet distribution.
 or torch.Tensor total_count (int) – number of Categorical trials.
 is_sparse (bool) – Whether to assume value is mostly zero when computing
log_prob()
, which can speed up computation when data is sparse.

arg_constraints
= {'concentration': GreaterThan(lower_bound=0.0), 'total_count': IntegerGreaterThan(lower_bound=0)}¶

concentration
¶

mean
¶

support
¶

variance
¶
DiscreteHMM¶

class
DiscreteHMM
(initial_logits, transition_logits, observation_dist, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Hidden Markov Model with discrete latent state and arbitrary observation distribution. This uses [1] to parallelize over time, achieving O(log(time)) parallel complexity.
The event_shape of this distribution includes time on the left:
event_shape = (num_steps,) + observation_dist.event_shape
This distribution supports any combination of homogeneous/heterogeneous time dependency of
transition_logits
andobservation_dist
. However, because time is included in this distribution’s event_shape, the homogeneous+homogeneous case will have a broadcastable event_shape withnum_steps = 1
, allowinglog_prob()
to work with arbitrary length data:# homogeneous + homogeneous case: event_shape = (1,) + observation_dist.event_shape
References:
 [1] Simo Sarkka, Angel F. GarciaFernandez (2019)
 “Temporal Parallelization of Bayesian Filters and Smoothers” https://arxiv.org/pdf/1905.13002.pdf
Parameters:  initial_logits (Tensor) – A logits tensor for an initial
categorical distribution over latent states. Should have rightmost size
state_dim
and be broadcastable tobatch_shape + (state_dim,)
.  transition_logits (Tensor) – A logits tensor for transition
conditional distributions between latent states. Should have rightmost
shape
(state_dim, state_dim)
(old, new), and be broadcastable tobatch_shape + (num_steps, state_dim, state_dim)
.  observation_dist (Distribution) – A conditional
distribution of observed data conditioned on latent state. The
.batch_shape
should have rightmost sizestate_dim
and be broadcastable tobatch_shape + (num_steps, state_dim)
. The.event_shape
may be arbitrary.

arg_constraints
= {'initial_logits': Real(), 'transition_logits': Real()}¶

filter
(value)[source]¶ Compute posterior over final state given a sequence of observations.
Parameters: value (Tensor) – A sequence of observations. Returns: A posterior distribution over latent states at the final time step. result.logits
can then be used asinitial_logits
in a sequential Pyro model for prediction.Return type: Categorical
EmpiricalDistribution¶

class
Empirical
(samples, log_weights, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Empirical distribution associated with the sampled data. Note that the shape requirement for log_weights is that its shape must match the leftmost shape of samples. Samples are aggregated along the
aggregation_dim
, which is the rightmost dim of log_weights.Example:
>>> emp_dist = Empirical(torch.randn(2, 3, 10), torch.ones(2, 3)) >>> emp_dist.batch_shape torch.Size([2]) >>> emp_dist.event_shape torch.Size([10])
>>> single_sample = emp_dist.sample() >>> single_sample.shape torch.Size([2, 10]) >>> batch_sample = emp_dist.sample((100,)) >>> batch_sample.shape torch.Size([100, 2, 10])
>>> emp_dist.log_prob(single_sample).shape torch.Size([2]) >>> # Vectorized samples cannot be scored by log_prob. >>> with pyro.validation_enabled(): ... emp_dist.log_prob(batch_sample).shape Traceback (most recent call last): ... ValueError: ``value.shape`` must be torch.Size([2, 10])
Parameters:  samples (torch.Tensor) – samples from the empirical distribution.
 log_weights (torch.Tensor) – log weights (optional) corresponding to the samples.

arg_constraints
= {}¶

enumerate_support
(expand=True)[source]¶ See
pyro.distributions.torch_distribution.TorchDistribution.enumerate_support()

event_shape
¶ See
pyro.distributions.torch_distribution.TorchDistribution.event_shape()

has_enumerate_support
= True¶

log_prob
(value)[source]¶ Returns the log of the probability mass function evaluated at
value
. Note that this currently only supports scoring values with emptysample_shape
.Parameters: value (torch.Tensor) – scalar or tensor value to be scored.

log_weights
¶

mean
¶ See
pyro.distributions.torch_distribution.TorchDistribution.mean()

sample
(sample_shape=torch.Size([]))[source]¶ See
pyro.distributions.torch_distribution.TorchDistribution.sample()

sample_size
¶ Number of samples that constitute the empirical distribution.
Return int: number of samples collected.

support
= Real()¶

variance
¶ See
pyro.distributions.torch_distribution.TorchDistribution.variance()
FoldedDistribution¶

class
FoldedDistribution
(base_dist, validate_args=None)[source]¶ Bases:
pyro.distributions.torch.TransformedDistribution
Equivalent to
TransformedDistribution(base_dist, AbsTransform())
, but additionally supportslog_prob()
.Parameters: base_dist (Distribution) – The distribution to reflect. 
support
= GreaterThan(lower_bound=0.0)¶

GammaPoisson¶

class
GammaPoisson
(concentration, rate, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Compound distribution comprising of a gammapoisson pair, also referred to as a gammapoisson mixture. The
rate
parameter for thePoisson
distribution is unknown and randomly drawn from aGamma
distribution.Note
This can be treated as an alternate parametrization of the
NegativeBinomial
(total_count
,probs
) distribution, with concentration = total_count and rate = (1  probs) / probs.Parameters: 
arg_constraints
= {'concentration': GreaterThan(lower_bound=0.0), 'rate': GreaterThan(lower_bound=0.0)}¶

concentration
¶

mean
¶

rate
¶

support
= IntegerGreaterThan(lower_bound=0)¶

variance
¶

GaussianHMM¶

class
GaussianHMM
(initial_dist, transition_matrix, transition_dist, observation_matrix, observation_dist, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Hidden Markov Model with Gaussians for initial, transition, and observation distributions. This adapts [1] to parallelize over time to achieve O(log(time)) parallel complexity, however it differs in that it tracks the log normalizer to ensure
log_prob()
is differentiable.This corresponds to the generative model:
z = initial_distribution.sample() x = [] for t in range(num_events): z = z @ transition_matrix + transition_dist.sample() x.append(z @ observation_matrix + observation_dist.sample())
The event_shape of this distribution includes time on the left:
event_shape = (num_steps,) + observation_dist.event_shape
This distribution supports any combination of homogeneous/heterogeneous time dependency of
transition_dist
andobservation_dist
. However, because time is included in this distribution’s event_shape, the homogeneous+homogeneous case will have a broadcastable event_shape withnum_steps = 1
, allowinglog_prob()
to work with arbitrary length data:event_shape = (1, obs_dim) # homogeneous + homogeneous case
References:
 [1] Simo Sarkka, Angel F. GarciaFernandez (2019)
 “Temporal Parallelization of Bayesian Filters and Smoothers” https://arxiv.org/pdf/1905.13002.pdf
Variables: Parameters:  initial_dist (MultivariateNormal) – A distribution
over initial states. This should have batch_shape broadcastable to
self.batch_shape
. This should have event_shape(hidden_dim,)
.  transition_matrix (Tensor) – A linear transformation of hidden
state. This should have shape broadcastable to
self.batch_shape + (num_steps, hidden_dim, hidden_dim)
where the rightmost dims are ordered(old, new)
.  transition_dist (MultivariateNormal) – A process
noise distribution. This should have batch_shape broadcastable to
self.batch_shape + (num_steps,)
. This should have event_shape(hidden_dim,)
.  observation_matrix (Tensor) – A linear transformation from hidden
to observed state. This should have shape broadcastable to
self.batch_shape + (num_steps, hidden_dim, obs_dim)
.  observation_dist (MultivariateNormal or
Normal) – An observation noise distribution. This should
have batch_shape broadcastable to
self.batch_shape + (num_steps,)
. This should have event_shape(obs_dim,)
.

arg_constraints
= {}¶

filter
(value)[source]¶ Compute posterior over final state given a sequence of observations.
Parameters: value (Tensor) – A sequence of observations. Returns: A posterior distribution over latent states at the final time step. result
can then be used asinitial_dist
in a sequential Pyro model for prediction.Return type: MultivariateNormal
GaussianMRF¶

class
GaussianMRF
(initial_dist, transition_dist, observation_dist, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Temporal Markov Random Field with Gaussian factors for initial, transition, and observation distributions. This adapts [1] to parallelize over time to achieve O(log(time)) parallel complexity, however it differs in that it tracks the log normalizer to ensure
log_prob()
is differentiable.The event_shape of this distribution includes time on the left:
event_shape = (num_steps,) + observation_dist.event_shape
This distribution supports any combination of homogeneous/heterogeneous time dependency of
transition_dist
andobservation_dist
. However, because time is included in this distribution’s event_shape, the homogeneous+homogeneous case will have a broadcastable event_shape withnum_steps = 1
, allowinglog_prob()
to work with arbitrary length data:event_shape = (1, obs_dim) # homogeneous + homogeneous case
References:
 [1] Simo Sarkka, Angel F. GarciaFernandez (2019)
 “Temporal Parallelization of Bayesian Filters and Smoothers” https://arxiv.org/pdf/1905.13002.pdf
Variables: Parameters:  initial_dist (MultivariateNormal) – A distribution
over initial states. This should have batch_shape broadcastable to
self.batch_shape
. This should have event_shape(hidden_dim,)
.  transition_dist (MultivariateNormal) – A joint
distribution factor over a pair of successive time steps. This should
have batch_shape broadcastable to
self.batch_shape + (num_steps,)
. This should have event_shape(hidden_dim + hidden_dim,)
(old+new).  observation_dist (MultivariateNormal) – A joint
distribution factor over a hidden and an observed state. This should
have batch_shape broadcastable to
self.batch_shape + (num_steps,)
. This should have event_shape(hidden_dim + obs_dim,)
.

arg_constraints
= {}¶
GaussianScaleMixture¶

class
GaussianScaleMixture
(coord_scale, component_logits, component_scale)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Mixture of Normal distributions with zero mean and diagonal covariance matrices.
That is, this distribution is a mixture with K components, where each component distribution is a Ddimensional Normal distribution with zero mean and a Ddimensional diagonal covariance matrix. The K different covariance matrices are controlled by the parameters coord_scale and component_scale. That is, the covariance matrix of the k’th component is given by
Sigma_ii = (component_scale_k * coord_scale_i) ** 2 (i = 1, …, D)
where component_scale_k is a positive scale factor and coord_scale_i are positive scale parameters shared between all K components. The mixture weights are controlled by a Kdimensional vector of softmax logits, component_logits. This distribution implements pathwise derivatives for samples from the distribution. This distribution does not currently support batched parameters.
See reference [1] for details on the implementations of the pathwise derivative. Please consider citing this reference if you use the pathwise derivative in your research.
[1] Pathwise Derivatives for Multivariate Distributions, Martin Jankowiak & Theofanis Karaletsos. arXiv:1806.01856
Note that this distribution supports both even and odd dimensions, but the former should be more a bit higher precision, since it doesn’t use any erfs in the backward call. Also note that this distribution does not support D = 1.
Parameters:  coord_scale (torch.tensor) – Ddimensional vector of scales
 component_logits (torch.tensor) – Kdimensional vector of logits
 component_scale (torch.tensor) – Kdimensional vector of scale multipliers

arg_constraints
= {'component_logits': Real(), 'component_scale': GreaterThan(lower_bound=0.0), 'coord_scale': GreaterThan(lower_bound=0.0)}¶

has_rsample
= True¶
InverseGamma¶

class
InverseGamma
(concentration, rate, validate_args=None)[source]¶ Bases:
pyro.distributions.torch.TransformedDistribution
Creates an inversegamma distribution parameterized by concentration and rate.
X ~ Gamma(concentration, rate) Y = 1/X ~ InverseGamma(concentration, rate)Parameters:  concentration (torch.Tensor) – the concentration parameter (i.e. alpha).
 rate (torch.Tensor) – the rate parameter (i.e. beta).

arg_constraints
= {'concentration': GreaterThan(lower_bound=0.0), 'rate': GreaterThan(lower_bound=0.0)}¶

concentration
¶

has_rsample
= True¶

rate
¶

support
= GreaterThan(lower_bound=0.0)¶
LKJCorrCholesky¶

class
LKJCorrCholesky
(d, eta, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Generates cholesky factors of correlation matrices using an LKJ prior.
The expected use is to combine it with a vector of variances and pass it to the scale_tril parameter of a multivariate distribution such as MultivariateNormal.
E.g., if theta is a (positive) vector of covariances with the same dimensionality as this distribution, and Omega is sampled from this distribution, scale_tril=torch.mm(torch.diag(sqrt(theta)), Omega)
Note that the event_shape of this distribution is [d, d]
Note
When using this distribution with HMC/NUTS, it is important to use a step_size such as 1e4. If not, you are likely to experience LAPACK errors regarding positivedefiniteness.
For example usage, refer to pyro/examples/lkj.py.
Parameters:  d (int) – Dimensionality of the matrix
 eta (torch.Tensor) – A single positive number parameterizing the distribution.

arg_constraints
= {'eta': GreaterThan(lower_bound=0.0)}¶

has_rsample
= False¶

support
= CorrCholesky()¶
MaskedMixture¶

class
MaskedMixture
(mask, component0, component1, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
A masked deterministic mixture of two distributions.
This is useful when the mask is sampled from another distribution, possibly correlated across the batch. Often the mask can be marginalized out via enumeration.
Example:
change_point = pyro.sample("change_point", dist.Categorical(torch.ones(len(data) + 1)), infer={'enumerate': 'parallel'}) mask = torch.arange(len(data), dtype=torch.long) >= changepoint with pyro.plate("data", len(data)): pyro.sample("obs", MaskedMixture(mask, dist1, dist2), obs=data)
Parameters:  mask (torch.Tensor) – A byte tensor toggling between
component0
andcomponent1
.  component0 (pyro.distributions.TorchDistribution) – a distribution
for batch elements
mask == 0
.  component1 (pyro.distributions.TorchDistribution) – a distribution
for batch elements
mask == 1
.

arg_constraints
= {}¶

has_rsample
¶

support
¶
 mask (torch.Tensor) – A byte tensor toggling between
MixtureOfDiagNormals¶

class
MixtureOfDiagNormals
(locs, coord_scale, component_logits)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Mixture of Normal distributions with arbitrary means and arbitrary diagonal covariance matrices.
That is, this distribution is a mixture with K components, where each component distribution is a Ddimensional Normal distribution with a Ddimensional mean parameter and a Ddimensional diagonal covariance matrix. The K different component means are gathered into the K x D dimensional parameter locs and the K different scale parameters are gathered into the K x D dimensional parameter coord_scale. The mixture weights are controlled by a Kdimensional vector of softmax logits, component_logits. This distribution implements pathwise derivatives for samples from the distribution.
See reference [1] for details on the implementations of the pathwise derivative. Please consider citing this reference if you use the pathwise derivative in your research. Note that this distribution does not support dimension D = 1.
[1] Pathwise Derivatives for Multivariate Distributions, Martin Jankowiak & Theofanis Karaletsos. arXiv:1806.01856
Parameters:  locs (torch.Tensor) – K x D mean matrix
 coord_scale (torch.Tensor) – K x D scale matrix
 component_logits (torch.Tensor) – Kdimensional vector of softmax logits

arg_constraints
= {'component_logits': Real(), 'coord_scale': GreaterThan(lower_bound=0.0), 'locs': Real()}¶

has_rsample
= True¶
MultivariateStudentT¶

class
MultivariateStudentT
(df, loc, scale_tril, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Creates a multivariate Student’s tdistribution parameterized by degree of freedom
df
, meanloc
and scalescale_tril
.Parameters: 
arg_constraints
= {'df': GreaterThan(lower_bound=0.0), 'loc': RealVector(), 'scale_tril': LowerCholesky()}¶

has_rsample
= True¶

mean
¶

support
= RealVector()¶

variance
¶

OMTMultivariateNormal¶

class
OMTMultivariateNormal
(loc, scale_tril)[source]¶ Bases:
pyro.distributions.torch.MultivariateNormal
Multivariate normal (Gaussian) distribution with OMT gradients w.r.t. both parameters. Note the gradient computation w.r.t. the Cholesky factor has cost O(D^3), although the resulting gradient variance is generally expected to be lower.
A distribution over vectors in which all the elements have a joint Gaussian density.
Parameters:  loc (torch.Tensor) – Mean.
 scale_tril (torch.Tensor) – Cholesky of Covariance matrix.

arg_constraints
= {'loc': Real(), 'scale_tril': LowerTriangular()}¶
RelaxedBernoulliStraightThrough¶

class
RelaxedBernoulliStraightThrough
(temperature, probs=None, logits=None, validate_args=None)[source]¶ Bases:
pyro.distributions.torch.RelaxedBernoulli
An implementation of
RelaxedBernoulli
with a straightthrough gradient estimator.This distribution has the following properties:
 The samples returned by the
rsample()
method are discrete/quantized.  The
log_prob()
method returns the log probability of the relaxed/unquantized sample using the GumbelSoftmax distribution.  In the backward pass the gradient of the sample with respect to the parameters of the distribution uses the relaxed/unquantized sample.
References:
 [1] The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables,
 Chris J. Maddison, Andriy Mnih, Yee Whye Teh
 [2] Categorical Reparameterization with GumbelSoftmax,
 Eric Jang, Shixiang Gu, Ben Poole
 The samples returned by the
RelaxedOneHotCategoricalStraightThrough¶

class
RelaxedOneHotCategoricalStraightThrough
(temperature, probs=None, logits=None, validate_args=None)[source]¶ Bases:
pyro.distributions.torch.RelaxedOneHotCategorical
An implementation of
RelaxedOneHotCategorical
with a straightthrough gradient estimator.This distribution has the following properties:
 The samples returned by the
rsample()
method are discrete/quantized.  The
log_prob()
method returns the log probability of the relaxed/unquantized sample using the GumbelSoftmax distribution.  In the backward pass the gradient of the sample with respect to the parameters of the distribution uses the relaxed/unquantized sample.
References:
 [1] The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables,
 Chris J. Maddison, Andriy Mnih, Yee Whye Teh
 [2] Categorical Reparameterization with GumbelSoftmax,
 Eric Jang, Shixiang Gu, Ben Poole
 The samples returned by the
Rejector¶

class
Rejector
(propose, log_prob_accept, log_scale)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Rejection sampled distribution given an acceptance rate function.
Parameters:  propose (Distribution) – A proposal distribution that samples batched
proposals via
propose()
.rsample()
supports asample_shape
arg only ifpropose()
supports asample_shape
arg.  log_prob_accept (callable) – A callable that inputs a batch of proposals and returns a batch of log acceptance probabilities.
 log_scale – Total log probability of acceptance.

has_rsample
= True¶
 propose (Distribution) – A proposal distribution that samples batched
proposals via
SpanningTree¶

class
SpanningTree
(edge_logits, sampler_options=None, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Distribution over spanning trees on a fixed number
V
of vertices.A tree is represented as
torch.LongTensor
edges
of shape(V1,2)
satisfying the following properties: The edges constitute a tree, i.e. are connected and cycle free.
 Each edge
(v1,v2) = edges[e]
is sorted, i.e.v1 < v2
.  The entire tensor is sorted in colexicographic order.
Use
validate_edges()
to verify edges are correctly formed.The
edge_logits
tensor has one entry for each of theV*(V1)//2
edges in the complete graph onV
vertices, where edges are each sorted and the edge order is colexicographic:(0,1), (0,2), (1,2), (0,3), (1,3), (2,3), (0,4), (1,4), (2,4), ...
This ordering corresponds to the sizeindependent pairing function:
k = v1 + v2 * (v2  1) // 2
where
k
is the rank of the edge(v1,v2)
in the complete graph. To convert a matrix of edge logits to the linear representation used here:assert my_matrix.shape == (V, V) i, j = make_complete_graph(V) edge_logits = my_matrix[i, j]
Parameters:  edge_logits (torch.Tensor) – A tensor of length
V*(V1)//2
containing logits (aka negative energies) of all edges in the complete graph onV
vertices. See above comment for edge ordering.  sampler_options (dict) – An optional dict of sampler options including:
mcmc_steps
defaulting to a single MCMC step (which is pretty good);initial_edges
defaulting to a cheap approximate sample;backend
one of “python” or “cpp”, defaulting to “python”.

arg_constraints
= {'edge_logits': Real()}¶

enumerate_support
(expand=True)[source]¶ This is implemented for trees with up to 6 vertices (and 5 edges).

has_enumerate_support
= True¶

sample
(sample_shape=torch.Size([]))[source]¶ This sampler is implemented using MCMC run for a small number of steps after being initialized by a cheap approximate sampler. This sampler is approximate and cubic time. This is faster than the classic AldousBroder sampler [1,2], especially for graphs with large mixing time. Recent research [3,4] proposes samplers that run in submatrixmultiply time but are more complex to implement.
References
 [1] Generating random spanning trees
 Andrei Broder (1989)
 [2] The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees,
 David J. Aldous (1990)
 [3] Sampling Random Spanning Trees Faster than Matrix Multiplication,
 David Durfee, Rasmus Kyng, John Peebles, Anup B. Rao, Sushant Sachdeva (2017) https://arxiv.org/abs/1611.07451
 [4] An almostlinear time algorithm for uniform random spanning tree generation,
 Aaron Schild (2017) https://arxiv.org/abs/1711.06455

support
= IntegerGreaterThan(lower_bound=0)¶

validate_edges
(edges)[source]¶ Validates a batch of
edges
tensors, as returned bysample()
orenumerate_support()
or as input tolog_prob()
.Parameters: edges (torch.LongTensor) – A batch of edges. Raises: ValueError Returns: None
Stable¶

class
Stable
(stability, skew, scale=1.0, loc=0.0, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Levy \(\alpha\)stable distribution. See [1] for a review.
This uses Nolan’s parametrization [2] of the
loc
parameter, which is required for continuity and differentiability. This corresponds to the notation \(S^0_\alpha(\beta,\sigma,\mu_0)\) of [1], where \(\alpha\) = stability, \(\beta\) = skew, \(\sigma\) = scale, and \(\mu_0\) = loc.This implements a reparametrized sampler
rsample()
, but does not implementlog_prob()
. Use in inference is thus limited to likelihoodfree algorithms such asEnergyDistance
. [1] S. Borak, W. Hardle, R. Weron (2005).
 Stable distributions. https://edoc.huberlin.de/bitstream/handle/18452/4526/8.pdf
 [2] J.P. Nolan (1997).
 Numerical calculation of stable densities and distribution functions.
 [3] Rafal Weron (1996).
 On the ChambersMallowsStuck Method for Simulating Skewed Stable Random Variables.
 [4] J.P. Nolan (2017).
 Stable Distributions: Models for Heavy Tailed Data. http://fs2.american.edu/jpnolan/www/stable/chap1.pdf
Parameters:  stability (Tensor) – Levy stability parameter \(\alpha\in(0,2]\) .
 skew (Tensor) – Skewness \(\beta\in[1,1]\) .
 scale (Tensor) – Scale \(\sigma > 0\) . Defaults to 1.
 loc (Tensor) – Location \(\mu_0\) in Nolan’s parametrization [2]. Defaults to 0.

arg_constraints
= {'loc': Real(), 'scale': GreaterThan(lower_bound=0.0), 'skew': Interval(lower_bound=1, upper_bound=1), 'stability': Interval(lower_bound=0, upper_bound=2)}¶

has_rsample
= True¶

support
= Real()¶
Unit¶

class
Unit
(log_factor, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Trivial nonnormalized distribution representing the unit type.
The unit type has a single value with no data, i.e.
value.numel() == 0
.This is used for
pyro.factor()
statements.
arg_constraints
= {'log_factor': Real()}¶

support
= Real()¶

VonMises¶

class
VonMises
(loc, concentration, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
A circular von Mises distribution.
This implementation uses polar coordinates. The
loc
andvalue
args can be any real number (to facilitate unconstrained optimization), but are interpreted as angles modulo 2 pi.See
VonMises3D
for a 3D cartesian coordinate cousin of this distribution.Parameters:  loc (torch.Tensor) – an angle in radians.
 concentration (torch.Tensor) – concentration parameter

arg_constraints
= {'concentration': GreaterThan(lower_bound=0.0), 'loc': Real()}¶

has_rsample
= False¶

mean
¶ The provided mean is the circular one.

sample
(sample_shape=torch.Size([]))[source]¶ The sampling algorithm for the von Mises distribution is based on the following paper: Best, D. J., and Nicholas I. Fisher. “Efficient simulation of the von Mises distribution.” Applied Statistics (1979): 152157.

support
= Real()¶

variance
¶ The provided variance is the circular one.
VonMises3D¶

class
VonMises3D
(concentration, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Spherical von Mises distribution.
This implementation combines the direction parameter and concentration parameter into a single combined parameter that contains both direction and magnitude. The
value
arg is represented in cartesian coordinates: it must be a normalized 3vector that lies on the 2sphere.See
VonMises
for a 2D polar coordinate cousin of this distribution.Currently only
log_prob()
is implemented.Parameters: concentration (torch.Tensor) – A combined locationandconcentration vector. The direction of this vector is the location, and its magnitude is the concentration. 
arg_constraints
= {'concentration': Real()}¶

support
= Real()¶

ZeroInflatedPoisson¶

class
ZeroInflatedPoisson
(gate, rate, validate_args=None)[source]¶ Bases:
pyro.distributions.zero_inflated.ZeroInflatedDistribution
A Zero Inflated Poisson distribution.
Parameters:  gate (torch.Tensor) – probability of extra zeros.
 rate (torch.Tensor) – rate of poisson distribution.

support
= IntegerGreaterThan(lower_bound=0)¶
ZeroInflatedNegativeBinomial¶

class
ZeroInflatedNegativeBinomial
(gate, total_count, probs=None, logits=None, validate_args=None)[source]¶ Bases:
pyro.distributions.zero_inflated.ZeroInflatedDistribution
A Zero Inflated Negative Binomial distribution.
Parameters:  gate (torch.Tensor) – probability of extra zeros.
 (float or Tensor) (total_count) – nonnegative number of negative Bernoulli trials
 (Tensor) (logits) – Event probabilities of success in the half open interval [0, 1)
 (Tensor) – Event logodds for probabilities of success

support
= IntegerGreaterThan(lower_bound=0)¶
ZeroInflatedDistribution¶

class
ZeroInflatedDistribution
(gate, base_dist, validate_args=None)[source]¶ Bases:
pyro.distributions.torch_distribution.TorchDistribution
Base class for a Zero Inflated distribution.
Parameters:  gate (torch.Tensor) – probability of extra zeros given via a Bernoulli distribution.
 base_dist (TorchDistribution) – the base distribution.

arg_constraints
= {'gate': Interval(lower_bound=0.0, upper_bound=1.0)}¶
Transforms¶
ConditionalTransform¶
CorrLCholeskyTransform¶

class
CorrLCholeskyTransform
(cache_size=0)[source]¶ Bases:
torch.distributions.transforms.Transform
Transforms a vector into the cholesky factor of a correlation matrix.
The input should have shape [batch_shape] + [d * (d1)/2]. The output will have shape [batch_shape] + [d, d].
Reference:
[1] Cholesky Factors of Correlation Matrices, Stan Reference Manual v2.18, Section 10.12

bijective
= True¶

codomain
= CorrCholesky()¶

domain
= Real()¶

event_dim
= 1¶

sign
= 1¶

ELUTransform¶
LeakyReLUTransform¶
LowerCholeskyAffine¶

class
LowerCholeskyAffine
(loc, scale_tril)[source]¶ Bases:
torch.distributions.transforms.Transform
A bijection of the form \(\mathbf{y} = \mathbf{L} \mathbf{x} + \mathbf{r}\) where mathbf{L} is a lower triangular matrix and mathbf{r} is a vector.
Parameters:  loc (torch.tensor) – the fixed Ddimensional vector to shift the input by.
 scale_tril (torch.tensor) – the D x D lower triangular matrix used in the transformation.

bijective
= True¶

codomain
= RealVector()¶

event_dim
= 1¶

log_abs_det_jacobian
(x, y)[source]¶ Calculates the elementwise determinant of the log Jacobian, i.e. log(abs(dy/dx)).

volume_preserving
= False¶
Permute¶

class
Permute
(permutation)[source]¶ Bases:
torch.distributions.transforms.Transform
A bijection that reorders the input dimensions, that is, multiplies the input by a permutation matrix. This is useful in between
AffineAutoregressive
transforms to increase the flexibility of the resulting distribution and stabilize learning. Whilst not being an autoregressive transform, the log absolute determinate of the Jacobian is easily calculable as 0. Note that reordering the input dimension between two layers ofAffineAutoregressive
is not equivalent to reordering the dimension inside the MADE networks that those IAFs use; using aPermute
transform results in a distribution with more flexibility.Example usage:
>>> from pyro.nn import AutoRegressiveNN >>> from pyro.distributions.transforms import AffineAutoregressive, Permute >>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> iaf1 = AffineAutoregressive(AutoRegressiveNN(10, [40])) >>> ff = Permute(torch.randperm(10, dtype=torch.long)) >>> iaf2 = AffineAutoregressive(AutoRegressiveNN(10, [40])) >>> flow_dist = dist.TransformedDistribution(base_dist, [iaf1, ff, iaf2]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
Parameters: permutation (torch.LongTensor) – a permutation ordering that is applied to the inputs. 
bijective
= True¶

codomain
= Real()¶

event_dim
= 1¶

log_abs_det_jacobian
(x, y)[source]¶ Calculates the elementwise determinant of the log Jacobian, i.e. log(abs([dy_0/dx_0, …, dy_{N1}/dx_{N1}])). Note that this type of transform is not autoregressive, so the log Jacobian is not the sum of the previous expression. However, it turns out it’s always 0 (since the determinant is 1 or +1), and so returning a vector of zeros works.

volume_preserving
= True¶

TransformModules¶
AffineAutoregressive¶

class
AffineAutoregressive
(autoregressive_nn, log_scale_min_clip=5.0, log_scale_max_clip=3.0, sigmoid_bias=2.0, stable=False)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
An implementation of the bijective transform of Inverse Autoregressive Flow (IAF), using by default Eq (10) from Kingma Et Al., 2016,
\(\mathbf{y} = \mu_t + \sigma_t\odot\mathbf{x}\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, \(\mu_t,\sigma_t\) are calculated from an autoregressive network on \(\mathbf{x}\), and \(\sigma_t>0\).
If the stable keyword argument is set to True then the transformation used is,
\(\mathbf{y} = \sigma_t\odot\mathbf{x} + (1\sigma_t)\odot\mu_t\)where \(\sigma_t\) is restricted to \((0,1)\). This variant of IAF is claimed by the authors to be more numerically stable than one using Eq (10), although in practice it leads to a restriction on the distributions that can be represented, presumably since the input is restricted to rescaling by a number on \((0,1)\).
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> from pyro.nn import AutoRegressiveNN >>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> transform = AffineAutoregressive(AutoRegressiveNN(10, [40])) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of the Bijector is required when, e.g., scoring the log density of a sample with
TransformedDistribution
. This implementation caches the inverse of the Bijector when its forward operation is called, e.g., when sampling fromTransformedDistribution
. However, if the cached value isn’t available, either because it was overwritten during sampling a new value or an arbitary value is being scored, it will calculate it manually. Note that this is an operation that scales as O(D) where D is the input dimension, and so should be avoided for large dimensional uses. So in general, it is cheap to sample from IAF and score a value that was sampled by IAF, but expensive to score an arbitrary value.Parameters:  autoregressive_nn (nn.Module) – an autoregressive neural network whose forward call returns a realvalued mean and logitscale as a tuple
 log_scale_min_clip (float) – The minimum value for clipping the log(scale) from the autoregressive NN
 log_scale_max_clip (float) – The maximum value for clipping the log(scale) from the autoregressive NN
 sigmoid_bias (float) – A term to add the logit of the input when using the stable tranform.
 stable (bool) – When true, uses the alternative “stable” version of the transform (see above).
References:
1. Improving Variational Inference with Inverse Autoregressive Flow [arXiv:1606.04934] Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling
2. Variational Inference with Normalizing Flows [arXiv:1505.05770] Danilo Jimenez Rezende, Shakir Mohamed
3. MADE: Masked Autoencoder for Distribution Estimation [arXiv:1502.03509] Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle

autoregressive
= True¶

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶
AffineCoupling¶

class
AffineCoupling
(split_dim, hypernet, log_scale_min_clip=5.0, log_scale_max_clip=3.0)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
An implementation of the affine coupling layer of RealNVP (Dinh et al., 2017) that uses the bijective transform,
\(\mathbf{y}_{1:d} = \mathbf{x}_{1:d}\) \(\mathbf{y}_{(d+1):D} = \mu + \sigma\odot\mathbf{x}_{(d+1):D}\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, e.g. \(\mathbf{x}_{1:d}\) represents the first \(d\) elements of the inputs, and \(\mu,\sigma\) are shift and translation parameters calculated as the output of a function inputting only \(\mathbf{x}_{1:d}\).
That is, the first \(d\) components remain unchanged, and the subsequent \(Dd\) are shifted and translated by a function of the previous components.
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> from pyro.nn import DenseNN >>> input_dim = 10 >>> split_dim = 6 >>> base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim)) >>> hypernet = DenseNN(split_dim, [10*input_dim], [input_dimsplit_dim, input_dimsplit_dim]) >>> transform = AffineCoupling(split_dim, hypernet) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of the Bijector is required when, e.g., scoring the log density of a sample with
TransformedDistribution
. This implementation caches the inverse of the Bijector when its forward operation is called, e.g., when sampling fromTransformedDistribution
. However, if the cached value isn’t available, either because it was overwritten during sampling a new value or an arbitary value is being scored, it will calculate it manually.This is an operation that scales as O(1), i.e. constant in the input dimension. So in general, it is cheap to sample and score (an arbitrary value) from
AffineCoupling
.Parameters:  split_dim (int) – Zeroindexed dimension \(d\) upon which to perform input/output split for transformation.
 hypernet (callable) – an autoregressive neural network whose forward call returns a realvalued mean and logitscale as a tuple. The input should have final dimension split_dim and the output final dimension input_dimsplit_dim for each member of the tuple.
 log_scale_min_clip (float) – The minimum value for clipping the log(scale) from the autoregressive NN
 log_scale_max_clip (float) – The maximum value for clipping the log(scale) from the autoregressive NN
References:
Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using Real NVP. ICLR 2017.

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶
BatchNorm¶

class
BatchNorm
(input_dim, momentum=0.1, epsilon=1e05)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
A type of batch normalization that can be used to stabilize training in normalizing flows. The inverse operation is defined as
\(x = (y  \hat{\mu}) \oslash \sqrt{\hat{\sigma^2}} \otimes \gamma + \beta\)that is, the standard batch norm equation, where \(x\) is the input, \(y\) is the output, \(\gamma,\beta\) are learnable parameters, and \(\hat{\mu}\)/\(\hat{\sigma^2}\) are smoothed running averages of the sample mean and variance, respectively. The constraint \(\gamma>0\) is enforced to ease calculation of the logdetJacobian term.
This is an elementwise transform, and when applied to a vector, learns two parameters (\(\gamma,\beta\)) for each dimension of the input.
When the module is set to training mode, the moving averages of the sample mean and variance are updated every time the inverse operator is called, e.g., when a normalizing flow scores a minibatch with the log_prob method.
Also, when the module is set to training mode, the sample mean and variance on the current minibatch are used in place of the smoothed averages, \(\hat{\mu}\) and \(\hat{\sigma^2}\), for the inverse operator. For this reason it is not the case that \(x=g(g^{1}(x))\) during training, i.e., that the inverse operation is the inverse of the forward one.
Example usage:
>>> from pyro.nn import AutoRegressiveNN >>> from pyro.distributions.transforms import AffineAutoregressive >>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> iafs = [AffineAutoregressive(AutoRegressiveNN(10, [40])) for _ in range(2)] >>> bn = BatchNorm(10) >>> flow_dist = dist.TransformedDistribution(base_dist, [iafs[0], bn, iafs[1]]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
Parameters: References:
[1] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning, 2015. https://arxiv.org/abs/1502.03167
[2] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density Estimation using Real NVP. In International Conference on Learning Representations, 2017. https://arxiv.org/abs/1605.08803
[3] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked Autoregressive Flow for Density Estimation. In Neural Information Processing Systems, 2017. https://arxiv.org/abs/1705.07057

bijective
= True¶

codomain
= Real()¶

constrained_gamma
¶

domain
= Real()¶

event_dim
= 0¶

BlockAutoregressive¶

class
BlockAutoregressive
(input_dim, hidden_factors=[8, 8], activation='tanh', residual=None)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
An implementation of Block Neural Autoregressive Flow (blockNAF) (De Cao et al., 2019) bijective transform. BlockNAF uses a similar transformation to deep dense NAF, building the autoregressive NN into the structure of the transform, in a sense.
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> naf = BlockAutoregressive(input_dim=10) >>> pyro.module("my_naf", naf) # doctest: +SKIP >>> naf_dist = dist.TransformedDistribution(base_dist, [naf]) >>> naf_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse operation is not implemented. This would require numerical inversion, e.g., using a root finding method  a possibility for a future implementation.
Parameters:  input_dim (int) – The dimensionality of the input and output variables.
 hidden_factors (list) – Hidden layer i has hidden_factors[i] hidden units per input dimension. This corresponds to both \(a\) and \(b\) in De Cao et al. (2019). The elements of hidden_factors must be integers.
 activation (string) – Activation function to use. One of ‘ELU’, ‘LeakyReLU’, ‘sigmoid’, or ‘tanh’.
 residual (string) – Type of residual connections to use. Choices are “None”, “normal” for \(\mathbf{y}+f(\mathbf{y})\), and “gated” for \(\alpha\mathbf{y} + (1  \alpha\mathbf{y})\) for learnable parameter \(\alpha\).
References:
Block Neural Autoregressive Flow [arXiv:1904.04676] Nicola De Cao, Ivan Titov, Wilker Aziz

autoregressive
= True¶

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶
ConditionalPlanar¶

class
ConditionalPlanar
(nn)[source]¶ Bases:
pyro.distributions.conditional.ConditionalTransformModule
A conditional ‘planar’ bijective transform using the equation,
\(\mathbf{y} = \mathbf{x} + \mathbf{u}\tanh(\mathbf{w}^T\mathbf{z}+b)\)where \(\mathbf{x}\) are the inputs with dimension \(D\), \(\mathbf{y}\) are the outputs, and the pseudoparameters \(b\in\mathbb{R}\), \(\mathbf{u}\in\mathbb{R}^D\), and \(\mathbf{w}\in\mathbb{R}^D\) are the output of a function, e.g. a NN, with input \(z\in\mathbb{R}^{M}\) representing the context variable to condition on. For this to be an invertible transformation, the condition \(\mathbf{w}^T\mathbf{u}>1\) is enforced.
Together with
ConditionalTransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> from pyro.nn.dense_nn import DenseNN >>> input_dim = 10 >>> context_dim = 5 >>> batch_size = 3 >>> base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim)) >>> hypernet = DenseNN(context_dim, [50, 50], param_dims=[1, input_dim, input_dim]) >>> transform = ConditionalPlanar(hypernet) >>> z = torch.rand(batch_size, context_dim) >>> flow_dist = dist.ConditionalTransformedDistribution(base_dist, [transform]).condition(z) >>> flow_dist.sample(sample_shape=torch.Size([batch_size])) # doctest: +SKIP
The inverse of this transform does not possess an analytical solution and is left unimplemented. However, the inverse is cached when the forward operation is called during sampling, and so samples drawn using the planar transform can be scored.
Parameters: nn (callable) – a function inputting the context variable and outputting a triplet of realvalued parameters of dimensions \((1, D, D)\). References: Variational Inference with Normalizing Flows [arXiv:1505.05770] Danilo Jimenez Rezende, Shakir Mohamed

bijective
= True¶

codomain
= Real()¶

condition
(context)[source]¶ See
pyro.distributions.conditional.ConditionalTransformModule.condition()

domain
= Real()¶

event_dim
= 1¶

ConditionalTransformModule¶

class
ConditionalTransformModule
(*args, **kwargs)[source]¶ Bases:
pyro.distributions.conditional.ConditionalTransform
,torch.nn.modules.module.Module
Conditional transforms with learnable parameters such as normalizing flows should inherit from this class rather than
ConditionalTransform
so they are also a subclass ofModule
and inherit all the useful methods of that class.
Householder¶

class
Householder
(input_dim, count_transforms=1)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
Represents multiple applications of the Householder bijective transformation. A single Householder transformation takes the form,
\(\mathbf{y} = (I  2*\frac{\mathbf{u}\mathbf{u}^T}{\mathbf{u}^2})\mathbf{x}\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, and the learnable parameters are \(\mathbf{u}\in\mathbb{R}^D\) for input dimension \(D\).
The transformation represents the reflection of \(\mathbf{x}\) through the plane passing through the origin with normal \(\mathbf{u}\).
\(D\) applications of this transformation are able to transform standard i.i.d. standard Gaussian noise into a Gaussian variable with an arbitrary covariance matrix. With \(K<D\) transformations, one is able to approximate a fullrank Gaussian distribution using a linear transformation of rank \(K\).
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> transform = Householder(10, count_transforms=5) >>> pyro.module("my_transform", p) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
Parameters: References:
Improving Variational AutoEncoders using Householder Flow, [arXiv:1611.09630] Tomczak, J. M., & Welling, M.

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶

log_abs_det_jacobian
(x, y)[source]¶ Calculates the elementwise determinant of the log jacobian. Householder flow is measure preserving, so \(\log(detJ) = 0\)

volume_preserving
= True¶

NeuralAutoregressive¶

class
NeuralAutoregressive
(autoregressive_nn, hidden_units=16, activation='sigmoid')[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
An implementation of the deep Neural Autoregressive Flow (NAF) bijective transform of the “IAF flavour” that can be used for sampling and scoring samples drawn from it (but not arbitrary ones).
Example usage:
>>> from pyro.nn import AutoRegressiveNN >>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> arn = AutoRegressiveNN(10, [40], param_dims=[16]*3) >>> transform = NeuralAutoregressive(arn, hidden_units=16) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse operation is not implemented. This would require numerical inversion, e.g., using a root finding method  a possibility for a future implementation.
Parameters:  autoregressive_nn (nn.Module) – an autoregressive neural network whose forward call returns a tuple of three realvalued tensors, whose last dimension is the input dimension, and whose penultimate dimension is equal to hidden_units.
 hidden_units (int) – the number of hidden units to use in the NAF transformation (see Eq (8) in reference)
 activation (string) – Activation function to use. One of ‘ELU’, ‘LeakyReLU’, ‘sigmoid’, or ‘tanh’.
Reference:
Neural Autoregressive Flows [arXiv:1804.00779] ChinWei Huang, David Krueger, Alexandre Lacoste, Aaron Courville

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶
Planar¶

class
Planar
(input_dim)[source]¶ Bases:
pyro.distributions.transforms.planar.ConditionedPlanar
,pyro.distributions.torch_transform.TransformModule
A ‘planar’ bijective transform with equation,
\(\mathbf{y} = \mathbf{x} + \mathbf{u}\tanh(\mathbf{w}^T\mathbf{z}+b)\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, and the learnable parameters are \(b\in\mathbb{R}\), \(\mathbf{u}\in\mathbb{R}^D\), \(\mathbf{w}\in\mathbb{R}^D\) for input dimension \(D\). For this to be an invertible transformation, the condition \(\mathbf{w}^T\mathbf{u}>1\) is enforced.
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> transform = Planar(10) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of this transform does not possess an analytical solution and is left unimplemented. However, the inverse is cached when the forward operation is called during sampling, and so samples drawn using the planar transform can be scored.
Parameters: input_dim (int) – the dimension of the input (and output) variable. References:
Variational Inference with Normalizing Flows [arXiv:1505.05770] Danilo Jimenez Rezende, Shakir Mohamed

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶

Polynomial¶

class
Polynomial
(autoregressive_nn, input_dim, count_degree, count_sum)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
An autoregressive bijective transform as described in Jaini et al. (2019) applying following equation elementwise,
\(y_n = c_n + \int^{x_n}_0\sum^K_{k=1}\left(\sum^R_{r=0}a^{(n)}_{r,k}u^r\right)du\)where \(x_n\) is the \(n\) is the \(n\), \(\left\{a^{(n)}_{r,k}\in\mathbb{R}\right\}\) are learnable parameters that are the output of an autoregressive NN inputting \(x_{\prec n}={x_1,x_2,\ldots,x_{n1}}\).
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> from pyro.nn import AutoRegressiveNN >>> input_dim = 10 >>> count_degree = 4 >>> count_sum = 3 >>> base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim)) >>> arn = AutoRegressiveNN(input_dim, [input_dim*10], param_dims=[(count_degree + 1)*count_sum]) >>> transform = Polynomial(arn, input_dim=input_dim, count_degree=count_degree, count_sum=count_sum) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of this transform does not possess an analytical solution and is left unimplemented. However, the inverse is cached when the forward operation is called during sampling, and so samples drawn using a polynomial transform can be scored.
Parameters:  autoregressive_nn (nn.Module) – an autoregressive neural network whose forward call returns a tensor of realvalued numbers of size (batch_size, (count_degree+1)*count_sum, input_dim)
 count_degree (int) – The degree of the polynomial to use for each elementwise transformation.
 count_sum (int) – The number of polynomials to sum in each elementwise transformation.
References:
Sumofsquares polynomial flow. [arXiv:1905.02325] Priyank Jaini, Kira A. Shelby, Yaoliang Yu

autoregressive
= True¶

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶
Radial¶

class
Radial
(input_dim)[source]¶ Bases:
pyro.distributions.torch_transform.TransformModule
A ‘radial’ bijective transform using the equation,
\(\mathbf{y} = \mathbf{x} + \beta h(\alpha,r)(\mathbf{x}  \mathbf{x}_0)\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, and the learnable parameters are \(\alpha\in\mathbb{R}^+\), \(\beta\in\mathbb{R}\), \(\mathbf{x}_0\in\mathbb{R}^D\), for input dimension \(D\), \(r=\mathbf{x}\mathbf{x}_0_2\), \(h(\alpha,r)=1/(\alpha+r)\). For this to be an invertible transformation, the condition \(\beta>\alpha\) is enforced.
Together with
TransformedDistribution
this provides a way to create richer variational approximations.Example usage:
>>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> transform = Radial(10) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of this transform does not possess an analytical solution and is left unimplemented. However, the inverse is cached when the forward operation is called during sampling, and so samples drawn using the radial transform can be scored.
Parameters: input_dim (int) – the dimension of the input (and output) variable. References:
Variational Inference with Normalizing Flows [arXiv:1505.05770] Danilo Jimenez Rezende, Shakir Mohamed

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶

Sylvester¶

class
Sylvester
(input_dim, count_transforms=1)[source]¶ Bases:
pyro.distributions.transforms.householder.Householder
An implementation of the Sylvester bijective transform of the Householder variety (Van den Berg Et Al., 2018),
\(\mathbf{y} = \mathbf{x} + QR\tanh(SQ^T\mathbf{x}+\mathbf{b})\)where \(\mathbf{x}\) are the inputs, \(\mathbf{y}\) are the outputs, \(R,S\sim D\times D\) are upper triangular matrices for input dimension \(D\), \(Q\sim D\times D\) is an orthogonal matrix, and \(\mathbf{b}\sim D\) is learnable bias term.
The Sylvester transform is a generalization of
Planar
. In the Householder type of the Sylvester transform, the orthogonality of \(Q\) is enforced by representing it as the product of Householder transformations.Together with
TransformedDistribution
it provides a way to create richer variational approximations.Example usage:
>>> base_dist = dist.Normal(torch.zeros(10), torch.ones(10)) >>> transform = Sylvester(10, count_transforms=4) >>> pyro.module("my_transform", transform) # doctest: +SKIP >>> flow_dist = dist.TransformedDistribution(base_dist, [transform]) >>> flow_dist.sample() # doctest: +SKIP tensor([0.4071, 0.5030, 0.7924, 0.2366, 0.2387, 0.1417, 0.0868, 0.1389, 0.4629, 0.0986])
The inverse of this transform does not possess an analytical solution and is left unimplemented. However, the inverse is cached when the forward operation is called during sampling, and so samples drawn using the Sylvester transform can be scored.
References:
Rianne van den Berg, Leonard Hasenclever, Jakub M. Tomczak, Max Welling. Sylvester Normalizing Flows for Variational Inference. In proceedings of The 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018).

bijective
= True¶

codomain
= Real()¶

domain
= Real()¶

event_dim
= 1¶

TransformModule¶

class
TransformModule
(*args, **kwargs)[source]¶ Bases:
torch.distributions.transforms.Transform
,torch.nn.modules.module.Module
Transforms with learnable parameters such as normalizing flows should inherit from this class rather than Transform so they are also a subclass of nn.Module and inherit all the useful methods of that class.
Transform Factories¶
Each Transform
and TransformModule
includes a corresponding helper function in lower case that inputs, at minimum, the input dimensions of the transform, and possibly additional arguments to customize the transform in an intuitive way. The purpose of these helper functions is to hide from the user whether or not the transform requires the construction of a hypernet, and if so, the input and output dimensions of that hypernet.
affine_autoregressive¶

affine_autoregressive
(input_dim, hidden_dims=None, **kwargs)[source]¶ A helper function to create an
AffineAutoregressive
object that takes care of constructing an autoregressive network with the correct input/output dimensions.Parameters:  input_dim (int) – Dimension of input variable
 hidden_dims (list[int]) – The desired hidden dimensions of the autoregressive network. Defaults to using [3*input_dim + 1]
 log_scale_min_clip (float) – The minimum value for clipping the log(scale) from the autoregressive NN
 log_scale_max_clip (float) – The maximum value for clipping the log(scale) from the autoregressive NN
 sigmoid_bias (float) – A term to add the logit of the input when using the stable tranform.
 stable (bool) – When true, uses the alternative “stable” version of the transform (see above).
affine_coupling¶

affine_coupling
(input_dim, hidden_dims=None, split_dim=None, **kwargs)[source]¶ A helper function to create an
AffineCoupling
object that takes care of constructing a dense network with the correct input/output dimensions.Parameters:  input_dim (int) – Dimension of input variable
 hidden_dims (list[int]) – The desired hidden dimensions of the dense network. Defaults to using [10*input_dim]
 split_dim (int) – The dimension to split the input on for the coupling transform. Defaults to using input_dim // 2
 log_scale_min_clip (float) – The minimum value for clipping the log(scale) from the autoregressive NN
 log_scale_max_clip (float) – The maximum value for clipping the log(scale) from the autoregressive NN
batchnorm¶
block_autoregressive¶

block_autoregressive
(input_dim, **kwargs)[source]¶ A helper function to create a
BlockAutoregressive
object for consistency with other helpers.Parameters:  input_dim (int) – Dimension of input variable
 hidden_factors (list) – Hidden layer i has hidden_factors[i] hidden units per input dimension. This corresponds to both \(a\) and \(b\) in De Cao et al. (2019). The elements of hidden_factors must be integers.
 activation (string) – Activation function to use. One of ‘ELU’, ‘LeakyReLU’, ‘sigmoid’, or ‘tanh’.
 residual (string) – Type of residual connections to use. Choices are “None”, “normal” for \(\mathbf{y}+f(\mathbf{y})\), and “gated” for \(\alpha\mathbf{y} + (1  \alpha\mathbf{y})\) for learnable parameter \(\alpha\).
conditional_planar¶

conditional_planar
(input_dim, context_dim, hidden_dims=None)[source]¶ A helper function to create a
ConditionalPlanar
object that takes care of constructing a dense network with the correct input/output dimensions.Parameters:
elu¶
householder¶

householder
(input_dim, count_transforms=None)[source]¶ A helper function to create a
Householder
object for consistency with other helpers.Parameters:
leaky_relu¶

leaky_relu
()[source]¶ A helper function to create a
LeakyReLUTransform
object for consistency with other helpers.
neural_autoregressive¶

neural_autoregressive
(input_dim, hidden_dims=None, activation='sigmoid', width=16)[source]¶ A helper function to create a
NeuralAutoregressive
object that takes care of constructing an autoregressive network with the correct input/output dimensions.Parameters:  input_dim (int) – Dimension of input variable
 hidden_dims (list[int]) – The desired hidden dimensions of the autoregressive network. Defaults to using [3*input_dim + 1]
 activation (string) – Activation function to use. One of ‘ELU’, ‘LeakyReLU’, ‘sigmoid’, or ‘tanh’.
 width (int) – The width of the “multilayer perceptron” in the transform (see paper). Defaults to 16
permute¶

permute
(input_dim, permutation=None)[source]¶ A helper function to create a
Permute
object for consistency with other helpers.Parameters:  input_dim (int) – Dimension of input variable
 permutation (torch.LongTensor) – Torch tensor of integer indices representing permutation. Defaults to a random permutation.
planar¶
polynomial¶

polynomial
(input_dim, hidden_dims=None)[source]¶ A helper function to create a
Polynomial
object that takes care of constructing an autoregressive network with the correct input/output dimensions.Parameters:  input_dim (int) – Dimension of input variable
 hidden_dims – The desired hidden dimensions of of the autoregressive network. Defaults to using [input_dim * 10]
radial¶
sylvester¶
tanh¶

tanh
()[source]¶ A helper function to create a
TanhTransform
object for consistency with other helpers.