estimators

Interfaces and classes for estimating expectations

See also

Estimator

The base class for all estimators

class pydrobert.torch.estimators.DirectEstimator(proposal, func, mc_samples, cv=None, cv_mean=None, is_log=False)[source]

Direct MC estimate using REINFORCE gradient estimate

The expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) is estimated by drawing \(N\) samples \(b^{(1:N)}\) i.i.d. from \(P\) and taking the sample average:

\[v \approx \frac{1}{N} \sum_{n=1}^N f\left(b^{(n)}\right).\]

An optional control variate \(c\) can be specified:

\[v \approx \frac{1}{N} \sum_{n=1}^N f\left(b^{(n)}\right) - c\left(b^{(n)}\right) + \mu_c\]

which is unbiased when \(\mathbb{E}_{b \sim P}[c(b)] = \mu_c\).

In the backward pass, the gradient of the expectation is estimated using REINFORCE [williams1992]:

\[\nabla v \approx \frac{1}{N} \sum_{n=1}^N \nabla \left(f\left(b^{(n)}\right) - c\left(b^{(n)}\right) + \mu_c\right)\log P(b).\]

With the control variate terms excluded if they were not specified.

Parameters:
Returns:

v (torch.Tensor)

class pydrobert.torch.estimators.EnumerateEstimator(proposal, func, is_log=False)[source]

Calculate expectation exactly by enumerating the support of the distribution

An unbiased, zero-variance “estimate” of an expectation over a discrete variable may be calculated brute force by enumerating the support and taking the product of function values with their probabilities under the distribution.

\[v = \mathbb{E}_{b \sim P}[f(b)] = \sum_b P(b) f(b).\]

When called, the instance does just that.

Parameters:
  • proposal (Distribution) – The distribution over which the expectation is taken, \(P\). Must be able to enumerate its support through torch.distributions.Distribution.enumerate_support() (proposal.has_enumerate_support == True).

  • func (Callable[[Tensor], Tensor]) –

  • is_log (bool) –

Returns:

v (torch.Tensor)

Warning

The call may be both compute- and memory-intensive, depending on the size of the support.

class pydrobert.torch.estimators.Estimator(proposal, func, is_log=False)[source]

Computes an estimate of an expectation

An estimator estimates the value of a function \(f\) integrated over a probability density \(P\)

\[v = \mathbb{E}_{b \sim P}\left[f(b)\right] = \int_{b \in \mathrm{supp}(P)} f(b) \mathrm{d}P(b).\]

The value of \(v\) can be estimated in many ways. This base class serves as the common foundation for those estimators. The usage pattern is as follows:

def func(b):
    # return the value of f(b) here

# ...
# training loop
for epoch in range(num_epochs):
    # ...
    # 1. Determine parameterization (e.g. logits) from inputs.
    # 2. Initialize the distribution and estimator in the training loop.
    dist = torch.distributions.SomeDistribution(logits=logits)
    estimator = pydrobert.torch.estimators.SomeEstimator(dist, func, ...)
    v = estimator()  # of shape dist.batch_shape
    # 3. calculate loss as a function of v
    loss.backwards()
    # ...
Parameters:
  • proposal (Distribution) – The distribution over which the expectation is taken. This is usually but not always \(P\) (see ImportanceSamplingEstimator for a counterexample).

  • func (Callable[[Tensor], Tensor]) – The function \(f\). A callable (such as a pydrobert.torch.Module) which accepts a sample tensor as input of shape (num_samples,) + proposal.batch_shape + proposal.event_shape and returns a tensor of shape (num_samples,) + proposal.batch_shape.

  • is_log (bool) – If True, the estimator operates in log space. func defines \(\log f\) instead of \(f\) and the return value v represents an estimate of \(\log v\). Estimators will often be more numerically stable in log space.

Returns:

v (torch.Tensor) – An estimate of \(v\). Of shape proposal.batch_shape.

Notes

An estimator is not a torch.nn.Module and is not in general safe to be JIT scripted or traced. The parameterization of the proposal distribution is usually output

class pydrobert.torch.estimators.ImportanceSamplingEstimator(proposal, func, mc_samples, density, self_normalize=False, is_log=False)[source]

Importance Sampling MC estimate

The expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) is estimated by drawing \(N\) samples \(b^{(1:N)}\) i.i.d. from the proposal distribution \(Q\), weighing the values \(f(b)\) according to the likelihood ratio of \(P(b)\) over \(Q(b)\), and taking the sample average:

\[\begin{split}v \approx \frac{1}{N} \sum_{n=1}^N w_n f\left(b^{(n)}\right) \\ w_n = \frac{P\left(b^{(n)}\right)}{Q\left(b^{(n)}\right)}.\end{split}\]

The estimate is unbiased iff \(Q\) dominates \(P\), that is

\[\forall b \quad P(b) > 0 \implies Q(b) > 0.\]

The gradient is estimated as

\[\nabla v \approx \frac{1}{N} \sum_{n=1}^N \frac{1}{Q\left(b^{(n)}\right)} \nabla P\left(b^{(n)}\right)f\left(b^{(n)}\right).\]

If self_normalized is set to True, \(v\) is instead estimated as:

\[\begin{split}v \approx \sum_{n=1}^N \omega_n f\left(b^{(n)}\right) \\ \omega_n = \frac{w_n}{\sum_{n'=1}^N w_{n'}}\end{split}\]

with gradients defined as

\[\nabla v \approx \nabla \sum_{n=1}^N \omega_n f\left(b^{(n)}\right).\]

The self-normalized estimate is biased but with decreasing bias (assuming the proposal dominates) as \(N \to \infty\). This property holds even if \(P\) is not a probability density (i.e. \(\sum_b P(b) \neq 1\)).

Parameters:
  • proposal (Distribution) – The distribution over which the expectation is taken. In this case, proposal has probability density \(Q\), not \(P\).

  • func (Callable[[Tensor], Tensor]) –

  • mc_samples (int) –

  • density (Density) – The density \(P\). Can be unnormalized.

  • self_normalize (bool) – Whether to use the self-normalized estimator.

  • is_log (bool) –

Returns:

v (torch.Tensor)

Notes

The gradient with respect to the proposal’s parameters is always set to \(0\). This is an unbiased estimate for \(v\) and its unnormalized IS estimator. It is a biased estimate of the gradient of the self-normalized estimate, but only because the self-normalized estimate is biased itself.

class pydrobert.torch.estimators.IndependentMetropolisHastingsEstimator(proposal, func, mc_samples, density, burn_in=0, initial_sample=None, initial_sample_tries=1000, is_log=False)[source]

Independent Metropolis Hastings MCMC estimator

Independent Metropolis Hastings (IMH) is a Markov Chain Monte Carlo (MCMC) technique for estimating the value \(v = \mathbb{E}_{b \sim P}[f(b)] < \infty\). A Markov Chain of \(N\) samples \(b^{(1:N)}\) is contructed sequentially by iteratively drawing samples from a proposal \(b' \sim Q\) and either accepting it or rejecting it and taking \(b^{(n-1)}\) as the next sample in the chain, \(b^{(n)}\), according to the following rules:

\[\begin{split}u \sim \mathrm{Uniform}([0, 1]) \\ b^{(n)} = \begin{cases} b' & \alpha(b', b^{(n-1)}) > u \\ b^{(n-1)} & \mathrm{otherwise} \end{cases} \\ \alpha(b', b^{(n-1)}) = \min\left( \frac{P(b')Q(b^{(n-1)})}{P(b^{(n-1)}Q(b'))}, 1\right).\end{split}\]

The sample estimate from the Markov Chain

\[v \approx \frac{1}{N - M} \sum_{n=M + 1}^N f\left(b^{(n)}\right)\]

for a fixed number of burn-in samples \(M \in [0, N)\) is biased but converges asymptotically (\(\lim N \to \infty\)) to \(v\) with strong guarantees as long as there exists some constant \(\epsilon\) such that [mengerson1996]

\[P(b) > 0 \implies \frac{P(b)}{Q(b)} \leq \epsilon.\]
Parameters:
  • proposal (Distribution) – The proposal distribution \(Q\).

  • func (Callable[[Tensor], Tensor]) –

  • mc_samples (int) –

  • density (Density) – The density \(P\). Does not have to be a probability distribution (can be unnormalized).

  • burn_in (int) – The number of samples in the chain discarded from the estimate, \(M\).

  • initial_sample (Optional[Tensor]) – If specified, initial_sample is used as the value \(b^{(0)}\) to start the chain. Of size either proposal.batch_size + proposal.event_size or (1,) + proposal.batch_size + proposal.event_size. A ValueError will be thrown if any elements are outside the support of \(P\) (density). If unspecified, \(b^{(0)}\) will be decided by randomly drawing from proposal until all elements are in the support of density.

  • initial_sample_tries (int) – If initial_sample is unspecified, initial_sample_tries dictates the maximum number of draws from proposal allowed in order to find elements in the support of density before a RuntimeError is thrown.

  • is_log (bool) –

Returns:

v (torch.Tensor)

Warning

The resulting estimate has no gradient attached to it and therefore cannot be backpropagated through.

find_initial_sample(tries=None)[source]

Find an initial sample by randomly sampling from the proposal

class pydrobert.torch.estimators.MonteCarloEstimator(proposal, func, mc_samples, is_log=False)[source]

A Monte Carlo estimator base class

A Monte Carlo estimator estimates an expectation with some form of random sampling from a proposal. Letting \(N\) represent some number of Monte Carlo samples, \(b^{(1:N)}\) represent N samples drawn from a proposal (usually but not necessarily \(P\)), the estimator G() is unbiased iff

\[\mathbb{E}_{b^{(1:N)} \sim P}[G(b)] = \mathbb{E}_{b \sim P}[f(b)].\]

A toy example can be found in the source repository under the test name test_benchmark. It can be run from the repository root as

DO_MC_BENCHMARK=1 pytest tests/test_mc.py -k benchmark -s
Parameters:
  • proposal (Distribution) –

  • func (Callable[[Tensor], Tensor]) –

  • mc_samples (int) – The number of samples to draw from proposal, \(N\).

  • is_log (bool) – If True, operate in log space. func defines \(\log f\) instead of \(f\) and the return value v represents an estimate of \(\log v\). The estimate of \(\log v\) is simply the log of the MC estimate of \(v\), which is a biased estimate. The estimate can be proven to be a lower bound of \(\log v\) via Jensen’s inequality. To recover an unbiased estimate of \(v\), one may exponentiate the return value.

Returns:

v (torch.Tensor)

class pydrobert.torch.estimators.RelaxEstimator(proposal, func, mc_samples, cv, proposal_params=(), cv_params=(), is_log=False)[source]

RELAX estimator

The RELAX estimator [grathwohl2017] estimates the expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) for discrete \(b\) via MC estimation, attempting to minimize the variance of the estimator using control variates over their continuous relaxations.

Let \(z^{(1:N)}\) be \(N\) continous relaxation variables drawn i.i.d. \(z^{(n)} \sim P(\cdot)\) and \(b^{(1:N)}\) be their discretizations \(H(z^{(n)}) = b^{(n)}\). Let \(P(z|b)\) be a conditional s.t. the joint distribution satisfies \(P(b)P(z|b) = P(z, b) = P(z)I[H(z) = b]\) and \(\tilde{z}^{(1:N)}\) be samples drawn from those conditionals \(\tilde{z}^{(n)} \sim P(\cdot|b^{(n)})\). Let \(c\) be some control variate accepting relaxed samples. Then the (unbiased) RELAX estimator is:

\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(b^{(n)}\right) - c\left(\tilde{z}^{(n)}\right) + c\left(z^{(n)}\right).\]

Pairing this estimator with one of the REBAR control variates from pydrobert.torch.modules yields the REBAR estimator [tucker2017].

We offer two ways of estimating the gradient \(\nabla z\). The first is a REINFORCE-style estimate:

\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla \left( \left(f\left(b^{(n)}\right) - c\left(\tilde{z}^{(n)}\right)\right)\log P(b) + c\left(z^{(n)}\right)\right).\]

The above estimate requires no special consideration for any variable for which the gradient is being calculated. The second, following [grathwohl2017], specially optimizes the control variate parameters to minimize the variance of the gradient estimates of the parameters involved in drawing \(z\). Let \(\theta_{1:K}\) be the set of such parameters, \(g_{\theta_k} \approx \nabla_{\theta_k} v\) be a REINFORCE-style estimate of the \(k\)-th \(z\) parameter using the equation above, and let \(\gamma\) be a control variate parameter. Then the variance-minimizing loss can be approximated by:

\[\nabla_\gamma \mathrm{Var}(v) \approx \frac{1}{K} \nabla_\gamma \left(\sum_{k=1}^K g^2_{\theta_k}\right).\]

The remaining parameters are calculated with the REINFORCE-style estimator above. The proposal parameters proposal_params and control variate parameters cv_params must be specified to use this loss function.

Parameters:
  • proposal (ConditionalStraightThrough) – The distribution over which the expectation is taken, \(P\). Must implement pydrobert.torch.distributions.ConditionalStraightThrough.

  • func (Callable[[Tensor], Tensor]) –

  • mc_samples (int) –

  • cv (Callable[[Tensor], Tensor]) –

  • proposal_params (Sequence[Tensor]) – A sequence of parameters used in the computation of \(z\) and \(P(H(z)\). Does not have to be specified unless using the variance-minimizing control variate objective. If non-empty, cv_params must be non-empty as well.

  • cv_params (Sequence[Tensor]) – A sequence of parameters used in the computation of control variate values. Does not have to be specified unless using the variance-minimizing control variate objective. If non-empty, proposal_params must be non-empty as well.

  • is_log (bool) –

Returns:

v (torch.Tensor)

Warning

The current implmentation does not support auxiliary loss functions for the control variate parameters when the variance-minimizing objective is used (proposal_params and cv_params are specified). Auxiliary loss functions for parameters other than cv_params are fine.

class pydrobert.torch.estimators.ReparameterizationEstimator(proposal, func, mc_samples, is_log=False)[source]

MC estimation of continuous variables with reparameterization gradient

This estimator applies to distributions over continuous random variables \(z \sim P\) whose values can be decomposed into the sum of a deterministic, learnable (i.e. with gradient) part \(\theta\) and a random, unlearnable part \(\epsilon\):

\[\begin{split}z = \theta + \epsilon,\>\epsilon \sim P' \\ \nabla P(z) = \nabla P'(\epsilon) = 0\end{split}\]

The expectation \(v = \mathbb{E}_{z \sim P}[f(z)]\) is estimated by drawing \(N\) samples \(z^{(1:N)}\) i.i.d. from \(P\) and taking the sample average:

\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(z^{(n)}\right).\]

We can ignore the probabilities in the bacward direction because \(\nabla P(z) = 0\), leaving the unbiased estimate of the gradient:

\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla f\left(z^{(n)}\right).\]
Parameters:
  • proposal (Distribution) – The distribution over which the expectation is taken, \(P\) (not \(P'\)). proposal must implement the Distribution.rsample() method (proposal.has_rsample == True).

  • func (Callable[[Tensor], Tensor]) –

  • mc_samples (int) –

  • is_log (bool) –

Returns:

v (torch.Tensor)

class pydrobert.torch.estimators.StraightThroughEstimator(proposal, func, mc_samples, is_log=False)[source]

MC estimation of discrete variables with continuous relaxation’s reparam grad

A straight-through estimator [bengio2013] is like a ReparameterizationEstimator but fudges the fact that the samples are actually discrete to compute the gradient. To estimate \(v = \mathbb{E}_{b \sim P}[f(b)]\), we need a distribution over discrete samples’ continuous relaxations,

\[z = \theta + \epsilon,\>\epsilon \sim P'\]

and a threshold function \(H(z) = b\) such that \(P(H(z)) = P(b)\). The estimate of \(v\) is computed by drawing \(N\) relaxed values \(z^{(1:N)}\) and taking the sample average on thresholded values:

\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(H\left(z^{(n)}\right)\right).\]

This estimate is unbiased. In the backward direction, we approximate

\[\nabla P(H(z)) \approx \nabla P(z) = \nabla P'(\epsilon) = 0\]

and end up with a biased estimate of the gradient resembling that of ReparameterizationEstimator:

\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla \left(H\left(z^{(n)}\right)\right).\]
Parameters:
Returns:

v (torch.Tensor)