estimators
Interfaces and classes for estimating expectations
See also
Estimator
The base class for all estimators
- class pydrobert.torch.estimators.DirectEstimator(proposal, func, mc_samples, cv=None, cv_mean=None, is_log=False)[source]
Direct MC estimate using REINFORCE gradient estimate
The expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) is estimated by drawing \(N\) samples \(b^{(1:N)}\) i.i.d. from \(P\) and taking the sample average:
\[v \approx \frac{1}{N} \sum_{n=1}^N f\left(b^{(n)}\right).\]An optional control variate \(c\) can be specified:
\[v \approx \frac{1}{N} \sum_{n=1}^N f\left(b^{(n)}\right) - c\left(b^{(n)}\right) + \mu_c\]which is unbiased when \(\mathbb{E}_{b \sim P}[c(b)] = \mu_c\).
In the backward pass, the gradient of the expectation is estimated using REINFORCE [williams1992]:
\[\nabla v \approx \frac{1}{N} \sum_{n=1}^N \nabla \left(f\left(b^{(n)}\right) - c\left(b^{(n)}\right) + \mu_c\right)\log P(b).\]With the control variate terms excluded if they were not specified.
- Parameters:
- Returns:
v (
torch.Tensor
)
- class pydrobert.torch.estimators.EnumerateEstimator(proposal, func, is_log=False)[source]
Calculate expectation exactly by enumerating the support of the distribution
An unbiased, zero-variance “estimate” of an expectation over a discrete variable may be calculated brute force by enumerating the support and taking the product of function values with their probabilities under the distribution.
\[v = \mathbb{E}_{b \sim P}[f(b)] = \sum_b P(b) f(b).\]When called, the instance does just that.
- Parameters:
proposal (
Distribution
) – The distribution over which the expectation is taken, \(P\). Must be able to enumerate its support throughtorch.distributions.Distribution.enumerate_support()
(proposal.has_enumerate_support == True
).is_log (
bool
) –
- Returns:
v (
torch.Tensor
)
Warning
The call may be both compute- and memory-intensive, depending on the size of the support.
- class pydrobert.torch.estimators.Estimator(proposal, func, is_log=False)[source]
Computes an estimate of an expectation
An estimator estimates the value of a function \(f\) integrated over a probability density \(P\)
\[v = \mathbb{E}_{b \sim P}\left[f(b)\right] = \int_{b \in \mathrm{supp}(P)} f(b) \mathrm{d}P(b).\]The value of \(v\) can be estimated in many ways. This base class serves as the common foundation for those estimators. The usage pattern is as follows:
def func(b): # return the value of f(b) here # ... # training loop for epoch in range(num_epochs): # ... # 1. Determine parameterization (e.g. logits) from inputs. # 2. Initialize the distribution and estimator in the training loop. dist = torch.distributions.SomeDistribution(logits=logits) estimator = pydrobert.torch.estimators.SomeEstimator(dist, func, ...) v = estimator() # of shape dist.batch_shape # 3. calculate loss as a function of v loss.backwards() # ...
- Parameters:
proposal (
Distribution
) – The distribution over which the expectation is taken. This is usually but not always \(P\) (seeImportanceSamplingEstimator
for a counterexample).func (
Callable
[[Tensor
],Tensor
]) – The function \(f\). A callable (such as apydrobert.torch.Module
) which accepts a sample tensor as input of shape(num_samples,) + proposal.batch_shape + proposal.event_shape
and returns a tensor of shape(num_samples,) + proposal.batch_shape
.is_log (
bool
) – IfTrue
, the estimator operates in log space. func defines \(\log f\) instead of \(f\) and the return value v represents an estimate of \(\log v\). Estimators will often be more numerically stable in log space.
- Returns:
v (
torch.Tensor
) – An estimate of \(v\). Of shapeproposal.batch_shape
.
Notes
An estimator is not a
torch.nn.Module
and is not in general safe to be JIT scripted or traced. The parameterization of the proposal distribution is usually output
- class pydrobert.torch.estimators.ImportanceSamplingEstimator(proposal, func, mc_samples, density, self_normalize=False, is_log=False)[source]
Importance Sampling MC estimate
The expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) is estimated by drawing \(N\) samples \(b^{(1:N)}\) i.i.d. from the proposal distribution \(Q\), weighing the values \(f(b)\) according to the likelihood ratio of \(P(b)\) over \(Q(b)\), and taking the sample average:
\[\begin{split}v \approx \frac{1}{N} \sum_{n=1}^N w_n f\left(b^{(n)}\right) \\ w_n = \frac{P\left(b^{(n)}\right)}{Q\left(b^{(n)}\right)}.\end{split}\]The estimate is unbiased iff \(Q\) dominates \(P\), that is
\[\forall b \quad P(b) > 0 \implies Q(b) > 0.\]The gradient is estimated as
\[\nabla v \approx \frac{1}{N} \sum_{n=1}^N \frac{1}{Q\left(b^{(n)}\right)} \nabla P\left(b^{(n)}\right)f\left(b^{(n)}\right).\]If self_normalized is set to
True
, \(v\) is instead estimated as:\[\begin{split}v \approx \sum_{n=1}^N \omega_n f\left(b^{(n)}\right) \\ \omega_n = \frac{w_n}{\sum_{n'=1}^N w_{n'}}\end{split}\]with gradients defined as
\[\nabla v \approx \nabla \sum_{n=1}^N \omega_n f\left(b^{(n)}\right).\]The self-normalized estimate is biased but with decreasing bias (assuming the proposal dominates) as \(N \to \infty\). This property holds even if \(P\) is not a probability density (i.e. \(\sum_b P(b) \neq 1\)).
- Parameters:
proposal (
Distribution
) – The distribution over which the expectation is taken. In this case, proposal has probability density \(Q\), not \(P\).mc_samples (
int
) –density (
Density
) – The density \(P\). Can be unnormalized.self_normalize (
bool
) – Whether to use the self-normalized estimator.is_log (
bool
) –
- Returns:
v (
torch.Tensor
)
Notes
The gradient with respect to the proposal’s parameters is always set to \(0\). This is an unbiased estimate for \(v\) and its unnormalized IS estimator. It is a biased estimate of the gradient of the self-normalized estimate, but only because the self-normalized estimate is biased itself.
- class pydrobert.torch.estimators.IndependentMetropolisHastingsEstimator(proposal, func, mc_samples, density, burn_in=0, initial_sample=None, initial_sample_tries=1000, is_log=False)[source]
Independent Metropolis Hastings MCMC estimator
Independent Metropolis Hastings (IMH) is a Markov Chain Monte Carlo (MCMC) technique for estimating the value \(v = \mathbb{E}_{b \sim P}[f(b)] < \infty\). A Markov Chain of \(N\) samples \(b^{(1:N)}\) is contructed sequentially by iteratively drawing samples from a proposal \(b' \sim Q\) and either accepting it or rejecting it and taking \(b^{(n-1)}\) as the next sample in the chain, \(b^{(n)}\), according to the following rules:
\[\begin{split}u \sim \mathrm{Uniform}([0, 1]) \\ b^{(n)} = \begin{cases} b' & \alpha(b', b^{(n-1)}) > u \\ b^{(n-1)} & \mathrm{otherwise} \end{cases} \\ \alpha(b', b^{(n-1)}) = \min\left( \frac{P(b')Q(b^{(n-1)})}{P(b^{(n-1)}Q(b'))}, 1\right).\end{split}\]The sample estimate from the Markov Chain
\[v \approx \frac{1}{N - M} \sum_{n=M + 1}^N f\left(b^{(n)}\right)\]for a fixed number of burn-in samples \(M \in [0, N)\) is biased but converges asymptotically (\(\lim N \to \infty\)) to \(v\) with strong guarantees as long as there exists some constant \(\epsilon\) such that [mengerson1996]
\[P(b) > 0 \implies \frac{P(b)}{Q(b)} \leq \epsilon.\]- Parameters:
proposal (
Distribution
) – The proposal distribution \(Q\).mc_samples (
int
) –density (
Density
) – The density \(P\). Does not have to be a probability distribution (can be unnormalized).burn_in (
int
) – The number of samples in the chain discarded from the estimate, \(M\).initial_sample (
Optional
[Tensor
]) – If specified, initial_sample is used as the value \(b^{(0)}\) to start the chain. Of size eitherproposal.batch_size + proposal.event_size
or(1,) + proposal.batch_size + proposal.event_size
. AValueError
will be thrown if any elements are outside the support of \(P\) (density). If unspecified, \(b^{(0)}\) will be decided by randomly drawing from proposal until all elements are in the support of density.initial_sample_tries (
int
) – If initial_sample is unspecified, initial_sample_tries dictates the maximum number of draws from proposal allowed in order to find elements in the support of density before aRuntimeError
is thrown.is_log (
bool
) –
- Returns:
v (
torch.Tensor
)
Warning
The resulting estimate has no gradient attached to it and therefore cannot be backpropagated through.
- class pydrobert.torch.estimators.MonteCarloEstimator(proposal, func, mc_samples, is_log=False)[source]
A Monte Carlo estimator base class
A Monte Carlo estimator estimates an expectation with some form of random sampling from a proposal. Letting \(N\) represent some number of Monte Carlo samples, \(b^{(1:N)}\) represent N samples drawn from a proposal (usually but not necessarily \(P\)), the estimator
G()
is unbiased iff\[\mathbb{E}_{b^{(1:N)} \sim P}[G(b)] = \mathbb{E}_{b \sim P}[f(b)].\]A toy example can be found in the source repository under the test name
test_benchmark
. It can be run from the repository root asDO_MC_BENCHMARK=1 pytest tests/test_mc.py -k benchmark -s
- Parameters:
proposal (
Distribution
) –mc_samples (
int
) – The number of samples to draw from proposal, \(N\).is_log (
bool
) – IfTrue
, operate in log space. func defines \(\log f\) instead of \(f\) and the return value v represents an estimate of \(\log v\). The estimate of \(\log v\) is simply the log of the MC estimate of \(v\), which is a biased estimate. The estimate can be proven to be a lower bound of \(\log v\) via Jensen’s inequality. To recover an unbiased estimate of \(v\), one may exponentiate the return value.
- Returns:
v (
torch.Tensor
)
- class pydrobert.torch.estimators.RelaxEstimator(proposal, func, mc_samples, cv, proposal_params=(), cv_params=(), is_log=False)[source]
RELAX estimator
The RELAX estimator [grathwohl2017] estimates the expectation \(v = \mathbb{E}_{b \sim P}[f(b)]\) for discrete \(b\) via MC estimation, attempting to minimize the variance of the estimator using control variates over their continuous relaxations.
Let \(z^{(1:N)}\) be \(N\) continous relaxation variables drawn i.i.d. \(z^{(n)} \sim P(\cdot)\) and \(b^{(1:N)}\) be their discretizations \(H(z^{(n)}) = b^{(n)}\). Let \(P(z|b)\) be a conditional s.t. the joint distribution satisfies \(P(b)P(z|b) = P(z, b) = P(z)I[H(z) = b]\) and \(\tilde{z}^{(1:N)}\) be samples drawn from those conditionals \(\tilde{z}^{(n)} \sim P(\cdot|b^{(n)})\). Let \(c\) be some control variate accepting relaxed samples. Then the (unbiased) RELAX estimator is:
\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(b^{(n)}\right) - c\left(\tilde{z}^{(n)}\right) + c\left(z^{(n)}\right).\]Pairing this estimator with one of the REBAR control variates from
pydrobert.torch.modules
yields the REBAR estimator [tucker2017].We offer two ways of estimating the gradient \(\nabla z\). The first is a REINFORCE-style estimate:
\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla \left( \left(f\left(b^{(n)}\right) - c\left(\tilde{z}^{(n)}\right)\right)\log P(b) + c\left(z^{(n)}\right)\right).\]The above estimate requires no special consideration for any variable for which the gradient is being calculated. The second, following [grathwohl2017], specially optimizes the control variate parameters to minimize the variance of the gradient estimates of the parameters involved in drawing \(z\). Let \(\theta_{1:K}\) be the set of such parameters, \(g_{\theta_k} \approx \nabla_{\theta_k} v\) be a REINFORCE-style estimate of the \(k\)-th \(z\) parameter using the equation above, and let \(\gamma\) be a control variate parameter. Then the variance-minimizing loss can be approximated by:
\[\nabla_\gamma \mathrm{Var}(v) \approx \frac{1}{K} \nabla_\gamma \left(\sum_{k=1}^K g^2_{\theta_k}\right).\]The remaining parameters are calculated with the REINFORCE-style estimator above. The proposal parameters proposal_params and control variate parameters cv_params must be specified to use this loss function.
- Parameters:
proposal (
ConditionalStraightThrough
) – The distribution over which the expectation is taken, \(P\). Must implementpydrobert.torch.distributions.ConditionalStraightThrough
.mc_samples (
int
) –proposal_params (
Sequence
[Tensor
]) – A sequence of parameters used in the computation of \(z\) and \(P(H(z)\). Does not have to be specified unless using the variance-minimizing control variate objective. If non-empty, cv_params must be non-empty as well.cv_params (
Sequence
[Tensor
]) – A sequence of parameters used in the computation of control variate values. Does not have to be specified unless using the variance-minimizing control variate objective. If non-empty, proposal_params must be non-empty as well.is_log (
bool
) –
- Returns:
v (
torch.Tensor
)
Warning
The current implmentation does not support auxiliary loss functions for the control variate parameters when the variance-minimizing objective is used (proposal_params and cv_params are specified). Auxiliary loss functions for parameters other than cv_params are fine.
- class pydrobert.torch.estimators.ReparameterizationEstimator(proposal, func, mc_samples, is_log=False)[source]
MC estimation of continuous variables with reparameterization gradient
This estimator applies to distributions over continuous random variables \(z \sim P\) whose values can be decomposed into the sum of a deterministic, learnable (i.e. with gradient) part \(\theta\) and a random, unlearnable part \(\epsilon\):
\[\begin{split}z = \theta + \epsilon,\>\epsilon \sim P' \\ \nabla P(z) = \nabla P'(\epsilon) = 0\end{split}\]The expectation \(v = \mathbb{E}_{z \sim P}[f(z)]\) is estimated by drawing \(N\) samples \(z^{(1:N)}\) i.i.d. from \(P\) and taking the sample average:
\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(z^{(n)}\right).\]We can ignore the probabilities in the bacward direction because \(\nabla P(z) = 0\), leaving the unbiased estimate of the gradient:
\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla f\left(z^{(n)}\right).\]- Parameters:
- Returns:
v (
torch.Tensor
)
- class pydrobert.torch.estimators.StraightThroughEstimator(proposal, func, mc_samples, is_log=False)[source]
MC estimation of discrete variables with continuous relaxation’s reparam grad
A straight-through estimator [bengio2013] is like a
ReparameterizationEstimator
but fudges the fact that the samples are actually discrete to compute the gradient. To estimate \(v = \mathbb{E}_{b \sim P}[f(b)]\), we need a distribution over discrete samples’ continuous relaxations,\[z = \theta + \epsilon,\>\epsilon \sim P'\]and a threshold function \(H(z) = b\) such that \(P(H(z)) = P(b)\). The estimate of \(v\) is computed by drawing \(N\) relaxed values \(z^{(1:N)}\) and taking the sample average on thresholded values:
\[v \approx \frac{1}{N} \sum^N_{n=1} f\left(H\left(z^{(n)}\right)\right).\]This estimate is unbiased. In the backward direction, we approximate
\[\nabla P(H(z)) \approx \nabla P(z) = \nabla P'(\epsilon) = 0\]and end up with a biased estimate of the gradient resembling that of
ReparameterizationEstimator
:\[\nabla v \approx \frac{1}{N} \sum^N_{n=1} \nabla \left(H\left(z^{(n)}\right)\right).\]- Parameters:
proposal (
StraightThrough
) – The distribution over which the expectation is taken, \(P\) (not \(P'\)). proposal must implementpydrobert.torch.distributions.StraightThrough
.mc_samples (
int
) –is_log (
bool
) –
- Returns:
v (
torch.Tensor
)