Likelihood and probability normalization, log-sum-exp trick

Working with probabilities involves multiplication and normalization of their values. Since the numerical values sometimes are extremely low that can lead to underflow problems. This problem is evident with particle filters - we have to multiply really low likelihood values that vanish in the end. Log-sum-exp allows to abbreviate this problem.

Approach Link to heading

Log-likelihoods Link to heading

Since the likelihood values can be extremely low it is more convenient to work with loglikelihood instead of likelihood:

$$ \log(\mathcal{L}). $$

Now, if we have independent measurements $i=1,\dots,n$ and associated likelihoods $\mathbf{L} = [\mathcal{L}_1, \dots, \mathcal{L}_n]$ we can calculate the joint likelihood as follows:

$$ w = \prod\mathcal{L}_i $$

that is, obviously, numerically unstable, or joint log-likelihood:

$$ \tau = \log\prod\mathcal{L}_{i} = \log(\mathcal{L}_1\cdot \mathcal{L}_2\cdot ~\dots~ \cdot\mathcal{L}_n) = \log{\mathcal{L}_1} + \log{\mathcal{L}_2}+~\dots~+\log{\mathcal{L}_n} = \sum_{i=1}^n\log\mathcal{L}_i, $$

by doing so we convert product to sum that doesn’t lead to numerical underflow issue.

Quite often we are interested in relative likelihoods, or odds (since we normalize the probability afterwards) for our particles. Assuming that each particle has certain weight, all the weights, or log-weights are stacked into the vector:

$$ \begin{gather} \mathbf{w} = [w_1,\dots, w_m],\\\\ \boldsymbol{\tau} = [\tau_1, \dots, \tau_m] \end{gather} $$

where $m$ is the number of particles.

To calculate the relative weight for each particle $j$ we can rewrite the equation:

$$ \begin{aligned} \hat{w}_j &= \frac{w_j}{\max(\mathbf{w})}\\\\ &= \exp\log{\frac{w_j}{\max(\mathbf{w})}}\\\\ &= \exp\left( \log w_j - \max(\log\mathbf{w}) \right)\\\\ &= \exp(\tau_j - \max\boldsymbol{\tau})\\\\ &= \exp\left(\sum_{i=1}^n\log\mathcal{L}_i - \max\boldsymbol{\tau}\right). \end{aligned} $$

Probability normalization Link to heading

Now, having these relative weights, we can normalize them more conveniently:

$$ p_j = \frac{\hat{w}_j}{\sum_{j=1}^m \hat{w}_j},~\sum_{j=1}^m{p_j} = 1. $$

Taking the logarithms on both side:

$$ \log p_i = \log\hat{w}_j - \log\sum_{j=1}^m\hat{w}_j, $$

and making the aforementioned substitution $\hat{w}_j = \exp(\tau_j - \max\boldsymbol{\tau})$, after some rewriting we end up with:

$$ \begin{align} \log p_i &= \log\exp(\tau_j - \max\boldsymbol{\tau}) - \log\sum\exp(\tau_j - \max\boldsymbol{\tau})\\\\ \log p_i &= \tau_j \underbrace{ - \max\boldsymbol{\tau} -\log\sum\exp(\tau_j - \max\boldsymbol{\tau})}_{\log\sum\exp(\boldsymbol{\tau})~-~\text{log-sum-exp trick}} \\\\ p_i &= \exp\left( \tau_j - \max\boldsymbol{\tau} -\log\sum\exp(\tau_j - \max\boldsymbol{\tau}) \right) \end{align} $$

We ended up with a famous log-sum-exp trick¹, the second line from the end is the normalized log-weights, and the last one is the normalized weights. The underlying meaning of this trick is apparent now: it scales the weights before normalization, avoiding the numerical under- or overflow.

Effective sample size Link to heading

Quite often the following equation is used to estimate effective sample size:

$$ N_{eff} = \frac{1}{\sum_{j=1}^{m} p_j^2}. $$

Since we work with log-likelihoods, substitute $p_j$ with $\exp\log p_j$:

$$ \begin{gather} \log N_{eff} &= -\log\sum_{j=1}^m \exp\log p_j^2 \\\\ \log N_{eff} &= -\log\sum_{j=1}^m \exp(2\cdot\log p_j). \end{gather} $$

Resulting in the following equation for $N_{eff}$:

$$ N_{eff} = \exp\left(-\log\sum_{j=1}^m \exp(2\cdot\log p_i) \right). $$

Code examples Link to heading

Assume that we have two likelihoods. By calculating the joint likelihood using multiplication, we, obviously, experience numerical underflow and end up with zeros:

>>> import numpy as np

>>> likelihoods1 = np.array([1e-249, 1e-244, 1e-244, 1e-247])
>>> likelihoods2 = np.array([1e-249, 1e-244, 1e-244, 1e-247])

>>> likelihoods1 * likelihoods2
array([0., 0., 0., 0.])

Let’s calculate log-likelihoods and normalize probabilities using naive method:

>>> log_likelihood1 = np.log(likelihoods1)
>>> log_likelihood2 = np.log(likelihoods2)

>>> tau = log_likelihood1 + log_likelihood2
>>> tau
array([-1132.87186575, -1123.66152538, -1123.66152538, -1137.47703594])

>>> np.exp(tau)/np.sum(np.exp(tau))
array([nan, nan, nan, nan])

And now let’s normalize using the derived equation:

>>> max_tau = tau.max()
>>> tau_normed = tau - max_tau - np.log(np.sum(np.exp(tau - max_tau)))
>>> w_normed = np.exp(tau_normed)
>>> w_normed
array([0.00005   , 0.49997475, 0.49997475, 0.0000005 ])

If we are interested in effective number of particles:

def logSumExp(tau):
    max_tau = np.max(tau)
    return max_tau + np.log(np.sum(np.exp(tau - max_tau)))

>>> tau_normed = tau - logSumExp(tau)
>>> effective_sample_size = np.exp(-logSumExp(2 * tau_normed))
>>> effective_sample_size
2.00020199509849

Success! Works perfectly!

References Link to heading

https://mc-stan.org/docs/2_27/stan-users-guide/log-sum-of-exponentials.html ↩︎