Bayesian regression Link to heading

Let’s consider a case when we know the conditional distribution $p(\mathbf{y}_k|\boldsymbol{\theta})$, but parameter $\boldsymbol{\theta} \in \mathbb{R}^d$ is unknown.

The classical statistical method for estimating the parameter is the maximum likelihood method (MLE), where we maximize the joint probability of the measurements, which is also called the likelihood function:

$$ \mathcal{L}(\boldsymbol{\theta}) = \prod_{k=1}^T p(\mathbf{y}_k\mid\boldsymbol{\theta}). $$

The maximum of the likelihood function with respect to $\boldsymbol{\theta}$ gives the MLE estimate:

$$ \hat{\boldsymbol{\theta}} = \operatorname*{argmax}_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}). $$

The difference between the Bayesian inference and the maximum lilelihood is that in the former we consider the parameter $\boldsymbol{\theta}$ as a random variable. Then the posterior distribution of the parameter $\boldsymbol{\theta}$ can be computed using Bayes rule:

$$ \begin{equation} \text{posterior} = \frac{\text{likelihood}\, \times \, \text{prior}}{\text{evidence}} = p(\boldsymbol{\theta} \mid \mathbf{y}_{1:T}) = \frac{p(\mathbf{y}_{1:T}\mid\boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\boldsymbol{y}_{1:T})}\end{equation}, $$

where $p(\boldsymbol{\theta})$ is the prior distribution which models the belief on the parameter before we have seen any data, and the term in the denominator is the normalization constant that can be left out. Then the posterior distribution can be written as:

$$ p(\boldsymbol{\theta} \mid \mathbf{y}_{1:K}) \propto p(\boldsymbol{\theta}) \prod p(\mathbf{y}_k|\boldsymbol{\theta}). $$

Now if we are interested in the most probable value of the random variable then the maximum a posteriori (MAP) estimate can be found (MLE can be seen as a MAP-estimate with the uniform prior on the parameter $\boldsymbol{\theta}$). The optimal estimate in the mean squared sense, however, is the posterior mean of the parameters.

The Bayesian model components Link to heading

If we take a look at the Bayes rule we can clearly identify the basic blocks of a Bayesian model:

$$ \text{posterior} = \frac{\text{likelihood}\, \times \, \text{prior}}{\text{evidence}}. $$

Prior model $p(\boldsymbol{\theta})$. This is the information on parameter $\boldsymbol{\theta}$ before seeing any observations.
Measurement model $p(\mathbf{y}|\boldsymbol{\theta})$. Usually there is a simple, but inaccurate or noisy relationship. This relationship is mathematically modeled using the measurement model, and often called likelihood of getting the measurement given our model.
Posterior distribution $p(\boldsymbol{\theta}|\mathbf{y})$. The posterior distribution is the conditional distribution of the parameters given the observations.

Note: quite often the integrals become computationally intractable. There are several options to tackle this problem.

Use conjugate prior: if both prior and likelihood belong to the same probability distribution family, then the posterior has a closed-form expression. For example, if our prior and likelihood are Gaussians, we end up with Gaussian posterior too.
Use numerical methods: gaussian approximations, multi-dimensional integration methods (if the dimensionalty of the integral is moderate), Monte Carlo methods (drawing samples from posterior and calculating the sample average), importance sampling.

Batch and recursive Bayesian estimation Link to heading

Let’s consider the linear regression model:

$$ \begin{equation} y_k = \theta_1 + \theta_2 t_k + \epsilon_k, \end{equation} $$

assuming that the measurement noise is zero mean Gaussian with a known variance $\epsilon_k \sim \mathcal{N}(0, \sigma^2)$, and the prior distribtution of the parameters $\boldsymbol{\theta} = [\theta_1, \theta_2]^T$ is also a Gaussian one with known mean and covariance $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{P}_0)$. In classical formulation we are interested in estimation of the parameters $\boldsymbol{\theta}$ from a set of measurement data $\mathcal{D} = [[t_1, y_1], \dots, [t_T, y_T]]$.

We can also rewrite this problem in probabilistic notation:

$$ \begin{align} p(y_k \mid \boldsymbol{\theta}) &= \mathcal{N}(y_k \mid \mathbf{H}_k\boldsymbol{\theta}, \sigma^2) \\ p(\boldsymbol{\theta}) &= \mathcal{N}(\boldsymbol{\theta} \mid \mathbf{m}_0, \mathbf{P}_0), \end{align} $$

where $\mathbf{H}_k$ is the row vector, and $\mathcal{N}(\cdot)$ is the Gaussian probability density function. If our measurement is a vector, then $\mathbf{H}_k$ is a matrix.

Now, applying Bayes rule, we get:

$$ \begin{align} p(\boldsymbol{\theta} \mid y_{1:T}) &\propto p(\boldsymbol{\theta}) \prod_{k=1}^T p(y_k \mid \boldsymbol{\theta}) \\ &=\mathcal{N}(\boldsymbol{\theta} \mid \mathbf{m}_0, \mathbf{P}_0) \prod_{k=1}^T \mathcal{N}(y_k \mid \mathbf{H}_k\boldsymbol{\theta}, \sigma^2). \end{align} $$

Since both prior and lilelihood are Gaussian, the posterior is Gaussian as well:

$$ p(\boldsymbol{\theta} \mid y_{1:T}) = \mathcal{N}(\boldsymbol{\theta} \mid \mathbf{m}_T, \mathbf{P}_T), $$

that means that the Gaussian distribution parameters $\mathbf{m}_T$ and $\mathbf{P}_T$ can be derived. They are known to be:

$$ \begin{align} \mathbf{m}_T &= \left[ \mathbf{P}_0^{-1} + \frac{1}{\sigma^2}\mathbf{H}^T\mathbf{H}\right]^{-1} \left[ \frac{1}{\sigma^2}\mathbf{H}^T\mathbf{y} + \mathbf{P}_0^{-1}\mathbf{m}_0\right], \\ \mathbf{P}_T &= \left[\mathbf{P}_0^{-1} + \frac{1}{\sigma^2}\mathbf{H}^T\mathbf{H}\right]^{-1}, \end{align} $$

where matrix $\mathbf{H}$ is build by stacking row-vectors $\mathbf{H}_k$ and the column-vector $\mathbf{y}$ is obtained by stacking all $y_k$.

It can also be shown that the solution to this problem can be obtained in a recursive way. The idea is quite simple: at every iteration we turn the previous posterior into the new prior. Doing so we can perform an online learning. Its biggest advantage is that we do not require all the data at once and allow the model to learn little by little - as soon as new data samples arrive.

If we plug $H_k$ instead of $H$ into the aforementioned equations, we end up with:

$$ \begin{align} \mathbf{m}_k &= \left[ \mathbf{P}_{k-1}^{-1} + \frac{1}{\sigma^2}\mathbf{H}_k^T\mathbf{H}_k\right]^{-1} \left[ \frac{1}{\sigma^2}\mathbf{H}_k^T\mathbf{y}_k + \mathbf{P}_{k-1}^{-1}\mathbf{m}_{k-1}\right], \\ \mathbf{P}_k &= \left[\mathbf{P}_{k-1}^{-1} + \frac{1}{\sigma^2}\mathbf{H}_k^T\mathbf{H}_k\right]^{-1}. \end{align} $$

Now, we can use Woodbury identity to show that:

$$ \mathbf{P}_k = \mathbf{P}_{k-1} - \mathbf{P}_{k-1}\mathbf{H}_k^T \left[ \mathbf{H}_k \mathbf{P}_{k-1} \mathbf{H}_k^T + \sigma^2 \right]^{-1}\mathbf{H}_k \mathbf{P}_{k-1}. $$

The equations for the mean and covariance then can be reduced to be:

$$ \begin{align} S_k &= \mathbf{H}_k \mathbf{P}_{k-1} \mathbf{H}_k^T + \sigma^2 \\ \mathbf{K}_k &= \mathbf{P}_{k-1} \mathbf{H}_k^T S_k^{-1} \\ \mathbf{m}_k &= \mathbf{m}_{k-1} + \mathbf{K}_k \left[ y_k - \mathbf{H}_k \mathbf{m}_{k-1} \right] \\ \mathbf{P}_k &= \mathbf{P}_{k-1} - \mathbf{K}_k S_k \mathbf{K}_k^T. \end{align} $$

We apply these equations sequentially for all datapoints. Since $S_k$ is scalar, no matrix inversion is required to perform all this computations. Note: these equations are special cases of the Kalman filter update equations, since the estimated parameters are assumed to be constant: there is no dynamic model for the parameters $\boldsymbol{\theta}$.

$S_k$ here is sometimes called predictive posterior variance. It consists of two summands: prior covariance of the parameters $\boldsymbol{\theta}$, but transformed to the $y_k$ space ($\mathbf{H} \mathbf{P} \mathbf{H}^T$) and the variance of the noise $\sigma^2$.

Code example Link to heading

import numpy as np
m_0 = np.array([[0.], [0.]])
P_0 = np.array([[1., 0.0], [0.0, 1.]])

sigma = 0.15
Theta = np.array([[-0.5], [0.25]])

# sample points
t = np.random.uniform(-2, 2, size = 16)
H = np.vstack([np.ones_like(t), t]).T
# generate true samples from our model
y_true = H @ Theta
# add noise
y_measured = y_true + np.random.normal(0, sigma, size=len(t)).reshape(-1, 1)

# perfrom batch estimation
# according to (3.4) from Sarkka book
P_T = np.linalg.inv(np.linalg.inv(P_0) + 1 / sigma**2 * H.T @ H)
m_T = P_T @ (1 / sigma**2 * H.T @ y_measured + np.linalg.inv(P_0) @ m_0)
print("Batch estimation of parameters:", m_T, P_T)

Batch estimation of parameters: [[-0.4668537]
 [ 0.2654166]] [[ 1.41013733e-03 -7.06345966e-05]
 [-7.06345966e-05  8.51102741e-04]]

Spoiler

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
from scipy.stats import multivariate_normal

def visualize_state(t: np.ndarray, y_measured: np.ndarray, m_k: np.ndarray, P_k: np.ndarray):
    theta_1, theta_2 = np.meshgrid(
        np.linspace(-2, 2, 101),
        np.linspace(-2, 2, 101))
    pdf = multivariate_normal.pdf(np.dstack((theta_1, theta_2)), m_k.flatten(), P_k)
    sampled_theta_from_model = multivariate_normal.rvs(m_k.flatten(), P_k, size = 5)
    t_ticks = np.linspace(-2, 2, 21)
    H = np.vstack([np.ones_like(t_ticks), t_ticks]).T
    mean_y = (H @ m_k).flatten()
    std_y = np.sqrt(np.diag(H @ P_k @ H.T))
    
    fig, ax = plt.subplots(1, 3, figsize = (6, 3))
    ax[0].contour(theta_1, theta_2, pdf)
    ax[0].set(aspect="equal", xlabel="$\\theta_1$", ylabel="$\\theta_2$", xlim=[-2, 2], ylim=[-2, 2], title="prior/posterior")

    ax[1].plot(t_ticks, mean_y)
    ax[1].fill_between(t_ticks, mean_y - std_y, mean_y + std_y, color="orange", alpha=0.5)
    ax[1].set(aspect="equal", xlabel="$t_1$", ylabel="$\\hat{y}$", xlim=[-2, 2], ylim=[-2, 2], title="$\\hat{y} = \\mathbf{H}\\mathbf{m}_0 \\pm \\text{std}$")
    
    for sampled_theta in sampled_theta_from_model:
        y_predictions_from_sampled_model = H @ sampled_theta
        ax[2].plot(t_ticks, y_predictions_from_sampled_model, alpha = 0.1, c="tab:blue")
    ax[2].scatter(t, y_measured, marker='o', facecolor="none", edgecolor="tab:blue", label="samples")
    ax[2].set(xlabel="$t_1$", ylabel="$y$", aspect="equal", xlim=[-2, 2], ylim=[-2, 2], title="$\\boldsymbol{\\theta} \\sim \\mathcal{N}(\\mathbf{m}_k, \\mathbf{P}_k)$")
    ax[2].legend()
    fig.tight_layout()


# perfrom recursive estimation
# according to (3.7) - (3.8) from Sarkka book
m_k, P_k = m_0, P_0
visualize_state([], [], m_k, P_k)

for i, (H_k, y_k) in enumerate(zip(H, y_measured)):
    H_k = H_k.reshape((1, -1))
    S_k = H_k @ P_k @ H_k.T + sigma**2
    K_k = P_k @ H_k.T / S_k
    m_k = m_k + K_k * (y_k - H_k @ m_k)
    P_k = P_k - S_k * K_k @ K_k.T

    if i % 5 == 0:
        visualize_state(t[0:i+1], y_measured[0:i+1], m_k, P_k)

print("Recursive estimation of parameters:", m_T, P_T)

Recursive estimation of parameters: [[-0.4668537]
 [ 0.2654166]] [[ 1.41013733e-03 -7.06345966e-05]
 [-7.06345966e-05  8.51102741e-04]]

png

Now let’s take a look on the following set of images demonstrating the online learning for the linear model. Before processing any samples our prior model:

assumes that both parameters $\theta_1$ and $\theta_2$ are zero-mean and have unit variance. This is shown on the left-most image.
the predictions of our model for each $t$ in range $(-2, 2)$ are shown on the middle image with the associated predictive variance. We can note that the lowest variance is around zero and rises as we move away from our parameters mean values.
if we draw actual parameters from our model, we end up with randomly oriented lines (shown on the right side).

Now, as we add new datapoints (circles on right images):

our prior is updated using likelihood and forms the new posterior (again, left images). With each newly added datapoint our model becomes more and more accurate. We can see that the covariance decrases.
if we draw the model parameters now and plot our linear regression model we can see that parameters predictions correctly reflect the true model.

Conclusion & resources Link to heading

Bayesian approach allows to compute posterior probability for a set of parameters and determines their compatibility with the data and a prior distribution. It allows to produce both a prediction of the mean and the associated uncertainty.

A lot of materials was cited and adapter from:

Also I have found the following post quite useful, it derives all the equations in details.