Let’s assume we want to predict whether the person is male or female judging by its height.

Spoiler
heights = np.array([120, 135, 145, 150, 151, 155, 165, 170, 172, 175, 180, 190])
labels = np.array([0, 0, 0, 0., 0, 0, 1, 1, 1, 1, 1, 1.])
females = labels == 0
males = labels == 1

fig, ax = plt.subplots(1, 1, figsize = (6, 2))
ax.scatter(heights[females], labels[females])
ax.scatter(heights[males], labels[males])
ax.set(xlabel = 'height, cm', ylabel = 'class')
ax.legend(['female', 'male'])
ax.grid()

png

What we want to do is to find a function that approximates $p(\mathbf{Y}|\mathbf{X})$.

Linear regression Link to heading

Let’s say we are given $N$ inputs and outputs. We want to find the linear function that maps our input $x$ to our output $y$:

$$ y = f(x) = ax + b, $$

or, in general case

$$ y = f(\boldsymbol{x}) = \boldsymbol{\theta}^T \boldsymbol{x}, $$

if our our prediction depends not on one parameter, but on $n$ of them: $\boldsymbol{x} \in \mathcal{R}^n$.

What we want to minimize is our loss function, that somehow describes the discrepancy between our predictions $\hat{y}$ and actual labels $y$ for all samples $x_i$:

$$ L(\boldsymbol{y}, \hat{\boldsymbol{y}}) = \sum_{i=0}^{N-1}L(y_i, \hat{y}_i) = \sum_{i=0}^{N-1}L(y_i, \hat{\boldsymbol{\theta}}^T \boldsymbol{x}_i). $$

Least squares Link to heading

If we want to minimize the sum of squared residuals we end up with:

$$ L(\boldsymbol{y}, \hat{\boldsymbol{y}}) = \sum_{i=0}^{N-1}(y_i - \hat{\boldsymbol{\theta}}^T \boldsymbol{x}_i)^2. $$

If we stack all inputs (and add extra column of ones, that models the constant, or shift) and outputs, we will have two matrices $\mathbf{X}$ and $\mathbf{Y}$:

$$ \begin{align*} \mathbf{X} = \begin{bmatrix} \boldsymbol{x}_0^T \\\\ \vdots \\\\ \boldsymbol{x}_{N-1}^T \end{bmatrix} & & \mathbf{Y} = \begin{bmatrix} y_0 \\\\ \vdots \\\\ y_{N-1} \end{bmatrix} \end{align*}, $$

we can write our loss function as:

$$ \begin{align*} L(\boldsymbol{y}, \hat{\boldsymbol{y}}) &= ||\mathbf{Y} - \mathbf{X} \boldsymbol{\theta}||^2 \\\\ &= (\mathbf{Y} - \mathbf{X} \boldsymbol{\theta})^T(\mathbf{Y} - \mathbf{X} \boldsymbol{\theta})\\\\ &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T\mathbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{Y} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}\\\\ &= \mathbf{Y}^T\mathbf{Y} - 2\mathbf{Y}^T\mathbf{X}\boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}. \end{align*} $$

During the derivation we note that since $\boldsymbol{\theta}$ is a vector we can rewrite $\mathbf{Y}^T\mathbf{X}\boldsymbol{\theta}$ as $\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{Y}$.

Finding the derivative w.r.t. parameters:

$$ \begin{align*} \frac{\partial L}{\partial {\boldsymbol {\theta }}} &= \frac{\partial(\mathbf{Y}^T\mathbf{Y} - 2\mathbf{Y}^T\mathbf{X}\boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta})}{\partial{\boldsymbol {\theta }}} \\\\ &= -2\mathbf{Y}^T\mathbf{X} + 2\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X} \\\\ &= \left(2\boldsymbol{\theta}^T\mathbf{X}^T - 2\mathbf{Y}^T\right)\mathbf{X} \end{align*} $$

Then set the derivative to zero and solve for $\boldsymbol{\theta}$:

$$ \begin{align*} -2\mathbf{Y}^T\mathbf{X} + 2\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X} &= \boldsymbol{0} \\\\ \mathbf{X}^T\mathbf{X}\boldsymbol{\theta} &= \mathbf{X}^T\mathbf{Y}\\\\ \boldsymbol{\theta} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}. \end{align*} $$

By doing so and plugging the available data to $\mathbf{X}$ and $\mathbf{Y}$ we find the best fit line and perform linear regression.

Maximum Likelihood Link to heading

The probability of multiple independent events is known as joint probability. The joint probability can be found by multiplication of the probabilities of individual events:

$$ p(\mathbf{Y}| \boldsymbol{\theta}, \mathbf{X}) = \prod_{i=0}^{N-1}p(y_i|\boldsymbol{\theta}, x_i). $$

If we assume that our noises are gaussian, we can write:

$$ \mathcal{L} = p(\mathbf{Y}| \boldsymbol{\theta}, \mathbf{X}) = \prod_{i=0}^{N-1}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\left( \frac{y - \hat{y}}{\sigma} \right)^2} $$

This probability $\mathcal{L}$ is called likelihood, applying $\log$ we get log-likelihood, the sum instead of product

$$ \begin{align*} \mathcal{LL} &= \sum_{i=0}^{N-1}\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\left( \frac{y - \hat{y}}{\sigma} \right)^2}\right) \\\\ &= \sum_{i=0}^{N-1}\log\frac{1}{\sqrt{2\pi\sigma^2}} + \sum_{i=0}^{N-1}\log\left(e^{-\frac{1}{2}\left( \frac{y - \hat{y}}{\sigma} \right)^2}\right) \\\\ &= \log\frac{1}{\sqrt{2\pi\sigma^2}} \sum_{i=0}^{N-1}1 + \sum_{i=0}^{N-1}{-\frac{1}{2}\left( \frac{y - \hat{y}}{\sigma} \right)^2} \\\\ &= -\frac{\log{\pi\sigma^2}}{2}{N} - \frac{\log{2}}{2}N - \frac{1}{2\sigma^2}\sum_{i=0}^{N-1}(y - \hat{y})^2. \end{align*} $$

Now if we look at $\mathcal{LL}$ as a function of only $\boldsymbol{\theta}$ and differentiate $\mathcal{LL}$ with respect to $\boldsymbol{\theta}$ we end up with exact same expression as for least-squares case. That is the reason why least squares estimate can be viewed as MLE under Gaussian noise model:

$$ y = y_{true} + \epsilon, \quad\text{ where }\epsilon\thicksim N(0,\sigma^2). $$

Fitting line to our data Link to heading

Spoiler
# linear regression
X = np.hstack((heights.reshape((-1, 1)), np.ones((len(heights), 1))))
Y = labels.reshape((-1, 1))
theta = np.linalg.inv(X.T @ X) @ X.T @ Y

fig, ax = plt.subplots(1, 1, figsize = (6, 2))
ax.scatter(heights[females], labels[females])
ax.scatter(heights[males], labels[males])
xrange = np.arange(np.min(heights), np.max(heights))
ax.plot(xrange, theta[0][0] * xrange + theta[1][0])
ax.set(xlabel = 'height, cm', ylabel = 'class')
ax.legend(['female', 'male'])
ax.grid()

print(f'theta: {*theta.flatten(),}')
theta: (0.022082018927444665, -3.011041009463699)

png

Now if we want to predict the gender of a person by its height we just multiply our feature vector with $\boldsymbol{\theta}$ and obtain the result: if it is greater than 0.5 then it is a male, if lower - female. Done! But if look at this problem more closely we fill figure that sometimes the “probability” can be negative, or bigger than one. It is counter-intuitive, since we want to stick to probability space. Another problem is that linear regression is sensitive to imbalances in data, for example:

Spoiler
heights = np.array([120, 121, 125, 135, 160, 161, 162, 163, 164, 165, 166, 167, 168, 170, 172, 175, 180])
labels = np.array([0, 0, 0, 0, 0, 0., 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
females = labels == 0
males = labels == 1

X = np.hstack((heights.reshape((-1, 1)), np.ones((len(heights), 1))))
Y = labels.reshape((-1, 1))
theta = np.linalg.inv(X.T @ X) @ X.T @ Y

fig, ax = plt.subplots(1, 1, figsize = (6, 2))
ax.scatter(heights[females], labels[females])
ax.scatter(heights[males], labels[males])
xrange = np.arange(np.min(heights), np.max(heights))
ax.plot(xrange, theta[0][0] * xrange + theta[1][0])
ax.set(xlabel = 'height, cm', ylabel = 'class')
ax.legend(['female', 'male'], loc = 'upper left')
ax.grid()

print(f'theta: {*theta.flatten(),}')
theta: (0.013266157882184739, -1.7925709515859998)

png

Now we can see that the classification is clearly wrong for males under 175cm.

Logistic regression Link to heading

Can we find a better function that approximates our data? Let’s have a look at the sigmoid function and plot its graph.

$$ \sigma(z) = \frac{1}{1 + e^{-z}} \in [0, 1]. $$
Spoiler
def sigmoid(z):
    return 1 / (1. + np.exp(-z))

fig, ax = plt.subplots(figsize = (6, 2))
inputs = np.linspace(-10, 10, 100)
ax.plot(inputs, sigmoid(inputs))
ax.grid()
ax.set(xlabel='x', ylabel='$\sigma(x)$')

png

The sigmoid takes an input $z \in \mathcal{R}$ and maps it into the range $(0, 1)$. Since we have only two classes we can use Bernoulli distribution. The likelihood of the data can be written as:

$$ \mathcal{L} = p(\mathbf{Y}| \boldsymbol{\theta}, \mathbf{X}) = \prod_{i=0}^{N-1} \sigma\left(\hat{y}\right)^y \left(1 - \sigma\left(\hat{y}\right)\right)^{1-y},~~~\hat{y}=\boldsymbol{\theta}^T\boldsymbol{x}. $$

Taking the logarithm, we obtain:

$$ \mathcal{LL} = -\sum_{i=0}^{N-1}\left(y\log\sigma\left(\hat{y}\right) + \left(1 - y\right)\log(1 - \sigma\left(\hat{y}\right) \right) $$

That is the function we want to minimize. Let’s look at the penalties that generates the aforementioned log-likelihood. To do so we plot two graphs: one for the case when the true label is $0$ and the other for the case when the true label is $1$. If our prediction is close to the correct label the penalization term approaches zero, if the prediction is completely wrong the penalization term has very high value. That is desirable behaviour for the classification problem.

Spoiler
def ll(label, prediction):
    return - label * np.log(prediction) - (1 - label) * np.log(1 - prediction)

fig, ax = plt.subplots(figsize = (6, 2))
ax.plot(sigmoid(inputs), ll(1, sigmoid(inputs)), label = '$y = 1$')
ax.plot(sigmoid(inputs), ll(0, sigmoid(inputs)), label = '$y = 0$')
ax.grid()
ax.legend()
ax.set(xlabel = 'Probability', ylabel = 'Penalization term')

png

Writing the log likelihood $\mathcal{LL}$ in matrix notation:

$$ \begin{align*} \mathcal{LL} = -\mathbf{Y}^T\log\sigma\left( \mathbf{X}\boldsymbol{\theta} \right) - (\mathbf{1}^T - \mathbf{Y}^T)\log\left(\mathbf{1} - \sigma\left( \mathbf{X}\boldsymbol{\theta} \right) \right) \\\\ \end{align*} $$

The derivative w.r.t. $\boldsymbol{\theta}$ can be found to be 1:

$$ \frac{\partial\mathcal{LL}}{\partial\boldsymbol{\theta}} = \frac{1}{N}\mathbf{X}^T\left( \sigma(\mathbf{x}\boldsymbol{\theta}) - \mathbf{Y} \right) $$

Since there is no closed form solution we have to use one of the numerical optimization techniques. Let’s use the gradient descent. I also tried the Newton method, but wasn’t able to produce robust results for all tested cases.

def logistic_regression_gradient_descent(
    X, Y, step_size = 1e-4, iters = 1000000
):
    theta = np.zeros((X.shape[1], 1))
    for i in range(iters):
        theta = theta -  step * X.T @ (sigmoid(X @ theta) - Y)
    return theta

Y = labels.reshape((-1, 1))
X = np.hstack((heights.reshape((-1, 1)), np.ones((len(Y), 1))))
theta = logistic_regression_gradient_descent(X, Y, step_size = 1e-2)
print(f'theta: {*theta.flatten(),}')
theta: (17.057175819704746, -2857.0739642526546)

And let’s look how our function approximates the probability of a point to be related to one or another class:

Spoiler
inputs = np.arange(100, 200)
outputs = sigmoid(np.hstack((inputs.reshape((-1, 1)), np.ones((len(inputs), 1)))) @ theta)

fig, ax = plt.subplots(1, 1, figsize = (6, 2))
ax.scatter(heights[females], labels[females])
ax.scatter(heights[males], labels[males])
xrange = np.arange(np.min(heights), np.max(heights))
ax.plot(inputs, outputs)
ax.set(xlabel = 'height, cm', ylabel = 'class')
ax.legend(['female', 'male'])
ax.grid()

png

Much better now!

Logarithm roots Link to heading

The key thing in logistic regression is the sequence of operation:

  1. converting the probability to odds ratio (the range $[0, 1]$ of the distribution function of a PDF is being mapped to range $(0, \infty)$;
  2. converting odds to log-odds (maps $(0, \infty)$ range to $(-\infty, \infty)$;
  3. the last step allows us to use regression on unconstrained space: $$ \log \frac{p_i}{1-p_i} = \boldsymbol{\theta}^T\mathbf{x_i} $$

After these transforms we have a linear model for the logistic regression.

Of course, we can go vice versa now and get the sigmoid function that we use during the optimization and mapping between domains:

$$ \begin{align*} \log \frac{p}{1-p} &= \boldsymbol{\theta}^T\mathbf{x} \\\\ \frac{p_i}{1-p} &= e^{~\boldsymbol{\theta}^T\mathbf{x}} \\\\ p &= \frac{e^{~\boldsymbol{\theta}^T\mathbf{x}}}{1 + e^{~\boldsymbol{\theta}^T\mathbf{x}}} \\\\ p &= \frac{1}{1 + e^{-\boldsymbol{\theta}^T\mathbf{x}}} \end{align*} $$

References Link to heading