Published on

Machine Learning Part 4: Maximum Likelihood Estimation

Authors
  • avatar
    Name
    Jyotir Sai
    Twitter
    Engineering Student

There's another way we can approach linear regression and that's from a probabilistic point of view. If you've ever dealt with logistic regression or neural networks then you've probably heard of maximum likelihood estimation. It's another way to estimate the parameters of our model. The goal of this blog post is to introduce maximum likelihood estimation and show how it's equivalent to estimating parameters using mean-squared error. First, let's start with a small review of some relevant probability concepts.

Probability Review

Probability is the formal study of uncertainty. The sample space, Ω\Omega, is the set of all possible outcomes of an experiment. For example, the sample space of tossing a die is

Ω={1,2,3,4,5,6}\Omega = \left \{1, 2, 3, 4, 5, 6 \right \}

In our probability space, we have a probability distribution, PP, that maps an event to a value between 0 and 1. A random variable, XX, represents the outcome of an event.

P(X=A)=P(A)P(X=A)=P(A)

The above represents the distribution of a variable and is read "Probability of random variable X taking on a value of A". For example, the probability of tossing a dice and getting a 5 is

P(X=5)=16P(X=5) = \frac{1}{6}

The distribution of more than one variable is known as the joint distribution or joint probability.

P(A,B)=P(A)P(B)P(A, B) = P(A) \cdot P(B)

For example, let's say we have two separate dice. What is the probability of rolling a 3 with dice 1 and rolling a 4 with dice 2?

P(A=3,B=4)=1616=136P(A=3, B=4) = \frac{1}{6}*\frac{1}{6} = \frac{1}{36}

The conditional probability is the probability of event AA occurring given that event BB has occurred.

P(AB)=P(A,B)P(B)P(A|B) = \frac{P(A, B)}{P(B)}

Given two dice are rolled, what is the probability that one of the die is 5 given that the total sum is 10?

P(A=5,B=10)P(A=5, B=10)

Let's first think about the probability of the total sum being 10 (P(B)P(B)). I can roll either {5,5}, {6,4}, or {4,6}. There are 36 total outcomes of rolling two die and 3 of them result in a sum of 10. Therefore, P(B)=3/36P(B)=3/36. Now, for the joint probability what is the probability of one of the die being a 5 AND the total sum being 10? This is simply the probability of rolling a {5,5} which is 1/36. Plugging these probabilities into our formula we have

P(AB)=1/363/36=1/3P(A|B) = \frac{1/36}{3/36} = 1/3

The probability of rolling a 5 given the total sum is 10 is 1/3.

If AA is an independent variable, then the probability of a given event, BB, has no effect on its outcome.

P(AB)=P(A)P(A|B) = P(A)

The marginal probability is the probability of a certain variable while summing out all other variables.

P(X)=YP(X,Y)=P(X,Y)dyP(X) = \sum_{Y} P(X,Y) = \int P(X,Y) dy

The expectation, E(X)E(X), is the expected or average value of the variable XX.

E(X)=XP(X)XE \left(X \right) = \sum_{X} P(X) X

What is the expected value of rolling a die? We have

E(X)=161+162+163+164+165+166=3.5E(X) = \frac{1}{6}*1 + \frac{1}{6}*2 + \frac{1}{6}*3 + \frac{1}{6}*4 + \frac{1}{6}*5 + \frac{1}{6}*6 = 3.5

Bayes' theorem is used to calculate the probability of an event, given prior information related to the event.

P(XY)=P(YX)P(X)P(Y)P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}

The P(XY)P(X|Y) term is known as the posterior probability. We want to know the outcome of XX, given the evidence from YY. The P(YX)P(Y|X) term is called the likelihood. It's the probability of seeing the evidence YY for a particular outcome XX. Bayes' theorem can also be written as

P(XY)=P(YX)P(X)P(X)P(YX)+P(X)P(YX)P(X|Y) = \frac{P(Y|X)P(X)}{P(X)P(Y|X)+P(\rightharpoondown X)P(Y|\rightharpoondown X)}

Where the probability P(Y)P(Y) is written in its full form.

Maximum Likelihood Estimation

Before getting into maximum likelihood estimation, I want to talk about the two major schools of thought in probability: frequentist and bayesian. In frequentist reasoning, we apply probabilities to data and assume our parameters are not random variables. In bayesian thinking, we assign probabilities to hypotheses and consider our parameters as random variables. In terms of linear regression, the parameters are the weights. In maximum likelihood estimation, we use a frequentist approach and assume our data comes from a probability distribution and estimate our weights.

When estimating our weights, our objective was to minimize the mean-squared error loss function. In maximum likelihood estimation, we try to find the parameters that maximize a likelihood function to our data. Let's say we want to fit a linear regression model

y(m)=wTx(m)+ϵ(m)y^{(m)} = \mathbf{w^{T}} x^{(m)} + \epsilon ^{(m)}

Where the ϵ\epsilon term is the noise which we will assume has a mean value of 0. We're also going to assume that our data comes from a normal distribution. The normal distribution is given by the following formula.

f(x)=1σ2πe12(xμσ)2f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right)^{2}}

σ\sigma is the standard deviation and μ\mu is the mean. The probability distributions of our parameters are

P(ϵ(m))=1σ2πe12(ϵ(m)σ)2P(\epsilon ^{(m)}) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left (\frac{\epsilon^{(m)}}{\sigma} \right)^{2}}
P(y(m)x(m))=1σ2πe12(y(m)wTx(m)σ)2P(y^{(m)} | x^{(m)}) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left (\frac{y^{(m)}-\mathbf{w}^{T}x^{(m)}}{\sigma} \right)^{2}}

Where

ϵmN(0,σ2)\epsilon^{m} \sim N(0,\sigma^2)
y(m)N(wTx(m),σ2)y^{(m)} \sim N(\mathbf{w}^T x^{(m)}, \sigma^2)

The output, y(m)y^{(m)}, is assumed to be normally distributed with a mean of wTx(m)\mathbf{w}^T x^{(m)}. We want to find the parameter, w\mathbf{w}, that maximizes the fit of our probability distribution P(y(m)x(m))P(y^{(m)}|x^{(m)}) to our data. In math terms, we want

w^=argmaxwP(D;w)\hat{\mathbf{w}} = argmax_{w} P(D;w)

where D={(x(m),y(m))}m=1MD = \left \{(x^{(m)}, y^{(m)}) \right \}_{m=1}^{M} is our data. The above is the definition of maximum likelihood estimation. We assume that each y(m)y^{(m)} is only dependent on its respective x(m)x^{(m)} so we can write each probability as a respective product (joint probability)

w^=argmaxwm=1MP(y(m)x(m);w)\hat{\mathbf{w}} = argmax_{w} \prod_{m=1}^{M} P(y^{(m)}|x^{(m)};w)
w^=argmaxwm=1M1σ2πe12(y(m)wTx(m)σ)2\hat{\mathbf{w}} = argmax_{w} \prod_{m=1}^{M} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left (\frac{y^{(m)}-\mathbf{w}^{T}x^{(m)}}{\sigma} \right)^{2}}

We replaced P(D;w)P(D;w) with the likelihood function which is the product of the probabilities of each data point. We want to maximize this function. We can multiply the sum of products by log to get a regular sum since log(AB)=log(A)+log(B)log(AB) = log(A)+log(B).

w^=argmaxwm=1Mlog1σ2πe12(y(m)wTx(m)σ)2\hat{\mathbf{w}} = argmax_{w} \sum_{m=1}^{M} log \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left (\frac{y^{(m)}-\mathbf{w}^{T}x^{(m)}}{\sigma} \right)^{2}}

The log cancels out the exponent so we can further reduce the above expression.

w^=argmaxwm=1M12σ2(y(m)wTx(m))2\hat{\mathbf{w}} = argmax_{w} \sum_{m=1}^{M} \frac{-1}{2 \sigma^{2}} \left (y^{(m)}-\mathbf{w}^T x^{(m)} \right)^{2}

We can replace our constant in front with simply MM

w^=argmaxwm=1M12M(y(m)wTx(m))2\hat{\mathbf{w}} = argmax_{w} \sum_{m=1}^{M} \frac{-1}{2 M} \left (y^{(m)}-\mathbf{w}^T x^{(m)} \right)^{2}

We can get rid of the negative sign by changing argmaxargmax to argminargmin.

w^=argminw12Mm=1M(y(m)wTx(m))2\hat{\mathbf{w}} = argmin_{w} \frac{1}{2 M} \sum_{m=1}^{M} \left (y^{(m)}-\mathbf{w}^T x^{(m)} \right)^{2}

We can see that the above expression is equivalent to the mean-squared error. Thus we've shown that minimizing the mean squared error is the same as maximizing the likelihood function assuming our data is normally distributed.