Maximum likelihood estimates

Jose M Sallan 2021-05-28 5 min read

A common problem of statistics is to make inferences about the parameters of a probability distribution. By inference, or statistical inference, we mean to deduce properties of a population from a sample. A example of inference is to estimate the mean height of the inhabitants of a country from a sample or subset of individuals of that country.

To make statistical inferences, we need to make assumptions about the joint probability distribution of the observations. This joint probability distribution is the probability to observe a sample given fixed values of the parameters of the distribution. Following with the mean height example, the central limit theorem asserts that the mean of the height of a sample of individuals picked randomly follows a normal distribution, with mean equal to the population mean. So we make the assumption that the sample mean follows a normal distribution.

When we make statistical inferences, we have a fixed set of observations, and our job is to obtain estimators of the parameters. Then we turn the joint probability distribution into a likelihood function. The likelihood function is the probability of some estimated values of the parameters, given fixed values of the random variables. We often consider that the maximum likelihood estimates of the parameters are the best values we can choose in statistical inference.

Maximum likelihood estimates of a binomial event

Let’s suppose that we have a population of red and white balls, and that a ball is red with an unknown probability p. This p is the parameter of a binomial probability distribution, that gives us the probability that k out of n balls are red as:

P[(n,k),p]=(nk)pk(1p)(nk)

Let’s suppose now that we take 5 balls from the population, and that two are red and three white. We are observing that n=5 and k=2, so the likelihood function for this population is:

L[p,(5,2)]=(52)p2(1p)3

The value of p that maximizes L(p) will be the maximum likelihood estimator of the probability of the population. Let’s represent the likelihood function:

It is likely that you are not surprised when you learn that the maximum likelihood estimator of p is 2/5=0.4.

Maximum likelihood estimates of a normal distribution

Let’s suppose now that we have a sample of n independent observations x={x1,,xn} from a normal distribution with un{kown population mean μ and population variance σ2. The density probability function of this variable is the Gaussian function:

P(x,(μ,σ))=1σ2πexp[(xμ)22σ2]

If the observations of x are independent and coming from the same normal distribution N(μ,σ), the probability of joint ocurrrence is equal to the product of the values of the Gaussian function for each observation. Then, we can define the likelihood function as:

L[(μ,σ),x]=i=1n1σ2πexp[(xiμ)22σ2]

Finding the minimum of this function can be hard. A way of making this easier is to minimize the logarithm of the likelihood function. This arises frequently when we are dealing with likelihood functions with normal distributions, and it is known as log likelihood function l[(μ,σ),x]. We can use the log likelihood instead of the likelihood because the logarithm is a monotonous function.

The log of the above likelihood function is:

l[(μ,σ),x]=n2ln(2π)n2ln(σ2)12σ2i=1n(xiμ)2

To obtain the maximum likelihood estimates of mean and variance mu^ we equal to zero the partial derivative:

lμ=1σ2i=1n(xiμ)=0

μ^=1ni=1nxi

We proceed in a similar way to obtain the maximum likelihood estimator of the variance:

lσ2=12σ2[1σ2i=1n(xiμ)2n]=0

σ^2=1ni=1n(xiμ^)2

Maximum likelihood estimates in linear regression

Let’s move now to the linear regression model:

yi=β0+β1xi1++βpxip+εi

Coefficients β0,,βp are population coefficients, that we can estimate through b0,,bp estimators. Let’s consider the residuals obtained when using those estimators:

ei=yib0b1xi1bpxip

Let’s make some assumptions about residuals:

  • observations are independent: this means that the residuals of an observation do not depend on other observations, or on exogenous variable (e.g., time) not considered in the model.
  • residuals follow a normal distribution eiN(0,σ): a normal distribution, with population mean zero and constant variance σ2.

Given these assumptions, the likelihood function of the residuals is:

L[(σ,μ),e]=i=131σ2πexp(ei22σ2)

And its log likelihood:

l[(σ,μ),e]=n2ln(2π)n2ln(σ2)12σ2i=1iei2 dd If the above assumptions about residuals are valid, maximizing the log likelihood is the same as minimizing sum of squared residuals. This means that the ordinary least squeres (OLS) estimates are the maximum likelihood estimates of the coefficients of the linear regression model if eiN(0,σ), that is, if the residuals follow a normal distribution with constant variance.

References