Maximum Likelihood Estimation

[A, SfS] Chapter 4: Estimation: 4.2: Maximum Likelihood Estimation

Maximum Likelihood Estimation

In this lesson, you will learn about a method for finding an optimal estimator for an unknown parameter.

$\text{}$

If we know that $x_1,...,x_n$ are the result of $n$ independent measurements of a random variable having a probability distribution that depends on some parameter $\theta$ with pmf $p(x;\theta)$ or pdf $f(x;\theta)$ , we would like to obtain some estimator $\hat{\theta}$ of $\theta$ which depends on $x_1,...,x_n$ and has the smallest MSE possible.

The most popular method for finding such an estimator is the Method of Maximum Likelihood, and the estimator it produces is called the Maximum Likelihood Estimator, or MLE.

Method of Maximum Likelihood

The idea behind the Method of Maximum Likelihood is to find the value for the parameter that provides the best explanation for the data we observed.

Process of Maximum Likelihood Estimation

For simplicity, assume we have a continuous random variable $X$ with pdf $f(x;\theta)$ depending on an unknown parameter $\theta$ , and $x_1,...,x_n$ are the result of independent observations of $X$ on a random sample of size $n$ . If $X$ is discrete with pmf $p(x;\theta)$ the procedure is identical.

First we form the likelihood function:

$L(\theta | x_1,...,x_n) = f(x_1;\theta) \cdots f(x_n;\theta) = \displaystyle\prod_{i=1}^n f(x_i;\theta)$

We want the value of $\theta$ that maximizes this function, if it exists. If the domain of $\theta$ involves one or more endpoints, we know that the maximum might occur at one of the endpoints. However, the maximum might also occur at a point where the derivative of this function is zero, or where the derivative does not exist.

This requires differentiating the likelihood function. However, computing the derivative of a product with $n$ factors is not efficient. Since the logarithm (regardless of base) is an increasing function, any value of $\theta$ that maximizes the logarithm of $L(\theta|x_1,...,x_n)$ will also maximize $L(\theta|x_1,...,x_n)$ . Hence we take the natural logarithm to form the log-likelihood function:

$l(\theta|x_1,...,x_n) = \ln L(\theta | x_1,...,x_n) = \ln \Bigg(\prod_{i=1}^n f(x_i;\theta)\Bigg) = \displaystyle\sum_{i=1}^n \ln f(x_i;\theta)$

Then we find all critical values of $\theta$ at which the derivative

$\cfrac{\mathrm{d}l}{\mathrm{d}\theta} = \displaystyle\sum_{i=1}^n\cfrac{f’(x_i;\theta)}{f(x_i;\theta)}$ equals $0$ , and all singular values at which the derivative fails to exist.

After this we must use either the first derivative test or the second derivative test to determine whether or not there is a local maximum at any of the critical values or singular values we found. Comparing the value of the likelihood function at all endpoints, critical values and singular values allows us to determine the MLE $\hat{\theta}$ of $\theta$ .

Note that the MLE is not necessarily an unbiased estimator. In fact, the MLE of $\sigma^2$ is:

$\cfrac{1}{n}\displaystyle\sum_{i=1}^n(x_i - \bar{x})^2$ We have already shown in the previous lesson that this estimator is biased. However, it can be shown that the bias (if any) of the MLE always converges to $0$ as the sample size $n$ increases, and furthermore that its variance always converges to a theoretical minimum. Moreover, the probability distribution of the MLE always converges to a normal distribution. Hence the MLE is generally an ideal estimator of a parameter.

We now provide some examples. Note that, since $\bar{x} = \cfrac{1}{n}\displaystyle\sum_{i=1}^n x_i$ , we will replace $\displaystyle\sum_{i=1}^n x_i$ with $n\bar{x}$ whenever it occurs in the examples, which is handy for simplifying the computations.

Example 1: Suppose $x_1,...,x_n$ are the result of independent measurements on a continuous random variable whose probability distribution has the following pdf:

$f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x}& \text{when }x > 0\\\\ 0& \text{otherwise} \end{cases}$

for some unknown parameter $\lambda > 0$ .

Find the MLE $\hat{\lambda}$ of $\lambda$ .

Solution: The domain of $\lambda$ is $(0,\infty)$ , which has no endpoints. The log-likelihood function is:

$\begin{array}{rcl} l(\lambda|x_1,...,x_n) &=& \displaystyle\sum_{i = 1}^n \ln \Big(\lambda e^{-\lambda x_i}\Big) \\\\ &=& \displaystyle\sum_{i=1}^n (\ln \lambda - \lambda x_i)\\\\ &=& n \ln \lambda - \lambda \displaystyle\sum_{i = 1}^n x_i \\\\ &=& n \ln \lambda - n \lambda \bar{x} \end{array}$

Then the derivative of the log-likelihood function is:

$\cfrac{\mathrm{d}l}{\mathrm{d}\lambda} = \cfrac{n}{\lambda} - n\bar{x}$ This is defined for all $\lambda > 0$ , so there are no singular points.

Moreover $\cfrac{n}{\lambda} - n\bar{x} = 0$ when $\lambda = \cfrac{1}{\bar{x}}$ , so we have one critical point.

Note that the second derivative

$\cfrac{\mathrm{d}^2l}{\mathrm{d}\lambda^2} = -\cfrac{n}{\lambda^2}$ is negative for all values of $\lambda$ , including the critical point, which implies that the critical point occurs at a maximum.

Consequently, the MLE of $\lambda$ is $\hat{\lambda} = \cfrac{1}{\bar{x}}$ .

Example 2: Suppose $x_1,...,x_n$ denote the result of $n$ independent measurements of a discrete random variable whose probability mass function is:

$p(x;\pi) = \pi(1 - \pi)^{x-1}$ for $x=1,2,3,...$ , where $\pi$ is an unknown parameter in the interval $(0,1)$ .

Find the MLE $\hat{\pi}$ of $\pi$ .

Solution: The log-likelihood function is:

$\begin{array}{rcl} l(\pi|x_1,...,x_n) &=& \displaystyle\sum_{i=1}^n \ln \Big(\pi(1-\pi)^{x_i - 1}\Big)\\\\ &=& \displaystyle\sum_{i=1}^n\Big(\ln \pi +(x_i - 1)\ln(1 - \pi)\Big)\\\\ &=& n\ln\pi + \ln(1 - \pi)\displaystyle\sum_{i=1}^n(x_i - 1)\\\\ &=& n\ln\pi + n\ln(1 - \pi)(\bar{x} - 1) \end{array}$

Then the derivative of the log-likelihood function is:

$\begin{array}{rcl} \cfrac{\mathrm{d}l}{\mathrm{d}\pi} &=& \cfrac{n}{\pi} - \cfrac{n(\bar{x} -1)}{1 - \pi}\\\\ &=& n\left( \cfrac{1 - \pi - \bar{x}\pi + \pi}{\pi(1 - \pi)}\right)\\\\ &=& n\left(\cfrac{1 - \bar{x}\pi}{\pi(1 - \pi)}\right) \end{array}$

which equals $0$ when $\pi = \cfrac{1}{\bar{x}}$ .

Note that:

when $\pi < \cfrac{1}{\bar{x}}$ we have $1 - \bar{x}\pi > 0$ so the derivative is positive and the log-likelihood function is increasing
when $\pi > \cfrac{1}{\bar{x}}$ we have $1 - \bar{x}\pi < 0$ so the derivative is negative and the log-likelihood function is decreasing

So we have a global maximum at $\pi = \cfrac{1}{\bar{x}}$ .

There are no singular points and no other critical points, and the domain of $\pi$ has no endpoints.

Therefore the MLE of $\pi$ is $\hat{\pi} = \cfrac{1}{\bar{x}}$ .