[A, SfS] Chapter 3: Probability Distributions: 3.5: The Central Limit Theorem
The Central Limit Theorem
The Central Limit Theorem
In this lesson, you will learn about the Central Limit Theorem, the most important theorem in statistics.
#\text{}#
Suppose you have a variable (discrete or continuous) that you plan to measure on a random sample of size #n# taken from some population. Let #X_1,X_2,...,X_n# denote these future measurements, and let
\[\bar{X} = \frac{1}{n}\displaystyle\sum^n_{i=1}X_i\]
denote the sample mean.
We already know that if the variable has a #N\big(\mu,\sigma^2\big)# distribution on the population, then
\[\bar{X} \sim N\big(\mu,\sigma^2\big)\]
which allows us to compute probabilities or quantiles involving the sample mean.
But suppose we know that the variable does not have a normal distribution (even if we don't have a name for its actual distribution). If we then want to compute probabilities or quantiles involving the sample mean, it seems to be impossible.
When we talk about the probability distribution of the sample mean, we are talking about the distribution of the values of the sample mean if every possible random sample of size #n# could be selected and measured. For every random sample, the sample mean will fall somewhere in the middle of the #n# measurements.
We can simulate the distribution of the sample mean using #\mathrm{R}#. We can't take every possible random sample of #n# because it could take forever, but we can simulate a very large number of such random samples (say, #100# thousand) from some non-normal distribution, and for each one we can compute the sample mean. Then we can make a histogram of all 100 thousand sample means and see what shape we get.
Simulation
Let's use the #B(35,0.1)# distribution, and let's take random samples of size #10#.
The #\mathrm{R}# code consists of #3# lines:
> Sample.means=rep(0,100000)
> for(i in 1:100000)Sample.means[i] = mean(rbinom(10,35,0.1))
> hist(Sample.means)
From one run of this code, we get the following histogram:
As you can see, the shape of the histogram looks like the density curve of the normal distribution. Moreover, the mean of the #B(35,0.1)# distribution is #(35)(0.1) = 3.5#.
As you can see, the peak of the histogram is right around #3.5#. But the #B(35,0.1)# distribution is not the normal distribution, so how can we get a normal shape for the distribution of the sample mean?
Let us repeat the #\mathrm{R}# simulation, but now we will take random samples of size #30#. The histogram of the distribution of the sample mean now looks like:
The shape is even more like a normal density curve, still centered at the mean of #3.5#. But now the histogram is mainly between #2.5# and #4.5#, whereas previously the histogram was mainly between #2# and #5#. So it has become narrower.
These simulations suggest the result known as the Central Limit Theorem.
#\text{}#
Central Limit Theorem
The Central Limit Theorem (CLT) states:
Let #X_1,X_2,...,X_n# denote independent measurements of a quantitative variable on a planned random sample of size #n# from a population, and suppose the population distribution of that variable has a mean of #\mu# and a finite variance of #\sigma^2#.
Let #S = \sum^n_{i=1}X_i# and let #\bar{X} = \frac{S}{n}#.
Then as #n \rightarrow \infty#, the distribution of #S# converges to the #N\big(n\mu,n\sigma^2\big)# distribution, and the distribution of #\bar{X}# converges to the #N\big(\mu,\frac{\sigma^2}{n}\big)# distribution.
Minimal Required Sample Size
In practical terms, this means that if you plan to make measurements of a quantitative variable #X# on a random sample of size #n#, and #n# is a “large enough” value, then you can proceed as if the sample mean #\bar{X}# of your measurements has a normal distribution with mean of #\mu = \mu_X# and a standard deviation of #\sigma = \tfrac{\sigma_X}{\sqrt n} #. But how large is “large enough”?
The answer to this question depends on how different the distribution #X# on the population is from the normal distribution. It could be that #10# is large enough, or it could be that you need #50# to be large enough. As a general rule, we typically consider a sample size of #30# to be large enough.
For example, we are interested in estimating the number of eggs laid by typical fertile female fruit fly during its lifetime. We know that the possible values for this quantitative variable are the non-negative integers #\{0,1,2,...\}#, so it is not a continuous variable, and thus it cannot have a normal distribution on the population of all fertile female fruit flies.
We will select a random sample of #100# pre-fertile female fruit flies and then count the number of eggs laid by each one of them over the course of their lifetimes. What is the probability that the mean number of eggs for our sample will be larger than #25#?
To use the CLT, we would need to know the mean and standard deviation of this variable over the entire population of fertile female fruit flies. For now, now let us assume that we are given from a reliable source that #\mu = 24# and #\sigma = 10#.
Then, based on the CLT, we can say that
\[\bar{X} \sim N\big(24, \cfrac{10^2}{100}\big) = N\big(24,1\big)\]
at least approximately, since the sample size is quite large (#n = 100#).
So to approximate #P(\overline{X} > 25)# in #\mathrm{R}# we use:
> 1 - pnorm(25, 24,1)
or
> pnorm(25,24,1,low=F)
to get #0.1587#.
Suppose #67\%# of adults in a city have a driving licence. We will randomly select #n# adults from the city and count how many of them have a driving licence.
If we labelled the participants in this study with the numbers #1,2,...,n# and let
\[X_i = \begin{cases}
1 &\text{if participant } i \text{ has a driving license}\\\\
0 &\text{if participant } i \text{ does not have a driving license}
\end{cases}\]
then #S = \sum^n_{i = 1}X_i# is the total number of participants that have a driving licence.
In this situation, each #X_i# is a Bernoulli random variable with #P(X_i = 1) = 0.67# (because #67\%# of the adults in the city have a driving licence):
\[X_i \sim B(1,0.67)\]
So the mean of #X# is #0.67# and the variance of #X# is #(0.67)(1 - 0.67) = 0.2211#.
Suppose the sample size is large (e.g., #n = 70#). Then the sample mean #\bar{X} = \frac{S}{70}# approximately has the following distribution:
\[\bar{X} \sim N\big(0.67,\frac{0.2211}{70}\big) \approx N\big(0.67,0.0032\big)\]
Let #\hat{p}# (“p-hat”) denote the proportion of adults in the sample that have a driving license. Thus #\hat{p} = \frac{S}{n}#. But this is exactly the same formula as the formula for the sample mean #\bar{X}#, so the result of the CLT also applies to #\hat{p}#.
Therefore we can conclude in general that if the sample size #n# is sufficiently large, the distribution of the sample proportion is:
\[\hat{p} \sim N\Bigg(p,\frac{p(1-p)}{n}\Bigg)\]
This result will be useful when we are collecting sample data for a Bernoulli variable with “success” coded as a #1# and “failure” coded as a #0#.
Suppose we want to estimate the probability that the percent of people who have a driving license in a sample of size #70# is less than #60\%#. Then we have #p = 0.67# and #n = 70# and #P(\hat{p} \leq 0.6)# can be approximated in #\mathrm{R}# using:
> pnorm(0.6,0.67,sqrt(0.67*(1-0.67)/70))
to be about #0.1065#.
Or visit omptest.org if jou are taking an OMPT exam.