[A, SfS] Chapter 5: Confidence Intervals: 5.2: CI for a mean
Confidence Interval for the Population Mean of a Quantitative Variable
Confidence Interval for the Population Mean of a Quantitative Variable
In this lesson, you will learn how to compute a confidence interval for the population mean #\mu# of a variable.
#\text{}#
In section 3.2 we defined #z_p# as the quantile of the standard normal distribution which has probability #p# to its left, such that #\mathbb{P}(Z \leq z_p) = p#. This is the notation used in probability. But in statistics, it will be more convenient to use this notation for the quantile which has probability #p# to its right, such that #\mathbb{P}(Z \geq z_p) = p#.
So from here on in this course, #z_{\alpha/2}# denotes the quantile of the standard normal distribution which has probability #\alpha/2# to its right, in the right tail of the standard normal distribution. Then, because the standard normal distribution is symmetric about zero, #-z_{\alpha/2}# denotes the quantile of the standard normal distribution which has probability #\alpha/2# to its left, in the left tail of the standard normal distribution. So if #\alpha = 0.05# then #z_{0.05/2} = 1.96# and #-z_{0.05/2} = -1.96#. (In R, this distinction is made by adding the setting #\mathrm{lower.tail=FALSE}# or #\mathrm{low=F}# in the #\mathrm{qnorm()}# command.)
Suppose #X# is a quantitative variable which, if #X# could be measured on every element of some population of interest, would have a mean of #\mu#. Because #\mu# represents the value of #X# for a “typical” element of the population, we would like to have a precise estimate of its value, based on a preferred level of confidence.
So we plan to select a random sample of size #n#, and measure #X# on each element of the sample, to obtain #X_1,..., X_n#. From these measurements we will compute the sample mean #\bar{X}# and the sample standard deviation #s#.
There are several options for the method of computing a #(1 - \alpha)100\%# confidence interval for #\mu#, based on the answers to three questions. These questions are:
- Can we assume that #X# has a normal distribution on the population?
- Do we know the value of the population variance #\sigma^2# of #X#?
- Can we consider the sample size #n# to be large?
We first consider the situation in which the answer to questions 1 and 2 is “Yes”. If so, the answer to question 3 is irrelevant.
We assume #X \sim N(\mu,\sigma^2)# and the value of #\sigma^2# is known. Then \[Z = \cfrac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\] and the middle #(1 - \alpha)100\%# of the distribution of #Z# falls between the quantiles #-z_{\alpha /2}# and #z_{\alpha /2}#.
This means that: \[\mathbb{P}\bigg(-z_{\alpha /2} \leq \cfrac{\bar{X} - \mu}{\sigma/\sqrt{n}} \leq z_{\alpha /2}\bigg) = 1 - \alpha\]
If we rearrange this inequality with some algebraic manipulation, we have:
\[\mathbb{P}\bigg(\bar{X} - z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}} \leq \mu \leq \bar{X} + z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}\bigg) = 1 - \alpha\]
Hence \[L = \bar{X} - z_{\alpha /2} \cfrac{\sigma}{\sqrt{n}} \,\,\,\,\,\,\,\,\,\,\,\,\,\, \text{and} \,\,\,\,\,\,\,\,\,\,\,\,\,\, U = \bar{X} + z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}\]
Confidence Interval for a Population Mean
Suppose #X \sim N(\mu,\sigma^2)# and the value of the population variance #\sigma^2# is known.
Then the formula for computing a #(1 - \alpha)100\%# confidence interval for the population mean #\mu#, based on a random sample of size #n# for which the sample size is #\bar{x}#, is:
\[(l,u) = \bigg(\bar{x} - z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}},\,\,\,\,\,\bar{x} + z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}\bigg)\]
In this setting, #z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}# is the margin of error of the #(1 - \alpha)100\%# CI, and the CI can be reported as: \[\bar{x} \pm z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}\]
Controlling the Margin of Error
As you can plainly see, increasing #n# will decrease the margin of error.
If you are planning a study and you prefer a CI for #\mu# to have a margin of error no larger than some #K > 0 # while maintaining the same confidence level #1 - \alpha#, you must determine the minimum sample size #n# necessary to achieve this goal.
That is, if you want #z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}} \leq K# then you need: \[n \geq \bigg(\cfrac{z_{\alpha /2}\sigma}{K}\bigg)^2\] Since #n# must be an integer, you would round this value to the next highest integer.
Consider the population of all steel wires of a certain type. Let #X# denote the breaking strength (in kN) of a wire from this population.
Suppose that we know that #X \sim N(\mu,4)#, but the value of the mean breaking strength #\mu# of this type of steel wire is unknown. We want to estimate #\mu# using a #95\%# CI. Suppose that for a particular sample with #n=100# we find #\bar{x} = 50# kN.
For a #95\%# CI, #\alpha = 0.05#, so #\alpha /2 = 0.025#. Then the middle #95\%# of the #N(0,1)# density falls between #-z_{0.025}# and #z_{0.025}#.
In #\mathrm{R}# we can find #z_{0.025}# using the command:
qnorm(0.025,low=F)
orqnorm(0.975)
We find #z_{0.025} \approx 1.96#.
Then a #95\%# CI for #\mu# is:
\[\begin{array}{rcl}
(l,u)&=&\bigg(\bar{x} - z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}},\,\,\,\,\,\bar{x} + z_{\alpha /2}\cfrac{\sigma}{\sqrt{n}}\bigg)\\\\
&=&\Big(50 - (1.96)(2)/\sqrt{100},\,\,\,\,\,50 + (1.96)(2)/\sqrt{100}\Big)\\\\
&=&(49.608,\,\,\,\,\,50.392)
\end{array}\]
So the mean breaking strength #\mu# of this type of steel wire is estimated to be a value between #49.608# kN and #50.392# kN, with #95\%# confidence.
In this example, the margin or error of the #95\%# CI is: \[(1.96)(2)/\sqrt{100} = 0.392\]
To achieve a margin of error no larger than #0.3#, we must increase the sample size to at least: \[n \geq \bigg(\cfrac{(1.96)(2)}{0.3}\bigg)^2 \approx 170.738\] i.e., at least #171# samples of steel wire.
#\text{}#
Going back to the three questions, suppose the answer to the first question is either “no” or “I’m not sure”, but the answers to the second and third questions are both “yes”. That is, we don’t know if #X# has a normal distribution on the population, but the sample size #n# is large and the value of #\sigma^2# is known.
The Central Limit Theorem tells us that #\bar{X}# has an approximate #N\Big(\mu,\sigma^2\Big)# distribution, so we can assume \[\cfrac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\] approximately. Then the formula for computing the upper and lower bounds of the #(1 - \alpha)100\%# CI for #\mu# is exactly the same as in the previous setting.
Now suppose the answer to the first question is still either “no” or “I’m not sure”, the answer to the third question is still “yes”, but the answer to the second question is “no”. That is, we don’t know the actual value of #\sigma^2#, which is actually much more realistic.
If the sample size is large, then the value of the sample variance #s^2# should be close to the actual value of #\sigma^2#. Combined with the Central Limit Theorem, we can say that \[\cfrac{\bar{X} - \mu}{s/\sqrt{n}} \sim N(0,1)\] approximately.
Confidence Interval for a Population Mean, Unknown Population Variance, Large Sample
Suppose #X# has a non-normal or otherwise unknown distribution, the value of the population variance #\sigma^2# is unknown, but the sample size is large enough for the Central Limit Theorem to apply.
Then the formula for computing the upper and lower bounds of the #(1 - \alpha)100\%# CI for #\mu# is:
\[(l,u)=\bigg(\bar{x} - z_{\alpha /2}\cfrac{s}{\sqrt{n}},\,\,\,\,\,\,\bar{x} + z_{\alpha /2}\cfrac{s}{\sqrt{n}}\bigg)\]
#\text{}#
Finally, suppose the answer to the first question is “yes” and the answers to the second and third questions are “no”. So #X \sim N(\mu,\sigma^2)# but we don’t know the value of #\sigma^2# and the sample size #n# is not large.
In this case, we have to work with #\cfrac{\bar{X} - \mu}{s/\sqrt{n}}# again, but this random variable does not have an approximate #N(0,1)# distribution.
However, it can be shown in this setting that #\cfrac{\bar{X} - \mu}{s/\sqrt{n}}# has the Student’s t-distribution with #n - 1# degrees of freedom. So we would write:
\[\cfrac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}\]
Now the structure of the CI is essentially the same, but in place of #z_{\alpha /2}# we use #t_{n-1,\alpha /2}#, the quantile of the #t_{n-1}# distribution for which #1 - \cfrac{\alpha}{2}# of the area below the density curve falls to its left (and thus #\cfrac{\alpha}{2}# of the area falls to its right).
Confidence Interval for a Population Mean, Unknown Population Variance, Small Sample
Suppose #X \sim N(\mu,\sigma^2)#, but the value of the population variance #\sigma^2# is unknown, and the sample size is not large enough for the Central Limit Theorem to apply.
Then the formula for computing the upper and lower bounds of the #(1 - \alpha)100\%# CI for #\mu# is:
\[(l,u)=\bigg(\bar{x} - t_{n-1,\alpha/2}\cfrac{s}{\sqrt{n}},\,\,\,\,\,\,\bar{x} + t_{n-1,\alpha/2}\cfrac{s}{\sqrt{n}}\bigg)\]
Calculating Quantiles of the Student's t-Distribution in R
To calculate the #t_{\alpha /2}# quantile of the Student's t-distribution with #n-1# degrees of freedom in R use:
> qt(1-α/2, n-1)
or
> qt(α/2, n-1, low=F)
Consider again the population of all steel wires of a certain type, where #X# denotes the breaking strength (in kN) of a wire from this population. We assume #X \sim N(\mu,\sigma^2)# with #\sigma^2# unknown.
Let’s estimate the mean breaking strength #\mu# of this type of steel wire using a #99\%# CI. Suppose that for a particular sample with #n = 10# we find #\bar{x} = 50# kN and #s = 3# kN.
For a #99\%# CI, #\alpha = 0.01#, so #\alpha /2 = 0.005#. Then the middle #99\%# of the #t_9# distribution falls between #-t_{9,0.005}# and #t_{9,0.005}#.
In R we can find #t_{9,0.005}# using the command:
> qt(0.005,9,low=F)
or
> qt(0.995,9)
We find #t_{9,0.005} \approx 3.25#.
Then a #99\%# CI for #\mu# is:
\[\begin{array}{rcl}
(l,u) &=& \bigg(\bar{x} - t_{n-1,\alpha/2}\cfrac{s}{\sqrt{n}},\,\,\,\,\,\,\bar{x} + t_{n-1,\alpha/2}\cfrac{s}{\sqrt{n}}\bigg)\\\\
&=& \Big(50 - (3.25)(3)/\sqrt{10},\,\,\,\,\, 50 + (3.25)(3)/\sqrt{10}\Big)\\\\
&=& (46.92,\,\,\,\,\,53.08)
\end{array}\] So the mean breaking strength #\mu# of this type of steel wire is estimated to be a value between #46.92# kN and #53.08# kN, with #99\%# confidence.
We didn’t address every possible situation from our three questions (such as if the answers to all three questions are “no”), but there are methods available for those situations as well. Those are not covered in this course.
Or visit omptest.org if jou are taking an OMPT exam.