Variance and Standard Deviation

[A, SfS] Chapter 1: Sampling, Descriptive Statistics, Intr: 1.10: Variance and Standard Deviation

Variance and Standard Deviation

This lesson will address:

The rationale for measuring the dispersion of data.
How it is carried out in practice.

$\text{}$
We have discussed the mean and the median as tools for describing the center of measurements on a quantitative variable, whether measured on an entire population or on a sample from that population. A measure of center provides a sense of what the measurement is on that variable for a “typical” member of the population. If you select an element from the population and need to estimate what the value of the variable would be for that element, you would estimate that value with the mean or median, because without further information you would have to assume that the selected element is typical.

But the selected element could very well be far from typical, so the value for the variable measured on that element could be quite far below or above the mean or median. We would thus like to have a sense of how far away from the center the value could be. We obtain this sense by means of measure of dispersion.

Dispersion

Dispersion is the degree to which values are clustered around the center of the distribution.

We have already been introduced to one measure of dispersion for a quantitative variable, the inter-quartile range ( $IQR$ ), which reports the range of values for the middle $50\%$ of the measurements when ranked in order. The larger the $IQR$ , the more dispersed are the measurements on the variable around the median. The $IQR$ has the advantage that it is not influenced by outliers, and is useful especially for skewed data. However, it is rarely used, in part because it does not lend itself well to mathematical operations.

$\text{}$

To measure the dispersion of a variable we compute the average difference between each measurement in the data set and the mean of the data set. But how we compute this average depends on whether we have measurements for the entire population (and thus use the population mean $\mu$ ), or we only have measurements for a sample (and thus use the sample mean $\bar{x}$ ).

Mean Absolute Deviation

For the $i$ th element in the data set, we let $x_i$ denote the measurement of the variable on that element.

Then $x_i - \mu$ is the difference between that measurement and the population mean, which could be $<0$ or $>0$ . If we were to add these up the negative and positive results would cancel each other out.

We could compute the average of their absolute values, which is called the mean absolute deviation.

However, this measure is not popular, because it is not convenient to perform mathematical computations involving absolute values.

$\text{}$

Instead we use the squared deviation, $(x_i - \mu)^2$ .

Population Variance

The average of the squared deviations is called the population variance, denoted $\sigma^2$ .

That is,

$\sigma^2 = \frac{1}{n} \sum_{i = 1}^{n} (x_i - \mu)^2$

where $n$ is the number of elements in the entire population.

Note that if $x_i$ is an outlier, then $(x_i - \mu)^2$ will be very large, and thus have a profound effect on the value of the variance. Thus the variance should only be used when the distribution of the measurements is fairly symmetric with no outliers.

$\text{}$

If we have measurements on the quantitative variable only for a sample from the population, we use a similar formula for computing the variance of the sample, but we replace $\mu$ with $\bar{x}$ . However, this creates a bias if we want to use the variance of the measurement in the sample to estimate the variance of the measurements on the entire population, since $\bar{x}$ is usually close to but not equal to $\mu$ . To compensate for this, we divide by $n -1$ rather than $n$ . We will see later how this correction is justified mathematically.

Sample Variance

The variance of measurements in a sample consisting of $n$ elements is denoted $s^2$ , and is computed as,

$s^2 = \frac{1}{n - 1} \sum_{i = 1}^{n}(x_i - \bar{x})^2$

We thus refer to $s^2$ as the sample variance, which is not to be confused with the population variance $\sigma^2$ .

The larger the sample variance, the larger is the dispersion of the measurements around the sample mean, so that an arbitrary measurement is less likely to be close to the center.

The variance (whether population or sample) is easy to use in mathematical computations. But is not very practical. If the variable is measured in kilograms, then its variance is in squared-kilograms, or if the variable is measured in euros, then the variance is in squared-euros. For scientific applications, we need a measure of dispersion that is in the same measurement units as the variable itself.

Hence we take the square root of the population variance to get population standard deviation $\sigma$ , and we take the square root of the sample variance to get the sample standard deviation $s$ .

$\text{}$

Using R

Standard Deviation and Variance

Suppose you have measurements on a quantitative variable stored in a vector named $\mathtt{Distance}$ in your $\mathrm{R}$ workspace.

In $\mathrm{R}$ you can compute the sample variance and sample standard deviation using:

> var(Distance)

> sd(Distance)

$\text{}$

If you have a data frame from which each column is a numeric vector, you can find the variance of each column in one go.

If the data frame is named $\mathtt{MyData}$ , you would use:

> diag(var(MyData))

$\text{}$

$\mathrm{R}$ does not have a function for the population variance or population standard deviation, since we are almost always working with samples.

It is easy to convert the sample variance to the population variance: just multiply by $\frac{n - 1}{n}$ , where $n$ is the population size, which is assumed to be the length of the vector.