The Mean

[A, SfS] Chapter 1: Sampling, Descriptive Statistics, Intr: 1.6: The Mean

The Mean

In this lesson you will learn:

What a measure of center is.
How to read the mathematical notation for the mean.
The interpretation of the mean in scientific applications.
How to use the formula to compute the mean, or to compute the sum of data values, or to determine the number of data values.
When the mean is not appropriate for describing the center of data.

#\text{}#
The primary purpose of descriptive statistics is to summarize sample data in such a way that it becomes easier for the researcher to identify patterns in the data. One particularly useful way in which a distribution of scores can be summarized is to identify where the center of the distribution is located. To determine the location of the center of a distribution, a measure of center can be calculated.

Measure of Center

A measure of center is a single value that attempts to describe the center of a distribution.

#\phantom{0}#

To describe the center of a quantitative variable, whether measured on a population or on a sample, we most often use the average, also known as the mean.

The Mean

The mean represents the value of the variable for a typical member of the population or the sample.

The mean of a variable #x# measured on a sample is represented by #\bar{x}# (or if a different letter is used in place of #x#, it is that letter with a bar over it). This is called the sample mean.

The mean of a variable #x# measured on the entire population is represented with the Greek letter: #\mu# or #\mu_x#. This is called the population mean.

Formula for the Mean

To calculate the mean of a variable #x# when measured on a sample, use the following formula:

\[\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i\]

The formula for the mean of a variable measured on an entire population is the same, but with the #\bar{x}# replaced by #\mu#.

This formula is comprised of the following symbols:

The value of the quantitative variable #x# for the #i#th element of the population or sample is #x_i# (or if a different letter is used in place of #x#, it is that letter with a subscript #i#).
The sum of all the values of #x# for the entire sample or population is denoted #\sum_{i = 1}^{n}x_i#.
The number of elements in the sample or population is #n#.

Consider the following sample of scores:

\[9,\,\,\,4,\,\,\,3,\,\,\,15,\,\,\,2,\,\,\,8,\,\,\,12,\,\,\,11\]

Calculate the mean of this sample.

#\bar{x}=8#

To calculate the sample mean #\bar{x}#, use the following formula:\[\begin{array}{rcl}\bar{x} &=& \cfrac{1}{n} \sum\limits_{i = 1}^n{x_i}\\&&\color{blue}{\text{Formula for the sample mean}}\\&=& \cfrac{1}{n} (x_1 + x_2 + \ldots + x_n )\\&&\color{blue}{\text{Expanded the summation sign}}\\&=& \cfrac{1}{8}(9+4+3+15+2+8+12+11)\\&&\color{blue}{\text{Entered the values for }X_1 \text{ through } X_{8} \text{ and }n \text{ into the equation}}\\&=& \dfrac{64}{8}\\&&\color{blue}{\text{Added the scores}}\\&=& 8\\\end{array}\]

New example

The value of the mean is strongly influenced by extreme values, which we call outliers. These are values either well above or well below the main part of the data.

For this reason, it is not advisable to use the mean if the measurements of the variable do not have a symmetric distribution, such that there is a balance of values both above and below the mean. Otherwise, the mean will be biased towards outliers.

#\text{}#

One way of reducing or eliminating the influence of outliers is to compute the #p\%# trimmed mean instead.

Trimmed Mean

A #p\%# trimmed mean is a method of averaging that removes the #p\%# largest and the #p\%# smallest values before calculating the mean.

Since the outliers are very likely to be discarded, this leads to a less biased measure of center.

Computing the Trimmed Mean

To compute a #p\%# trimmed mean:

Order the measurements from smallest to largest.
Discard both the bottom #p\%# of the measurements and the top #p\%# of the measurements.
Compute the mean for the remaining measurements.

#\text{}#

Using R

Mean

If you have a vector in #\mathrm{R}# containing numeric data, computing the mean is simple.

For example, suppose the name of the vector is #\mathtt{Age}# and you want the mean.

> mean(Age)

This is the same as taking the sum of the data and dividing by the sample size:

> sum(Age)/length(Age)

Means of all columns

If you have a data frame and all of its columns are numeric vectors, you can use the function #\mathtt{colMeans()}# to obtain the mean of each column in one go.

> colMeans(MyData)

Trimmed mean

If you want the trimmed mean for the data, with #p\%# trimmed off each end of the ordered data, you just have to add that information.

For example, to calculate the #10\%# trimmed mean of the #\mathtt{Age}# vector:

> mean(Age, trim=0.10)