[A, SfS] Chapter 8: Correlation and Regression: 8.2: Simple Linear Regression
Simple Linear Regression
Simple Linear RegressionIn this section, we will discuss how we can model the linear relationship between a continuous dependent variable and a quantitative independent variable using a linear function.
In many scientific studies, we have a continuous variable which we hypothesize to have an association with some other quantitative variable . This hypothesis could be based on theory, such as a physical law, or based on observations, in which case we develop an empirical model, i.e., a model which seems to be useful but which does not necessarily reflect some underlying objective truth. In this kind of model, we are thinking in terms of explaining the variation in using the variable , which could allow us to make predictions for the value of when the value of is given.
If we believe that the association between two quantitative variables is linear, then we employ a linear model to represent that association, i.e., an equation
However, when looking at data it is evident that observed values of for different values of do not precisely conform to such a linear function, no matter what values are chosen for and . The values deviate from the model, either in the positive or the negative direction, with some of these deviations much larger than others.
To incorporate this behavior into our linear model, we must include an error term , which we treat has a random variable with mean and some variance that accounts for the variability of the observations around the linear function. Thus the linear model becomes:
The IV is not a random variable, but since is a random variable, the DV is also random, with its mean dependent on , i.e.,
The values of the coefficients and are based on the proposed linear association at the population level and are usually not known (unless derived from theory, as in a physics equation). Consequently, we must estimate the values using a random sample from the population. We will denote the estimates by and , respectively.
Linear RegressionThe process of using data to obtain the estimates and is called linear regression.
When there is only one independent variable, the process is called simple linear regression.
There are different procedures for obtaining the estimates, including the method of maximum likelihood. But we will use a method called ordinary least-squares regression (OLS), in which case and are called the least-squares coefficients, and the resulting linear equation
This is understood to be the equation of the best-fit line which fits the points in the scatterplot better than any other line. Of course, "best fit" implies that there must be some criterion by which we decide what is indeed the best fit. That criterion is the least-squares criterion.
Least-squares CriterionGiven measurements of the variables and on a random sample of size from the population, we obtain observed pairs .
Suppose for a moment that we already have the estimates and . Then given any arbitrary observed pair , we can use in the equation of the regression line to obtain the predicted value of (also called the fitted value) , i.e.,
This is the value that the estimated linear model predicts for the DV when . The predicted value is almost always different from the observed value , so let
The best-fit model would make as small as possible. But because there are pairs , and they do not all lie precisely on the same line, it will not be possible to make all the residuals equal to zero. Instead, we want to make them collectively as small as possible.
In the method of least-squares, we minimize the sum of the squared residuals, which we denote as
Our goal is to choose and such that we minimize .
To accomplish this, we find the derivative of with respect to each coefficient separately, set each derivative equal to , and solve the two resulting equations simultaneously to obtain these values. We also check the second derivatives to be sure we have found a minimum.
In the end, we find the least-squares coefficients to be given by the formulas:
Note that one can also compute the slope estimate using the Pearson sample correlation coefficient , i.e., the slope of the least-squares regression line is:
In the following plot, the data were simulated from the linear model The least-squares coefficients are computed as and . Although the intercept estimate is quite inaccurate, the slope estimate is very close to the truth.
Both the true line (solid) and the regression line (dotted) are plotted together with the data. As can be seen, the regression line is not far from the true line, and fits the scatterplot quite well.
The line
+
is the best-fit line to model the association implicit within the given data, and is an unbiased estimate of the hypothetical line
However, it only describes that association for the range of observed values for the IV (in the above example, from to ). It is unwise to extrapolate this model to values of the IV outside of that range. In the above example, if we extrapolate the model to values of close to , we already know that the model will be very inaccurate, since the true intercept and the estimated intercept are very different.
Although the model is the best linear fit for the data, it does not mean that the fit is particularly good. It just means that no other line fits the data better, based on the least-squares criterion.
Determining the Quality of the Model's FitLet denote the total sum of squares for the DV, defined as:
Meanwhile, the sum of the squared residuals (given above as ) is also known as the error sum of squares , and represents the total variability in our sample of the DV around the regression line.
Let denote the regression sum of squares, defined as:
To measure the goodness of the fit of the model to the data, we define the coefficient of determination as:
Consequently, is always between and , and the closer is to the better is the quality of the fit of the linear model to the data, and the better is the usefulness of the model to predict the value of the DV for given values of the IV.
Note that there is no established threshold for distinguishing a "good fit" from a "bad fit". But if we compare two or more simple linear models for the DV, we can use this criterion to conclude which model has the best fit among them.
For the model derived from the previous figure, we have , so we can say that the linear model explains of the variability in the DV, which is quite good.
Inference about the Parameters of a Simple Linear ModelWe have assumed that the error in the model
We now assume further that .
Consequently, for any specified value of , it is true that:
- An unbiased estimate of the error variance is the mean-square error: Sincewe haveso an alternative formula is:
- The variance of , denoted , is estimated by where .
Its square root is the standard error of the slope estimate. - The variance of , denoted , is estimated by Its square root is the standard error of the intercept estimate.
- The statistics each have the distribution.
Thus we can compute confidence intervals for and , or conduct hypothesis tests regarding the values of and , using the same formulations that we used when we made inference about the population mean of a quantitative variable using the Student's t-distribution. The null hypothesis value for the slope would be denoted and for the intercept would be denoted .
Typically we are interested in testing whether the slope of the regression line is different from zero. A nonzero slope implies that the IV has a statistically significant effect on the DV. Most statistical software will report the P-value for this particular test (along with a test for the value of the intercept, which is rarely of interest).
Another topic of statistical inference is estimating the population mean of the DV for any specified value of the IV, i.e., . We skip that in this course.
Also, we might want to use the linear model to predict the value of the DV for a single future observation having a specified value of the IV, using a prediction interval. We will also skip this topic in this course.
Using RSimple Linear Regression
To conduct simple linear regression in , suppose you have measurements on two quantitative variables stored in a data frame in your workspace.
For example, consider the following data frame, consisting of measurements on the variables Speed and Goodput, which you may paste into :
Data = data.frame(
Speed = sort(rep(c(5,10,20,30,40),5)),
Goodput = c(95.111,94.577,94.734,94.317,94.644,90.800,90.183,91.341,91.321,92.104,72.422,82.089,84.937,
87.800,89.941,62.963,76.126,84.855,87.694,90.556,55.298,78.262,84.624,87.078,90.101)
)
Here, Speed is measured in meters per second, while Goodput is a percentage.
Suppose you want to fit a simple linear regression model with Speed as the independent variable and Goodput as the dependent variable. In you should give an appropriate name to the linear model, which is created using the function.
For instance, suppose you want to name the linear model Model. The corresponding command is:
> Model = lm(Goodput ~ Speed, data=Data)
To view the output of this linear model, you would then give the command:
> summary(Model)
This produces the following result:
In the summary you see summary statistics about the residuals, followed by the estimated regression coefficients (Intercept and slope for the IV Speed), their standard errors ( and ), t-values when testing for a nonzero coefficient ( and ), and corresponding P-values (under ). The significance codes indicate how small the P-values are.
Then you have values for the Residual standard error (which is the square root of the , and is an estimate of , in this case ) and its degrees of freedom, the (Multiple R-squared, in this case ), and some additional information.
If you want to see the vector of the fitted values (the predictions of the dependent variable for each value of the independent variable based on the regression line, also called the "Y-hat" values) for this model, you would give the command:
> Model$fitted.values
Similarly, if you want to display the vector of the residuals (the differences between observed and predicted values of the dependent variable) for this model, you would give the command:
> Model$residuals
To display the vector of coefficients (the beta-hats) and for the estimated regression function, you would give the command:
> Model$coefficients
Scatterplot
Suppose you want to make a scatterplot in with the IV Speed on the horizontal axis and the DV Goodput on the vertical axis. The command is:
> plot(Data$Speed, Data$Goodput)
The first argument to the function is the column corresponding to the variable associated with the horizontal axis, which should always be the IV, and the second argument is the column corresponding to the variable associated with the vertical axis, which should always be DV.
The resulting scatterplot is inadequate. It has no title, and the axis labels aren't very informative. Also, the plotted points are open circles, while one might prefer dark filled-in circles. To fix this, we can add some additional settings to the command:
> plot(Data$Speed, Data$Goodput, main="Computer Networks", xlab="Speed(M/s)", ylab="Goodput(%)", pch=20)
Now we obtain a much nicer scatterplot:You may want to also add a plot of the estimated regression line to the scatterplot of the data, since you have already obtained the least-squares estimates of the regression coefficients, which are stored in .
To add the plot of the estimated regression function to the scatterplot, use the command. This command takes the intercept of the line as its first argument, and the slope as the second argument.
> abline(Model$coefficients)
The line will appear superimposed over the data on the scatterplot:
Confidence Intervals
Let us return to our original model. To obtain confidence intervals for the intercept and the slope in the above linear model named Model, the command would be:
> confint(Model)
Then will display the lower bound and upper bound of the confidence interval for each parameter:
To obtain confidence intervals with another confidence level, say, confidence intervals, you would use:
> confint(Model, level=0.99)
Two-sided Hypothesis Testing
If you want to conduct the hypothesis test
against
the test statistic is already provided for you in the model summary.
Notice that for the slope (after the word "Speed") under we have a t-value of . This is just the estimated slope divided by its standard error.
The two-sided P-value is given to you under , which in this case is , and leads us to conclude in favor of at the level, i.e., Speed has a significant effect on Goodput. (Note: The summary will not display a P-value smaller than , but will simply display , which is essentially zero.) The two stars in the last column of the summary tell you that the given P-value is between and (in the significance codes line).
You would test against in the same way. Note that in this example, the P-value for the intercept is (essentially zero), so you would also reject at any typical significance level.
One-sided Hypothesis Testing
If you have a one-sided alternative hypothesis, like or , you should divide the given two-sided P-value in half, but you must also make sure that the sign of the slope estimate is in agreement with the alternative hypothesis before you can conclude in its favor.
Test for a Nonzero Slope Coefficient
If you want to conduct a test in which the null value for the slope is not , you will have to divide the difference between the parameter estimate and the null value by the standard error to get the test statistic, i.e.,
Coefficient of Determination
The value of the coefficient of determination is given in the original model output as . In this example it is .
The value of is the proportion of the variation in the dependent variable that is explained by the association with the independent variable in the regression model, which we may re-format as .
Or visit omptest.org if jou are taking an OMPT exam.