[A, SfS] Chapter 8: Correlation and Regression: 8.4: Multiple Linear Regression
Multiple Linear Regression
Multiple Linear RegressionIn this section, we will discuss how we can model a linear association between a continuous dependent variable and a set of two or more independent variables.
The multiple linear regression model is an extension of the simple linear regression model to situations in which we have more than one quantitative IV which we believe to have a linear association with the continuous DV, and we want to analyze their collective effect on the DV.
Multiple Linear Regression ModelGiven #p# independent variables #x_1,...,x_p# and dependent variable #Y#, the multiple linear regression model is: \[Y = \beta_0 + \beta_1x_1 + \cdot\cdot\cdot + \beta_px_p + \varepsilon\] where #\varepsilon \sim N\Big(0,\sigma^2\Big)#.
For #j = 1,...,p#, each regression coefficient #\beta_j# represents the average change in the DV at the population level when the IV #x_j# increases by one unit and all other IVs remain constant.
Thus #\beta_j# represents the effect of #x_j# on #Y# when controlling for the effects of all the other IVs. It could be that in a simple linear regression model we find that an IV has a significant effect on the DV, but when we use a multiple regression model to control for the effects of one or more additional IVs, the effect of the original IV is no longer significant. When this happens, one or more of the new IVs will usually have a strong correlation with the original IV.
\[\text{}\]
Multiple Least-squares RegressionGiven observations \[(y_1,x_{11},...,x_{1p}),(y_2,x_{21},...,x_{2p}),...,(y_n,x_{n1},...,x_{np})\] of the DV and the #p# IVs measured on a random sample of size #n#, the #p+1# regression parameters \[\beta_0,\beta_1,...,\beta_p\] are estimated using least-squares regression to obtain the least-squares coefficients \[\Big(\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_p\Big)\]
These are the unique set of values that minimize the sum of the squared residuals #e_1^2,...,e_n^2#:
\[\begin{array}{rcl}
S &=& \displaystyle\sum_{i=1}^n e^2_i \\\\
&=& \displaystyle\sum_{i=1}^n\Big( Y_i - \hat{\beta}_0 - \hat{\beta}_1x_{i1} - \cdot\cdot\cdot - \hat{\beta}_px_{ip}\Big)^2
\end{array}\]
This minimization is performed using techniques from linear algebra that are typically implemented using software.
The least-squares regression function is then represented by the equation: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_{1} + \cdot\cdot\cdot + \hat{\beta}_px_{p}\]
and for the #i#th case in the data set the #i#th fitted value (or predicted value) is: \[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \cdot\cdot\cdot + \hat{\beta}_px_{ip}\]
while the #i#th residual is: \[e_i = y_i - \hat{y}_i\]
\[\text{}\]
Determining the Quality of the Model's FitTo evaluate the quality of the fit of the multiple linear regression model to the data, we again define the total sum of squares (SST) as \[SST = \displaystyle\sum_{i=1}^n (y_i - \bar{y})^2\]
the error sum of squares (SSE) as \[SSE = \displaystyle\sum_{i=1}^n (y_i - \hat{y}_i)^2\]
and the the regression sum of squares (SSR) as \[SSR = SST - SSE = \displaystyle\sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
The coefficient of determination is still written as #R^2#, and is still computed in the same way: \[R^2 = \cfrac{SSR}{SST} = 1 - \cfrac{SSE}{SST}\]
This is the square of the sample correlation coefficient #R# between the observed values of the DV #y_1,...,y_n# and the fitted values #\hat{y}_1,...,\hat{y}_n#.
Again it is a number between #0# and #1# and represents the proportion of the variance in the DV explained by the regression model, with values closer to #1# signaling a better fit.
Note that there is no established threshold for distinguishing a "good fit" from a "bad fit". But if we compare two or more linear models that have the same number of IVs, we can conclude which model has the best fit among them based on which model has a larger #R^2#.
However, a preferred criterion for comparing models which may have different numbers of IVs is the adjusted coefficient of determination, which penalizes models that include IVs that contribute very little toward prediction of the DV: \[R^2_{adj} = 1 - \cfrac{n-1}{n-p-1} \cdot \cfrac{SSE}{SST}\]
The #R^2# always increases when another IV is added, whereas the adjusted #R^2# will achieve a maximum after which adding further IVs causes it to decrease.
\[\text{}\]
Inference about the Parameters of a Multiple Linear ModelIn order to conduct statistical inference about the population parameters for a multiple linear model, the error variance #\sigma^2# is again estimated by the mean square error : \[MSE = \cfrac{SSE}{n - p - 1}\]
The estimated variances of #\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_p#, denoted #s_0^2,s_1^2,...,s_p^2#, are computed using complicated linear-algebra formulas best left to software.
Assuming again that our data are measured on a random sample, and that the residuals are consistent with a random sample from the #N\big(0,\sigma^2\big)# distribution, the statistics #T_0,T_1,...,T_p# each have a #t_{n-p-1}# distribution, where
Most often, we test whether or not #\beta_j \neq 0#, for #j = 1,...,p#, to determine whether the effect of a specific #IV# on the #DV# is statistically significant, so most statistical software reports the P-values for these tests.
Dealing with Nonsignificant Independent VariablesAs stated earlier, an IV could be found statistically significant in a simple linear model, but when combined with other IVs in a multiple linear regression model it could be found not significant. This occurs when that IV is strongly correlated with one or more of the other IVs, so that it does not provide any additional predictive power beyond what is already provided by the other IVs.
If any IVs are found to be not statistically significant (i.e., the P-value is large when testing #H_0:\beta_j=0# against #H_1:\beta_j\neq0# for the #j#th IV), we should drop the IV having the largest P-value and fit a new model with the remaining IVs.
We should continue dropping IVs one at a time until we have a final model in which each IV is statistically significant. Dropping insignificant IVs will cause the #R^2# to decrease in small amounts, while causing the adjusted #R^2# to increase.
\[\text{}\]However, in a multiple regression model, we should not assess the individual effects of individual IVs until we first assess whether or not the model itself is statistically significant. This is to avoid Type I errors due to multiple testing.
So we should first consider an omnibus test for the model itself, which incorporates all the information embedded within the data simultaneously to assess whether or not the multiple regression model is significant.
Assessing the Significance of the Multiple Regression ModelTo determine whether or not the multiple regression model is significant, we test the null hypothesis
#H_0 : \beta_1 = \cdot\cdot\cdot = \beta_p = 0# (all IVs are not significant, hence the model is not significant),
against the alternative hypothesis
#H_1: \beta_j \neq 0# for at least one #j# (the model is significant, even if some of the IVs are not)
If you conclude in favor of #H_1#, then you can proceed to assess the effect of the individual IVs.
To conduct this test, we compute the regression mean square as \[MSR = \cfrac{SSR}{p}\] and set \[F = \cfrac{MSR}{MSE}\] If the model is significant, the regression mean square should be large relative to the mean square error, so that #F# is larger than #1#. The #F#-statistic has the #F_{p,n-p-1}# distribution when the regression assumptions are valid, so the P-value is computed as #P(F_{p,n-p-1} > F)#.
Statistical software typically displays this information in an ANalysis Of VAriance (ANOVA) table, which includes the various sums of squares, their degrees of freedom, the mean squares, the value of the #F#-statistic, and its P-value. However, in #\mathrm{R}#, the value of the F-statistic, and its P-value, are provided in the summary of the model.
Checking the Assumptions of the Multiple Linear ModelThe assumptions in multiple linear regression are checked in the same way as in simple linear regression, i.e., with a residual plot and a normal quantile-quantile plot.
It is also wise to inspect a scatterplot of the DV against each of the #p# IVs separately to check the linearity assumption. These can often be obtained using a scatterplot matrix.
Violations of the assumptions can often be remedied using transformations on one or more of the IVs or on the DV.
\[\text{}\]
Using RTo conduct multiple linear regression in #\mathrm{R}#, suppose you have measurements on four quantitative variables stored in a data frame in your #\mathrm{R}# workspace.
For example, consider the following data frame, consisting of measurements on the variables Overhead, LCR, Pause, and Speed, which you may paste into #\mathrm{R}#:
Data = data.frame(
Overhead = c(428.90,443.68,452.38,461.24,475.07,446.06,465.89,477.07,488.73,498.77,452.24,475.97,499.67,
501.48,519.20,445.45,489.02,506.23,516.27,508.18,444.41,490.58,511.35,523.12,523.36),
LCR = c(9.426,8.318,7.366,6.744,6.059,16.456,13.281,11.155,9.506,8.310,26.314,19.013,14.725,12.117,
10.284,33.009,22.125,16.695,13.257,11.107,37.823,24.140,17.700,14.064,11.691),
Pause = rep(10*(1:5),5),
Speed = sort(rep(c(5,10,20,30,40),5))
)
Here Overhead will be the dependent variable, and Speed, Pause, and LCR will be independent variables.
Scatterplot Matrix
To obtain a scatterplot matrix showing scatterplots for all six pairs among the four variables, use the #\mathtt{pairs()}# command:
> pairs(Data, pch=20)
This matrix enables you to tell whether the dependent variable appears to have any non-linear association with any of the independent variables, and also if any pairs among the three independent variables appear to be correlated with each other:
None of the plots indicate a non-linear trend.
Correlation Matrix
To obtain a correlation matrix, use the #\mathrm{R}# command:
> cor(Data)
This matrix gives you the strength of the linear relationship between each pair of variables.
From this output we can see that Overhead is most strongly correlated with Pause (#r=0.74#) and Speed (#r=0.53#), and weakly correlated with LCR (#r=-0.24#). So we would expect that LCR might not have a significant effect on Overhead when controlling for the other two IVs.
We also see that LCR is strongly correlated with both Pause and Speed, but Pause is not correlated with Speed at all. This further strengthens our expectation that LCR will not have a significant effect on Overhead when controlling for the other #2# IVs.
Note that both the scatterplot matrix and the correlation matrix are symmetric, with no useful information on the diagonal. You only need information from either the cells above the diagonal or the cells below the diagonal, due to the symmetry.
Multiple Linear Regression
Now suppose you wish to fit a multiple regression model for the three independent variables Speed, Pause, and LCR and the dependent variable Overhead, and suppose you choose to name the model #\mathtt{Model}#.
Then the #\mathrm{R}# commands to obtain a model and display its results are:
> Model = lm(Overhead ~ Speed + Pause + LCR, data=Data)
> summary(Model)
#\mathrm{R}# displays the summary of the model:
The summary gives you:
- The parameter estimates for the intercept and the three slope coefficients (under #\mathtt{Estimate}#)
- Their respective standard errors (under #\mathtt{Std.\,Error}#)
- The test statistic for testing the individual null hypotheses #H_0: \beta_j = 0# against the corresponding alternative hypotheses #H_1 : \beta_j \neq 0# for #j = 0,1,2#, and #3# (under #\mathtt{t\,value}#)
- The corresponding P-values for each test (under #\mathtt{Pr(>|t|)}#)
Significance codes are supplied to give you a sense of how small the P-value is.
Note that in the summary, the line for the intercept estimate begins with #\mathtt{(Intercept)}# and the line for each additional slope estimate begins with the name of the independent variable associated with that slope.
In this example, we see that the P-value for LCR is #0.062#, so we should consider deleting this IV from the model, unless it is essential for our theory.
The summary also provides you with the #\mathtt{Residual\,standard\,error}#, which is #\sqrt{MSE}#, i.e., the square root of the mean square error, and estimates #\sigma#, along with its degrees of freedom, #n - p - 1#, where #p = 3# in this case since there are three independent variables.
The summary also provides you with the coefficient of determination #R^2#, which is here called #\mathtt{Multiple\,R}\text{-}\mathtt{squared}#, as well as the adjusted #R^2#, which is here called #\mathtt{Adjusted\,R}\text{-}\mathtt{squared}#.
In this example, both #R^2# and #R^2_{adj}# are very high, above #80\%#.
Finally, in the last line of the summary you find the #F#-statistic for testing the null hypothesis \[H_0 : \beta_1 = \beta_2 = \beta_3 = 0\] against its alternative \[H_1 : \beta_1 \neq 0 \text{ or } \beta_2 \neq 0 \text{ or } \beta_3 \neq 0\] as well as its degrees of freedom #p# and #n-p-1#, and the corresponding P-value.
In this example, we see that the model is significant, with a very small P-value for the omnibus test.
Confidence Intervals
If you want individual confidence intervals for the population regression coefficients, follow the same procedure described previously using the #\mathtt{confint()}# function.
For simultaneous #(1 - \alpha)100\%# confidence intervals using the Bonferroni method, set the level at #1 - \alpha/k# in the #\mathtt{confint()}# function, where #k# is the number of regression parameters for which you want simultaneous confidence intervals, and #\alpha# is the overall confidence level.
For example, to obtain #90\%# Bonferroni joint confidence interval for #\beta_1#, #\beta_2#, and #\beta_3#, note that \[1 - \alpha/k = 1 - 0.10/3 = 0.967\] so the #\mathrm{R}# command would be:
> confint(Model, level=0.967)
The output will include a #96.7\%# confidence interval for the intercept #\beta_0# as well, but you can ignore it.
Residuals, Fitted Values, and Plots
To obtain the residuals or the fitted values, and to create the various plots involving the residuals to assess the validity of the regression assumptions, follow the same procedures described in the Simple Linear Regression section.
Or visit omptest.org if jou are taking an OMPT exam.