Lab 08: Multiple Linear Models

(Stats) Learn how to interpret multiple linear models, make predictions, and use standard tests and diagnostics for evaluation, including making diagnostic plots. (R) Model summaries. Diagnostic plots. Prediction and plotting.

Published

April 18, 2023

Outline

Objectives

This lab will guide you through the process of

  1. Creating multiple linear models
  2. Evaluating model assumptions with diagnostic plots
  3. Checking for correlation in the predictors
    • Scatterplot matrix with plot()
    • Pearson’s Correlation with cor()
    • Variance Inflation Factor with vif()

R Packages

We will be using the following packages:

You’ll want to install ISLR2 and car with install.packages(c("ISLR2", "car")).

Data

Multiple Regression

In this section, we will walk through how to build a multiple linear model in R. This is not fundamentally different than building a simple linear model. The only difference is that we need to update the model formula. For the simple linear model, the formula is just y ~ x, for the multiple linear model, it’s y ~ x1 + x2 + … + xn. We simply add the covariates together using the plus-sign.

Let’s work through an example with the adverts data set used in the textbook An Introduction to Statistical Learning With Applications in R. We want to know whether investment in different advertising media increases sales of some product. The website for the ISLR book hosts several data sets. The authors have also written companion R packages (ISLR and ISLR2, for the first and second editions, respectively). Now, as it happens, the path to a csv file does not have to be a local path pointing to a location on your computer. It can be also be a url pointing to where a file is stored remotely on a website, so reading in the advertising data from the ISLR website is as easy as this.

adverts <- read_csv("https://www.statlearning.com/s/Advertising.csv") |> 
  rename_with(tolower) |> 
  select(sales, tv, radio, newspaper)

Before fitting the model, let’s have a look at some summaries of the data with skim().

skim(adverts)
Data summary
Name adverts
Number of rows 200
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sales 0 1 14.0 5.22 1.6 10.38 12.9 17.4 27.0 ▁▇▇▅▂
tv 0 1 147.0 85.85 0.7 74.38 149.8 218.8 296.4 ▇▆▆▇▆
radio 0 1 23.3 14.85 0.0 9.97 22.9 36.5 49.6 ▇▆▆▆▆
newspaper 0 1 30.6 21.78 0.3 12.75 25.8 45.1 114.0 ▇▆▃▁▁

And now the model.

adverts_lm <- lm(sales ~ tv + radio + newspaper, data = adverts)

summary(adverts_lm)
#> 
#> Call:
#> lm(formula = sales ~ tv + radio + newspaper, data = adverts)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -8.828 -0.891  0.242  1.189  2.829 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)  2.93889    0.31191    9.42 <0.0000000000000002 ***
#> tv           0.04576    0.00139   32.81 <0.0000000000000002 ***
#> radio        0.18853    0.00861   21.89 <0.0000000000000002 ***
#> newspaper   -0.00104    0.00587   -0.18                0.86    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.69 on 196 degrees of freedom
#> Multiple R-squared:  0.897,  Adjusted R-squared:  0.896 
#> F-statistic:  570 on 3 and 196 DF,  p-value: <0.0000000000000002

Note the change in how we interpret the coefficients! Each coefficient must be interpreted relative to some value of the other covariates. For example, the coefficient estimate for television is 0.046. We interpret this as saying, “For some given investment in radio and newspaper advertising, increasing tv advertising by $1000 will increase the number of units sold by approximately 46 units (because the units are measured in thousands, so 1000 x 0.046).”

Exercises

  1. Load the Boston dataset from the ISLR2 package with data().
  2. Subset the variables in the dataset using the select() function from dplyr. Choose all the following variables:
    • medv = median household value
    • rm = number of rooms
    • crim = per capita crime rate
    • lstat = percent of households with low socioeconomic status
  3. Summarize the table with skim().
  4. Make a simple linear model of median household value (medv) as a function of average number of rooms (rm).
    • Call the model simple_lm.
  5. Now make a multiple linear model of median household value (medv) as a function of average number of rooms, per capita crime rate, and percent of household with low socioeconomic status (rm, crim, and lstat, respectively).
    • Call the model boston_lm.
  6. Summarize the model with summary().
    • Are all the coefficients significantly different than zero?
    • How much of the variance in house value is explained by these variables (ie what is the R-squared value)?
  7. How do you interpret the coefficient for number of rooms? What effect does increasing the number of rooms have?

Model Evaluation

As always, we can use the base plot() function to check model assumptions, using the which argument to specify the type of plot we want to make. For example, we can make a Q-Q plot like so:

plot(adverts_lm, which = 2)

Or, we can use the check_model() function from the performance package.

check_model(
  adverts_lm,
  check = c("linearity", "homogeneity", "outliers", "qq")
)

How does it look? Have we met the assumptions of linear regression? Consider, for exmaple, that Q-Q plot. It would seem to suggest that the residuals are skewed to the left. That’s not good because the linear model assumes that the residuals are normally distributed and centered on zero. Another way to check for this is to plot a histogram of the residuals and check whether it has the shape of a bell curve.

adverts_residuals <- tibble(residuals = residuals(adverts_lm))

ggplot(adverts_residuals, aes(residuals)) +
  geom_histogram() +
  labs(
    x = "Residual Sales",
    y = "Count"
  )

😬 You can see the left skew as indicated by the Q-Q plot.

Exercises

  1. Plot a histogram of the residuals in boston_lm.
  2. Run check_model() on boston_lm and check for linearity, homogeneity, outliers, and qq.
  3. Does the model meet the assumptions of linear regression?

ANOVA

When evaluating the model, we should also check whether adding variables actually makes a significant improvement. To do that, we need to compare the complex model to a simpler model. In R, we do that with the anova() function. Here, we’re going to compare the full model to set of simpler models that are nested within it. Note that in each case the simpler model is a subset of the more complex model (ie, these are nested models)! The ANOVA wouldn’t work otherwise!

m1 <- lm(sales ~ tv, data = adverts)
m2 <- lm(sales ~ tv + radio, data = adverts)
m3 <- lm(sales ~ tv + radio + newspaper, data = adverts)

anova(m1, m2, m3)
#> Analysis of Variance Table
#> 
#> Model 1: sales ~ tv
#> Model 2: sales ~ tv + radio
#> Model 3: sales ~ tv + radio + newspaper
#>   Res.Df  RSS Df Sum of Sq      F              Pr(>F)    
#> 1    198 2103                                            
#> 2    197  557  1      1546 544.05 <0.0000000000000002 ***
#> 3    196  557  1         0   0.03                0.86    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is a sequential ANOVA that does comparisons in order, first comparing the tv model to the intercept model, then the tv + radio model to the tv model, and so on, to the full model. For each step, it reports the residual sum of squares (RSS or the error in the model) and evaluates whether the difference in that value between the models is significant. The degrees of freedom (DF) represents the difference in residual degrees of freedom (Res.Df) between the simple and complex models (for row 2, that’s 199-198=1). The sum of squares (Sum of Sq) is the difference in the Residual Sum of Squares (RSS) for each model (for row 2, that’s 5417-2103=3315). The F statistic is then the ratio of (Sum of Sq/Df) to what is effectively the mean squared error of the full model (for row 2 that’s (3315/5)/(557/196) = 1166). The intuitive idea here is that an F-statistic close to zero means that adding the covariate does not reduce the model’s RSS or error to any meaningful degree. As always, the F-statistic is then compared to an F distribution with the degrees of freedom to determine how likely that particular value is, assuming the null hypothesis that there is no significant improvement. Notice that each of these covariates makes a significant contribution to the model of sales except newspaper spending. Why might that be?

One last point, before moving on. You do not have to explicitly generate a sequence of nested models. You can simply call anova() on the full model and it will generate those models for you. However, do be aware that the printed ANOVA table is slightly different (additional columns and different column names) in the two cases. Here it is the for the full advertising model:

anova(adverts_lm)
#> Analysis of Variance Table
#> 
#> Response: sales
#>            Df Sum Sq Mean Sq F value              Pr(>F)    
#> tv          1   3315    3315 1166.73 <0.0000000000000002 ***
#> radio       1   1546    1546  544.05 <0.0000000000000002 ***
#> newspaper   1      0       0    0.03                0.86    
#> Residuals 196    557       3                                
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercises

  1. Use the anova() function to compare simple_lm to boston_lm.
    • State your null and alternate hypotheses.
  2. Is the complex model worth it?

Collinearity

Linear regression assumes that the predictor or independent variables are uncorrelated. However, this is often not the case. To evaluate this assumption, you can do one of two things. I actually recommend that you always do both. First, you can check for pairwise correlations visually with the pairs() function from base R or analytically with the cor() argument. Both work with tables. Second, you check for multicollinearity using the vif() function to estimate the Variance Inflation Factor for each covariate.

Check for collinearity

pairs(adverts)

cor(adverts)
#>           sales     tv  radio newspaper
#> sales     1.000 0.7822 0.5762    0.2283
#> tv        0.782 1.0000 0.0548    0.0566
#> radio     0.576 0.0548 1.0000    0.3541
#> newspaper 0.228 0.0566 0.3541    1.0000

Notice that pairs() generates a scatterplot matrix with all pairwise combination of variables. If there’s no correlation, the scatterplot should look like a cloud of random points. If there is a correlation, the points will cluster along some line. The cor() function generates a correlation matrix (by default a Pearson’s correlation matrix). Of course, every variable correlates perfectly with itself, hence a value of 1 along the diagonal. Values above and below the diagonal are simply mirrors of each other, as is the case with the scatterplot matrix. The fact that each variable correlates with sales is not problematic in this case, as that is the relationship we are trying to model. What we want to focus on here is the degree of correlation between the predictors tv, radio, and newspaper. It is preferable that the correlations between these be as close to zero as possible, but again, it will rarely be the case that they are exactly zero. Any value greater than 0.7 is a cause for concern and should lead you to update your model to address it.

Check for multicollinearity

Generally, it’s a good idea to test for correlations between predictors before building a model. However, the vif() function only works on model objects, which makes sense as the variance being inflated is variance around coefficient estimates made by the model. That means, of course, that you have to build a model first before evaluating its covariates for multicollinearity. Using the linear model we made with the advertising data, that looks like this:

vif(adverts_lm)
#>        tv     radio newspaper 
#>      1.00      1.14      1.15

This generates a named vector, with one value for each covariate. A general rule of thumb is that VIF for a coefficient should be less than 5. After that, you should consider changes to your model.

Exercises

  1. Use pairs() on the Boston data to visualize pairwise relationships between variables.
    • Do you see any potential trends?
  2. Use cor() on the Boston data to estimate the Pearson correlation coefficient for each pair of variables.
    • Are there any strong correlations?
  3. Use vif() on the bostom_lm model to estimate the Variance Inflation Factor for each covariate.
    • Are any greater than 5 for any of them?

Homework

  1. Load the Handaxes dataset from the archdata package using data().
  2. Now let’s explore this dataset.
    • Check the variables that it includes with names(). You should see maximum length (L), breadth (B), and thickness (T), among others.
    • Try renaming those to make them more informative and readable with rename().
    • Use select() to subset this table, taking only those three variables (length, breadth, and thickness).
    • Summarize this table with skim().
    • Visualize the pairwise relationships between variables with pairs().
    • Estimate Pearson’s correlation coefficient for each pair of variables with cor().
  3. Now making a simple linear model showing the relationship (if any) between the length and breadth of handaxes. Be sure to do all of the following:
    • Use summary() to report the model.
    • Use predict() and geom_line() to visualize the modeled relationship. Be sure to plot this over the data!
    • Add a confidence interval with geom_ribbon().
    • Use check_model() to visually inspect the model.
    • Does the model satisfy the assumptions of linear regression?
  4. Build a multiple linear model of handaxe length as a function of breadth and thickness. Be sure to do all of the following:
    • Use summary() to report the model.
    • Use check_model() to visually inspect the model.
    • Does the model satisfy the assumptions of linear regression?
  5. Use the vif() function on the multiple linear model.
    • Is the VIF for either variable greater than 5?
  6. Conduct an ANOVA using the anova() function to compare the simple and complex models.
    • Does the complex model significantly improve on the simpler model?