Multiple Regression in R

Introduction

No matter your exposure to data science & the world of statistics, it’s likely that at some point, you’ve at the very least heard of regression. As a precursor to this quick lesson on multiple regression, you should have some familiarity with simple linear regression. If you aren’t, you can start here! Otherwise let’s dive in with multiple linear regression.

The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.

Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.

Multiple Linear Regression

In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable–square footage.

Let’s continue to build on some of what we’ve already done there. We’ll build that same model, only this time, we’ll include an additional variable.

fit <- lm(price ~  sqft_living + waterfront, 
   data = housing)
summary(fit)

Similar to what you would’ve seen before, we are predicting price by square feed living space, only now we’re also including a waterfront variable, take note of the data type of our new variable.

Parallel Slopes Model

We’ve just created what is known as a parallel slopes model. A parallel slopes model is the result of a multiple linear regression model that has both one numeric explanatory variable and one categorical explanatory variable.

The formula derived from linear regression is the equation of a line.

y = mx + b

  • y is our dependent variable
  • m is the coefficient assigned to our explanatory variable
  • x is the value of the explanatory variable
  • b is the y intercept

Having in mind the resemblance to the equation of a line; when we’re trying to model home prices according to the number of bedrooms alone, we derive a coefficient related to x and a y intercept that best approximates price by minimizing error.

The question we’re left with is… when we introduce a categorical variable in addition to the current numeric predictor into our regression formula, how is it handled or reflected in the model’s output?

If you’ve ever built a simple linear regression model using only a categorical explanatory variable, you may be familiar with the idea of group means across the different levels of a categorical informing the coefficients assigned. You can read a greater detailed explanation of that here.

In a parallel slopes model, the inclusion of a categorical variable is now reflected in changes to the value of the y-intercept.

You may have asked yourself why these multiple regression models are called parallel slopes models.

Let’s create a visualization of our model and then break down the meaning!

First things first, let’s build our parallel slopes model

fit <- lm(price ~  sqft_living + waterfront, 
   data = housing)
summary(fit)

Then we will add a field onto the housing dataset to represent our fitted value.

housing$pred_price <- predict(fit, housing)

Now we can visualize!

ggplot(housing, aes(x = sqft_living, y = price, col = waterfront)) + 
  geom_point()+
  geom_line(aes(y = price_pred))

Take note of the following visual. We see two lines that represent our predictions for each value for sqft_living in the case of a waterfront or not.

The key thing I want to highlight here is that each datapoint has the same coefficient assigned to sqft_living or in other words, the same slope. This is apparent based on the slope of each prediction line, they are parallel, as such we know the slope is the same.

What we do see is that the line for cases when waterfront was positive is higher than for those that are without a waterfront.

Let’s take a look at the model summary to gain some additional context.

summary(fit)

To understand what’s happening here, let’s think about the model without the waterfront piece first. All records will have the same y-intercept, 21,522, and all records’ value for sqft_living will be multiplied by the coefficient of 246.

What will then further distinguish records with a value of “1” for waterfront is that their y-intercept will increase by the amount of the waterfront1 estimate– 575,954. As such two records with the same sqft_living, but different values for waterfront would only differ on the incremental y-intercept value.

Conclusion

In the last several minutes we’ve covered the following:

  • The definition of multiple regression
  • The difference between multiple regression & simple linear regression
  • The definition of a parallel slopes model
  • How to build your own parallel slopes model
  • How the parallel slopes model comes to be

I hope you’ve found this post on multivariate regression and parallel slopes models to be helpful.

Happy Data Science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: