Understand Customer Churn With The Chi-squared Test Statistic

Introduction

The chi-square statistic is a useful tool for understanding the relationship between two categorical variables.

For the sake of example, let’s say you work for a tech company that has rolled out a new product and you want to assess the relationship between this product and customer churn. In the age of data, tech or otherwise, many companies undergo to risk of taking evidence that is either anecdotal or perhaps a high level visualization to indicate certainty of a given relationship. The chi-square statistic gives us a way to quantify and assess the strength of a given pair of categorical variables.

Customer Churn

Let’s explore chi-square from this lens of customer churn.

You can download the customer churn dataset that we’ll be working with from kaggle. This dataset provides details for a variety of telecom customers and whether or not they “churned” or closed their account.

Regardless of what company, teams, products, or industries you work with, the following example should be very generalizable.

Now that we have our dataset, let’s quickly use dplyr‘s select command to pull down the fields we’ll be working with for simplicity sake. I’ll also be dropping the number of levels down to two for simplicity sake. You can certainly run a chi-square test on categorical variables with more than two levels, but as we venture to understand it from the ground up, we’ll keep it simple.

churn <- churn %>%
  select(customerID, StreamingTV, Churn)%>%
  mutate(StreamingTV = ifelse(StreamingTV == 'Yes', 1, 0))

Churn is going to be classified as a Yes or a No. As you just saw, StreamingTV will be encoded with either a 1 or 0.

Exploratory Data Analysis

I won’t go into great depth on exploratory data analysis here, but I will give you two quick tools to being able to assess a relationship between two categorical variables.

Proportion Tables

Proportion tables are a great way to establish some fundamental understanding about the relationship between two categoricals

table(churn$StreamingTV)
table(churn$Churn)
round(prop.table(table(churn$StreamingTV)),2)
round(prop.table(table(churn$Churn)),2)

Table gives us a quick idea of the counts in any given level, wrapping that in prop.table() allows us to see the percentage break down.

Let’s now pass both variables to our table() function

table(churn$StreamingTV, churn$Churn)
round(prop.table(table(churn$StreamingTV, churn$Churn),1),2)

Once you pass another variable into the proportion table, you’re then able to establish where you want to assess relative proportion. In this case, the second parameter we pass to the prop.table() function, “1”, which specifies that we’d like to see the relative proportion of records across each row or value of StreamingTV. As you can see in the above table in cases when a customer did not have streaming tv, they remained active 76% of the time, conversely if they did have streaming tv they actually stuck around less at 70%.

Now before we go getting ahead of ourselves, saying that having streaming tv most certainly is causing more people to churn… we need to make an assessment of whether or not we really have grounds to make such a claim. Yes the proportion of return customers is lower, but the difference could be random noise. More on this shortly.

Time to Visualize

This will give us similar information to what we just saw, but visualization tends to lend better to quickly understanding relative value.

Let’s start off with a quick bar plot with StreamingTV across the x-axis, and the fill as Churn.

churn %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar()

As you can see, nearly as many tv streamers churned and with a substantially lower total customer count. Similar to what we saw with proportion tables, 100% stacked bar helps assess relative distribution among values of a categorical variable. All we have to do is pass position = 'fill' to geom_bar().

churn %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar(position = 'fill')

Diving into the Chi-square Statistic

Now there appears to be some sort of relationship between the two variables, yet we don’t have an assessment of the statistical significance. In other words, is it because of something about the relationship between tv streamers and customers, i.e. did they hate the service so much that they churn at a higher rate? Does their overall bill appear way to high as a product of the streaming plan, such that they churn all together?

All great questions, and we won’t have the answer to them just yet, but what we are doing is taking the first steps to assessing whether this larger investigative journey is worthwhile.

Chi-square Explanation

Before we dive into the depths of creating a chi-square statistic, it’s very important that you understand the purpose conceptually.

We can see two categorical variables that appear to be related, however we don’t definitively know if the disparate proportions are a product of randomness or some other underlying affect. This is where chi-square comes in. The chi-square test statistic is effectively a comparison of our distribution to the distribution we would expect, in the case that the two variables were indeed perfectly independent.

So first things first, we need a dataset to represent said independence.

Generating Our Sample Dataset

We will be making use of the infer package. This package is incredibly useful for creating sample data for hypothesis testing, creating confidence intervals, etc.

I won’t break down all of the details on how to use infer, but at a high level, you’re creating a new dataset. In this case, we want to create a dataset that looks a lot like what we just saw with the churn dataset, only this time, we want to ensure independent distribution, i.e. in cases when customers are tv streamers, we shouldn’t see a greater occurrence of churn.

Easy way to think about infer is in the following the steps of specify, hypothesize, and generate. We specify the relationship we’re modeling, we input the intended distribution, independent, and finally we specify the number of replicates we want to generate. A replicate in this case will mirror the row count of our original dataset. There are instances in which you would create many replicates of the same dataset and make calculations on top of that, but not for this part of the process.

churn_perm <- churn %>%
  specify(Churn ~ StreamingTV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1, type = "permute")

Lets’s quickly take a look at this dataset.

head(churn_perm)

As you can see we have the two variables we specified, as well as replicate. All records in this table will be replicate: 1, as we only made a single replicate.

Sample Summaries

Let’s quickly visualize our independent dataset to visualize the relative proportions now.

churn_perm %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar(position = 'fill')

As desired you can see that the relative proportions line up almost exactly. There is some randomness at play so we may not see that these two line up perfectly… but that’s really the point. We’re not doing this quite yet, but remember when I mentioned the idea of creating many replicates?

What might the purpose of that be?

If we create this sample dataset tons of times, do we ever see a gap as wide as 70% to 76% churn as we saw in our observed dataset? If so, how often do we see it? Is it so often that we don’t have grounds to chalk up the difference to anything more than random noise?

Alright enough of that rant… On to making an assessment of how much our observed data varies from our sample data.

Let’s Get Calculating

Now that we really understand our purpose, let’s go ahead and calculate our statistic. Simply enough, our intent is to calculate the distance between each cell of our table of observed counts with that of our sample counts.

The formula for said “distance” looks like this:

sum(((obs - sample)^2)/sample)

  1. We subtract observed from our sample,
  2. but square them such that they don’t cancel each other out.
  3. We divide them by the sample count to prevent any single cell from having too great a presence due to its size,
  4. and finally we take the sum.

The chi-square statistic that we get is: 20.1

So, great. We understand the purpose of the chi-square statistic, we even have it… but what we still don’t know is… is a chi-square stat of 20.1 meaningful?

Hypothesis Testing

Earlier in the post, we spoke about how we can use the infer package to create many, many replicates. A hypothesis test is precisely the time for that type of sampling.

Let’s use infer again, just this time we’ll generate 500 replicates & calculate a chi-square statistic for each group of replicates.

churn_null <- churn %>%
  specify(Churn ~ StreamingTV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 500, type = "permute") %>%
  calculate(stat = "Chisq")
churn_null

Based on the above output, you can see that each replicate has it’s own stat.

Let’s use a density plot to see what our distribution of chi-square statistics looks like.

churn_null %>%
ggplot(aes(x = stat)) +
  # Add density layer
  geom_density()

At a first glance we can see the distribution of chi-square statistics is very right skewed. We can also see that our statistic of 20.1 is not even on the plot.

Let’s add a vertical line to show how our observed chi-square compares to the permuted distribution.

churn_null %>%
ggplot(aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = obs_chi_sq, color = "red")

When it comes to having sufficient evidence to reject the null hypothesis, this is promising. Null hypothesis being that there is no relationship between the two variables.

Calculating P-value

As a final portion to this lesson on how to use chi-square statistics, let’s talk about how we should go about calculating p-value.

Earlier I mentioned the idea that we might want to know if our simulated chi-square stat was ever as large as our observed chi-square stat, and if so how often it might have occurred.

That is the essence of p-value.

When taking the chi-square stat of two variables that we know are independent of one another (the simulated case), what percentage of these replicates’ chi-square stats are greater than or equal to our observed chi-square stat.

churn_null %>%
  summarise(p_value = 2 * mean(stat >= obs_chi_sq))

In the case of our sample, we’re getting a p-value of 0. As to say that in the course of 500 replicates, we never surpassed a chi-square stat of 20.1.

As such, we would reject the null hypothesis that churn and streaming tv are independent.

Conclusion

We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like the chi-square statistic. My hope is that having a strong foundational understanding of the need and corresponding calculation of this statistic lends to the right instinct for recognizing the right opportunity to put this tool to work.

In just a few minutes, we have covered:

  • A bit of EDA for pairs of categorical variables
    • Proportion tables
    • Bar Charts
    • 100% Stacked Bar
  • Chi-square explanation & purpose
  • How to calculate a chi-square statistic
  • Hypothesis testing with infer
  • Calculating p-value

If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!

How to Visualize Multiple Regression in 3D

Image by Mediamodifier from Pixabay

Introduction

No matter your exposure to data science & the world of statistics, at the very least, you’ve very likely heard of regression. In this post we’ll be talking about multiple regression, as a precursor, you’ll definitely want some familiarity with simple linear regression. If you aren’t familiar you can start here! Otherwise, let’s dive in with multiple linear regression. I recently spoke about visualizing multiple linear regression with heatmaps, if you’ve already read that post, feel free to jump down to the modeling section of this post where we’ll build our new model and introduce the plotly package and 3 dimensional visualizaiton. If you haven’t read it, this is another helpful way to visualize multiple regression.

Multiple Linear Regression

The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.

Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.

In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable — square footage.

This post is a part of a series of posts where we explore different implementations of linear regression. In a post where we explore the parallel slopes model, we create a model where we predict price using square footage and whether it’s a waterfront property or not. Here we’ll do something similar, but we’ll create our model using multiple numeric inputs.

Let’s Get Modeling

Similar to what we’ve built in the aforementioned posts, we’ll create a linear regression model where we add a new numeric variable.

The dataset we’re working with is a Seattle home prices dataset. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.

Through the course of this post, we’ll be trying to explain price through a function of other numeric variables in the dataset.

With that said, let’s dive in. Similar to what we’ve built previously we’re using sqft_living to predict price, only here we’ll add another variable: sqft_basement

fit <- lm(price ~  sqft_living + sqft_basement,    
          data = housing)
summary(fit)

The inclusion of various numeric explanatory variables to a regression model is simple syntactically as well as mathematically.

Visualization Limitations

While you can technically layer numeric variables one after another into the same model, it can quickly become difficult to visualize and understand.

In the case of our model, we have three separate dimensions we’ll need to be able to assess.

As I mentioned previously, here we will be using plotly‘s 3d plotting tools to generate deeper understanding.

Let’s play around with plot_ly!

Let’s first visualize sqft_living and price to familiarize ourselves with the syntax.

plot_ly(data = housing, x = ~sqft_living, y = ~price, opacity = 0.5) %>%
  add_markers()

As you can see the syntax isn’t too different from ggplot. First specify the data, then jump into the aesthetics without having to explicitly declare that they’re aesthetics. The above visual is a simple 2 dimensional scatter plot.

Let’s visualize in 3 dimensions!

plot_ly(data = housing, z = ~price, x = ~sqft_living, y = ~bathrooms, opacity = 0.5) %>%
  add_markers()

Similar to what we did before, we’ve just moved price to the z-axis and now included sqft_basement. What’s fun about this plotting tool is that it’s not static, you can click and drag rotating the angle from which you’re viewing the plot. Obviously here you’re just seeing a screenshot, but get this running on your own machine to experience the full flexibility of plotly. At the moment you run this command in RStudio, your Viewer window will populate with this dragable/moveable visual that lends well to interpreting a dataset of greater dimensions.

Adding a Plane

When moving from two dimensions to three dimensions, things change. If you have background in linear algebra this may resonate. To put it simply, if you have a single dimension, then you have a point. If you have two dimensions you have a line. If you have three dimensions… you have a plane.

Having this in mind, Let’s visualize our multiple linear regression model with a plane.

First things first we need to create a matrix with all possible model inputs as well as the model prediction in each case.

Below I create a vector for our x and our y. We then pass them to the outer function where we declare the operation of passing them both to the linear regression function defined through the fitting of our model.

x <- seq(370, 15000, by = 10)
y <- seq(0, 15000, by = 10)

plane <- outer(x, y, function(a, b){fit$coef[1] + 
    fit$coef[2]*a + fit$coef[3]*b})

Now that we have our plane, let’s add it to our visual.

plot_ly(data = housing, z = ~price, x = ~sqft_living, y = ~sqft_basement, opacity = 0.5) %>%
  add_markers() %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

Again, you’ve got to jump in and play with plotly yourself.

Now that we have the plane added to our 3D scatter plot, what does that give us? Why is it helpful?

Have you ever added a regression line to your 2D scatter plot? If so, what was the intention?

You would add a line to your plot to give an indication of what the ‘best fit’ looks like, but it’s also useful to be able to say for a given value of x, we would predict y. The plane gives us exactly that. For given values of x and y, what’s z?

Conclusion

We have done a lot in a short amount of time. Multiple linear regression models can become increasingly complex very quickly. My hope is that adding this functionality to your tools set, you’ll be able to maintain better understanding of the data and models your working with. It’s not incredibly difficult to load a model with every variable we have access to, but it does raise the question of whether it solves our objective. Does it lend the type of understanding that we set out to obtain when we engaged in the modeling process?

In a few short minutes, we’ve covered:

  • Multiple linear regression definition
  • Building a mlr model
  • Visualization/interpretation limitations
  • Using 3D plots and planes to interpret our data and models

If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!

Visualizing Multiple Linear Regression with Heatmaps

Image by DavidRockDesign from Pixabay 

Introduction

No matter your exposure to data science & the world of statistics, it’s likely that at some point, you’ve at the very least heard of regression. As a precursor to this quick lesson on multiple regression, you should have some familiarity with simple linear regression. If you aren’t, you can start here! Otherwise, let’s dive in with multiple linear regression.

The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.

Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.

Multiple Linear Regression

In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable — square footage.

This post aligns very closely to another post I’ve made on multiple linear regression, the distinction is between the data types of the variables that are explaining our dependent variable. That post explains multiple linear regression using one numeric & one categorical variable; also known as a parallel slopes model.

What we’ll run through below will give us insight into a multiple linear regression model where we use multiple numeric variables to explain our dependent variable and how we can effectively visualize utilizing a heat map. Enjoy!

Let’s Build a Regression Model

Similar to what we’ve built in the aforementioned posts, we’ll create a linear regression model where we add another numeric variable.

The dataset we’re working with is a Seattle home prices dataset. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.

Through the course of this post, we’ll be trying to explain price through a function of other numeric variables in the dataset.

With that said, let’s dive in. Similar to the other posts, we’re using sqft_living to predict price, only here we’ll add another variable: bathrooms.

fit <- lm(price ~  sqft_living + bathrooms,    
          data = housing)
summary(fit)

The inclusion of various numeric explanatory variables to a regression model is both simple syntactically as well as mathematically.

Visualization Limitations

While you can technically layer numeric variables one after another into the same model, it can quickly become difficult to visualize and understand.

In the case of our model, we have three separate dimensions we’ll need to be able to assess.

Over the next bit, we’ll review different approaches to visualizing models with increasing complexity.

Break Out the Heatmap

The purpose of our visualization is to understand given variables relating to one another. A simple scatter plot is a very intuitive choice for two numeric variables. At the moment we include a third variable, things are a bit more confusing.

The first option we’ll be reviewing is the heatmap. This form of visualization as an overlay to a scatter plot does a good job communicating how our model output changes as the combination of our explanatory variables change.

First things first, we need to create a grid that combines all of the unique combinations of our two variables. This will be key as we want to have an exhaustive view of how our model varies with respect to explanatory variables.

Once we do this, we can assign predictions to each of them giving us a clear indication of our prediction across all potential combinations of our numeric variables.

Below I’ll use the table function to get an idea of the range of values for the sake of creating the sequence as you can see in the code below. Alternatively you could also pass all of the unique occurrences of a given variable like so data.frame(table(housing$sqft_living))[1] into the expand.grid function.

We use expand.grid to create a dataframe with all of the various variable combinations.

table(housing$bathrooms)
table(housing$sqft_living)

all_combinations <- expand.grid(sqft_living = seq(370, 13540, by = 10), bathrooms = seq(0.75, 8, by = 0.25))

Now that we have our dataframe, let’s generate predictions using broom‘s augment function.

combos_aug <- augment(fit, newdata = all_combinations)

Let’s move onto the visualization.

housing %>%
ggplot(aes(x = sqft_living, y = bathrooms))+
  geom_point(aes(color = price))

Here we see the scatter between our explanatory variables with the color gradient assigned to the dependent variable price.

Let’s add our tile. We see the same code as above, we’re just now including the geom_tile function with the model predictions, .fitted.

housing %>%
ggplot(aes(x = sqft_living, y = bathrooms))+
  geom_point(aes(color = price))+
  geom_tile(data = combos_aug, 
            aes(fill = .fitted), alpha = 0.5)

As you can see we can see a more distinct gradient moving across sqft_living on the x-axis. With that said we can also see some gradient across the bathrooms on the y-axis. We can similarly see that the price, as visualized by the point color gradient, is far darker/lower on the bottom right of our chart.

Conclusion

Creating a model with tons of different explanatory variables can be very easy to do. Whether or not that creates deeper understanding of a given variable is the question. While this is a simple example, I hope that this proves helpful as you seek to make sense of some of your more complex multiple linear regression models.

In the course of the post we’ve covered the following:

  • Multiple linear regression definition
  • Building a mlr model
  • Visualization/interpretation limitations
  • Using Heatmaps in conjunction with scatter plots

If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!

The Intuitive Explanation of Logistic Regression

Introduction

Logistic regression can be pretty difficult to understand! As such I’ve put together a very intuitive explanation of the why, what, and how of logistic regression. We’ll start with some building blocks that should lend well to clearer understanding so hang in there! Through the course of the post, I hope to send you on your way to understanding, building, and interpreting logistic regression models. Enjoy!

What is Logistic Regression?

Logistic regression is a very popular approach to predicting or understanding a binary variable (hot or cold, big or small, this one or that one– you get the idea). Logistic regression falls into the machine learning category of classification.

One more example for you to distinguish between linear and logistic regression: Rather than predicting how much something will be sold for.. you alternatively are predicting whether it will be sold or not. Without further adieu, let’s dive right in!

Understanding Linear Regression as a Precursor

Let’s talk about the output of a linear regression. For those of you who aren’t familiar with linear regression it would be best to start there. You can visit this post to learn about Simple Linear Regression & this one for Multiple Linear Regression.

Now knowing a bit about linear regression; you’d know that the linear regression output is equatable to the equation of a line. Simply enough.. that’s all we want; just a way to reinterpret one variable to lend insight into another.

So knowing this, let’s run a linear regression on a binary variable. Binary being yes or no, 1 or 0. This variable will be the candidate for our logistic regression, but we’ll get there shortly.

Building a Linear Regression to Understand Logistic Regression

Today we’ll work with the mtcars dataset. This is a classic dataset for data science learning that details fuel consumption, engine details, among other details for a variety of automobiles.

Quick glimpse at the dataset:

glimpse(mtcars)
head(mtcars)

Linear Regression for a Binary Variable

In this dataset, we have one binary variable… vs. Not knowing much about cars, I won’t be able to give you a detailed explanation of what vs means, but the high level is it’s representative of the engine configuration.. I also know that the configuration has impact on things like power, efficiency, etc. which is something we’d be able to tease out through our models. So hopefully it will be easy to determine the difference!

Let’s build that regression model! We’ll seek to understand vs as a function of miles per gallon.

fit <- lm(vs ~ mpg
  ,data = mtcars)
summary(fit)

Feel free to peek at the regression output below:

You can see an R-squared of .44, which means we can explain 44% of the variation in y, with variation in x. Not bad. We also see a p-value of less than .05. Two thumbs up there.

Now here’s where things get tricky.. Let’s take a look at the y-intercept. We have a y-intercept of -.67. Seems odd as vs isn’t something that can go negative… in fact, it can only be 0 or 1. This is going to constitute a major issue for using linear regression on a binary variable, but more on that in a moment. First, we’ll seek to better understand the data we’re working with.

Visualizing Our Output

Let’s quickly visualize the two variables with a scatter plot and add the regression line to give us some additional insight.

mtcars %>%
  ggplot(aes(x = mpg, y = vs))+
  geom_point()+
  geom_smooth(method = 'lm', se = FALSE)

For starters you can see that the y-axis is represented as a continuous variable, so you’ll see that all of the points are along either the 0 or 1.

As far as the x-axis goes we can see that depending on whether the dependent variable was 1 or 0, there is a concentration towards the right, and then left respectively. This lines up with what I mentioned earlier that cars with the vs of 1 have better fuel consumption than those of vs, 0.

As is very intuitive from the chart we can see the line cutting right through the two groups in linear fashion.

The Obvious Issue?

The obvious issue here is that the line goes on forever in either direction. The output on either extreme, literally wouldn’t make sense.

For this reason, we can’t use linear regression as is. That doesn’t mean it’s useless, we just have to modify it such that our extremes of prediction aren’t infinite.

“Generalizing” Linear Regression

The way we get to logistic regression is through what is called a “generalized linear model”. You can think about the function or equation of a line we just created through our simple linear regression. Through the linear model we have an understanding of y based on a function that we relate to x. For logistic regression, we take that function and effectively wrap it in an additional function that is responsible for “generalizing” the model.

What’s the Purpose of the Generalization or Generalizing Function?

What we’re trying to do is classify a given datapoint, or in other words assign a vehicle of a given mpg to one of two groups, either v (0) or straight (1).

A helpful way to think about this is from the perspective of probability. For any given mpg datapoint, that automobile has a given probability to be either v or straight. As such it would make sense that the output of a model intended to shed light on that relationship would do so with probability.

To sum up this idea, we want to generalize the linear output in a way that’s representative of probability.

How to Generalize

The process of generalization as I mentioned earlier has to do with wrapping our linear function into yet another function. This function is what’s known as a link function. Just as I mentioned a moment ago, that link function will scale the linear output to be a probability between 0 and 1. For logistic regression, a sub-category of generalized linear models, the logic link function is used.

Visualizing Logistic Regression

Let’s look at the same graph as before but fit the logistic curve this time.

mtcars %>%
  ggplot(aes(x = mpg, y = vs))+
  geom_point()+
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))

We have included the same code as before, only now our method is “glm” which specifies we want a generalized linear model; secondly we specify the family = “binomial”. This is what calls out which link function to use. In this case “binomial” uses the logit function.

Let’s interpret our new chart. As you can see that either end of our line on either extreme flattens such that the line will never reach 0 or 1.

We can see that the line towards the middle is very straight and similar to that of our linear model.

Building Our First Logistic Regression Model

Let’s go ahead and jump into building our own logistic regression model. This reads very similar to the linear regression call with two key differences. The first is the call is glm, the second is the family is “binomial” similar to what you saw in the geom_smooth call. Same rationale for its use here.

fit <- glm(vs ~ mpg
          ,data = mtcars
          ,family = 'binomial')
summary(fit)

First thing I want to call out with the glm function is that you have to first encode the dependent variable as either 1 or 0. Other classification algorithms may not need said encoding, but in the case of logistic regression, to reiterate, it is a wrapping of a linear output. As such, the linear model that sits inside of our link function would not work on a character or factor.

Logistic Regression Interpretation

Now it’s time to talk about interpretation. Also not an incredibly simple topic, but we’ll approach it as intuitively as possible.

There are three ways for one to think about logistic regression interpretation:

  • Probability
  • Odds
  • Log-odds

Each has different trade-offs when it comes interpretability. But first… Definitions

Probability

Simply enough, probability is the measure of likelihood expressed between 0 and 1.

Odds

Odds on the other hand is used to represent how frequently something happens (probability) relative to how often it doesn’t (1-probability).

The formula for that looks like this O = p/(1-p).

One thing to think about here is it’s on the exponential scale.

Let’s write a bit of quick code to make the reason for the exponential scale to be more intuitive. We’ll create a sequence 0 to 1, by .05. Then we’ll create an odds field based on the above formula. Lastly we’ll plot the line!

probs_vs_odds <- data.frame(prob = seq(0, 1, by = 0.05))
probs_vs_odds <- probs_vs_odds %>% 
  mutate(inverse_prob = 1-prob,
         odds = prob / inverse_prob)
probs_vs_odds %>%
  ggplot(aes(x = prob, y = odds))+
  geom_line()

I’ll also add the dataframe below to give additional illustration. Hopefully this makes it pretty clear to think about. When the probability is 5% your odds are 1 in 20 conversely when the probability is 95% your odds are 19 to 1, not a linear change.

Log odds

Very similar to odds with one change. We take the log to mitigate the exponential curve. Once we take the log odds, we’re able to visualize our model as a line once again. This is great for function interpretation, but pretty horrible when it comes to output interpretation

Trade-offs

My below visuals are intended to relay the spectrum of interpretability for the function & the output. Probability’s output is very simple to interpret, but its function is non-linear. Odds makes sense, but isn’t the easiest thing to mentally wrap your mind around & is exponential as such.. as a function, doesn’t quite make sense. Finally log-odds is just about impossible to interpret, but it’s function is linear which is great for interpretation.

No one of these is outright the best. While for the output, probability is the easiest to interpret, the probability function itself is non-linear. You may find yourself working with some combination to communicate predictions versus the function and so forth.

Performance Evaluation

For this last section, I’m going to set you up with a couple of tools that will be key in model performance evaluation.

We’ll be using what’s called cross-validation. All this means is that rather than training a model with all of your datapoints, you’ll pull some amount of them out, wait until the model is trained, then generate predictions for them, and then make a comparison between the predictions & the actuals.

Below we break out the train and test groups, & generate the model.

n <- nrow(mtcars)
n_train <- round(0.8 * n) 
set.seed(123)
train_indices <- sample(1:n, n_train)
train <- mtcars[train_indices,]  
test <- mtcars[-train_indices,] 

fit <- glm(vs ~ mpg
  ,data = train
  ,family = "binomial")

From here, we’ll generate a prediction for our test group. If you were to look at the pred field, you would actually see the probability of being one.

The challenge this leaves us with is that rather than saying these cars likely have configuration 1 and these ones have 0; we’re left with probabilities.

You’ll see in the second line of the below code I round the prediction. .5 and above is 1, and below is 0.

.5 is used as a pretty standard classification threshold– although there are certainly situations that would necessitate a higher or lower threshold.

Finally we use caret‘s confusionMatrix function to visualize the confusion matrix and to also deliver a handful of performance evaluation metrics. Things like accuracy, p-value, sensitivity, specificity, and so forth.

test$pred <- predict(fit, test, type = 'response')
test$pred <- as.factor(round(test$pred))
test$vs <- as.factor(test$vs)
confusionMatrix(test$pred, test$vs)

In our case, p-value was high & accuracy was mid-tier.

Conclusion

If you’ve made it this far then hopefully you’ve learned a thing or two about logistic regression and will feel comfortable building, interpreting, & communicating your own logistic regression models.

Through the course of the post, we’ve run through the following:

  • Definition
  • Model building
  • Interpretation
  • Performance Evaluation

I hope this proves useful! Check out my other data science lessons at datasciencelessons.com. Happy Data Science-ing!

Multiple Regression in R

Introduction

No matter your exposure to data science & the world of statistics, it’s likely that at some point, you’ve at the very least heard of regression. As a precursor to this quick lesson on multiple regression, you should have some familiarity with simple linear regression. If you aren’t, you can start here! Otherwise let’s dive in with multiple linear regression.

The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.

Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.

Multiple Linear Regression

In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable–square footage.

Let’s continue to build on some of what we’ve already done there. We’ll build that same model, only this time, we’ll include an additional variable.

fit <- lm(price ~  sqft_living + waterfront, 
   data = housing)
summary(fit)

Similar to what you would’ve seen before, we are predicting price by square feed living space, only now we’re also including a waterfront variable, take note of the data type of our new variable.

Parallel Slopes Model

We’ve just created what is known as a parallel slopes model. A parallel slopes model is the result of a multiple linear regression model that has both one numeric explanatory variable and one categorical explanatory variable.

The formula derived from linear regression is the equation of a line.

y = mx + b

  • y is our dependent variable
  • m is the coefficient assigned to our explanatory variable
  • x is the value of the explanatory variable
  • b is the y intercept

Having in mind the resemblance to the equation of a line; when we’re trying to model home prices according to the number of bedrooms alone, we derive a coefficient related to x and a y intercept that best approximates price by minimizing error.

The question we’re left with is… when we introduce a categorical variable in addition to the current numeric predictor into our regression formula, how is it handled or reflected in the model’s output?

If you’ve ever built a simple linear regression model using only a categorical explanatory variable, you may be familiar with the idea of group means across the different levels of a categorical informing the coefficients assigned. You can read a greater detailed explanation of that here.

In a parallel slopes model, the inclusion of a categorical variable is now reflected in changes to the value of the y-intercept.

You may have asked yourself why these multiple regression models are called parallel slopes models.

Let’s create a visualization of our model and then break down the meaning!

First things first, let’s build our parallel slopes model

fit <- lm(price ~  sqft_living + waterfront, 
   data = housing)
summary(fit)

Then we will add a field onto the housing dataset to represent our fitted value.

housing$pred_price <- predict(fit, housing)

Now we can visualize!

ggplot(housing, aes(x = sqft_living, y = price, col = waterfront)) + 
  geom_point()+
  geom_line(aes(y = price_pred))

Take note of the following visual. We see two lines that represent our predictions for each value for sqft_living in the case of a waterfront or not.

The key thing I want to highlight here is that each datapoint has the same coefficient assigned to sqft_living or in other words, the same slope. This is apparent based on the slope of each prediction line, they are parallel, as such we know the slope is the same.

What we do see is that the line for cases when waterfront was positive is higher than for those that are without a waterfront.

Let’s take a look at the model summary to gain some additional context.

summary(fit)

To understand what’s happening here, let’s think about the model without the waterfront piece first. All records will have the same y-intercept, 21,522, and all records’ value for sqft_living will be multiplied by the coefficient of 246.

What will then further distinguish records with a value of “1” for waterfront is that their y-intercept will increase by the amount of the waterfront1 estimate– 575,954. As such two records with the same sqft_living, but different values for waterfront would only differ on the incremental y-intercept value.

Conclusion

In the last several minutes we’ve covered the following:

  • The definition of multiple regression
  • The difference between multiple regression & simple linear regression
  • The definition of a parallel slopes model
  • How to build your own parallel slopes model
  • How the parallel slopes model comes to be

I hope you’ve found this post on multivariate regression and parallel slopes models to be helpful.

Happy Data Science-ing!

Leverage Anti-joins

Introduction

Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.

In a previous post, I outlined the benefits of semi-joins and how to use them. Here I’ll be following that up with a very similar explanation of anti-joins.

If you want to brush up on semi-joins first you can find that here.

Filtering Joins

Anti-joins & semi-joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.

Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.

A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.

The Anti-join

The anti-join is used with the intent of taking a dataset and filtering it down based on if a common identifier is not located in some additional dataset. If it shows up in both datasets… it’s excluded.

A good way to drill this idea home is to code the alternative.

Territory Analysis Example

Let’s say you’re a data analyst helping out your sales team who’s just hired on a new account executive. This new rep needs accounts, and the sales leaders want to make sure these accounts aren’t currently being worked. For this problem, let’s say we have two datasets; one that contains all accounts and another one that logs all sales activity this would have an activity id as well as an account id.

First dataset

| account_id | current_owner| account_name | revenue

Second dataset

| activity_id | account_id |activity_type |

What you’ll need to do is filter the first dataset using the second dataset.

In a World Without anti-joins

Let’s first try to figure this out without the use of anti-joins

Based on what we know now around left joins… we could do the following:

accounts %>%
left_join(activity, by = 'account_id')%>%
filter(is.na(activity_type))

As you can see, we can left join the activity dataset to our accounts dataset. For whichever accounts there are not matches in this dataset activity type would be populated as NULL. As such, if you want a list of the accounts not in the second dataset you would filter where activity type is NULL.

This is fine and effectively performs the same functionality I explained above. Two annoying things are 1. it gives you a new field, activity_type; 2. is that if the same account shows up many times in the activity dataset, when you join it will create a new record for as many matches as there are.

We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.

accounts %>%
left_join(activity, by = 'account_id')%>%
filter(is.na(activity_type))%>%
select(-activity_type)

Simplifying with Anti-joins

Now let’s simplify things even more with an anti-join.

accounts %>%
semi_join(activity, by = 'account_id')

This will get us to the exact same output as each of the above examples. It will filter out records of accounts where they show up in the activity dataset. As such only accounts that are not being worked are going to get moved to the new rep in our scenario. This approach wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.

Conclusion

There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code & workflow.

We’ve learned:

  • The difference between mutating joins and filtering joins
  • How to execute a “filtering-join” in the absence of anti-joins
  • The specific output and intent for an anti-join
  • How to use a anti-join

I hope this proves helpful in your day to day as a data professional.

Happy Data Science-ing!

Getting Started with Data Science

Introduction

When it comes to getting started in data science it can be a bit overwhelming. You need to know statistics, programming, machine learning… within each of those domains there are a many, many sub domains that can dominate a person’s focus and once they’re done reading everything there is to know about one thing, they may not feel any farther along than when they started.

Through the course of this post, I’ll talk a bit about some of the challenges to getting started, the best way to think about data science as a discipline, and how to get started.

Challenges of Getting Started

The Volume of Resources is Overwhelming

One of the amazing things about trying to get into data science today is that there are a ton of resources… that said, it can be a bit overwhelming to navigate it all. What should you start with and what should you spend your time and money on are classic questions it seems no aspiring data scientist has the answer to.

There are a myriad of machine learning courses on udemy, udacity, and other learning sites, many of which cover machine learning from a very shallow perspective.. I’ve seen when individuals use this type of resource they often jump from course to course trying to go deeper, but getting a lot of redundant material. That’s not to say these resources are bad, but one requires direction when it comes to the timing of a course like that.

There are very expensive paid programs which also do very well to get one onto their feet in the space, these programs can be excellent when it comes to the curation of content and the linear learning structure one needs to be successful, but all in all, it may not be absolutely necessary to get your start in the field of data science. In either case, gathering some fundamentals and familiarization with the space can help one be more successful and can also help them really validate their interest.

The other thing that happens is one might do an ML course, and then a basic python course, and then jump to a ML engineering course, and whatever else. This leads to a very disjointed education where the principles one is learning don’t necessarily build off of one another.

Imposter Syndrome

An all too common phenomenon in data science is what’s known as imposter syndrome. Imposter syndrome is the feeling that maybe you don’t belong, or don’t deserve the title you have because there is so much you don’t know. Because the data science umbrella is so broad, it makes it difficult for a data scientist to have great depth in every single sub-discipline of the field. A data scientist is often looked to as the expert on all things data science, and as a result blind spots are frequently highlighted. The fact is, there is so much to know in this field, and you will struggle to learn it all. Something that is key to overcoming imposter syndrome is having a good understanding of what really falls under the domain of data science, what is essential, & what constitutes a nice-to-have.

Expensive

From masters programs, nano-degrees, & video lessons to newsletters, text books, & more; these resources can be expensive! While many of them undoubtedly have great content, it can constitute a barrier for many. Adding to the note on imposter syndrome and having a good understanding of what one really needs and in what priority; if you are brand new to the field and start off learning about deep learning, you may burn a chunk of change while feeling no closer to your first role as a data scientist.

What is Data Science, Really?

As I’ve eluded to, a key first step is really grasping the workflow of a data scientist and how different skills and technologies come into play.

In the following section we’ll break that down.

Data Collection

First and foremost, you need to be able to get your hands on data. Regardless of what you hope to do with it, having the skills to get it is a key first step.

If you’re not already familiar, get your feet wet with SQL. SQL stands for structured query language. It’s all about pulling data out of a database. It’s actually pretty simple as far as code goes, as the main purpose is to ask a database for data.

Let’s say you’re a data analyst at a B2B SaaS company. They have all of their sales activity data in a series of database tables. Your boss wants to know which industries they should focus on. As a first step, it would be up to you to find out what industries the company already sells to, and how well they perform in each of them. This could be a highly valuable analysis and the first step is to be able to write a SQL query.

If you’re just starting out; I’d almost always say get started with SQL first. I won’t break down the specifics of SQL here, but it is a great first step to thinking about data, its structure, how you’d pull it and use it in the right ways.

There are many flavors of SQL, in the end that depends on the DBMS (data base management system) you use. You’ll likely hear of redshift, mysql, sql server, mariadb, among many others… The syntactical changes are often pretty minimal here and are a simple google search away. Don’t feel a need to familiarize yourself with the unique flavors of each right off the bat… they aren’t sufficiently different for that to be necessary, especially right at the start.

There are great SQL courses out there. Khan Academy has a free SQL course that’s a great introduction. Codeacademy & datacamp also have excellent SQL courses that will give you the basics of what you need to get started with data collection.

Beyond SQL, there are many other places you can get data. Data can come from a csv, excel file, directly from websites, JSON, & more. While familiarization is helpful, I’d call those nice-to-haves. In your early days as a data scientist you need to know you can write SQL queries well.

Data Cleaning

Data cleaning is all about getting your data into a state where it’s usable for whatever analysis should follow.

There are many aspects to data cleaning; how do we handle missing values, are data types correct, is there any specific type of re-encoding of variables that needs to take place, and many other things– largely in consideration of the analysis to follow.

Data Wrangling

Data wrangling operates as an adjacent step to data cleaning. This also has to do with getting your data in the right format to be useful.

You may have a series of datasets that you need to combine into one. As such, you might use what is called a join or a union to combine said datasets. There is also consideration of making your dataset wide versus long, I wont go into the specifics here, but having those few simple operations in your tool belt will go a long way.

You can typically accomplish most data wrangling needs with SQL, with that said there are lots of additional functionality provided via R or Python.

Exploratory Data Analysis

Exploratory data analysis or EDA represents familiarization with your data. This includes looking at samples of your dataset, looking at its datatypes, assessing the relationship between different combinations of variables through different charting options, making assessments of different variables using summary statistics.

Precedent to modeling, engaging in the EDA process helps one understand the patterns and relationships between different variables, and lends well to the analysis you will eventually conduct.

This is where skills with data visualization software come in. While you can create these visualizations with R or Python, it’s sometimes helpful to have them exist in a data visualization platform like Tableau, Domo, PowerBI, etc.

Everything up to this point is what I’d say might qualify you to be a data analyst. With that said, your typical data scientist might introduce more sophisticated methods or approaches to any given area we’ve already covered. This is not to say you won’t find data analysts doing work beyond this scope either, but it’s just typically were I’d identify the distinction.

Statistical Analysis & Modeling

Once you have some good understanding of your data, this is where you get to flex your statistics muscles. This includes things like probability density functions, t-tests, linear regression, logistic regression, hypothesis testing and so forth.

There are many statistics tools and methods to be familiar with and this is a major area to differentiate yourself as a data scientist. Data science as a field has major roots in statistics, but is often looked over in favor of more complex machine learning approaches.

Having a strong statistics background will set you apart as a data scientist.

Machine Learning

Machine learning or ML constitutes abstractions of a lot of the more basic statistics principles you might use otherwise. A lot of the most useful ML algorithms represent different packaging of your traditional linear regression, or it may just be building various linear regression models with slight changes from the inputs to the data passed into them to eventually land at an ideal model.

This is true all the way through neural networks and deep learning.

When you’re first getting your start, working to have a fundamental understanding of the statistics and math that supports more complex algorithms will prove helpful when it comes to having confidence in what you’re using, but also will help you better identify appropriate use of a given algorithm.

It’s easy for ML to appear overwhelming. The synonym artificial intelligence typically gives cause for surprise when one realizes these are built off of traditional statistical models.

My recommendation here is to learn the basics first. Give yourself enough time to master the data analyst track, get exposed to using a variety of statistics tools and analyses, don’t be overwhelmed by the notion of ML, and work to expose yourself to the broad expanse of machine learning algorithms and seek to understand what differentiates them from whatever else is out there.

ML Engineering

The last thing I’d like to highlight is what’s known as ML engineering.

For starters, I should clarify that this is not something I explicitly interpret as the data science domain. While many data scientists have these skills, it is not absolutely pertinent as a data scientist that you know everything there is to know about this area.

In fact, machine learning engineer is a very common job title where the entirety of one’s work is around this area. ML engineering is all about deploying ml models. This is everything from standing up APIs where teams & individuals can interact with models, setting up jobs that re-train models and re-run predictions, storing these models and making them accessible.

Conclusion

Data science is a new and exciting field, but can often appear overwhelming. Thinking of it through this lense will help you make the difficult decisions of which resources and which order you should take as you work on your data science education.

The Data Science Umbrella

  • Data Collection
  • Data Cleansing
  • Data Wrangling
  • Exploratory Data Analysis
  • Statistical Analysis
  • Machine Learning
  • Machine Learning Engineering

Thinking of your data science education in a linear fashion with these specific skills in mind should help inform your next step on your journey to become a data scientist.

Best of luck! Happy data science-ing!

Leverage Semi-joins in R

Introduction

Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.

In this post, I’ll be focusing on just semi-joins; with that said, there is a lot of overlap between semi & anti, so get ready to learn a bit about both.

Filtering Joins

Semi & anti joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.

Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.

A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.

Semi Join

The semi join is used with the intent of taking a dataset and filtering it down based on whether a common identifier is located in some additional dataset.

A good way to drill this idea home is to code the alternative.

Opportunity Dataset Example

Let’s say we have a dataset from salesforce that contains all deals or opportunities that we’ve worked on or are currently working on.

This opportunity dataset gives us a lot of really good information around the deal itself. Let’s suppose it looks something like this:

| opp_id | account_id | created_date | close_date | amount | stage |

Now let’s say we need to filter this dataset down such that we’re only including enterprise accounts. Well it just so happens that there is no segment field on our opportunity dataset that we can use to filter… we’ll have to leverage information from elsewhere.

Let’s pretend that the only place enterprise accounts have been tracked is in a random excel file. Let’s say the dataset looks like this:

| account_id | customer_segment |

In a World Without Semi-joins

Based on what we know now around left joins… we could do the following:

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))

As you can see, we can left join the enterprise accounts dataset to our main opps dataset and in the event that there is no matching value, customer segment would be null, as such you can add a filter statement saying you only want non null cases.

This is fine and effectively performs the same functionality I explained above. One annoying thing is it gives you a new field, customer_segment, that is the same for every record.

We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))%>%
select(-customer_segment)

Assuming you’ve learned about inner joins as well, we could also achieve a similar functionality there with slightly simpler code.

opportunities %>%
inner_join(enterprise_accounts, by = 'account_id')%>%
select(-customer_segment)

Simplifying with Semi-joins

Now let’s simplify things even more with a semi-join.

opportunities %>%
semi_join(enterprise_accounts, by = 'account_id')

This will get us to the exact same output as each of the above examples. It will filter out records of opportunity where there is no matching account_id in the enterprise accounts table. It wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.

Conclusion

There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code, your workflow.

We’ve learned:

  • The difference between mutating joins and filtering joins
  • How to execute a “filtering join” in the absence of semi-joins
  • The specific output and intent for a semi-join
  • How to use a semi-join

I hope this proves helpful in your day to day as a data professional.

Happy Data Science-ing!

Kmeans clustering

Introduction

Clustering is a machine learning technique that falls into the unsupervised learning category. Without going into a ton of detail on different machine learning categories, I’ll give a high level description of unsupervised learning.

To put it simply, rather than pre-determining what we want our algorithm to find, we provide the algorithm little to no guidance ahead of time.

In other words, rather than explicitly telling our algorithm what we’d like to predict or explain, we kick back and hand the baton off to the algorithm to identify prominent patterns among given features.

If you’d like to learn more about that delineation, you can check out this post.

Clustering has a broad variety of applications and is an incredibly useful tool to have in your data science toolbox.

We will be talking about a very specific implementation of clustering, Kmeans. With that said, a foundational understanding of clustering lends well as a precursor to practical application. We won’t be diving into that here, but you can learn more from a conceptual level through this post.

Getting started with Kmeans

The K in Kmeans

K represents the number of groups or clusters you are seeking to identify. If you were performing clustering analysis where you already knew there were three groupings, you would use that context to inform the algorithm that you need three groups. It’s not always that you know how many natural groupings there are, but you may know how many groupings you need. In the below examples we’ll use the Iris dataset which is a collection of measurements associated with various iris species, but there are many other use cases.. customers based on usage, size, etc. or prospects based on likelihood to buy and likely purchase amounts, etc. There are a wide variety of applications in business, biology, and elsewhere.

Learning about centroids

Once we’ve determined k, the algorithm will allocate k points randomly; and these points will operate as your “centroids”. What is a centroid, you might ask? A centroid refers to what will be the center point of each cluster.

Once these center points or centroids have been randomly allocated… the euclidean distance is calculated between each point and each centroid.

Once the distance is calculated, each point is assigned to the closest centroid.

We now have a first version of what these clusters could look like; but we’re not done. Once we’ve arrived at the first version of our cluster, the centroid is moved to the absolute center of the group of points assigned to each centroid. A recalculation of euclidian distance takes place and in the event that a given point is now closer to an alternative centroid, it is assigned accordingly. This same process occurs until the centroid reaches stability and points are no longer being reassigned. The combination of points assigned to a given centroid is what comprises each cluster.

Lets build our first model

A bit of exploration to start

For this analysis, we’ll use the classic Iris dataset as I mentioned earlier. This dataset has been used time and time again to teach the concept of clustering. Edgar Anderson collected sepal & petal length/width data across three species of iris. If you’re interested in more information on this, check out the wikipedia explanation (https://en.wikipedia.org/wiki/Iris_flower_data_set)

Quick EDA

head(iris)
glimpse(iris)

We are going to visualize petal length & width in a scatter plot with species overlayed as the color.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = factor(Species))) +
  geom_point()

Let’s do the same for a couple more variable combinations just for fun.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = factor(Species))) +
  geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = factor(Species))) +
  geom_point()

We could go through this same exercise exhaustively for each combination of variables, but for the sake of brevity; we’ll carry on.

Building our first clustering model

set.seed(3)
k_iris <- kmeans(iris[, 3:4], centers = 3)

When we set the seed to a given number, we ensure we can reproduce our results.

We call the kmeans function & pass the relevant data & columns. In this case, we are using the petal length & width to build our model. We declare 3 centers as we know there are three different species.

If you then call your cluster model. You will get an ouput akin to the following:

A few things worth highlighting:

You will see a handful of useful pieces of information:

  • the number of clusters, as previously determined by the centers parameter
  • means for each value across the naturally determined cluster.
  • “Within cluster sum of squares” – this represents the absolute distance between each point and the centroid by cluster
  • Available components – here the model delivers up a handful of other pieces of information. We’ll leverage the “cluster” component here shortly!

Performance Assessment

Here we are allowing an unsupervised machine learning method to identify naturally occurring groups within our data, but we also have access to the actual species; so let’s assess the algorithm’s performance!

A very simple way is to look at a table with species & assigned clusters. Here you’ll see that we’ll reference the model object and call the determined cluster by record as well.

table(k_iris$cluster, iris$Species)

Here we can see that all of the setosa species were classified in the same grouping with no other species added to the same group. For versicolor we can see kmeans accurately captured 48/50 of the veriscolor in cluster 3, and 46/50 virginica in cluster 1.

We can very easily assign this classification to our iris dataset with the mutate function from the dplyr library

iris <- mutate(iris, cluster = k_iris$cluster)

Now lets recreate our first scatter plot, swapping out species for cluster

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = factor(cluster))) +
  geom_point()

Above we can see a near identical representation of the natural groupings of these two datapoints based on their species.

We could certainly test a variety of variable combinations to arrive at an even better approximation of each species of iris, but now you have the foundational tools needed to do exactly that!

Conclusion

In a few short minutes we’ve learned how Kmeans clustering works and how to implement it.

  • Clustering is an unsupervised learning method
  • Kmeans is among the most popular clustering techniques
  • K is the predetermined number of clusters
  • How does the algorithm actually work
  • How to create our own cluster model

I hope this quick post on the practical application of Kmeans proves useful to you in whatever your application of the technique.

Happy data science-ing!

What Every Data Scientist Needs to Know About Clustering

Introduction to Machine Learning

Machine learning is a frequently buzzed about term, yet there is often a lack of understanding into its different areas.

One of the first distinctions made with machine learning is between what’s called supervised and unsupervised learning.

Having a basic understanding of this distinction and the purposes/applications of either will be incredibly helpful as you pave your way in the world of machine learning and data science.

Supervised & Unsupervised Learning

Supervised learning is certainly the more famous category.

Supervised learning consists of classification & regression, which effectively means, you determine a response variable and explanatory variables, and use them to model a relationship–Whether that’s for the sake of explanation or prediction. (You can learn more about that distinction here).

Classification effectively represents identifying the relationship between a categorical variable and some composition of other variables. This could be predicting which inbox an email belongs to, predicting whether someone will default on a loan, or predicting whether a sales lead will convert or not.

Regression also attempts to model a relationship between dependent and independent variables, however in this case, we’re trying to model a continuous variable. Regression is probably what you’ve heard about the most. This could be modeling home prices, revenue numbers, the age of a tree, the top speed of a car, customer product usage, etc.

Unsupervised learning

Now that we have that all of that out of the way, let’s talk about unsupervised learning.

To put it simply, rather than pre-determining what we want our algorithm to find, we provide the algorithm little to no guidance ahead of time.

In other words, rather than explicitly telling our algorithm what we’d like to predict or explain, we kick back and hand the baton off to the algorithm to identify prominent patterns among given features.

Let’s kick this off with an example of how you might use supervised and unsupervised learning methods on housing data.

One supervised approach could be to use the data to train a model to predict home prices.

An unsupervised approach could be to identify natural groupings of price or some combination of variables, e.g. price & number of rooms.

What I just explained is actually the technique we’re here to learn about. With that let’s dive in.

What Exactly Is Clustering or Cluster Analysis?

To facilitate the best understanding of clustering, a good start is to understand its fundamental purpose.

Purpose

If you’re a data scientist, one of the pre-requisites to any analysis is coming to understand your data in some capacity. One aspect of that understanding comes from the notion of similarity between records of a dataset.

To further define this idea, we want to understand the relative similarity or dissimilarity between any and all records.

One mechanism for accomplishing this is by identifying natural groupings of records. These groupings can be defined as the records that are most similar to one another and most dissimilar from records of other groupings.

If it’s not obvious, this is where clustering comes into play. It is the algorithmic approach to creating said groupings.

To further drive this point home, clustering analysis helps you answer one fundamental question: How similar or dissimilar are any two observations?

How Is it Measured?

We are trying to asses the similarity of two records and we use the distance between those records to help define that.

The dissimilarity metric or distance is defined as 1 – similarity

The greater the distance, the greater the dissimilarity and vice-versa.

Let’s illustrate with the following hypothetical dataset.

chess <- data.frame(
                 y = c(2, 4),
                 x = c(5, 3))
row.names(chess) <- c('knight', 'king')

I created a little dataset that details the x & y axis of a chess board. I don’t know if there is actually a name for each axis of a chess board… I do know one contains letters, but for simplicity sake, go along with me.

We will use the dist function, which defaults to the euclidean distance between two points. As you can see, it defines the distance between the knight and the king as 3.46

dist_chess <- dist(chess)

You can also manually calculate the euclidean distance between the two pieces. Also equals 2.82.

knight <- chess[1,]
king <- chess[2,]

piece_distance <- sqrt((knight$y - king$y)^2 + (knight$x - king$x)^2)

Let’s throw up our pieces on a plot!

ggplot(chess, aes(x = x, y = y)) + 
  geom_point() +
  lims(x = c(0,8), y = c(0, 8))

Applications of Clustering

Now after all of that explanation, what in the world is clustering for? Why should you spend your time learning it?

Cluser analysis is incredibly useful any time you want an assessment of similarity.

You may work at a software company where you want to understand how different users are similar or dissimilar, possibly to alter the offering, messaging, etc.

The applications run far beyond business as well. From analysis of plant & animal species, user behavior, weather, and just about anything where we can measure a pattern..

When Is the Best Time to Use it?

There may be many potentially appropriate times to use cluster analysis. One of the most prominent of which is during exploratory data analysis. If you’re not familiar with exploratory data analysis, you can learn more on exploratory data analysis fundamentals here.

Without diving too deep into the principles of exploratory data analysis (EDA), the key intention of EDA is to familiarize yourself with the dataset you’re working with.

Clustering can be immensely helpful during this process.

Clustering Prep

Lets jump into some of the pre-processing steps that are required before we can perform our analysis.

Don’t Forget to Scale!

Let’s jump back to the chess example. Each 1 unit increase in a value represents one for cell in a given direction. This is a great example of when euclidian distance makes perfect sense.

However, what if you are clustering with metrics that don’t exist on the same scale… something like annual revenue and number of employees, or foot size and vertical jump.

The challenge comes when the values we’re using aren’t comparable to one another, as displayed in my previous example.

Think of the following two scenarios,

scenario 1:

You have two companies both with the same number of employees, but one with 1000 more dollars in revenue than the other.

Now let’s swap the differing variable, they have the same revenue but one has 1000 employees more than the other.

The first scenario constitutes two companies that would likely be incredibly similar, only a thousand dollars off on revenue constitutes a minor difference and likely signifies two companies of very similar value. Ironically, the second example highlights two companies that are vastly different. It could vary massively in industry, market segment, or whatever.

While the difference in both scenarios was 1000, that difference was significant of two vary different things in either scenario.

The problem with the varying group values is that they have different averages and different variation. Which would be the exact situation we just reviewed.

As such, when performing cluster analysis, it’s very important that we scale our metrics to have the same average and variability.

We will use an approach called standardization that effectively will bring the mean of our metric to 0 and the standard deviation to 1.

Technically speaking, we can manually scale a given variable as seen below.

scaled_var = (var - mean(var))/sd(var)

While it’s good to familiarize yourself with the logic of the calculation, it’s also very convenient to just use the scale function in R.

Scaling & Distancing Housing Data

Let’s go through the same exercise with housing data.

I pulled down this Seattle home prices dataset from kaggle. You can find that here.

Let’s do a quick visualization of the first two datapoints in the datset.

Here are the datapoints we’re working with:

housing <- housing[1:2, c('price', 'sqft_lot')]
ggplot(housing, aes(x = sqft_lot, y = price))+
  geom_point()
housing <- housing[1:2, c('price', 'sqft_lot')]
ggplot(housing, aes(x = sqft_lot, y = price))+
  geom_point()

housing_dist <- dist(housing)

# when I scale, I'm going to do so using the entire dataset, but # when I take the dist I'll just use the subset

housing_scaled <- scale(housing)
housing_scaled_dist <- dist(housing_scaled)

Similarity Score for Categorical Data

We have spent the entirety of the time so far talking about the euclidean distance between two points and using that as our proxy for dissimilarity/similarly.

What about in the case of categorical data?

Similar to euclidian distance for categoricals, we use something called the Jacaard index.

Let me explain the Jacaard index.

Lets say you have a categorical field with cases a & b. The Jacaard index provides to us the ratio of instances when both a & b occurred relative to the number of times either occurred.

You can also think of it as the ratio of the intersection of a & b to the union of a & b.

Using the same dist function we used earlier, but in this case just changing the method to 'binary' and you’ll have a measure of distance.

Let’s first create a dataset to play around with. Below you can see I’ve come up with two categorical variables for each company.

companies <- data.frame(
  industry = c('retail', 'retail', 'tech', 'finance', 'finance', 'retail'),
  segment = c('smb', 'smb', 'mid market', 'enterprise', 'mid market', 'enterprise'))
row.names(companies) <- c('a', 'b', 'c', 'd', 'e', 'f')
companies$industry <- as.factor(companies$industry)
companies$segment <- as.factor(companies$segment)

Make sure to declare your categoricals as factors!

Now we’ll turn our categoricals into dummy variables, you may have also heard the term one hot encoding. The idea is that each value of the categorical is turned into a column and the row value is populated with either a 1 or a 0. We’ll use the dummy.data.frame from the dummies package in R.

companies_dummy <- dummy.data.frame(companies)

As mentioned, you’ll notice above that for each value of industry, we see a column associated with each of the values: finance, retail, & tech. We see the same thing for each column.

Let’s now run our dist function.

dist <- dist(companies_dummy, method = 'binary')

Here each company is being compared to one another. A & B are 0, because there is no distance between them. If you recall they were both smb retail. 1 would be if they held no similarity. You’ll notice that for C & E, they had only one similarity, hence the distance of .67.

Conclusion

I hope you’ve enjoyed this breakdown of clustering.

We’ve covered the two main areas of machine learning.

The definition, purpose, applications, and measurement of clustering.

We’ve learned about preprocessing and how to compute distance between two points whether numeric or categorical.

Each of these lessons will prove incredibly foundational as you continue to learn and implement different clustering approaches in your analysis.

Happy Data Science-ing