Tired of Nested ifelse in Dplyr?

Using Mutate to Feature Engineer a New Categorical

Among the most helpful functions from dplyr is mutate; it allows you to create new variables– typically by layering some logic on top of the other variables in your dataset.

Quick Example

Let’s say that you’re analyzing user data and you want to categorize users according to usage volume.

You decide that you want four tiers– inactive users, limited users, healthy users, & power users.

Your code might look something like this. What you’ll see here is that to create the new, user_type variable, we use the mutate function, declare the new variable name, then leverage ifelse to determine under what bands to apply different values. As you can see below, if actions within the app are less than 5, the lowest threshold, then we’ll call then inactive… if that criteria is not true, then we’ll enter the next ifelse statement to establish the next set of criteria.

df %>% 
    mutate(user_type = ifelse(
               app_actions <= 5, 'inactive user', ifelse(
                   app_actions <= 10, 'limited user', ifelse(
                       app_actions <= 15, 'healthy user', 'active user'      
            )
        )
    )  
)

While ifelse is a staple and very useful, when you start nesting too many ifelse, a couple of problems arise.

  1. Messy code that is hard to interpret & edit
  2. You write a lot of redundant code

I should add, that what I wrote above wasn’t too crazy, but you can very quickly end up needing double digit ifelse statements which creates the exact problems we’re talking about.

case_when to Save The Day

In many ways, R presents a lot more flexibility than sql, but with that said, one SQL command that many miss… unnecessarily is case_when. Luckily for us, case_when is actually a thing in R.

Check out the exact same code snippet presented with a case_when.

df %>% 
    mutate(user_type = case_when(
               app_actions <= 5 ~ 'inactive user',
               app_actions <= 10 ~ 'limited user',
               app_actions <= 15 ~ 'healthy user',
               TRUE ~ 'active user'      
            )
        )

Again this is a very simple example, but when you are having to do twenty condition/value combinations, this presents a lot of time savings as well as clarity & readability. The main difference here is that the left side is effectively reserved for conditions, the ~ sign operates as the divider between comparison & value, and obviously on the right is the value to be given matching criteria. As a final note on this, TRUE acts as a final catchall, akin to an else statement.

Conclusion

In short, while ifelse have their place and are incredibly useful, case_when makes a simple & easy to interpret alternative when you may be confronting a myriad of ifelse statements.

I hope you find this useful in all of your feature engineering endeavors! If you found this useful and enjoyable, come check out some of our other data science posts at datasciencelessons.com! Happy data science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: