Become a Master of Data Wrangling in R

The dplyr package has a rich set of tools & functions that you can use for data wrangling, exploratory data analysis, feature engineering, and the like.

In the next few minutes, we’ll run through the functions that are absolutely pivotal and that you’ll find yourself using every day as a data scientist.

Select: Surface the Variables That Matter

Functions available to you in dplyr relate very closely to what you might use with SQL or even Python depending on what tasks you are performing.

The select function in dplyr acts very similar to how you would specify a columns that you want in a SQL select statement,

The major difference here is that with sql your code might look like this… (and that you wont include aggregation functions in dplyr select, but more on that later)

SELECT mpg, cyl, carb
FROM mtcars

Conversely with dplyr we first start by specifying the dataset we want to manipulate, we then use what’s known as a pipe operator %>% to effectively say, “hey this is our dataset and the next action or function I call out should be done to it!”

So we’ll effectively see the function flip with a slight change in syntax. To continue with that example, “Hey mtcars is my dataset, and I’d like to select mpg, cyl, and carb from it.”

You can pipe from one function to another. It’s super convenient and easy to get used to. More on that in a moment.

mtcars %>%
select(mpg, cyl, carb)

I personally prefer the approach of dplyr when it comes to data manipulation. It’s a lot easier to logically breakdown any data manipulation you may be performing or reviewing because you pipe function to function effectively recreating your dataset in each line.

With R or SQL your output would look something like this:

Filter

You can think of filter as dplyr’s answer to SQL’s where statement. If you don’t think in SQL, this is pretty much how you get rid of anything that doesn’t fulfill your desired criteria.

For instance, let’s say you only want to look at cars that have 4 cylinders.

You would declare mtcars %>% then call the filter function containing the criteria you care about.

mtcars %>%
filter(cyl == 4)

Your output will look like this:

You’ll see that all of the records contain 4 cylinders.

Other Conditional Operators

You can use other conditional operators as well, like < or >. You can do this on strings, logicals, etc.

mtcars %>%
filter(cyl >= 4)

We can now see a variety of examples where the cyl column is greater than or equal to 4.

Combining Filters

Imagine a scenario where, yes you want to filter to records where cyl is greater than or equal to 4… great; but you also want to filter down to records where mpg is less than 20.

You can combine those criteria using & or |. & means that both criteria have to be fulfilled, while | will include records where at least one of the two criteria are fulfilled.

You aren’t limited to two or three or four criteria. You can chain together as many filters as you’d like!

Check out these examples!

mtcars %>%
  filter(cyl >= 4 & mpg < 20)
mtcars %>%
  filter(cyl >= 4 | mpg < 20)

Pop-quiz

Off the top of your head which code snippet do you think will return more records?

The second one will always return more (unless all vehicles with 4 cylinders or more all have less than 20 mpg, then they’d return the same number of records.)

This is because the filter is less strict. Only one of the two criteria have to be fulfilled, versus both having to be fulfilled with &.

You can also separate criteria using a comma and it will behave the same as the & operator.

Don’t Forget to Pipe Your Functions Together

So far you’ve learned about two key functions, select & filter. You very likely will want to combine the outputs in succession. The secret to this is the pipe operator.

After you perform the select operation, you now have a new dataset. It’s mtcars plus whatever operation you did on top. When you follow with the pipe operator (%>%), you effectively tell R that the next function will be acting as a continuation of that lines output! Check out below how we combine those last two actions.

mtcars %>%
  select(cyl, mpg, carb)%>%
  filter(cyl >= 4 & mpg < 20)

For instance, if you didn’t include the gear field in your select function and then piped that output into a filter that said gear == 4, it would give you this error:

It does this because mtcars is no longer the dataset you’re working with, it’s the version of mtcars that only includes the three columns listed in the select function.

Always Feel Free to Save What You’ve Done

Before we move onto other functions you can use, it’s also worth telling you how to effectively save your dataset.

You’ll do it the same way you’d declare any variable in R. “<-

mtcars_sub <- mtcars %>%
  select(cyl, mpg, carb)%>%
  filter(cyl >= 4 & mpg < 20)

Arrange: Let’s Get Orderly!

Alright, now that that’s out of the way.. lets jump into the arrange function.

To continue relating each function to its SQL counterpart, arrange is the ORDER BY equivalent.

mtcars %>%
  filter(cyl >= 4)%>%
  arrange(mpg)

After we filter we then pipe into an arrange function where we include mpg. We will now see everything sorted smallest to largest mpg.

Now lets say you want to reverse the order on this looking at mpg highest to lowest. All you will do is throw in a desc().

mtcars %>%
  filter(cyl >= 4)%>%
  arrange(desc(mpg))

We now see the same output sorted from greatest to least mpg.

Mutate

The mutate function is how we go about creating a new column. This could be adding multiple column values together, creating categorical buckets according to a columns values, binning categoricals to be less granular, or additional columns for whatever else.

I’m going to jump into an example here, but keep in mind that my familiarity with cars is limited… so my understanding of the utility of the metric I’m about to create is a little naive.

Let’s say you want to know what the horsepower per cylinder would be. You’d declare the new variable name, and then include the logic for how to calculate this new field.

mtcars %>%
  mutate(hp_per_cyl = hp / cyl)

Group_by & Summarize

To wrap this up, the final major dplyr operations to cover is its grouping & aggregating functionality.

In SQL when performing an aggregation, you use GROUP BY to declare your grouping variables. However the logic for aggregation (mean, sum, max, min, count, etc.) sits in the SELECT statement– which we visited at the beginning.

In dplyr, we first will pipe into the group_by() function specifying our grouping variables. A little different, but honestly a lot simpler to follow. Then all logic for aggregation follows in the summarize function that you pipe into right after.

For this example, lets group by the cylinder count & then look at the mean miles per gallon and horse power.

You’ll see we pipe into group_by specifying cyl, and then pipe into summarise— where we first declare the name of our new aggregate field and then perform the operation.

mtcars %>%
  group_by(cyl)%>%
  summarize(mean_mpg = mean(mpg),
            mean_hp = mean(hp))

As you may have guessed, higher cylinder count is inversely related to miles per gallon, and relates closely with horse power.

Realistically, you would probably just evaluate the correlation across these specific variables, but it serves to show operation of group_by & summarize.

Conclusion

Dplyr is an incredibly useful library that includes the functionality of other data wrangling languages and its format for piping one function to another allows you to keep track of what otherwise could be an incredibly confusing query.

Dplyr also offers much much more that we didn’t get a chance to detail here, but what we’ve covered will serve as an excellent foundation as you branch into some of dplyr‘s more complex functionality like mutate_each, summarise_if, and many others.

Let me know what you liked and what you’d like to learn more about!

Check out some of my other Medium posts at @datasciencelessons and follow me on Twitter at @data_lessons.

As always, happy data science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: