How to do feature engineering with categorical data in R

Purpose:

Machine learning models have difficulty interpreting categorical data; feature engineering allows us to re-contextualize our categorical data to improve the rigor of our machine learning models. Feature engineering also provides added layers of perspective to data analysis. The big question that feature engineering approaches solve is; how to I utilize my data in interesting & clever ways to make it much more useful.

What it is & what it is not

Feature engineering is not about cleaning data, removing null values, or other similar tasks; feature engineering has to do with the alteration of variables to improve the story they tell. It is about leveraging content knowledge & data exploration.

Methods for categorical data

Bins/Buckets

Often times when you’re to use categorical data as a predictor, you might find that some of the levels of that variable have a very sparse occurrence or that the variables levels are seriously redundant.

Any decision you make to start grouping variable levels should be strategically driven.

A good start here for both approaches is the table() function in R.

I’ll be using the UCI Bank marketing dataset which details the demographics of customers & whether the marketing campaign was successful or not. The dataset can be found here: http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip

The idea here would be to identify the occurrence of a level with too few records, or alternatively bins that seem more indicative of what the data is trying to inform.

Sometimes a table is a little bit more difficult to ingest, so throwing this into a bar chart works well also.

bank.df %>%
   group_by(job)%>%
   summarise(n = n())%>%
   ggplot(aes(x = job, y = n))+
   geom_bar(stat = "identity")+
   theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the sake of this exercise, lets say we actually wanted to understand profession along the lines of technology use within a given role. In such a case we would begin to bin each of these job titles accordingly.

There is a nifty case when function you can use within dplyr‘s mutate that is very handy when you are reassigning many different levels of a variable, rather than using some nested ifelse function. This function is also very useful when converting numeric variables to categorical data. Please drop a comment if that’s something you’d want to learn more about.

bank.df <- bank.df %>%
   mutate(technology_use = 
          case_when(job == 'admin' ~ "med",
                    job == 'blue-collar' ~ "low",
                    job == 'entrepreneur' ~ "high",
                    job == 'housemaid' ~ "low",
                    job == 'management' ~ "med",
                    job == 'retired' ~ "low",
                    job == 'self-employed' ~ "low",
                    job == 'services' ~ "med",
                    job == 'student' ~ "high",
                    job == 'technician' ~ "high",
                    job == 'unemployed' ~ "low",
                    job == 'unknown' ~ "low"))

As you can see above, I create a new field called technology use and assign each a value according to its technology use. I’m sure you could argue different assignments for each of these, but for the sake of the example, I didn’t think too much about it.

Now lets quickly review this new field.

table(bank.df$technology_use)
round(prop.table(table(bank.df$technology_use)),2)

How to decide how to bin

Binning should be depend upon what you’re trying to understand. Lets say the granularity of job was much greater and we had several marketing related jobs, CMO, marketing analyst, digital marketing manager, etc. you might want to understand the impact of a department, a skillset, or just a slightly higher level of granularity.

Be sure to leverage table, prop table, & bar charts to get a better of how variable levels could be best recategorized.

Dummy Variables & One Hot Encoding

Lets say you have a two level categorical variable, which machine learning models still wont like, so rather than leaving the default variable levels as ‘yes’ & ‘no’, we will encode it to be a dummy variable. A dummy variable is the numeric representation of a categorical variable. Any time the value for default is yes, we’ll encode it to 1 and 0 otherwise. For two level variables that are mutually exclusive this eliminates the need for an additional column for no as it’s implicit in the first column.

bank.df <- bank.df %>%    
  mutate( 
    defaulted = ifelse(default  == "yes", 1, 0))

We’ve talked about the creation of a single column as a dummy variable, but we should talk about one-hot encoding. One hot encoding is effectively the same thing, but for variables of many levels where the column has 0s in all rows except for where the value corresponds to the new column, then it would be 1.

dummyVars from the caret package is very useful here.

library(caret)
dmy <- dummyVars(" ~ .", data = bank.df)
bank.dummies<- data.frame(predict(dmy, newdata = bank.df))
print(bank.dummies)

Above, we load the caret package, run the dummyVars function for all variables, then create a new dataframe depending on the one hot encoded variables it identified.

Lets take a look at the new table

str(bank.dummies)

I didn’t include all of the columns, but you can see that it left age alone, and then one hot encoded job & marital. You can see that the datatype for all columns are now numeric.

Be aware of sparse fields and be ready to compare the binning approach with one hot encoding to get a more effective result.

Combining Features or Feature Crossing

Feature crossing is where you combine multiple variables together. Sometimes combining variables together can produce predictive performance that out performs what they could’ve done in isolation.

Similar to how you might explore this for binning, group by the two variables you’re considering crossing and get a count of each combination.

bank.df %>% 
   group_by(job, marital) %>%
   summarise(n = n())

A visualization of this is often a lot easier to interpret

bank.df %>% 
  group_by(job, marital) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = job, y = n, fill = marital))+
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

To knock out the crossing & encoding steps in one you can use the same dummyVars function that I showed you earlier.

dmy <- dummyVars( ~ job:marital, data = bank.df)
bank.cross <- predict(dmy, newdata = bank.df

Keep in mind that when you are combining multiple variables together, you may have some of these new values that are very sparse. Similar to why you would perform binning on any given categorical variable, you should review the output of any feature crossing to verify whether they should be binned prior.

Conclusion

I hope this has been a helpful introduction to feature engineering with categorical variables in R. There are many additional methods that can be used for numeric variables & combinations of numeric & categorical; we can use PCA among other things to improve the predictive power of explanatory variables. I’ll jump into those in another post. For a lesson in PCA, check out my post at https://datasciencelessons.com/2019/07/20/principal-component-analysis-in-r/ or check out my blog at https://datasciencelessons.com/

Happy Data Science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: