COVID-19: Data Visualization Mastery

I recently made a post where we explored the data recently put out by John Hopkins University on COVID-19; while we were able to make some interesting discoveries, it seemed pertinent to gather data that provided a more full picture.

In my search I came across the following dataset acquired and distributed by Tableau. This data actually leverages the JHU dataset, but they do some additional pre-cleaning for us.

Visit this page to get the data yourself: https://www.tableau.com/covid-19-coronavirus-data-resources

Let’s immediately dive right into some basic exploratory data analysis.

Data Familiarization

Download it and pull it in!

covid_tab <- read.csv('covid_tableau.csv')

Let’s take a look using the head function!

head(covid_tab)

My first call out here is that this is structured at a date, country, province, case type level. Similar to the JHU data, but with a critical difference. That being the case type. If we can understand the breakdown of active, deaths, and recovered in any give place and time, that does volumes for us making an assessment of the situation and its progress.

You’ll also note that the difference column represents the change of each case type day to day versus cases is the current total.

glimpse(covid_tab)

Glimpse also provides a bit of context into the dataset dimensions, data types, samples, etc.

summary(covid_tab)

This gives a handful of interesting stats depending on datatype. Number of occurrences of values for categoricals, min, max, quartile, mean, median for numerics.

Another quick note, I was seeing that we had the following values for case types:

I wanted to understand if active, deaths, and recovered added up to confirmed, if they were mutually exclusive, or how those different values would be leveraged.

covid_tab %>%
  group_by(Country_Region, Case_Type ) %>%
  summarise(max(Cases))

Grouping by the Country_Region & Case_type, since I new the Cases were cumulative, I just took the max and made an assessment of how they added up.

Take a look at this sample!

Using Afghanistan and Albania as examples, you can see that active, deaths, & recovered add up to confirmed.

One quick note before we move on, you may have noticed when we ran glimpse, that the Date column was being called out as a factor.

Why is this an issue?

Because any visualization will treat factor values as strings and order them accordingly.

Take a look at the below chart. Where we’re seeing that big drop in the middle actually represents days earlier in the month.

Use a basic command like this to clean it up. Be aware of the format you need to use. I wont dive into here, but that’s important to let the function know where the data is “coming from”.

covid_tab$Date <- as.Date(covid_tab$Date, format = "%m/%d/%Y")

If we ran that command again now, we get this

There are many other steps one could go through as part of exploratory data analysis; rather than continue that here, we are going to dive in deep with some data visualization.

Data Visualization

Now that we have a decent understanding of the data

Jumping in, I just showed this above, but I’ll show how we got there.

Here we want to see cumulative confirmed cases across the board.

I take the dateframe, filter to ‘confirmed’ cases, group by date, summarizing by cases and create a bar chart.

covid_tab %>%
  filter(Case_Type == 'Confirmed')%>%
  group_by(Date) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

We could then do this again for each of the case types. That typically would make a lot of sense for an exploratory data analysis process.

covid_tab %>%
  filter(Case_Type == 'Deaths')%>%
  group_by(Date)%>%
  summarise(Deaths = sum(Cases))%>%
  ggplot(aes(x = Date, y = Deaths))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

Similarly we can do this to represent daily change of deaths, just by changing aggregating Deaths by the Difference field.

covid_tab %>%
  filter(Case_Type == 'Deaths')%>%
  group_by(Date)%>%
  summarise(Deaths = sum(Difference))%>%
  ggplot(aes(x = Date, y = Deaths))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

Visualizing Multiple Variables

Similar to what we created above, below we do the same now breaking totals out by their Case_Type.

As you can see in my code, I’ve removed all occurrences of the term ‘confirmed’ from Case_Type. I suppose we could include it and it would just look like a summing bar next to the more granular bars.

covid_tab %>%
  filter(!grepl('Confirmed', Case_Type)) %>%
  group_by(Date, Case_Type) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases, fill = Case_Type))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

While this view is ok, I actually prefer to include the position argument ‘dodge’. Then this same chart is going to look a bit more comparative across the different values of case_type.

covid_tab %>%
  filter(!grepl('Confirmed', Case_Type)) %>%
  group_by(Date, Case_Type) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases, fill = Case_Type))+
  geom_bar(stat = 'identity', position = 'dodge')+ 
  theme(axis.text.x = element_text(angle = 45))

Now we have a better perspective into how these numbers are changing on a daily basis respective to one another. We saw a major dip to active cases early march and a resurgence in the last week or so.

Recovered is increasing at a consistent pace, and even eclipsed the total active cases only to trade the lead once again a couple weeks later.

Cases by Daily Difference

What we’re seeing here is a major drop to the new occurrence of active cases towards the end of February, but then a big spike towards mid march.

covid_tab %>%
  filter(!grepl('Confirmed', Case_Type)) %>%
  group_by(Date, Case_Type) %>%
  summarise(Difference = sum(Difference))%>%
  ggplot(aes(x = Date, y = Difference, fill = Case_Type))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

What should help better inform where this came from is breaking this out by geographical variables. We’ll start higher level with country and then look at province levels as well.

A couple of things to explain with this code. The legend in RStudio was blown up because of the 100+ countries accounted for in this dataset. So I created country totals taking the highest case count at a given time. I then joined this into the familiar command we’ve been running with and used it to filter out countries who haven’t reached a given number of cases, which gives us the highest 16 or so countries.

From there we create a similar visualization, this time replacing case type with country. We can incorporate multiple categoricals into our visualization, but we’ll save that for later.

country_totals <- covid_tab %>%
  group_by(Country_Region)%>%
  summarise(max_cases = max(Cases))

covid_tab %>%
  left_join(country_totals, by = 'Country_Region') %>%
  filter(grepl('Confirmed', Case_Type) & max_cases >= 1000) %>%
  group_by(Date, Case_Type, Country_Region) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases, fill = Country_Region))+
  geom_bar(stat = 'identity')+ 
  theme(axis.text.x = element_text(angle = 45))

As you’d guess, you can see that China has dominated this number for the majority of the outbreak, with Italy emerging since late February.

Lets now check this with the daily differences. Here we’re looking at the daily change to the number of active cases.

covid_tab %>%
  left_join(country_totals, by = 'Country_Region') %>%
  filter(grepl('Active', Case_Type) & max_cases >= 1000 & Date > '2020-02-15') %>%
  group_by(Date, Case_Type, Country_Region) %>%
  summarise(Difference = sum(Difference))%>%
  ggplot(aes(x = Date, y = Difference, fill = Country_Region))+
  geom_bar(stat = 'identity', position = 'dodge')+ 
  theme(axis.text.x = element_text(angle = 45))

This produces an interesting glimpse into the intensity of change at given times in either direction.

Incorporating multiple categoricals

This is not the most beautiful visualization given the many levels to country and the varying volume of each of the panes in the following faceted chart. With that said breaking this out across multiple categorical variables can provide an interesting perspective.

Similar to our previous command, here we add the facet_wrap command on the bottom which causes us to break the chart into a different pane for each value of the included categorical variable.

covid_tab %>%
  left_join(country_totals, by = 'Country_Region') %>%
  filter(!grepl('Confirmed', Case_Type) & max_cases >= 1000 & Date > '2020-02-15') %>%
  group_by(Date, Case_Type, Country_Region) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases, fill = Country_Region))+
  geom_bar(stat = 'identity', position = 'dodge')+ 
  theme(axis.text.x = element_text(angle = 45))+
  facet_wrap(~Case_Type)

In the active pane, we can see the stark drop to China’s active count, while Italy quickly has risen to eclipse China for most cases in the last week.

For deaths we can see China holding still through this window of time, while Italy grows.

With the obvious lag behind China’s outbreak, other countries are still days or weeks behind their recovery curve.

We could do this same thing just excluding China to get a better perspective into the relative situation where the outbreak is more current.

What’s interesting here is what we see in the first pane. Not focusing on volume, rather the shape of any given country’s curve.

While we see some exponential growth, South Korea conversely has experienced this symmetrical distribution; where almost as quickly as cases rose, they have fallen.

Let’s again facet this chart to see the difference more clearly.

Keep in mind some of this pattern could be due to testing ramp. There is talk around the mass testing that South Korea rolled out very quickly. I don’t have specifics on how the testing rollouts/ramps differ country to country, so I cannot speak to that, but it’s something to keep in mind.

Facet by country

Let’s do that same type of analysis, now faceting by country.

covid_tab %>%
  filter(!grepl('Confirmed', Case_Type) & grepl('China|Korea|US|United Kingdom|Italy|Spain|Iran', Country_Region) & Date > '2020-02-15') %>%
  group_by(Date, Case_Type, Country_Region) %>%
  summarise(Cases = sum(Cases))%>%
  ggplot(aes(x = Date, y = Cases, fill = Case_Type))+
  geom_bar(stat = 'identity', position = 'dodge')+ 
  theme(axis.text.x = element_text(angle = 45))+
  facet_wrap(~Country_Region)

Below, I’ll remove China to improve the interpretability of the other lagging countries.

Head-to-head Comparison with Italy

The US has been compared to Italy when it comes to our current trajectory. It’s been said many times that we’re 10 days behind them. Lets look into it!

There are several ways one could do this, but what I wanted to do was identity the first case of coronavirus in a given country, and from there give each subsequent day and number of days since outbreak. This way we can try to come up with a perspective into the similar or dis-similar outcomes across each country.

The first thing I do is create a first day dataset, where by grouping down to the country and filtering to outbreak days, I take the first date.

I then join that into the dataset we’ve been working with and include that metric as our x-axis.

first_day <- covid_tab %>%
  filter(grepl('Active', Case_Type) & grepl('US|Italy', Country_Region) & Cases > 0) %>%
  group_by(Country_Region)%>%
  summarise(first_date = min(Date))

covid_tab %>%
  filter(grepl('Active', Case_Type) & grepl('US|Italy', Country_Region))%>%
  group_by(Date, Case_Type, Country_Region)%>%
  summarise(Cases = sum(Cases))%>%
  left_join(first_day, by = 'Country_Region')%>%
  mutate(days_since_outbreak = as.numeric(as.Date(Date) - as.Date(first_date))) %>%
  ggplot(aes(x = days_since_outbreak, y = Cases, fill = Country_Region))+
  geom_bar(stat = 'identity', position = 'dodge')

What’s interesting is the number of days since first case, and the fact that our first case was actually identified as early as 1/23/20. Growth in the US didn’t until about a month later.

Italy’s first day was 1/31/20, about a week after the US’s first date, yet they experienced that exponential growth far quicker.

Again you can attribute these numbers to reporting or any variety of things, but it’s still interesting.

I did this again, but selected the “first date” according to the date it appeared to very quickly pick up. Here we can see that from the time the outbreak really started to take hold, we’re seeing infection rates at a faster rate.

One could immediately speculate saying that maybe the US has more quickly mobilized testing capability, and that could be why we see quicker growth from the US. It could be “quicker discovery”.

South Korea is the country is getting the most credit for their speed to testing, let’s see how they fall in line comparatively.

As we can see, growth held similar for a couple of weeks until Korea’s curve started to flatten. This doesn’t necessarily tell us anything about testing capability nor speed to said testing, but it does produce an interesting picture of the comparative growth and decline.

Growth

One of the greatest concern is the growth of the virus. How quickly will the virus sweep across any given country. Similar to some of the other analysis we’ve done, I wanted to observe the growth rate by country of active cases.

I calculated growth rate as follows: Today’s total cases – yesterday’s cases/yesterday’s cases.

You’ll see in the code snippet below that I used the lag function after ordering by date and grouping by country to pull the prior day’s total onto the current day’s record. From there, I simply created a field with the previously defined formula.

I also revisited the idea of establishing day 0 as the first day the virus was active.

I then visualized using a line chart breaking out countries by col.

covid_growth <- covid_tab %>%
  filter(grepl('Active', Case_Type) & grepl('China|Korea|US|United 
  Kingdom|Italy|Spain|Iran', Country_Region))%>%
  group_by(Date, Case_Type, Country_Region) %>%
  summarise(Difference = sum(Difference), Cases = sum(Cases))%>%
  arrange(Date)%>%
  group_by(Country_Region) %>%
  mutate(lag_cases = lag(Cases),
         growth = round((Cases - lag_cases) / lag_cases,2))%>%
  ungroup()%>%
  left_join(first_day, by = 'Country_Region')%>%
  mutate(days_since_outbreak = as.numeric(as.Date(Date) - 
  as.Date(first_date)))
  

covid_growth%>%
  filter(days_since_outbreak > 0)%>%
  ggplot(aes(x = days_since_outbreak, y = growth, col = Country_Region))+
  geom_line(stat = 'identity')

The reason this looks promising is because of what’s known as the “doubling rate”. One metric many have assessed is the number of days it takes a given country to double the current day’s total. With a consistent line around 1, we could assume a doubling rate of 1 day. Clearly there have been days with even a 4X increase in the number of cases; the good thing is we’re seeing a consistent decline in the percentage growth day over day.

Conclusion

I hope this has proven useful as you seek to better understand the ongoings of the world, your company, or whatever you’re working on!

Happy data-science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: