6 Showing the Right Numbers, Part I

6.1 Goals

Understand where different errors in plotting come from.
Understand how you can use ggplot commands to transform data.

6.2 Making a data viz is an iterative process

The first time you write some code to produce your data viz, it won’t always look amazing. Sometimes you will try to tell R to do one thing, but the way you say it won’t be sensible or meaningful to R. You’ll have to write and rewrite your code until you get it right. I still have to do the same thing all the time.

You might also (correctly) tell R to do one thing and after the fact change your mind. Maybe there’s a typo in one of your labels. Maybe an entirely different geom would be a better choice. I personally find the process of tweaking my plots fun, but there’s no accounting for personal taste.

Here’s an example of what this iterative process can look like. From the {socviz} package we’ll access some data called county_data. This provides information at the county level in the US about different population demographics and election outcomes in 2018.

Let’s look at the relationship between population size and the share of the population that’s black. And let’s also try using geom_line(), which draws lines by connecting observations in order of the variable on the x-axis. Does this seem like a sensible choice?

library(tidyverse)
# open tools in the tidyverse

library(socviz)
# open socviz to access country_data

Data <- na.omit(county_data)

# here's the plot:
ggplot(Data) +
  aes(x = pop, y = black) +
  geom_line() + 
  scale_x_log10()

What went wrong? R did exactly what we told it to do. It just so happens that our choice of geom was not the most useful way to show the data. Maybe geom_point() would be a better option? Let’s see.

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point() +
  scale_x_log10()

That looks much better! The problem with using a line plot to show the relationship between population and the share of the population that is black is that it creates the perception that the data points are related. The official term for this inference is connection, and it follows from a simple fact of human cognition: when things are visually tied or connected to each other, they are automatically interpreted as being related. Remember that a good data visualization is defined in reference to its goal. In this case, our goal is to show how the share of a population that is black is correlated with the overall county population size. We do not want to also convey the idea that one county is related to another.

Other geoms are consistent with our goal as well. We might also add a smoother geom to show how the average of our variable in the y-axis changes given the variable on our x-axis.

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point() +
  geom_smooth() +
  scale_x_log10()

6.3 Grouping geometry

There are a lot of points in this data and there’s a lot of noise, too. We might wonder how reliable the overall sample trend is. For instance, we might wonder whether it makes sense to show the relationship between population size and the share that’s black all together. Counties are organized into states, and maybe the trend in one state is different than in another. As it turns out, we can easily group our geoms by certain groups or categories when we plot them.

The way we’ve done this before is to map an aesthetic, like color, to a particular category. We can do this with our data, but there’s one possible problem: there are 50 states! If we map color to state how will we even make sense of the plot, let alone have a legend that will fit in the data viz?

Thankfully we have another option for mapping aesthetics: grouping. Rather than map a color or something like that to state, we can just tell ggplot to draw a different smoothed trend per state but not to add different colors or a legend. Let’s try it out and see how it looks:

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point() +
  geom_smooth(
    aes(group = state)
  ) +
  scale_x_log10()

What do you think? It’s a little hard to look at. Maybe it would be better if we got rid of the confidence intervals.

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point() +
  stat_smooth(
    aes(group = state),
    se = F,
    size = 0.75,
    alpha = 0.5
  ) +
  scale_x_log10()

That’s better, but we don’t have a finished product yet. Another option to try is to add a couple of different smoothed layers, one where we map groups to states and another that just shows the overall trend. We’ll add some transparency to the points, too. We’ll also introduce a new function called stat_smooth(). This is a lot like geom_smooth() but it lets us customize a few extra things such as line transparency.

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point(
    alpha = 0.5,
    color = "gray"
  ) +
  stat_smooth(
    aes(group = state),
    geom = "line",
    se = F,
    color = "gray30",
    linewidth = 0.5,
    alpha = 0.5
  ) +
  geom_smooth(
    color = "black",
    se = F
  ) +
  scale_x_log10()

Still not great, but getting better. We can clearly see that the overall trend is positive but that it doesn’t always hold up at the state level. In some states the relationship is actually negative and strongly so.

6.4 Small multiples

In addition to grouping geoms by categories, we can also create multiple panels that show relationships by different groups. This might be another useful approach to incorporate as we visualize the relationship between population size and the share that is black.

Enter facet_wrap(). This function lets us facet our data by a variable (or multiple variables). The result is what’s a small multiple plot.

Let’s see what this looks like by using the variable called census_region. This column in our data has four values indicating whether a state is in the Northeast, Midwest, South, or West.

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point(
    alpha = 0.5,
    color = "grey"
  ) +
  geom_smooth(
    aes(group = state),
    se = F,
    size = 0.5,
    color = "gray30"
  ) +
  geom_smooth(
    se = F
  ) +
  scale_x_log10() +
  facet_wrap(~ census_region)

How does that look? What facet_wrap() has done is tell ggplot to make four different sub-plots, one for each census region in the data. We used the phrase ~ census_region inside the function to give it instructions to facet by the values in the census_region column in the data.

This way of splitting up the data helps us to see a number of things. For example, the South is really weird. Just about everywhere else, as county population goes up, so does the share of the population that’s black. But in the South the relationship goes in the opposite direction in several states.

We can add a few different customizations to our faceted plot. We might, for instance, want to give ggplot the freedom to fit the range of values shown in the different sub-plots to the data in a given region. Right now, it uses a range for the x-axis and y-axis that works for the lowest and highest values for the whole data. We can tell it scales = "free" to change this:

ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point(
    alpha = 0.5,
    color = "grey"
  ) +
  geom_smooth(
    aes(group = state),
    se = F,
    size = 0.5,
    color = "gray30"
  ) +
  geom_smooth(
    se = F
  ) +
  scale_x_log10() +
  facet_wrap(~ census_region,
             scales = "free")

We can also use facet_grid() similarly to facet_wrap(). This function let’s you easily facet by multiple categories at once.

There’s actually a very helpful version of facet_grid() that we can access by installing and then loading the {ggh4x} package:

install.packages("ggh4x")

The function we want to use is called facet_grid2(). What I like about this function is that it permits freeing up the axis scales, an option that the original facet_grid() doesn’t allow.

Using this function, let’s facet both by census region and whether or not the majority of people in a country voted for Trump or Clinton in the 2018 U.S. Presidential election:

library(ggh4x)
ggplot(Data) +
  aes(x = pop, y = black) +
  geom_point(
    alpha = 0.5,
    color = "grey"
  ) +
  geom_smooth(
    se = F
  ) +
  scale_x_log10() +
  facet_grid2(
    winner ~ census_region,
    scales = "free",
    independent = "y"
  )

Breaking the data up like this, the South continues to look weird. In particular, counties in the South that voted disproportionately in favor of Clinton in 2018 are the weirdest. Why do you think this would be?

6.5 Transforming data with geoms

A final thing to note about working with ggplot is that we can use it to transform data for us before it’s plotted.

We’ve already done something like that when using scale_x_log10(). This transforms the x variable to the log-10 scale. Some geoms do something like this, too. For example, geom_bar().

ggplot(Data) +
  aes(x = census_region) +
  geom_bar()

This function returns a count of the number of observations in the data per region. In this case, an observation is a county, so it tells us the number of counties in different regions.

When working with geoms like geom_bar(), we can give it extra instructions for how to summarize the data. For example, say we want proportions rather than counts:

ggplot(Data) +
  aes(x = census_region) +
  geom_bar(aes(y = ..prop..))

Oops! We forgot to do something. By default, geom_bar() takes a proportion by the x variable. What happens when you take a number then divide by that number? You get 1 of course. We need to be more explicit with geom_bar():

ggplot(Data) +
  aes(x = census_region) +
  geom_bar(aes(y = ..prop.., group = 1))

By adding group = 1 we’ve “tricked” geom_bar() into grouping the data by a variable where each value is the same (in this case, 1). This makes the function return a proportion where the count per region is divided by the total count since all regions are in our new “group” where each value is the same.

Say we also were feeling artistic and wanted to add some color to the above plot. We could map colors to regions, but remember what we talked about in the first week of class about redundancy. We’re going to produce a legend for our colors. That’s just silly, since the legend labels are going to be identical to our x-axis labels.

It turns out we can fix that:

ggplot(Data) +
  aes(x = census_region, fill = census_region) +
  geom_bar(show.legend = F)

We’ll introduce some new functions later on that let us do many of these transformations and more before we even give the data to ggplot. In my opinion, writing some code to prep the data before giving it to ggplot is better, but it’s worthwhile to know that you can do some transformations of your data under the hood with ggplot, too.