8  Drawing Maps

8.1 Goals

  • Learn how to draw maps to show distributions.
  • Get the details and scaling right.
  • Make small multiples with maps.
  • Introduce {geofacet}.

8.2 Maps are just ways to show distributions

You may think maps are a unique kind of data viz. In reality, they have a lot in common with histograms and density plots in that they’re just another way of showing distributions.

Consider the election dataset from the {socviz} package. If we want to check out the distribution of something like Donald Trump’s vote margin (r_points) in 2016, we could make a histogram. This returns information about common and less common margins observed across states. Notice in the code I’m using a new function called labs(). This is short for “labels.” As the name implies, this function lets us update the labels of our figure like the x-axis title, the y-axis title, and the overall title of the plot. We can also add captions and even include subtitles, too.

## open {tidyverse} and {socviz}
library(tidyverse)
library(socviz)

## make histogram of Trump's vote margin
ggplot(election) +
  aes(x = r_points) +
  geom_histogram(
    color = "black",
    fill = "gray"
  ) +
  labs(
    x = "Republican Margin",
    y = "Count",
    title = "Distribution of the Republican Vote Margin",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  )

A histogram can be a nice first pass at our data if we want to simply and quickly summarize the distribution of a variable. However, histograms can only show us a limited amount of information. For instance, each of the data points for vote margin is connected to a specific state. A histogram doesn’t tell us which states fall where in this distribution. If we want to show this information as well, we could make a bar or column plot to connect specific vote margins to states. The below code creates a column plot showing state names on the x-axis and Trump’s vote margin on the y-axis. It uses geom_col() to specify that we want to use columns. Pay special attention to the use of the reorder() function that I’ve used inside aes(). This function tells ggplot that when it maps the state names to the x-axis it should list them in order based on Trump’s vote margin. Also notice the use of the theme() function. We’ll talk more about this function later. It’s a great multipurpose tool for customizing different features of your data viz. In the below code, I use it to customize the angle in which US state labels appear on the x-axis.

ggplot(election) +
  aes(x = reorder(st, r_points),
      y = r_points) +
  geom_col() +
  labs(
    x = NULL,
    y = NULL,
    title = "Trump's Vote Margin by State",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45, hjust = 1
    )
  )

For data like this we aren’t restricted to using a column plot. We could also use a version of a “dot plot” called a “lollipop plot.” I’ve done this using geom_pointrange(). For some extra flourishes, we can make it a small multiple by country region and map the color aesthetic to an indicator for whether Trump won the majority of the votes in a given state. This is what I do in the below code. Notice that I switched the axes around, too. I personally think that dot/lollipop plots look better horizontally than vertically, but that’s just my own taste. To further help highlight when Trump’s margin is above versus below zero, I’ve used geom_vline() to hard code in a vertical line at 0. I also got rid of all vertical grid lines using options set in theme().

ggplot(election) +
  aes(
    x = r_points,
    y = reorder(st, r_points),
    color = r_points < 0
  ) +
  geom_vline(
    xintercept = 0,
    linetype = 2
  ) +
  geom_pointrange(
    aes(xmin = 0, xmax = r_points),
    size = 0.5,
    show.legend = F
  ) +
  facet_wrap(
    ~ census,
    scales = "free_y"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Trump's Vote Margin by State",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  ) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

This approach helps us to see specifically how geography relates to the distribution of Trump’s vote margin. However, because the data specifically deal with the distribution of some variable across space (U.S. states), we can also show the data using a map. To get us set up to plot a map we can use the map_data() function which comes from a package called {maps}. This package automatically gets installed and opened when you open the {tidyverse}. We can use this function to tell R to make a dataset that has all the latitude and longitude coordinates we need to draw the boundaries of U.S. states:

## make a us_states dataset
us_states <- map_data("state")

## look at the first five rows
us_states |>
  slice_head(n = 5)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>
4 -87.53076 30.33239     1     4 alabama      <NA>
5 -87.57087 30.32665     1     5 alabama      <NA>

As you can see, the us_states data frame we saved contains the longitude and latitude of state boundaries. To get ggplot to draw a map based on these values, we’ll give it to ggplot(), and then we’ll use the long and lat columns as our x and y values, respectively. To get it to draw a map, we’ll use geom_polygon(), which is a special geometry function for drawing different shapes:

ggplot(us_states) +
  aes(x = long, y = lat) +
  geom_polygon(
    color = "white"
  )

Whoa! That looks terrible! We forgot one step. To make sure that ggplot knows to properly connect the right longitude and latitude values as it draws a map, we need to make sure to group by states. There is a group column in the data that let’s us do this. All we need to do, then, is specify group = group inside the aes() function:

ggplot(us_states) +
  aes(x = long, y = lat, group = group) +
  geom_polygon(
    color = "white"
  )

And now we have a map of the continental US! It looks a little wonky though. We need to update some settings so that we get the proportions right. We can do that with the coord_map() function. We can also make the state boundaries thinner to improve the look of the figure.

ggplot(us_states) +
  aes(x = long, y = lat, group = group) +
  geom_polygon(
    color = "white",
    size = 0.05
  ) +
  coord_map(
    projection = "albers",
    lat0 = 39,
    lat1 = 45
  )

So we can draw a map. But how do we connect values in the election data to our us_state data? We need to cross-walk the datasets and then use a *_join() function to combine them together. There are a number of join functions. In our case, we’re going to use left_join(). We’ll talk more about joining later on. If you’re curious about what it’s doing, just run ?left_join in the console.

The main thing we need to do to merge or join the datasets together is to ensure that they have a common column with consistent identifiers for states. This process of making sure different datasets have common identifiers is “cross-walking,” because, as the name suggestions, we’re trying to connect data points between datasets so that we can then bring them together. The below code adds the necessary column to the election data and then merges the datasets together.

## cross-walk the data
# we need a region column in election to match 
# the column in us_states
election$region <- tolower(election$state)

## join the data
us_states_elec <- left_join(us_states, election)

If we look at the new dataset, us_states_elec, we can see that it includes all the variables in the election data alongside the data necessary to draw state boundaries from us_states:

glimpse(us_states_elec)
Rows: 15,537
Columns: 28
$ long         <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.57087, -8…
$ lat          <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665, 30.3266…
$ group        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ order        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ region       <chr> "alabama", "alabama", "alabama", "alabama", "alabama", "a…
$ subregion    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
$ st           <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
$ fips         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ total_vote   <dbl> 2123372, 2123372, 2123372, 2123372, 2123372, 2123372, 212…
$ vote_margin  <dbl> 588708, 588708, 588708, 588708, 588708, 588708, 588708, 5…
$ winner       <chr> "Trump", "Trump", "Trump", "Trump", "Trump", "Trump", "Tr…
$ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
$ pct_margin   <dbl> 0.2773, 0.2773, 0.2773, 0.2773, 0.2773, 0.2773, 0.2773, 0…
$ r_points     <dbl> 27.72, 27.72, 27.72, 27.72, 27.72, 27.72, 27.72, 27.72, 2…
$ d_points     <dbl> -27.72, -27.72, -27.72, -27.72, -27.72, -27.72, -27.72, -…
$ pct_clinton  <dbl> 34.36, 34.36, 34.36, 34.36, 34.36, 34.36, 34.36, 34.36, 3…
$ pct_trump    <dbl> 62.08, 62.08, 62.08, 62.08, 62.08, 62.08, 62.08, 62.08, 6…
$ pct_johnson  <dbl> 2.09, 2.09, 2.09, 2.09, 2.09, 2.09, 2.09, 2.09, 2.09, 2.0…
$ pct_other    <dbl> 1.46, 1.46, 1.46, 1.46, 1.46, 1.46, 1.46, 1.46, 1.46, 1.4…
$ clinton_vote <dbl> 729547, 729547, 729547, 729547, 729547, 729547, 729547, 7…
$ trump_vote   <dbl> 1318255, 1318255, 1318255, 1318255, 1318255, 1318255, 131…
$ johnson_vote <dbl> 44467, 44467, 44467, 44467, 44467, 44467, 44467, 44467, 4…
$ other_vote   <dbl> 31103, 31103, 31103, 31103, 31103, 31103, 31103, 31103, 3…
$ ev_dem       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ev_rep       <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
$ ev_oth       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ census       <chr> "South", "South", "South", "South", "South", "South", "So…

Now we can make our map to show election outcomes. All we need to do is make a map like before, but this time also map fill to party, which is a column that tells us whether the Republicans or Democrats won the Presidential election in a given state.

ggplot(us_states_elec) +
  aes(
    x = long,
    y = lat,
    group = group,
    fill = party
  ) +
  geom_polygon(
    color = "white",
    size = 0.05
  ) +
  coord_map(
    projection = "albers",
    lat0 = 39,
    lat1 = 45
  )

The colors are off, of course. We’ll talk more about advanced customization options for color palettes in a few weeks. As a preview, I’ve made some of those updates below using the {coolorrr} package which you can install by writing devtools::install_github("milesdwilliams15/coolorrr"). In addition to updating the color palette, the below code customizes a few addition things. One of the major customizations is the use of theme_void(). This is one many different theme_*() functions that are essentially pre-customized versions of the basic theme() function. They come fully loaded with unique settings for how to change the look of your data viz. theme_void() gets rid of the gray background and grid, as well as all axis labels.

library(coolorrr)
set_palette(
  binary = c("blue", "red")
)

ggplot(us_states_elec) +
  aes(
    x = long,
    y = lat,
    group = group,
    fill = party
  ) +
  geom_polygon(
    color = "white",
    size = 0.1
  ) +
  coord_map(
    projection = "albers",
    lat0 = 39,
    lat1 = 45
  ) +
  ggpal(
    aes = "fill",
    type = "binary"
  ) +
  labs(
    title = "Election Results (2016)",
    fill = NULL
  ) +
  theme_void() +
  theme(
    legend.position = c(0.1, 0.1),
    plot.title = element_text(hjust = 0.5)
  )

Here’s another example mapping the fill aesthetic to a continuous variable (e.g., Trump’s vote margin in 2016):

ggplot(us_states_elec) +
  aes(
    x = long,
    y = lat,
    group = group,
    fill = r_points
  ) +
  geom_polygon(
    color = "black",
    size = 0.05
  ) +
  coord_map(
    projection = "albers",
    lat0 = 39,
    lat1 = 45
  ) +
  ## use a diverging color palette
  ggpal(
    type = "diverging",
    aes = "fill",
    breaks = c(-50, -25, 0, 25, 50),
    limits = c(-75, 75)
  ) +
  ## set labels
  labs(
    title = "Election Results (2016)",
    fill = "% Margin"
  ) +
  ## update theme and other options
  theme_void() +
  theme(
    legend.position = c(0.1, 0.1),
    plot.title = element_text(hjust = 0.5)
  )

8.3 Mapping US counties

We can get even more granular with our visualizations. Let’s merge county_map and county_data from the {socviz} package. Using this data we can draw a map to show the distribution of county-level variables.

## join county_map and county_data from {socviz}
county_map_data <- left_join(county_map, county_data, by = "id")

One of the variables in county_data is population density. Let’s visualize that. The below code creates a map of the US using county-level detail. Importantly, notice that we don’t have to use coord_map() to fix the proportions this time. The county_map data comes to us with already pre-adjusted coordinates to draw the lines of the map. It also has Alaska and Hawaii!

ggplot(county_map_data) +
  aes(x = long, y = lat, group = group,
      fill = pop / land_area) +
  geom_polygon(
    color = "black",
    size = 0.05
  ) +
  labs(
    title = "Population Density by County",
    fill = "Population\nDensity"
  ) +
  ggpal(
    type = "sequential",
    aes = "fill"
  ) +
  theme_void() 

Hmmm, not so great. That’s probably because population density has a pretty skewed distribution. We can confirm that by making a histogram:

ggplot(county_data) +
  aes(x = pop / land_area) +
  geom_histogram()

This is a case where finding a way to make the data discrete or changing its scale would be nice. Thankfully, we already have a column in the data that has a discrete version of population density. We can use that instead. Notice, however, that the default color scale isn’t great.

ggplot(county_map_data) +
  aes(
    x = long,
    y = lat,
    group = group,
    fill = pop_dens
  ) +
  geom_polygon(
    color = "white",
    size = 0.1
  ) 

Here’s an example making use of some more advanced tools to customize.

ggplot(county_map_data) +
  aes(
    x = long,
    y = lat,
    group = group,
    fill = pop_dens
  ) +
  geom_polygon(
    color = "black",
    size = 0.1
  ) +
  ggpal(
    type = "sequential",
    aes = "fill",
    ordinal = T,
    levels = 7,
    labels = c("0-10", "10-50", "50-100", "100-500", "500-1K", "1K-5K", "5K+")
  ) +
  labs(
    title = "Population Density",
    fill = "Population per\nsquare mile"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

8.4 Using geofacets

Maps are great, but sometimes the detail they provide is unnecessary or even distracting. At the same time, giving a sense for how things are organized spatially can help to communicate effectively with data. There’s a great package called {geofacet} that lets us find a happy medium. Let’s open the package and use the election data to make a small multiple with facet_geo():

## open {geofacet}
library(geofacet)

ggplot(election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F
  ) +
  geom_text(
    color = "white"
  ) +
  facet_geo(
    ~ st,
    scales = "free"
  ) +
  ggpal(
    type = "diverging",
    aes = "fill",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void() 

In the above, I used a geometry layer called geom_tile(). This function tells ggplot to draw tiles (hence the name). If we didn’t facet the plot, here’s what it would look like:

ggplot(election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F
  ) +
  geom_text(
    color = "white"
  ) +
  ggpal(
    type = "diverging",
    aes = "fill",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void()

Woof!

Here’s what it would look like if we just used a normal facet.

ggplot(election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F
  ) +
  geom_text(
    color = "white"
  ) +
  facet_wrap(
    ~ st,
    scales = "free"
  ) +
  ggpal(
    type = "diverging",
    aes = "fill",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void()

Geo-faceting is clearly the way to go for this data. You can think of facet_geo() as a way to update a faceted grid based on the geographical location of observations.

8.5 Use small-multiples to show change over time

Above, we used small-multiples with a geographically informed arrangement to make something like a map. We can also use small-multiples with actual maps to show how trends evolve over time.

Here’s an example using the gapminder dataset and a world map. Let’s access the data and look at the year column.

library(gapminder)
unique(gapminder$year)
 [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

We have a bunch of years in the data. To keeps things simple, let’s just pick three years:

years_to_keep <- c(1957, 1982, 2007)
gapminder_sm <- gapminder |>
  filter(year %in% years_to_keep)

Remember, this dataset is for countries, so instead of a map of only the U.S. we need a map of the world. The below code uses the map_data() function to get the coordinates to draw a world map. It then joins this data with the gapminder dataset. Notice the use of the countrycode() function from the {countrycode} package. This is great tool for taking a set of country names that might be of slightly different spellings and converting them to a standardized set of values for merging datasets.

## make and save world map data
world_map <- map_data("world")

## cross-walk the data
gapminder_sm <- gapminder_sm |>
  mutate(
    country = countrycode::countrycode(
      country, "country.name", "country.name"
    )
  )
world_map <- world_map |>
  mutate(
    country = countrycode::countrycode(
      region, "country.name", "country.name"
    )
  )

## merge
world_data <- left_join(
  world_map, gapminder_sm, by = "country"
) 

We now have what we need to draw three different maps, one for 1957, one for 1982, and one for 2007. Let’s show how life expectancy has changed across countries over time:

ggplot(drop_na(world_data, year)) +
  aes(
    x = long,
    y = lat,
    group = group
  ) +
  geom_polygon(
    data = world_map,
    color = "black",
    size = 0.1,
    fill = "white"
  ) +
  geom_polygon(
    aes(fill = lifeExp),
    color = "black",
    size = 0.1
  ) +
  facet_wrap(
    ~ year,
    ncol = 2
  ) +
  coord_fixed() +
  # Adding some special sauce...
  ggpal(
    type = "sequential",
    aes = "fill"
  ) +
  labs(
    title = "Life Expectancy over Time",
    fill = "Avg. Life Expectancy",
    caption = "Data: {gapminder}"
  ) +
  theme_void() +
  theme(
    legend.position = c(0.7, 0.3),
    legend.direction = "horizontal",
    legend.title = element_text(hjust = 0.5),
    plot.title = element_text(hjust = 0.5)
  ) +
  guides(
    fill = guide_legend(title.position = "top")
  )

And there we go!

8.6 When NOT to draw a map

The first question you should ask yourself when making a data visualization is: what do I want to show? If you want to show the spatial distribution of a variable (how some quantity differs across geographical locations) a map may be a good visualization choice. But, if you want to show other kinds of distributions, like how opinions on issues differ between Republicans and Democrats or how support for democratic institutions has changed over time, a map may be a poor choice.

A good rule of thumb is to base your choice on how you answer this simple question: Do you want to show a distribution or a relationship? If the former, a map might make sense as long as your data belong to distinct geographic units. If the latter, a scatter plot, box plot, column plot, etc. might be a better choice.