6  Drawing Maps

6.1 Goals

  • Learn how to draw maps to show distributions.
  • Get the details and scaling right.
  • Make small multiples with maps.
  • Introduce {geofacet}.

6.2 Maps are just ways to show distributions

You may think maps are a unique kind of data viz. In reality, they have a lot in common with histograms and density plots in that they’re just another way of showing distributions. While I personally think distributions are not the most interesting kind of graph you can produce, they are important because they show the central tendency and dispersion of your data. This information is really important if we want to detect outliers in election outcomes to look for possible fraud or other irregularities.

Consider the election dataset from the {socviz} package. By using the glimpse() function from {dplyr} (a package in the {tidyverse}) we can quickly see what’s in the data. As you can see from the output below, the data consists of state-level observations from the 2016 U.S. Presidential election.

library(tidyverse)
library(socviz)
glimpse(election)
Rows: 51
Columns: 22
$ state        <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
$ st           <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL…
$ fips         <dbl> 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, …
$ total_vote   <dbl> 2123372, 318608, 2604657, 1130635, 14237893, 2780247, 164…
$ vote_margin  <dbl> 588708, 46933, 91234, 304378, 4269978, 136386, 224357, 50…
$ winner       <chr> "Trump", "Trump", "Trump", "Trump", "Clinton", "Clinton",…
$ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
$ pct_margin   <dbl> 0.2773, 0.1473, 0.0350, 0.2692, 0.2999, 0.0491, 0.1364, 0…
$ r_points     <dbl> 27.72, 14.73, 3.50, 26.92, -29.99, -4.91, -13.64, -11.38,…
$ d_points     <dbl> -27.72, -14.73, -3.50, -26.92, 29.99, 4.91, 13.64, 11.38,…
$ pct_clinton  <dbl> 34.36, 36.55, 44.58, 33.65, 61.48, 48.16, 54.57, 53.09, 9…
$ pct_trump    <dbl> 62.08, 51.28, 48.08, 60.57, 31.49, 43.25, 40.93, 41.71, 4…
$ pct_johnson  <dbl> 2.09, 5.88, 4.08, 2.64, 3.36, 5.18, 2.96, 3.33, 1.58, 2.1…
$ pct_other    <dbl> 1.46, 6.29, 3.25, 3.13, 3.66, 3.41, 1.55, 1.88, 3.47, 1.8…
$ clinton_vote <dbl> 729547, 116454, 1161167, 380494, 8753792, 1338870, 897572…
$ trump_vote   <dbl> 1318255, 163387, 1252401, 684872, 4483814, 1202484, 67321…
$ johnson_vote <dbl> 44467, 18725, 106327, 29829, 478500, 144121, 48676, 14757…
$ other_vote   <dbl> 31103, 20042, 84762, 35440, 521787, 94772, 25457, 8327, 1…
$ ev_dem       <dbl> 0, 0, 0, 0, 55, 9, 7, 3, 3, 0, 0, 3, 0, 20, 0, 0, 0, 0, 0…
$ ev_rep       <dbl> 9, 3, 11, 6, 0, 0, 0, 0, 0, 29, 16, 0, 4, 0, 11, 6, 6, 8,…
$ ev_oth       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ census       <chr> "South", "West", "West", "South", "West", "West", "Northe…

If we want to check out the distribution of something like Donald Trump’s vote margin (r_points) in 2016, we could make a histogram. This returns information about common and less common margins for Trump observed across states. It can also help us see if there are any states where Trump’s margin was either unusually high or low. The below output suggests there’s an observation in the data where Trump did really poorly compared to Hillary Clinton.

ggplot(election) +
  aes(x = r_points) +
  geom_histogram(
    color = "black",
    fill = "gray"
  ) +
  labs(
    x = "Republican Margin",
    y = "Count",
    title = "Distribution of the Republican Vote Margin",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  )

A histogram can be a nice first pass at our data if we want to simply and quickly summarize the distribution of a variable. However, histograms can only show us a limited amount of information. For instance, each of the data points for vote margin is connected to a specific state. A histogram doesn’t tell us which states fall where in this distribution. If we want to show this information as well, a better approach is to a bar or column plot to connect specific vote margins to states. The below code creates a column plot showing state names on the x-axis and Trump’s vote margin on the y-axis. It uses geom_col() to specify that we want to use columns. Pay special attention to the use of the reorder() function that I’ve used inside aes(). This function tells ggplot that when it maps the state names to the x-axis it should list them in order based on Trump’s vote margin. Also notice the use of the theme() function. We’ll talk more about this function later. It’s a great multipurpose tool for customizing different features of your data viz. In the below code, I use it to customize the angle in which US state labels appear on the x-axis. If you look at the results, it looks like the state where Trump did especially badly isn’t a state at all. It’s the District of Columbia.

ggplot(election) +
  aes(x = reorder(st, r_points),
      y = r_points) +
  geom_col() +
  labs(
    x = NULL,
    y = NULL,
    title = "Trump's Vote Margin by State",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45, hjust = 1
    )
  )

For data like this we aren’t restricted to using a column plot. We could also use a version of a “dot plot” called a “lollipop plot.” This is easy to produce using geom_pointrange(). For some extra flourishes, I made it a small multiple by country region and mapped the color aesthetic to an indicator for whether Trump won the majority of the votes in a given state. For the faceting, I used facet_grid() and included some options to make the width of the small multiples proportional to the number of states that fall into each region. To further help highlight when Trump’s margin is above versus below zero, I’ve used geom_hline() to hard code in a horizontal line at 0. I also got rid of all horizontal grid lines using options set in theme().

ggplot(election) +
  aes(
    x = reorder(st, r_points),
    y = r_points,
    color = r_points < 0
  ) +
  geom_hline(
    yintercept = 0,
    linetype = 2
  ) +
  geom_pointrange(
    aes(ymin = 0, ymax = r_points),
    size = 0.5,
    show.legend = F
  ) +
  facet_grid(
    ~ census,
    scales = "free_x",
    space = "free_x"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Trump's Vote Margin by State",
    subtitle = "2016 U.S. Presidential Election",
    caption = "Data: {socviz}"
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    axis.text.x = element_text(
      angle = 45,
      hjust = 1
    )
  )

This approach helps us to see specifically how geography relates to the distribution of Trump’s vote margin. However, because the data specifically deal with the distribution of some variable across space (U.S. states), we can also show the data using a map.

To do this, we need to take a few extra steps to ensure our data points are linked up with the relevant information for drawing a map of the United States. Thankfully, we have access to some helpful tools in R to make this happen.

To get us set up to plot a map we can use the us_map() function which comes from a package called {usmap}. To install it in R, just run install.packages("usmap") in the console. We can use this function to tell R to make a dataset that has “shape files” associated with the boundaries of U.S. states. These files contain all the relevant information needed to draw geographic boundaries. In the below code, I use this function to create a data object called us_states. I then use a function called slice_head() to look at the first five rows of the data.

## make a us_states dataset
library(usmap)
us_states <- us_map("states")

## look at the first five rows
us_states |>
  slice_head(n = 5)
Simple feature collection with 5 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2590847 ymin: -2608148 xmax: 1433830 ymax: -39564.73
Projected CRS: NAD27 / US National Atlas Equal Area
# A tibble: 5 × 4
  fips  abbr  full                                                          geom
  <chr> <chr> <chr>                                           <MULTIPOLYGON [m]>
1 02    AK    Alaska     (((-2396847 -2547721, -2393297 -2546391, -2391552 -254…
2 01    AL    Alabama    (((1093777 -1378535, 1093269 -1374223, 1092965 -134144…
3 05    AR    Arkansas   (((483065.2 -927788.2, 506062 -926263.3, 531512.5 -924…
4 04    AZ    Arizona    (((-1388676 -1254584, -1389181 -1251856, -1384522 -124…
5 06    CA    California (((-1719946 -1090033, -1709611 -1090026, -1700882 -110…

As you can see, the us_states data frame contains four columns. The first three are identifiers for states (FIPS codes, state abbreviations, and state names). The fourth column is labeled geom. The values in this column are a special kind of shape file object that contains information necessary to draw geographic boundaries based on their coordinates (latitude and longitude). We can give this information to ggplot to draw a map based on these values. Note that the code is somewhat different from the approach we’ve taken so far.

ggplot(us_states) +
  geom_sf()

Unlike the usual ggplot workflow, we don’t have to specify aesthetics to draw a map of the U.S. The geom column in the data has all the relevant aesthetic mapping information, and the geom_sf() function knows to look for a column in the data called geom in order to know what to draw.

Now that we have data to draw a map, the next step is to connect values in the election data to our us_state data. For this to work, we need to cross-walk the datasets and then use a *_join() function to combine them together. There are a number of join functions. In our case, we’re going to use left_join(). We’ll talk more about joining later on. If you’re curious about what it’s doing, just run ?left_join in the console.

The main thing required to merge or join the datasets together is a common column with consistent identifiers for states. This process of making sure different datasets have common identifiers is “cross-walking,” because, as the name suggestions, we’re trying to connect data points between datasets so that we can then bring them together. The below code adds the necessary column to the election data and then merges the datasets.

## cross-walk the data
# we need a state column in us_state to match 
# the column in election
us_states$state <- us_states$full

## join the data
us_states_elec <- left_join(us_states, election, by = "state")

If we look at the new dataset, us_states_elec, we can see that it includes all the variables in the election data alongside the data necessary to draw state boundaries:

glimpse(us_states_elec)
Rows: 51
Columns: 26
$ fips.x       <chr> "02", "01", "05", "04", "06", "08", "09", "11", "10", "12…
$ abbr         <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL…
$ full         <chr> "Alaska", "Alabama", "Arkansas", "Arizona", "California",…
$ geom         <MULTIPOLYGON [m]> MULTIPOLYGON (((-2396847 -2..., MULTIPOLYGON…
$ state        <chr> "Alaska", "Alabama", "Arkansas", "Arizona", "California",…
$ st           <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL…
$ fips.y       <dbl> 2, 1, 5, 4, 6, 8, 9, 11, 10, 12, 13, 15, 19, 16, 17, 18, …
$ total_vote   <dbl> 318608, 2123372, 1130635, 2604657, 14237893, 2780247, 164…
$ vote_margin  <dbl> 46933, 588708, 304378, 91234, 4269978, 136386, 224357, 27…
$ winner       <chr> "Trump", "Trump", "Trump", "Trump", "Clinton", "Clinton",…
$ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
$ pct_margin   <dbl> 0.1473, 0.2773, 0.2692, 0.0350, 0.2999, 0.0491, 0.1364, 0…
$ r_points     <dbl> 14.73, 27.72, 26.92, 3.50, -29.99, -4.91, -13.64, -86.77,…
$ d_points     <dbl> -14.73, -27.72, -26.92, -3.50, 29.99, 4.91, 13.64, 86.77,…
$ pct_clinton  <dbl> 36.55, 34.36, 33.65, 44.58, 61.48, 48.16, 54.57, 90.86, 5…
$ pct_trump    <dbl> 51.28, 62.08, 60.57, 48.08, 31.49, 43.25, 40.93, 4.09, 41…
$ pct_johnson  <dbl> 5.88, 2.09, 2.64, 4.08, 3.36, 5.18, 2.96, 1.58, 3.33, 2.1…
$ pct_other    <dbl> 6.29, 1.46, 3.13, 3.25, 3.66, 3.41, 1.55, 3.47, 1.88, 1.8…
$ clinton_vote <dbl> 116454, 729547, 380494, 1161167, 8753792, 1338870, 897572…
$ trump_vote   <dbl> 163387, 1318255, 684872, 1252401, 4483814, 1202484, 67321…
$ johnson_vote <dbl> 18725, 44467, 29829, 106327, 478500, 144121, 48676, 4906,…
$ other_vote   <dbl> 20042, 31103, 35440, 84762, 521787, 94772, 25457, 10809, …
$ ev_dem       <dbl> 0, 0, 0, 0, 55, 9, 7, 3, 3, 0, 0, 3, 0, 0, 20, 0, 0, 0, 0…
$ ev_rep       <dbl> 3, 9, 6, 11, 0, 0, 0, 0, 0, 29, 16, 0, 6, 4, 0, 11, 6, 8,…
$ ev_oth       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ census       <chr> "West", "South", "South", "West", "West", "West", "Northe…

Now we can make our map to show election outcomes. All we need to do is make a map like before, but this time also map the fill aesthetic to party, which is a column that tells us whether the Republicans or Democrats won the Presidential election in a given state.

ggplot(us_states_elec) +
  aes(fill = party) +
  geom_sf()

The colors are off, of course. We’ll talk more about advanced customization options for color palettes in a few weeks. For now we can use some built-in functions with {ggplot2} to update the palette to colors that better align with the U.S. political parties.

ggplot(us_states_elec) +
  aes(fill = party) +
  geom_sf() +
  scale_fill_manual(
    values = c("steelblue", "red3")
  ) +
  labs(
    title = "Election Results (2016)",
    fill = "Winning Party:"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Here’s another example mapping the fill aesthetic to a continuous variable (e.g., Trump’s vote margin in 2016):

ggplot(us_states_elec) +
  aes(fill = r_points) +
  geom_sf() +
  scale_fill_gradient2(
    low = "steelblue",
    mid = "white",
    high = "red3",
    breaks = c(-50, -25, 0, 25, 50),
    labels = \(x) paste0(x, "%")
  ) +
  labs(
    title = "Election Results (2016)",
    fill = "Trump's Margin:"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

6.3 Mapping US counties

We can get even more granular with our visualizations by showing data at the county level. We can use the us_map() function to make a county-level dataset which we’ll call us_counties. We’ll then cross-walk and merge it with another dataset from {socviz} called county_data.

## make US county data
us_counties <- us_map("counties")

## cross-walk
us_counties$name <- us_counties$county
us_counties$state <- us_counties$abbr

## join with county_data from {socviz}
county_map_data <- left_join(us_counties, county_data, by = c("name", "state"))

One of the variables in county_data is population density. Let’s visualize that. The below code creates a map of the US using county-level detail.

ggplot(county_map_data) +
  aes(fill = pop / land_area) +
  geom_sf() +
  labs(
    title = "Population Density by County",
    fill = "Population\nDensity"
  ) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme_void()

Hmmm, not so great. That’s probably because population density has a pretty skewed distribution. We can confirm that by making a histogram:

ggplot(county_data) +
  aes(x = pop / land_area) +
  geom_histogram()

This is a case where finding a way to make the data discrete or changing its scale would be nice. Thankfully, we already have a column in the data that has a discrete version of population density. We can use that instead. To update the color palette for an ordered discrete value, we’ll use scale_fill_brewer().

ggplot(county_map_data) +
  aes(fill = pop_dens) +
  geom_sf() +
  scale_fill_brewer() +
  labs(
    title = "Population Density",
    fill = "Population per\nsquare mile"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

6.4 Using geofacets

Maps are great, but sometimes the detail they provide is unnecessary or even distracting. At the same time, giving a sense for how things are organized spatially can help to communicate effectively with data. There’s a great package called {geofacet} that lets us find a happy medium. Let’s open the package and use the election data to make a small multiple with facet_geo():

## open {geofacet}
library(geofacet)

ggplot(socviz::election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F,
    color = "black"
  ) +
  geom_text() +
  facet_geo(
    ~ st,
    scales = "free"
  ) +
  scale_fill_gradient2(
    low = "steelblue",
    mid = "white",
    high = "red3",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void() 

In the above, I used a geometry layer called geom_tile(). This function tells ggplot to draw tiles (hence the name). If we didn’t facet the plot, here’s what it would look like:

ggplot(socviz::election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F,
    color = "black"
  ) +
  geom_text() +
  scale_fill_gradient2(
    low = "steelblue",
    mid = "white",
    high = "red3",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void() 

Woof!

Here’s what it would look like if we just used a normal facet.

ggplot(socviz::election) +
  aes(
    x = st,
    y = 1,
    fill = pct_trump,
    label = round(pct_trump)
  ) +
  geom_tile(
    show.legend = F,
    color = "black"
  ) +
  geom_text() +
  facet_wrap(
    ~ st,
    scales = "free"
  ) +
  scale_fill_gradient2(
    low = "steelblue",
    mid = "white",
    high = "red3",
    midpoint = 50
  ) +
  labs(
    title = "Trump's vote shares (2016)"
  ) +
  theme_void() 

Geo-faceting is clearly the way to go for this data. You can think of facet_geo() as a way to update a faceted grid based on the geographical location of observations.

6.5 Use small-multiples to show change over time

Above, we used small-multiples with a geographically informed arrangement to make something like a map. We can also use small-multiples with actual maps to show how trends evolve over time.

Here’s an example using some county-level returns for U.S. Presidential elections from 2000 to 2020. I’m reading in a dataset called countypres_2000-2020.rds and saving it as an object called us_pres. This dataset is stored in a .rds format. I then run some code to collapse it to the state level. We’ll talk more about the steps I’m taking in the next chapter. Once I collapse it I then cross-walk and merge it with the us_states data we created earlier in this chapter.

us_pres <- read_rds("countypres_2000-2020.rds")

## aggregate to the state level
us_pres |>
  group_by(state, year) |>
  summarize(
    dem_margin = sum(democrat) / sum(democrat + republican) - 0.5,
    dem_win = ifelse(dem_margin > 0, "Democrat", "Republican")
  ) -> us_pres

us_states$state <- str_to_lower(us_states$state)

us_state_dt <- left_join(us_states, us_pres, by = c("state"))

When I collapsed the data, I created a column that indicates whether the Democratic Party won a particular state in a given election. I’ll make a small multiple map that shows which states the Democrats won in each election year from 2000 to 2020:

ggplot(us_state_dt) +
  aes(fill = dem_win) +
  geom_sf() +
  scale_fill_manual(
    values = c("steelblue", "red3")
  ) +
  theme_void() +
  labs(
    title = "The Democratic vote margin, 2000-2020",
    fill = "Winner:"
  ) +
  facet_wrap(~ year)

And there we go!

6.6 When NOT to draw a map

The first question you should ask yourself when making a data visualization is: what do I want to show? If you want to show the spatial distribution of a variable (how some quantity differs across geographical locations) a map may be a good visualization choice. If you want to show other kinds of distributions, like how opinions on issues differ between Republicans and Democrats or how support for democratic institutions has changed over time, a map may be a poor choice. A good rule of thumb is to consider whether you want to show a distribution or a relationship. If the former, a map might make sense as long as your data belong to distinct geographic units. If the latter, a scatter plot, box plot, column plot, etc. might be a better option.