10  Layering Complexity and Adding Labels and Text

10.1 Goals

  • Learn about adding text and labels to figures.
  • Introduce the {geomtextpath} package.

Last time we talked about how to manipulate data with {dplyr} and {tidyr} tools. Now, we’re going to build on some of these skills while also honing our data viz.

First, let’s get our data ready to use. Last time we looked at a mix of international conflict data and democracy data. Today, let’s home in on conflict data specifically. The below code opens the R packages that we need for our analysis. We’ll use the {peacesciencer} R package to access our data. While the package has a number of functions for constructing and merging datasets together, it also provides some discrete datasets that we can work with as well. The one we’ll use today is called gml_mid_disps, and you can learn more about it by writing ?gml_mid_disps in the R console.

## Let's get ready to go...
library(tidyverse) 
library(peacesciencer)

## look at the data
glimpse(gml_mid_disps)
Rows: 2,174
Columns: 11
$ dispnum  <dbl> 2, 3, 4, 7, 8, 9, 11, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23,…
$ styear   <dbl> 1902, 1913, 1946, 1951, 1856, 1889, 1938, 1938, 1863, 1895, 1…
$ stmon    <dbl> 5, 5, 5, 10, 7, 12, 3, 3, 4, 10, 2, 12, 1, 11, 11, 6, 9, 8, 3…
$ outcome  <dbl> 6, 4, 6, 1, 1, 4, 4, 4, 5, 4, 6, 4, 5, 3, 5, 1, 2, 2, 5, 5, 6…
$ settle   <dbl> 1, 3, 1, 3, 2, 2, 2, 1, 1, 3, 1, 3, 1, 1, 3, 2, 1, 1, 3, 3, 1…
$ fatality <dbl> 0, 0, 2, 2, 6, 0, 0, 0, 3, 0, 0, 0, 6, 0, 0, 2, 0, -9, 0, 0, …
$ mindur   <dbl> 1, 177, 183, 105, 237, 13, 2, 204, 250, 78, 29, 57, 573, 127,…
$ maxdur   <dbl> 31, 177, 183, 105, 237, 13, 2, 204, 250, 78, 29, 57, 573, 156…
$ hiact    <dbl> 7, 10, 16, 17, 20, 7, 14, 11, 16, 7, 15, 7, 20, 14, 10, 16, 1…
$ hostlev  <dbl> 3, 3, 4, 4, 5, 3, 4, 3, 4, 3, 4, 3, 5, 4, 3, 4, 4, 4, 4, 4, 4…
$ recip    <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1…

This dataset provides information on over 2,000 militarized interstate disputes (MIDs) that took place among countries in the international system from 1816 to 2010. For each dispute there’s information about how deadly it became, how long it lasted, and whether the dispute was reciprocated by the other side.

If you notice from the above output, the data is entirely numerical, but some of these numbers are stand-ins for categorical values. We can use the mutate() function to update the dataset accordingly. Notice the use of the functions ym(), ifelse(), and frcode(). The last function comes from the {socsci} package, which you can install from the developer’s GitHub page using the following code in the R console:

devtools::install_github("ryanburge/socsci")
library(socsci) ## for coding ordered categories
gml_mid_disps |>
  mutate(
    ## date
    date = ym(paste0(styear, stmon)),
    
    ## is the conflict an all-out war?
    all_out_war = ifelse(hostlev == 5, "War", "Low-level Dispute"),
    
    ## fatalities
    fatality_cat = frcode(
      fatality == 0 ~ "0",
      fatality == 1 ~ "1-25",
      fatality == 2 ~ "26-100",
      fatality == 3 ~ "101-250",
      fatality == 4 ~ "251-500",
      fatality == 5 ~ "501-999",
      fatality == 6 ~ "1,000 +",
      TRUE ~ "Unknown"
    ),
    
    ## how was the MID settled?
    settle_cat = frcode(
      settle == 1 ~ "None",
      settle == 2 ~ "Imposed",
      settle == 3 ~ "Negotiation",
      settle == 4 ~ "Unclear",
      TRUE ~ "Unknown"
    )
  ) -> mid_data

With this dataset, we can plot a number of trends over time. Let’s try some out below…

10.3 Adding layers and text

Let’s do a few things to update the above figure. First, let’s include a smoothed trend for the number of new conflicts over time and update the color and transparency of the data points as well.

mid_data |>
  count(date) |>
  ggplot() +
  aes(x = date, y = n) +
  geom_point(
    color = "gray",
    alpha = 0.5
  ) +
  geom_smooth(
    method = "gam",
    se = F,
    color = "navy"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Monthly Frequency of MIDs, 1816-2010"
  )

We can also use text and other plotting elements to highlight important events, like World War 2. The below code use geom_vline() to draw a red vertical line for September of 1939. It then uses annotate() to add a label for the vertical line in matching red.

mid_data |>
  count(date) |>
  ggplot() +
  aes(x = date, y = n) +
  geom_point(
    color = "gray",
    alpha = 0.5
  ) +
  geom_smooth(
    method = "gam",
    se = F,
    color = "navy"
  ) +
  geom_vline(
    xintercept = ym("193909"),
    color = "red4",
    linewidth = 1
  ) +
  annotate(
    geom = "text",
    x = ym("193909"),
    y = 9,
    label = "WWII",
    color = "red4",
    hjust = -0.2, ## control possition
    fontface = 4  ## make bold and italic
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Monthly Frequency of MIDs, 1816-2010"
  )

{geomtextpath} provides some nice functions for plotting text and labels and labels as well. This package was specifically designed to add labels directly to different kinds of lines drawn with geom_*() functions. Here’s the above figure updated to have a label for both the smoothed regression line and the vertical line for WW2:

## open {geomtextpat}
# install.packages("geomtextpath")
library(geomtextpath)

## put it to use
mid_data |>
  count(date) |>
  ggplot() +
  aes(x = date, y = n) +
  geom_point(
    color = "gray",
    alpha = 0.5
  ) +
  geom_textsmooth(
    method = "gam",
    color = "navy",
    linewidth = 1,
    label = "Avg. Trend"
  ) +
  geom_textvline(
    xintercept = ym("193909"),
    color = "red4",
    linewidth = 1,
    label = "WWII"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Monthly Frequency of MIDs, 1816-2010"
  )

The ability to add text directly to lines is a major bonus. In some cases, this can be a good alternative to using a legend if you want to show trends for different groups. Say, for instance, we wanted to show separate trends for all-out war versus lower-level conflicts. The below code shows what we could do. I’ve made some changes to the code compared to what I did above. First, instead of grouping by date I grouped by start year and whether a MID was an all-out war. There just aren’t enough unique all-out wars that started in the same month, so grouping by year gives us a little more variation. Also, note the changes inside geom_textsmooth(). I’ve mapped the linetype and label aesthetics to the values in the all_out_war column in the data.

mid_data |>
  group_by(all_out_war) |>
  count(styear) |> 
  ggplot() +
  aes(x = styear, y = n) +
  geom_point(
    color = "gray",
    alpha = 0.5
  ) +
  geom_textsmooth(
    aes(linetype = all_out_war, 
        label = all_out_war), 
    method = "gam",
    linewidth = 1,
    show.legend = F
  ) +
  geom_textvline(
    xintercept = 1939,
    color = "red4",
    linewidth = 1,
    label = "WWII"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "Yearly Frequency of MIDs, 1816-2010"
  )

10.4 Updating legend and label names

Speaking of layering complexity and adding labels, sometimes when we show trends over time we may want to show two or more related but distinct variables in the same figure. We covered how to do this with democracy measures in the previous notes. However, we didn’t deal specifically with how we can update the labels of the variable names in our plot.

To illustrate, let’s use some state-year data and bring in measures of democracy:

create_stateyears() |>
  add_democracy() -> dem_data

We have three measures of democracy in the data:

  • v2x_polyarchy: This comes from the Varieties of Democracy project and takes a value between 0 and 1 where 1 is the most democratic and 0 is the least.
  • polity2: This comes from the Polity Project and takes values between -10 and 10 where 10 is the most democratic and -10 is the least.
  • xm_qudsetst: Xavier Marquez wrote a paper in 2016 where he developed a method to extend the Unified Democracy Scores, a democracy measure created by other scholars. This is a “normalized” measure of democracy set to have a mean of 0 and standard deviation of 1.

While each of these measures are an attempt to capture the same basic concept, they are produced not only using very different methods but also are on very different scales. We walked through how to deal with this in the previous chapter, but something we can add is new labels for each one of the variable names to show in the legend. The below code does this by using mutate() after the data has been pivoted using pivot_longer(). There, a function called case_when() is used to change the names of the democracy measures in the name column to something more intuitive. You can see the difference this makes by looking at the plot below. Note the use of {geomtextpath}, as well as the use of theme(legend.position = "none"). Since we’re using labels directly on the trend lines, we don’t need a legend.

# Update the data and summarize:
dem_data |>
  
  ## use mutate() to rescale the data
  mutate(
    across(
      c(v2x_polyarchy, polity2, xm_qudsest),
      scale
    )
  ) |>
  
  ## get the average of each democracy score/year
  group_by(year) |>
  summarize(
    across(
      c(v2x_polyarchy, polity2, xm_qudsest),
    ~ mean(.x, na.rm=T)
    )
  ) |>
  
  ## pivot the data longer by the democracy measures
  pivot_longer(
    cols = v2x_polyarchy:xm_qudsest
  ) |>
  
  ## update the democracy measure variable names
  mutate(
    name = case_when(
      name == "v2x_polyarchy" ~ "V-Dem",
      name == "polity2" ~ "Polity",
      name == "xm_qudsest" ~ "Extended UDS"
    )
  ) |>
  
  ## visualize
  ggplot() +
  aes(x = year, y = value, color = name) +
  geom_point(
    alpha = 0.4
  ) +
  geom_labelsmooth(
    aes(label = name),
    hjust = 0.4
  ) +
  labs(
    x = NULL,
    y = "Quality of Democracy",
    title = "Democracy over time, 1816-2007"
  ) +
  theme(
    legend.position = "none"
  )

Using mutate() and case_when() after you’ve pivoted data is a nice strategy to keep in mind when you go from summarizing, to pivoting, to plotting.