9 Data Wrangling

9.1 Goals

Understand the importance of data wrangling.
Apply the main {dplyr} functions to mutate and summarize a dataset.
Layer complexity by chaining together multiple {dplyr} functions.
Use {tidyr} functions to reshape datasets.

9.2 The importance of data wrangling

Data wrangling is the process of cleaning and reshaping a dataset to make it ready for data visualization or analysis. Many of the datasets you’ve used up to now have required only minimal data wrangling, but in many real-world settings you won’t be so lucky. You therefore need to know how to clean, collapse, and reshape data so you can use it to generate insights.

The {tidyverse} approach to statistical programming in R provides a number of helpful tools or “verbs” for performing data wrangling. We’ll walk through these verbs below and how to use them. While you probably won’t master them over night, with time and practice these tools will become easier to use.

9.3 Using `{dplyr}` for data mutation and summaries

The {dplyr} package is a powerful tool for letting us transform data into what we need for analysis and data visualization. This package is part of the {tidyverse} and is automatically opened when you use library(tidyverse).

There are 6 key {dplyr} functions that you need to know about:

filter(): Choose observations based on their values using Boolean operators (==, !=, >, <, >=, <=)
arrange(): Reorder the rows of a dataset based on the values of a variable
select(): Choose variables (columns) based on their names
mutate(): Create new variables or modify existing ones
summarize(): Collapse many variables into a single summary or a summary by certain categories
group_by(): Specify the scope of the function that you want to perform (for example, calculate the average votes cast for a candidate by education level).

We’ll start by opening the tidyverse as always:

library(tidyverse)

While the functions available with {dplyr} are functions, sometimes we call them “verbs” because we use them to do certain operations with our data. Moreover, a nice feature of working with these verbs is that we can chain them together to perform a succession of possibly highly complex data transformations. We can chain together multiple {dplyr} “layers” like with ggplot using the |> pipe operator. We’ve already seen this in action in some previous code.

A quick note on the pipe: Once upon a time we used %>% for pipes. You can still use this version of the pipe, but the most up-to-date version of R provides |> as a new alternative. There are just a few minor differences in how they work, but these differences usually don’t matter. A nice feature of |> is that it is computationally faster than %>% (in addition to being slightly quicker to write). If you have a lot of code to run and lots of computations to deal with, any small efficiency gain is a win.

Back to using {dplyr} verbs… Here’s how the workflow with these verbs generally works:

First, we start with a data object.
Then, we add verbs that describe what to do with the data using column (variable) names and other instructions. Note that we don’t need to put the column names in quotes. Because we’ve already specified the data object that they are contained in, R will be able to find them as a “known” object.
The output is a new data frame, which we can save as a new object.

For example, say we wanted to create a smaller version of our dataset that only contains certain observations. We would write:

filtered_data <- data |>
    filter(column_name == "some text")

As we work with {dplyr} verbs, we need to know how to use various comparison operators:

Greater than >
Less than <
Greater than or equal to >=
Less than or equal to <=
Equal to ==
Not equal to !=
Is a value in a bigger set of values? %in%
Is NA? is.na()

We also need to know some mathematical operators (should follow order of operations).

addition +
subtraction -
multiplication *
division /
square ^2

We also need to know the logical operators:

and &
or |
not !

In the following subsections we’ll walk through some examples of these verbs in action. To do this, we’ll use data from the {peacesciencer} package. This package has a lot of useful data for studying conflict, which is a core subject of study in the political science field called international relations or IR. It is different from many other R packages that provide access to data in that it contains a number of its own verbs for declaring a base dataset and then populating it with different variables that we’re interested in. You can read more about it here: https://svmiller.com/peacesciencer/

If you don’t have the package already, you’ll need to install it by writing install.packages("peacesciencer") in the R console. Once you do, open it:

library(peacesciencer)

The below code generates one of a large number of possible datasets we could construct using {peacesciencer}. In particular, it creates a base dataset of the universe of all countries that have existed from 1946 to 2000 with an observation for each year in that time frame that the country existed. This is what we would call a country-year dataset. The code then populates this dataset with information about a country’s involvement in a militarized interstate dispute (MID), whether the country is a “major power,” its quality of democracy, and its GDP and population.

## start with function that creates state-year data
create_stateyears(
  subset_years = 1946:2000
) |>
  ## add indicators for whether countries were involved in a MID
  add_gml_mids() |>
  ## then use this function to add major power indicators
  add_cow_majors() |>
  ## add information about the quality of democracy in countries
  add_democracy() |>
  ## add economic info
  add_sdp_gdp() -> Data

To get a sense for the size of this dataset we can use the dim() function to tell us the number of rows and columns:

dim(Data)

[1] 7421   15

Let’s use this data to work through some {dplyr} verb examples.

9.3.1 filter()

Let’s practice filter first. Let’s say we only want data for major powers. When an observation is a major power, the data codes it as 1. So if we write the following, we’ll only keep the rows in the data where a country is a major power. As you can see from the below code, when we do this the filtered data is printed out for us.

Data |>
  filter(cowmaj == 1)

# A tibble: 291 × 15
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  1946             1           1                  1
 2     2 United States of Am…  1947             0           0                  0
 3     2 United States of Am…  1948             1           1                  0
 4     2 United States of Am…  1949             1           1                  0
 5     2 United States of Am…  1950             1           1                  1
 6     2 United States of Am…  1951             1           0                  1
 7     2 United States of Am…  1952             1           0                  0
 8     2 United States of Am…  1953             1           1                  1
 9     2 United States of Am…  1954             1           1                  1
10     2 United States of Am…  1955             1           1                  1
# ℹ 281 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

We can filter on more than one variable at a time. Say we wanted major powers that also are involved in a MID. We just need to include the “and” operator & and a new logical condition:

Data |>
  filter(cowmaj == 1 & gmlmidongoing == 1)

# A tibble: 253 × 15
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  1946             1           1                  1
 2     2 United States of Am…  1948             1           1                  0
 3     2 United States of Am…  1949             1           1                  0
 4     2 United States of Am…  1950             1           1                  1
 5     2 United States of Am…  1951             1           0                  1
 6     2 United States of Am…  1952             1           0                  0
 7     2 United States of Am…  1953             1           1                  1
 8     2 United States of Am…  1954             1           1                  1
 9     2 United States of Am…  1955             1           1                  1
10     2 United States of Am…  1956             1           1                  1
# ℹ 243 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

OK but what if we want to filter on two variables but make it either-or? Way we want major powers or countries involved in a MID:

Data |>
  filter(cowmaj == 1 | gmlmidongoing == 1)

# A tibble: 2,484 × 15
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  1946             1           1                  1
 2     2 United States of Am…  1947             0           0                  0
 3     2 United States of Am…  1948             1           1                  0
 4     2 United States of Am…  1949             1           1                  0
 5     2 United States of Am…  1950             1           1                  1
 6     2 United States of Am…  1951             1           0                  1
 7     2 United States of Am…  1952             1           0                  0
 8     2 United States of Am…  1953             1           1                  1
 9     2 United States of Am…  1954             1           1                  1
10     2 United States of Am…  1955             1           1                  1
# ℹ 2,474 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

Note: filter() only includes rows where the condition is TRUE. So if we want to keep in NA (missing) values, then we need to ask for them explicitly. For example, what would we do if we wanted only rows where values in a column were blank (indicated by NA on the computer). We might want to do this to figure out which observations in our data have missing data. Let’s look at the democracy measure polity2. As a first blush, we might assume that to select rows for which polity2 is NA we need to right

Data |>
  filter(polity2 == NA)

This actually won’t work. Since NA is a special value that indicates missing data, it cannot be equal to anything, or not equal to anything. This means we can only include or exclude NA using functions designed to do that. For example, we can use is.na() like so:

Data |>
  filter(is.na(polity2))

# A tibble: 598 × 15
   ccode statenme  year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>    <dbl>         <dbl>       <dbl>              <dbl>
 1    31 Bahamas   1973             0           0                  0
 2    31 Bahamas   1974             0           0                  0
 3    31 Bahamas   1975             0           0                  0
 4    31 Bahamas   1976             0           0                  0
 5    31 Bahamas   1977             0           0                  0
 6    31 Bahamas   1978             0           0                  0
 7    31 Bahamas   1979             0           0                  0
 8    31 Bahamas   1980             0           0                  0
 9    31 Bahamas   1981             0           0                  0
10    31 Bahamas   1982             0           0                  0
# ℹ 588 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

9.3.2 arrange()

The arrage() verb is used to change the order that observations appear in our data—kind of like using “sort” in Excel, except you know that you won’t accidentally sort just the one row and scramble all your data.

Let’s sort the dataset by year:

Data |>
  arrange(year)

# A tibble: 7,421 × 15
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  1946             1           1                  1
 2    20 Canada                1946             0           0                  0
 3    40 Cuba                  1946             0           0                  0
 4    41 Haiti                 1946             0           0                  0
 5    42 Dominican Republic    1946             0           0                  0
 6    70 Mexico                1946             0           0                  0
 7    90 Guatemala             1946             0           0                  0
 8    91 Honduras              1946             0           0                  0
 9    92 El Salvador           1946             0           0                  0
10    93 Nicaragua             1946             0           0                  0
# ℹ 7,411 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

OK, cool, but what if we want to see the most recent years first? We can reverse the order using desc():

Data |>
  arrange(desc(year))

# A tibble: 7,421 × 15
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  2000             1           1                  1
 2    20 Canada                2000             1           1                  0
 3    31 Bahamas               2000             0           0                  0
 4    40 Cuba                  2000             0           0                  0
 5    41 Haiti                 2000             0           0                  0
 6    42 Dominican Republic    2000             0           0                  0
 7    51 Jamaica               2000             0           0                  0
 8    52 Trinidad and Tobago   2000             0           0                  0
 9    53 Barbados              2000             0           0                  0
10    54 Dominica              2000             0           0                  0
# ℹ 7,411 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

This would work, too, but this little hack only works for numerical data. If we wanted to sort by a character string, we’d still need to use desc() to reverse the order.

Data |>
  arrange(-year)

Speaking of arranging by characters, we can use the following code to arrange the data in alphabetical order by country names:

Data |>
  arrange(statenme)

# A tibble: 7,421 × 15
   ccode statenme     year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>       <dbl>         <dbl>       <dbl>              <dbl>
 1   700 Afghanistan  1946             0           0                  0
 2   700 Afghanistan  1947             0           0                  0
 3   700 Afghanistan  1948             0           0                  0
 4   700 Afghanistan  1949             1           1                  1
 5   700 Afghanistan  1950             1           1                  1
 6   700 Afghanistan  1951             1           1                  1
 7   700 Afghanistan  1952             0           0                  0
 8   700 Afghanistan  1953             0           0                  0
 9   700 Afghanistan  1954             0           0                  0
10   700 Afghanistan  1955             1           1                  1
# ℹ 7,411 more rows
# ℹ 9 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, polity2 <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>,
#   wbpopest <dbl>, sdpest <dbl>, wbgdppc2011est <dbl>

For reverse order, we’d just right the following instead:

Data |>
  arrange(desc(statenme))

9.3.3 select()

select() gives us another way to slice our data, but unlike filter() which lets us pick which rows we want to keep, select() lets us pick which columns to keep. Dropping unnecessary columns from a dataset can be nice if we want to simply our analysis and not confuse ourselves with dozens or hundreds of extra variables that we don’t need.

As we can tell by using ncol() on the data (below), we have 15 columns. This isn’t too unreasonable, but it does have some variables we might consider dropping.

ncol(Data)

[1] 15

Say that we just wanted to look at how democracy, major power status, and surplus domestic product (an adjusted measure of GDP that quantifies how much wiggle room a country has to spend on militarization) on how likely a country is to start fights with other countries. Let’s use select() to just get the data points we need. We’ll keep state names and the year, too, for good measure.

Data |>
  select(statenme, year, gmlmidonset_init, cowmaj, xm_qudsest, sdpest)

# A tibble: 7,421 × 6
   statenme                  year gmlmidonset_init cowmaj xm_qudsest sdpest
   <chr>                    <dbl>            <dbl>  <dbl>      <dbl>  <dbl>
 1 United States of America  1946                1      1       1.26   28.5
 2 United States of America  1947                0      1       1.26   28.5
 3 United States of America  1948                0      1       1.25   28.5
 4 United States of America  1949                0      1       1.25   28.5
 5 United States of America  1950                1      1       1.27   28.5
 6 United States of America  1951                0      1       1.26   28.6
 7 United States of America  1952                0      1       1.35   28.6
 8 United States of America  1953                1      1       1.35   28.7
 9 United States of America  1954                1      1       1.35   28.7
10 United States of America  1955                1      1       1.36   28.7
# ℹ 7,411 more rows

That was pretty easy, but what if we wanted the majority of the columns in the data but we wanted to drop one? Would we have to write out all the column names inside select() except the one we don’t want? Thankfully the answer to this question is no. Here’s all we need to write drop one column, say polity2:

Data |>
  select(-polity2)

# A tibble: 7,421 × 14
   ccode statenme              year gmlmidongoing gmlmidonset gmlmidongoing_init
   <dbl> <chr>                <dbl>         <dbl>       <dbl>              <dbl>
 1     2 United States of Am…  1946             1           1                  1
 2     2 United States of Am…  1947             0           0                  0
 3     2 United States of Am…  1948             1           1                  0
 4     2 United States of Am…  1949             1           1                  0
 5     2 United States of Am…  1950             1           1                  1
 6     2 United States of Am…  1951             1           0                  1
 7     2 United States of Am…  1952             1           0                  0
 8     2 United States of Am…  1953             1           1                  1
 9     2 United States of Am…  1954             1           1                  1
10     2 United States of Am…  1955             1           1                  1
# ℹ 7,411 more rows
# ℹ 8 more variables: gmlmidonset_init <dbl>, cowmaj <dbl>,
#   v2x_polyarchy <dbl>, xm_qudsest <dbl>, wbgdp2011est <dbl>, wbpopest <dbl>,
#   sdpest <dbl>, wbgdppc2011est <dbl>

We can even use some helper functions inside of select() to get even more specific. A few examples are:

starts_with("")
end_with("")
contains("ijk")

For example, we have a few columns that contain “mid”. We can select all of those by writing:

Data |>
  select(contains("mid"))

# A tibble: 7,421 × 4
   gmlmidongoing gmlmidonset gmlmidongoing_init gmlmidonset_init
           <dbl>       <dbl>              <dbl>            <dbl>
 1             1           1                  1                1
 2             0           0                  0                0
 3             1           1                  0                0
 4             1           1                  0                0
 5             1           1                  1                1
 6             1           0                  1                0
 7             1           0                  0                0
 8             1           1                  1                1
 9             1           1                  1                1
10             1           1                  1                1
# ℹ 7,411 more rows

9.3.4 mutate()

Sometimes we need to create new variables. We can do that using mutate().

Since this is our fourth {dplyr} function, let’s start chaining some of these together to see how they work. Here’s one that we did before, with an extra column wbpopest:

Data |>
  select(statenme, year, gmlmidonset_init, cowmaj, xm_qudsest, sdpest, wbpopest)

To chain on another command, we use the pipe |>, just like in ggplot where we added layers to our graphic using +. Let’s pipe this select verb into mutate and create a measure of surplus domestic product per capita. As we do this, note that sdpest and wbpopest are log-transformed versions of these variables. To make a measure of SDP per capita, we’ll need to take the exponent of each (this undoes the log-transformation) before we divide the former by the latter:

Data |>
  select(statenme, year, gmlmidonset_init, cowmaj, xm_qudsest, sdpest, wbpopest) |>
  mutate(
    sdppc = exp(sdpest) / exp(wbpopest)
  )

# A tibble: 7,421 × 8
   statenme       year gmlmidonset_init cowmaj xm_qudsest sdpest wbpopest  sdppc
   <chr>         <dbl>            <dbl>  <dbl>      <dbl>  <dbl>    <dbl>  <dbl>
 1 United State…  1946                1      1       1.26   28.5     18.7 16848.
 2 United State…  1947                0      1       1.26   28.5     18.8 16285.
 3 United State…  1948                0      1       1.25   28.5     18.8 16301.
 4 United State…  1949                0      1       1.25   28.5     18.8 15709.
 5 United State…  1950                1      1       1.27   28.5     18.8 16091.
 6 United State…  1951                0      1       1.26   28.6     18.8 17001.
 7 United State…  1952                0      1       1.35   28.6     18.8 17327.
 8 United State…  1953                1      1       1.35   28.7     18.9 17980.
 9 United State…  1954                1      1       1.35   28.7     18.9 17819.
10 United State…  1955                1      1       1.36   28.7     18.9 18106.
# ℹ 7,411 more rows

A nice thing about working with mutate() is that you can make new columns in succession and they’ll be treated as known objects for the next line in mutate(). Say for example that we wanted to undo the log-transformation of SDP and population in our data in addition to creating a measure of SDP per capita:

Data |>
  select(statenme, year, gmlmidonset_init, cowmaj, xm_qudsest, sdpest, wbpopest) |>
  mutate(
    sdpest = exp(sdpest),
    wbpopest = exp(wbpopest),
    sdppc = sdpest / wbpopest
  )

# A tibble: 7,421 × 8
   statenme      year gmlmidonset_init cowmaj xm_qudsest  sdpest wbpopest  sdppc
   <chr>        <dbl>            <dbl>  <dbl>      <dbl>   <dbl>    <dbl>  <dbl>
 1 United Stat…  1946                1      1       1.26 2.33e12   1.38e8 16848.
 2 United Stat…  1947                0      1       1.26 2.28e12   1.40e8 16285.
 3 United Stat…  1948                0      1       1.25 2.34e12   1.43e8 16301.
 4 United Stat…  1949                0      1       1.25 2.30e12   1.47e8 15709.
 5 United Stat…  1950                1      1       1.27 2.40e12   1.49e8 16091.
 6 United Stat…  1951                0      1       1.26 2.57e12   1.51e8 17001.
 7 United Stat…  1952                0      1       1.35 2.66e12   1.53e8 17327.
 8 United Stat…  1953                1      1       1.35 2.81e12   1.56e8 17980.
 9 United Stat…  1954                1      1       1.35 2.83e12   1.59e8 17819.
10 United Stat…  1955                1      1       1.36 2.93e12   1.62e8 18106.
# ℹ 7,411 more rows

Note that as things are getting more complex, it’s helpful to do things one step at a time rather than all at once. This is a good recipe for success and will help cut down on frustration.

You could use many mathematical operators to create new columns beyond exp() and /, including sum(x), y - mean(y), log(), log2(), log10(), lead().

9.3.5 summarize() and group_by()

The functions summarize() and group_by() can be used in combination to collapse a dataframe to a single row, or a set of rows (if we use group_by()). For example, here’s the rate at which new MIDs are initiated in the data:

Data |>
  summarize(
    mean_mid_init = mean(gmlmidongoing_init)
  )

# A tibble: 1 × 1
  mean_mid_init
          <dbl>
1         0.215

Sometimes an average for the whole dataset is nice, but it can sometimes be useful to see how an average breaks down by different factors. For example, what is the average rate of MID initiation by year? If we combine summarize() with group_by() we can see:

Data |>
  group_by(year) |>
  summarize(
    mean_wars = mean(gmlmidongoing_init)
  )

# A tibble: 55 × 2
    year mean_wars
   <dbl>     <dbl>
 1  1946     0.106
 2  1947     0.103
 3  1948     0.181
 4  1949     0.227
 5  1950     0.32 
 6  1951     0.395
 7  1952     0.377
 8  1953     0.342
 9  1954     0.195
10  1955     0.262
# ℹ 45 more rows

We could also filter this down so it’s not just the total mean per year. Say we only want to look at the rate of MID initiation by major powers:

Data |>
  filter(cowmaj == 1) |>
  group_by(year) |>
  summarize(
    mean_wars = mean(gmlmidongoing_init)
  )

# A tibble: 55 × 2
    year mean_wars
   <dbl>     <dbl>
 1  1946      1   
 2  1947      0.25
 3  1948      0.75
 4  1949      0.5 
 5  1950      1   
 6  1951      1   
 7  1952      0.8 
 8  1953      1   
 9  1954      0.6 
10  1955      0.8 
# ℹ 45 more rows

Finally, we also have data on democracy. What if we only look at those that scored a 0.5 or higher on our democracy measure or that are major powers?

Data |>
  filter(
    cowmaj == 1 |
      xm_qudsest > 0.5
  ) |>
  group_by(year) |>
  summarize(
    mean_wars = mean(gmlmidongoing_init)
  )

# A tibble: 55 × 2
    year mean_wars
   <dbl>     <dbl>
 1  1946    0.167 
 2  1947    0.0968
 3  1948    0.219 
 4  1949    0.152 
 5  1950    0.371 
 6  1951    0.429 
 7  1952    0.429 
 8  1953    0.457 
 9  1954    0.229 
10  1955    0.256 
# ℹ 45 more rows

We can do more than one summary of the data at a time with summarize(). The below code reports for each year the number of unique country observations, the rate of MID initiation and the standard deviation of MID initiation:

Data |>
  group_by(year) |>
  summarize(
    n = n(), 
    mean_wars = mean(gmlmidongoing_init),
    sd_wars = sd(gmlmidongoing_init)
  )

# A tibble: 55 × 4
    year     n mean_wars sd_wars
   <dbl> <int>     <dbl>   <dbl>
 1  1946    66     0.106   0.310
 2  1947    68     0.103   0.306
 3  1948    72     0.181   0.387
 4  1949    75     0.227   0.421
 5  1950    75     0.32    0.470
 6  1951    76     0.395   0.492
 7  1952    77     0.377   0.488
 8  1953    79     0.342   0.477
 9  1954    82     0.195   0.399
10  1955    84     0.262   0.442
# ℹ 45 more rows

There a lot of different functions you can use in summarize(), including:

mean(): average
median(): midpoint
sd(): standard deviation
IQR(): interquartile range
mad(): median absolute deviation
min(): minimum
max(): maximum
quantile(x, 0.25): value in the data that is > 25% of the values and < 75% of the values
n(): count
sum(!is.na(x)): count all non-missing values (don’t count NAs)
n_distinct(): count all distinct (unique) values

Note that if you have data that has NA values, for functions like mean(), median(), or sd() you’ll want to use the option na.rm = T inside the function. This drops NA observations as these functions calculate the quantities of interest. Otherwise, the functions will return NA.

9.4 From summarizing to plotting

One of the advantages of using summarize() and group_by() in combination is that it makes plotting different summaries of your data so much easier. In a previous chapter we talked about how certain geom_*() functions with ggplot() will count and summarize things for you under the hood. This is fine in some cases, but sometimes the syntax necessary to make this happen isn’t terribly intuitive. Sometimes it will make more sense to do a combination of summarize() and group_by() with your data before using ggplot() to plot relationships. The below code shows an example of what this can look like. Notice that you can go from your full data, make a summary, and then give it the ggplot all within a series of pipes:

Data |>
  group_by(year) |>
  summarize(
    conflict_rate = mean(gmlmidonset_init)
  ) |>
  ggplot() +
  aes(x = year, y = conflict_rate) +
  geom_col() +
  labs(
    x = NULL,
    y = "Rate of MID Initiation",
    title = "Conflict initiation over time, 1946-2000",
    caption = "Data: {peacesciencer}"
  )

9.5 Pivoting data with `{tidyr}`

There are some instances where we may want to pivot our data to make it longer or wider prior to plotting as well. Our dataset contains three different measures of democracy. These measures emphasize some different things in quantifying quality of democracy, so there may be some divergence in what they tell us about quality of democracy in the world over time. The code below creates an object called dem_per_year which has a row for each year in the dataset. For each year, it gives in three separate columns the average democracy score for countries in the world based on each of the three different measures of democracy.

Important to note, it does not create a yearly average for the raw scores. The reason is that each of the democracy measures is on different scales. One goes from -10 to 10, another from 0 to 1. To fix that, the below code uses the scale function, which transforms data to “standard deviation units” and sets them to have a mean of 0.

Since we want to apply the same transformation to multiple columns, we can use a handy function called across() inside mutate(). Note the syntax below. After the transformations have been applied, we can then group by year and then summarize to get the mean of the transformed democracy scores per year. Notice that we can also use across() inside summarize(). However, I do something slightly different here than when I used scale to transform the data in mutate(). There I just had to write across(v2x_polyarchy:xm_qudsest, scale). In summarize() I need to write across(v2x_polyarchy:xm_qudsest, ~ mean(.x, na.rm = T)).

Why couldn’t I just write across(v2x_polyarchy:xm_qudsest, mean)? The reason is that there are missing or NA values in some of the democracy scores for some countries. By default, mean() handles these by just returning a mean of NA. Since we want to update how mean() deals with missing values, we need to create what’s called a “local function.” This is like the same thing as making a new mean function like so:

my_mean <- function(x) {
  mean(x, na.rm = T)
}

Instead of needing to create a new function like this, we can do so locally inside across() using slightly different syntax:

~ mean(.x, na.rm = T)

That’s much shorter. The ~ tells across() that you’re making a new function, and .x is the placeholder for what ever you want to give to the new function to perform an operation on.

Okay, here’s all the code put together:

dem_per_year <- Data |>
  ## first we need to put these measures on the same scale:
  mutate(
    across(v2x_polyarchy:xm_qudsest, scale)
  ) |>
  ## then group by year
  group_by(
    year
  ) |>
  ## then take the average of each per year
  summarize(
    across(v2x_polyarchy:xm_qudsest, ~  mean(.x, na.rm=T))
  )

If we look at the data we can see that it gives us a summary of the average democracy score per year in standard deviation units:

dem_per_year

# A tibble: 55 × 4
    year v2x_polyarchy  polity2 xm_qudsest
   <dbl>         <dbl>    <dbl>      <dbl>
 1  1946       -0.208   0.0897    0.00797 
 2  1947       -0.106   0.0608    0.00498 
 3  1948       -0.0889  0.0639    0.000867
 4  1949       -0.110   0.0194   -0.0233  
 5  1950       -0.0848  0.0541    0.0103  
 6  1951       -0.0843  0.0344   -0.0174  
 7  1952       -0.0568  0.0539    0.00652 
 8  1953       -0.0504  0.0417   -0.00285 
 9  1954       -0.0585 -0.00305  -0.0530  
10  1955       -0.0395  0.0374   -0.00428 
# ℹ 45 more rows

Now, without performing any other transformations of the data, we can give it directly to ggplot() for plotting. Since we want to see all three democracy measures side-by-side, we can just add a geom_line() layer for each:

ggplot(dem_per_year) +
  aes(x = year) +
  geom_line(
    aes(y = v2x_polyarchy),
    color = "navy"
  ) +
  geom_line(
    aes(y = polity2),
    color = "gray"
  ) +
  geom_line(
    aes(y = xm_qudsest),
    color = "red"
  )

That’s pretty simple, and it doesn’t look too bad. But, one problem with this approach is that we don’t have a legend to indicate which line corresponds to which democracy measure. We can do this using a few hacks, but as a rule I like to avoid relying on too many of these. An alternative solution is to reshape our data.

Enter the pivot functions. These live in the {tidyr} package, which, like {dplyr}, is part of the {tidyverse} and is opened automatically when you run library(tidyverse).

To help with plotting democracy trends using each of the measures we have in our dataset, the function we need to use is pivot_longer(). Here’s what it does to the data if we pivot longer on the democracy measures:

dem_per_year |>
  pivot_longer(
    cols = v2x_polyarchy:xm_qudsest
  )

# A tibble: 165 × 3
    year name              value
   <dbl> <chr>             <dbl>
 1  1946 v2x_polyarchy -0.208   
 2  1946 polity2        0.0897  
 3  1946 xm_qudsest     0.00797 
 4  1947 v2x_polyarchy -0.106   
 5  1947 polity2        0.0608  
 6  1947 xm_qudsest     0.00498 
 7  1948 v2x_polyarchy -0.0889  
 8  1948 polity2        0.0639  
 9  1948 xm_qudsest     0.000867
10  1949 v2x_polyarchy -0.110   
# ℹ 155 more rows

As you can see above, I used pivot_longer() to reshape the dataset. In the process of doing this, I made it untidy. As a general rule of thumb, we like tidy data, which, if you remember, has three characteristics:

Each row is an observation.
Each column is a variable.
Each cell has a single entry.

By pivoting the data, it no longer has characteristics 1 and 2. Though this is normally bad, it is actually ideal for plotting our democracy data side-by-side with ggplot. This is because, with the data pivoted, we can do this:

dem_per_year |>
  pivot_longer(
    v2x_polyarchy:xm_qudsest
  ) |>
  ggplot() +
  aes(
    x = year,
    y = value,
    color = name
  ) +
  geom_line()

By pivoting the data, we produced our plot using much less code and we have a legend automatically produced for us with different colors associated with each measure of democracy.

Importantly, there are few instances outside of the one shown above where pivoting the data like this makes sense. Most of the time, tidy data is best.

9.6 Conclusion

Knowing how to properly use the data wrangling verbs discussed in this chapter will take time and plenty of practice. Don’t be discouraged if you struggle at first. If you can focus on just remembering what these verbs are and their purpose, you can work out the details of your code later. Sometimes just knowing the name of the function you want to use is 95% of the battle in programming.

Another tip that I think will be helpful is to remember to save the changes you make to your data. Say you use mutate() to make a new column in your data. If you run something like the following, the original data will be printed out with the new column sdppc, but it won’t be saved.

Data |>
  mutate(
    sdppc = exp(sdpest) / exp(wbpopest)
  )

If you want to ensure that your change to your data is saved you need to make sure you assign the output back to the name of your original data object:

Data |>
  mutate(
    sdppc = exp(sdpest) / exp(wbpopest)
  ) -> Data ## added assignment!

Now the change to the data has been saved.

Something similar is true for other verbs that you apply to the data, but for certain changes you make be careful with how you save those changes. Sometimes it makes sense to assign the changes using the same object name that you started with. Other times, say if you filter the data, select only certain columns, or collapse it in some way you may want to save the changed data as a new object with a different name. Otherwise, your original dataset will be lost.

The big picture takeaway is (1) work on remembering what these different verbs are and what they do and (2) save changes to your data in a sensible way. If you can do these two things, you’ll go a long way in your ability to successfully wrangle all kinds of data.