3  Prerequisites II: Accessing Data and Making Plots

3.1 Goals

  • Get data into R
  • Making a data visualization
  • Cover some helpful resources if you want to go deeper into the details of using R

3.2 Reading data into R

Data analysis is impossible without data. The only way we can use data is to get it into R.

A lot of packages (including base R) provide tools for reading data into R. In this class, we’ll primarily use the {peacesciencer} package to query and construct datasets for analysis. However, we’ll also use datasets from other sources as well. When this is the case, we’ll use additional tools, like the read_csv() function from the {readr} package. This is one of the many packages in the {tidyverse} ecosystem.

There are three main ways that I will have you access data in this class. The first is via .csv files I have stored on my GitHub. The second is local .csv files that you’ll download and store in your project “Data” folder. The third is using R packages (like {peacesciencer}) that have been developed to specifically pull data from different online databases.

Let’s go over these different approaches.

3.2.1 Way 1: From my GitHub

Let’s start with the first way. I put together a dataset that cross-references instances of conflict onset for countries over time with their quality of governance score from the Worldwide Governance Indicators database. Here’s the url to the raw .csv file on my GitHub: https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv

To read the data into R, we can use the read_csv() function like so:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Data <- read_csv(
  "https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv"
)
Rows: 3047 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (3): year, sumonset1, wgi

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that I used quotation marks ("") around the url. Some things in R require the use of quotation marks (like data files or character strings) and other things don’t (like the names of objects or functions).

Importantly, when read_csv() reads the data into R, it will spit out a few messages when it’s done. These provide details about how read_csv() assigned variable classes to each of the columns in the data. In this case, it tells us that it assigned country to the character class (that means it’s a non-ordered category) and it assigned year, sumonset1, and wgi the class double, which is R’s way of saying real numbers.

3.2.2 Way 2: Local Files in Your Project “Data” Folder

You can give read_csv() a url like I did above, or the location of a file stored elsewhere. For example, I’m writing this note on my laptop, and the same dataset I have on my GitHub also lives in my files on my computer. To access it locally, I could use the here() function from the {here} package to tell R where to pull the data from:

file_location <- here::here(
  "DPR 101", "Data", "onset_and_wgi.csv"
)
Data <- read_csv(file_location)

You don’t have to use here(), but I like using it because it automatically adds all the pesky backslashes I’d need to add to tell R where a file lives.

3.2.3 Way 3: R Packages

Some datasets come pre-installed in R, and some are accessible with different R packages. For example, the mtcars data frame is automatically accessible in R the moment you open the Posit Workbench.

For other datasets you’ll need to use certain R packages that have been created to make it possible to access, query, and load different datasets. Some examples include {DemocracyData} and, of course, {peacesciencer}, which we’ll cover in the next chapter.

3.3 Looking at Data

One of the first steps in data analysis (once our data is cleaned up of course) is data visualization. Looking at our data can tell us a lot even before we do more formal analyses.

As an example, let’s work on some data visualizations using {peacesciencer} data. First, let’s open the {peacesciencer} package using the library() function. Let’s also open the {tidyverse}.

library(peacesciencer)
library(tidyverse)

Next, let’s make a very simple country-year dataset. This is a kind of time-series panel dataset that gives us units of observation (countries) over multiple periods of time (years). We can do this with the create_stateyears() function from {peacesciencer}.

data <- create_stateyears()

We now have an object called data in R. Let’s use glimplse() to take a look inside:

glimpse(data)
Rows: 17,316
Columns: 3
$ ccode    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ statenme <chr> "United States of America", "United States of America", "Unit…
$ year     <int> 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1…

The output from glimpse() tells us a few things about this object called data. It has 17,316 rows (state-year observations) and 3 columns, each indicating something about the country-year observation.

The ccode column is a numerical code called the “correlates of war” code. These are special codes created by the Correlates of War (CoW) Project. These provide a standardized way of referring to specific countries in the international system. They make it easier to combine new datasets together based on country ID. While you could combine datasets using actual country names, spellings can differ from one dataset to the next creating errors in the merging process. Having a standardized coding system for countries helps us avoid this problem. As we’ll discuss later, {peacesciencer} lets you use a couple of different coding systems, both the CoW codes or, alternatively, Gleditsch-Ward (GW) system membership codes.

The next column in the data is called statenme. This gives us the actual name of the country. There is also a year column that tells us the year we have an observation for a given country.

Most conflict studies run from the years 1816 to the early 2000s or the present. That’s because these are the years for which we have the highest quality data. We can look at the range of years covered in our data object by using the range() function on year:

range(data$year)
[1] 1816 2023

It goes all the way from 1816 to 2023. However, it does not have every possible year in this range for every country, we can check the number of years for each country in the data using some tools from the {tidyverse} (specifically from {dplyr}):

data |>
  group_by(statenme) |>
  summarize(
    n_years = n()
  )
# A tibble: 217 × 2
   statenme          n_years
   <chr>               <int>
 1 Afghanistan           105
 2 Albania               106
 3 Algeria                62
 4 Andorra                31
 5 Angola                 49
 6 Antigua & Barbuda      43
 7 Argentina             183
 8 Armenia                33
 9 Australia             104
10 Austria                89
# ℹ 207 more rows

Using a combination of the pipe operator |> and the functions group_by() and summarize(), I gave R instructions to calculate the number of times a country appears in the data. I also can get the same answer by writing the following, instead:

data |>
  count(statenme)
# A tibble: 217 × 2
   statenme              n
   <chr>             <int>
 1 Afghanistan         105
 2 Albania             106
 3 Algeria              62
 4 Andorra              31
 5 Angola               49
 6 Antigua & Barbuda    43
 7 Argentina           183
 8 Armenia              33
 9 Australia           104
10 Austria              89
# ℹ 207 more rows

By looking at the first 10 rows in this data, it’s clear that not all countries have the same number of years. This is because not all countries existed as of 1816 or still existed by 2023. The create_stateyears() function only returns country-year observations for which a country was officially considered a country according to the folks at CoW.

Alright, with this data, let’s look at some of our first conflict data (civil wars). We can merge in civil war data to data by using the add_cow_war() function and telling it we want to add type = "intra". This is short for “intra-state” wars, which are wars that take place within a particular country involving the government and one or more non-government actors.

data <- data |>
  add_cow_wars(type = "intra")
Joining with `by = join_by(ccode, year)`

If we use glimpse() we can see that our data has a few extra columns now:

glimpse(data)
Rows: 17,316
Columns: 13
$ ccode           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ statenme        <chr> "United States of America", "United States of America"…
$ year            <int> 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, …
$ warnum          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ warname         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ wartype         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cowintraonset   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cowintraongoing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ intnl           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ outcome         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sideadeaths     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sidebdeaths     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ intrawarnums    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

We’ll cover what these columns mean later, but for right now let’s focus on the cowintraonset column. This column provides data for a binary variable that takes the value \(1\) when a civil war started in a given country in a given year. Otherwise, it’s \(0\).

Now, it’s important to confirm with our new data that we have valid observations for the time-series under study. Let’s check:

data |>
  group_by(year) |>
  summarize(
    wars_detected = any(cowintraonset == 1)
  ) |>
  filter(is.na(wars_detected))
# A tibble: 16 × 2
    year wars_detected
   <int> <lgl>        
 1  2008 NA           
 2  2009 NA           
 3  2010 NA           
 4  2011 NA           
 5  2012 NA           
 6  2013 NA           
 7  2014 NA           
 8  2015 NA           
 9  2016 NA           
10  2017 NA           
11  2018 NA           
12  2019 NA           
13  2020 NA           
14  2021 NA           
15  2022 NA           
16  2023 NA           

It looks like all years from 2008 onward don’t have valid observations (a value that’s either \(0\) or \(1\)). That means our civil war data only goes up to the year 2007. We should probably drop all the years from 2008 onward.

data <- data |>
  drop_na(cowintraonset)

The above code uses the drop_na() function to tell R we want to drop all rows in the data where cowintraonset does not have a valid value. To double check that it worked we can repeat the code from earlier:

data |>
  group_by(year) |>
  summarize(
    wars_detected = any(cowintraonset == 1)
  ) |>
  filter(is.na(wars_detected))
# A tibble: 0 × 2
# ℹ 2 variables: year <int>, wars_detected <lgl>

We see zero rows returned, which tells us we now have all the observations for which our civil war data is valid.

One thing we can do with this data is summarize the frequency of civil wars over time. We can do this with some more tools from {dplyr}.

summary_data <- data |>
  group_by(year) |>
  summarize(
    n_wars = sum(cowintraonset)
  )

Using the above code, we’ve told R to create a new object called summary_data which contains for each given year a count of the number of civil wars that were newly started. We can look at it using glimpse().

glimpse(summary_data)
Rows: 193
Columns: 2
$ year   <int> 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 182…
$ n_wars <dbl> 0, 0, 1, 0, 1, 2, 0, 0, 0, 0, 1, 0, 1, 0, 4, 2, 1, 0, 2, 2, 2, …

With this data, we can now tell R to produce a plot. In this class, we’ll use plotting tools from the {ggplot2} package, which is another sub-package of the {tidyverse}. Here’s a scatter plot showing, per year, the number of civil wars started:

ggplot(summary_data) +
  aes(x = year, y = n_wars) +
  geom_point()

Really quickly, I want you to notice a few things about ggplot from the code used to produce the above figure. First, ggplot works by building figures in steps. We call these layers. Second, we add layers (literally) by using the + operator. While normally we use this for addition (i.e., 2 + 2) when we use ggplot the + acts a lot like a pipe operator (|>). Basically, the + in ggplot just tells R that we want to add a new set of commands or instructions for creating a plot.

More formally, the ggplot workflow looks like:

  1. Feed ggplot data.
  2. Map aesthetics (tell ggplot what relationships to show).
  3. Draw geometry (tell ggplot how to show these relationships).
  4. Customize.

These steps are just a starting point. We can riff on them in various ways to make a more interesting plot. As we do this, another thing to note about making figures with ggplot() is that we can save them as objects in R. Check it out:

p <- ggplot(summary_data) +
  aes(x = year, y = n_wars) +
  geom_point()

The object p is our ggplot data viz. Now, every time we write p, it tells R to produce the figure:

p

This feature of working with ggplot is great. The biggest benefit is that it lets us build up a solid foundation for a data viz and then add new details later.

For example, say we want to add another geometry layer, like a regression smoother (basically a model that finds the best fit for the data). All we need to do is add a new layer to p like so:

p + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Because I saved the first plot as the object p, I didn’t have to re-write the code to produce the old plot before adding a new layer.

In addition to the smoothed model we fit to the data, let’s update the plot labels as well to make them more informative:

p + 
  geom_smooth() +
  labs(
    x = "Year",
    y = "Number of New Civil Wars",
    title = "Civil Wars from 1816-2007"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The labs() function lets us update the labels of our plot.

As we work with datasets in subsequent chapters, we’ll produce all kinds of data visualizations to help us look at trends in conflicts.

3.4 Helpful Resources for Learning More

This class is not about how to use R in its entirety. Here are some resources for you to check out on your own time:

My personal favorite is swirlstats.com, because it lets you work at your own pace directly in R, and for free.

3.5 Wrapping up

Working with code is hard, and I promise you that you will run into problems. When you do, just remember that everyone has problems with their code. It’s normal, and it would be weird if you didn’t have any issues. Learning to troubleshoot and work through problems is an essential skill to develop.