3  Prerequisites II: Accessing Data and Making Plots

3.1 Goals

  • Get data into R
  • Making a data visualization
  • Cover some helpful resources if you want to go deeper into the details of using R

3.2 Reading data into R

Data analysis is impossible without data. The only way we can use data is to get it into R.

A lot of packages (including base R) provide tools for reading data into R. In this class, we’ll primarily use the read_csv() function from the {readr} package. This is one of the many packages in the {tidyverse} ecosystem of packages.

There are four main ways that I will have you access data in this class. The first is via .csv files I have stored on my GitHub. The second is via files stored in your “Data” folder. The third is from Google Drive. The fourth is using R packages that have been developed to specifically pull data from different online databases.

Let’s go over these different approaches.

3.2.1 Way 1: From my GitHub

Let’s start with the first way. I put together a dataset that cross-references instances of conflict onset for countries over time with their quality of governance score from the Worldwide Governance Indicators database. Here’s the url to the raw .csv file on my GitHub: https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv

To read the data into R, we can use the read_csv() function like so:

library(tidyverse)
Data <- read_csv(
  "https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv"
)

Notice that I used quotation marks ("") around the url. Some things in R require the use of quotation marks (like data files or character strings) and other things don’t (like the names of objects or functions).

3.2.2 Way 2: Local Files

You can give read_csv() a url like I did above, or the location of a file stored elsewhere. For example, I’m writing this note on my laptop, and the same dataset I have on my GitHub also lives in my files on my computer. To access it locally, I could use the here() function from the {here} package to tell R where to pull the data from:

file_location <- here::here(
  "DPR 101", "Data", "onset_and_wgi.csv"
)
Data <- read_csv(file_location)

You don’t have to use here(), but I like using it because it automatically adds all the pesky backslashes I’d need to add to tell R where a file lives.

Importantly, when read_csv() reads the data into R, it will spit out a few messages when it’s done. These provide details about how read_csv() assigned variable classes to each of the columns in the data. In this case, it tells us that it assigned country to the character class (that means it’s a non-ordered category) and it assigned year, sumonset1, and wgi the class double, which is R’s way of saying real numbers.

3.2.3 Way 3: From Google Drive

Some datasets for this class will come from Google Drive. Here’s an example pulling from a Google Sheets document that has voter turnout data from a field experiment done in New Haven, CT in 1998:

url <- "https://docs.google.com/spreadsheets/d/19RaIaVoJMChVsNGO45OrqZcSE3to-EwQ-vywngHLbks/edit#gid=817523709"
library(googlesheets4)
gs4_deauth()
turnout_data <- range_speedread(url)

To access Google Sheet files you need to open the {googlesheets4} package. It has a few functions for reading in data, but the best and fastest is range_speedread().

The workflow for reading in data this way is very similar to the approach I took for reading in .csv files from GitHub. The main differences are:

  1. You need to load the {googlesheets4} package.
  2. You need to run the function gs4_deauth() before you try reading in the data.

The gs4_deauth() function removes the need to enter in some additional permissions for accessing files from Drive.

3.2.4 Way 4: R Packages

Some datasets come pre-installed in R, and some are accessible with different R packages. For example, the mtcars data frame is automatically accessible in R the moment you open the Posit Workbench.

There for other datasets you’ll need to use certain R packages that have been created to make it possible to access, query, and attach different datasets. Some examples include {DemocracyData} and {peacesciencer}.

3.3 Making a figure

One of the first steps in data analysis (once our data is cleaned up of course) is data visualization. Looking at our data can tell us a lot even before we do more formal analyses.

As an example, let’s use the gapminder dataset from the {gapminder} package:

# install.packages(gapminder)
library(gapminder)

We now have an object called gapminder in R. This is a dataset that contains information for a bunch of countries over time about wealth and life expectancy. Below, I’m using the sample_n() function to look at 10 random rows from the data (notice the use of the pipe operator):

gapminder |>
  sample_n(10)
# A tibble: 10 × 6
   country            continent  year lifeExp      pop gdpPercap
   <fct>              <fct>     <int>   <dbl>    <int>     <dbl>
 1 Yemen, Rep.        Asia       1972    39.8  7407075     1265.
 2 Reunion            Africa     1977    67.1   492095     4320.
 3 Guatemala          Americas   1957    44.1  3640876     2617.
 4 Mexico             Americas   1982    67.4 71640904     9611.
 5 Romania            Europe     1992    69.4 22797027     6598.
 6 Gambia             Africa     2002    58.0  1457766      661.
 7 Lebanon            Asia       1957    59.5  1647412     6090.
 8 Eritrea            Africa     1962    40.2  1666618      381.
 9 Dominican Republic Americas   1982    63.7  5968349     2861.
10 Uganda             Africa     2007    51.5 29170398     1056.

Using this data, we can make a simple scatter plot showing how per capita GDP predicts life expectancy. We’ll do this using ggplot() from the {ggplot2} package, which is part of the {tidyverse}.

ggplot(gapminder) + 
  aes(x = gdpPercap, y = lifeExp) +
  geom_point()

You can notice a few things about ggplot from the code used to produce the above figure. First, ggplot works by building figures in steps. We call these layers. Second, we add layers (literally) by using the + operator. While normally we use this for addition (i.e., 2 + 2) when we use ggplot the + acts a lot like this thing called a pipe operator (%>% or for later versions of R |>). Basically, the + in ggplot just tells R that we want to add a new set of commands or instructions for creating a plot.

More formally, the ggplot workflow looks like:

  1. Feed ggplot data.
  2. Map aesthetics (tell ggplot what relationships to show).
  3. Draw geometry (tell ggplot how to show these relationships).
  4. Customize.

Back to the figure we made, as a first pass this isn’t too bad. We could obviously add a few more flourishes to make our data viz publication ready, but this is enough to get a sense for the data. Take a look at the figure. What does it tell us about life expectancy and GDP per capita?

Another thing to note about making figures with ggplot() is that we can save them as objects in R. Check it out:

p <- ggplot(gapminder) +
  aes(x = gdpPercap, y = lifeExp) + 
  geom_point()

The object p is our ggplot data viz. Now, every time we write p, it tells R to produce the figure:

p

This feature of working with ggplot is great. The biggest benefit is that it lets us build up a solid foundation for a data viz and then add new details later.

For example, say we want to compare the above figure to a version where the scale for GDP per capita is different (say using the log-10 scale). All we need to do is add a new layer to p like so:

p + scale_x_log10()

Because I saved the first plot as the object p, I didn’t have to re-write the code to produce the old plot before adding a new layer.

Speaking of this new layer, does the log10 scale change any conclusions you previously drew about the relationship between per capita GDP and life expectancy?

3.4 Helpful Resources for Learning More

This class is not about how to use R in its entirety. But, you may find it helpful to get more familiar with it. Here are some resources for you to check out on your own time:

My personal favorite is swirlstats.com, because it lets you work at your own pace directly in R, and for free.

3.5 Wrapping up

Working with code is hard, and I promise you that you will run into problems. When you do, just remember that everyone has problems with their code. It’s normal, and it would be weird if you didn’t have any issues.

We’ll get more into the details of working with ggplot in the coming weeks. But for now, you should have at minimum some helpful examples for how to:

  • Read data into R
  • Make a scatter plot using ggplot
  • Access other resources for working in R