library(tidyverse)
<- read_csv(
Data "https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv"
)
3 Prerequisites II: Accessing Data and Making Plots
3.1 Goals
- Get data into R
- Making a data visualization
- Cover some helpful resources if you want to go deeper into the details of using R
3.2 Reading data into R
Data analysis is impossible without data. The only way we can use data is to get it into R.
A lot of packages (including base R) provide tools for reading data into R. In this class, we’ll primarily use the read_csv()
function from the {readr}
package. This is one of the many packages in the {tidyverse}
ecosystem of packages.
There are four main ways that I will have you access data in this class. The first is via .csv files I have stored on my GitHub. The second is via files stored in your “Data” folder. The third is from Google Drive. The fourth is using R packages that have been developed to specifically pull data from different online databases.
Let’s go over these different approaches.
3.2.1 Way 1: From my GitHub
Let’s start with the first way. I put together a dataset that cross-references instances of conflict onset for countries over time with their quality of governance score from the Worldwide Governance Indicators database. Here’s the url to the raw .csv file on my GitHub: https://raw.githubusercontent.com/milesdwilliams15/Teaching/main/DPR%20101/Data/onset_and_wgi.csv
To read the data into R, we can use the read_csv()
function like so:
Notice that I used quotation marks (""
) around the url. Some things in R require the use of quotation marks (like data files or character strings) and other things don’t (like the names of objects or functions).
3.2.2 Way 2: Local Files
You can give read_csv()
a url like I did above, or the location of a file stored elsewhere. For example, I’m writing this note on my laptop, and the same dataset I have on my GitHub also lives in my files on my computer. To access it locally, I could use the here()
function from the {here}
package to tell R where to pull the data from:
<- here::here(
file_location "DPR 101", "Data", "onset_and_wgi.csv"
)<- read_csv(file_location) Data
You don’t have to use here()
, but I like using it because it automatically adds all the pesky backslashes I’d need to add to tell R where a file lives.
Importantly, when read_csv()
reads the data into R, it will spit out a few messages when it’s done. These provide details about how read_csv()
assigned variable classes to each of the columns in the data. In this case, it tells us that it assigned country
to the character class (that means it’s a non-ordered category) and it assigned year
, sumonset1
, and wgi
the class double, which is R’s way of saying real numbers.
3.2.3 Way 3: From Google Drive
Some datasets for this class will come from Google Drive. Here’s an example pulling from a Google Sheets document that has voter turnout data from a field experiment done in New Haven, CT in 1998:
<- "https://docs.google.com/spreadsheets/d/19RaIaVoJMChVsNGO45OrqZcSE3to-EwQ-vywngHLbks/edit#gid=817523709"
url library(googlesheets4)
gs4_deauth()
<- range_speedread(url) turnout_data
To access Google Sheet files you need to open the {googlesheets4}
package. It has a few functions for reading in data, but the best and fastest is range_speedread()
.
The workflow for reading in data this way is very similar to the approach I took for reading in .csv files from GitHub. The main differences are:
- You need to load the
{googlesheets4}
package. - You need to run the function
gs4_deauth()
before you try reading in the data.
The gs4_deauth()
function removes the need to enter in some additional permissions for accessing files from Drive.
3.2.4 Way 4: R Packages
Some datasets come pre-installed in R, and some are accessible with different R packages. For example, the mtcars
data frame is automatically accessible in R the moment you open the Posit Workbench.
There for other datasets you’ll need to use certain R packages that have been created to make it possible to access, query, and attach different datasets. Some examples include {DemocracyData}
and {peacesciencer}
.
3.3 Making a figure
One of the first steps in data analysis (once our data is cleaned up of course) is data visualization. Looking at our data can tell us a lot even before we do more formal analyses.
As an example, let’s use the gapminder
dataset from the {gapminder}
package:
# install.packages(gapminder)
library(gapminder)
We now have an object called gapminder
in R. This is a dataset that contains information for a bunch of countries over time about wealth and life expectancy. Below, I’m using the sample_n()
function to look at 10 random rows from the data (notice the use of the pipe operator):
|>
gapminder sample_n(10)
# A tibble: 10 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Yemen, Rep. Asia 1972 39.8 7407075 1265.
2 Reunion Africa 1977 67.1 492095 4320.
3 Guatemala Americas 1957 44.1 3640876 2617.
4 Mexico Americas 1982 67.4 71640904 9611.
5 Romania Europe 1992 69.4 22797027 6598.
6 Gambia Africa 2002 58.0 1457766 661.
7 Lebanon Asia 1957 59.5 1647412 6090.
8 Eritrea Africa 1962 40.2 1666618 381.
9 Dominican Republic Americas 1982 63.7 5968349 2861.
10 Uganda Africa 2007 51.5 29170398 1056.
Using this data, we can make a simple scatter plot showing how per capita GDP predicts life expectancy. We’ll do this using ggplot()
from the {ggplot2}
package, which is part of the {tidyverse}
.
ggplot(gapminder) +
aes(x = gdpPercap, y = lifeExp) +
geom_point()
You can notice a few things about ggplot from the code used to produce the above figure. First, ggplot works by building figures in steps. We call these layers. Second, we add layers (literally) by using the +
operator. While normally we use this for addition (i.e., 2 + 2
) when we use ggplot the +
acts a lot like this thing called a pipe operator (%>%
or for later versions of R |>
). Basically, the +
in ggplot just tells R that we want to add a new set of commands or instructions for creating a plot.
More formally, the ggplot workflow looks like:
- Feed ggplot data.
- Map aesthetics (tell ggplot what relationships to show).
- Draw geometry (tell ggplot how to show these relationships).
- Customize.
Back to the figure we made, as a first pass this isn’t too bad. We could obviously add a few more flourishes to make our data viz publication ready, but this is enough to get a sense for the data. Take a look at the figure. What does it tell us about life expectancy and GDP per capita?
Another thing to note about making figures with ggplot()
is that we can save them as objects in R. Check it out:
<- ggplot(gapminder) +
p aes(x = gdpPercap, y = lifeExp) +
geom_point()
The object p
is our ggplot data viz. Now, every time we write p
, it tells R to produce the figure:
p
This feature of working with ggplot is great. The biggest benefit is that it lets us build up a solid foundation for a data viz and then add new details later.
For example, say we want to compare the above figure to a version where the scale for GDP per capita is different (say using the log-10 scale). All we need to do is add a new layer to p
like so:
+ scale_x_log10() p
Because I saved the first plot as the object p
, I didn’t have to re-write the code to produce the old plot before adding a new layer.
Speaking of this new layer, does the log10 scale change any conclusions you previously drew about the relationship between per capita GDP and life expectancy?
3.4 Helpful Resources for Learning More
This class is not about how to use R in its entirety. But, you may find it helpful to get more familiar with it. Here are some resources for you to check out on your own time:
My personal favorite is swirlstats.com, because it lets you work at your own pace directly in R, and for free.
3.5 Wrapping up
Working with code is hard, and I promise you that you will run into problems. When you do, just remember that everyone has problems with their code. It’s normal, and it would be weird if you didn’t have any issues.
We’ll get more into the details of working with ggplot in the coming weeks. But for now, you should have at minimum some helpful examples for how to:
- Read data into R
- Make a scatter plot using ggplot
- Access other resources for working in R