4  ggplot() Basics, Part I

4.1 Goals

  • Understand the ggplot workflow.
  • Know how to distinguish tidy data from untidy data and why tidy data are important for doing data viz with ggplot.
  • Understand mapping.
  • Understand adding layers.

4.2 How ggplot() works

At its core, the ggplot workflow consists of a simple three-step process:

  1. Feed data to the ggplot() function.
  2. Tell ggplot() what variables you want show relationships for using the aes() function.
  3. Tell ggplot() about the geometry (shapes, colors, points, etc.) that you want it to use to show the relationship(s) you’re interested in using a geom_*() function.

As you move from one step to the next, you’ll use the + symbol to connect your instructions.

In practice, this might look something like the code shown below. It starts by feeding a data object called Data to the ggplot() function. After the + operator, it then uses the aes() function to give ggplot instructions about which variables to show. In this case, aes(x = var1, y = var2) specifies that I want values of var1 to be shown along the x-axis of the figure and values of var2 to be shown along the y-axis. Finally, after another call to the + operator the geom_point() function is used. This function tells ggplot that I want to show the relationship between var1 and var2 using points (e.g, I want a scatter plot). The output is shown below the code.

ggplot(Data) +
  aes(x = var1, y = var2) +
  geom_point()

You’ll notice that this figure is quite spartan in some respects, but you can see right off the bat that ggplot does quite a lot for you. As we’ll learn as we progress through this course, ggplot can do so much more. There’s a reason why it’s the state-of-the-art for data visualization. Right now, we’re crawling. Soon, we’ll be sprinting.

4.3 Good data viz starts with “tidy” data

Using ggplot itself is quite simple, but before we can use it, we need data. If we’re lucky, our data will already be fully processed and ready to go. If we’re not lucky (and usually we’re not), we’ll need to do some data wrangling. Data wrangling just refers to the process of cleaning and reshaping data to make it ready for visualization or analysis. Thankfully, the {tidyverse} family of packages provide some helpful tools for doing this. We’ll learn the specifics in coming chapters.

For most data viz or analysis purposes (no matter what tools or software you use), the goal is usually to have tidy data. “Tidy” in this context does not just mean “clean.” Tidy data refers to a specific set of characteristics about a dataset—its shape and its contents.

Tidy data have three key characteristics:

  1. Each row is an observation.
  2. Each column is a variable.
  3. Each cell is a single value.

Tidy data are always rectangular in shape. Specifically, they are long-format data as opposed to wide-format.

Here’s an example of data in wide-format. Can you tell why it’s wide? (Hint: look at how many observations we have per country in a single row.)

Wide Data
country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Afghanistan 29 30 32 34 36 38 40 41 42 42 42 44
Albania 55 59 65 66 68 69 70 72 72 73 76 76
Algeria 43 46 48 51 55 58 61 66 68 69 71 72
Angola 30 32 34 36 38 39 40 40 41 41 41 43
Argentina 62 64 65 66 67 68 70 71 72 73 74 75

This format is really inefficient for making a data viz with ggplot. Alternatively, long-format, or tidy data is much better. Here’s that same data, but tidy:

Long Data
country year lifeExp
Afghanistan 1952 29
Afghanistan 1957 30
Afghanistan 1962 32
Afghanistan 1967 34
Afghanistan 1972 36

Do you see the difference? Each row is a single observation (a country in a single year). For each observation, we have three variables, each with its own column, and each cell in each column has just one value.

4.4 Building plots in layers

Once we have some tidy data, we then can start thinking about data visualization. As already summarized, there are three basic steps we need to follow. Let’s take a closer look at each of these steps to get a better sense for the logic behind the ggplot workflow.

When we use ggplot, we first feed the core ggplot() function some data. This is how we tell ggplot what our data is.

However, just giving ggplot our data isn’t enough. Look at what happens when we run ggplot() on its own with just the gapminder data:

ggplot(gapminder)

It gave us a lot of nothing! Instead, we just have a blank canvas. Think of the ggplot workflow as a process of adding layers of complexity upon a very simple foundation. This is the secret sauce that makes ggplot so flexible. Rather than build a super complex plot all at once, you can take it one step at a time, letting you critique and revise your work as you fashion a beautiful data visualization. You’re more like a painter or sculptor than a scientist.

Once we have our blank canvas, the next step is to select the data we want to show. We do this with the aes() function.

ggplot(gapminder) +
  aes(x = year, y = lifeExp)

The aes() function accepts a lot of different commands. In the above, I told it x = year to say I want the year column in the data to appear along the x-axis, and I told it y = lifeExp to tell it I want life expectancy to appear along the y-axis. I could also tell it to give some things different colors based on categories of the data (e.g., color = continent).

You can use aes() after you use the core ggplot() function using +, or you can add it directly inside of ggplot() like so:

ggplot(
  gapminder, 
  mapping = aes(x = year, y = lifeExp)
)

Using it this way makes it more explicit that aes() is part of the mapping process with ggplot. You can use this way if your prefer, or my way. All roads lead to Rome.

Our next step is to add some geometry to our canvas. To do this, we use “geoms” (short for geometry). There are a number of geom functions, like geom_point(), geom_col(), geom_boxplot(), and so on. Some geoms will make more sense than others depending on your data, and it’s up to you to make good judgments about which to use. Each provides a specific set of default instructions for how to connect aesthetics to different shapes, colors, and sizes in the data viz.

In the case of looking at life expectancy over time, geom_point() might be a sensible option.

ggplot(gapminder) + 
  aes(x = year, y = lifeExp) +
  geom_point()

Hmmm, it’s okay, but I think we can do better. What do you think this would look like if we tried geom_smooth(), geom_line(), or geom_boxplot()?

4.5 You can use multiple geometries

A nice thing about working with ggplot is that we can add to it ad infinitum, layer upon layer. You aren’t restricted to only one geom, for example. Let’s try a combo using geom_point() and geom_smooth():

ggplot(gapminder) +
  aes(x = year, y = lifeExp) +
  geom_point() +
  geom_smooth()

geom_smooth() adds a smoothed regression line to our plot. If you don’t have a background in statistics, just think of the smooth line as a summary of the mean of the variable on the y-axis depending on the value of the variable on the x-axis. In the above figure, we can see that average life expectancy has been increasing over time. By default, geom_smoothe() provides 95% confidence intervals (CIs) around the mean. Think of these as a summary of how precisely the mean of the y-variable is estimated. Wider 95% CIs mean there’s more “noise” than “signal” in the data. Narrower 95% CIs mean there’s more “signal” than “noise.”

An interesting trick about working with geom layers is that we can specify aesthetics directly inside them. In fact, ggplot is super flexible about where you give it information about your data, too. Each of the below ways of writing the code will give you an identical figure to the one produced above. Try them out to see for yourself.

## Way 1:
ggplot(gapminder, aes(x = year, y = lifeExp)) +
  geom_point() +
  geom_smooth()

## Way 2:
ggplot(gapminder) +
  geom_point(aes(x = year, y = lifeExp)) +
  geom_smooth(aes(x = year, y = lifeExp))

## Way 3:
ggplot() +
  geom_point(
    data = gapminder,
    aes(x = year, y = lifeExp)
  ) +
  geom_smooth(
    data = gapminder,
    aes(x = year, y = lifeExp)
  )

Any of the above approaches makes no difference for your output. However, as we start to consider conditioning our geometry on different groups in the data, we’ll need to be a little more specific about where we specify variables using aes(). We’ll deal with that in the coming chapters.

4.6 Wrapping up

The ggplot workflow follows a simple logic. Using this logic, you can produce a near infinite variety of data visualizations. We haven’t even covered the myriad ways you can customize the theme and overall look of your data viz. Before we get there, however, we first need to talk a little bit more about mapping aesthetics, which we’ll cover in the next chapter.