ggplot(Data) +
aes(x = var1, y = var2) +
geom_point()
4 ggplot()
Basics, Part I
4.1 Goals
- Understand the ggplot workflow.
- Know how to distinguish tidy data from untidy data and why tidy data are important for doing data viz with ggplot.
- Understand mapping.
- Understand adding layers.
4.2 How ggplot()
works
At its core, the ggplot workflow consists of a simple three-step process:
- Feed data to the
ggplot()
function. - Tell
ggplot()
what variables you want show relationships for using theaes()
function. - Tell
ggplot()
about the geometry (shapes, colors, points, etc.) that you want it to use to show the relationship(s) you’re interested in using ageom_*()
function.
As you move from one step to the next, you’ll use the +
symbol to connect your instructions.
In practice, this might look something like the code shown below. It starts by feeding a data object called Data
to the ggplot()
function. After the +
operator, it then uses the aes()
function to give ggplot instructions about which variables to show. In this case, aes(x = var1, y = var2)
specifies that I want values of var1
to be shown along the x-axis of the figure and values of var2
to be shown along the y-axis. Finally, after another call to the +
operator the geom_point()
function is used. This function tells ggplot that I want to show the relationship between var1
and var2
using points (e.g, I want a scatter plot). The output is shown below the code.
You’ll notice that this figure is quite spartan in some respects, but you can see right off the bat that ggplot does quite a lot for you. As we’ll learn as we progress through this course, ggplot can do so much more. There’s a reason why it’s the state-of-the-art for data visualization. Right now, we’re crawling. Soon, we’ll be sprinting.
4.3 Good data viz starts with “tidy” data
Using ggplot itself is quite simple, but before we can use it, we need data. If we’re lucky, our data will already be fully processed and ready to go. If we’re not lucky (and usually we’re not), we’ll need to do some data wrangling. Data wrangling just refers to the process of cleaning and reshaping data to make it ready for visualization or analysis. Thankfully, the {tidyverse}
family of packages provide some helpful tools for doing this. We’ll learn the specifics in coming chapters.
For most data viz or analysis purposes (no matter what tools or software you use), the goal is usually to have tidy data. “Tidy” in this context does not just mean “clean.” Tidy data refers to a specific set of characteristics about a dataset—its shape and its contents.
Tidy data have three key characteristics:
- Each row is an observation.
- Each column is a variable.
- Each cell is a single value.
Tidy data are always rectangular in shape. Specifically, they are long-format data as opposed to wide-format.
Here’s an example of data in wide-format. Can you tell why it’s wide? (Hint: look at how many observations we have per country in a single row.)
country | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | 29 | 30 | 32 | 34 | 36 | 38 | 40 | 41 | 42 | 42 | 42 | 44 |
Albania | 55 | 59 | 65 | 66 | 68 | 69 | 70 | 72 | 72 | 73 | 76 | 76 |
Algeria | 43 | 46 | 48 | 51 | 55 | 58 | 61 | 66 | 68 | 69 | 71 | 72 |
Angola | 30 | 32 | 34 | 36 | 38 | 39 | 40 | 40 | 41 | 41 | 41 | 43 |
Argentina | 62 | 64 | 65 | 66 | 67 | 68 | 70 | 71 | 72 | 73 | 74 | 75 |
This format is really inefficient for making a data viz with ggplot. Alternatively, long-format, or tidy data is much better. Here’s that same data, but tidy:
country | year | lifeExp |
---|---|---|
Afghanistan | 1952 | 29 |
Afghanistan | 1957 | 30 |
Afghanistan | 1962 | 32 |
Afghanistan | 1967 | 34 |
Afghanistan | 1972 | 36 |
Do you see the difference? Each row is a single observation (a country in a single year). For each observation, we have three variables, each with its own column, and each cell in each column has just one value.
4.4 Building plots in layers
Once we have some tidy data, we then can start thinking about data visualization. As already summarized, there are three basic steps we need to follow. Let’s take a closer look at each of these steps to get a better sense for the logic behind the ggplot workflow.
When we use ggplot, we first feed the core ggplot()
function some data. This is how we tell ggplot what our data is.
However, just giving ggplot our data isn’t enough. Look at what happens when we run ggplot()
on its own with just the gapminder
data:
ggplot(gapminder)
It gave us a lot of nothing! Instead, we just have a blank canvas. Think of the ggplot workflow as a process of adding layers of complexity upon a very simple foundation. This is the secret sauce that makes ggplot so flexible. Rather than build a super complex plot all at once, you can take it one step at a time, letting you critique and revise your work as you fashion a beautiful data visualization. You’re more like a painter or sculptor than a scientist.
Once we have our blank canvas, the next step is to select the data we want to show. We do this with the aes()
function.
ggplot(gapminder) +
aes(x = year, y = lifeExp)
The aes()
function accepts a lot of different commands. In the above, I told it x = year
to say I want the year column in the data to appear along the x-axis, and I told it y = lifeExp
to tell it I want life expectancy to appear along the y-axis. I could also tell it to give some things different colors based on categories of the data (e.g., color = continent
).
You can use aes()
after you use the core ggplot()
function using +
, or you can add it directly inside of ggplot()
like so:
ggplot(
gapminder, mapping = aes(x = year, y = lifeExp)
)
Using it this way makes it more explicit that aes()
is part of the mapping process with ggplot. You can use this way if your prefer, or my way. All roads lead to Rome.
Our next step is to add some geometry to our canvas. To do this, we use “geoms” (short for geometry). There are a number of geom functions, like geom_point()
, geom_col()
, geom_boxplot()
, and so on. Some geoms will make more sense than others depending on your data, and it’s up to you to make good judgments about which to use. Each provides a specific set of default instructions for how to connect aesthetics to different shapes, colors, and sizes in the data viz.
In the case of looking at life expectancy over time, geom_point()
might be a sensible option.
ggplot(gapminder) +
aes(x = year, y = lifeExp) +
geom_point()
Hmmm, it’s okay, but I think we can do better. What do you think this would look like if we tried geom_smooth()
, geom_line()
, or geom_boxplot()
?
4.5 You can use multiple geometries
A nice thing about working with ggplot is that we can add to it ad infinitum, layer upon layer. You aren’t restricted to only one geom, for example. Let’s try a combo using geom_point()
and geom_smooth()
:
ggplot(gapminder) +
aes(x = year, y = lifeExp) +
geom_point() +
geom_smooth()
geom_smooth()
adds a smoothed regression line to our plot. If you don’t have a background in statistics, just think of the smooth line as a summary of the mean of the variable on the y-axis depending on the value of the variable on the x-axis. In the above figure, we can see that average life expectancy has been increasing over time. By default, geom_smoothe()
provides 95% confidence intervals (CIs) around the mean. Think of these as a summary of how precisely the mean of the y-variable is estimated. Wider 95% CIs mean there’s more “noise” than “signal” in the data. Narrower 95% CIs mean there’s more “signal” than “noise.”
An interesting trick about working with geom layers is that we can specify aesthetics directly inside them. In fact, ggplot is super flexible about where you give it information about your data, too. Each of the below ways of writing the code will give you an identical figure to the one produced above. Try them out to see for yourself.
## Way 1:
ggplot(gapminder, aes(x = year, y = lifeExp)) +
geom_point() +
geom_smooth()
## Way 2:
ggplot(gapminder) +
geom_point(aes(x = year, y = lifeExp)) +
geom_smooth(aes(x = year, y = lifeExp))
## Way 3:
ggplot() +
geom_point(
data = gapminder,
aes(x = year, y = lifeExp)
+
) geom_smooth(
data = gapminder,
aes(x = year, y = lifeExp)
)
Any of the above approaches makes no difference for your output. However, as we start to consider conditioning our geometry on different groups in the data, we’ll need to be a little more specific about where we specify variables using aes()
. We’ll deal with that in the coming chapters.
4.6 Wrapping up
The ggplot workflow follows a simple logic. Using this logic, you can produce a near infinite variety of data visualizations. We haven’t even covered the myriad ways you can customize the theme and overall look of your data viz. Before we get there, however, we first need to talk a little bit more about mapping aesthetics, which we’ll cover in the next chapter.