# A tibble: 2 × 5
subject time age weight height
<chr> <dbl> <dbl> <dbl> <dbl>
1 John Smith 1 33 90 1.87
2 Mary Smith 1 NA NA 1.54
Module 2.5
Tidying Data
- Start a QMD file for this module.
- Review the concepts of tidy data and read through the tidyr documentation.
- Have a quick look at the
tidyr
cheatsheet to explore the capabilities of the package.
Overview
Earlier in the course, we introduced the idea of tidy data—a consistent way of structuring datasets that makes them easier to work with in R. In a tidy dataset, each variable forms a column, each observation forms a row, and each cell contains a single value. This structure is especially powerful when paired with the tidyverse
tools you’ve already seen, like ggplot2
and dplyr
.
In this module, we’ll focus on how to reshape or pivot messy data into tidy format using functions from the tidyr
package. We’ll start by examining example datasets included in the package, then work with real-world data from the World Bank. Along the way, you’ll learn to convert data between wide and long formats using pivot_longer()
and pivot_wider()
, and you’ll practice fixing common issues like inconsistent variable names and incorrect data types.
Are These Data Tidy?
Let’s start by loading the tidyr
package and examining a few of its built-in datasets. One of the datasets is called smiths
. You can access this dataset by calling:
Are these data tidy? The answer is “yes.” These data are tidy because each variable is in its own column, each observation is in its own row, and each cell contains a single value. It might not look like a typical dataset, because it is small (only two rows) and because it has missing values, but it meets the tidy data principles.
Here is another example of tidy data from the tidyr
package, called table1
.
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
What about these data? Are they tidy? Again the answer is “yes.” There are four columns, each of which represents a variable: country
, year
, cases
, and population
. Each row represents a single observation, and each cell contains a single value.
OK let’s try one more. This one is called table2
.
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
This one is a little tricky because, on the one hand we could say that it is tidy data in “long form” if we treat type
as a single variable. But if I told you that “cases” here refers to the number of case of tuberculosis and “population” refers to the population of the country, then we would say that this is not tidy data. The reason is that the type
variable is not a variable in its own right but rather a label for the two variables cases
and population
and, in that case, cases
and population
should be in separate columns (as in table1
).
A dataset is in long format when each row represents a single observation, and multiple categories or types that could be defined as separate variables are stacked into one column—often paired with a corresponding value column. It’s in wide format when different categories have their own columns.
Which format is better? It depends on how you define your variables.
Sometimes, long format is clearly appropriate. For example, if you’re comparing regions or species, it makes sense to keep those as a single column so you can facet or color by them in a ggplot
plot.
But when you want to compare values across categories, or analyze how the values of the categories relate to a third variable, wide format is more appropriate.
Now let’s look at another rendition of these same data.
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Can you tell what is going on here? Are these data tidy? No. In this case, the rate variable is not the actual rate of infection but rather two values (cases and population) separated by a slash. So here we have two values in one cell, which violates the tidy data principle that each cell should contain a single value.
Finally, let’s look at one more example. Here istable4a
from the tidyr
package.
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
What is the issue here? In this case, the data are not tidy because the year
variable is spread across multiple columns (1999
and 2000
). The years represent observations and should therefore be in their own rows in the context of a single column, not in separate columns.
Tidying Data
Let’s now talk about how we can use tidyr
to tidy our data. The tidyr
package provides several functions to help you reshape data into tidy format. The most commonly used functions are pivot_longer()
and pivot_wider()
. But we can also make use of separate()
when we need to split a single column into multiple columns.
pivot_longer()
is helpful when we have a situation like we saw in table4a
, where we have multiple columns representing different observations of the same variable. It allows us to “pivot” those columns into a single column with an additional column indicating the observation type (e.g., year).
pivot_longer()
takes three main arguments: cols
to identify which columns you want to pivot; names_to
to identify the name of the column where the old column names are going to (the identifier), and values_to
, the name of the column where the values are going to. Let’s give it a try with table4a
:
table4a_pivot <- table4a |>
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases")
table4a_pivot
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
This reshapes the data so that we have a single year
column and a single cases
column, with the observations (country-years) in the rows.
Now let’s take the case of table2
, which is in long format but not tidy because the type
variable is not a separate variable. We can use pivot_wider()
to reshape it into a tidy format.
pivot_wider()
takes three main arguments: names_from
to identify the column whose values will become new column names, values_from
to identify the column containing the values that will fill the new columns, and values_fill
, an optional argument that specifies what to use for missing values (e.g., NA
, 0
).
Let’s try reshaping table2
into a tidy format using pivot_wider()
:
table2_tidy <- table2 |>
pivot_wider(names_from = type,
values_from = count)
table2_tidy
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Finally, let’s look at how to use separate()
to split a single column into multiple columns. This is useful when you have a column that contains multiple pieces of information, like the rate
column in table3
.
table3_separated <- table3 |>
separate(rate, into = c("cases", "population"), sep = "/")
table3_separated
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
World Bank Worked Example
As a worked example, let’s look at the tidyr
world_bank_pop
dataset, which simulates a typical structure you might get from downloading World Bank data manually. Let’s take a closer look at the world_bank_pop
dataset. You can view it using:
world_bank_pop
# A tibble: 1,064 × 20
country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
# `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
# `2016` <dbl>, `2017` <dbl>
At first glance, this dataset appears to be in a wide format. Each row represents a country and indicator (such as “SP.POP.TOTL” for total population), and each year from 2000 to 2017 is stored in its own column. While this might seem convenient, it violates the tidy data principle that each variable should form its own column. Here, the years are actually values of a single variable called “year”—but they are spread across multiple columns.
To tidy this data, our first step is to convert it from wide format to long format using the pivot_longer()
function from the tidyr
package. Specifically, we want to pivot all the year columns into a single “year” column, with a corresponding “pop” column for the population values.
Here’s how we do it:
long_pop_data <- world_bank_pop |>
pivot_longer(
cols = `2000`:`2017`, # The year columns to pivot
names_to = "year", # New column for year values
values_to = "pop" # New column for population values
)
long_pop_data
# A tibble: 19,152 × 4
country indicator year pop
<chr> <chr> <chr> <dbl>
1 ABW SP.URB.TOTL 2000 41625
2 ABW SP.URB.TOTL 2001 42025
3 ABW SP.URB.TOTL 2002 42194
4 ABW SP.URB.TOTL 2003 42277
5 ABW SP.URB.TOTL 2004 42317
6 ABW SP.URB.TOTL 2005 42399
7 ABW SP.URB.TOTL 2006 42555
8 ABW SP.URB.TOTL 2007 42729
9 ABW SP.URB.TOTL 2008 42906
10 ABW SP.URB.TOTL 2009 43079
# ℹ 19,142 more rows
This transformation makes the dataset much more flexible. Now, each row corresponds to a single observation: a country-indicator-year combination with a single value for population. This structure is ideal for most types of analysis and visualization. However, it’s worth asking—are we fully done tidying?
Not quite. The current dataset is in long format, and that’s great in many situations. But in this case, each country-year may have data for multiple indicators (e.g., total population, urban population), and in most cases we would prefer to have each indicator in its own column.
To do that, we can use pivot_wider()
to reshape the dataset so that the indicator codes become column names and each row represents a country-year combination.
tidy_pop_data <- long_pop_data |>
pivot_wider(
names_from = indicator,
values_from = pop
)
tidy_pop_data
# A tibble: 4,788 × 6
country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABW 2000 41625 1.66 89101 2.54
2 ABW 2001 42025 0.956 90691 1.77
3 ABW 2002 42194 0.401 91781 1.19
4 ABW 2003 42277 0.197 92701 0.997
5 ABW 2004 42317 0.0946 93540 0.901
6 ABW 2005 42399 0.194 94483 1.00
7 ABW 2006 42555 0.367 95606 1.18
8 ABW 2007 42729 0.408 96787 1.23
9 ABW 2008 42906 0.413 97996 1.24
10 ABW 2009 43079 0.402 99212 1.23
# ℹ 4,778 more rows
Now, the dataset is both tidy and wide: one row per country and year, with separate columns for each indicator. This makes it easy to perform operations like plotting urban versus total population or calculating the share of the population that is urban.
This kind of workflow—going from wide to long and back to wide—is common when working with real-world datasets. You might start with a messy spreadsheet or export, use pivot_longer()
to get it into a form that’s easier to work with, and then use pivot_wider()
to tailor the structure to your specific analysis needs.