Tidying Data

October 7, 2024

Tidying Data

What is Tyding Data?

Earlier we talked about the concept of “tidy data”
- Each variable forms a column
- Each observation is in a row
- Each cell has a single value
The process of tidying data involves reshaping (or pivoting) data into a tidy format
We want to use the pivot_longer() or pivot_wider() functions from tidyr to do this

Query: Are these data in a tidy format?

library(tidyr)

smiths

# A tibble: 2 × 5
  subject     time   age weight height
  <chr>      <dbl> <dbl>  <dbl>  <dbl>
1 John Smith     1    33     90   1.87
2 Mary Smith     1    NA     NA   1.54

Tip

To get a list of data frames available in a package use the data() function, e.g. data(package = "tidyr").

How about these data?

world_bank_pop

# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

Pivot Longer

pivot_longer() takes three arguments:

cols - which columns you want to pivot
names_to - the name of the column where the old column names are going to (identifier)
values_to - the name of the column where the values are going to

Example: WB Population Data

library(tidyr)
library(dplyr)

# Pivot using pivot_longer

long_pop_data <- world_bank_pop |>
  pivot_longer(
    cols = `2000`:`2017`,   # The columns you want to pivot (years)
    names_to = "year",      # New column name for the years
    values_to = "pop"       # New column name for the values
  ) 

# View the tidied data
long_pop_data

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

This is better, usable even, but are we done if we want a tidy data frame?

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Issue is that the data are in long form (which is OK for some purposes), but we want to make it wider. Wider, but tidy…

# A tibble: 19,152 × 4
   country indicator   year    pop
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Pivot Wider

pivot_wider() takes three main arguments:

names_from - the column whose values will become new column names (identifier)
values_from - the column containing the values that will fill the new columns
values_fill (optional) - specifies what to use for missing values (e.g., NA, 0)

`pivotwider()` the WB Data

# pivot wider
tidy_pop_data <- long_pop_data |>
  pivot_wider(
    names_from = indicator, 
    values_from = pop
  )

# view the data
tidy_pop_data

# A tibble: 4,788 × 6
   country year  SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
   <chr>   <chr>       <dbl>       <dbl>       <dbl>       <dbl>
 1 ABW     2000        41625      1.66         89101       2.54 
 2 ABW     2001        42025      0.956        90691       1.77 
 3 ABW     2002        42194      0.401        91781       1.19 
 4 ABW     2003        42277      0.197        92701       0.997
 5 ABW     2004        42317      0.0946       93540       0.901
 6 ABW     2005        42399      0.194        94483       1.00 
 7 ABW     2006        42555      0.367        95606       1.18 
 8 ABW     2007        42729      0.408        96787       1.23 
 9 ABW     2008        42906      0.413        97996       1.24 
10 ABW     2009        43079      0.402        99212       1.23 
# ℹ 4,778 more rows

`pivotwider()` the WB Data

# A tibble: 4,788 × 6
   country year  SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
   <chr>   <chr>       <dbl>       <dbl>       <dbl>       <dbl>
 1 ABW     2000        41625      1.66         89101       2.54 
 2 ABW     2001        42025      0.956        90691       1.77 
 3 ABW     2002        42194      0.401        91781       1.19 
 4 ABW     2003        42277      0.197        92701       0.997
 5 ABW     2004        42317      0.0946       93540       0.901
 6 ABW     2005        42399      0.194        94483       1.00 
 7 ABW     2006        42555      0.367        95606       1.18 
 8 ABW     2007        42729      0.408        96787       1.23 
 9 ABW     2008        42906      0.413        97996       1.24 
10 ABW     2009        43079      0.402        99212       1.23 
# ℹ 4,778 more rows

Exercise

Download Some Messy Data

Download some messy WB data
Make it multiple variables
Download as a CSV file
Save in your project /data folder

Messy Data Example

Read Data

# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/your_file_name.csv")

# View the data
glimpse(wb_data_messy)

Pivot the Data

Use pivot_longer() to get the data in long form
Use pivot_wider() to get the series in the columns

Special Considerations

# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> 
  pivot_longer(             
    cols = `1973 [YR1973]`: `2022 [YR2022]`, # columns to pivot
    names_to = "year", # name the identifier column "year"
    values_to = "values" # name the numeric var column 
  ) 

# View the data
glimpse(wb_data)

Special Considerations

wb_data <- wb_data |>
  select(-`Series Name`)

# Reshape the data
tidy_data <- wb_data |> 
  pivot_wider(             
    names_from = `Series Code`,
    values_from = values
  ) 

# View the data
glimpse(wb_data)

Give it a Shot!

Try downloading some messy WB data and make it tidy
Refer to previous slides for help

10:00

Mutate

Mutating Variables

Anytime we want to change a variable, we are going to use the dplyr verbs mutate() or mutate_at()
mutate() is if you want to change on variable
Use across() for multiple variables

Let’s Fix Our Variables

# Fix year and flfp
wb_data <- wb_data |> 
  mutate(year = substring(year, 1, 4)) |>  # truncate year
  mutate(across(c("year", "flfp"), as.numeric))  # convert year and flfp to numeric

# View the data
glimpse(wb_data)

Now Try it With Multiple Varaibles

Go to the data frame with multiple variables that you created earlier
How would you modify this code to make sure the variables are in the right format?

Clean Variable Names

Very simple: use the janitor package!

# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)

Column Specifications

Can clean columns on backend
But can also specify data types in read_csv() or read_excel()

spec(wb_data_messy) # show column specifications (from `readr`)

wb_data_messy <- read_csv("data/your_file_name.csv", 
                          col_types = cols(
                           col_double(`1990 [YR1990]`) 
                          ))

Tidying Data

Tidying Data

What is Tyding Data?

Pivot Longer

Pivot Longer

Example: WB Population Data

Pivot Wider

Pivot Wider

pivotwider() the WB Data

pivotwider() the WB Data

Exercise

Download Some Messy Data

Messy Data Example

Read Data

Pivot the Data

Special Considerations

Special Considerations

Give it a Shot!

Mutate

Mutating Variables

Let’s Fix Our Variables

Now Try it With Multiple Varaibles

Clean Variable Names

Column Specifications

`pivotwider()` the WB Data

`pivotwider()` the WB Data