Module 2.6

Cleaning Data

Prework

Start a QMD file for this module. (At this point, I assume you know how to create a QMD file in a project and set the YAML header. So this will be the last time I mention it.)
Install the janitor package and have a look at the documentation.

Overview

In this module we are going to focus on how to clean up messy real world data. We are going to do this using data downloaded from the World Development Indicators. This will be our third iteration of working with World Bank data, so if you are a budding economist you are in luck! The first time we encountered World Bank data was through the WDI package where the data were tidy just by virtue of being accessed through the API. In the last module, we left off with a worked example where we used the tidyr version of some untidy World Bank’s World Development Indicators (WDI) dataset and transposed it to make it into a tidy dataset. In this lesson, we are going to download data directly from the WDI interface that is not only untidy but also contains some other issues that we will need to address.

Downloading and Reading in the Data

Go to the World Development Indicators portal at the World Bank’s Data Bank.

Under Countries, select the Countries tab and then select the little check mark ☑️ to select all of the countries. Be sure to select the Countries tab first, though, or you will also be downloading aggregate data for regions and groups of countries.

Next, under Series, search for “labor force participation” and find labor force participation rates for women ages 15-64 (ILO modeled estimates). Check that series.

Now go to Time and select the years from the last 50 years. Click Apply Changes, go to Download Options and download as a .csv file. Place the .csv file in the data directory that you created for this module. Save it as “messy_wb_data.csv” or something like that.

Now we are going to read this messy World Bank data into R using the read_csv() function from the readr package.After we have read the data into R, we are going to have a look at it with glimpse().

Did You Know?

While comma delimited files are the most common kind of flat file, readr includes functions for parsing files with a wide range of delimiters including tabs (read_tsv()), semicolons (read_csv2()) and white spaces (read_table()). There is also a Tidyverse package for reading in Excel files called readxl.

library(readr) 
library(dplyr) 

wb_data_messy <- read_csv("data/messy_wb_data.csv")

glimpse(wb_data_messy)

Rows: 222
Columns: 54
$ `Country Name`  <chr> "Afghanistan", "Albania", "Algeria", "American Samoa",…
$ `Country Code`  <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "ATG", "ARG"…
$ `Series Name`   <chr> "Labor force participation rate, female (% of female p…
$ `Series Code`   <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI…
$ `1972 [YR1972]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1973 [YR1973]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1974 [YR1974]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1975 [YR1975]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1976 [YR1976]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1977 [YR1977]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1978 [YR1978]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1979 [YR1979]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1980 [YR1980]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1981 [YR1981]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1982 [YR1982]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1983 [YR1983]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1984 [YR1984]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1985 [YR1985]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1986 [YR1986]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1987 [YR1987]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1988 [YR1988]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1989 [YR1989]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `1990 [YR1990]` <chr> "15.83", "60.63", "12.31", "..", "..", "76.73", "..", …
$ `1991 [YR1991]` <chr> "15.89", "65.54", "12.33", "..", "..", "76.69", "..", …
$ `1992 [YR1992]` <chr> "15.92", "66.56", "12.37", "..", "..", "76.66", "..", …
$ `1993 [YR1993]` <chr> "15.91", "65.01", "12.41", "..", "..", "76.68", "..", …
$ `1994 [YR1994]` <chr> "15.88", "63.64", "12.47", "..", "..", "76.64", "..", …
$ `1995 [YR1995]` <chr> "15.92", "61.59", "12.56", "..", "..", "76.57", "..", …
$ `1996 [YR1996]` <chr> "15.75", "60.28", "12.64", "..", "..", "76.55", "..", …
$ `1997 [YR1997]` <chr> "15.59", "61.91", "12.59", "..", "..", "76.53", "..", …
$ `1998 [YR1998]` <chr> "15.47", "60.62", "12.59", "..", "..", "76.53", "..", …
$ `1999 [YR1999]` <chr> "15.4", "58.87", "12.63", "..", "..", "76.51", "..", "…
$ `2000 [YR2000]` <chr> "15.35", "57.89", "12.71", "..", "..", "76.49", "..", …
$ `2001 [YR2001]` <chr> "15.5", "56.71", "12.85", "..", "..", "76.48", "..", "…
$ `2002 [YR2002]` <chr> "15.7", "56.06", "13.02", "..", "..", "76.44", "..", "…
$ `2003 [YR2003]` <chr> "15.92", "55.3", "13.24", "..", "..", "76.41", "..", "…
$ `2004 [YR2004]` <chr> "16.13", "54.57", "13.5", "..", "..", "76.38", "..", "…
$ `2005 [YR2005]` <chr> "16.33", "53.88", "13.79", "..", "..", "76.36", "..", …
$ `2006 [YR2006]` <chr> "16.12", "53.43", "14.12", "..", "..", "76.39", "..", …
$ `2007 [YR2007]` <chr> "15.91", "53.07", "14.47", "..", "..", "76.42", "..", …
$ `2008 [YR2008]` <chr> "15.74", "52.78", "14.87", "..", "..", "76.46", "..", …
$ `2009 [YR2009]` <chr> "15.65", "51.57", "15.31", "..", "..", "76.53", "..", …
$ `2010 [YR2010]` <chr> "15.65", "52.75", "15.49", "..", "..", "76.59", "..", …
$ `2011 [YR2011]` <chr> "16", "60.59", "16.45", "..", "..", "76.67", "..", "55…
$ `2012 [YR2012]` <chr> "16.44", "55.1", "17.48", "..", "..", "76.73", "..", "…
$ `2013 [YR2013]` <chr> "17.42", "50.58", "18.29", "..", "..", "76.79", "..", …
$ `2014 [YR2014]` <chr> "18.46", "50.18", "16.68", "..", "..", "76.83", "..", …
$ `2015 [YR2015]` <chr> "19.55", "54.05", "17.5", "..", "..", "76.87", "..", "…
$ `2016 [YR2016]` <chr> "20.7", "56.4", "18.33", "..", "..", "76.9", "..", "56…
$ `2017 [YR2017]` <chr> "21.91", "55.54", "19.19", "..", "..", "76.91", "..", …
$ `2018 [YR2018]` <chr> "22.32", "59.12", "18.95", "..", "..", "76.9", "..", "…
$ `2019 [YR2019]` <chr> "22.74", "61.46", "18.7", "..", "..", "76.88", "..", "…
$ `2020 [YR2020]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …
$ `2021 [YR2021]` <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", …

Your Turn!!

Follow along with the video and the steps described avove to download the data and read it into R.

Reshaping the Data

Recall from the last module that in order for the data to be tidy, we want each column to represent a variable and each row to represent an observation.

But here again we see that the World Bank data are in wide form, meaning that each column represents a year and each row represents a country. This entails that each row represents multiple observations, violating tidy principles.

To rectify this, we need to reshape the data from wide form to long form using pivot_longer()](https://tidyr.tidyverse.org/reference/pivot_longer.html) function from the tidyr package. Recall that thepivot_longer()` function takes three basic arguments:

cols - which columns you want to pivot
names_to - the name of the column where the old column names are going to
values_to - the name of the column where the values are going to

In our case, we want to reshape all of the year columns and have the years represented in the rows. We want the newly created column to be called “year” and the values are going to represent the data on female labor force participation we downloaded (female labor force participation rates).

# Load tidyr
library(tidyr)

wb_data <- wb_data_messy |> 
  pivot_longer(         
    cols = `1972 [YR1972]`: `2021 [YR2021]`,
    names_to = "year", 
    values_to = "flfp" 
  ) 

glimpse(wb_data)

Rows: 11,100
Columns: 6
$ `Country Name` <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
$ `Country Code` <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`  <chr> "Labor force participation rate, female (% of female po…
$ `Series Code`  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.…
$ year           <chr> "1972 [YR1972]", "1973 [YR1973]", "1974 [YR1974]", "197…
$ flfp           <chr> "..", "..", "..", "..", "..", "..", "..", "..", "..", "…

Notice that when we specify the years in our pivot_longer() call we encapsulate them in backticks (``). This is because the years, as they were imported from the WDI dataset, have spaces in them. Typically we want to avoid this scenario by writing our variable names in snake_case.

Your Turn!!

Follow along with the video and the steps described above to reshape the data. Make sure to note and use backticks when specifying the years in the pivot_longer() call.

Cleaning up Our Data

Now that our data are transposed, we can start to clean up a few remaining issues. For example, the year variable is stored as a character string that includes both the year and a redundant label in brackets—e.g., "1972 [YR1972]". In addition, the variable flfp (female labor force participation) is stored as a character when it should be numeric.

To fix these issues, we’ll use the mutate() function from dplyr. First, we call mutate() along with substring() to extract just the first four characters from the year column. Then, we use the across() function inside mutate() to convert both year and flfp to numeric.

wb_data <- wb_data |> 
  mutate(year = substring(year, 1, 4)) |>  
  mutate(across(c("year", "flfp"), as.numeric))  

glimpse(wb_data)

Rows: 11,100
Columns: 6
$ `Country Name` <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
$ `Country Code` <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ `Series Name`  <chr> "Labor force participation rate, female (% of female po…
$ `Series Code`  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.…
$ year           <dbl> 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1…
$ flfp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

The last thing we are going to do is to fix the variable names. Specifically, we want to remove the spaces from the remaining variables and conver them from title case to snake case. To do this, we will use the clean_names() function from the janitor package.

As a final step, we can export our clean data to a new .csv file with the write.csv() function from readr.

# Load janitor
library(janitor)

wb_data_clean <- wb_data |>  
  clean_names() 

write_csv(wb_data_clean, "data/wb_data_clean.csv")

glimpse(wb_data_clean)

Rows: 11,100
Columns: 6
$ country_name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan…
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "…
$ series_name  <chr> "Labor force participation rate, female (% of female popu…
$ series_code  <chr> "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE.ZS", "SL.TLF.ACTI.FE…
$ year         <dbl> 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 198…
$ flfp         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Your Turn!!

Follow along with the video and the steps described above to clean up the data. Use mutate() and across() to:

Truncate the year variable so that it only includes the four-digit year,
Convert year and flfp to numeric using across() inside mutate().

Then, use clean_names() from the janitor package to clean up the variable names—removing spaces and converting to snake_case. Finally, export your cleaned dataset to a new .csv file using write_csv() from the readr package.

💡 Challenge yourself!
Download a new dataset from the World Bank with more than one variable. Use pivot_longer() and pivot_wider() to reshape the data, then follow the steps above to clean and prepare it for analysis.