Working With Data

September 29, 2024

How Do We Get Tidy/Clean Data?


  • Get lucky and find it (like on kaggle)
  • Wrangle it ourselves
  • Use a package where it has been wrangled for us
  • Download via an API

This Lesson

  • Practice with World Bank and V-Dem data
  • World Bank data through wbstats
    • There is another package called WDI
    • Both packages for accessing data through WB API
  • Varieties of Democracy (V-Dem) through vdemlite
    • There is also a package called vdemdata
    • vdemlite offers more functionality, works better in the cloud

filter(), select(), mutate()


Along the way we will practice some important dplyr verbs:


  • filter() is used to select observations based on their values
  • select() is used to select variables
  • mutate() is used to create new variables or modifying existing ones


As well as some helpful functions from the janitor package.

Data Frames

What is a Data Frame?


  • Special kind of tabular data used in data science
  • Each column can be a different data type
  • Data frames are the most common data structure in R

What is a Tibble?


  • Modern data frames in R
  • Offers better printing and subsetting behaviors
  • Does not convert character vectors to factors by default
  • Displays only the first 10 rows and as many columns as fit on screen
  • Column names are preserved exactly, even if they contain spaces

Creating a Tibble


  • When you read data into R with readr you automatically get a tibble
  • You can create a tibble using tibble() from the tibble package:
  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble  
# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE     

Common Data Types

  • <chr> (Character): Stores text strings
    • Example: "hello", "R programming"
  • <dbl> (Double): Stores decimal (floating-point) numbers
    • Example: 3.14, -1.0
  • <int> (Integer): Stores whole numbers (integers)
    • Example: 1, -100, 42
  • <lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
    • Example: TRUE, FALSE, NA
  • <fct> (Factor): Stores categorical variables with fixed levels
    • Example: factor(c("low", "medium", "high"))
  • <date> (Date): Stores dates in the “YYYY-MM-DD” format
    • Example: as.Date("2024-09-05")

Other Data Types


  • <dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
    • Example: as.POSIXct("2024-09-05 14:30:00")
  • <time> (Time): Specifically stores time-of-day values (rarely seen without a date)
    • Example: "14:30:00"
  • <list> (List): Stores lists, where each entry can be a complex object.
    • Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with lubridate

  • lubridate is an R package that makes it easier to work with dates and times

  • Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")
[1] "September 05, 2024"

Your Turn


  • Create your own tibble
  • Make it on a topic you find interesting
  • Try to include at least three data types
05:00

APIs

APIs


  • API stands for “Application Programming Interface”
  • Way for two computers to talk to each other
  • In our case, we will use APIs to download social science data

APIs in R

  • APIs are accessed through packages in R
  • Sometimes there can be more than one package for an API
  • Much easier than reading in data from messy flat file!
  • We will use a few API packages in this course
    • World Bank data through wbstats (or WDI)
    • fredr for Federal Reserve Economic Data
    • tidycensus for US Census data
  • But there are many APIs out there (please explore!)

Searching for WB Indicators


flfp_indicators <- wb_search("female labor force") # store the list of indicators

print(flfp_indicators, n=26) # view the indicators

wbstats Example


# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
women_emp <- wb_data(indicators, mrv = 50) |> # download data for last 50 yrs
  select(!iso2c) |> # drop the iso2c code which we won't be using
  rename(year = date) |> # rename date to year 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(women_emp) 

Your Turn!


  • Search for a WB indicator
  • Download the data
05:00

V-Dem Data

The V-Dem Dataset


  • V-Dem stands for Varieties of Democracy
  • It is a dataset that measures democracy around the world
  • Based on expert assessments of the quality of democracy in each country
  • Two packages we will explore: vdemlite and vdemdata

vdemlite


  • Covers a few hundred commonly used indicators and indices from 1970 onward
  • Covers everything in this document
  • As opposed to 4000+ indicators from the 18th century onward
  • Adds some functionality for working with the data
  • Easier to work with in the cloud and apps

vdemlite fuctions


  • fetchdem() to download the data
  • summarizedem() provides searchable table of indicators with summary stats
  • searchdem() to search for specific indicators or all indicators used to construct an index
  • See the vdemlite documentation for more details

fetchdem()


# Load packages
library(vdemlite) # to download V-Dem data

# Polyarchy and clean elections index for USA and Sweden for 2000-2020
dem_indicators <- fetchdem(indicators = c("v2x_polyarchy", "v2xel_frefair"),
                           countries = c("USA", "SWE"))

# View the data
glimpse(dem_indicators)

summarizedem()


# Summary statistics for the polyarchy index
summarizedem(indicator = "v2x_polyarchy")

searchdem()


searchdem()

Your Turn


  • Look at the vdemlite documentation
  • Try using searchdem() to find an indicator you are interested in using
  • Use summarizedem() to get summary statistics for that variable
  • Use fetchdem() to download the data for that variable for a country or countries of interest
  • Try using mutate() to add region codes to the data
05:00