Working With Data

June 22, 2025

How Do We Get Tidy/Clean Data?

Get lucky and find it (like on kaggle)
Wrangle it ourselves
Use a package where it has been wrangled for us
Download via an API

This Lesson

Practice with World Bank and V-Dem data
World Bank data through wbstats
- There is another package called WDI
- Both packages for accessing data through WB API
Varieties of Democracy (V-Dem) through vdemlite
- There is also a package called vdemdata
- vdemlite offers more functionality, works better in the cloud

`filter()`, `select()`, `mutate()`

Along the way we will practice some important dplyr verbs:

filter() is used to select observations based on their values
select() is used to select variables
mutate() is used to create new variables or modifying existing ones

As well as some helpful functions from the janitor package.

Data Frames

What is a Data Frame?

Special kind of tabular data used in data science
Each column can be a different data type
Data frames are the most common data structure in R

What is a Tibble?

Modern data frames in R
Offers better printing and subsetting behaviors
Does not convert character vectors to factors by default
Displays only the first 10 rows and as many columns as fit on screen
Column names are preserved exactly, even if they contain spaces

Creating a Tibble

When you read data into R with readr you automatically get a tibble
You can create a tibble using tibble() from the tibble package:

  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble

# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE

Common Data Types

<chr> (Character): Stores text strings
- Example: "hello", "R programming"
<dbl> (Double): Stores decimal (floating-point) numbers
- Example: 3.14, -1.0
<int> (Integer): Stores whole numbers (integers)
- Example: 1, -100, 42
<lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
- Example: TRUE, FALSE, NA
<fct> (Factor): Stores categorical variables with fixed levels
- Example: factor(c("low", "medium", "high"))
<date> (Date): Stores dates in the “YYYY-MM-DD” format
- Example: as.Date("2024-09-05")

Other Data Types

<dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
- Example: as.POSIXct("2024-09-05 14:30:00")
<time> (Time): Specifically stores time-of-day values (rarely seen without a date)
- Example: "14:30:00"
<list> (List): Stores lists, where each entry can be a complex object.
- Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with `lubridate`

lubridate is an R package that makes it easier to work with dates and times
Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")

[1] "September 05, 2024"

Your Turn

Create your own tibble
Make it on a topic you find interesting
Try to include at least three data types

05:00

APIs

API stands for “Application Programming Interface”
Way for two computers to talk to each other
In our case, we will use APIs to download social science data

APIs in R

APIs are accessed through packages in R
Sometimes there can be more than one package for an API
Much easier than reading in data from messy flat file!
We will use a few API packages in this course
- World Bank data through wbstats (or WDI)
- fredr for Federal Reserve Economic Data
- tidycensus for US Census data
But there are many APIs out there (please explore!)

Searching for WB Indicators

flfp_indicators <- wb_search("female labor force") # store the list of indicators

print(flfp_indicators, n=26) # view the indicators

`wbstats` Example

# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
women_emp <- wb_data(indicators, mrv = 50) |> # download data for last 50 yrs
  select(!iso2c) |> # drop the iso2c code which we won't be using
  rename(year = date) |> # rename date to year 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(women_emp)

Your Turn!

Search for a WB indicator
Download the data

05:00

V-Dem Data

The V-Dem Dataset

V-Dem stands for Varieties of Democracy
It is a dataset that measures democracy around the world
Based on expert assessments of the quality of democracy in each country
Two packages we will explore: vdemlite and vdemdata

`vdemlite`

Covers a few hundred commonly used indicators and indices from 1970 onward
Covers everything in this document
As opposed to 4000+ indicators from the 18th century onward
Adds some functionality for working with the data
Easier to work with in the cloud and apps

`vdemlite` fuctions

fetchdem() to download the data
summarizedem() provides searchable table of indicators with summary stats
searchdem() to search for specific indicators or all indicators used to construct an index
See the vdemlite documentation for more details

`fetchdem()`

# Load packages
library(vdemlite) # to download V-Dem data

# Polyarchy and clean elections index for USA and Sweden for 2000-2020
dem_indicators <- fetchdem(indicators = c("v2x_polyarchy", "v2xel_frefair"),
                           countries = c("USA", "SWE"))

# View the data
glimpse(dem_indicators)

`summarizedem()`

# Summary statistics for the polyarchy index
summarizedem(indicator = "v2x_polyarchy")

`searchdem()`

searchdem()

Your Turn

Look at the vdemlite documentation
Try using searchdem() to find an indicator you are interested in using
Use summarizedem() to get summary statistics for that variable
Use fetchdem() to download the data for that variable for a country or countries of interest
Try using mutate() to add region codes to the data

05:00

Working With Data

How Do We Get Tidy/Clean Data?

This Lesson

filter(), select(), mutate()

Data Frames

What is a Data Frame?

What is a Tibble?

Creating a Tibble

Common Data Types

Other Data Types

Dates and Times with lubridate

Your Turn

APIs

APIs

APIs in R

Searching for WB Indicators

wbstats Example

Your Turn!

V-Dem Data

The V-Dem Dataset

vdemlite

vdemlite fuctions

fetchdem()

summarizedem()

searchdem()

Your Turn

`filter()`, `select()`, `mutate()`

Dates and Times with `lubridate`

`wbstats` Example

`vdemlite`

`vdemlite` fuctions

`fetchdem()`

`summarizedem()`

`searchdem()`