Module 2.3

Summarizing Data

Prework
  • Start a QMD file for this module.
  • Review the concepts of filter(), select(), and mutate() from our previous lesson.
  • Read about the vdemlite package.
  • vdemlite is not on CRAN, so you will need to install it from GitHub using the pak package.
    • First install pak by typing install.packages("pak") in your console.
    • Then, install the vdemlite package by typing pak::pkg_install("eteitelbaum/vdemlite") in your console.

Overview

In this lesson, you’ll learn how to summarize data by groups using the powerful group_by(), summarize(), and arrange() functions from the dplyr package. This sequence of operations is one of the most common and useful workflows in data science. We’ll apply it to real-world data from the Varieties of Democracy (V-Dem) project, a rich dataset that measures democratic attributes across countries and years. You’ll gain experience calculating summary statistics for different regions and time periods, and ranking countries or groups based on those statistics.

The V-Dem Dataset (and the vdemlite package)

The V-Dem project stands for Varieties of Democracy. It provides detailed, expert-coded data on the quality of democracy across countries and years. The full dataset, accessible via the vdemdata package, includes over 4,000 variables dating back to the 18th century. But this dataset is quite large and complex, making it less practical for many applications.

For this class we are going to mainly rely on a package called vdemlite. This package includes several hundred widely used indicators from 1970 onward and is optimized for quick access and online teaching environments.

The vdemlite package comes with several convenient functions. searchdem() is a convenience function that helps you look up specific indicators or find the underlying components of composite indices.

Variable Tags and Descriptions

When you call searchdem() you get a searchable table of all the indicators in the vdemlite package that allows you to search by the indicator tab/label and descriptor. Try it out!

Once you have identified a variable that you are interested in, you can call summarizedem(), which generates summary statistics for an indicator. Let’s summarize the polyarchy score, a widely used measure of electoral democracy.

summarizedem(indicator = "v2x_polyarchy")
Summary of v2x_polyarchy by Country
Years: 1970 - 2023

Finally, you can use fetchdem() to retrieve a subset of the data filtered by indicators, countries, and years.

library(dplyr)

dem_indicators <- fetchdem(
  indicators = c("v2x_polyarchy", "v2xel_frefair"),                   
  start_year = 2000, end_year = 2020,
  countries = c("USA", "SWE")
  )

glimpse(dem_indicators)
Rows: 42
Columns: 6
$ country_name    <chr> "Sweden", "Sweden", "Sweden", "Sweden", "Sweden", "Swe…
$ country_text_id <chr> "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE"…
$ country_id      <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ year            <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, …
$ v2x_polyarchy   <dbl> 0.914, 0.914, 0.914, 0.915, 0.915, 0.915, 0.915, 0.916…
$ v2xel_frefair   <dbl> 0.962, 0.962, 0.962, 0.963, 0.963, 0.963, 0.963, 0.965…

Here, fetchdem includes filtering arguments for indicators, countries, and years. We could also have used the dplyr select() and filter() verbs to narrow down the data after fetching it, and this is actually what is going on “under the hood.” These fetchdem() arguments just make it easier to work with the data without having to write a lot of code.

Your Turn!
  • Use searchdem() to find an indicator of interest.
  • Use summarizedem() to view its summary statistics.
  • Use fetchdem() to download that variable for a country or set of countries.

Grouping and Summarizing

Let’s use the V-Dem data to illustrate how to group and summarize data. One of the most common sequences in data wrangling involves group_by(), followed by summarize(), and then arrange() to sort the results.

We can start by downloading some data for all of the countries. Let’s download the polyarchy score (v2x_polyarchy), along with V-Dem’s liberal democracy score (v2x_libdem), a women’s political empowerment index (v2x_gender) and per capital gdp (v2x_gdp_pc). Let’s also download the region of each country (e_regionpol_6C) so that we can have something to group and summarize by.

Let’s then save those data in a data frame called democracy and let’s then pipe the data into the rename function so that we can have more intuitive names for our variables.

democracy <- 
  fetchdem(
    indicators = c(
      "v2x_polyarchy", 
      "v2x_libdem", 
      "v2x_gender", 
      "e_gdppc", 
      "e_regionpol_6C"
      ) 
    ) |>
    rename(
      polyarchy = v2x_polyarchy,
      libdem = v2x_libdem,
      womens_emp = v2x_gender,
      gdp_pc = e_gdppc,
      region = e_regionpol_6C
    )

glimpse(democracy)
Rows: 9,170
Columns: 9
$ country_name    <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mex…
$ country_text_id <chr> "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX"…
$ country_id      <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ year            <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, …
$ polyarchy       <dbl> 0.250, 0.248, 0.249, 0.249, 0.251, 0.251, 0.262, 0.276…
$ libdem          <dbl> 0.111, 0.110, 0.111, 0.111, 0.111, 0.111, 0.115, 0.123…
$ womens_emp      <dbl> 0.421, 0.421, 0.421, 0.426, 0.426, 0.433, 0.446, 0.454…
$ gdp_pc          <dbl> 7.890, 8.082, 8.463, 8.845, 9.189, 9.480, 9.673, 9.914…
$ region          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …

Notice here that we are just downloading the data for these variables for all of the years and all of the countries. We are not using the countries or years arguments, so we get all of the data.

Also notice that the region variable is a series of region codes. This is going to make it hard to understand what the regions are, so we will need to do some additional work to make this more interpretable. Let’s call mutate() along with case_match() to create a new variable that classifies countries into named regions based on the e_regionpol_6C variable. We will save the new data frame as democracy again, overwriting the previous version.

democracy <- democracy |>
  mutate(
    region = case_match(region, # replace the values with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
  )

glimpse(democracy)
Rows: 9,170
Columns: 9
$ country_name    <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mex…
$ country_text_id <chr> "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX"…
$ country_id      <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ year            <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, …
$ polyarchy       <dbl> 0.250, 0.248, 0.249, 0.249, 0.251, 0.251, 0.262, 0.276…
$ libdem          <dbl> 0.111, 0.110, 0.111, 0.111, 0.111, 0.111, 0.115, 0.123…
$ womens_emp      <dbl> 0.421, 0.421, 0.421, 0.426, 0.426, 0.433, 0.446, 0.454…
$ gdp_pc          <dbl> 7.890, 8.082, 8.463, 8.845, 9.189, 9.480, 9.673, 9.914…
$ region          <chr> "Latin America", "Latin America", "Latin America", "La…

Once we have the data ready, we can summarize these variables by region. Below we have an example of how to group the data by region, summarize the average democracy score, the median libdem score, the standard deviation of womens_emp, and the minimum value of gdp_pc. We then arrange the results in descending order of the average democracy score.

democracy |>
  group_by(region) |>
  summarize(
    polyarchy_mean = mean(polyarchy, na.rm = TRUE),
    libdem_median = median(libdem, na.rm = TRUE),
    womens_emp_sd = sd(womens_emp, na.rm = TRUE),
    gdp_pc_min = min(gdp_pc, na.rm = TRUE)
  ) |>
  arrange(desc(polyarchy_mean))
# A tibble: 6 × 5
  region         polyarchy_mean libdem_median womens_emp_sd gdp_pc_min
  <chr>                   <dbl>         <dbl>         <dbl>      <dbl>
1 The West                0.846         0.802         0.101      4.04 
2 Latin America           0.532         0.401         0.201      1.19 
3 Eastern Europe          0.475         0.338         0.161      1.47 
4 Asia                    0.354         0.212         0.193      0.726
5 Africa                  0.315         0.143         0.205      0.286
6 Middle East             0.213         0.118         0.183      1.49 
Code Detail: na.rm = TRUE

Note how we add na.rm = TRUE to the mean(), median(), and sd() functions. This is important because it tells R to ignore any missing values (NAs) when calculating these statistics. If you don’t include this argument, R will return NA for the entire summary if any of the values are missing.

This pattern — group, summarize, arrange — is at the core of many descriptive analyses. You’re grouping the data by a categorical variable (region), summarizing one or more numeric variables (e.g., polyarchy, libdem), and then sorting the results to highlight interesting patterns.

What if we wanted to get the same summary statistics for multiple variables at once? To do this, we can use the across() function within summarize(). This allows us to apply the same function (like mean(), median(), etc.) to multiple columns without repeating code.

democracy |>
  group_by(region) |>
  summarize(
    across(
      c(polyarchy, libdem, womens_emp, gdp_pc),
           mean,
           na.rm = TRUE,
           .names = "mean_{col}"
      )
  ) |>
  arrange(desc(mean_polyarchy))
# A tibble: 6 × 5
  region         mean_polyarchy mean_libdem mean_womens_emp mean_gdp_pc
  <chr>                   <dbl>       <dbl>           <dbl>       <dbl>
1 The West                0.846       0.776           0.867       31.5 
2 Latin America           0.532       0.392           0.656        8.37
3 Eastern Europe          0.475       0.357           0.739       11.6 
4 Asia                    0.354       0.265           0.557        7.38
5 Africa                  0.315       0.215           0.537        3.83
6 Middle East             0.213       0.165           0.422       20.7 

Here we are grouping the data by region and then summarizing the mean of several indicators: polyarchy, libdem, women_rep, and flfp. The .names = "mean_{col}" argument allows us to create new column names that include the original variable names, prefixed with “mean_”.

Your Turn!
  • Find a few indicators of interest in vdemlite using searchdem().
  • Use summarizedem() to view their summary statistics.
  • Use fetchdem() to download the data for those indicators and region codes that you can use to group the data, filtering for a subset of years if you like, but retaining all of the countries. Save those data in a data frame called democracy or something similar.
  • Use a group_by(), summarize(), and arrange() sequence to calculate some summary statistics for the data. Don’t forget to include na.rm = TRUE in your summary functions to handle missing values.
  • Then, use across() to compute one statistic (i.e. mean or median) of several indicators.