Module 2.3
Summarizing Data
- Start a QMD file for this module.
- Review the concepts of
filter()
,select()
, andmutate()
from our previous lesson. - Read about the vdemlite package.
-
vdemlite
is not on CRAN, so you will need to install it from GitHub using thepak
package.- First install
pak
by typinginstall.packages("pak")
in your console. - Then, install the
vdemlite
package by typingpak::pkg_install("eteitelbaum/vdemlite")
in your console.
- First install
Overview
In this lesson, you’ll learn how to summarize data by groups using the powerful group_by()
, summarize()
, and arrange()
functions from the dplyr
package. This sequence of operations is one of the most common and useful workflows in data science. We’ll apply it to real-world data from the Varieties of Democracy (V-Dem) project, a rich dataset that measures democratic attributes across countries and years. You’ll gain experience calculating summary statistics for different regions and time periods, and ranking countries or groups based on those statistics.
The V-Dem Dataset (and the vdemlite
package)
The V-Dem project stands for Varieties of Democracy. It provides detailed, expert-coded data on the quality of democracy across countries and years. The full dataset, accessible via the vdemdata
package, includes over 4,000 variables dating back to the 18th century. But this dataset is quite large and complex, making it less practical for many applications.
For this class we are going to mainly rely on a package called vdemlite
. This package includes several hundred widely used indicators from 1970 onward and is optimized for quick access and online teaching environments.
The vdemlite
package comes with several convenient functions. searchdem()
is a convenience function that helps you look up specific indicators or find the underlying components of composite indices.
When you call searchdem()
you get a searchable table of all the indicators in the vdemlite
package that allows you to search by the indicator tab/label and descriptor. Try it out!
Once you have identified a variable that you are interested in, you can call summarizedem()
, which generates summary statistics for an indicator. Let’s summarize the polyarchy score, a widely used measure of electoral democracy.
summarizedem(indicator = "v2x_polyarchy")
Finally, you can use fetchdem()
to retrieve a subset of the data filtered by indicators, countries, and years.
library(dplyr)
dem_indicators <- fetchdem(
indicators = c("v2x_polyarchy", "v2xel_frefair"),
start_year = 2000, end_year = 2020,
countries = c("USA", "SWE")
)
glimpse(dem_indicators)
Rows: 42
Columns: 6
$ country_name <chr> "Sweden", "Sweden", "Sweden", "Sweden", "Sweden", "Swe…
$ country_text_id <chr> "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE"…
$ country_id <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ year <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, …
$ v2x_polyarchy <dbl> 0.914, 0.914, 0.914, 0.915, 0.915, 0.915, 0.915, 0.916…
$ v2xel_frefair <dbl> 0.962, 0.962, 0.962, 0.963, 0.963, 0.963, 0.963, 0.965…
Here, fetchdem includes filtering arguments for indicators
, countries
, and years
. We could also have used the dplyr
select()
and filter()
verbs to narrow down the data after fetching it, and this is actually what is going on “under the hood.” These fetchdem()
arguments just make it easier to work with the data without having to write a lot of code.
Grouping and Summarizing
Let’s use the V-Dem data to illustrate how to group and summarize data. One of the most common sequences in data wrangling involves group_by()
, followed by summarize()
, and then arrange()
to sort the results.
We can start by downloading some data for all of the countries. Let’s download the polyarchy score (v2x_polyarchy
), along with V-Dem’s liberal democracy score (v2x_libdem
), a women’s political empowerment index (v2x_gender
) and per capital gdp (v2x_gdp_pc
). Let’s also download the region of each country (e_regionpol_6C
) so that we can have something to group and summarize by.
Let’s then save those data in a data frame called democracy
and let’s then pipe the data into the rename function so that we can have more intuitive names for our variables.
democracy <-
fetchdem(
indicators = c(
"v2x_polyarchy",
"v2x_libdem",
"v2x_gender",
"e_gdppc",
"e_regionpol_6C"
)
) |>
rename(
polyarchy = v2x_polyarchy,
libdem = v2x_libdem,
womens_emp = v2x_gender,
gdp_pc = e_gdppc,
region = e_regionpol_6C
)
glimpse(democracy)
Rows: 9,170
Columns: 9
$ country_name <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mex…
$ country_text_id <chr> "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX"…
$ country_id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ year <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, …
$ polyarchy <dbl> 0.250, 0.248, 0.249, 0.249, 0.251, 0.251, 0.262, 0.276…
$ libdem <dbl> 0.111, 0.110, 0.111, 0.111, 0.111, 0.111, 0.115, 0.123…
$ womens_emp <dbl> 0.421, 0.421, 0.421, 0.426, 0.426, 0.433, 0.446, 0.454…
$ gdp_pc <dbl> 7.890, 8.082, 8.463, 8.845, 9.189, 9.480, 9.673, 9.914…
$ region <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
Notice here that we are just downloading the data for these variables for all of the years and all of the countries. We are not using the countries
or years
arguments, so we get all of the data.
Also notice that the region variable is a series of region codes. This is going to make it hard to understand what the regions are, so we will need to do some additional work to make this more interpretable. Let’s call mutate()
along with case_match()
to create a new variable that classifies countries into named regions based on the e_regionpol_6C
variable. We will save the new data frame as democracy
again, overwriting the previous version.
democracy <- democracy |>
mutate(
region = case_match(region, # replace the values with country names
1 ~ "Eastern Europe",
2 ~ "Latin America",
3 ~ "Middle East",
4 ~ "Africa",
5 ~ "The West",
6 ~ "Asia")
)
glimpse(democracy)
Rows: 9,170
Columns: 9
$ country_name <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mex…
$ country_text_id <chr> "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX"…
$ country_id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ year <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, …
$ polyarchy <dbl> 0.250, 0.248, 0.249, 0.249, 0.251, 0.251, 0.262, 0.276…
$ libdem <dbl> 0.111, 0.110, 0.111, 0.111, 0.111, 0.111, 0.115, 0.123…
$ womens_emp <dbl> 0.421, 0.421, 0.421, 0.426, 0.426, 0.433, 0.446, 0.454…
$ gdp_pc <dbl> 7.890, 8.082, 8.463, 8.845, 9.189, 9.480, 9.673, 9.914…
$ region <chr> "Latin America", "Latin America", "Latin America", "La…
Once we have the data ready, we can summarize these variables by region. Below we have an example of how to group the data by region, summarize the average democracy
score, the median libdem
score, the standard deviation of womens_emp
, and the minimum value of gdp_pc
. We then arrange the results in descending order of the average democracy score.
democracy |>
group_by(region) |>
summarize(
polyarchy_mean = mean(polyarchy, na.rm = TRUE),
libdem_median = median(libdem, na.rm = TRUE),
womens_emp_sd = sd(womens_emp, na.rm = TRUE),
gdp_pc_min = min(gdp_pc, na.rm = TRUE)
) |>
arrange(desc(polyarchy_mean))
# A tibble: 6 × 5
region polyarchy_mean libdem_median womens_emp_sd gdp_pc_min
<chr> <dbl> <dbl> <dbl> <dbl>
1 The West 0.846 0.802 0.101 4.04
2 Latin America 0.532 0.401 0.201 1.19
3 Eastern Europe 0.475 0.338 0.161 1.47
4 Asia 0.354 0.212 0.193 0.726
5 Africa 0.315 0.143 0.205 0.286
6 Middle East 0.213 0.118 0.183 1.49
This pattern — group, summarize, arrange — is at the core of many descriptive analyses. You’re grouping the data by a categorical variable (region
), summarizing one or more numeric variables (e.g., polyarchy
, libdem
), and then sorting the results to highlight interesting patterns.
What if we wanted to get the same summary statistics for multiple variables at once? To do this, we can use the across()
function within summarize()
. This allows us to apply the same function (like mean()
, median()
, etc.) to multiple columns without repeating code.
democracy |>
group_by(region) |>
summarize(
across(
c(polyarchy, libdem, womens_emp, gdp_pc),
mean,
na.rm = TRUE,
.names = "mean_{col}"
)
) |>
arrange(desc(mean_polyarchy))
# A tibble: 6 × 5
region mean_polyarchy mean_libdem mean_womens_emp mean_gdp_pc
<chr> <dbl> <dbl> <dbl> <dbl>
1 The West 0.846 0.776 0.867 31.5
2 Latin America 0.532 0.392 0.656 8.37
3 Eastern Europe 0.475 0.357 0.739 11.6
4 Asia 0.354 0.265 0.557 7.38
5 Africa 0.315 0.215 0.537 3.83
6 Middle East 0.213 0.165 0.422 20.7
Here we are grouping the data by region
and then summarizing the mean of several indicators: polyarchy
, libdem
, women_rep
, and flfp
. The .names = "mean_{col}"
argument allows us to create new column names that include the original variable names, prefixed with “mean_”.