Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Module 2.2
Transforming Data
- Start a new QMD file for this module in your Module 2 project folder.
- Install the
nycflights13
package:install.packages("nycflights13")
. - Familiarize yourself with the
dplyr
package. - Have a look at the
dplyr
cheatsheet.
Overview
Sometimes we want to do more than just explore a dataset in its raw form—we want to tailor it to answer specific questions. Whether we’re preparing data for a report, cleaning it for visualization, or building a model, it’s often necessary to narrow down to the most relevant rows, select only the variables we care about, or generate new variables that express relationships or transformations more clearly. In this module, we’ll learn how to do just that using three foundational functions from the dplyr package: filter(), select(), and mutate().
Exploring the nycflights13
Data
Let’s use the flights
dataset from the nycflights13
package to learn the grammar of data wrangling in R. This dataset contains detailed information on all flights departing from New York City airports in 2013. It includes a variety of variables, such as departure and arrival times, flight delays, air time, and more, making it an ideal dataset for practicing data wrangling techniques.
Filtering Rows with filter()
The filter()
function allows us to extract rows from a data frame that meet specific conditions. This is useful when we want to zoom in on a particular subset of the data. For example, we might be interested in analyzing only the flights that departed on a certain day, or perhaps only the flights from a specific carrier or airport.
Suppose we want to examine only the flights that departed on January 1st. We can use:
flights |>
filter(month == 1, day == 1)
# A tibble: 842 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Filtering is essential when we are interested in targeted analyses, like studying seasonal trends, isolating outliers, or preparing training and test sets for modeling.
In the above code chunk, we used filter(month == 1, day == 1)
to filter for flights departing on January 1. The ==
operator is used for comparison, meaning “equal to.” This is different from the =
operator, which is used for assignment in R. When filtering data, we always use ==
to specify conditions we want to match (not =
).
The filter()
function can handle multiple conditions using logical operators. Here we used ==
for “equal to.” But we could also use !=
(not equal to), <
(less than), >
(greater than), &
(and), and |
(or). We can also combine operators to create more complex conditions. For example, if we wanted to filter for flights that occurred in the month of June or earlier, we could write filter(month <= 6)
. Similarly, to find flights that were delayed by more than 30 minutes, we could use filter(dep_delay > 30)
. Try experimenting with these operators in the filter()
function in the code above to see how they affect the results.
Selecting Columns with select()
The select()
function is used to choose a subset of columns from a data frame. This is helpful when we are only interested in a few variables and want to avoid cluttering our analysis with unnecessary information.
For example, if we want to examine only the year, month, day, departure time, and flight number, we might write:
flights |>
select(year, month, day, dep_time, arr_time, carrier, flight)
# A tibble: 336,776 × 7
year month day dep_time arr_time carrier flight
<int> <int> <int> <int> <int> <chr> <int>
1 2013 1 1 517 830 UA 1545
2 2013 1 1 533 850 UA 1714
3 2013 1 1 542 923 AA 1141
4 2013 1 1 544 1004 B6 725
5 2013 1 1 554 812 DL 461
6 2013 1 1 554 740 UA 1696
7 2013 1 1 555 913 B6 507
8 2013 1 1 557 709 EV 5708
9 2013 1 1 557 838 B6 79
10 2013 1 1 558 753 AA 301
# ℹ 336,766 more rows
We can also rename variables while selecting them, which can be useful for clarity or consistency with other datasets:
flights |>
select(date = time_hour, airline = carrier, flight)
# A tibble: 336,776 × 3
date airline flight
<dttm> <chr> <int>
1 2013-01-01 05:00:00 UA 1545
2 2013-01-01 05:00:00 UA 1714
3 2013-01-01 05:00:00 AA 1141
4 2013-01-01 05:00:00 B6 725
5 2013-01-01 06:00:00 DL 461
6 2013-01-01 05:00:00 UA 1696
7 2013-01-01 06:00:00 B6 507
8 2013-01-01 06:00:00 EV 5708
9 2013-01-01 06:00:00 B6 79
10 2013-01-01 06:00:00 AA 301
# ℹ 336,766 more rows
Selecting variables makes our workflow cleaner, especially when preparing data for visualization or modeling.
Creating New Variables with mutate()
The mutate()
function allows us to add new columns to a data frame or transform existing ones. This is useful for deriving new insights from raw data. For example, we might want to calculate how much time was gained or lost during a flight:
flights |>
mutate(time_gain = arr_delay - dep_delay) |>
select(flight, dep_delay, arr_delay, time_gain)
# A tibble: 336,776 × 4
flight dep_delay arr_delay time_gain
<int> <dbl> <dbl> <dbl>
1 1545 2 11 9
2 1714 4 20 16
3 1141 2 33 31
4 725 -1 -18 -17
5 461 -6 -25 -19
6 1696 -4 12 16
7 507 -5 19 24
8 5708 -3 -14 -11
9 79 -3 -8 -5
10 301 -2 8 10
# ℹ 336,766 more rows
This calculation gives us a sense of whether flights tended to recover from delays during their journey. We can also use mutate()
to convert units (e.g., minutes to hours), flag particular conditions, or apply mathematical transformations.
We can use many different arithmetic operations within mutate()
, such as addition (+
), subtraction (-
), multiplication (*
), and division (/
) or exponentiation (^
or **
). For example, if we wanted to convert air_time
from minutes to hours, we could write mutate(air_time_hours = air_time / 60)
. Or, if we wanted to square the arr_delay
variable, we could write mutate(arr_delay_squared = arr_delay^2)
or mutate(arr_delay_squared = arr_delay ** 2)
. We will use arithmetic operators in the context of mutate a lot in the course, so it’s worth practicing with different operations to see how they work.
Combining filter()
, select()
, and mutate()
These three functions are most powerful when used together. A typical workflow might involve narrowing down the dataset to a subset of interest, selecting relevant variables, and then creating new variables to aid analysis.
Here’s an example:
flights |>
filter(month == 6, day == 15) |>
select(flight, origin, dest, dep_delay, arr_delay) |>
mutate(total_delay = dep_delay + arr_delay)
# A tibble: 801 × 6
flight origin dest dep_delay arr_delay total_delay
<int> <chr> <chr> <dbl> <dbl> <dbl>
1 1431 EWR CLT -4 -22 -26
2 327 EWR IAH 0 -3 -3
3 725 JFK BQN -4 34 30
4 1714 LGA IAH -3 -25 -28
5 1141 EWR SFO -2 -5 -7
6 5559 EWR DTW -9 -9 -18
7 635 LGA ORD -7 -23 -30
8 517 EWR MCO -6 -28 -34
9 731 LGA DTW -6 -16 -22
10 1535 LGA PHL -6 -7 -13
# ℹ 791 more rows
In this example, we look at flights on June 15, keep only columns that describe the flight and its delays, and compute the total delay time. This kind of pipeline helps us build a tidy and interpretable dataset suited for visualization or statistical analysis.
Applying the Concepts to the V-Dem Dataset
Now that we have a solid understanding of how to filter, select, and mutate data using dplyr, let’s apply these concepts to a new dataset: the Varieties of Democracy (V-Dem) dataset. This dataset contains a wealth of information about democracy and governance across countries and over time.
In this video walkthrough, we’ll revisit the filter()
, select()
, and mutate()
functions in the context of real-world political data. We’ll explore how to narrow our focus to particular years or countries, select the most relevant indicators of democracy, and construct new variables that help make sense of complex governance metrics.