Module 2.2

Transforming Data

Prework
  • Start a new QMD file for this module in your Module 2 project folder.
  • Install the nycflights13 package: install.packages("nycflights13").
  • Familiarize yourself with the dplyr package.
  • Have a look at the dplyr cheatsheet.

Overview

Sometimes we want to do more than just explore a dataset in its raw form—we want to tailor it to answer specific questions. Whether we’re preparing data for a report, cleaning it for visualization, or building a model, it’s often necessary to narrow down to the most relevant rows, select only the variables we care about, or generate new variables that express relationships or transformations more clearly. In this module, we’ll learn how to do just that using three foundational functions from the dplyr package: filter(), select(), and mutate().

Exploring the nycflights13 Data

Let’s use the flights dataset from the nycflights13 package to learn the grammar of data wrangling in R. This dataset contains detailed information on all flights departing from New York City airports in 2013. It includes a variety of variables, such as departure and arrival times, flight delays, air time, and more, making it an ideal dataset for practicing data wrangling techniques.

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Filtering Rows with filter()

The filter() function allows us to extract rows from a data frame that meet specific conditions. This is useful when we want to zoom in on a particular subset of the data. For example, we might be interested in analyzing only the flights that departed on a certain day, or perhaps only the flights from a specific carrier or airport.

Suppose we want to examine only the flights that departed on January 1st. We can use:

flights |>
  filter(month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Filtering is essential when we are interested in targeted analyses, like studying seasonal trends, isolating outliers, or preparing training and test sets for modeling.

Code Detail: == vs =

In the above code chunk, we used filter(month == 1, day == 1) to filter for flights departing on January 1. The == operator is used for comparison, meaning “equal to.” This is different from the = operator, which is used for assignment in R. When filtering data, we always use == to specify conditions we want to match (not =).

The filter() function can handle multiple conditions using logical operators. Here we used == for “equal to.” But we could also use != (not equal to), < (less than), > (greater than), & (and), and | (or). We can also combine operators to create more complex conditions. For example, if we wanted to filter for flights that occurred in the month of June or earlier, we could write filter(month <= 6). Similarly, to find flights that were delayed by more than 30 minutes, we could use filter(dep_delay > 30). Try experimenting with these operators in the filter() function in the code above to see how they affect the results.

Your Turn!!
  1. Use the filter() function to find all flights that departed from JFK airport.
  2. Filter the flights that arrived on or after June 1.
  3. Find flights that arrived before May 1 and were delayed by more than 60 minutes.
  4. Filter for flights that were operated by American Airlines (carrier code “AA”) or Delta Airlines (carrier code “DL”).

Selecting Columns with select()

The select() function is used to choose a subset of columns from a data frame. This is helpful when we are only interested in a few variables and want to avoid cluttering our analysis with unnecessary information.

For example, if we want to examine only the year, month, day, departure time, and flight number, we might write:

flights |>
  select(year, month, day, dep_time, arr_time, carrier, flight)
# A tibble: 336,776 × 7
    year month   day dep_time arr_time carrier flight
   <int> <int> <int>    <int>    <int> <chr>    <int>
 1  2013     1     1      517      830 UA        1545
 2  2013     1     1      533      850 UA        1714
 3  2013     1     1      542      923 AA        1141
 4  2013     1     1      544     1004 B6         725
 5  2013     1     1      554      812 DL         461
 6  2013     1     1      554      740 UA        1696
 7  2013     1     1      555      913 B6         507
 8  2013     1     1      557      709 EV        5708
 9  2013     1     1      557      838 B6          79
10  2013     1     1      558      753 AA         301
# ℹ 336,766 more rows

We can also rename variables while selecting them, which can be useful for clarity or consistency with other datasets:

flights |>
  select(date = time_hour, airline = carrier, flight)
# A tibble: 336,776 × 3
   date                airline flight
   <dttm>              <chr>    <int>
 1 2013-01-01 05:00:00 UA        1545
 2 2013-01-01 05:00:00 UA        1714
 3 2013-01-01 05:00:00 AA        1141
 4 2013-01-01 05:00:00 B6         725
 5 2013-01-01 06:00:00 DL         461
 6 2013-01-01 05:00:00 UA        1696
 7 2013-01-01 06:00:00 B6         507
 8 2013-01-01 06:00:00 EV        5708
 9 2013-01-01 06:00:00 B6          79
10 2013-01-01 06:00:00 AA         301
# ℹ 336,766 more rows

Selecting variables makes our workflow cleaner, especially when preparing data for visualization or modeling.

Your Turn!!
  1. Use the select() function to keep only the year, month, day, dep_time, and arr_time columns.
  2. Select the flight, origin, dest, and air_time columns and rename them to flight_number, departure_airport, arrival_airport, and flight_duration.

Creating New Variables with mutate()

The mutate() function allows us to add new columns to a data frame or transform existing ones. This is useful for deriving new insights from raw data. For example, we might want to calculate how much time was gained or lost during a flight:

flights |>
  mutate(time_gain = arr_delay - dep_delay) |>
  select(flight, dep_delay, arr_delay, time_gain)
# A tibble: 336,776 × 4
   flight dep_delay arr_delay time_gain
    <int>     <dbl>     <dbl>     <dbl>
 1   1545         2        11         9
 2   1714         4        20        16
 3   1141         2        33        31
 4    725        -1       -18       -17
 5    461        -6       -25       -19
 6   1696        -4        12        16
 7    507        -5        19        24
 8   5708        -3       -14       -11
 9     79        -3        -8        -5
10    301        -2         8        10
# ℹ 336,766 more rows

This calculation gives us a sense of whether flights tended to recover from delays during their journey. We can also use mutate() to convert units (e.g., minutes to hours), flag particular conditions, or apply mathematical transformations.

We can use many different arithmetic operations within mutate(), such as addition (+), subtraction (-), multiplication (*), and division (/) or exponentiation (^ or **). For example, if we wanted to convert air_time from minutes to hours, we could write mutate(air_time_hours = air_time / 60). Or, if we wanted to square the arr_delay variable, we could write mutate(arr_delay_squared = arr_delay^2) or mutate(arr_delay_squared = arr_delay ** 2). We will use arithmetic operators in the context of mutate a lot in the course, so it’s worth practicing with different operations to see how they work.

Your Turn!!
  1. Use the mutate() function to create a new variable called total_delay that sums dep_delay and arr_delay.
  2. Create a new variable called air_time_hours that converts air_time from minutes to hours.
  3. Add a new variable called flight_speed that calculates the average speed of the flight in miles per hour, assuming the distance is given in miles and after converting air time into hours.

Combining filter(), select(), and mutate()

These three functions are most powerful when used together. A typical workflow might involve narrowing down the dataset to a subset of interest, selecting relevant variables, and then creating new variables to aid analysis.

Here’s an example:

flights |>
  filter(month == 6, day == 15) |>
  select(flight, origin, dest, dep_delay, arr_delay) |>
  mutate(total_delay = dep_delay + arr_delay)
# A tibble: 801 × 6
   flight origin dest  dep_delay arr_delay total_delay
    <int> <chr>  <chr>     <dbl>     <dbl>       <dbl>
 1   1431 EWR    CLT          -4       -22         -26
 2    327 EWR    IAH           0        -3          -3
 3    725 JFK    BQN          -4        34          30
 4   1714 LGA    IAH          -3       -25         -28
 5   1141 EWR    SFO          -2        -5          -7
 6   5559 EWR    DTW          -9        -9         -18
 7    635 LGA    ORD          -7       -23         -30
 8    517 EWR    MCO          -6       -28         -34
 9    731 LGA    DTW          -6       -16         -22
10   1535 LGA    PHL          -6        -7         -13
# ℹ 791 more rows

In this example, we look at flights on June 15, keep only columns that describe the flight and its delays, and compute the total delay time. This kind of pipeline helps us build a tidy and interpretable dataset suited for visualization or statistical analysis.

Applying the Concepts to the V-Dem Dataset

Now that we have a solid understanding of how to filter, select, and mutate data using dplyr, let’s apply these concepts to a new dataset: the Varieties of Democracy (V-Dem) dataset. This dataset contains a wealth of information about democracy and governance across countries and over time.

In this video walkthrough, we’ll revisit the filter(), select(), and mutate() functions in the context of real-world political data. We’ll explore how to narrow our focus to particular years or countries, select the most relevant indicators of democracy, and construct new variables that help make sense of complex governance metrics.

Your Turn!!
  1. Using the nycflights dataset, filter the flights for February only.
  2. Select carrier, origin, dest, and air_time.
  3. Create a new variable that converts air_time from minutes to hours.
  4. Explore other interesting combinations using filter(), select(), and mutate().
  5. Go to kaggle.com and find a dataset that interests you. Use the dplyr functions to filter, select, and mutate the data. Optional: glimpse() your data and share a screenshot of your results in the discussion forum.