AE 07: Using {duckdb} for NYC flights

Application exercise
Modified

August 25, 2025

Packages

We will use the following packages in this application exercise.

  • {tidyverse}: For data import, wrangling, and visualization.
  • {duckdb}: To query a DuckDB.
  • {dbplyr}: To translate our {dplyr} and {tidyr} functions to SQL.
  • {scales}: Formatting graph labels.

New York City flights data

The U.S. Department of Transportation publishes detailed statistics on flights originating in the United States. {nycflights13} is a highly popular R package containing information on all flights that departed NYC1 in 2013. The package was originally developed to provide practice wrangling and working with relational datasets.

That said, this package has become quite dated since it was originally developed in the mid-2010s. The same air travel data can be accessed for any airports in the United States using the {anyflights} package. The challenge is that data for more than a trivial handful of airports and years becomes significantly large and unwieldy to utilize.

To solve these issues, we will utilize a DuckDB that contains detailed air travel data for all NYC airports from 2015-2023. This dataset can be access from data/nyc-flights.duckdb.

Connect to the database

Demonstration: Connect to the local database and identify the tables.

con <- duckdb(dbdir = "data/nyc-flights.duckdb") |>
  dbConnect()

dbListTables(con)
[1] "airlines" "airports" "flights"  "planes"   "weather" 

The database has the following relational structure:

Connections between all five tables in the database. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows. Source: R4DS.

Connections between all five tables in the database. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows. Source: R4DS.

Demonstration: Use tbl() to create an object that represents each database table. This will allow us to use {dbplyr} to translate our code to the corresponding SQL query.

airlines <- tbl(con, "airlines")
airports <- tbl(con, "airports")
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")
weather <- tbl(con, "weather")

Translate some basic SQL queries

Your turn: Explain what each of the following SQL queries does, and recreate them using {dbplyr}.

SELECT flights.*
FROM flights
WHERE (dep_delay < arr_delay)

Add response here.

# add code here
SELECT flights.*, distance / (air_time / 60.0) AS speed
FROM flights

Add response here.

# add code here

What causes flight delays?

Your turn: What is the relationship between weather events and departure delays? Calculate the average departure delay and join with the weather table. How do these R commands translate to SQL? What trends do you see?

TipHints
  • Before summarizing flights, examine the weather table and figure out its unit of analysis. This will clue you in to how best to summarize the flights data.
  • Use show_query() to examine the translated SQL code.
# add code here

Number of flights departing NYC airports over time

Now let’s examine potential patterns or trends in the volume of flights departing NYC-area airports over time.

Create a date variable

Your turn: Currently flights does not have a standalone date variable that identifies the date of the flight. Create one and store it at the beginning of the table.

TipMake a date from component columns

lubridate::make_date() can create a date variable from numeric columns which contain each of the date components.

NoteModifying contents of the database

tbl() provides a read-only interface to the table. You can modify the contents of the table and store them as an R object, but it does not actually change the contents of the table in the database. To modify the database directly, you need to use an appropriate method from {DBI}.

# add code here

Visualize the volume of flights over time

Demonstration: Calculate the daily number of departing flights from NYC and visualize as a line graph.

# add code here

Demonstration: Calculate a rolling seven-day average number of flights departing from NYC.

# add code here

Footnotes

  1. Specifically, John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and Newark International Airport (EWR).↩︎