library(tidyverse)
library(scales)
AE 00: Bechdel + data visualization
Suggested answers
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
In this mini analysis we work with the data used in the FiveThirtyEight story titled “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”.
This analysis is about the Bechdel test, a measure of the representation of women in fiction.
Getting started
Packages
We start with loading the packages we’ll use: tidyverse for majority of the analysis and scales for pretty plot labels later on.
Data
The data are stored as a CSV (comma separated values) file in the data
folder of your repository. Let’s read it from there and save it as an object called bechdel
.
<- read_csv("data/bechdel.csv") bechdel
Get to know the data
We can use the glimpse
function to get an overview (or “glimpse”) of the data.
glimpse(bechdel)
Rows: 1,615
Columns: 17
$ title <chr> "21 & Over", "Dredd 3D", "12 Years a Slave", "2 Guns", "…
$ year <dbl> 2013, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…
$ gross_2013 <dbl> 67878146, 55078343, 211714070, 208105475, 190040426, 184…
$ budget_2013 <dbl> 13000000, 45658735, 20000000, 61000000, 40000000, 225000…
$ roi <dbl> 5.221396, 1.206305, 10.585703, 3.411565, 4.751011, 0.818…
$ binary <chr> "FAIL", "PASS", "FAIL", "FAIL", "FAIL", "FAIL", "FAIL", …
$ clean_test <chr> "notalk", "ok", "notalk", "notalk", "men", "men", "notal…
$ imdb <chr> "tt1711425", "tt1343727", "tt2024544", "tt1272878", "tt0…
$ test <chr> "notalk", "ok-disagree", "notalk-disagree", "notalk", "m…
$ budget <dbl> 1.30e+07, 4.50e+07, 2.00e+07, 6.10e+07, 4.00e+07, 2.25e+…
$ domgross <dbl> 25682380, 13414714, 53107035, 75612460, 95020213, 383624…
$ intgross <dbl> 42195766, 40868994, 158607035, 132493015, 95020213, 1458…
$ code <chr> "2013FAIL", "2012PASS", "2013FAIL", "2013FAIL", "2013FAI…
$ domgross_2013 <dbl> 25682380, 13611086, 53107035, 75612460, 95020213, 383624…
$ intgross_2013 <dbl> 42195766, 41467257, 158607035, 132493015, 95020213, 1458…
$ period_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ decade_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
- What does each observation (row) in the data set represent?
Each observation represents a different movie.
- How many observations (rows) are in the data set?
There are 1615 movies in the dataset.
- How many variables (columns) are in the data set?
There are 17 columns in the dataset.
Variables of interest
The variables we’ll focus on are the following:
budget_2013
: Budget in 2013 inflation adjusted dollars.gross_2013
: Gross (US and international combined) in 2013 inflation adjusted dollars.roi
: Return on investment, calculated as the ratio of the gross to budget.clean_test
: Bechdel test result:ok
= passes testdubious
men
= women only talk about mennotalk
= women don’t talk to each othernowomen
= fewer than two women
binary
: Bechdel Test PASS vs FAIL binary
We will also use the year
of release in data prep and title
of movie to take a deeper look at some outliers.
There are a few other variables in the dataset, but we won’t be using them in this analysis.
Visualizing data with ggplot2
ggplot2 is the package and ggplot()
is the function in this package that is used to create a plot.
ggplot()
creates the initial base coordinate system, and we will add layers to that base. We first specify the data set we will use withdata = bechdel
.
ggplot(data = bechdel)
- The
mapping
argument is paired with an aesthetic (aes()
), which tells us how the variables in our data set should be mapped to the visual properties of the graph.
ggplot(data = bechdel,
mapping = aes(x = budget_2013, y = gross_2013))
As we previously mentioned, we often omit the names of the first two arguments in R functions. So you’ll often see this written as:
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013))
Note that the result is exactly the same.
- The
geom_xx
function specifies the type of plot we want to use to represent the data. In the code below, we usegeom_point
which creates a plot where each observation is represented by a point.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013)) +
geom_point()
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Note that this results in a warning as well. What does the warning mean?
Budget vs. gross revenue
Step 1 - Your turn
Modify the following plot to change the color of all points to a different color.
See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for many color options you can use by name in R or use the hex code for a color of your choice.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013)) +
geom_point(color = "blueviolet")
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 2 - Your turn
Add labels for the title and x and y axes.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013))+
geom_point(color = "blueviolet") +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue"
)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 3 - Your turn
An aes
thetic is a visual property of one of the objects in your plot. Commonly used aesthetic options are:
- color
- fill
- shape
- size
- alpha (transparency)
Modify the plot below, so the color
of the points is based on the variable binary
.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary)) +
geom_point() +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 4 - Your turn
Expand on your plot from the previous step to make the size of your points based on roi
.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point() +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 5 - Your turn
Expand on your plot from the previous step to make the transparency (alpha
) of the points 0.5.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point(alpha = 0.5) +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 6 - Your turn
Expand on your plot from the previous step by using facet_wrap
to display the association between budget and gross for different values of clean_test
.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point(alpha = 0.5) +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
+
) facet_wrap(facets = vars(clean_test))
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 7 - Demo
Improve your plot from the previous step by making the x and y scales more legible.
Make use of the scales package, specifically the scale_x_continuous()
and scale_y_continuous()
functions.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point(alpha = 0.5) +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
+
) facet_wrap(facets = vars(clean_test)) +
scale_x_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale()))
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Step 8 - Your turn
Expand on your plot from the previous step by using facet_grid
to display the association between budget and gross for different combinations of clean_test
and binary
. Comment on whether this was a useful update.
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point(alpha = 0.5) +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test"
+
) facet_grid(rows = vars(clean_test), cols = vars(binary)) +
scale_x_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale()))
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
It doesn’t seem particularly useful since binary
is a subset of clean_test
and maps on to exactly one value.
Step 9 - Demo
What other improvements could we make to this plot?
# Answers may vary
# Personally, I would label the legend titles and move the legends to the bottom
ggplot(bechdel,
aes(x = budget_2013, y = gross_2013, color = binary, size = roi)) +
geom_point(alpha = 0.5) +
labs(
x = "Budget (in 2013 dollars)",
y = "Gross revenue (in 2013 dollars)",
title = "Budget vs. gross revenue, by Bechdel test",
color = "Bechdel test",
size = "Return on investment"
+
) facet_wrap(facets = vars(clean_test)) +
scale_x_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
theme(legend.position = "bottom")
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Return-on-investment
Finally, let’s take a look at return-on-investment (ROI).
Step 1 - Your turn
Create side-by-side box plots of roi
by clean_test
where the boxes are colored by binary
.
ggplot(bechdel,
aes(x = clean_test, y = roi, color = binary)) +
geom_boxplot() +
labs(
title = "Return on investment vs. Bechdel test result",
x = "Detailed Bechdel result",
y = "Return-on-investment (gross / budget)",
color = "Bechdel\nresult"
)
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).
What are those movies with very high returns on investment?
|>
bechdel filter(roi > 400) |>
select(title, roi, budget_2013, gross_2013, year, clean_test)
# A tibble: 3 × 6
title roi budget_2013 gross_2013 year clean_test
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Paranormal Activity 671. 505595 339424558 2007 dubious
2 The Blair Witch Project 648. 839077 543776715 1999 ok
3 El Mariachi 583. 11622 6778946 1992 nowomen
Step 2 - Demo
Expand on your plot from the previous step to zoom in on movies with roi < 400
to get a better view of how the medians across the categories compare.
# add code here
ggplot(bechdel,
aes(x = clean_test, y = roi, color = binary)) +
geom_boxplot() +
labs(
title = "Return on investment vs. Bechdel test result",
x = "Detailed Bechdel result",
y = "Return-on-investment (gross / budget)",
color = "Bechdel\nresult"
+
) coord_cartesian(ylim = c(0, 18))
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).
What does this plot say about return-on-investment on movies that pass the Bechdel test?
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-08-29
pandoc 3.3 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P bit 4.0.5 2022-11-15 [?] CRAN (R 4.3.0)
P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.3.0)
cli 3.6.3 2024-06-21 [1] RSPM (R 4.4.0)
P colorspace 2.1-0 2023-01-23 [?] CRAN (R 4.3.0)
P crayon 1.5.3 2024-06-20 [?] CRAN (R 4.4.0)
P digest 0.6.35 2024-03-11 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.3.1)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.3.3)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.3.0)
P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.3.1)
P glue 1.7.0 2024-01-09 [?] CRAN (R 4.3.1)
P gtable 0.3.5 2024-04-22 [?] CRAN (R 4.3.1)
P here 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.3.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.3.1)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.3.1)
P jsonlite 1.8.8 2023-12-04 [?] CRAN (R 4.3.1)
P knitr 1.47 2024-05-29 [?] CRAN (R 4.4.0)
P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.3.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.3.1)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.3.0)
P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.3.1)
P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.3.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.3.0)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.3.0)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.3.1)
renv 1.0.7 2024-04-11 [1] CRAN (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.3.3)
P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0)
P rprojroot 2.0.4 2023-11-05 [?] CRAN (R 4.3.1)
P rstudioapi 0.16.0 2024-03-24 [?] CRAN (R 4.3.1)
P scales * 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.3.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.3.1)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.3.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.3.1)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.3.1)
P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.3.0)
P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.3.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.3.1)
P vroom 1.6.5 2023-12-05 [?] CRAN (R 4.3.1)
withr 3.0.1 2024-07-31 [1] RSPM (R 4.4.0)
P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0)
P yaml 2.3.8 2023-12-11 [?] CRAN (R 4.3.1)
[1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────