Do not throw away your shot: Text mining and Hamilton
Before TikTok came for Lin-Manuel Miranda, there was Hamilton.
@okayelisabeth lin❤️😽 #fyp #linmanuelmiranda #hamilton ♬ original sound - elisabeth
One of the nice things about the musical is that it is sung-through, so the lyrics contain essentially all of the dialogue. This provides an interesting opportunity to use the tidytext package to analyze the lyrics.
hamilton <- read_csv(file = "data/hamilton.csv") |>
mutate(song_name = parse_factor(song_name))
glimpse(hamilton)
Rows: 3,532
Columns: 5
$ song_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ song_name <fct> "Alexander Hamilton", "Alexander Hamilton", "Alexander Ham…
$ line_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ line <chr> "How does a bastard, orphan, son of a whore and a", "Scots…
$ speaker <chr> "Aaron Burr", "Aaron Burr", "Aaron Burr", "Aaron Burr", "J…
Along with the lyrics, we also know the singer (speaker
) of each line of dialogue. This will be helpful if we want to perform analysis on a subset of singers.
Convert to tidytext format
Currently, hamilton
is stored as one-row-per-line of lyrics. The definition of a single “line” is somewhat arbitrary. For substantial analysis, we will convert the corpus to a tidy-text data frame of one-row-per-token. Initially, we will use unnest_tokens()
to tokenize all unigrams.
hamilton_tidy <- unnest_tokens(tbl = hamilton, output = word, input = line)
hamilton_tidy
# A tibble: 21,142 × 5
song_number song_name line_num speaker word
<dbl> <fct> <dbl> <chr> <chr>
1 1 Alexander Hamilton 1 Aaron Burr how
2 1 Alexander Hamilton 1 Aaron Burr does
3 1 Alexander Hamilton 1 Aaron Burr a
4 1 Alexander Hamilton 1 Aaron Burr bastard
5 1 Alexander Hamilton 1 Aaron Burr orphan
6 1 Alexander Hamilton 1 Aaron Burr son
7 1 Alexander Hamilton 1 Aaron Burr of
8 1 Alexander Hamilton 1 Aaron Burr a
9 1 Alexander Hamilton 1 Aaron Burr whore
10 1 Alexander Hamilton 1 Aaron Burr and
# ℹ 21,132 more rows
Remember that by default, unnest_tokens()
automatically converts all text to lowercase and strips out punctuation.
Length of songs by words
An initial check reveals the length of each song in terms of the number of words in its lyrics.
hamilton_tidy |>
mutate(song_name = fct_rev(song_name)) |>
ggplot(mapping = aes(y = song_name)) +
geom_bar() +
labs(
title = "Length of songs in Hamilton",
x = "Song length (in words)",
y = NULL,
caption = "Source: Genius"
)
As a function of number of words, “Non-Stop” is the longest song in the musical.
Stop words
Of course not all words are equally important. Consider the 10 most frequent words in the lyrics:
# A tibble: 2,929 × 2
word n
<chr> <int>
1 the 848
2 i 639
3 you 578
4 to 544
5 a 471
6 and 383
7 in 317
8 it 294
9 of 274
10 my 259
# ℹ 2,919 more rows
Not particularly informative. We can identify a list of stop_words
then remove them via anti_join()
.
# get a set of stop words
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
# remove stop words
hamilton_tidy <- anti_join(x = hamilton_tidy, y = stop_words)
hamilton_tidy |>
count(word) |>
slice_max(n = 20, order_by = n) |>
mutate(word = fct_reorder(.f = word, .x = n)) |>
ggplot(aes(x = n, y = word)) +
geom_col() +
labs(
title = "Frequency of Hamilton lyrics",
x = NULL,
y = NULL
)
Now the words seem more relevant to the specific story being told in the musical.
Words used most by each cast member
Since we know which singer performs each line, we can examine the relative significance of different words to different characters. Term frequency-inverse document frequency (tf-idf) is a simple metric for measuring the importance of specific words to a corpus. Here let’s calculate the top ten words for each member of the principal cast.
# principal cast via Wikipedia
principal_cast <- c(
"Hamilton", "Eliza", "Burr", "Angelica", "Washington",
"Lafayette", "Jefferson", "Mulligan", "Madison",
"Laurens", "Philip", "Peggy", "Maria", "King George"
)
# calculate tf-idf scores for words sung by the principal cast
hamilton_tf_idf <- hamilton_tidy |>
filter(speaker %in% principal_cast) |>
mutate(speaker = parse_factor(x = speaker, levels = principal_cast)) |>
count(speaker, word) |>
bind_tf_idf(term = word, document = speaker, n = n)
# visualize the top N terms per character by tf-idf score
hamilton_tf_idf |>
group_by(speaker) |>
slice_max(n = 10, order_by = tf_idf, with_ties = FALSE) |>
ggplot(mapping = aes(x = tf_idf, y = word)) +
geom_col() +
labs(
title = "Most important words in *Hamilton*",
subtitle = "Principal cast only",
x = "tf-idf",
y = NULL,
caption = "Source: Genius"
) +
facet_wrap(facets = vars(speaker), scales = "free") +
theme(plot.title = element_markdown())
Not very functional sorted alphabetically. Let’s sort all the facets from highest to lowest tf-idf scores.
# visualize the top N terms per character by tf-idf score
hamilton_tf_idf |>
group_by(speaker) |>
slice_max(n = 10, order_by = tf_idf, with_ties = FALSE) |>
# create word as a factor column ordered by n
mutate(word = fct_reorder(.f = word, .x = n)) |>
ggplot(mapping = aes(x = tf_idf, y = word)) +
geom_col() +
labs(
title = "Most important words in *Hamilton*",
subtitle = "Principal cast only",
x = "tf-idf",
y = NULL,
caption = "Source: Genius"
) +
facet_wrap(facets = vars(speaker), scales = "free") +
theme(plot.title = element_markdown())
Still does not look right. The problem is that some tokens appear in multiple facets but with different tf-idf scores (and different orders). We need to order the rows within each facet independently. But ggplot2 does not support this operation. How can we do so? Using tidytext::reorder_within()
and tidytext::scale_y_reordered()
.
hamilton_tf_idf |>
group_by(speaker) |>
slice_max(n = 10, order_by = tf_idf, with_ties = FALSE) |>
# create word as a factor column ordered by n
mutate(word = fct_reorder(.f = word, .x = n)) |>
# resolve ambiguities when same word appears for different characters
ungroup() |>
mutate(word = reorder_within(x = word, by = tf_idf, within = speaker)) |>
ggplot(mapping = aes(x = tf_idf, y = word)) +
geom_col() +
scale_y_reordered() +
labs(
title = "Most important words in *Hamilton*",
subtitle = "Principal cast only",
x = "tf-idf",
y = NULL,
caption = "Source: Genius"
) +
facet_wrap(facets = vars(speaker), scales = "free") +
theme(plot.title = element_markdown())
Again, some expected results stick out. Hamilton is always singing about not throwing away his shot, Eliza is helplessly in love with Alexander, while Burr regrets not being “in the room where it happens”. And don’t forget King George’s love songs to his wayward children.
Sentiment analysis
Sentiment analysis utilizes the text of the lyrics to classify content as positive or negative. Dictionary-based methods use pre-generated lexicons of words independently coded as positive/negative. We can combine one of these dictionaries with the Hamilton tidy-text data frame using inner_join()
to identify words with sentimental affect, and further analyze trends.
Here we use the afinn
dictionary which classifies 2,477 words on a scale of \([-5, +5]\).
# afinn dictionary
get_sentiments(lexicon = "afinn")
# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
hamilton_afinn <- hamilton_tidy |>
# join with sentiment dictionary
inner_join(y = get_sentiments(lexicon = "afinn")) |>
# create row id and cumulative sentiment over the entire corpus
mutate(
cum_sent = cumsum(value),
id = row_number()
)
hamilton_afinn
# A tibble: 1,109 × 8
song_number song_name line_num speaker word value cum_sent id
<dbl> <fct> <dbl> <chr> <chr> <dbl> <dbl> <int>
1 1 Alexander Hamilton 1 Aaron Burr bast… -5 -5 1
2 1 Alexander Hamilton 1 Aaron Burr whore -4 -9 2
3 1 Alexander Hamilton 2 Aaron Burr forg… -1 -10 3
4 1 Alexander Hamilton 4 Aaron Burr hero 2 -8 4
5 1 Alexander Hamilton 7 John Laur… smar… 2 -6 5
6 1 Alexander Hamilton 11 Thomas Je… stru… -2 -8 6
7 1 Alexander Hamilton 12 Thomas Je… long… -1 -9 7
8 1 Alexander Hamilton 13 Thomas Je… steal -2 -11 8
9 1 Alexander Hamilton 17 James Mad… pain -2 -13 9
10 1 Alexander Hamilton 18 Burr insa… -2 -15 10
# ℹ 1,099 more rows
First, we can examine the sentiment of each song individually by calculating the average sentiment of each word in the song.
# sentiment by song
hamilton_afinn |>
group_by(song_name) |>
summarize(sent = mean(value)) |>
mutate(song_name = fct_rev(song_name)) |>
ggplot(mapping = aes(x = sent, y = song_name, fill = sent)) +
geom_col() +
scale_fill_viridis_c() +
labs(
title = "Positive/negative sentiment in *Hamilton*",
subtitle = "By song",
x = "Average sentiment",
y = NULL,
fill = "Average\nsentiment",
caption = "Source: Genius"
) +
theme(
plot.title = element_markdown(),
legend.position = "none"
)
Again, the general themes of the songs come across in this analysis. “Alexander Hamilton” introduces Hamilton’s tragic backstory and difficult circumstances before emigrating to New York. “Dear Theodosia” is a love letter from Burr and Hamilton, promising to make the world a better place for their respective children.
However, this also illustrates some problems with dictionary-based sentiment analysis. Consider the back-to-back songs “Helpless” and “Satisfied”.
“Helpless” depicts Eliza and Alexander falling in love with one another and getting married, while “Satisfied” recounts these same events from the perspective of Eliza’s sister Angelica who suppresses her own feelings for Hamilton out of a sense of duty to her sister. From the perspective of the listener, “Helpless” is the far more positive song of the pair. Why are they reversed based on the textual analysis?
get_sentiments(lexicon = "afinn") |>
filter(word %in% c("helpless", "satisfied"))
# A tibble: 2 × 2
word value
<chr> <dbl>
1 helpless -2
2 satisfied 2
Herein lies the problem with dictionary-based methods. The AFINN lexicon codes “helpless” as a negative term and “satisfied” as a positive term. On their own this makes sense, but in the context of the music clearly Eliza is “helplessly” in love while Angelica will in fact never be “satisfied” because she cannot be with Alexander. A dictionary-based sentiment classification will always miss these nuances in language.
We could also examine the general disposition of each speaker based on the sentiment of their lyrics. Consider the principal cast below:
hamilton_afinn |>
filter(speaker %in% principal_cast) |>
# calculate average sentiment by character with standard error
group_by(speaker) |>
summarize(
sent = mean(value),
se = sd(value) / n()
) |>
# generate plot sorted from positive to negative
ggplot(mapping = aes(y = fct_reorder(speaker, sent), x = sent, fill = sent)) +
geom_pointrange(mapping = aes(
xmin = sent - 2 * se,
xmax = sent + 2 * se
)) +
labs(
title = "Positive/negative sentiment in *Hamilton*",
subtitle = "By speaker",
x = "Average sentiment",
y = NULL,
caption = "Source: Genius"
) +
theme(
plot.title = element_markdown(),
legend.position = "none"
)
Given his generally neutral sentiment, Aaron Burr clearly follows his own guidance.
Also, can we please note Peggy’s general pessimism?
Tracking the cumulative sentiment across the entire musical, it’s easy to identify the high and low points.
# get first row for each song
hamilton_songs <- hamilton_afinn |>
group_by(song_number) |>
filter(id == min(id)) |>
select(song_number, id, song_name)
hamilton_songs
# A tibble: 45 × 3
# Groups: song_number [45]
song_number id song_name
<dbl> <int> <fct>
1 1 1 Alexander Hamilton
2 2 26 Aaron Burr, Sir
3 3 46 My Shot
4 4 84 The Story of Tonight
5 5 96 The Schuyler Sisters
6 6 123 Farmer Refuted
7 7 150 You’ll Be Back
8 8 169 Right Hand Man
9 9 205 A Winter’s Ball
10 10 216 Helpless
# ℹ 35 more rows
ggplot(data = hamilton_afinn, mapping = aes(x = id, y = cum_sent)) +
geom_line() +
# label the start of each song
scale_x_reverse(
breaks = pull(.data = hamilton_songs, id),
labels = pull(.data = hamilton_songs, song_name)
) +
labs(
title = "Positive/negative sentiment in *Hamilton*",
x = NULL,
y = "Cumulative sentiment",
caption = "Source: Genius"
) +
# transpose to be able to fit song titles on the graph
coord_flip() +
theme(
panel.grid.minor.y = element_blank(),
plot.title = element_markdown()
)
After the initial drop from “Alexander Hamilton”, the next peaks in the graph show several positive events in Hamilton’s life: meeting his friends, becoming Washington’s secretary, and meeting and marrying Eliza. The musical experiences a drop in tone during the rough years of the revolution and Hamilton’s dismissal back to New York, then rebounds as the revolutionaries close in on victory at Yorktown. Hamilton’s challenges as a member of Washington’s cabinet and rivalry with Jefferson are captured in the up-and-down swings in the graph, rises up with “One Last Time” and Hamilton writing Washington’s Farewell Address, dropping once again with “Hurricane” and the revelation of Hamilton’s affair, rising as Alexander and Eliza reconcile before finally descending once more upon Hamilton’s death in his duel with Burr.
Pairs of words
Finally we can examine the colocation of pairs of words to look for common usage.
# calculate all pairs of words in the musical
hamilton_pair <- hamilton |>
unnest_tokens(
output = word,
input = line,
token = "ngrams",
n = 2
) |>
separate(
col = word,
into = c("word1", "word2"),
sep = " "
) |>
filter(
!word1 %in% stop_words$word,
!word2 %in% stop_words$word
) |>
drop_na(word1, word2) |>
count(word1, word2, sort = TRUE)
# filter for only relatively common combinations
bigram_graph <- hamilton_pair |>
filter(n > 3) |>
graph_from_data_frame()
# draw a network graph
set.seed(1776) # New York City
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), show.legend = FALSE, alpha = .5) +
geom_node_point(color = "#0052A5", size = 3, alpha = .5) +
geom_node_text(aes(label = name), vjust = 1.5) +
ggtitle("Word Network in Lin-Manuel Miranda's *Hamilton*") +
theme_void() +
theme(plot.title = element_markdown())
It’s apparent there are several major themes detected through this approach, including the Hamilton/Jefferson relationship, “Aaron Burr, sir”, Philip’s song with his mother (un, deux, trois, quatre, …), the rising up of the colonies, and those young, scrappy, and hungry men.
Acknowledgments
- This page is derived in part from A Sentiment Analysis of Hamilton: The broom Where it Happens / When are these #rcatladies gonna rise up? and licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
- This page is derived in part from Alexander Hamilton: The Breakdown.
- This page is derived in part from Tidytext Analysis and licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-10-08
pandoc 3.4 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P backports 1.5.0 2024-05-23 [?] CRAN (R 4.4.0)
P broom 1.0.6 2024-05-17 [?] CRAN (R 4.4.0)
P cachem 1.1.0 2024-05-16 [?] CRAN (R 4.4.0)
cli 3.6.3 2024-06-21 [1] RSPM (R 4.4.0)
P colorspace 2.1-0 2023-01-23 [?] CRAN (R 4.3.0)
P digest 0.6.35 2024-03-11 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.3.1)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.3.3)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P fs 1.6.4 2024-04-25 [?] CRAN (R 4.4.0)
P generics 0.1.3 2022-07-05 [?] CRAN (R 4.3.0)
P ggforce 0.4.2 2024-02-19 [?] CRAN (R 4.4.0)
P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.3.1)
P ggraph * 2.2.1 2024-03-07 [?] CRAN (R 4.4.0)
P ggrepel 0.9.5 2024-01-10 [?] CRAN (R 4.3.1)
P ggtext * 0.1.2 2022-09-16 [?] CRAN (R 4.3.0)
P glue 1.7.0 2024-01-09 [?] CRAN (R 4.3.1)
P graphlayouts 1.1.1 2024-03-09 [?] CRAN (R 4.4.0)
P gridExtra 2.3 2017-09-09 [?] CRAN (R 4.3.0)
P gridtext 0.1.5 2022-09-16 [?] CRAN (R 4.3.0)
P gtable 0.3.5 2024-04-22 [?] CRAN (R 4.3.1)
P here 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.3.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.3.1)
P htmlwidgets 1.6.4 2023-12-06 [?] CRAN (R 4.3.1)
P igraph * 2.0.3 2024-03-13 [?] CRAN (R 4.4.0)
P janeaustenr 1.0.0 2022-08-26 [?] CRAN (R 4.3.0)
P jsonlite 1.8.8 2023-12-04 [?] CRAN (R 4.3.1)
P knitr 1.47 2024-05-29 [?] CRAN (R 4.4.0)
P lattice 0.22-6 2024-03-20 [?] CRAN (R 4.4.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.3.1)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.3.0)
P MASS 7.3-61 2024-06-13 [?] CRAN (R 4.4.0)
P Matrix 1.7-0 2024-03-22 [?] CRAN (R 4.4.0)
P memoise 2.0.1 2021-11-26 [?] CRAN (R 4.3.0)
P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.3.1)
P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.3.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.3.0)
P polyclip 1.10-6 2023-09-27 [?] CRAN (R 4.3.1)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P R6 2.5.1 2021-08-19 [?] CRAN (R 4.3.0)
P rappdirs 0.3.3 2021-01-31 [?] CRAN (R 4.3.0)
P Rcpp 1.0.12 2024-01-09 [?] CRAN (R 4.3.1)
P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.3.1)
renv 1.0.7 2024-04-11 [1] CRAN (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.3.3)
P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0)
P rprojroot 2.0.4 2023-11-05 [?] CRAN (R 4.3.1)
P scales 1.3.0.9000 2024-05-07 [?] Github (r-lib/scales@c0f79d3)
P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P SnowballC 0.7.1 2023-04-25 [?] CRAN (R 4.3.0)
P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.3.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.3.1)
P textdata 0.4.5 2024-05-28 [?] CRAN (R 4.4.0)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidygraph 1.3.1 2024-01-30 [?] CRAN (R 4.4.0)
P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.3.1)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.3.1)
P tidytext * 0.4.2 2024-04-10 [?] CRAN (R 4.4.0)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.3.1)
P tokenizers 0.3.0 2022-12-22 [?] CRAN (R 4.3.0)
P tweenr 2.0.3 2024-02-26 [?] CRAN (R 4.3.1)
P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.3.0)
P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.3.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.3.1)
P viridis 0.6.5 2024-01-29 [?] CRAN (R 4.4.0)
P viridisLite 0.4.2 2023-05-02 [?] CRAN (R 4.3.0)
P widyr * 0.1.5 2022-09-13 [?] CRAN (R 4.3.0)
withr 3.0.1 2024-07-31 [1] RSPM (R 4.4.0)
P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0)
P xml2 1.3.6 2023-12-04 [?] CRAN (R 4.3.1)
P yaml 2.3.8 2023-12-11 [?] CRAN (R 4.3.1)
[1] /Users/soltoffbc/Projects/info-5001/course-site/renv/library/macos/R-4.4/aarch64-apple-darwin20
[2] /Users/soltoffbc/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────