library(tidyverse)
library(tidytext)
library(taylor)
library(tayloRswift)
library(ggridges)
library(scales)
theme_set(theme_minimal(base_size = 13))
Text analysis (Taylor’s Version)
Taylor Swift is one of the most recognizable and popular recording artists on the planet. She is also a prolific songwriter, having written or co-written every song on each of her nine studio albums. Currently she is smashing records on her Eras concert tour.
In this application exercise we will use the taylor package to analyze the lyrics of Taylor Swift’s songs. The package contains a data frame taylor_albums
with information about each of her studio albums, including the release date, the number of tracks, and the album cover art. The package also contains a data frame taylor_album_songs
with the lyrics of each song from her official studio albums.1
1 This excludes singles released separately from an album as well as non-Taylor-owned albums that have a Taylor-owned alternative (e.g., Fearless is excluded in favor of Fearless (Taylor’s Version)).
Import Taylor Swift lyrics
We can load the relevant data files directly from the taylor package.
library(taylor)
data("taylor_album_songs")
data("taylor_albums")
taylor_album_songs taylor_albums
Convert to tidytext format
Currently, taylor_album_songs
is stored as one-row-song, with the lyrics nested in a list-column where each element is a tibble with one-row-per-line. The definition of a single “line” is somewhat arbitrary. For substantial analysis, we will convert the corpus to a tidy-text data frame of one-row-per-token. Initially, we will use unnest_tokens()
to tokenize all unigrams.
Demonstration: Convert taylor_album_songs
to a tidy-text data frame of one-row-per-token.
# tokenize taylor lyrics
<- taylor_album_songs |>
taylor_lyrics # select relevant columns
select(album_name, track_number, track_name, lyrics) |>
# factor albums by release date
mutate(album_name = fct(x = album_name, levels = taylor_albums$album_name)) |>
# unnest the list-column to one-row-per-song-per-line
unnest(col = lyrics) |>
# now tokenize the lyrics
...
Length of songs by words
An initial check reveals the length of each song in terms of the number of words in its lyrics.
|>
taylor_lyrics count(album_name, track_number, track_name) |>
ggplot(mapping = aes(x = n)) +
geom_histogram() +
labs(
title = "Length of songs by Taylor Swift",
x = "Song length (in words)",
y = NULL,
caption = "Source: {taylor}"
)
Stop words
Generic stop words
Of course not all words are equally important. Consider the 10 most frequent words in the lyrics:
|>
taylor_lyrics count(word, sort = TRUE)
These are not particularly informative. We can identify a list of stop words then remove them via anti_join()
.
Demonstration: Remove stop words from the data frame and reevaluate the most frequent words in the lyrics.
# get a set of stop words
get_stopwords(source = "______")
# remove stop words
<- anti_join(x = taylor_lyrics, y = ______)
taylor_tidy
taylor_tidy
|>
taylor_tidy count(word) |>
slice_max(n = 20, order_by = n) |>
mutate(word = fct_reorder(.f = word, .x = n)) |>
ggplot(aes(x = n, y = word)) +
geom_col() +
labs(
title = "Frequency of tokens in Taylor Swift lyrics",
x = "Number of occurrences",
y = NULL,
caption = "Source: {taylor}"
)
Domain-specific stop words
While this takes care of generic stop words, we can also identify domain-specific stop words. For example, Taylor Swift’s lyrics are full of interjections and exclamations that are not particularly informative. We can identify these and remove them from the corpus.
# domain-specific stop words
# source: https://rpubs.com/RosieB/642806
<- c(
taylor_stop_words "oh", "ooh", "eh", "ha", "mmm", "mm", "yeah", "ah",
"hey", "eeh", "uuh", "uh", "la", "da", "di", "ra",
"huh", "hu", "whoa", "gonna", "wanna", "gotta", "em"
)
<- taylor_lyrics |>
taylor_tidy anti_join(get_stopwords(source = "smart")) |>
filter(!word %in% taylor_stop_words)
taylor_tidy
|>
taylor_tidy count(word, sort = TRUE)
Words most relevant to each album
Since we know the songs for each album, we can examine the relative significance of different words to different albums. Term frequency-inverse document frequency (tf-idf) is a simple metric for measuring the importance of specific words to a corpus. Here let’s calculate the top ten words for each album.
Demonstration: Calculate the top ten words for each album.
<- taylor_tidy |>
taylor_tf_idf count(______, ______) |>
bind_tf_idf(term = ______, document = ______, n = ______)
taylor_tf_idf
# visualize the top N terms per character by tf-idf score
|>
taylor_tf_idf group_by(album_name) |>
slice_max(n = 10, order_by = tf_idf, with_ties = FALSE) |>
ggplot(mapping = aes(x = tf_idf, y = word)) +
geom_col() +
facet_wrap(facets = vars(album_name), scales = "free")
Sentiment analysis
Sentiment analysis utilizes the text of the lyrics to classify content as positive or negative. Dictionary-based methods use pre-generated lexicons of words independently coded as positive/negative. We can combine one of these dictionaries with the Taylor Swift tidy-text data frame using inner_join()
to identify words with sentimental affect, and further analyze trends.
Demonstration: Import the AFINN sentiment dictionary and combine with taylor_tidy
.
# join with sentiment dictionary, drop words which are not defined
<- taylor_tidy |>
taylor_afinn inner_join(y = ______)
taylor_afinn
Sentimental affect of each song
Your turn: Visualize the sentiment of each song individually by calculating the average sentiment of each word in the song.
|>
taylor_afinn summarize(sent = ______, .by = c(______, ______)) |>
mutate(track_name = fct_reorder(.f = track_name, .x = sent)) |>
ggplot(mapping = aes(x = sent, y = track_name, fill = sent)) +
geom_col() +
scale_fill_viridis_c() +
labs(
title = "Sentimental affect of Taylor Swift songs",
x = "Average sentiment",
y = NULL,
caption = "Source: {taylor}"
+
) theme(
legend.position = "none",
plot.title.position = "plot"
)
Shake It Off
# what's up with shake it off?
# your code here
Sentimental affect of each album
We could also examine the general disposition of each album based on their overall positive/negative affect.
# errorbar plot
|>
taylor_afinn # calculate average sentiment by album with standard error
summarize(
sent = ______,
se = ______ / ______,
.by = album_name
|>
) # reverse album order for vertical plot
mutate(album_name = fct_rev(f = album_name)) |>
# generate plot
ggplot(mapping = aes(y = album_name, x = sent)) +
geom_pointrange(mapping = aes(
xmin = sent - 2 * se,
xmax = sent + 2 * se
+
)) labs(
title = "Emotional affect in Taylor Swift albums",
x = "Average sentiment",
y = NULL,
caption = "Source: {taylor}"
)
# ggridge plot with tayloRswift color palette
# add code here
Notice that we are visualizing the albums in chronological order, but the all the “Taylor’s Version” albums are re-recordings of albums she made early in her career. We can stan artists owning their own music, but what about her career arc? Has Taylor Swift gotten more positive or negative across her career?
In order to assess that, we have to go back to her original album recordings. We can do this by joining the taylor_afinn
data frame with the taylor_all_songs
data frame, which contains all of her original songs.
<- taylor_all_songs |>
taylor_all_tidy select(album_name, track_number, track_name, lyrics) |>
unnest(col = lyrics) |>
unnest_tokens(output = word, input = lyric) |>
# stop word removal
anti_join(get_stopwords(source = "smart")) |>
filter(!word %in% taylor_stop_words) |>
# filter to full studio albums
semi_join(y = taylor_albums |>
filter(!ep)) |>
# exclude rereleases
filter(!str_detect(string = album_name, pattern = "Taylor's Version")) |>
# order albums by release date
mutate(album_name = factor(x = album_name, levels = taylor_albums$album_name) |>
fct_rev())
# errorbar plot
|>
taylor_all_tidy # join with sentiment dictionary
inner_join(y = get_sentiments(lexicon = "afinn")) |>
# calculate average sentiment by album with standard error
summarize(
sent = ______,
se = ______ / ______,
.by = album_name
|>
) # generate plot sorted from positive to negative
ggplot(mapping = aes(y = album_name, x = sent)) +
geom_pointrange(mapping = aes(
xmin = sent - 2 * se,
xmax = sent + 2 * se
+
)) labs(
title = "Emotional affect in Taylor Swift albums",
subtitle = "Original studio albums",
x = "Average sentiment",
y = NULL,
caption = "Source: {taylor}"
)
# ggridge plot with tayloRswift color palette
# add code here
Varying types of sentiment
tidytext includes multiple sentiment dictionaries for different types of sentiment. We can use the nrc
dictionary to examine the different types of sentiment in Taylor Swift’s lyrics.
Your turn: Visualize the different types of sentiment in Taylor Swift’s lyrics, by album. Normalize by the number of (sentimental) words per album.
|>
taylor_all_tidy # join with sentiment dictionary
inner_join(y = ______) |>
filter(sentiment != "positive", sentiment != "negative") |>
count(______, ______) |>
# normalize for number of tokens per album
left_join(y = taylor_all_tidy |>
count(album_name, name = "n_all")) |>
mutate(n_pct = n / n_all) |>
# reverse album order for vertical plot
mutate(album_name = fct_rev(f = album_name)) |>
ggplot(mapping = aes(x = n_pct, y = album_name)) +
geom_col() +
scale_x_continuous(labels = percent_format()) +
facet_wrap(
facets = vars(sentiment)
+
) labs(
title = "Sentimental affect (by type) in Taylor Swift albums",
subtitle = "Original studio albums",
x = "Percentage of tokens in the album",
y = NULL,
caption = "Source: {taylor}"
)