Text analysis (Taylor’s Version)

Application exercise
library(tidyverse)
library(tidytext)
library(taylor)
library(tayloRswift)
library(ggridges)
library(scales)

theme_set(theme_minimal(base_size = 13))

Taylor Swift is one of the most recognizable and popular recording artists on the planet. She is also a prolific songwriter, having written or co-written every song on each of her nine studio albums. Currently she is smashing records on her Eras concert tour.

Taylor Swift holding her hands up in a heart shape and then pointing at the camera.

In this application exercise we will use the taylor package to analyze the lyrics of Taylor Swift’s songs. The package contains a data frame taylor_albums with information about each of her studio albums, including the release date, the number of tracks, and the album cover art. The package also contains a data frame taylor_album_songs with the lyrics of each song from her official studio albums.1

1 This excludes singles released separately from an album as well as non-Taylor-owned albums that have a Taylor-owned alternative (e.g., Fearless is excluded in favor of Fearless (Taylor’s Version)).

Import Taylor Swift lyrics

We can load the relevant data files directly from the taylor package.

library(taylor)

data("taylor_album_songs")
data("taylor_albums")
taylor_album_songs
taylor_albums

Convert to tidytext format

Currently, taylor_album_songs is stored as one-row-song, with the lyrics nested in a list-column where each element is a tibble with one-row-per-line. The definition of a single “line” is somewhat arbitrary. For substantial analysis, we will convert the corpus to a tidy-text data frame of one-row-per-token. Initially, we will use unnest_tokens() to tokenize all unigrams.

Demonstration: Convert taylor_album_songs to a tidy-text data frame of one-row-per-token.

# tokenize taylor lyrics
taylor_lyrics <- taylor_album_songs |>
  # select relevant columns
  select(album_name, track_number, track_name, lyrics) |>
  # factor albums by release date
  mutate(album_name = fct(x = album_name, levels = taylor_albums$album_name)) |>
  # unnest the list-column to one-row-per-song-per-line
  unnest(col = lyrics) |>
  # now tokenize the lyrics
  ...

Length of songs by words

An initial check reveals the length of each song in terms of the number of words in its lyrics.

taylor_lyrics |>
  count(album_name, track_number, track_name) |>
  ggplot(mapping = aes(x = n)) +
  geom_histogram() +
  labs(
    title = "Length of songs by Taylor Swift",
    x = "Song length (in words)",
    y = NULL,
    caption = "Source: {taylor}"
  )

Stop words

Generic stop words

Of course not all words are equally important. Consider the 10 most frequent words in the lyrics:

taylor_lyrics |>
  count(word, sort = TRUE)

These are not particularly informative. We can identify a list of stop words then remove them via anti_join().

Demonstration: Remove stop words from the data frame and reevaluate the most frequent words in the lyrics.

# get a set of stop words
get_stopwords(source = "______")

# remove stop words
taylor_tidy <- anti_join(x = taylor_lyrics, y = ______)
taylor_tidy

taylor_tidy |>
  count(word) |>
  slice_max(n = 20, order_by = n) |>
  mutate(word = fct_reorder(.f = word, .x = n)) |>
  ggplot(aes(x = n, y = word)) +
  geom_col() +
  labs(
    title = "Frequency of tokens in Taylor Swift lyrics",
    x = "Number of occurrences",
    y = NULL,
    caption = "Source: {taylor}"
  )

Domain-specific stop words

While this takes care of generic stop words, we can also identify domain-specific stop words. For example, Taylor Swift’s lyrics are full of interjections and exclamations that are not particularly informative. We can identify these and remove them from the corpus.

# domain-specific stop words
# source: https://rpubs.com/RosieB/642806
taylor_stop_words <- c(
  "oh", "ooh", "eh", "ha", "mmm", "mm", "yeah", "ah",
  "hey", "eeh", "uuh", "uh", "la", "da", "di", "ra",
  "huh", "hu", "whoa", "gonna", "wanna", "gotta", "em"
)

taylor_tidy <- taylor_lyrics |>
  anti_join(get_stopwords(source = "smart")) |>
  filter(!word %in% taylor_stop_words)
taylor_tidy

taylor_tidy |>
  count(word, sort = TRUE)

Words most relevant to each album

Since we know the songs for each album, we can examine the relative significance of different words to different albums. Term frequency-inverse document frequency (tf-idf) is a simple metric for measuring the importance of specific words to a corpus. Here let’s calculate the top ten words for each album.

Demonstration: Calculate the top ten words for each album.

taylor_tf_idf <- taylor_tidy |>
  count(______, ______) |>
  bind_tf_idf(term = ______, document = ______, n = ______)
taylor_tf_idf

# visualize the top N terms per character by tf-idf score
taylor_tf_idf |>
  group_by(album_name) |>
  slice_max(n = 10, order_by = tf_idf, with_ties = FALSE) |>
  ggplot(mapping = aes(x = tf_idf, y = word)) +
  geom_col() +
  facet_wrap(facets = vars(album_name), scales = "free")

Sentiment analysis

Sentiment analysis utilizes the text of the lyrics to classify content as positive or negative. Dictionary-based methods use pre-generated lexicons of words independently coded as positive/negative. We can combine one of these dictionaries with the Taylor Swift tidy-text data frame using inner_join() to identify words with sentimental affect, and further analyze trends.

Demonstration: Import the AFINN sentiment dictionary and combine with taylor_tidy.

# join with sentiment dictionary, drop words which are not defined
taylor_afinn <- taylor_tidy |>
  inner_join(y = ______)
taylor_afinn

Sentimental affect of each song

Your turn: Visualize the sentiment of each song individually by calculating the average sentiment of each word in the song.

taylor_afinn |>
  summarize(sent = ______, .by = c(______, ______)) |>
  mutate(track_name = fct_reorder(.f = track_name, .x = sent)) |>
  ggplot(mapping = aes(x = sent, y = track_name, fill = sent)) +
  geom_col() +
  scale_fill_viridis_c() +
  labs(
    title = "Sentimental affect of Taylor Swift songs",
    x = "Average sentiment",
    y = NULL,
    caption = "Source: {taylor}"
  ) +
  theme(
    legend.position = "none",
    plot.title.position = "plot"
  )

Shake It Off

Taylor Swift singing 'Haters gonna hate'

Taylor Swift shaking it off
# what's up with shake it off?

# your code here

Sentimental affect of each album

We could also examine the general disposition of each album based on their overall positive/negative affect.

# errorbar plot
taylor_afinn |>
  # calculate average sentiment by album with standard error
  summarize(
    sent = ______,
    se = ______ / ______,
    .by = album_name
  ) |>
  # reverse album order for vertical plot
  mutate(album_name = fct_rev(f = album_name)) |>
  # generate plot
  ggplot(mapping = aes(y = album_name, x = sent)) +
  geom_pointrange(mapping = aes(
    xmin = sent - 2 * se,
    xmax = sent + 2 * se
  )) +
  labs(
    title = "Emotional affect in Taylor Swift albums",
    x = "Average sentiment",
    y = NULL,
    caption = "Source: {taylor}"
  )

# ggridge plot with tayloRswift color palette

# add code here

Notice that we are visualizing the albums in chronological order, but the all the “Taylor’s Version” albums are re-recordings of albums she made early in her career. We can stan artists owning their own music, but what about her career arc? Has Taylor Swift gotten more positive or negative across her career?

In order to assess that, we have to go back to her original album recordings. We can do this by joining the taylor_afinn data frame with the taylor_all_songs data frame, which contains all of her original songs.

taylor_all_tidy <- taylor_all_songs |>
  select(album_name, track_number, track_name, lyrics) |>
  unnest(col = lyrics) |>
  unnest_tokens(output = word, input = lyric) |>
  # stop word removal
  anti_join(get_stopwords(source = "smart")) |>
  filter(!word %in% taylor_stop_words) |>
  # filter to full studio albums
  semi_join(y = taylor_albums |>
    filter(!ep)) |>
  # exclude rereleases
  filter(!str_detect(string = album_name, pattern = "Taylor's Version")) |>
  # order albums by release date
  mutate(album_name = factor(x = album_name, levels = taylor_albums$album_name) |>
           fct_rev())


# errorbar plot
taylor_all_tidy |>
  # join with sentiment dictionary
  inner_join(y = get_sentiments(lexicon = "afinn")) |>
  # calculate average sentiment by album with standard error
  summarize(
    sent = ______,
    se = ______ / ______,
    .by = album_name
  ) |>
  # generate plot sorted from positive to negative
  ggplot(mapping = aes(y = album_name, x = sent)) +
  geom_pointrange(mapping = aes(
    xmin = sent - 2 * se,
    xmax = sent + 2 * se
  )) +
  labs(
    title = "Emotional affect in Taylor Swift albums",
    subtitle = "Original studio albums",
    x = "Average sentiment",
    y = NULL,
    caption = "Source: {taylor}"
  )


# ggridge plot with tayloRswift color palette

# add code here

Varying types of sentiment

tidytext includes multiple sentiment dictionaries for different types of sentiment. We can use the nrc dictionary to examine the different types of sentiment in Taylor Swift’s lyrics.

Your turn: Visualize the different types of sentiment in Taylor Swift’s lyrics, by album. Normalize by the number of (sentimental) words per album.

taylor_all_tidy |>
  # join with sentiment dictionary
  inner_join(y = ______) |>
  filter(sentiment != "positive", sentiment != "negative") |>
  count(______, ______) |>
  # normalize for number of tokens per album
  left_join(y = taylor_all_tidy |>
    count(album_name, name = "n_all")) |>
  mutate(n_pct = n / n_all) |>
  # reverse album order for vertical plot
  mutate(album_name = fct_rev(f = album_name)) |>
  ggplot(mapping = aes(x = n_pct, y = album_name)) +
  geom_col() +
  scale_x_continuous(labels = percent_format()) +
  facet_wrap(
    facets = vars(sentiment)
  ) +
  labs(
    title = "Sentimental affect (by type) in Taylor Swift albums",
    subtitle = "Original studio albums",
    x = "Percentage of tokens in the album",
    y = NULL,
    caption = "Source: {taylor}"
  )