Text analysis: fundamentals and sentiment analysis

Lecture 22

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-11-13

Announcements

Announcements

  • Lab 06
  • Homework 05
  • Project (draft) deliverables

Core text data workflows

Basic workflow for text analysis

  • Obtain your text sources
  • Extract documents and move into a corpus
  • Transformation
  • Extract features
  • Perform analysis

Obtain your text sources

  • Web sites/APIs
  • Databases
  • PDF documents
  • Digital scans of printed materials

Extract documents and move into a corpus

  • Text corpus
  • Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

  • Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
  • Standard text processing
    • Convert to lower case
    • Remove punctuation
    • Remove numbers
    • Remove stopwords
    • Remove domain-specific stopwords
    • Stemming

Extract features

  • Convert the text string into some sort of quantifiable measures
  • Bag-of-words model
    • Term frequency vector
    • Term-document matrix
    • Ignores context
  • Word embeddings

Word embeddings

Perform analysis

  • Basic
    • Word frequency
    • Collocation
    • Dictionary tagging
  • Advanced
    • Document classification
    • Corpora comparison
    • Topic modeling

tidytext

  • Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
  • Learn more at tidytextmining.com
library(tidyverse)
library(tidytext)

What is tidy text?

text <- c(
  "Yeah, with a boy like that it's serious",
  "There's a boy who is so wonderful",
  "That girls who see him cannot find back home",
  "And the gigolos run like spiders when he comes",
  "'Cause he is Eros and he's Apollo",
  "Girls, with a boy like that it's serious",
  "Senoritas, don't follow him",
  "Soon, he will eat your hearts like cereals",
  "Sweet Lolitas, don't go",
  "You're still young",
  "But every night they fall like dominoes",
  "How he does it, only heaven knows",
  "All the other men turn gay wherever he goes (wow!)"
)
text
 [1] "Yeah, with a boy like that it's serious"           
 [2] "There's a boy who is so wonderful"                 
 [3] "That girls who see him cannot find back home"      
 [4] "And the gigolos run like spiders when he comes"    
 [5] "'Cause he is Eros and he's Apollo"                 
 [6] "Girls, with a boy like that it's serious"          
 [7] "Senoritas, don't follow him"                       
 [8] "Soon, he will eat your hearts like cereals"        
 [9] "Sweet Lolitas, don't go"                           
[10] "You're still young"                                
[11] "But every night they fall like dominoes"           
[12] "How he does it, only heaven knows"                 
[13] "All the other men turn gay wherever he goes (wow!)"

What is tidy text?

text_df <- tibble(line = 1:length(text), text = text)
text_df
# A tibble: 13 × 2
    line text                                              
   <int> <chr>                                             
 1     1 Yeah, with a boy like that it's serious           
 2     2 There's a boy who is so wonderful                 
 3     3 That girls who see him cannot find back home      
 4     4 And the gigolos run like spiders when he comes    
 5     5 'Cause he is Eros and he's Apollo                 
 6     6 Girls, with a boy like that it's serious          
 7     7 Senoritas, don't follow him                       
 8     8 Soon, he will eat your hearts like cereals        
 9     9 Sweet Lolitas, don't go                           
10    10 You're still young                                
11    11 But every night they fall like dominoes           
12    12 How he does it, only heaven knows                 
13    13 All the other men turn gay wherever he goes (wow!)

What is tidy text?

text_df |>
  unnest_tokens(output = word, input = text)
# A tibble: 91 × 2
    line word   
   <int> <chr>  
 1     1 yeah   
 2     1 with   
 3     1 a      
 4     1 boy    
 5     1 like   
 6     1 that   
 7     1 it's   
 8     1 serious
 9     2 there's
10     2 a      
# ℹ 81 more rows

Counting words

text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)
# A tibble: 67 × 2
   word      n
   <chr> <int>
 1 he        5
 2 like      5
 3 a         3
 4 boy       3
 5 that      3
 6 and       2
 7 don't     2
 8 girls     2
 9 him       2
10 is        2
# ℹ 57 more rows

Application exercise

ae-20

  • Go to the course GitHub org and find your ae-20 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

  • tidytext allows you to structure text data in a format conducive to exploratory analysis and wrangling/visualization with tidyverse
  • Tokenizing is a process of converting raw character strings to recognizable features
  • Remove non-informative stop words to reduce noise in the text data
  • tf-idf measures the importance of frequently occurring tokens, increasing the weight for tokens that are not used very much across the corpus
  • Dictionary-based sentiment analysis provides a rough classification of text into positive/negative sentiments

Only a few weeks left in the semester!