Text analysis: fundamentals and sentiment analysis

Lecture 22

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2023

2023-11-13

Announcements

Lab 06
Homework 05
Project (draft) deliverables

Core text data workflows

Basic workflow for text analysis

Obtain your text sources
Extract documents and move into a corpus
Transformation
Extract features
Perform analysis

Obtain your text sources

Web sites/APIs
Databases
PDF documents
Digital scans of printed materials

Extract documents and move into a corpus

Text corpus
Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
Standard text processing
- Convert to lower case
- Remove punctuation
- Remove numbers
- Remove stopwords
- Remove domain-specific stopwords
- Stemming

Extract features

Convert the text string into some sort of quantifiable measures
Bag-of-words model
- Term frequency vector
- Term-document matrix
- Ignores context
Word embeddings

Word embeddings

Perform analysis

Basic
- Word frequency
- Collocation
- Dictionary tagging
Advanced
- Document classification
- Corpora comparison
- Topic modeling

`tidytext`

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
Learn more at tidytextmining.com

library(tidyverse)
library(tidytext)

What is tidy text?

text <- c(
  "Yeah, with a boy like that it's serious",
  "There's a boy who is so wonderful",
  "That girls who see him cannot find back home",
  "And the gigolos run like spiders when he comes",
  "'Cause he is Eros and he's Apollo",
  "Girls, with a boy like that it's serious",
  "Senoritas, don't follow him",
  "Soon, he will eat your hearts like cereals",
  "Sweet Lolitas, don't go",
  "You're still young",
  "But every night they fall like dominoes",
  "How he does it, only heaven knows",
  "All the other men turn gay wherever he goes (wow!)"
)
text

 [1] "Yeah, with a boy like that it's serious"           
 [2] "There's a boy who is so wonderful"                 
 [3] "That girls who see him cannot find back home"      
 [4] "And the gigolos run like spiders when he comes"    
 [5] "'Cause he is Eros and he's Apollo"                 
 [6] "Girls, with a boy like that it's serious"          
 [7] "Senoritas, don't follow him"                       
 [8] "Soon, he will eat your hearts like cereals"        
 [9] "Sweet Lolitas, don't go"                           
[10] "You're still young"                                
[11] "But every night they fall like dominoes"           
[12] "How he does it, only heaven knows"                 
[13] "All the other men turn gay wherever he goes (wow!)"

What is tidy text?

text_df <- tibble(line = 1:length(text), text = text)
text_df

# A tibble: 13 × 2
    line text                                              
   <int> <chr>                                             
 1     1 Yeah, with a boy like that it's serious           
 2     2 There's a boy who is so wonderful                 
 3     3 That girls who see him cannot find back home      
 4     4 And the gigolos run like spiders when he comes    
 5     5 'Cause he is Eros and he's Apollo                 
 6     6 Girls, with a boy like that it's serious          
 7     7 Senoritas, don't follow him                       
 8     8 Soon, he will eat your hearts like cereals        
 9     9 Sweet Lolitas, don't go                           
10    10 You're still young                                
11    11 But every night they fall like dominoes           
12    12 How he does it, only heaven knows                 
13    13 All the other men turn gay wherever he goes (wow!)

What is tidy text?

text_df |>
  unnest_tokens(output = word, input = text)

# A tibble: 91 × 2
    line word   
   <int> <chr>  
 1     1 yeah   
 2     1 with   
 3     1 a      
 4     1 boy    
 5     1 like   
 6     1 that   
 7     1 it's   
 8     1 serious
 9     2 there's
10     2 a      
# ℹ 81 more rows

Counting words

text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)

# A tibble: 67 × 2
   word      n
   <chr> <int>
 1 he        5
 2 like      5
 3 a         3
 4 boy       3
 5 that      3
 6 and       2
 7 don't     2
 8 girls     2
 9 him       2
10 is        2
# ℹ 57 more rows

Application exercise

`ae-20`

Go to the course GitHub org and find your ae-20 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

tidytext allows you to structure text data in a format conducive to exploratory analysis and wrangling/visualization with tidyverse
Tokenizing is a process of converting raw character strings to recognizable features
Remove non-informative stop words to reduce noise in the text data
tf-idf measures the importance of frequently occurring tokens, increasing the weight for tokens that are not used very much across the corpus
Dictionary-based sentiment analysis provides a rough classification of text into positive/negative sentiments

Only a few weeks left in the semester!