Text analysis: fundamentals and sentiment analysis

Lecture 21

Dr. Benjamin Soltoff

Cornell University
INFO 5001 - Fall 2024

November 12, 2024

Announcements

Announcements

  • Lab 05
  • Homework 05

Core text data workflows

Basic workflow for text analysis

  • Obtain your text sources
  • Extract documents and move into a corpus
  • Transformation
  • Extract features
  • Perform analysis

Obtain your text sources

  • Web sites/APIs
  • Databases
  • PDF documents
  • Digital scans of printed materials

Extract documents and move into a corpus

  • Text corpus
  • Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

  • Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
  • Standard text processing
    • Convert to lower case
    • Remove punctuation
    • Remove numbers
    • Remove stopwords
    • Remove domain-specific stopwords
    • Stemming

Extract features

Convert the text string into some sort of quantifiable measures

Bag-of-words representation

This sentence is giving

Bet, I just lowkey vibed with this fire idea, and no cap, it’s giving major slay energy, so I’m finna drop it and let y’all stan!

Y’all this finna bet vibed drop it stan fire with giving major I just idea cap so and lowkey no and it’s I’m slay energy let

This idea giving y’all stan with bet energy no I’m vibed it cap and I slay lowkey and so it’s fire let major finna just drop

Giving major fire cap idea this and it y’all bet stan drop vibed lowkey energy with so slay I’m finna just it’s I no let and

Drop vibed cap this energy finna I’m it and stan y’all just major no and bet lowkey I slay so with idea fire giving it’s let

Order is meaningless.

Term frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 1 0 0 0 0 0 1 0 1 0 0
The cat ate the mouse 0 1 1 0 0 0 0 1 0 2 0
The cheese was delicious 0 0 0 1 1 0 0 0 0 1 1
The dog ate the cat 0 1 1 0 0 1 0 0 0 2 0
  • Term frequency vector
  • Term-document matrix
  • Sparse data structure

Term frequency-inverse document frequency

Term frequency: raw count of term in a document

Inverse document frequency:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

tf-idf = term frequency \(\times\) inverse document frequency

Frequency of a term adjusted for how rarely it is used

Term frequency-inverse document frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 0.462 0.000 0.000 0.000 0.000 0.000 0.462 0.000 0.462 0.000 0.000
The cat ate the mouse 0.000 0.139 0.139 0.000 0.000 0.000 0.000 0.277 0.000 0.115 0.000
The cheese was delicious 0.000 0.000 0.000 0.347 0.347 0.000 0.000 0.000 0.000 0.072 0.347
The dog ate the cat 0.000 0.139 0.139 0.000 0.000 0.277 0.000 0.000 0.000 0.115 0.000

Word embeddings

Word embedding: a mathematical representation of a word in a continuous vector space

  • Dense data structure
  • Captures context
  • Semantic similarity

Word embeddings

Word embeddings

word d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
are -0.51533000 0.831860 0.22457 -0.738650 0.187180 0.260210 -0.42564 0.671210 -0.310840 -0.612750 0.089526 -0.240110 1.18780 0.676090 -0.022885 -0.92533 0.071174 0.388370 -0.4292400 0.371440 0.326710 0.431410 0.874950 0.3400900 -0.23189 -0.41144 0.490610 -0.32906 -0.491090 -0.189880 0.334080 -0.212450 -0.383860 -0.080547 1.116100 0.236170 0.313330 0.492860 0.100000 -0.151310 -0.141760 -0.280200 -0.23880 -0.35486 0.18282 -0.191340 0.60544 0.074573 -0.207310 -0.609650 0.199080 -0.570240 -0.174270 1.441900 -0.250190 -1.86480 0.416710 -0.246070 1.5010000 0.874150 -0.671350 1.27620 -0.272100 0.17583 1.22420 0.28242 0.6237500 0.6395100 0.369140 -0.846770 -0.322700 -0.671520 -0.1963500 -0.4078900 -0.209660 -0.19623 0.041885 0.539670 -1.110500 -0.395150 0.66590000 -0.233000 -1.082000 0.046465 -2.09930 -0.284930 0.0800250 -0.129630 -0.30011 -0.467640 -0.818310 -0.048509 -0.32233 -0.320130 -1.1207000 -0.056788 -0.730040 -1.20240 1.130400 0.347900
ate -0.08029200 0.659240 0.35281 0.034911 -0.944040 0.306810 0.60626 0.390930 0.228050 -0.710910 0.322700 0.499360 0.39814 0.611360 -0.010969 -0.09097 -0.421980 -0.080869 -0.3649000 0.074443 0.544210 0.350360 0.010708 -0.5578100 -0.23541 0.16357 -0.941980 -0.15397 -0.361850 0.138090 0.351410 1.066500 0.545790 0.056154 0.332340 1.009100 0.029193 0.526120 0.161590 -0.344020 -0.029192 -0.413610 -0.20168 -0.16338 -0.13938 0.378120 -0.54910 0.109800 0.152180 -0.739240 -0.034577 -0.202590 0.304410 0.423220 -0.975890 -0.25193 -0.411190 0.126880 0.0158810 0.390360 0.365970 1.35690 0.047675 -0.62382 -0.32479 -0.10494 0.0878120 -0.7758900 0.433540 0.222770 -0.200040 0.013524 0.7980100 0.5074600 -0.716180 0.92140 -0.960170 -0.785590 0.048053 0.730540 0.25351000 0.257890 -0.824790 0.181390 -0.66272 -0.886150 0.0548580 -0.086880 -0.77234 0.432990 0.714370 -0.881040 0.43407 -0.066353 -0.9752000 -0.907160 0.147380 0.03475 0.384050 0.175360
cat 0.23088000 0.282830 0.63180 -0.594110 -0.585990 0.632550 0.24402 -0.141080 0.060815 -0.789800 -0.291020 0.142870 0.72274 0.204280 0.140700 0.98757 0.525330 0.097456 0.8822000 0.512210 0.402040 0.211690 -0.013109 -0.7161600 0.55387 1.14520 -0.880440 -0.50216 -0.228140 0.023885 0.107200 0.083739 0.550150 0.584790 0.758160 0.457060 -0.280010 0.252250 0.689650 -0.609720 0.195780 0.044209 -0.31136 -0.68826 -0.22721 0.461850 -0.77162 0.102080 0.556360 0.067417 -0.572070 0.237350 0.471700 0.827650 -0.292630 -1.34220 -0.099277 0.281390 0.4160400 0.105830 0.622030 0.89496 -0.234460 0.51349 0.99379 1.18460 -0.1636400 0.2065300 0.738540 0.240590 -0.964730 0.134810 -0.0072484 0.3301600 -0.123650 0.27191 -0.409510 0.021909 -0.606900 0.407550 0.19566000 -0.418020 0.186360 -0.032652 -0.78571 -0.138470 0.0440070 -0.084423 0.04911 0.241040 0.452730 -0.186820 0.46182 0.089068 -0.1818500 -0.015230 -0.736800 -0.14532 0.151040 -0.714930
cheese -0.63712000 0.605150 -0.19317 0.116060 -0.410510 0.129780 1.74050 0.053119 0.208400 -0.536420 0.061240 -0.027045 -0.17595 1.296300 0.416620 0.90429 0.384430 -0.615150 -0.4669300 0.618620 -0.597650 0.886310 -0.374760 -0.9017800 -0.16541 1.00080 0.070107 -0.38194 -0.620150 -0.412870 0.046083 0.613130 -0.560240 -0.593780 0.055440 0.622950 0.193900 -0.214870 0.110400 -1.433400 1.016800 -1.591000 -0.64335 -0.88056 -0.13692 -0.166660 0.37185 -0.198730 -0.105600 -0.647160 -0.162720 -0.266330 -0.604040 0.677650 -1.660300 -0.76015 -0.592030 0.690610 0.0982840 0.090139 0.970170 0.63826 0.700190 -0.07888 0.77505 -0.59275 0.0099363 0.1458000 0.090962 -0.997450 -0.332210 0.605890 0.6329000 0.4926700 0.312280 0.90852 -0.434890 -0.319390 0.835890 0.832720 0.47300000 0.053605 -0.429040 0.330060 0.11979 -1.012000 -0.3595800 0.190870 0.53706 -0.605020 0.014610 0.136870 -1.18810 -0.222550 -0.9175600 -1.289900 0.186770 -0.27083 1.303300 0.036128
delicious -0.65534000 0.340340 0.30284 -0.148540 0.176830 0.337250 0.51254 0.047677 0.203640 -0.169770 0.064244 -0.030980 0.29266 0.256680 0.266270 0.55210 -0.199290 -0.455120 0.0758580 0.672750 0.074552 0.212680 0.043048 -0.9397500 0.16909 1.26090 -0.118490 0.19958 -0.780670 -0.968800 -0.273490 0.471600 -0.011452 -0.742100 0.413170 0.604600 -0.075988 0.218740 0.186800 -1.350800 0.686080 -0.138280 -0.29852 -0.72438 0.56742 0.317580 -0.11389 -0.063852 0.062136 -0.102100 0.309080 -0.538150 0.341190 0.019077 -0.991060 -1.00930 0.773920 0.453050 0.0667420 -0.897930 -0.490000 1.16020 -0.293620 -0.31742 0.22462 -1.19390 0.2820300 -0.5876100 -0.109370 -0.941000 -0.046886 0.327370 0.2178300 0.5369800 -0.200270 1.17190 -0.669520 -0.533590 0.405850 0.336610 -0.12291000 -0.188850 -0.452200 0.605610 -0.46547 -0.441810 0.2503800 0.173040 -0.51647 -0.225460 0.164590 0.279910 -0.42529 -0.468750 -1.1439000 -0.615680 -0.426700 -0.68853 0.089564 0.723000
dog 0.30817000 0.309380 0.52803 -0.925430 -0.736710 0.634750 0.44197 0.102620 -0.091420 -0.566070 -0.532700 0.201300 0.77040 -0.139830 0.137270 1.11280 0.893010 -0.178690 -0.0019722 0.572890 0.594790 0.504280 -0.289910 -1.3491000 0.42756 1.27480 -1.161300 -0.41084 0.042804 0.548660 0.188970 0.375900 0.580350 0.669750 0.811560 0.938640 -0.510050 -0.070079 0.828190 -0.353460 0.210860 -0.244120 -0.16554 -0.78358 -0.48482 0.389680 -0.86356 -0.016391 0.319840 -0.492460 -0.069363 0.018869 -0.098286 1.312600 -0.121160 -1.23990 -0.091429 0.352940 0.6464500 0.089642 0.702940 1.12440 0.386390 0.52084 0.98787 0.79952 -0.3462500 0.1409500 0.801670 0.209870 -0.860070 -0.153080 0.0745230 0.4081600 0.019208 0.51587 -0.344280 -0.245250 -0.779840 0.274250 0.22418000 0.201640 0.017431 -0.014697 -1.02350 -0.396950 -0.0056188 0.305690 0.31748 0.021404 0.118370 -0.113190 0.42456 0.534050 -0.1671700 -0.271850 -0.625500 0.12883 0.625290 -0.520860
mice 0.00063935 0.275940 0.11937 -0.587170 -0.732070 0.364360 0.73082 0.194790 -0.456630 -0.712230 -0.462910 0.354310 0.41265 0.011087 0.704830 1.15380 -0.865050 0.747780 1.0898000 -0.136560 -0.215850 -0.608840 0.068820 -0.2693900 -0.14702 0.23594 -0.362450 -0.80454 -0.619630 0.478210 0.721450 0.343340 -0.329530 0.190550 1.033400 0.230030 0.115860 0.874050 -0.253240 0.421480 -0.464190 -0.243130 -1.36830 -0.28809 -0.18192 0.294360 0.33680 -0.068659 -0.929580 -0.135920 -0.850740 -0.245050 0.089080 0.628800 0.069943 -0.72037 -0.561120 -0.256980 -0.5670900 -0.195380 0.013889 1.16350 0.238500 -0.12460 0.50788 1.59060 -0.3817100 0.3070000 0.738250 0.060485 0.065348 -0.019585 0.4766500 0.2848400 -0.783970 0.29604 0.098664 -0.142200 -0.128560 0.357240 0.18805000 -0.272090 -1.156600 1.092900 -1.53750 0.345480 1.5179000 -0.030003 -0.95319 0.416920 -0.111090 -0.608480 0.58638 0.179360 -0.4151700 -0.343450 -0.857680 -0.81315 0.254300 -1.163200
mouse -0.09320700 0.049685 0.25748 -0.525010 -0.180090 0.468880 0.26035 -0.484460 -0.020865 -1.021200 -0.642040 0.062146 0.17611 -0.521840 0.589680 1.54660 -0.418890 0.750560 1.2493000 -0.252390 -0.275400 0.094360 0.658510 -0.5618800 0.89223 0.82503 -0.589030 -0.70064 -0.229580 0.036496 0.385330 0.822370 0.028273 0.533260 1.044000 0.413500 -0.626240 -0.199070 0.626840 -0.193680 0.071461 -0.056608 -0.62716 -0.21990 -0.70554 0.756930 -0.33047 0.248220 -0.334600 0.413430 -0.508890 0.171170 0.193200 0.417950 -0.204310 -1.48530 -0.821540 0.069956 0.0020854 0.310960 0.452840 1.14810 0.089534 0.17282 0.56481 1.00160 -0.3856100 0.2381400 0.659000 0.207000 -0.136880 0.049653 0.0198350 -0.6654400 -0.365960 0.39073 -0.183770 0.218370 0.042889 0.791930 -0.09979700 -0.206130 -0.446030 0.172250 -1.25740 1.084900 0.9162000 -0.176950 0.56489 -0.017692 -0.045254 0.458630 0.47844 -0.160780 0.0030882 -0.092954 -0.496070 -0.58809 0.777270 -0.670310
silly -0.08140800 0.059552 0.77880 -0.646800 -0.615850 0.647310 -0.44597 0.308900 -0.071626 0.266020 0.161110 -0.040699 -0.43499 -0.134010 0.688020 0.53160 -0.762000 0.814480 0.2602000 0.574170 0.828190 0.422930 0.305790 -1.0311000 0.32201 0.68830 -0.553720 0.13781 -0.330430 -0.024804 -0.302030 0.399540 0.156220 -0.948060 -0.572130 0.460430 -0.856440 -0.653490 0.165680 -0.346040 0.387710 0.912410 -0.33025 -0.41045 -0.74941 -0.215180 0.26530 0.523260 -0.462110 -0.477560 0.405750 -0.187820 0.177040 -0.039180 -0.760020 -1.10750 0.447030 0.884780 0.1169300 0.070433 -0.093688 0.66467 -0.649070 0.26288 0.27458 -0.52282 1.0216000 0.0037161 -0.361660 -0.236730 -0.269150 -0.207520 0.0701320 -0.0048971 -0.583350 0.53387 -0.570200 0.355030 -0.083076 0.180800 -0.04327600 -0.325590 0.436960 -0.069350 -1.72520 -0.085043 -0.5303200 0.148600 -0.13186 0.054436 -0.264000 0.316100 -0.24254 -0.560520 -0.0719670 0.051976 -1.059800 -0.11550 -0.540620 0.194170
the -0.03819400 -0.244870 0.72812 -0.399610 0.083172 0.043953 -0.39141 0.334400 -0.575450 0.087459 0.287870 -0.067310 0.30906 -0.263840 -0.132310 -0.20757 0.333950 -0.338480 -0.3174300 -0.483360 0.146400 -0.373040 0.345770 0.0520410 0.44946 -0.46971 0.026280 -0.54155 -0.155180 -0.141070 -0.039722 0.282770 0.143930 0.234640 -0.310210 0.086173 0.203970 0.526240 0.171640 -0.082378 -0.717870 -0.415310 0.20335 -0.12763 0.41367 0.551870 0.57908 -0.334770 -0.365590 -0.548570 -0.062892 0.265840 0.302050 0.997750 -0.804810 -3.02430 0.012540 -0.369420 2.2167000 0.722010 -0.249780 0.92136 0.034514 0.46745 1.10790 -0.19358 -0.0745750 0.2335300 -0.052062 -0.220440 0.057162 -0.158060 -0.3079800 -0.4162500 0.379720 0.15006 -0.532120 -0.205500 -1.252600 0.071624 0.70565000 0.497440 -0.420630 0.261480 -1.53800 -0.302230 -0.0734380 -0.283120 0.37104 -0.252170 0.016215 -0.017099 -0.38984 0.874240 -0.7256900 -0.510580 -0.520280 -0.14590 0.827800 0.270620
was 0.13717000 -0.542870 0.19419 -0.299530 0.175450 0.084672 0.67752 0.098295 -0.035611 0.213340 0.516630 0.206870 0.44082 -0.336550 0.560250 -0.68790 0.519570 -0.212580 -0.5270800 -0.122490 0.330990 0.026448 0.590070 0.0065469 0.45405 -0.33884 -0.282610 -0.24633 0.108470 0.316400 -0.153680 0.735030 0.118580 0.708420 0.075081 0.297380 -0.113950 0.408070 -0.042531 -0.213010 -0.798490 -0.127030 0.75200 -0.41746 0.46615 -0.039097 0.65961 -0.323360 0.442000 -0.941370 -0.231250 -0.306040 0.799120 1.458100 -0.881990 -3.00410 -0.752430 -0.205030 1.1998000 0.948810 0.306490 0.48411 -0.757200 0.65856 0.70107 -0.93141 0.5292800 0.2332300 0.188570 0.386910 0.011489 -0.319370 0.0118580 0.2294400 0.177640 0.16868 0.140030 0.586470 -1.544700 -0.064425 -0.00064711 0.136060 -0.326950 0.100430 -1.54600 -0.547600 0.2102700 -0.671950 -0.15970 -0.682710 -0.220430 -0.870880 -0.16248 0.830860 -0.2304500 0.198640 -0.051892 -0.52057 0.254340 -0.237590

Word embeddings

Document d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
The dog ate the cat 0.3823700 0.761710 2.96888 -2.283849 -2.100396 1.662016 0.50943 1.021270 -0.953455 -1.891862 0.074720 0.708910 2.50940 0.148130 0.002381 1.59426 1.664260 -0.839063 -0.1195322 0.192823 1.833840 0.320250 0.399229 -2.518988 1.64494 1.64415 -2.931160 -2.15007 -0.857546 0.428495 0.568136 2.091679 1.964150 1.779974 1.281640 2.577146 -0.352927 1.760771 2.022710 -1.471956 -1.058292 -1.444141 -0.27188 -1.89048 -0.02407 2.333390 -1.02612 -0.474051 0.297200 -2.261423 -0.801794 0.585309 1.281924 4.558970 -2.999300 -8.88263 -0.576816 0.022370 5.511771 2.029852 1.191380 5.21898 0.268633 1.34541 3.87267 1.49202 -0.5712280 0.0386500 1.869626 0.232350 -1.910516 -0.320866 0.2493246 0.4132800 -0.061182 2.00930 -2.778200 -1.419931 -3.843887 1.555588 2.084650 1.036390 -1.462259 0.657001 -5.54793 -2.026030 -0.0536298 -0.431853 0.33633 0.191094 1.317900 -1.215248 0.54077 2.305245 -2.775600 -2.215400 -2.255480 -0.27354 2.815980 -0.519190
The cat ate the mouse -0.0190070 0.502015 2.69833 -1.883429 -1.543776 1.496146 0.32781 0.434190 -0.882900 -2.346992 -0.034620 0.569756 1.91511 -0.233880 0.454791 2.02806 0.352360 0.090187 1.1317400 -0.632457 0.963650 -0.089670 1.347649 -1.731768 2.10961 1.19438 -2.358890 -2.43987 -1.129930 -0.083669 0.764496 2.538149 1.412073 1.643484 1.514080 2.052006 -0.469117 1.631780 1.821360 -1.312176 -1.197691 -1.256629 -0.73350 -1.32680 -0.24479 2.700640 -0.49303 -0.209440 -0.357240 -1.355533 -1.241321 0.737610 1.573410 3.664320 -3.082450 -9.12803 -1.306927 -0.260614 4.867406 2.251170 0.941280 5.24268 -0.028223 0.99739 3.44961 1.69410 -0.6105880 0.1358400 1.726956 0.229480 -1.187326 -0.118133 0.1946366 -0.6603200 -0.446350 1.88416 -2.617690 -0.956311 -3.021158 2.073268 1.760673 0.628620 -1.925720 0.843948 -5.78183 -0.544180 0.8681890 -0.914493 0.58374 0.151998 1.154276 -0.643428 0.59465 1.610415 -2.605342 -2.036504 -2.126050 -0.99046 2.967960 -0.668640
Mice are silly -0.5960986 1.167352 1.12274 -1.972620 -1.160740 1.271880 -0.14079 1.174900 -0.839096 -1.058960 -0.212274 0.073501 1.16546 0.553167 1.369965 0.76007 -1.555876 1.950630 0.9207600 0.809050 0.939050 0.245500 1.249560 -0.960400 -0.05690 0.51280 -0.425560 -0.99579 -1.441150 0.263526 0.753500 0.530430 -0.557170 -0.838057 1.577370 0.926630 -0.427250 0.713420 0.012440 -0.075870 -0.218240 0.389080 -1.93735 -1.05340 -0.74851 -0.112160 1.20754 0.529174 -1.599000 -1.223130 -0.245910 -1.003110 0.091850 2.031520 -0.940267 -3.69267 0.302620 0.381730 1.050840 0.749203 -0.751149 3.10437 -0.682670 0.31411 2.00666 1.35020 1.2636400 0.9502261 0.745730 -1.023015 -0.526502 -0.898625 0.3504320 -0.1279471 -1.576980 0.63368 -0.429651 0.752500 -1.322136 0.142890 0.810674 -0.830680 -1.801640 1.070015 -5.36200 -0.024493 1.0676050 -0.011033 -1.38516 0.003716 -1.193400 -0.340889 0.02151 -0.701290 -1.607837 -0.348262 -2.647520 -2.13105 0.844080 -0.621130
The cheese was delicious -1.1934840 0.157750 1.03198 -0.731620 0.024942 0.595655 2.53915 0.533491 -0.199021 -0.405391 0.929984 0.081535 0.86659 0.952590 1.110830 0.56092 1.038660 -1.621330 -1.2355820 0.685520 -0.045708 0.752398 0.604128 -1.782942 0.90719 1.45315 -0.304713 -0.97024 -1.447530 -1.206340 -0.420809 2.102530 -0.309182 -0.392820 0.233481 1.611103 0.207932 0.938180 0.426309 -3.079588 0.186520 -2.271620 0.01348 -2.15003 1.31032 0.663693 1.49665 -0.920712 0.032946 -2.239200 -0.147782 -0.844680 0.838320 3.152577 -4.338160 -7.79785 -0.558000 0.569210 3.581526 0.863029 0.536880 3.20393 -0.316116 0.72971 2.80864 -2.91164 0.7466713 0.0249500 0.118100 -1.771980 -0.310445 0.455830 0.5546080 0.8428400 0.669370 2.39916 -1.496500 -0.472010 -1.555560 1.176529 1.055093 0.498255 -1.628820 1.297580 -3.42968 -2.303640 0.0276320 -0.591160 0.23193 -1.765360 -0.025015 -0.471199 -2.16571 1.013800 -3.017600 -2.217520 -0.812102 -1.62583 2.475004 0.792158

Consumer complaints to the CFPB

Consumer complaints to the CFPB

[1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                   
[2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."
[3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                           
[4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"                                                 

Sparse matrix structure

Document-feature matrix of: 117,214 documents, 46,099 features (99.88% sparse) and 0 docvars.
         features
docs        account auto bank call charg chase dai date dollar
  3113204 1       1    2    2    1     1     1   3    1      1
  3113208 0       1    0    6    3     5     0   0    1      1
  3113804 0       0    0    0    0     0     0   2    2      0
  3113805 0       1    0    0    0     0     0   0    0      0
  3113807 0       2    0    0    0     1     0   0    0      0
  3113808 0       0    0    0    0     0     0   0    0      0
[ reached max_ndoc ... 117,208 more documents, reached max_nfeat ... 46,089 more features ]

Sparsity of text corpa

Generating word embeddings

  • Dimension reduction
    • Principal components analysis (PCA)
    • Singular value decomposition (SVD)
  • Probabilistic models
  • Neural networks
    • Word2Vec
    • GloVe
    • BERT
    • ELMO
  • Custom-generated or pre-trained

GloVe

  • Pre-trained word vector representations
  • Measured using co-occurrence statistics (how frequently words occur in proximity to each other)
  • Four versions
    • Wikipedia (2014) - 6 billion tokens, 400 thousand words
    • Twitter - 27 billion tokens, 2 billion tweets, 1.2 million words
    • Common Crawl - 42 billion tokens, 1.9 million words
    • Common Crawl - 840 billion tokens, 2.2 million words

GloVe 6b (100 dimensions)

# A tibble: 400,000 × 101
   token      d1      d2      d3      d4      d5      d6      d7      d8      d9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 "the" -0.0382 -0.245   0.728  -0.400   0.0832  0.0440 -0.391   0.334  -0.575 
 2 ","   -0.108   0.111   0.598  -0.544   0.674   0.107   0.0389  0.355   0.0635
 3 "."   -0.340   0.209   0.463  -0.648  -0.384   0.0380  0.171   0.160   0.466 
 4 "of"  -0.153  -0.243   0.898   0.170   0.535   0.488  -0.588  -0.180  -1.36  
 5 "to"  -0.190   0.0500  0.191  -0.0492 -0.0897  0.210  -0.550   0.0984 -0.201 
 6 "and" -0.0720  0.231   0.0237 -0.506   0.339   0.196  -0.329   0.184  -0.181 
 7 "in"   0.0857 -0.222   0.166   0.134   0.382   0.354   0.0129  0.225  -0.438 
 8 "a"   -0.271   0.0440 -0.0203 -0.174   0.644   0.712   0.355   0.471  -0.296 
 9 "\""  -0.305  -0.236   0.176  -0.729  -0.283  -0.256   0.266   0.0253 -0.0748
10 "'s"   0.589  -0.202   0.735  -0.683  -0.197  -0.180  -0.392   0.342  -0.606 
# ℹ 399,990 more rows
# ℹ 91 more variables: d10 <dbl>, d11 <dbl>, d12 <dbl>, d13 <dbl>, d14 <dbl>,
#   d15 <dbl>, d16 <dbl>, d17 <dbl>, d18 <dbl>, d19 <dbl>, d20 <dbl>,
#   d21 <dbl>, d22 <dbl>, d23 <dbl>, d24 <dbl>, d25 <dbl>, d26 <dbl>,
#   d27 <dbl>, d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>, d32 <dbl>,
#   d33 <dbl>, d34 <dbl>, d35 <dbl>, d36 <dbl>, d37 <dbl>, d38 <dbl>,
#   d39 <dbl>, d40 <dbl>, d41 <dbl>, d42 <dbl>, d43 <dbl>, d44 <dbl>, …

Fairness in word embeddings

Word embeddings learn semantics and meaning from human speech. If the text is biased, then the embeddings will also contain bias.

Perform analysis

  • Basic
    • Word frequency
    • Collocation
    • Dictionary tagging
  • Advanced
    • Document classification
    • Corpora comparison
    • Topic modeling

tidytext

tidytext

  • Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
  • Learn more at tidytextmining.com
library(tidyverse)
library(tidytext)

What is tidy text?

text <- c(
  "Yeah, with a boy like that it's serious",
  "There's a boy who is so wonderful",
  "That girls who see him cannot find back home",
  "And the gigolos run like spiders when he comes",
  "'Cause he is Eros and he's Apollo",
  "Girls, with a boy like that it's serious",
  "Senoritas, don't follow him",
  "Soon, he will eat your hearts like cereals",
  "Sweet Lolitas, don't go",
  "You're still young",
  "But every night they fall like dominoes",
  "How he does it, only heaven knows",
  "All the other men turn gay wherever he goes (wow!)"
)
text
 [1] "Yeah, with a boy like that it's serious"           
 [2] "There's a boy who is so wonderful"                 
 [3] "That girls who see him cannot find back home"      
 [4] "And the gigolos run like spiders when he comes"    
 [5] "'Cause he is Eros and he's Apollo"                 
 [6] "Girls, with a boy like that it's serious"          
 [7] "Senoritas, don't follow him"                       
 [8] "Soon, he will eat your hearts like cereals"        
 [9] "Sweet Lolitas, don't go"                           
[10] "You're still young"                                
[11] "But every night they fall like dominoes"           
[12] "How he does it, only heaven knows"                 
[13] "All the other men turn gay wherever he goes (wow!)"

What is tidy text?

text_df <- tibble(line = 1:length(text), text = text)
text_df
# A tibble: 13 × 2
    line text                                              
   <int> <chr>                                             
 1     1 Yeah, with a boy like that it's serious           
 2     2 There's a boy who is so wonderful                 
 3     3 That girls who see him cannot find back home      
 4     4 And the gigolos run like spiders when he comes    
 5     5 'Cause he is Eros and he's Apollo                 
 6     6 Girls, with a boy like that it's serious          
 7     7 Senoritas, don't follow him                       
 8     8 Soon, he will eat your hearts like cereals        
 9     9 Sweet Lolitas, don't go                           
10    10 You're still young                                
11    11 But every night they fall like dominoes           
12    12 How he does it, only heaven knows                 
13    13 All the other men turn gay wherever he goes (wow!)

What is tidy text?

text_df |>
  unnest_tokens(output = word, input = text)
# A tibble: 91 × 2
    line word   
   <int> <chr>  
 1     1 yeah   
 2     1 with   
 3     1 a      
 4     1 boy    
 5     1 like   
 6     1 that   
 7     1 it's   
 8     1 serious
 9     2 there's
10     2 a      
# ℹ 81 more rows

Counting words

text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)
# A tibble: 67 × 2
   word      n
   <chr> <int>
 1 he        5
 2 like      5
 3 a         3
 4 boy       3
 5 that      3
 6 and       2
 7 don't     2
 8 girls     2
 9 him       2
10 is        2
# ℹ 57 more rows

Application exercise

ae-18

  • Go to the course GitHub org and find your ae-18 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Recap

  • tidytext allows you to structure text data in a format conducive to exploratory analysis and wrangling/visualization with tidyverse
  • Tokenizing is a process of converting raw character strings to recognizable features
  • Remove non-informative stop words to reduce noise in the text data
  • Dictionary-based sentiment analysis provides a rough classification of text into positive/negative sentiments