Wednesday, October 31, 2018

It's alive! (somewhat)!

I've been plugging away at working with sentiment analysis in RStudio and want to pause to share some results! This post will cover working with RStudio scripts to generate "tidy" data frames and perform some sentiment analysis per-chunk (instead of just per-document). 

Tidy Data

As I've mentioned in previous posts, a lot of the sentiment analysis packages out there make use of "tidy data," Like Julia Silge explains on her blog post from 2016, tidy data takes the following format:
  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

So, basically, the goal is to convert the SOTUs into really long, skinny tables, where each occurrence of a word gets its own row. This means the tidy table (stupidly called a "tibble," apparently) keeps track of where words are in relation to each other. This is an improvement (for my purposes) over the the "document-term-matrix" I created in my initial text mining steps, which only gave each word a single row along with the number of times it appears in the corpus, in alphabetical order (not the order they appear in the text).

a chart with SOTU filenames for rows and alphabetical list of words for column headings. the chart is populated with numbers showing how many times each term appears in each file.
The document-term-matrix. Every word is recorded and we can see which files they are in, but there's no way to track where they are in relation to one another within the files.

a chart with each word in the first sentence of Washington's first 1790 SOTU as its own row
The tidy data frame retains the words' relationships to each other - linenumber isn't working the way I want right now, but it does reset with each file and keeps the context of each token (word) intact.

Analysis: Positive Words in Washington's 1792 SOTU

Once the "tidy_SOTUs" data frame was created, I could use it for all kinds of fun analysis that raised a lot of new questions. For example, the top "positive" words in Washington's 1792 SOTU were:
  1.  peace, 5 instances
  2.  present, 5 instances
  3.  found, 4 instances
Peace seems definitely positive, but you probably wouldn't be talking about it all the time if there were no threat of war. Most of Washington's mentions are really about the desire for, but absence of, peace.
  • A sanction commonly respected even among savages has been found in this instance insufficient to protect from massacre the emissaries of peace.
  • I can not dismiss the subject of Indian affairs without again recommending to your consideration the [. . .] restraining the commission of outrages upon the Indians, without which all pacific plans must prove nugatory. To enable [. . .] the employment of qualified and trusty persons to reside among them as agents would also contribute to the preservation of peace . . .
  • I particularly recommend to your consideration the means of preventing those aggressions by our citizens on the territory of other nations, and other infractions of the law of nations, which, furnishing just subject of complaint, might endanger our peace with them . . .
So, while it was fun to learn the word "nugatory," qualifying these mentions of "peace" as "positive sentiments" is probably a little misleading.

"Present" I assume is on the positive list from NRC (here, I think, but I need to look more into the exact list included with R, and the methodology for the "crowdsourcing" used to generate the list) as a synonym for "gift," but that's not how Washington is using the word at all (he's referring to the temporal present). 

"Found," again, seems like it would be on the "positive" list because of its sense as recovering something lost, but Washington is always using it in a legalistic context: 
  • a sanction that "has been found in this instance insufficient,"
  • considering the expense of future operations "which may be found inevitable,"
  • a reform of the judiciary that "will, it is presumed, be found worthy of particular attention," and
  • a question about the post office, which, if "upon due inquiry, be found to be the fact," a remedy would need to be considered.
None of these are really strictly negative sentiments, but they are not like the "I found my missing wallet" or "found a new hobby" types of sentiments that I expect led to the word's placement on the "positive" list.

Sentiment Analysis Within Documents

I had already been able to run some sentiment analysis tasks on my corpus of SOTUs, but because they were relying on document-term-matrices instead of tidy data, I wasn't able to run any analysis comparing words or contexts within documents - only between them. Even though the line number thing wasn't working like I wanted (each token in each SOTU file was given its own line number instead of actually going line-by-line, I tried simply using as.integer() and division to count every ten words as a line but it didn't work), having the words in order still enabled analysis I couldn't do before.

The first visualization I generated with tidy text was a great feeling. Here's the code with some comments:

library(tidyr)  #load the tidyr library

SOTU_sentiment <- tidy_SOTUs %>%  # dump tidy_SOTUs into SOTU_sentiment,
  inner_join(get_sentiments("bing")) %>%  # get the sentiment lexicon "bing" (different from NRC lexicon in previous example)
  count(file, index = linenumber %/% 80, sentiment) %>% # this is taking a chunk of 80 lines at a time- in this case, 80 words at a time
  spread(sentiment, n, fill = 0) %>% # honestly still need to learn what "spread" is doing (this is why we wanted tidyr) - I think it is basically normalizing the charts so they all plot in equal space while representing different word counts
  mutate(sentiment = positive - negative) # finally, calculate "sentiment" by subtracting instances of negative words from the instances of positive words. This will be performed per 80-token chunk.

library(ggplot2) # load the ggplot2 library for visualization

ggplot(SOTU_sentiment, aes(index, sentiment, fill = file)) + #plot SOTU_sentiment
geom_col(show.legend = FALSE) + #don't show a legend
  facet_wrap(~file, ncol = 2, scales = "free_x") # plot per-file and in 2 columns, allow x-axis to adapt to scale of values being plotted (i think)

and here's the resulting visualization from ggplot:
Organizing these file names with month-first was a little dumb and resulted in these being out of order, but ideally the file-names shouldn't make a huge difference for the final product anyway.

It is interesting to note the pattern of most addresses beginning on a positive note. This seemed like a clear enough pattern (with notable enough exceptions in 1792 and 1794) that it was worth looking into before going much further - if the NRC list was seemingly so off-base about "peace" and "present," I wanted to see if these visualizations even meant anything.

Grabbing the first 160 words (two 80-word chunks) from the 1790 and 1794 SOTUs, then comparing them subjectively with their respective charts revealed the following (image contains lots of text, also available as a markdown file on the GitHub repo (raw markdown)):
please visit markdown file link to read text contained within image

I have to say, these definitely pass the smell-test. 1790 is all congratulatory and full of lofty, philosophic patriotism, while 1794 is about insurrection and taxation. I was especially pleased that, even though 1794 contains a lot of what I would think are positive words (gracious, indulgence, heaven, riches, power, happiness, expedient, stability), the chart still depicts a negative number that is in-line with my subjective reading of the text.

Comparing Sentiment Lexicons

Like I mentioned, I have used two lexicons so far in these examples: "NRC" and "Bing" (not the search engine). It's not entirely clear (without more digging) how these lexicons were generated (NRC mentions "crowdsourcing" and Bing just says "compiled over many years starting from our first paper") but for now, I wanted to at least start by getting a feel for how they might differ. Especially as I'm dealing with texts where people are saying things like "nugatory" and "burthens" (or even just the difference in word choices even between Bush 43 and Obama), it's definitely possible that these lexicons won't be a good fit across over 200 years of texts.

Fortunately, the process from the "Sentiment Analysis with Tidy Data" that I was following had just the thing. I'm realizing that eventually I should make some kind of notebook for this code, and I'll omit it here, but basically I ended up with a chart comparing three different sentiment lexicons, AFINN, Bing et al. and NRC; each run over Washington's 1793 SOTU:

It's a good sign that the trends across the three analyses roughly track together. In general, it looks like AFINN skews more positive overall and is a little less volatile than the others, and obviously the three use different scales, but it's nice that the results weren't radically different. 

Lots more to come, but that's plenty for one post.