Wednesday, October 24, 2018

Forays into Sentiment Analysis

I spent some time working with Rachael Tatman's "Sentiment Analysis in R" tutorial and wanted to blog here about some of the results.

Setting up

I had already done some basic text mining and tokenization of my State of the Union corpus (which, as I've mentioned before, is incomplete and inaccurate right now). Hilariously, Tatman's tutorial also used some State of the Union addresses as its corpus, although it only contains a few years' worth of SOTUs. I was working with about 215 - not the full set, but certainly a better sample size than the few in the tutorial. 

The first steps were just loading in the requisite R packages: tidyverse, tidytext, glue, and stringr. Tidyverse took quite a while to install and requires several attempts reading error messages, installing stuff with apt, trying to install again and waiting for the next error. It would be cool if I could install these packages through apt, or R had the ability to download its own dependencies (or even just a clearer error message!) but whatever. 

With those packages loaded, I used glue to write a "fileName" vector (I guess that's what we're calling them in R?) that includes the full file path and name, and a "fileText" vector that we then turn into a data_frame and tokenize into the vector "tokens." I also learned the fun-fact that dollar signs are a special character in R, and need to be escaped with \\ in order to operate on them. Hooray for learning! Really, I should have guessed/figured this out before, but I know now.

Counting Sentiments

The nice thing about tidytext was that I already had some sentiment lexicons loaded (well, "some" in theory, only "bing" worked for me) and ready to use. Happily, the sentiment analysis function worked on my first try, and was pretty fun:
Even though I'm really just executing instructions other people wrote, running this the first few times was a pretty satisfying feeling

But this also raised some questions. If this was just one list, how would other lists change the results? What exactly was this spread() function doing - Tatman's comments say she "made data wide rather than narrow" which, sure. I don't know what that means. Nonetheless, I was looking for proof-of-concept at this point, not a robust and fully theorized methodology, so I continued on.

I ran into some fun issues with REGEX here - Tatman's tutorial used REGEX to pull attributes from the filenames in the corpus and assign them to attributes of the text. My filenames were in a slightly different format from the tutorial's, and looked like this:

  • 1801-Jefferson-12-8.txt
  • 1945-FDR-1-6.txt

and so on. I couldn't quite figure out how to match the names, and my clumsy solution of using ([[:alpha:]]{4,}) to find strings of letters longer than 3 characters (so as to avoid picking up "txt" in every file name) was actually missing FDR (used instead of "Roosevelt" since there are 2 Roosevelt presidents). I posted a "help wanted" issue on GitHub, and fortunately @RJP43 responded and helped me find a solution. Some other random guy actually replied, too, but he deleted his (perfectly helpful) response, which was a bummer.

I also had a tough time assigning the political party of each POTUS to their respective SOTUs - I actually ended up just encoding them into the filenames and using a different REGEX to find and extract them. When I tried to match and assign based on the presidents' names like in Tatman's tutorial:
democrats = sentiments %>%
  filter(president == c("Jackson", "Polk")) %>%
  mutate(party = "D")

It didn't work - it wouldn't grab all of Jackson and Polk's SOTUs, but just a few. Rerunning the function with different individual names did grab them all, but it was rewriting - not appending - and so still not doing the trick. In the end, encoding the parties into the filenames worked for now, but this is something that seems pretty basic in R and that I should work on understanding.

Also, the "normalSentiment" was not exactly in Tatman's tutorial (though she did offer it as an extra challenge along with a hint about using the nrow() function). I'm using
normalSentiment = sentiment/nrow(tokens)
to get it right now, but is doing the sentiment subtraction without normalization first the same as doing the normalization before the subtraction? In other words:
normalSentiment = (positive - negative) / tokens,
normalSentiment = (positive/tokens)-(negative/tokens)?
This seems like an embarrassingly simple question, but I'm not sure without checking it. Anyway, below is a snippet of the table I ended up with:
It took some doing, but I was able to assign a party and generate a "normalSentiment" for each SOTU

Visualizations and Analysis

The first visualization just showed the "sentiment" score of the texts over time. Here it is with "sentiment" (just the raw positive - negative score) and "normalSentiment" ((positive-negative)/word count):

Obviously, there's a lot going on under the hood here, but it's definitely notable that normalization hits Washington especially hard (Washington owns 2 of the top 5 shortest SOTUs by word count, never topping 3,000 words). The normalized sentiment shows a totally different pattern than the "un-normalized" one (if nothing else, the dip in sentiment around the Civil War in the normalized chart makes more sense than the bump in the first chart).

Interestingly, the two charts also seem to suggest different outliers - the first chart shows sentiments tightly clustered (probably reflecting their short length more than actual sentiment) in the early years of the republic, then major outliers in the Taft, Truman, and Carter years. In the normalized version, it looks like Washington was all over the place, but there are fewer off-the-charts outliers overall. This makes sense - Washington's shorter speeches lead to a volatile normalized score because each word is essentially "worth more" in the score, and Carter's insanely positive speech is actually just his giant 33,000+ word written address that he dropped on his way out of office in 1981.

Next up: sentiment by party. The following two charts again show the "raw" and "normalized" sentiments, this time showing box plots by political party. The colors are a little confusing, so read the labels carefully!

Wow! A few things stood out to me here right away. First, note that only George Washington is considered "unaffiliated," and notice how much the size of the unaffiliated box changes depending on whether we normalize or not! John Adams, as our only federalist, shows a similar effect. Also, there are only four whigs and four democratic-republicans. Each president has more than one SOTU, but either way these sample sizes are fairly small. In any case, I was surprised to see the Democrats score lower on sentiment than Republicans in both versions of the chart! I am eager to run this again and break things up more by time - it's important to remember that today's Democratic and Republican parties bear little resemblance to those of the same names in the 19th century.

Moving Forward

First I just have to say, these visualizations are lovely. I'll learn to color-code them more appropriately (blue for Republicans and red for Democrats is simply too hard for my brain to get used to) and manipulate the data and the axes/labels in more detail, but hats off to the developers of these packages for enabling such simple and inviting visualizations so easily. 

My next step is to find out whether I can find a solution for running sentiment analysis on particular words. This is my ultimate goal, though I'm beginning to worry it might require some more tagging and slicing and dicing of the texts themselves - perhaps breaking them into paragraph-level chunks would work. I also need to learn more about the algorithms, parameters, and word lists that I used in these examples. I would also be interested in trying to pull in some tweets and run some analyses on those; something I've seen folks using R for around the web.