Wednesday, October 31, 2018

It's alive! (somewhat)!

I've been plugging away at working with sentiment analysis in RStudio and want to pause to share some results! This post will cover working with RStudio scripts to generate "tidy" data frames and perform some sentiment analysis per-chunk (instead of just per-document). 

Tidy Data

As I've mentioned in previous posts, a lot of the sentiment analysis packages out there make use of "tidy data," Like Julia Silge explains on her blog post from 2016, tidy data takes the following format:
  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

So, basically, the goal is to convert the SOTUs into really long, skinny tables, where each occurrence of a word gets its own row. This means the tidy table (stupidly called a "tibble," apparently) keeps track of where words are in relation to each other. This is an improvement (for my purposes) over the the "document-term-matrix" I created in my initial text mining steps, which only gave each word a single row along with the number of times it appears in the corpus, in alphabetical order (not the order they appear in the text).

a chart with SOTU filenames for rows and alphabetical list of words for column headings. the chart is populated with numbers showing how many times each term appears in each file.
The document-term-matrix. Every word is recorded and we can see which files they are in, but there's no way to track where they are in relation to one another within the files.

a chart with each word in the first sentence of Washington's first 1790 SOTU as its own row
The tidy data frame retains the words' relationships to each other - linenumber isn't working the way I want right now, but it does reset with each file and keeps the context of each token (word) intact.


Analysis: Positive Words in Washington's 1792 SOTU

Once the "tidy_SOTUs" data frame was created, I could use it for all kinds of fun analysis that raised a lot of new questions. For example, the top "positive" words in Washington's 1792 SOTU were:
  1.  peace, 5 instances
  2.  present, 5 instances
  3.  found, 4 instances
Peace seems definitely positive, but you probably wouldn't be talking about it all the time if there were no threat of war. Most of Washington's mentions are really about the desire for, but absence of, peace.
  • A sanction commonly respected even among savages has been found in this instance insufficient to protect from massacre the emissaries of peace.
  • I can not dismiss the subject of Indian affairs without again recommending to your consideration the [. . .] restraining the commission of outrages upon the Indians, without which all pacific plans must prove nugatory. To enable [. . .] the employment of qualified and trusty persons to reside among them as agents would also contribute to the preservation of peace . . .
  • I particularly recommend to your consideration the means of preventing those aggressions by our citizens on the territory of other nations, and other infractions of the law of nations, which, furnishing just subject of complaint, might endanger our peace with them . . .
So, while it was fun to learn the word "nugatory," qualifying these mentions of "peace" as "positive sentiments" is probably a little misleading.

"Present" I assume is on the positive list from NRC (here, I think, but I need to look more into the exact list included with R, and the methodology for the "crowdsourcing" used to generate the list) as a synonym for "gift," but that's not how Washington is using the word at all (he's referring to the temporal present). 

"Found," again, seems like it would be on the "positive" list because of its sense as recovering something lost, but Washington is always using it in a legalistic context: 
  • a sanction that "has been found in this instance insufficient,"
  • considering the expense of future operations "which may be found inevitable,"
  • a reform of the judiciary that "will, it is presumed, be found worthy of particular attention," and
  • a question about the post office, which, if "upon due inquiry, be found to be the fact," a remedy would need to be considered.
None of these are really strictly negative sentiments, but they are not like the "I found my missing wallet" or "found a new hobby" types of sentiments that I expect led to the word's placement on the "positive" list.


Sentiment Analysis Within Documents

I had already been able to run some sentiment analysis tasks on my corpus of SOTUs, but because they were relying on document-term-matrices instead of tidy data, I wasn't able to run any analysis comparing words or contexts within documents - only between them. Even though the line number thing wasn't working like I wanted (each token in each SOTU file was given its own line number instead of actually going line-by-line, I tried simply using as.integer() and division to count every ten words as a line but it didn't work), having the words in order still enabled analysis I couldn't do before.

The first visualization I generated with tidy text was a great feeling. Here's the code with some comments:

library(tidyr)  #load the tidyr library

SOTU_sentiment <- tidy_SOTUs %>%  # dump tidy_SOTUs into SOTU_sentiment,
  inner_join(get_sentiments("bing")) %>%  # get the sentiment lexicon "bing" (different from NRC lexicon in previous example)
  count(file, index = linenumber %/% 80, sentiment) %>% # this is taking a chunk of 80 lines at a time- in this case, 80 words at a time
  spread(sentiment, n, fill = 0) %>% # honestly still need to learn what "spread" is doing (this is why we wanted tidyr) - I think it is basically normalizing the charts so they all plot in equal space while representing different word counts
  mutate(sentiment = positive - negative) # finally, calculate "sentiment" by subtracting instances of negative words from the instances of positive words. This will be performed per 80-token chunk.

library(ggplot2) # load the ggplot2 library for visualization

ggplot(SOTU_sentiment, aes(index, sentiment, fill = file)) + #plot SOTU_sentiment
 
geom_col(show.legend = FALSE) + #don't show a legend
  facet_wrap(~file, ncol = 2, scales = "free_x") # plot per-file and in 2 columns, allow x-axis to adapt to scale of values being plotted (i think)


and here's the resulting visualization from ggplot:
Organizing these file names with month-first was a little dumb and resulted in these being out of order, but ideally the file-names shouldn't make a huge difference for the final product anyway.

It is interesting to note the pattern of most addresses beginning on a positive note. This seemed like a clear enough pattern (with notable enough exceptions in 1792 and 1794) that it was worth looking into before going much further - if the NRC list was seemingly so off-base about "peace" and "present," I wanted to see if these visualizations even meant anything.

Grabbing the first 160 words (two 80-word chunks) from the 1790 and 1794 SOTUs, then comparing them subjectively with their respective charts revealed the following (image contains lots of text, also available as a markdown file on the GitHub repo (raw markdown)):
please visit markdown file link to read text contained within image

I have to say, these definitely pass the smell-test. 1790 is all congratulatory and full of lofty, philosophic patriotism, while 1794 is about insurrection and taxation. I was especially pleased that, even though 1794 contains a lot of what I would think are positive words (gracious, indulgence, heaven, riches, power, happiness, expedient, stability), the chart still depicts a negative number that is in-line with my subjective reading of the text.


Comparing Sentiment Lexicons

Like I mentioned, I have used two lexicons so far in these examples: "NRC" and "Bing" (not the search engine). It's not entirely clear (without more digging) how these lexicons were generated (NRC mentions "crowdsourcing" and Bing just says "compiled over many years starting from our first paper") but for now, I wanted to at least start by getting a feel for how they might differ. Especially as I'm dealing with texts where people are saying things like "nugatory" and "burthens" (or even just the difference in word choices even between Bush 43 and Obama), it's definitely possible that these lexicons won't be a good fit across over 200 years of texts.

Fortunately, the process from the "Sentiment Analysis with Tidy Data" that I was following had just the thing. I'm realizing that eventually I should make some kind of notebook for this code, and I'll omit it here, but basically I ended up with a chart comparing three different sentiment lexicons, AFINN, Bing et al. and NRC; each run over Washington's 1793 SOTU:

It's a good sign that the trends across the three analyses roughly track together. In general, it looks like AFINN skews more positive overall and is a little less volatile than the others, and obviously the three use different scales, but it's nice that the results weren't radically different. 

Lots more to come, but that's plenty for one post. 

Wednesday, October 24, 2018

Forays into Sentiment Analysis

I spent some time working with Rachael Tatman's "Sentiment Analysis in R" tutorial and wanted to blog here about some of the results.

Setting up

I had already done some basic text mining and tokenization of my State of the Union corpus (which, as I've mentioned before, is incomplete and inaccurate right now). Hilariously, Tatman's tutorial also used some State of the Union addresses as its corpus, although it only contains a few years' worth of SOTUs. I was working with about 215 - not the full set, but certainly a better sample size than the few in the tutorial. 

The first steps were just loading in the requisite R packages: tidyverse, tidytext, glue, and stringr. Tidyverse took quite a while to install and requires several attempts reading error messages, installing stuff with apt, trying to install again and waiting for the next error. It would be cool if I could install these packages through apt, or R had the ability to download its own dependencies (or even just a clearer error message!) but whatever. 

With those packages loaded, I used glue to write a "fileName" vector (I guess that's what we're calling them in R?) that includes the full file path and name, and a "fileText" vector that we then turn into a data_frame and tokenize into the vector "tokens." I also learned the fun-fact that dollar signs are a special character in R, and need to be escaped with \\ in order to operate on them. Hooray for learning! Really, I should have guessed/figured this out before, but I know now.

Counting Sentiments

The nice thing about tidytext was that I already had some sentiment lexicons loaded (well, "some" in theory, only "bing" worked for me) and ready to use. Happily, the sentiment analysis function worked on my first try, and was pretty fun:
Even though I'm really just executing instructions other people wrote, running this the first few times was a pretty satisfying feeling

But this also raised some questions. If this was just one list, how would other lists change the results? What exactly was this spread() function doing - Tatman's comments say she "made data wide rather than narrow" which, sure. I don't know what that means. Nonetheless, I was looking for proof-of-concept at this point, not a robust and fully theorized methodology, so I continued on.

I ran into some fun issues with REGEX here - Tatman's tutorial used REGEX to pull attributes from the filenames in the corpus and assign them to attributes of the text. My filenames were in a slightly different format from the tutorial's, and looked like this:

  • 1801-Jefferson-12-8.txt
  • 1945-FDR-1-6.txt

and so on. I couldn't quite figure out how to match the names, and my clumsy solution of using ([[:alpha:]]{4,}) to find strings of letters longer than 3 characters (so as to avoid picking up "txt" in every file name) was actually missing FDR (used instead of "Roosevelt" since there are 2 Roosevelt presidents). I posted a "help wanted" issue on GitHub, and fortunately @RJP43 responded and helped me find a solution. Some other random guy actually replied, too, but he deleted his (perfectly helpful) response, which was a bummer.

I also had a tough time assigning the political party of each POTUS to their respective SOTUs - I actually ended up just encoding them into the filenames and using a different REGEX to find and extract them. When I tried to match and assign based on the presidents' names like in Tatman's tutorial:
democrats = sentiments %>%
  filter(president == c("Jackson", "Polk")) %>%
  mutate(party = "D")

It didn't work - it wouldn't grab all of Jackson and Polk's SOTUs, but just a few. Rerunning the function with different individual names did grab them all, but it was rewriting - not appending - and so still not doing the trick. In the end, encoding the parties into the filenames worked for now, but this is something that seems pretty basic in R and that I should work on understanding.

Also, the "normalSentiment" was not exactly in Tatman's tutorial (though she did offer it as an extra challenge along with a hint about using the nrow() function). I'm using
normalSentiment = sentiment/nrow(tokens)
to get it right now, but is doing the sentiment subtraction without normalization first the same as doing the normalization before the subtraction? In other words:
normalSentiment = (positive - negative) / tokens,
or
normalSentiment = (positive/tokens)-(negative/tokens)?
This seems like an embarrassingly simple question, but I'm not sure without checking it. Anyway, below is a snippet of the table I ended up with:
It took some doing, but I was able to assign a party and generate a "normalSentiment" for each SOTU

Visualizations and Analysis

The first visualization just showed the "sentiment" score of the texts over time. Here it is with "sentiment" (just the raw positive - negative score) and "normalSentiment" ((positive-negative)/word count):






Obviously, there's a lot going on under the hood here, but it's definitely notable that normalization hits Washington especially hard (Washington owns 2 of the top 5 shortest SOTUs by word count, never topping 3,000 words). The normalized sentiment shows a totally different pattern than the "un-normalized" one (if nothing else, the dip in sentiment around the Civil War in the normalized chart makes more sense than the bump in the first chart).

Interestingly, the two charts also seem to suggest different outliers - the first chart shows sentiments tightly clustered (probably reflecting their short length more than actual sentiment) in the early years of the republic, then major outliers in the Taft, Truman, and Carter years. In the normalized version, it looks like Washington was all over the place, but there are fewer off-the-charts outliers overall. This makes sense - Washington's shorter speeches lead to a volatile normalized score because each word is essentially "worth more" in the score, and Carter's insanely positive speech is actually just his giant 33,000+ word written address that he dropped on his way out of office in 1981.

Next up: sentiment by party. The following two charts again show the "raw" and "normalized" sentiments, this time showing box plots by political party. The colors are a little confusing, so read the labels carefully!


Wow! A few things stood out to me here right away. First, note that only George Washington is considered "unaffiliated," and notice how much the size of the unaffiliated box changes depending on whether we normalize or not! John Adams, as our only federalist, shows a similar effect. Also, there are only four whigs and four democratic-republicans. Each president has more than one SOTU, but either way these sample sizes are fairly small. In any case, I was surprised to see the Democrats score lower on sentiment than Republicans in both versions of the chart! I am eager to run this again and break things up more by time - it's important to remember that today's Democratic and Republican parties bear little resemblance to those of the same names in the 19th century.

Moving Forward

First I just have to say, these visualizations are lovely. I'll learn to color-code them more appropriately (blue for Republicans and red for Democrats is simply too hard for my brain to get used to) and manipulate the data and the axes/labels in more detail, but hats off to the developers of these packages for enabling such simple and inviting visualizations so easily. 

My next step is to find out whether I can find a solution for running sentiment analysis on particular words. This is my ultimate goal, though I'm beginning to worry it might require some more tagging and slicing and dicing of the texts themselves - perhaps breaking them into paragraph-level chunks would work. I also need to learn more about the algorithms, parameters, and word lists that I used in these examples. I would also be interested in trying to pull in some tweets and run some analyses on those; something I've seen folks using R for around the web. 

Static Server is Up!

Another lengthy post about text mining and sentiment analysis is coming soon, but for now: I've successfully deployed a Tomcat server to my sotu-db.cs.luc.edu machine. Right now, the only thing hosted there is the extremely simple static search box, but you can access it live now at http://sotu-db.cs.luc.edu:8080/sotu-db/. The search button doesn't actually do anything, and I still have a lot of implementation details to work out. But this is a nice step.

I'm using Tomcat 9 for the server, and I followed the walkthrough here to get things set up. It seems like Tomcat has some flexibility in terms of hosting JS apps and stuff that could be useful for later phases of the project. Right now, however, I'm thinking the solution might look something like enter search term > pipe term into a new document with touch > feed that document to RStudio somehow and save results into new document > serve the new document as a static page. There's a lot I think I'm missing here in terms of security, but right now that's not a major concern for this project (people won't be creating usernames or passwords, and the corpus is public anyway).

it's not much to look at, but this screen and the fact that anyone can see it by visiting this address are the products of a lot of work!

Tuesday, October 23, 2018

Progress Report: Scripting in R



Thanks to Dr. George Thiruvathukal of LUC's Computer Science department (my final DH project co-advisor) I have an RStudio (the IDE for R) server running on a server run by LUC which, as I'm writing this, I realize I should really point to from rserver.sotu-db.com or something like that (edit: compute-r.sotu-db.com now leads directly to the RStudio login page). For more on setting up the server itself, see my blog post "Setting up an RStudio Server" at blog.tylermonaghan.com. This post is about working with some RStudio scripts and packages that I found by following a few online tutorials. I'm behind my Oct. 22 deadline by a day already, but I think it's worth taking time to blog on what I've found so far, because there's a lot going on.

Basic Text Mining in R

For this section, I followed the excellent tutorial "A gentle introduction to text mining using R" at the "Eight to Late" blog. The post is a few years old, but still works well and does a good job of explaining what's going on in the code.

One main way I deviated from the "gentle introduction" tutorial was by using my own corpus of "State of the Union" texts (1790-2006, plus 2016 and 2018 annual presidential remarks to Congress - for more on which are technically "State of the Unions" and which were given as verbal remarks versus written reports, and more, at UC-Santa Barbara's American Presidency Project which, goodness, has changed a lot recently - I'll have to grab and archive a copy of the old version which was more compact and data-dense). The tutorial used a corpus of the author's own blog posts, a clever idea which I would enjoy doing with the posts on this blog once I have accumulated enough to make it interesting! 

Another couple of things I did differently from the tutorial: I used equal-signs (=) instead of arrows (<-) because I think those arrows are weird. Also, instead of just rewriting the same SOTUs corpus (called "docs" in the tutorial), I often incremented the variable name so that I would end up with "SOTUs," "SOTUs2," "SOTUs3," etc. I did this to preserve each transformation step-by-step, always preserving the ability to reverse particular steps and document things along the way. Even though this cluttered up the environment a little, I thought it was well worth it to be able to preserve and document.


Transforming data step by step





These image galleries on Blogger really stink, I really need to migrate this blog to somewhere else...

Anyway, you can already see how "fellow citizens" got combined somewhere between SOTUs2 and SOTUs3, and stayed that way for all future steps. Also, as I've written about earlier on this blog, removing stopwords will be appropriate for some textual analysis and distant reading, but I don't think it's necessary or wise when running sentiment analysis - something to keep in mind.

One pretty funny issue I ran into was with dollar signs. I didn't realize, but they're special characters in R, so they need to be escaped in any functions you run. I wanted the dollar-signs stripped out, but talking about money could be an interesting part of analyzing State of the Union texts, so I wanted to replace "$" with "dollars." First I tried building a toSpace transformer like I'd done with colons and hyphens:

SOTUs2 = tm_map(SOTUs2, toSpace, "$")


This didn't really do anything, so next I tried

toDollars = content_transformer(function(x,pattern) {return (gsub(pattern, "dollars", x))})
#use toDollars to change dollar signs to "dollars"
SOTUs3 = tm_map(SOTUs2, toDollars, "$")

which again, didn't do much besides add the word "dollarsdollars" to the end of each document.

Finally, I tried

replaceDollars = function(x) gsub("?","dollars",x)

Which again, didn't work. Dollar signs were still in the texts! So, just to try, I replaced $ with ? - looking back, I'm not sure what I expected, but I was pretty entertained when I turned Thomas Jefferson's 1808 State of the Union report into an endless string of the word "dollars:


Ultimately, I will not be using any of these as my working texts for SOTU-db so it's not as important to fix each error (fellowcitizens) as it is to understand where and how errors are being introduced, and ensure that I am thinking about and accounting for them. In the end, I did get a visualization out of it which is fun:

This basic counting is a far cry from sentiment analysis or really anything particularly revelatory at all, but it's a good marker of progress for me on this project! Also, the fact that "will," a positive, constructive word is the top word feels slightly encouraging to me in this time when patriotic fervor seems synonymous with a mistrust of all our democratic institutions. Don't ask me what it means that "state" is #2, though! Stay tuned for more R with sentiment analysis coming very soon!

Tuesday, October 9, 2018

Milestone: Frontend options

Today, 10/10, was my deadline for completing the SOTU-db frontend options and I'm pleased that I have a very basic HTML and React Native front pages. They each consist solely of a title, search bar, and a submit button, but they exist. It's worth starting to think about some pros and cons as I move into my next milestone, working on the sentiment analysis portion of the project.

The HTML version has the added benefit of being able to use GET commands with the submit button. This would allow users to save URLs with their search terms inside them. The other limitations of GET (like a lack of URL privacy and size limit) don't seem like they would be issues for this project. There's definitely an appealing brutalism to the simple HTML version I have now.

The React Native version has the benefit of being a newer tool that I could use some experience with. I like the way it's designed with mobile in mind, and I think this version would feel more fun and responsive on a mobile device. The downsides are that it's much more complicated for me to use (since I'm new to React Native and JS in general) and I'm not quite sure I could use the URL-saving that HTML GET would allow.

Fortunately I don't have to make a final decision on this for another month or so, but it's good to have these considerations in mind as I keep working!

Next stop, a home-baked R Server and getting some sentiment analysis up and running!