Tuesday, October 23, 2018

Progress Report: Scripting in R

Thanks to Dr. George Thiruvathukal of LUC's Computer Science department (my final DH project co-advisor) I have an RStudio (the IDE for R) server running on a server run by LUC which, as I'm writing this, I realize I should really point to from rserver.sotu-db.com or something like that (edit: compute-r.sotu-db.com now leads directly to the RStudio login page). For more on setting up the server itself, see my blog post "Setting up an RStudio Server" at blog.tylermonaghan.com. This post is about working with some RStudio scripts and packages that I found by following a few online tutorials. I'm behind my Oct. 22 deadline by a day already, but I think it's worth taking time to blog on what I've found so far, because there's a lot going on.

Basic Text Mining in R

For this section, I followed the excellent tutorial "A gentle introduction to text mining using R" at the "Eight to Late" blog. The post is a few years old, but still works well and does a good job of explaining what's going on in the code.

One main way I deviated from the "gentle introduction" tutorial was by using my own corpus of "State of the Union" texts (1790-2006, plus 2016 and 2018 annual presidential remarks to Congress - for more on which are technically "State of the Unions" and which were given as verbal remarks versus written reports, and more, at UC-Santa Barbara's American Presidency Project which, goodness, has changed a lot recently - I'll have to grab and archive a copy of the old version which was more compact and data-dense). The tutorial used a corpus of the author's own blog posts, a clever idea which I would enjoy doing with the posts on this blog once I have accumulated enough to make it interesting! 

Another couple of things I did differently from the tutorial: I used equal-signs (=) instead of arrows (<-) because I think those arrows are weird. Also, instead of just rewriting the same SOTUs corpus (called "docs" in the tutorial), I often incremented the variable name so that I would end up with "SOTUs," "SOTUs2," "SOTUs3," etc. I did this to preserve each transformation step-by-step, always preserving the ability to reverse particular steps and document things along the way. Even though this cluttered up the environment a little, I thought it was well worth it to be able to preserve and document.

Transforming data step by step

These image galleries on Blogger really stink, I really need to migrate this blog to somewhere else...

Anyway, you can already see how "fellow citizens" got combined somewhere between SOTUs2 and SOTUs3, and stayed that way for all future steps. Also, as I've written about earlier on this blog, removing stopwords will be appropriate for some textual analysis and distant reading, but I don't think it's necessary or wise when running sentiment analysis - something to keep in mind.

One pretty funny issue I ran into was with dollar signs. I didn't realize, but they're special characters in R, so they need to be escaped in any functions you run. I wanted the dollar-signs stripped out, but talking about money could be an interesting part of analyzing State of the Union texts, so I wanted to replace "$" with "dollars." First I tried building a toSpace transformer like I'd done with colons and hyphens:

SOTUs2 = tm_map(SOTUs2, toSpace, "$")

This didn't really do anything, so next I tried

toDollars = content_transformer(function(x,pattern) {return (gsub(pattern, "dollars", x))})
#use toDollars to change dollar signs to "dollars"
SOTUs3 = tm_map(SOTUs2, toDollars, "$")

which again, didn't do much besides add the word "dollarsdollars" to the end of each document.

Finally, I tried

replaceDollars = function(x) gsub("?","dollars",x)

Which again, didn't work. Dollar signs were still in the texts! So, just to try, I replaced $ with ? - looking back, I'm not sure what I expected, but I was pretty entertained when I turned Thomas Jefferson's 1808 State of the Union report into an endless string of the word "dollars:

Ultimately, I will not be using any of these as my working texts for SOTU-db so it's not as important to fix each error (fellowcitizens) as it is to understand where and how errors are being introduced, and ensure that I am thinking about and accounting for them. In the end, I did get a visualization out of it which is fun:

This basic counting is a far cry from sentiment analysis or really anything particularly revelatory at all, but it's a good marker of progress for me on this project! Also, the fact that "will," a positive, constructive word is the top word feels slightly encouraging to me in this time when patriotic fervor seems synonymous with a mistrust of all our democratic institutions. Don't ask me what it means that "state" is #2, though! Stay tuned for more R with sentiment analysis coming very soon!