Thursday, March 15, 2018

Working with R

This post outlines my first forays into working with R, "a language and environment for statistical computing and graphics" (about R).

One of my secondary goals for this project is to enable users to perform textual analyses and visualizations using R. R seems to get a lot of buzz in the DH community right now, and I'm interested to learn what it can do. I wrote my first lines of (Processing) code just months ago, so I don't expect to become fluent in R by May, I just want to dabble and see what's possible.

Over the past couple days I have been following along with various online tutorials to try to do some basic analyses that I already understand pretty well, like word counts and frequency comparisons. The easiest tutorial for me to follow has been "A gentle introduction to text mining using R" on the "Eight to Late" blog.

Following this blog and the advice of many, many others, I'm using RStudio to work with R. I'm beginning to understand the process of loading packages and the syntax of R. It seems that most of what I'm doing is defining variables on the left, and explaining the operations I want to perform to store in those variables on right right of an equals sign (or <- arrow, as in the tutorial, but that's a little confusing for me). The tutorial involves using a corpus provided by the blog (of archived blog posts), but I've been using my Gutenberg SOTU documents as my corpus instead. I've successfully stripped punctuation and numbers, and converted the text to lowercase. But when I try to delete the default stopwords, RStudio just runs and runs, never finishing the operation. I plan to let it run overnight; perhaps my corpus really is large enough and the task so much more complicated than removing numbers or punctuation that it simply needs more time. We'll see! It took several hours, but it looks like removing stopwords was finally successful! However, on examining a couple of SOTUs after this transformation, I am realizing certain stopwords could be of great interest and should probably be retained, especially pronouns like "we," "us," "they," and "them." How will the differences in stopword lists and algorithms for stemming and lemmatizing (which is my next step) impact the results of analytical operations on the text? How much of this is necessary to surface to the user and how much should be hidden? Would it be feasible and advisable to allow users the option to work with different corpora with different cleaning processes applied?

an Rstudio window showing a console with various commands being entered, and an "environment" window showing different SOTU corpora and values
I saved each step along the way as a separate corpus so I would have a record of each step of the process

I'm confident I'll be able to put R to use doing some basic textual mining/analysis, but I'm not sure if I'll be able to get into more advanced techniques like topic modeling (in which I'm more interested). And it's doubtful I'll find a use for it that can't already be done on Voyant Tools or HTRC. Regardless, I feel confident that using R to export data and then using my Processing tool, Grapher, to visualize that data will be an excellent learning experience, right on my level as a budding developer. This will be the first time I've really been able to integrate data that holds interest for me as a humanist and historian into my work with programming.

After a recent class with Dr. Thiruvathukal, my de facto faculty adviser for this project, I'm also  interested in exploring whether I can use the natural language toolkit (NLTK) to dive more deeply into the word choices American presidents have used in their SOTU addresses. Stay tuned for developments on that front, hopefully coming soon!

edited 3/16/2018 to show that stopword removal was eventually successful

No comments:

Post a Comment