One of my secondary goals for this project is to enable users to perform textual analyses and visualizations using R. R seems to get a lot of buzz in the DH community right now, and I'm interested to learn what it can do. I wrote my first lines of (Processing) code just months ago, so I don't expect to become fluent in R by May, I just want to dabble and see what's possible.
Over the past couple days I have been following along with various online tutorials to try to do some basic analyses that I already understand pretty well, like word counts and frequency comparisons. The easiest tutorial for me to follow has been "A gentle introduction to text mining using R" on the "Eight to Late" blog.
Following this blog and the advice of many, many others, I'm using RStudio to work with R. I'm beginning to understand the process of loading packages and the syntax of R. It seems that most of what I'm doing is defining variables on the left, and explaining the operations I want to perform to store in those variables on right right of an equals sign (or <- arrow, as in the tutorial, but that's a little confusing for me). The tutorial involves using a corpus provided by the blog (of archived blog posts), but I've been using my Gutenberg SOTU documents as my corpus instead. I've successfully stripped punctuation and numbers, and converted the text to lowercase.
I saved each step along the way as a separate corpus so I would have a record of each step of the process |
I'm confident I'll be able to put R to use doing some basic textual mining/analysis, but I'm not sure if I'll be able to get into more advanced techniques like topic modeling (in which I'm more interested). And it's doubtful I'll find a use for it that can't already be done on Voyant Tools or HTRC. Regardless, I feel confident that using R to export data and then using my Processing tool, Grapher, to visualize that data will be an excellent learning experience, right on my level as a budding developer. This will be the first time I've really been able to integrate data that holds interest for me as a humanist and historian into my work with programming.
After a recent class with Dr. Thiruvathukal, my de facto faculty adviser for this project, I'm also interested in exploring whether I can use the natural language toolkit (NLTK) to dive more deeply into the word choices American presidents have used in their SOTU addresses. Stay tuned for developments on that front, hopefully coming soon!
edited 3/16/2018 to show that stopword removal was eventually successful