Thursday, February 28, 2019

February 2019 Development Update

It's been an eventful couple of weeks for me, in Chicago, and in State of the Union world, but not so much for SOTU-db.

My new position at the Chicago History Museum is in full swing, including some recent and upcoming weekend events. It's also been a wild few weeks of weather as the "polar vortex" and plenty of snow have blanketed Chicago. Together with a busy year in my personal life, this has prevented me from getting as much work done on SOTU-db in recent weeks as I would have liked.

Sadly, this work-slowdown coincided with one of the more interesting periods of SOTU news in recent memory. Originally scheduled for January 29, President Trump's 2019 "State of the Union" address was postponed when Speaker Pelosi delivered a Jan. 16 letter to Mr. Trump postponing the address until the conclusion of the then-ongoing partial federal government shutdown. By January 25, the federal government was fully re-opened (quite possibly temporarily - stay tuned) and the SOTU was quickly rescheduled for Tuesday, February 5.

"The Daily" podcast by the New York Times ran an episode about the State of the Union the morning after Trump's SOTU, which included this wonderful exchange:
Michael: "I really don't like this language people use: 'SOTU.'"
Mark: "Yeah, it's really awful --"
Michael [crosstalk]: "awful, so let's not use it..."
Mark: "... it's kind of a dismal evening anyway, and calling it SOTU just sort of deepens the gloom that hangs over the whole thing."
So that was fantastic - but the overall episode tried to address how the last four POTUSs have responded in their SOTU addresses to losing their party's majority in Congress. It was definitely worth a listen and speaks to one of the original questions SOTU-db was meant to address: do State of the Union addresses actually matter, and is there a correlation between what a president says in a SOTU with their actual policy decisions? The podcast suggests that they do matter, and that the correlation does exist.

Sadly that must be it for this post - it's already Feb 28 so I'm out of time for my monthly update. I'm hopeful that I can get back to a more regular SOTU-db update schedule once History Fair season wraps up for this year.


Tuesday, January 22, 2019

January 2019 Development Update

A quick update on SOTU-db:

  • SOTU-db's servers were offline for several weeks around the new year, but have now been restored
  • I successfully defended SOTU-db as my Master's capstone project, and have now completed my MA in Digital Humanities
  • I recently took a new full-time position working with the Chicago Metro History Fair (blog post about the new role here)
  • Due to a combination of the above three factors, development of SOTU-db has been at a standstill in recent weeks
I'm hoping to resume work on SOTU-db now at a pace that reflects its status as a hobby/passion project. I'm not committing to further development milestones right now, but I plan to post on this blog at least once a month with the latest updates. I will likely also be making much heavier use of commit messages and version tags on GitHub. I still fully intend to bring SOTU-db into a "full release," but it could be quite a while before that happens.

It's not realistic to think many more features will be completed or bugs will be squashed before the 2019 SOTU address, whether postponed due to the partial government shutdown or not. Nevertheless, I would like to be able to promote the site via hashtags whenever the SOTU actually is delivered, so for now I think a focus on clarity and usability will be my main focus, so that new users can figure out what the site is doing.

Sunday, December 9, 2018

Alpha 3.0

SOTU-db version alpha 3.0 is now available at www.sotu-db.com.


  • Enter a string to perform full-text search for a corpus of SOTUs from 1978 - 2017 and return sentiment values from 4 sentiment lexicons for each occurrence of your string (by sentence)
  • Choose from a selection of 13 SOTUs to examine by year; charts for each SOTU by chunk size (set by user from four options) and the text used for processing are returned to the user
  • Viewer and Extras are present but need improvement and may not display correctly
  • Documentation and Credits likewise exist but need improvement

Thursday, November 29, 2018

Alpha Release

It's been a long night of working on SOTU-db so this will be brief:

SOTU-db is now in alpha release! Full-text searching is not yet enabled, but users can search for a limited set of SOTUs via text search or filter-list. Users can also select a "chunkSize" which determines how many words SOTU-db considers in each chunk of sentiment analysis. SOTU-db will (after a brief pause..) return the cleaned (lowercase, no punctutation) text of the selected SOTU along with a chart showing the relative sentiment (positive - negative) of each chunk of text within that SOTU, according to the three different sentiment lexicons available in the tidytext package for R.



SOTU-db is currently password protected. For access, just ask.

Tuesday, November 20, 2018

2 weeks from deadline: update!

An extremely brief post to catalog some of the work that's been done in the past couple of days:

First, I got graphics to output to PNG files. It was easier than I expected. I think learning about PHP helped me understand the paradigm to write to files here. Now I have a script "simple-plotter.R" that will output a chart to "simpleplot.png." All I needed was png("nameOfFileToWrite") before the call to draw the chart (plotSentiment in this case, but it should work with any graphic). Then, dev.off() afterward stops the redirection, sending output back to stdout. 

Also, I learned a handy thing about R Scripts: adding four or more pound-sign comment hashtag symbol things (####) after a line sets it as an anchor that you can jump to within the RStudio editor. 
notice how RStudio populates the menu to navigate within the script with all the headings that I placed between ####s
This is definitely helpful as one of my recent issues has been just getting lost within the R scripts. R still takes me more time and effort to read than, say, HTML or Java, so it's nice to find these little tricks. 

I also learned about two R packages for sentiment analysis, sentimentr and SentimentAnalysis. I've just barely begun checking them out but actually you can see the loading of SentimentAnalysis and then the calling of its analyzeSentiment() function in the screenshot above. I'm not sure it adds much new to the toolbox, but it does seem to simplify some tasks (and if nothing else it adds more sentiment lexicons). I'll be checking into these more in the coming days, but it's nice to go from a bunch of text files in a directory to a line chart like the one below in three commands.
this is a weird sample of SOTUs and not very clear, but generating this was trivial with the SentimentAnalysis package
Aside from all that, I've added a bundle of SOTUs (including Carter's obnoxious separate written/verbal ones), begun cataloging my R script files, and re-wrote the requirements inside the GitHub repo. 

Sunday, November 18, 2018

A dip into Stanford CoreNLP

Last week I worked for the first time with Stanford's CoreNLP tools. They seemed popular at DHCS and seemed well-documented, so I thought I would give them a try. I had already worked quite a bit in the R TidyText sentiment analysis "branch" (conceptual, not a Git branch) and needed to determine if CoreNLP was going to warrant losing some of that work to switch over to a new branch, would not be worth the switch, or (ideally) if it would even be simple enough to keep both pipelines and compare the two.

Setting up

I followed the instructions at the Stanford CoreNLP Download page for "steps to setup from the official release." The problem is, these steps don't actually work. This is a running theme for documentation of Linux software / stuff from GitHub repos. I get it - remembering to update the documentation each time the software is challenging on an active project. But there has to be something in the software engineering team's toolkit that can make this workflow easier. 

By following the instructions, running the suggested command
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
resulted in an error: "Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP." This seems like a Java CLASSPATH thing, which is frustrating since I followed the commands in the instructions to change the CLASSPATH, and echoing it seems to show that all the stuff in the CoreNLP file structure has been added. But in any case, the error was preventing me from making progress. I poked around in the file structure and found "corenlp.sh." When run, I noticed that it was using:
Java –mx5g –cp "./*" edu.stanford 
So I think the change from -mx3g to -mx5g just gives Java more memory. This seems fine, I think my VM has 4GB of memory and then a bunch of swap. I'm assuming it will use swap since I think the 5g means Java gets 5GB of memory, but maybe not - now that I think about it, maybe this has something to do with the slow results (below). But then the -cp "./*" command associates the stuff in the directory (I run it from the corenlp directory) with the Java classpath. Again, I thought I already did this, but in any case, I combined these two files to run
Java –mx5g –cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file 2012-obama-SOTU.txt
Which basically runs the whole CoreNLP battery over the specified text file. I don't exactly know what I mean by "CoreNLP battery," but it must be a lot, because the process took forever to run: a total of 1285.7 seconds (21+ minutes) that generated a JSON file around 7.5MB (over 300,000 lines).

First-pass assessment of CoreNLP

It seems like it's mainly figuring out the structure of sentences, figuring out which words and phrases refer to and depend upon one another. It seems to have the character-offset of each word within the file, and tags a "speaker," "pos," and "ner" for each word (not sure what all those mean yet).
The data on the left is pretty sparse, making for a lengthy file. The whole view of the left-hand JSON file only captures some of the data for two words of the original text in the right-hand file.
 So, this took forever and generated an enormous file that wasn't really doing much for me - it's here on GitHub in its entirety. It was time to shift to sentiment analysis alone, instead of the whole of CoreNLP.

I headed to the "sentiment page" of the CoreNLP project which seemed promising. It seemed to address the potential problem I noticed in my post from 10/31/2018 where the Bing lexicon seemed to tag mentions of "peace" as positive in Washington's 1790 SOTU, when the actual mentions were of peace being absent or in danger (hardly a positive sentiment). The CoreNLP sentiment package claims to "[compute] the sentiment based on how words compose the meaning of longer phrases" and handle things like negation better.

Thankfully, the "source code" page of the CoreNLP sentiment tool had the commands I needed, so running the sentiment analysis was easy. The default command output the results to the screen, and it was really fascinating watching each sentence go by along with a "positive," "negative," or "neutral" tag from the CoreNLP. Most of the tags seemed totally wrong. I took a screenshot at one point. It contained sixteen sentences. Personally, I would say eleven are positive, one negative, and four neutral. CoreNLP called only three positive, eight negative, and four neutral. You can just check out the screenshot, but one especially bizarre tag to me is the "negative" call for "And together, the entire industry added nearly 160,000 jobs." It's hard for me to see how any portion of that sentence could really be construed as negative, and that pattern mostly held for the other sentences in the screenshot:
some bizarre sentiment calls from Stanford CoreNLP

I encourage readers to check out the full results here on my GitHub. Apparently CoreNLP is not a fan of US troops withdrawing from Iraq, among other weird results.

Moving Forward

While in the future I can see myself trying to work more with CoreNLP  and comparing its results to those I'm getting from the R TidyText methods, for now, CoreNLP isn't the way forward for SOTU-db right now. It's too complex to learn on the timeline I have, and the results don't seem worth it right now. In the immediate future, my goal is to continue working within RStudio to generate a script that can accept input, find matches in the corpus, and output them back to the user.

Tuesday, November 13, 2018

DHCS 2018

This past weekend I had the pleasure of spending some time at the 13th Annual Chicago Colloquium on the Digital Humanities and Computer Science, or #DHCS2018, for which I served as a steering committee member and volunteer. There was a lot of fantastic scholarship, but for the purposes of this blog post I wanted to highlight a few papers that I was able to hear about (due to concurrent panels and some scheduling conflicts, I couldn't see them all):

Circulation Modeling of Library Book Promotions, Robin Burke (DePaul University)

Dr. Burke showed some great work studying the Chicago Public Library's One Book, One Chicago program. Of course what caught my attention in particular was his mention of sentiment analysis; their project searched the texts of the assigned One Book, One Chicago novels for place-names (toponyms), identified the sentiment of each mention, and mapped them. I caught up with him after his panel where he told me that they used Stanford's NLP package and analyzed the sentiment on a per-sentence basis, so each toponym was being mapped along with the sentiment of the sentence in which it occurred. Robin cautioned that moving up to a paragraph-level would be too much text for useful sentiment analysis, and suggested that for certain situations (such as some 18th- and 19th-century POTUS' very long sentences) even shorter lengths might be more useful - a few words in either direction of the occurrence. Since finding word occurrences and recording their sentiment is exactly what my project was doing, this was really useful information to me. Dr. Burke expressed that sentiment analysis might be a little more straightforward with political texts than with, for example, novels, and I shared that I had run across Stanford's time and domain-specific sentiment lexicons, it was a great conversation and cool to feel like I really had something to contribute and take away from it.

Analyzing the Effectiveness of Using Character n-grams to Perform Authorship Attribution in the English Language, David Berdik (Duquesne Univeristy)

David gave a great talk on using n-grams to identify the authors of texts. I was pretty surprised at his conclusion that 2- and 3-grams were the most useful for identifying authorship, and I don't think I was alone in this surprise based on the audience's questions after his talk. However, I think I also misunderstood a bit; I thought he meant words that were only 2 and 3 characters long, but I think it actually means any string of 2 or 3 characters, which could include word fragments and even spaces. In any case, it gave me the idea of using the SOTU-db as training data, then allowing users to run their own texts through a tool on SOTU-db and get an output of which POTUS the user's text most resembles! This could potentially be a really fun classroom tool, especially if combined with a kind of "SOTU template" or "SOTU authoring" tool, and the ability to constrain the output (so teachers can ensure their students are matched with a POTUS they are studying). 

'This, reader, is no fiction': Examining the Correlation Between Reader Address and Author Identity in the Nineteenth- and Twentieth-Century Novel, Gabi Kirilloff (Texas Christian University)

Gabi's talk was an unexpected pleasure, and her discussions about how authors used reader address (such as "you" and "dear reader") to define their imagined audiences (among other things) had me unexpectedly drawing connections to SOTU-db. Analysis of how presidents have used pronouns like "we" and "you" could be quite revealing - something to remember as I think about cleaning the texts (both "we" and "you" would be stripped by the standard stopword lists).

Dissenting Women: Assessing the Significance of Gender on Rhetorical Style in the Supreme Court, Rosamond Thalken (Washington State University)

Rosamond's project was right in my wheelhouse and I was excited to hear about it (and some of the others that have come before that she referenced). I can't find it right now on Google, but I'm going to continue to look; smart discussions of rhetoric and how it can be explored through computational techniques is a weak point of SOTU-db but one that I find extremely interesting and important. Once I am confident the technical side of the tool is somewhat stable and running, I hope it can be a tool to do the same types of analysis that Rosamond has done with her project here.


Moving Forward

On a practical level, the conference has made me want to explore the Stanford NLP package to ensure that I'm not making things harder on myself than they need to be with R and NLTK and everything. Stanford NLP popped up in multiple presentations, so this seems worth being sure that I'm not neglecting the "industry standard" without good cause. Otherwise, the above-mentioned talks have mostly given me bigger-picture things to consider (which is great, because right now I don't need more major changes to my roadmap). It's wild how quickly the time is going - I felt like I was off to a very quick and even sometimes ahead-of-schedule start to the semester, and now I am shocked that my MA defense is only 21 days away, so it's a good thing I got off to that good start or I'd really be in trouble! Maybe I'll get "minimum viable product" tattooed on my fingers or something...