Sunday, November 18, 2018

A dip into Stanford CoreNLP

Last week I worked for the first time with Stanford's CoreNLP tools. They seemed popular at DHCS and seemed well-documented, so I thought I would give them a try. I had already worked quite a bit in the R TidyText sentiment analysis "branch" (conceptual, not a Git branch) and needed to determine if CoreNLP was going to warrant losing some of that work to switch over to a new branch, would not be worth the switch, or (ideally) if it would even be simple enough to keep both pipelines and compare the two.

Setting up

I followed the instructions at the Stanford CoreNLP Download page for "steps to setup from the official release." The problem is, these steps don't actually work. This is a running theme for documentation of Linux software / stuff from GitHub repos. I get it - remembering to update the documentation each time the software is challenging on an active project. But there has to be something in the software engineering team's toolkit that can make this workflow easier. 

By following the instructions, running the suggested command
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
resulted in an error: "Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP." This seems like a Java CLASSPATH thing, which is frustrating since I followed the commands in the instructions to change the CLASSPATH, and echoing it seems to show that all the stuff in the CoreNLP file structure has been added. But in any case, the error was preventing me from making progress. I poked around in the file structure and found "corenlp.sh." When run, I noticed that it was using:
Java –mx5g –cp "./*" edu.stanford 
So I think the change from -mx3g to -mx5g just gives Java more memory. This seems fine, I think my VM has 4GB of memory and then a bunch of swap. I'm assuming it will use swap since I think the 5g means Java gets 5GB of memory, but maybe not - now that I think about it, maybe this has something to do with the slow results (below). But then the -cp "./*" command associates the stuff in the directory (I run it from the corenlp directory) with the Java classpath. Again, I thought I already did this, but in any case, I combined these two files to run
Java –mx5g –cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file 2012-obama-SOTU.txt
Which basically runs the whole CoreNLP battery over the specified text file. I don't exactly know what I mean by "CoreNLP battery," but it must be a lot, because the process took forever to run: a total of 1285.7 seconds (21+ minutes) that generated a JSON file around 7.5MB (over 300,000 lines).

First-pass assessment of CoreNLP

It seems like it's mainly figuring out the structure of sentences, figuring out which words and phrases refer to and depend upon one another. It seems to have the character-offset of each word within the file, and tags a "speaker," "pos," and "ner" for each word (not sure what all those mean yet).
The data on the left is pretty sparse, making for a lengthy file. The whole view of the left-hand JSON file only captures some of the data for two words of the original text in the right-hand file.
 So, this took forever and generated an enormous file that wasn't really doing much for me - it's here on GitHub in its entirety. It was time to shift to sentiment analysis alone, instead of the whole of CoreNLP.

I headed to the "sentiment page" of the CoreNLP project which seemed promising. It seemed to address the potential problem I noticed in my post from 10/31/2018 where the Bing lexicon seemed to tag mentions of "peace" as positive in Washington's 1790 SOTU, when the actual mentions were of peace being absent or in danger (hardly a positive sentiment). The CoreNLP sentiment package claims to "[compute] the sentiment based on how words compose the meaning of longer phrases" and handle things like negation better.

Thankfully, the "source code" page of the CoreNLP sentiment tool had the commands I needed, so running the sentiment analysis was easy. The default command output the results to the screen, and it was really fascinating watching each sentence go by along with a "positive," "negative," or "neutral" tag from the CoreNLP. Most of the tags seemed totally wrong. I took a screenshot at one point. It contained sixteen sentences. Personally, I would say eleven are positive, one negative, and four neutral. CoreNLP called only three positive, eight negative, and four neutral. You can just check out the screenshot, but one especially bizarre tag to me is the "negative" call for "And together, the entire industry added nearly 160,000 jobs." It's hard for me to see how any portion of that sentence could really be construed as negative, and that pattern mostly held for the other sentences in the screenshot:
some bizarre sentiment calls from Stanford CoreNLP

I encourage readers to check out the full results here on my GitHub. Apparently CoreNLP is not a fan of US troops withdrawing from Iraq, among other weird results.

Moving Forward

While in the future I can see myself trying to work more with CoreNLP  and comparing its results to those I'm getting from the R TidyText methods, for now, CoreNLP isn't the way forward for SOTU-db right now. It's too complex to learn on the timeline I have, and the results don't seem worth it right now. In the immediate future, my goal is to continue working within RStudio to generate a script that can accept input, find matches in the corpus, and output them back to the user.