I followed the instructions at the Stanford CoreNLP Download page for "steps to setup from the official release." The problem is, these steps don't actually work. This is a running theme for documentation of Linux software / stuff from GitHub repos. I get it - remembering to update the documentation each time the software is challenging on an active project. But there has to be something in the software engineering team's toolkit that can make this workflow easier.
By following the instructions, running the suggested command
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txtresulted in an error: "Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP." This seems like a Java CLASSPATH thing, which is frustrating since I followed the commands in the instructions to change the CLASSPATH, and echoing it seems to show that all the stuff in the CoreNLP file structure has been added. But in any case, the error was preventing me from making progress. I poked around in the file structure and found "corenlp.sh." When run, I noticed that it was using:
Java –mx5g –cp "./*" edu.stanfordSo I think the change from -mx3g to -mx5g just gives Java more memory. This seems fine, I think my VM has 4GB of memory and then a bunch of swap. I'm assuming it will use swap since I think the 5g means Java gets 5GB of memory, but maybe not - now that I think about it, maybe this has something to do with the slow results (below). But then the -cp "./*" command associates the stuff in the directory (I run it from the corenlp directory) with the Java classpath. Again, I thought I already did this, but in any case, I combined these two files to run
Java –mx5g –cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file 2012-obama-SOTU.txtWhich basically runs the whole CoreNLP battery over the specified text file. I don't exactly know what I mean by "CoreNLP battery," but it must be a lot, because the process took forever to run: a total of 1285.7 seconds (21+ minutes) that generated a JSON file around 7.5MB (over 300,000 lines).
First-pass assessment of CoreNLPIt seems like it's mainly figuring out the structure of sentences, figuring out which words and phrases refer to and depend upon one another. It seems to have the character-offset of each word within the file, and tags a "speaker," "pos," and "ner" for each word (not sure what all those mean yet).
|The data on the left is pretty sparse, making for a lengthy file. The whole view of the left-hand JSON file only captures some of the data for two words of the original text in the right-hand file.|
I headed to the "sentiment page" of the CoreNLP project which seemed promising. It seemed to address the potential problem I noticed in my post from 10/31/2018 where the Bing lexicon seemed to tag mentions of "peace" as positive in Washington's 1790 SOTU, when the actual mentions were of peace being absent or in danger (hardly a positive sentiment). The CoreNLP sentiment package claims to "[compute] the sentiment based on how words compose the meaning of longer phrases" and handle things like negation better.
Thankfully, the "source code" page of the CoreNLP sentiment tool had the commands I needed, so running the sentiment analysis was easy. The default command output the results to the screen, and it was really fascinating watching each sentence go by along with a "positive," "negative," or "neutral" tag from the CoreNLP. Most of the tags seemed totally wrong. I took a screenshot at one point. It contained sixteen sentences. Personally, I would say eleven are positive, one negative, and four neutral. CoreNLP called only three positive, eight negative, and four neutral. You can just check out the screenshot, but one especially bizarre tag to me is the "negative" call for "And together, the entire industry added nearly 160,000 jobs." It's hard for me to see how any portion of that sentence could really be construed as negative, and that pattern mostly held for the other sentences in the screenshot:
|some bizarre sentiment calls from Stanford CoreNLP|
I encourage readers to check out the full results here on my GitHub. Apparently CoreNLP is not a fan of US troops withdrawing from Iraq, among other weird results.