Thursday, November 29, 2018

Alpha Release

It's been a long night of working on SOTU-db so this will be brief:

SOTU-db is now in alpha release! Full-text searching is not yet enabled, but users can search for a limited set of SOTUs via text search or filter-list. Users can also select a "chunkSize" which determines how many words SOTU-db considers in each chunk of sentiment analysis. SOTU-db will (after a brief pause..) return the cleaned (lowercase, no punctutation) text of the selected SOTU along with a chart showing the relative sentiment (positive - negative) of each chunk of text within that SOTU, according to the three different sentiment lexicons available in the tidytext package for R.



SOTU-db is currently password protected. For access, just ask.

Tuesday, November 20, 2018

2 weeks from deadline: update!

An extremely brief post to catalog some of the work that's been done in the past couple of days:

First, I got graphics to output to PNG files. It was easier than I expected. I think learning about PHP helped me understand the paradigm to write to files here. Now I have a script "simple-plotter.R" that will output a chart to "simpleplot.png." All I needed was png("nameOfFileToWrite") before the call to draw the chart (plotSentiment in this case, but it should work with any graphic). Then, dev.off() afterward stops the redirection, sending output back to stdout. 

Also, I learned a handy thing about R Scripts: adding four or more pound-sign comment hashtag symbol things (####) after a line sets it as an anchor that you can jump to within the RStudio editor. 
notice how RStudio populates the menu to navigate within the script with all the headings that I placed between ####s
This is definitely helpful as one of my recent issues has been just getting lost within the R scripts. R still takes me more time and effort to read than, say, HTML or Java, so it's nice to find these little tricks. 

I also learned about two R packages for sentiment analysis, sentimentr and SentimentAnalysis. I've just barely begun checking them out but actually you can see the loading of SentimentAnalysis and then the calling of its analyzeSentiment() function in the screenshot above. I'm not sure it adds much new to the toolbox, but it does seem to simplify some tasks (and if nothing else it adds more sentiment lexicons). I'll be checking into these more in the coming days, but it's nice to go from a bunch of text files in a directory to a line chart like the one below in three commands.
this is a weird sample of SOTUs and not very clear, but generating this was trivial with the SentimentAnalysis package
Aside from all that, I've added a bundle of SOTUs (including Carter's obnoxious separate written/verbal ones), begun cataloging my R script files, and re-wrote the requirements inside the GitHub repo. 

Sunday, November 18, 2018

A dip into Stanford CoreNLP

Last week I worked for the first time with Stanford's CoreNLP tools. They seemed popular at DHCS and seemed well-documented, so I thought I would give them a try. I had already worked quite a bit in the R TidyText sentiment analysis "branch" (conceptual, not a Git branch) and needed to determine if CoreNLP was going to warrant losing some of that work to switch over to a new branch, would not be worth the switch, or (ideally) if it would even be simple enough to keep both pipelines and compare the two.

Setting up

I followed the instructions at the Stanford CoreNLP Download page for "steps to setup from the official release." The problem is, these steps don't actually work. This is a running theme for documentation of Linux software / stuff from GitHub repos. I get it - remembering to update the documentation each time the software is challenging on an active project. But there has to be something in the software engineering team's toolkit that can make this workflow easier. 

By following the instructions, running the suggested command
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
resulted in an error: "Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP." This seems like a Java CLASSPATH thing, which is frustrating since I followed the commands in the instructions to change the CLASSPATH, and echoing it seems to show that all the stuff in the CoreNLP file structure has been added. But in any case, the error was preventing me from making progress. I poked around in the file structure and found "corenlp.sh." When run, I noticed that it was using:
Java –mx5g –cp "./*" edu.stanford 
So I think the change from -mx3g to -mx5g just gives Java more memory. This seems fine, I think my VM has 4GB of memory and then a bunch of swap. I'm assuming it will use swap since I think the 5g means Java gets 5GB of memory, but maybe not - now that I think about it, maybe this has something to do with the slow results (below). But then the -cp "./*" command associates the stuff in the directory (I run it from the corenlp directory) with the Java classpath. Again, I thought I already did this, but in any case, I combined these two files to run
Java –mx5g –cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file 2012-obama-SOTU.txt
Which basically runs the whole CoreNLP battery over the specified text file. I don't exactly know what I mean by "CoreNLP battery," but it must be a lot, because the process took forever to run: a total of 1285.7 seconds (21+ minutes) that generated a JSON file around 7.5MB (over 300,000 lines).

First-pass assessment of CoreNLP

It seems like it's mainly figuring out the structure of sentences, figuring out which words and phrases refer to and depend upon one another. It seems to have the character-offset of each word within the file, and tags a "speaker," "pos," and "ner" for each word (not sure what all those mean yet).
The data on the left is pretty sparse, making for a lengthy file. The whole view of the left-hand JSON file only captures some of the data for two words of the original text in the right-hand file.
 So, this took forever and generated an enormous file that wasn't really doing much for me - it's here on GitHub in its entirety. It was time to shift to sentiment analysis alone, instead of the whole of CoreNLP.

I headed to the "sentiment page" of the CoreNLP project which seemed promising. It seemed to address the potential problem I noticed in my post from 10/31/2018 where the Bing lexicon seemed to tag mentions of "peace" as positive in Washington's 1790 SOTU, when the actual mentions were of peace being absent or in danger (hardly a positive sentiment). The CoreNLP sentiment package claims to "[compute] the sentiment based on how words compose the meaning of longer phrases" and handle things like negation better.

Thankfully, the "source code" page of the CoreNLP sentiment tool had the commands I needed, so running the sentiment analysis was easy. The default command output the results to the screen, and it was really fascinating watching each sentence go by along with a "positive," "negative," or "neutral" tag from the CoreNLP. Most of the tags seemed totally wrong. I took a screenshot at one point. It contained sixteen sentences. Personally, I would say eleven are positive, one negative, and four neutral. CoreNLP called only three positive, eight negative, and four neutral. You can just check out the screenshot, but one especially bizarre tag to me is the "negative" call for "And together, the entire industry added nearly 160,000 jobs." It's hard for me to see how any portion of that sentence could really be construed as negative, and that pattern mostly held for the other sentences in the screenshot:
some bizarre sentiment calls from Stanford CoreNLP

I encourage readers to check out the full results here on my GitHub. Apparently CoreNLP is not a fan of US troops withdrawing from Iraq, among other weird results.

Moving Forward

While in the future I can see myself trying to work more with CoreNLP  and comparing its results to those I'm getting from the R TidyText methods, for now, CoreNLP isn't the way forward for SOTU-db right now. It's too complex to learn on the timeline I have, and the results don't seem worth it right now. In the immediate future, my goal is to continue working within RStudio to generate a script that can accept input, find matches in the corpus, and output them back to the user.

Tuesday, November 13, 2018

DHCS 2018

This past weekend I had the pleasure of spending some time at the 13th Annual Chicago Colloquium on the Digital Humanities and Computer Science, or #DHCS2018, for which I served as a steering committee member and volunteer. There was a lot of fantastic scholarship, but for the purposes of this blog post I wanted to highlight a few papers that I was able to hear about (due to concurrent panels and some scheduling conflicts, I couldn't see them all):

Circulation Modeling of Library Book Promotions, Robin Burke (DePaul University)

Dr. Burke showed some great work studying the Chicago Public Library's One Book, One Chicago program. Of course what caught my attention in particular was his mention of sentiment analysis; their project searched the texts of the assigned One Book, One Chicago novels for place-names (toponyms), identified the sentiment of each mention, and mapped them. I caught up with him after his panel where he told me that they used Stanford's NLP package and analyzed the sentiment on a per-sentence basis, so each toponym was being mapped along with the sentiment of the sentence in which it occurred. Robin cautioned that moving up to a paragraph-level would be too much text for useful sentiment analysis, and suggested that for certain situations (such as some 18th- and 19th-century POTUS' very long sentences) even shorter lengths might be more useful - a few words in either direction of the occurrence. Since finding word occurrences and recording their sentiment is exactly what my project was doing, this was really useful information to me. Dr. Burke expressed that sentiment analysis might be a little more straightforward with political texts than with, for example, novels, and I shared that I had run across Stanford's time and domain-specific sentiment lexicons, it was a great conversation and cool to feel like I really had something to contribute and take away from it.

Analyzing the Effectiveness of Using Character n-grams to Perform Authorship Attribution in the English Language, David Berdik (Duquesne Univeristy)

David gave a great talk on using n-grams to identify the authors of texts. I was pretty surprised at his conclusion that 2- and 3-grams were the most useful for identifying authorship, and I don't think I was alone in this surprise based on the audience's questions after his talk. However, I think I also misunderstood a bit; I thought he meant words that were only 2 and 3 characters long, but I think it actually means any string of 2 or 3 characters, which could include word fragments and even spaces. In any case, it gave me the idea of using the SOTU-db as training data, then allowing users to run their own texts through a tool on SOTU-db and get an output of which POTUS the user's text most resembles! This could potentially be a really fun classroom tool, especially if combined with a kind of "SOTU template" or "SOTU authoring" tool, and the ability to constrain the output (so teachers can ensure their students are matched with a POTUS they are studying). 

'This, reader, is no fiction': Examining the Correlation Between Reader Address and Author Identity in the Nineteenth- and Twentieth-Century Novel, Gabi Kirilloff (Texas Christian University)

Gabi's talk was an unexpected pleasure, and her discussions about how authors used reader address (such as "you" and "dear reader") to define their imagined audiences (among other things) had me unexpectedly drawing connections to SOTU-db. Analysis of how presidents have used pronouns like "we" and "you" could be quite revealing - something to remember as I think about cleaning the texts (both "we" and "you" would be stripped by the standard stopword lists).

Dissenting Women: Assessing the Significance of Gender on Rhetorical Style in the Supreme Court, Rosamond Thalken (Washington State University)

Rosamond's project was right in my wheelhouse and I was excited to hear about it (and some of the others that have come before that she referenced). I can't find it right now on Google, but I'm going to continue to look; smart discussions of rhetoric and how it can be explored through computational techniques is a weak point of SOTU-db but one that I find extremely interesting and important. Once I am confident the technical side of the tool is somewhat stable and running, I hope it can be a tool to do the same types of analysis that Rosamond has done with her project here.


Moving Forward

On a practical level, the conference has made me want to explore the Stanford NLP package to ensure that I'm not making things harder on myself than they need to be with R and NLTK and everything. Stanford NLP popped up in multiple presentations, so this seems worth being sure that I'm not neglecting the "industry standard" without good cause. Otherwise, the above-mentioned talks have mostly given me bigger-picture things to consider (which is great, because right now I don't need more major changes to my roadmap). It's wild how quickly the time is going - I felt like I was off to a very quick and even sometimes ahead-of-schedule start to the semester, and now I am shocked that my MA defense is only 21 days away, so it's a good thing I got off to that good start or I'd really be in trouble! Maybe I'll get "minimum viable product" tattooed on my fingers or something...

Thursday, November 8, 2018

PHPhun

Ah, how quaint to look back at my old Input/Output post and see the exuberant naivete of youth... sentences like "I'm excited to try to figure out more PHP and get things hopefully running as a start-to-finish 'use case' in the next day or two!" have now been confronted with the reality of many hours banging my head against the wall trying to figure out how to get PHP to work with the R scripts I have. The good news is that I think I've finally overcome the major obstacles, and now I really do feel confident in getting a "use case" demo up in the next couple of days at most.

The goal

The point of PHP is essentially to get the parts of the project to interact with each other. Here's a basic outline of what should be possible:

  • user selects a SOTU from a list, let's say Obama's 2016 SOTU 
  • user clicks a button to display sentiment for that SOTU 
  • the button-press activates a PHP script that: 
    • identifies the proper SOTU 
    • runs an R script on that SOTU 
    • the R script returns a few lines of sentiment (starting with "#A tibble" in the picture below for example) 
    • that output is written to a file 
    • that file is displayed back to the user in their browser



The problem(s)

I will copy here a long portion of an email I sent asking for help from a good friend and talented programmer, Olly, with some extra notes highlighted in yellow:

1. I know that I can do the following in the command line:
echo words >> output.txt
that works fine, so far so good.
the above commands will find output.txt (or create it if it doesn't exist) and append the word "words" to the file.

2. I also know that because I have R installed on my command line and a short script called myscript.R, I can type this into my terminal:

Rscript pathTo/myscript.R >> output.txt
which does what I expect. The output of the script, which is redirected (appended) to output.txt, looks like this:


the script "myscript.R" is running a really basic sentiment analysis on a text (a pre-defined text in this case; the results will be the same every time). By using the >> redirection again, the system sends those results to output.txt, resulting in a file like the one you see above, giving the number of "negative" and "positive" sentiment words found in the text.

3. What I want to have happen next is for PHP to do this script for me, triggered by a user clicking a button on the web page. Then the resulting file should be echoed back to the user so they can see the same results you see above. But I can't do it! Here's what I can do:

4. I can definitely call the PHP script (text-to-file.php), which can pipe arbitrary text to a file, then show the file to the user. I actually did this a couple different ways but I think the cleanest was at this commit, basically:

file_put_contents("../sentiments.md","Text is getting appended then a line break \n", FILE_APPEND);

$output = fopen("../sentiments.md","r");

echo fread($output,filesize("../sentiments.md"));

I could just keep hitting refresh and seeing new lines of "Text is getting appended then a line break" appear in the browser each time (which is what I was expecting).

5. I can also have the php script do:

exec("echo happy >> sentiments.txt");

$output = fopen("sentiments.txt","r");

echo fread($output, filesize("sentiments.txt"));


which again does what I want: I can F5 and keep appending "happy" to sentiments.txt, which shows in the browser, over and over.

But when I simply swap out "echo happy" with "Rscript myscript.R," it no longer works! The file (sentiments.txt) is created, but nothing is written to it, which makes the fread throw an error or at best display its contents: nothing. I'm guessing that the Rscript is just taking too long (it takes a good couple of seconds to execute and that time will only increase with longer documents) but I've had no luck using "sleep" or anything like that to try to introduce a delay between running the command and reading from the file.

Identifying and Overcoming

One really helpful factor in figuring out what was going wrong was redirecting my STDERR (standard error) to my STDOUT (standard output) so that instead of getting blank files, I was getting files that had the error printed to them. Thanks to Brian Storti for this great article explaining the 2>&1 idiom to redirect that output.

Error the First: Environmental Variables

Once I was able to see the error output that I'd redirected to a file, I was surprised to see that I was actually generating a "the term RScript is not recognized as the name of a cmdlet, function, script file..." This was a weird problem since I knew for sure I had already added Rscript.exe to my PATH variable (after all, I could run it no problem just by typing into the terminal).

It turns out that Apache server has it's own environmental variables. From the documentation, it seemed like it should be able to use the system's PATH variables too, but obviously in practice that wasn't working. Doing phpinfo() showed the environmental variables that Apache was using, and Rscript was nowhere to be found. Stupidly, I couldn't quite figure out how to update the environment variables for Apache, so my solution now is just to use the "f:/ully/qualified/path/to/Rscript.exe" in the PHP file. This seems like a silly way to do it, but it works for now, so I'm forging ahead.

Error the Second: Package Locations

Once the PHP file knew where to find RScript.exe, I was a little surprised to see it still throwing errors instead of output. This time, the error was about missing packages in R. I had been following the default practice in the RStudio install, which installs packages to the individual profile. Instead, I wanted the packages put into a common library (the difference between these practices is explained in this RStudio support article). I just did a simple copy from the user folder to the common library, and that took care of that.

Error the Third: Permissions

Once I finally got the PHP to find RScipt.exe AND the R packages it needed, I was still getting weird results. From experimenting some more on the command line, I found that somewhere along the way my permissions had gotten messed up, and the PHP script was no longer allowed permission to create or edit the text files it was trying to create / append. This makes sense: when I run commands from the terminal, it knows I am logged in on my own user account, with those permissions. But when the PHP script tries to access those same shell commands, what "user" is it doing so under? What permissions does it have? Honestly, I don't quite know the answer to this yet (security is a weak point of my own software knowledge and I don't think I'm alone among DHers in that regard) but by changing a few permissions and a few locations for where files are written and read from, I have a solution for now.

Results

It's not on the live, public server yet (I'll need to hand-edit the full paths to RScript and stuff, since my development server at home is on Windows but the public one for the site is on Ubuntu, so obviously different file path schemes). But from the home server, I can now type in "obama," "washington," or anything else, and get taken to a real results page that shows:

the term entered in the search box,
whether that matches with "obama" or "washington,"
the full 2012 SOTU text if obama was searched, or the full 1790 text if washington was searched,
in the sentiment info box, the number of positive and negative (and the net positive-negative) sentiment word matches in the SOTU from the Bing sentiment lexicon:




The spacing is messed up, but as a proof of concept this reflects a ton of learning and work on my part in a short amount of time. From here, figuring out the SQL database is my only remaining big hurdle; the rest is about scaling up (which will not be a small job)! With this milestone, I'm feeling good about having a minimum viable product in the next few weeks. The next step will be figuring out SQL as well as leaving some of the web-dev stuff to the side to work on the R scripting.

Monday, November 5, 2018

HGSA 2018

On Saturday I had the chance to show SOTU-db as a poster at LUC's 15th Annual History Graduate Student Conference: "Building Bridges" (link to program here). This was a great opportunity for me to gather feedback on my work, but also a good deadline to meet in terms of motivation - the spikes in commits on my GitHub repos the past week is noticeable, and I feel great about the progress I've made!

It was a good exercise to figure out what to include on the poster. First, this was a history conference; the audience would probably be more interested in the "so what" of my project than in the technical details. Second, since my project scope has changed over the summer, it was a good exercise to reconsider what my project really was about. Ultimately, I decided on a summary blurb of text, then four boxes (mainly because four 8.5x11 inch pages fit neatly in the space on the posterboard) highlighting these four aspects of SOTU-db:
  • The text corpus of SOTUs,
  • the SOTU-db.com website and user-facing experience,
  • the LAMP stack that makes up SOTU-db's "digital infrastructure,"
  • text mining and analytics (R, NLTK).
This was my first time presenting at any kind of conference like this, poster or otherwise, and I enjoyed it quite a bit more than I expected. I felt like most people were finding the poster useful, but I really wish my live demo on the laptop would have worked. It wasn't fully functional because I never could get the scroll bar on my fake "results" page to work, so the visualizations were not even on the screen. I do think people appreciated when I could show them the RStudio visualizations, but even those should have been better organized so I could find the ones I wanted with people waiting and looking over my shoulder! 

I was really hoping that I could provide a username and password to people and have them visit the site on their own devices, but even if the search demo had been working properly, I don't seem to be able to access the site from Loyola's campus for whatever reason.

a man speaking and gesturing toward a poster with "SOTU-db" and other detailed text printed on it
credit: Rebecca Parker @RJP43
A huge congratulations is due to the folks in Loyola's History Graduate Student Association for putting together such a great conference! Unfortunately I wasn't able to stay for the whole day but I'm looking forward to catching up with some of the people who were there to learn what I missed!

The next big deadline for SOTU-db is my presentation at the CTSDH in about four weeks. Until then I'll be working hard to get my corpora and index database squared away, get the PHP on the site working with my server, and creating a working search interface!

Thursday, November 1, 2018

Input/Output

Over the summer, I took some courses on CodeAcademy, and one of them was just a command line interface (CLI) course. It didn't feel very useful except that I learned about standard input / output, and the >, >>, and | operators that allow inputs and outputs to be exchanged between commands and put to use. Even if I end up not using this exact technique, I'm really glad I had that lesson - it helped me to understand more of what the system is actually doing with files. Being used to finicky filetypes (like distinguishing between .doc and .docx), there's something very satisfying about working with the raw data and being able to transform and use it in different ways.

Earlier this evening I made a checklist of tasks I thought I would need to finish in order to get a viable working product ready to show for Saturday morning (plus two more that would need to get done at some point but weren't urgent). These were:

  • learn to issue commands to R through the CLI (done)
  • learn to pipe variables (like the user's query term) into those commands (not done yet with variables, but done with text - should be quick)
  • output whatever R outputs into a static document (done via | pipe, but might switch to php)
  • figure out how to automate serving this document (done, just fixed permissions, and now by piping right to the file on the server, the server automatically picks up those changes. Just need to trigger a refresh for the user, see below)
  • direct user to the newly generated page (not done)
  • listen for GET requests to server (not done)
  • extract the search term from the URL (not done, and may not even do it this way anymore
  • figure out the user experience - loading bar? (not done yet, related to how I redirect the user to the new page)

For a few hours' work, I'm pretty excited about what I've done: decided on PHP for pretty much the rest of the minimum viable product, figured out how to take user input and run it through a script in R and output a new page for the user. From there, everything should be mostly a matter of tweaking, and scale, and finally directing my focus toward the actual humanities/historical questions on which this tool can hopefully help offer new perspectives. 

For now, it's very exciting that I was able to input a filename (the "-washington.md" file) and specify an output file (sentiments.txt) with this:
Rscript myscript.R < pathTo/1790-01-08-washington.md > /pathTo/sentiments.txt
and then throw in this fun command, inputting a .jpg of Trump and specifying an append to that same output sentiments.txt:
jp2a /pathTo/trump.jpg >> /pathTo/sentiments.txt
and ended up with this:
who knew I would ever be pumped to see this ascii-art face pop up on my screen!

By the way, here's the original image:


Pretty good for a night's work if I may say. From the command line, I can specify a couple of filenames and get a readout of the sentiment of that text and an ascii-art image from jp2a as a bonus. If the user knew to hit refresh, the readout is even hitting the live server automatically, which is cool. Even though the sentiments from 1790 and the image of Trump have pretty much nothing to do with each other, this was a worthwhile task. I'm excited to try to figure out more PHP and get things hopefully running as a start-to-finish "use case" in the next day or two! I'll leave this image of my updated view of how the project components will fit together below, and call it a night.
a UML diagram showing the data flow from the user's query to the returned static web page
still a bit of a mess and I'm sure nothing close to UML-standard, but closer than before, and finally beginning to feel like it is reflecting reality more than just my ideas!