Sunday, December 9, 2018

Alpha 3.0

SOTU-db version alpha 3.0 is now available at www.sotu-db.com.


  • Enter a string to perform full-text search for a corpus of SOTUs from 1978 - 2017 and return sentiment values from 4 sentiment lexicons for each occurrence of your string (by sentence)
  • Choose from a selection of 13 SOTUs to examine by year; charts for each SOTU by chunk size (set by user from four options) and the text used for processing are returned to the user
  • Viewer and Extras are present but need improvement and may not display correctly
  • Documentation and Credits likewise exist but need improvement

Thursday, November 29, 2018

Alpha Release

It's been a long night of working on SOTU-db so this will be brief:

SOTU-db is now in alpha release! Full-text searching is not yet enabled, but users can search for a limited set of SOTUs via text search or filter-list. Users can also select a "chunkSize" which determines how many words SOTU-db considers in each chunk of sentiment analysis. SOTU-db will (after a brief pause..) return the cleaned (lowercase, no punctutation) text of the selected SOTU along with a chart showing the relative sentiment (positive - negative) of each chunk of text within that SOTU, according to the three different sentiment lexicons available in the tidytext package for R.



SOTU-db is currently password protected. For access, just ask.

Tuesday, November 20, 2018

2 weeks from deadline: update!

An extremely brief post to catalog some of the work that's been done in the past couple of days:

First, I got graphics to output to PNG files. It was easier than I expected. I think learning about PHP helped me understand the paradigm to write to files here. Now I have a script "simple-plotter.R" that will output a chart to "simpleplot.png." All I needed was png("nameOfFileToWrite") before the call to draw the chart (plotSentiment in this case, but it should work with any graphic). Then, dev.off() afterward stops the redirection, sending output back to stdout. 

Also, I learned a handy thing about R Scripts: adding four or more pound-sign comment hashtag symbol things (####) after a line sets it as an anchor that you can jump to within the RStudio editor. 
notice how RStudio populates the menu to navigate within the script with all the headings that I placed between ####s
This is definitely helpful as one of my recent issues has been just getting lost within the R scripts. R still takes me more time and effort to read than, say, HTML or Java, so it's nice to find these little tricks. 

I also learned about two R packages for sentiment analysis, sentimentr and SentimentAnalysis. I've just barely begun checking them out but actually you can see the loading of SentimentAnalysis and then the calling of its analyzeSentiment() function in the screenshot above. I'm not sure it adds much new to the toolbox, but it does seem to simplify some tasks (and if nothing else it adds more sentiment lexicons). I'll be checking into these more in the coming days, but it's nice to go from a bunch of text files in a directory to a line chart like the one below in three commands.
this is a weird sample of SOTUs and not very clear, but generating this was trivial with the SentimentAnalysis package
Aside from all that, I've added a bundle of SOTUs (including Carter's obnoxious separate written/verbal ones), begun cataloging my R script files, and re-wrote the requirements inside the GitHub repo. 

Sunday, November 18, 2018

A dip into Stanford CoreNLP

Last week I worked for the first time with Stanford's CoreNLP tools. They seemed popular at DHCS and seemed well-documented, so I thought I would give them a try. I had already worked quite a bit in the R TidyText sentiment analysis "branch" (conceptual, not a Git branch) and needed to determine if CoreNLP was going to warrant losing some of that work to switch over to a new branch, would not be worth the switch, or (ideally) if it would even be simple enough to keep both pipelines and compare the two.

Setting up

I followed the instructions at the Stanford CoreNLP Download page for "steps to setup from the official release." The problem is, these steps don't actually work. This is a running theme for documentation of Linux software / stuff from GitHub repos. I get it - remembering to update the documentation each time the software is challenging on an active project. But there has to be something in the software engineering team's toolkit that can make this workflow easier. 

By following the instructions, running the suggested command
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
resulted in an error: "Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP." This seems like a Java CLASSPATH thing, which is frustrating since I followed the commands in the instructions to change the CLASSPATH, and echoing it seems to show that all the stuff in the CoreNLP file structure has been added. But in any case, the error was preventing me from making progress. I poked around in the file structure and found "corenlp.sh." When run, I noticed that it was using:
Java –mx5g –cp "./*" edu.stanford 
So I think the change from -mx3g to -mx5g just gives Java more memory. This seems fine, I think my VM has 4GB of memory and then a bunch of swap. I'm assuming it will use swap since I think the 5g means Java gets 5GB of memory, but maybe not - now that I think about it, maybe this has something to do with the slow results (below). But then the -cp "./*" command associates the stuff in the directory (I run it from the corenlp directory) with the Java classpath. Again, I thought I already did this, but in any case, I combined these two files to run
Java –mx5g –cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file 2012-obama-SOTU.txt
Which basically runs the whole CoreNLP battery over the specified text file. I don't exactly know what I mean by "CoreNLP battery," but it must be a lot, because the process took forever to run: a total of 1285.7 seconds (21+ minutes) that generated a JSON file around 7.5MB (over 300,000 lines).

First-pass assessment of CoreNLP

It seems like it's mainly figuring out the structure of sentences, figuring out which words and phrases refer to and depend upon one another. It seems to have the character-offset of each word within the file, and tags a "speaker," "pos," and "ner" for each word (not sure what all those mean yet).
The data on the left is pretty sparse, making for a lengthy file. The whole view of the left-hand JSON file only captures some of the data for two words of the original text in the right-hand file.
 So, this took forever and generated an enormous file that wasn't really doing much for me - it's here on GitHub in its entirety. It was time to shift to sentiment analysis alone, instead of the whole of CoreNLP.

I headed to the "sentiment page" of the CoreNLP project which seemed promising. It seemed to address the potential problem I noticed in my post from 10/31/2018 where the Bing lexicon seemed to tag mentions of "peace" as positive in Washington's 1790 SOTU, when the actual mentions were of peace being absent or in danger (hardly a positive sentiment). The CoreNLP sentiment package claims to "[compute] the sentiment based on how words compose the meaning of longer phrases" and handle things like negation better.

Thankfully, the "source code" page of the CoreNLP sentiment tool had the commands I needed, so running the sentiment analysis was easy. The default command output the results to the screen, and it was really fascinating watching each sentence go by along with a "positive," "negative," or "neutral" tag from the CoreNLP. Most of the tags seemed totally wrong. I took a screenshot at one point. It contained sixteen sentences. Personally, I would say eleven are positive, one negative, and four neutral. CoreNLP called only three positive, eight negative, and four neutral. You can just check out the screenshot, but one especially bizarre tag to me is the "negative" call for "And together, the entire industry added nearly 160,000 jobs." It's hard for me to see how any portion of that sentence could really be construed as negative, and that pattern mostly held for the other sentences in the screenshot:
some bizarre sentiment calls from Stanford CoreNLP

I encourage readers to check out the full results here on my GitHub. Apparently CoreNLP is not a fan of US troops withdrawing from Iraq, among other weird results.

Moving Forward

While in the future I can see myself trying to work more with CoreNLP  and comparing its results to those I'm getting from the R TidyText methods, for now, CoreNLP isn't the way forward for SOTU-db right now. It's too complex to learn on the timeline I have, and the results don't seem worth it right now. In the immediate future, my goal is to continue working within RStudio to generate a script that can accept input, find matches in the corpus, and output them back to the user.

Tuesday, November 13, 2018

DHCS 2018

This past weekend I had the pleasure of spending some time at the 13th Annual Chicago Colloquium on the Digital Humanities and Computer Science, or #DHCS2018, for which I served as a steering committee member and volunteer. There was a lot of fantastic scholarship, but for the purposes of this blog post I wanted to highlight a few papers that I was able to hear about (due to concurrent panels and some scheduling conflicts, I couldn't see them all):

Circulation Modeling of Library Book Promotions, Robin Burke (DePaul University)

Dr. Burke showed some great work studying the Chicago Public Library's One Book, One Chicago program. Of course what caught my attention in particular was his mention of sentiment analysis; their project searched the texts of the assigned One Book, One Chicago novels for place-names (toponyms), identified the sentiment of each mention, and mapped them. I caught up with him after his panel where he told me that they used Stanford's NLP package and analyzed the sentiment on a per-sentence basis, so each toponym was being mapped along with the sentiment of the sentence in which it occurred. Robin cautioned that moving up to a paragraph-level would be too much text for useful sentiment analysis, and suggested that for certain situations (such as some 18th- and 19th-century POTUS' very long sentences) even shorter lengths might be more useful - a few words in either direction of the occurrence. Since finding word occurrences and recording their sentiment is exactly what my project was doing, this was really useful information to me. Dr. Burke expressed that sentiment analysis might be a little more straightforward with political texts than with, for example, novels, and I shared that I had run across Stanford's time and domain-specific sentiment lexicons, it was a great conversation and cool to feel like I really had something to contribute and take away from it.

Analyzing the Effectiveness of Using Character n-grams to Perform Authorship Attribution in the English Language, David Berdik (Duquesne Univeristy)

David gave a great talk on using n-grams to identify the authors of texts. I was pretty surprised at his conclusion that 2- and 3-grams were the most useful for identifying authorship, and I don't think I was alone in this surprise based on the audience's questions after his talk. However, I think I also misunderstood a bit; I thought he meant words that were only 2 and 3 characters long, but I think it actually means any string of 2 or 3 characters, which could include word fragments and even spaces. In any case, it gave me the idea of using the SOTU-db as training data, then allowing users to run their own texts through a tool on SOTU-db and get an output of which POTUS the user's text most resembles! This could potentially be a really fun classroom tool, especially if combined with a kind of "SOTU template" or "SOTU authoring" tool, and the ability to constrain the output (so teachers can ensure their students are matched with a POTUS they are studying). 

'This, reader, is no fiction': Examining the Correlation Between Reader Address and Author Identity in the Nineteenth- and Twentieth-Century Novel, Gabi Kirilloff (Texas Christian University)

Gabi's talk was an unexpected pleasure, and her discussions about how authors used reader address (such as "you" and "dear reader") to define their imagined audiences (among other things) had me unexpectedly drawing connections to SOTU-db. Analysis of how presidents have used pronouns like "we" and "you" could be quite revealing - something to remember as I think about cleaning the texts (both "we" and "you" would be stripped by the standard stopword lists).

Dissenting Women: Assessing the Significance of Gender on Rhetorical Style in the Supreme Court, Rosamond Thalken (Washington State University)

Rosamond's project was right in my wheelhouse and I was excited to hear about it (and some of the others that have come before that she referenced). I can't find it right now on Google, but I'm going to continue to look; smart discussions of rhetoric and how it can be explored through computational techniques is a weak point of SOTU-db but one that I find extremely interesting and important. Once I am confident the technical side of the tool is somewhat stable and running, I hope it can be a tool to do the same types of analysis that Rosamond has done with her project here.


Moving Forward

On a practical level, the conference has made me want to explore the Stanford NLP package to ensure that I'm not making things harder on myself than they need to be with R and NLTK and everything. Stanford NLP popped up in multiple presentations, so this seems worth being sure that I'm not neglecting the "industry standard" without good cause. Otherwise, the above-mentioned talks have mostly given me bigger-picture things to consider (which is great, because right now I don't need more major changes to my roadmap). It's wild how quickly the time is going - I felt like I was off to a very quick and even sometimes ahead-of-schedule start to the semester, and now I am shocked that my MA defense is only 21 days away, so it's a good thing I got off to that good start or I'd really be in trouble! Maybe I'll get "minimum viable product" tattooed on my fingers or something...

Thursday, November 8, 2018

PHPhun

Ah, how quaint to look back at my old Input/Output post and see the exuberant naivete of youth... sentences like "I'm excited to try to figure out more PHP and get things hopefully running as a start-to-finish 'use case' in the next day or two!" have now been confronted with the reality of many hours banging my head against the wall trying to figure out how to get PHP to work with the R scripts I have. The good news is that I think I've finally overcome the major obstacles, and now I really do feel confident in getting a "use case" demo up in the next couple of days at most.

The goal

The point of PHP is essentially to get the parts of the project to interact with each other. Here's a basic outline of what should be possible:

  • user selects a SOTU from a list, let's say Obama's 2016 SOTU 
  • user clicks a button to display sentiment for that SOTU 
  • the button-press activates a PHP script that: 
    • identifies the proper SOTU 
    • runs an R script on that SOTU 
    • the R script returns a few lines of sentiment (starting with "#A tibble" in the picture below for example) 
    • that output is written to a file 
    • that file is displayed back to the user in their browser



The problem(s)

I will copy here a long portion of an email I sent asking for help from a good friend and talented programmer, Olly, with some extra notes highlighted in yellow:

1. I know that I can do the following in the command line:
echo words >> output.txt
that works fine, so far so good.
the above commands will find output.txt (or create it if it doesn't exist) and append the word "words" to the file.

2. I also know that because I have R installed on my command line and a short script called myscript.R, I can type this into my terminal:

Rscript pathTo/myscript.R >> output.txt
which does what I expect. The output of the script, which is redirected (appended) to output.txt, looks like this:


the script "myscript.R" is running a really basic sentiment analysis on a text (a pre-defined text in this case; the results will be the same every time). By using the >> redirection again, the system sends those results to output.txt, resulting in a file like the one you see above, giving the number of "negative" and "positive" sentiment words found in the text.

3. What I want to have happen next is for PHP to do this script for me, triggered by a user clicking a button on the web page. Then the resulting file should be echoed back to the user so they can see the same results you see above. But I can't do it! Here's what I can do:

4. I can definitely call the PHP script (text-to-file.php), which can pipe arbitrary text to a file, then show the file to the user. I actually did this a couple different ways but I think the cleanest was at this commit, basically:

file_put_contents("../sentiments.md","Text is getting appended then a line break \n", FILE_APPEND);

$output = fopen("../sentiments.md","r");

echo fread($output,filesize("../sentiments.md"));

I could just keep hitting refresh and seeing new lines of "Text is getting appended then a line break" appear in the browser each time (which is what I was expecting).

5. I can also have the php script do:

exec("echo happy >> sentiments.txt");

$output = fopen("sentiments.txt","r");

echo fread($output, filesize("sentiments.txt"));


which again does what I want: I can F5 and keep appending "happy" to sentiments.txt, which shows in the browser, over and over.

But when I simply swap out "echo happy" with "Rscript myscript.R," it no longer works! The file (sentiments.txt) is created, but nothing is written to it, which makes the fread throw an error or at best display its contents: nothing. I'm guessing that the Rscript is just taking too long (it takes a good couple of seconds to execute and that time will only increase with longer documents) but I've had no luck using "sleep" or anything like that to try to introduce a delay between running the command and reading from the file.

Identifying and Overcoming

One really helpful factor in figuring out what was going wrong was redirecting my STDERR (standard error) to my STDOUT (standard output) so that instead of getting blank files, I was getting files that had the error printed to them. Thanks to Brian Storti for this great article explaining the 2>&1 idiom to redirect that output.

Error the First: Environmental Variables

Once I was able to see the error output that I'd redirected to a file, I was surprised to see that I was actually generating a "the term RScript is not recognized as the name of a cmdlet, function, script file..." This was a weird problem since I knew for sure I had already added Rscript.exe to my PATH variable (after all, I could run it no problem just by typing into the terminal).

It turns out that Apache server has it's own environmental variables. From the documentation, it seemed like it should be able to use the system's PATH variables too, but obviously in practice that wasn't working. Doing phpinfo() showed the environmental variables that Apache was using, and Rscript was nowhere to be found. Stupidly, I couldn't quite figure out how to update the environment variables for Apache, so my solution now is just to use the "f:/ully/qualified/path/to/Rscript.exe" in the PHP file. This seems like a silly way to do it, but it works for now, so I'm forging ahead.

Error the Second: Package Locations

Once the PHP file knew where to find RScript.exe, I was a little surprised to see it still throwing errors instead of output. This time, the error was about missing packages in R. I had been following the default practice in the RStudio install, which installs packages to the individual profile. Instead, I wanted the packages put into a common library (the difference between these practices is explained in this RStudio support article). I just did a simple copy from the user folder to the common library, and that took care of that.

Error the Third: Permissions

Once I finally got the PHP to find RScipt.exe AND the R packages it needed, I was still getting weird results. From experimenting some more on the command line, I found that somewhere along the way my permissions had gotten messed up, and the PHP script was no longer allowed permission to create or edit the text files it was trying to create / append. This makes sense: when I run commands from the terminal, it knows I am logged in on my own user account, with those permissions. But when the PHP script tries to access those same shell commands, what "user" is it doing so under? What permissions does it have? Honestly, I don't quite know the answer to this yet (security is a weak point of my own software knowledge and I don't think I'm alone among DHers in that regard) but by changing a few permissions and a few locations for where files are written and read from, I have a solution for now.

Results

It's not on the live, public server yet (I'll need to hand-edit the full paths to RScript and stuff, since my development server at home is on Windows but the public one for the site is on Ubuntu, so obviously different file path schemes). But from the home server, I can now type in "obama," "washington," or anything else, and get taken to a real results page that shows:

the term entered in the search box,
whether that matches with "obama" or "washington,"
the full 2012 SOTU text if obama was searched, or the full 1790 text if washington was searched,
in the sentiment info box, the number of positive and negative (and the net positive-negative) sentiment word matches in the SOTU from the Bing sentiment lexicon:




The spacing is messed up, but as a proof of concept this reflects a ton of learning and work on my part in a short amount of time. From here, figuring out the SQL database is my only remaining big hurdle; the rest is about scaling up (which will not be a small job)! With this milestone, I'm feeling good about having a minimum viable product in the next few weeks. The next step will be figuring out SQL as well as leaving some of the web-dev stuff to the side to work on the R scripting.

Monday, November 5, 2018

HGSA 2018

On Saturday I had the chance to show SOTU-db as a poster at LUC's 15th Annual History Graduate Student Conference: "Building Bridges" (link to program here). This was a great opportunity for me to gather feedback on my work, but also a good deadline to meet in terms of motivation - the spikes in commits on my GitHub repos the past week is noticeable, and I feel great about the progress I've made!

It was a good exercise to figure out what to include on the poster. First, this was a history conference; the audience would probably be more interested in the "so what" of my project than in the technical details. Second, since my project scope has changed over the summer, it was a good exercise to reconsider what my project really was about. Ultimately, I decided on a summary blurb of text, then four boxes (mainly because four 8.5x11 inch pages fit neatly in the space on the posterboard) highlighting these four aspects of SOTU-db:
  • The text corpus of SOTUs,
  • the SOTU-db.com website and user-facing experience,
  • the LAMP stack that makes up SOTU-db's "digital infrastructure,"
  • text mining and analytics (R, NLTK).
This was my first time presenting at any kind of conference like this, poster or otherwise, and I enjoyed it quite a bit more than I expected. I felt like most people were finding the poster useful, but I really wish my live demo on the laptop would have worked. It wasn't fully functional because I never could get the scroll bar on my fake "results" page to work, so the visualizations were not even on the screen. I do think people appreciated when I could show them the RStudio visualizations, but even those should have been better organized so I could find the ones I wanted with people waiting and looking over my shoulder! 

I was really hoping that I could provide a username and password to people and have them visit the site on their own devices, but even if the search demo had been working properly, I don't seem to be able to access the site from Loyola's campus for whatever reason.

a man speaking and gesturing toward a poster with "SOTU-db" and other detailed text printed on it
credit: Rebecca Parker @RJP43
A huge congratulations is due to the folks in Loyola's History Graduate Student Association for putting together such a great conference! Unfortunately I wasn't able to stay for the whole day but I'm looking forward to catching up with some of the people who were there to learn what I missed!

The next big deadline for SOTU-db is my presentation at the CTSDH in about four weeks. Until then I'll be working hard to get my corpora and index database squared away, get the PHP on the site working with my server, and creating a working search interface!

Thursday, November 1, 2018

Input/Output

Over the summer, I took some courses on CodeAcademy, and one of them was just a command line interface (CLI) course. It didn't feel very useful except that I learned about standard input / output, and the >, >>, and | operators that allow inputs and outputs to be exchanged between commands and put to use. Even if I end up not using this exact technique, I'm really glad I had that lesson - it helped me to understand more of what the system is actually doing with files. Being used to finicky filetypes (like distinguishing between .doc and .docx), there's something very satisfying about working with the raw data and being able to transform and use it in different ways.

Earlier this evening I made a checklist of tasks I thought I would need to finish in order to get a viable working product ready to show for Saturday morning (plus two more that would need to get done at some point but weren't urgent). These were:

  • learn to issue commands to R through the CLI (done)
  • learn to pipe variables (like the user's query term) into those commands (not done yet with variables, but done with text - should be quick)
  • output whatever R outputs into a static document (done via | pipe, but might switch to php)
  • figure out how to automate serving this document (done, just fixed permissions, and now by piping right to the file on the server, the server automatically picks up those changes. Just need to trigger a refresh for the user, see below)
  • direct user to the newly generated page (not done)
  • listen for GET requests to server (not done)
  • extract the search term from the URL (not done, and may not even do it this way anymore
  • figure out the user experience - loading bar? (not done yet, related to how I redirect the user to the new page)

For a few hours' work, I'm pretty excited about what I've done: decided on PHP for pretty much the rest of the minimum viable product, figured out how to take user input and run it through a script in R and output a new page for the user. From there, everything should be mostly a matter of tweaking, and scale, and finally directing my focus toward the actual humanities/historical questions on which this tool can hopefully help offer new perspectives. 

For now, it's very exciting that I was able to input a filename (the "-washington.md" file) and specify an output file (sentiments.txt) with this:
Rscript myscript.R < pathTo/1790-01-08-washington.md > /pathTo/sentiments.txt
and then throw in this fun command, inputting a .jpg of Trump and specifying an append to that same output sentiments.txt:
jp2a /pathTo/trump.jpg >> /pathTo/sentiments.txt
and ended up with this:
who knew I would ever be pumped to see this ascii-art face pop up on my screen!

By the way, here's the original image:


Pretty good for a night's work if I may say. From the command line, I can specify a couple of filenames and get a readout of the sentiment of that text and an ascii-art image from jp2a as a bonus. If the user knew to hit refresh, the readout is even hitting the live server automatically, which is cool. Even though the sentiments from 1790 and the image of Trump have pretty much nothing to do with each other, this was a worthwhile task. I'm excited to try to figure out more PHP and get things hopefully running as a start-to-finish "use case" in the next day or two! I'll leave this image of my updated view of how the project components will fit together below, and call it a night.
a UML diagram showing the data flow from the user's query to the returned static web page
still a bit of a mess and I'm sure nothing close to UML-standard, but closer than before, and finally beginning to feel like it is reflecting reality more than just my ideas!


Wednesday, October 31, 2018

It's alive! (somewhat)!

I've been plugging away at working with sentiment analysis in RStudio and want to pause to share some results! This post will cover working with RStudio scripts to generate "tidy" data frames and perform some sentiment analysis per-chunk (instead of just per-document). 

Tidy Data

As I've mentioned in previous posts, a lot of the sentiment analysis packages out there make use of "tidy data," Like Julia Silge explains on her blog post from 2016, tidy data takes the following format:
  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

So, basically, the goal is to convert the SOTUs into really long, skinny tables, where each occurrence of a word gets its own row. This means the tidy table (stupidly called a "tibble," apparently) keeps track of where words are in relation to each other. This is an improvement (for my purposes) over the the "document-term-matrix" I created in my initial text mining steps, which only gave each word a single row along with the number of times it appears in the corpus, in alphabetical order (not the order they appear in the text).

a chart with SOTU filenames for rows and alphabetical list of words for column headings. the chart is populated with numbers showing how many times each term appears in each file.
The document-term-matrix. Every word is recorded and we can see which files they are in, but there's no way to track where they are in relation to one another within the files.

a chart with each word in the first sentence of Washington's first 1790 SOTU as its own row
The tidy data frame retains the words' relationships to each other - linenumber isn't working the way I want right now, but it does reset with each file and keeps the context of each token (word) intact.


Analysis: Positive Words in Washington's 1792 SOTU

Once the "tidy_SOTUs" data frame was created, I could use it for all kinds of fun analysis that raised a lot of new questions. For example, the top "positive" words in Washington's 1792 SOTU were:
  1.  peace, 5 instances
  2.  present, 5 instances
  3.  found, 4 instances
Peace seems definitely positive, but you probably wouldn't be talking about it all the time if there were no threat of war. Most of Washington's mentions are really about the desire for, but absence of, peace.
  • A sanction commonly respected even among savages has been found in this instance insufficient to protect from massacre the emissaries of peace.
  • I can not dismiss the subject of Indian affairs without again recommending to your consideration the [. . .] restraining the commission of outrages upon the Indians, without which all pacific plans must prove nugatory. To enable [. . .] the employment of qualified and trusty persons to reside among them as agents would also contribute to the preservation of peace . . .
  • I particularly recommend to your consideration the means of preventing those aggressions by our citizens on the territory of other nations, and other infractions of the law of nations, which, furnishing just subject of complaint, might endanger our peace with them . . .
So, while it was fun to learn the word "nugatory," qualifying these mentions of "peace" as "positive sentiments" is probably a little misleading.

"Present" I assume is on the positive list from NRC (here, I think, but I need to look more into the exact list included with R, and the methodology for the "crowdsourcing" used to generate the list) as a synonym for "gift," but that's not how Washington is using the word at all (he's referring to the temporal present). 

"Found," again, seems like it would be on the "positive" list because of its sense as recovering something lost, but Washington is always using it in a legalistic context: 
  • a sanction that "has been found in this instance insufficient,"
  • considering the expense of future operations "which may be found inevitable,"
  • a reform of the judiciary that "will, it is presumed, be found worthy of particular attention," and
  • a question about the post office, which, if "upon due inquiry, be found to be the fact," a remedy would need to be considered.
None of these are really strictly negative sentiments, but they are not like the "I found my missing wallet" or "found a new hobby" types of sentiments that I expect led to the word's placement on the "positive" list.


Sentiment Analysis Within Documents

I had already been able to run some sentiment analysis tasks on my corpus of SOTUs, but because they were relying on document-term-matrices instead of tidy data, I wasn't able to run any analysis comparing words or contexts within documents - only between them. Even though the line number thing wasn't working like I wanted (each token in each SOTU file was given its own line number instead of actually going line-by-line, I tried simply using as.integer() and division to count every ten words as a line but it didn't work), having the words in order still enabled analysis I couldn't do before.

The first visualization I generated with tidy text was a great feeling. Here's the code with some comments:

library(tidyr)  #load the tidyr library

SOTU_sentiment <- tidy_SOTUs %>%  # dump tidy_SOTUs into SOTU_sentiment,
  inner_join(get_sentiments("bing")) %>%  # get the sentiment lexicon "bing" (different from NRC lexicon in previous example)
  count(file, index = linenumber %/% 80, sentiment) %>% # this is taking a chunk of 80 lines at a time- in this case, 80 words at a time
  spread(sentiment, n, fill = 0) %>% # honestly still need to learn what "spread" is doing (this is why we wanted tidyr) - I think it is basically normalizing the charts so they all plot in equal space while representing different word counts
  mutate(sentiment = positive - negative) # finally, calculate "sentiment" by subtracting instances of negative words from the instances of positive words. This will be performed per 80-token chunk.

library(ggplot2) # load the ggplot2 library for visualization

ggplot(SOTU_sentiment, aes(index, sentiment, fill = file)) + #plot SOTU_sentiment
 
geom_col(show.legend = FALSE) + #don't show a legend
  facet_wrap(~file, ncol = 2, scales = "free_x") # plot per-file and in 2 columns, allow x-axis to adapt to scale of values being plotted (i think)


and here's the resulting visualization from ggplot:
Organizing these file names with month-first was a little dumb and resulted in these being out of order, but ideally the file-names shouldn't make a huge difference for the final product anyway.

It is interesting to note the pattern of most addresses beginning on a positive note. This seemed like a clear enough pattern (with notable enough exceptions in 1792 and 1794) that it was worth looking into before going much further - if the NRC list was seemingly so off-base about "peace" and "present," I wanted to see if these visualizations even meant anything.

Grabbing the first 160 words (two 80-word chunks) from the 1790 and 1794 SOTUs, then comparing them subjectively with their respective charts revealed the following (image contains lots of text, also available as a markdown file on the GitHub repo (raw markdown)):
please visit markdown file link to read text contained within image

I have to say, these definitely pass the smell-test. 1790 is all congratulatory and full of lofty, philosophic patriotism, while 1794 is about insurrection and taxation. I was especially pleased that, even though 1794 contains a lot of what I would think are positive words (gracious, indulgence, heaven, riches, power, happiness, expedient, stability), the chart still depicts a negative number that is in-line with my subjective reading of the text.


Comparing Sentiment Lexicons

Like I mentioned, I have used two lexicons so far in these examples: "NRC" and "Bing" (not the search engine). It's not entirely clear (without more digging) how these lexicons were generated (NRC mentions "crowdsourcing" and Bing just says "compiled over many years starting from our first paper") but for now, I wanted to at least start by getting a feel for how they might differ. Especially as I'm dealing with texts where people are saying things like "nugatory" and "burthens" (or even just the difference in word choices even between Bush 43 and Obama), it's definitely possible that these lexicons won't be a good fit across over 200 years of texts.

Fortunately, the process from the "Sentiment Analysis with Tidy Data" that I was following had just the thing. I'm realizing that eventually I should make some kind of notebook for this code, and I'll omit it here, but basically I ended up with a chart comparing three different sentiment lexicons, AFINN, Bing et al. and NRC; each run over Washington's 1793 SOTU:

It's a good sign that the trends across the three analyses roughly track together. In general, it looks like AFINN skews more positive overall and is a little less volatile than the others, and obviously the three use different scales, but it's nice that the results weren't radically different. 

Lots more to come, but that's plenty for one post. 

Wednesday, October 24, 2018

Forays into Sentiment Analysis

I spent some time working with Rachael Tatman's "Sentiment Analysis in R" tutorial and wanted to blog here about some of the results.

Setting up

I had already done some basic text mining and tokenization of my State of the Union corpus (which, as I've mentioned before, is incomplete and inaccurate right now). Hilariously, Tatman's tutorial also used some State of the Union addresses as its corpus, although it only contains a few years' worth of SOTUs. I was working with about 215 - not the full set, but certainly a better sample size than the few in the tutorial. 

The first steps were just loading in the requisite R packages: tidyverse, tidytext, glue, and stringr. Tidyverse took quite a while to install and requires several attempts reading error messages, installing stuff with apt, trying to install again and waiting for the next error. It would be cool if I could install these packages through apt, or R had the ability to download its own dependencies (or even just a clearer error message!) but whatever. 

With those packages loaded, I used glue to write a "fileName" vector (I guess that's what we're calling them in R?) that includes the full file path and name, and a "fileText" vector that we then turn into a data_frame and tokenize into the vector "tokens." I also learned the fun-fact that dollar signs are a special character in R, and need to be escaped with \\ in order to operate on them. Hooray for learning! Really, I should have guessed/figured this out before, but I know now.

Counting Sentiments

The nice thing about tidytext was that I already had some sentiment lexicons loaded (well, "some" in theory, only "bing" worked for me) and ready to use. Happily, the sentiment analysis function worked on my first try, and was pretty fun:
Even though I'm really just executing instructions other people wrote, running this the first few times was a pretty satisfying feeling

But this also raised some questions. If this was just one list, how would other lists change the results? What exactly was this spread() function doing - Tatman's comments say she "made data wide rather than narrow" which, sure. I don't know what that means. Nonetheless, I was looking for proof-of-concept at this point, not a robust and fully theorized methodology, so I continued on.

I ran into some fun issues with REGEX here - Tatman's tutorial used REGEX to pull attributes from the filenames in the corpus and assign them to attributes of the text. My filenames were in a slightly different format from the tutorial's, and looked like this:

  • 1801-Jefferson-12-8.txt
  • 1945-FDR-1-6.txt

and so on. I couldn't quite figure out how to match the names, and my clumsy solution of using ([[:alpha:]]{4,}) to find strings of letters longer than 3 characters (so as to avoid picking up "txt" in every file name) was actually missing FDR (used instead of "Roosevelt" since there are 2 Roosevelt presidents). I posted a "help wanted" issue on GitHub, and fortunately @RJP43 responded and helped me find a solution. Some other random guy actually replied, too, but he deleted his (perfectly helpful) response, which was a bummer.

I also had a tough time assigning the political party of each POTUS to their respective SOTUs - I actually ended up just encoding them into the filenames and using a different REGEX to find and extract them. When I tried to match and assign based on the presidents' names like in Tatman's tutorial:
democrats = sentiments %>%
  filter(president == c("Jackson", "Polk")) %>%
  mutate(party = "D")

It didn't work - it wouldn't grab all of Jackson and Polk's SOTUs, but just a few. Rerunning the function with different individual names did grab them all, but it was rewriting - not appending - and so still not doing the trick. In the end, encoding the parties into the filenames worked for now, but this is something that seems pretty basic in R and that I should work on understanding.

Also, the "normalSentiment" was not exactly in Tatman's tutorial (though she did offer it as an extra challenge along with a hint about using the nrow() function). I'm using
normalSentiment = sentiment/nrow(tokens)
to get it right now, but is doing the sentiment subtraction without normalization first the same as doing the normalization before the subtraction? In other words:
normalSentiment = (positive - negative) / tokens,
or
normalSentiment = (positive/tokens)-(negative/tokens)?
This seems like an embarrassingly simple question, but I'm not sure without checking it. Anyway, below is a snippet of the table I ended up with:
It took some doing, but I was able to assign a party and generate a "normalSentiment" for each SOTU

Visualizations and Analysis

The first visualization just showed the "sentiment" score of the texts over time. Here it is with "sentiment" (just the raw positive - negative score) and "normalSentiment" ((positive-negative)/word count):






Obviously, there's a lot going on under the hood here, but it's definitely notable that normalization hits Washington especially hard (Washington owns 2 of the top 5 shortest SOTUs by word count, never topping 3,000 words). The normalized sentiment shows a totally different pattern than the "un-normalized" one (if nothing else, the dip in sentiment around the Civil War in the normalized chart makes more sense than the bump in the first chart).

Interestingly, the two charts also seem to suggest different outliers - the first chart shows sentiments tightly clustered (probably reflecting their short length more than actual sentiment) in the early years of the republic, then major outliers in the Taft, Truman, and Carter years. In the normalized version, it looks like Washington was all over the place, but there are fewer off-the-charts outliers overall. This makes sense - Washington's shorter speeches lead to a volatile normalized score because each word is essentially "worth more" in the score, and Carter's insanely positive speech is actually just his giant 33,000+ word written address that he dropped on his way out of office in 1981.

Next up: sentiment by party. The following two charts again show the "raw" and "normalized" sentiments, this time showing box plots by political party. The colors are a little confusing, so read the labels carefully!


Wow! A few things stood out to me here right away. First, note that only George Washington is considered "unaffiliated," and notice how much the size of the unaffiliated box changes depending on whether we normalize or not! John Adams, as our only federalist, shows a similar effect. Also, there are only four whigs and four democratic-republicans. Each president has more than one SOTU, but either way these sample sizes are fairly small. In any case, I was surprised to see the Democrats score lower on sentiment than Republicans in both versions of the chart! I am eager to run this again and break things up more by time - it's important to remember that today's Democratic and Republican parties bear little resemblance to those of the same names in the 19th century.

Moving Forward

First I just have to say, these visualizations are lovely. I'll learn to color-code them more appropriately (blue for Republicans and red for Democrats is simply too hard for my brain to get used to) and manipulate the data and the axes/labels in more detail, but hats off to the developers of these packages for enabling such simple and inviting visualizations so easily. 

My next step is to find out whether I can find a solution for running sentiment analysis on particular words. This is my ultimate goal, though I'm beginning to worry it might require some more tagging and slicing and dicing of the texts themselves - perhaps breaking them into paragraph-level chunks would work. I also need to learn more about the algorithms, parameters, and word lists that I used in these examples. I would also be interested in trying to pull in some tweets and run some analyses on those; something I've seen folks using R for around the web. 

Static Server is Up!

Another lengthy post about text mining and sentiment analysis is coming soon, but for now: I've successfully deployed a Tomcat server to my sotu-db.cs.luc.edu machine. Right now, the only thing hosted there is the extremely simple static search box, but you can access it live now at http://sotu-db.cs.luc.edu:8080/sotu-db/. The search button doesn't actually do anything, and I still have a lot of implementation details to work out. But this is a nice step.

I'm using Tomcat 9 for the server, and I followed the walkthrough here to get things set up. It seems like Tomcat has some flexibility in terms of hosting JS apps and stuff that could be useful for later phases of the project. Right now, however, I'm thinking the solution might look something like enter search term > pipe term into a new document with touch > feed that document to RStudio somehow and save results into new document > serve the new document as a static page. There's a lot I think I'm missing here in terms of security, but right now that's not a major concern for this project (people won't be creating usernames or passwords, and the corpus is public anyway).

it's not much to look at, but this screen and the fact that anyone can see it by visiting this address are the products of a lot of work!

Tuesday, October 23, 2018

Progress Report: Scripting in R



Thanks to Dr. George Thiruvathukal of LUC's Computer Science department (my final DH project co-advisor) I have an RStudio (the IDE for R) server running on a server run by LUC which, as I'm writing this, I realize I should really point to from rserver.sotu-db.com or something like that (edit: compute-r.sotu-db.com now leads directly to the RStudio login page). For more on setting up the server itself, see my blog post "Setting up an RStudio Server" at blog.tylermonaghan.com. This post is about working with some RStudio scripts and packages that I found by following a few online tutorials. I'm behind my Oct. 22 deadline by a day already, but I think it's worth taking time to blog on what I've found so far, because there's a lot going on.

Basic Text Mining in R

For this section, I followed the excellent tutorial "A gentle introduction to text mining using R" at the "Eight to Late" blog. The post is a few years old, but still works well and does a good job of explaining what's going on in the code.

One main way I deviated from the "gentle introduction" tutorial was by using my own corpus of "State of the Union" texts (1790-2006, plus 2016 and 2018 annual presidential remarks to Congress - for more on which are technically "State of the Unions" and which were given as verbal remarks versus written reports, and more, at UC-Santa Barbara's American Presidency Project which, goodness, has changed a lot recently - I'll have to grab and archive a copy of the old version which was more compact and data-dense). The tutorial used a corpus of the author's own blog posts, a clever idea which I would enjoy doing with the posts on this blog once I have accumulated enough to make it interesting! 

Another couple of things I did differently from the tutorial: I used equal-signs (=) instead of arrows (<-) because I think those arrows are weird. Also, instead of just rewriting the same SOTUs corpus (called "docs" in the tutorial), I often incremented the variable name so that I would end up with "SOTUs," "SOTUs2," "SOTUs3," etc. I did this to preserve each transformation step-by-step, always preserving the ability to reverse particular steps and document things along the way. Even though this cluttered up the environment a little, I thought it was well worth it to be able to preserve and document.


Transforming data step by step





These image galleries on Blogger really stink, I really need to migrate this blog to somewhere else...

Anyway, you can already see how "fellow citizens" got combined somewhere between SOTUs2 and SOTUs3, and stayed that way for all future steps. Also, as I've written about earlier on this blog, removing stopwords will be appropriate for some textual analysis and distant reading, but I don't think it's necessary or wise when running sentiment analysis - something to keep in mind.

One pretty funny issue I ran into was with dollar signs. I didn't realize, but they're special characters in R, so they need to be escaped in any functions you run. I wanted the dollar-signs stripped out, but talking about money could be an interesting part of analyzing State of the Union texts, so I wanted to replace "$" with "dollars." First I tried building a toSpace transformer like I'd done with colons and hyphens:

SOTUs2 = tm_map(SOTUs2, toSpace, "$")


This didn't really do anything, so next I tried

toDollars = content_transformer(function(x,pattern) {return (gsub(pattern, "dollars", x))})
#use toDollars to change dollar signs to "dollars"
SOTUs3 = tm_map(SOTUs2, toDollars, "$")

which again, didn't do much besides add the word "dollarsdollars" to the end of each document.

Finally, I tried

replaceDollars = function(x) gsub("?","dollars",x)

Which again, didn't work. Dollar signs were still in the texts! So, just to try, I replaced $ with ? - looking back, I'm not sure what I expected, but I was pretty entertained when I turned Thomas Jefferson's 1808 State of the Union report into an endless string of the word "dollars:


Ultimately, I will not be using any of these as my working texts for SOTU-db so it's not as important to fix each error (fellowcitizens) as it is to understand where and how errors are being introduced, and ensure that I am thinking about and accounting for them. In the end, I did get a visualization out of it which is fun:

This basic counting is a far cry from sentiment analysis or really anything particularly revelatory at all, but it's a good marker of progress for me on this project! Also, the fact that "will," a positive, constructive word is the top word feels slightly encouraging to me in this time when patriotic fervor seems synonymous with a mistrust of all our democratic institutions. Don't ask me what it means that "state" is #2, though! Stay tuned for more R with sentiment analysis coming very soon!

Tuesday, October 9, 2018

Milestone: Frontend options

Today, 10/10, was my deadline for completing the SOTU-db frontend options and I'm pleased that I have a very basic HTML and React Native front pages. They each consist solely of a title, search bar, and a submit button, but they exist. It's worth starting to think about some pros and cons as I move into my next milestone, working on the sentiment analysis portion of the project.

The HTML version has the added benefit of being able to use GET commands with the submit button. This would allow users to save URLs with their search terms inside them. The other limitations of GET (like a lack of URL privacy and size limit) don't seem like they would be issues for this project. There's definitely an appealing brutalism to the simple HTML version I have now.

The React Native version has the benefit of being a newer tool that I could use some experience with. I like the way it's designed with mobile in mind, and I think this version would feel more fun and responsive on a mobile device. The downsides are that it's much more complicated for me to use (since I'm new to React Native and JS in general) and I'm not quite sure I could use the URL-saving that HTML GET would allow.

Fortunately I don't have to make a final decision on this for another month or so, but it's good to have these considerations in mind as I keep working!

Next stop, a home-baked R Server and getting some sentiment analysis up and running!

Saturday, September 22, 2018

September 2018 Update: Refined project scope

It is my final semester at Loyola University Chicago and that means getting SOTU-db into functional form! With that in mind, the scope of SOTU-db is being refined so that SOTU-db is simply a web interface for performing sentiment analysis on words or phrases that appear in the SOTU corpus. It's doubtful that full markup will be feasible for even one full address, and certainly not for the whole corpus. Stay tuned for more continuous and content-rich updates in the near future as the project kicks into gear during this fall semester!

Thursday, May 10, 2018

Summer 2018 Update

It's been a hectic Spring semester (and one that has not felt like spring at all) but we are all finished up with spring classes at Loyola. That means a round of updates to the blog (you're reading it), the main page, and the launch of the SOTU-db web app prototype at prototype.sotu-db.com. There is a very high chance that address for the prototype will not function, but you are welcome to try! There is also a prototype HTML interface in the works at https://tymonaghan.github.io/sotu-db/new-site/index.html. Stay tuned for another, more thorough update from the spring, including:

  • working with NLTK
  • getting the prototype developed with some thoughts on prototyping software tools
  • updates on the planned features for SOTU-db
Plans for the summer months include:
  • continuing to develop the prototype for SOTU-db web app
  • find a more stable digital host/home for the project
  • re-visit and adjustment of project objectives and timeline
Even though things are changing and in many cases behind schedule, I'm really excited about the direction of this project; I'm hopeful we can even get the thing up and running in time for the next real-world State of the Union address in 2019! 

Tuesday, April 17, 2018

Working with Twarc



The last few weeks I have been working with the command line tool Twarc. It essentially scrapes Twitter's data and saves it all as a jsonl file. Now, I'll be frank, I barely know how to use this tool and it took me quite a long time to even set it up. Even when I finally had it up and running, I couldn't figure out how to get to the data and see what I had collected. 

I downloaded it off GitHub and the download itself was easy enough. The Readme gave me information on how to do searches so I was all good there. The issue came when I went searching for the files I had supposedly created. I looked through the twarc master file to see if they had been deposited there, and found nothing. After freaking out a bit, I realized the files would go to my user file on my laptop by looking at the command prompt where I had created the search, seen below:

So you can see, the file path is shown clearly. I was able to find the jsonl file in my user file. The big issue at that point was: how do I open the files to see the data? Eventually I figured out that word was suitable and was able to open a document with all the information for me to peruse. You can see below what that information looks like:



All that data sure looks confusing! I wish I could give you a more detailed walk through of how to use the data and what it all means, but I'm still figuring that out for myself. What I do know, is that the url's will take you to either the tweet itself or the media attached to it. That is what makes Twarc really awesome; you can get access to the media, something not all Twitter scrapers do. Either way, the data can only go back as far as seven days as that is what is available through Twitter's API. 

As I learn more, I can post updates here to try and make your experience with Twarc easier than mine!

Thursday, March 15, 2018

Project Trajectory

This post explores more of the timing and motivation behind SOTU-db.

The idea

Ultimately I suppose the inspiration for the idea of SOTU-db goes back to my own high school career, when I certainly would not have found working with the text of a State of the Union address to be an engaging activity. More recently, though, my experience teaching high school social studies really made me treasure well-made online learning tools. In a not-uncommon experience, I would get excited by an online resource and plan to take my class to the computer lab with a specific activity in mind. Once in the lab with computer access, students immediately set to work logging into Facebook, Instagram, Twitter, etc.

I understand the appeal of these platforms, and often found myself thankful that these services were not readily available via pocket-sized devices when I was a high school student -- I'm sure they would not have helped with my own engagement in my schoolwork. Though connecting with friends and being part of that social milieu is a key appeal of these products, it's also undeniable that the web interfaces for interacting with these platforms are far superior to many of the digital tools I was trying to push onto my students. The New York Times's 2010 census map tool could be an amazing learning opportunity - but the Flash interface made loading it on different devices unreliable and inconsistent, the page is heavy and often crashed or stopped working under heavy load, and the interface is simply too different from the expected Google Maps-like experience to hold students' interest.

The New York Times 2010 Census Map online tool.
This was discouraging to me. What if students could log in to the census map, track their progress, see notification badges letting them know they have tasks left undone, insights left unexplored? What if the map offered suggestions or nudges for users to explore in a particular way, or try out a certain feature? Could student engagement be improved? Questions like this are a big part of how I found myself as a graduate student in Digital Humanities. Creating a tool that teachers can effectively use in diverse classrooms is a major goal for me, because it's something I would have appreciated more of when I was teaching.

The idea to work with State of the Union addresses in particular came about as I was watching President Trump's 2018 "State of the Union" address. As Head of State, President Trump's words were being symbolically decoded and analyzed all over the world. Particularly during an administration marked by instability, President Trump's ability to conform to the expectations of office and to deliver an address with clear, coherent messaging seemed urgent. At the conclusion of the address, I felt Trump had successfully conformed to the aesthetic and ceremonial expectations of the evening, without saying much in the way of substantive policy or ideological goals. This was far from satisfying as someone looking for clues about the trajectory of this administration. Was there a way to cut through the noise and see what set Trump's speech apart from other State of the Union addresses? Could we compare his word choices and topic selections with previous presidents to learn more about the priorities of this administration?

From my perspective as a student in Digital Humanities, these questions seemed perfectly suited to the application of digital tools for scholarship and criticism. On a basic level, it seemed obvious to use tools for textual analysis (Voyant Tools, R) to analyze the specific word choices of the January 30, 2018 address. In fact, interesting and insightful analyses of this type had already been conducted and are available online:
I envisioned a project that could put some of this together into one platform and open up a kind of exploratory, playful engagement with the texts. Inspired by the simple, fun digital tools I've explored already in my career as a Digital Humanist (see the list DH tools below), I wanted to create my own platform for searching, comparing, analyzing, and visualizing the text of these highly symbolic acts of communication: the annual address (or State of the Union). This, too, has already been done, at a site called "SOTU."
DH Tools:
How is my own project distinct from SOTU? At the most basic level, SOTU-db is different because I am creating it. As I outlined in my blog post about project goals, the goals for the SOTU-db product align nicely with my own goals as a developer and researcher. Therefore, even if SOTU-db offered no additional or superior functionality to "SOTU," it would still be a worthwhile project for me. But there are real differences between what I envision for SOTU-db and the "SOTU" site.

First, our goals and audiences appear different. "SOTU" dedicates one of its five major tabs to an "essay" entitled "The {Sorry} State We Are In." Provocative and subjective, the essay strikes a tone I hope to avoid on SOTU-db. I hope SOTU-db is equally useful as a classroom tool and as a resource for professional and academic research; I don't feel an essay of this nature would help SOTU-db achieve that goal.

Secondly, "SOTU" relies for its analysis on frequency counts of words ("Statistical Methods" appendix). I am certainly interested in this and plan to use this for major parts of SOTU-db. But my interest goes far beyond this. I am not only interested in which words are unique, but which common words have been used in which contexts, how adjectives and connotations are used, and patterns that might be visible among and within presidential terms. For example, when presidents have used the words "Americans," what adjectives or actions have they ascribed to Americans? Does the answer change depending on historical period, political party affiliation, or whether the US was at war or not when the speech was delivered? These are the type of questions I hope to enable users of SOTU-db to answer -- and, importantly, the type of question I want to encourage them to ask.

a mockup of an Android phone screen with a top menu bar that says SOTU-db, a search box, and various cards featuring different quotes and names of presidents
A mobile version mockup of
the main SOTU-db landing page
This brings me to a final, major difference between "SOTU" and SOTU-db. "SOTU" is a tool for users who bring their own curiosities and research questions to the site. I want SOTU-db to function this way as well, but I also want to encourage and guide user interactions -- if the user wants such guidance -- even for those who do not come with a research question or interest in mind. One online tool I've always found useful is the site for Federal Reserve Economic Data, or FRED. In many ways, I've tried to create a more contemporary and mobile-friendly version of the functionality they offer on their main page, in terms of search and discovery. Like FRED, SOTU-db's major component on the front page is a search box for users with a research topic already in mind. But below this search box, FRED also includes a number of other ways to discover data: recent data, popular data, data in the news, and so on. Borrowing this functionality and moving it onto interactive, engaging "cards" should give SOTU-db the playful engagement I am after while also making the site visually appealing and mobile-friendly.

The Addresses

"State of the Union Addresses and Messages" at The American Presidency Project by John Woolley and Gerhard Peters appears to be the authoritative online resource for State of the Union addresses by US presidents. As they explain there, the tradition for delivering the "State of the Union" to Congress have changed over time, such that referring to them all as "speeches" or "addresses" is probably technically incorrect. Additionally, in recent decades, American presidents have often delivered an annual address at the beginning of their terms. As Peters explains, 
"For research purposes, it is probably harmless to categorize these as State of the Union messages. The impact of such a speech on public, media, and congressional perceptions of presidential leadership and power should be the same as if the address was an official State of the Union."
I concur, in part, and plan to include these "non-official" SOTU addresses in the project (and to continue referring to them as SOTUs - at this phase, if the American Presidency Project lists it on its "State of the Union Addresses and Messages," I include it in SOTU-db and call it a SOTU). But that nagging phrase, "should be the same," is precisely the type of question SOTU-db should be able to help answer.

The rise of incoming presidents delivering a pseudo-SOTU at the beginning of their term is relatively new (only since Reagan) and also coincides with an end to the tradition of presidents delivering a SOTU early in the year after elections that voted a new president into office. It would not be surprising to find material differences in the word choices of new presidents only weeks into the office as compared to other addresses given years along into a presidency. Likewise, it is not inconceivable that presidents about to leave office within weeks would speak differently than those just beginning or in the midst of their terms. Can we isolate these speeches and see what words, topics, and styles stand out? Questions like this help to motivate me and to structure the project in a way that encourages users to ask and find answers for questions like this.

The Format

Though not originally conceived for a class assignment, this project has become the major piece of my DIGH-402 class at Loyola University Chicago, taught by Dr. George Thiruvathukal. Though I expect to continue development of SOTU-db beyond the semester, by the end of the term the goal is to have a "minimum viable product" and build from there. I greatly appreciate the interest and support of Dr. Thiruvathukal throughout this project. My goal is to have a minimum viable product operational by May 1.