Thursday, March 15, 2018

Project Trajectory

This post explores more of the timing and motivation behind SOTU-db.

The idea

Ultimately I suppose the inspiration for the idea of SOTU-db goes back to my own high school career, when I certainly would not have found working with the text of a State of the Union address to be an engaging activity. More recently, though, my experience teaching high school social studies really made me treasure well-made online learning tools. In a not-uncommon experience, I would get excited by an online resource and plan to take my class to the computer lab with a specific activity in mind. Once in the lab with computer access, students immediately set to work logging into Facebook, Instagram, Twitter, etc.

I understand the appeal of these platforms, and often found myself thankful that these services were not readily available via pocket-sized devices when I was a high school student -- I'm sure they would not have helped with my own engagement in my schoolwork. Though connecting with friends and being part of that social milieu is a key appeal of these products, it's also undeniable that the web interfaces for interacting with these platforms are far superior to many of the digital tools I was trying to push onto my students. The New York Times's 2010 census map tool could be an amazing learning opportunity - but the Flash interface made loading it on different devices unreliable and inconsistent, the page is heavy and often crashed or stopped working under heavy load, and the interface is simply too different from the expected Google Maps-like experience to hold students' interest.

The New York Times 2010 Census Map online tool.
This was discouraging to me. What if students could log in to the census map, track their progress, see notification badges letting them know they have tasks left undone, insights left unexplored? What if the map offered suggestions or nudges for users to explore in a particular way, or try out a certain feature? Could student engagement be improved? Questions like this are a big part of how I found myself as a graduate student in Digital Humanities. Creating a tool that teachers can effectively use in diverse classrooms is a major goal for me, because it's something I would have appreciated more of when I was teaching.

The idea to work with State of the Union addresses in particular came about as I was watching President Trump's 2018 "State of the Union" address. As Head of State, President Trump's words were being symbolically decoded and analyzed all over the world. Particularly during an administration marked by instability, President Trump's ability to conform to the expectations of office and to deliver an address with clear, coherent messaging seemed urgent. At the conclusion of the address, I felt Trump had successfully conformed to the aesthetic and ceremonial expectations of the evening, without saying much in the way of substantive policy or ideological goals. This was far from satisfying as someone looking for clues about the trajectory of this administration. Was there a way to cut through the noise and see what set Trump's speech apart from other State of the Union addresses? Could we compare his word choices and topic selections with previous presidents to learn more about the priorities of this administration?

From my perspective as a student in Digital Humanities, these questions seemed perfectly suited to the application of digital tools for scholarship and criticism. On a basic level, it seemed obvious to use tools for textual analysis (Voyant Tools, R) to analyze the specific word choices of the January 30, 2018 address. In fact, interesting and insightful analyses of this type had already been conducted and are available online:
I envisioned a project that could put some of this together into one platform and open up a kind of exploratory, playful engagement with the texts. Inspired by the simple, fun digital tools I've explored already in my career as a Digital Humanist (see the list DH tools below), I wanted to create my own platform for searching, comparing, analyzing, and visualizing the text of these highly symbolic acts of communication: the annual address (or State of the Union). This, too, has already been done, at a site called "SOTU."
DH Tools:
How is my own project distinct from SOTU? At the most basic level, SOTU-db is different because I am creating it. As I outlined in my blog post about project goals, the goals for the SOTU-db product align nicely with my own goals as a developer and researcher. Therefore, even if SOTU-db offered no additional or superior functionality to "SOTU," it would still be a worthwhile project for me. But there are real differences between what I envision for SOTU-db and the "SOTU" site.

First, our goals and audiences appear different. "SOTU" dedicates one of its five major tabs to an "essay" entitled "The {Sorry} State We Are In." Provocative and subjective, the essay strikes a tone I hope to avoid on SOTU-db. I hope SOTU-db is equally useful as a classroom tool and as a resource for professional and academic research; I don't feel an essay of this nature would help SOTU-db achieve that goal.

Secondly, "SOTU" relies for its analysis on frequency counts of words ("Statistical Methods" appendix). I am certainly interested in this and plan to use this for major parts of SOTU-db. But my interest goes far beyond this. I am not only interested in which words are unique, but which common words have been used in which contexts, how adjectives and connotations are used, and patterns that might be visible among and within presidential terms. For example, when presidents have used the words "Americans," what adjectives or actions have they ascribed to Americans? Does the answer change depending on historical period, political party affiliation, or whether the US was at war or not when the speech was delivered? These are the type of questions I hope to enable users of SOTU-db to answer -- and, importantly, the type of question I want to encourage them to ask.

a mockup of an Android phone screen with a top menu bar that says SOTU-db, a search box, and various cards featuring different quotes and names of presidents
A mobile version mockup of
the main SOTU-db landing page
This brings me to a final, major difference between "SOTU" and SOTU-db. "SOTU" is a tool for users who bring their own curiosities and research questions to the site. I want SOTU-db to function this way as well, but I also want to encourage and guide user interactions -- if the user wants such guidance -- even for those who do not come with a research question or interest in mind. One online tool I've always found useful is the site for Federal Reserve Economic Data, or FRED. In many ways, I've tried to create a more contemporary and mobile-friendly version of the functionality they offer on their main page, in terms of search and discovery. Like FRED, SOTU-db's major component on the front page is a search box for users with a research topic already in mind. But below this search box, FRED also includes a number of other ways to discover data: recent data, popular data, data in the news, and so on. Borrowing this functionality and moving it onto interactive, engaging "cards" should give SOTU-db the playful engagement I am after while also making the site visually appealing and mobile-friendly.

The Addresses

"State of the Union Addresses and Messages" at The American Presidency Project by John Woolley and Gerhard Peters appears to be the authoritative online resource for State of the Union addresses by US presidents. As they explain there, the tradition for delivering the "State of the Union" to Congress have changed over time, such that referring to them all as "speeches" or "addresses" is probably technically incorrect. Additionally, in recent decades, American presidents have often delivered an annual address at the beginning of their terms. As Peters explains, 
"For research purposes, it is probably harmless to categorize these as State of the Union messages. The impact of such a speech on public, media, and congressional perceptions of presidential leadership and power should be the same as if the address was an official State of the Union."
I concur, in part, and plan to include these "non-official" SOTU addresses in the project (and to continue referring to them as SOTUs - at this phase, if the American Presidency Project lists it on its "State of the Union Addresses and Messages," I include it in SOTU-db and call it a SOTU). But that nagging phrase, "should be the same," is precisely the type of question SOTU-db should be able to help answer.

The rise of incoming presidents delivering a pseudo-SOTU at the beginning of their term is relatively new (only since Reagan) and also coincides with an end to the tradition of presidents delivering a SOTU early in the year after elections that voted a new president into office. It would not be surprising to find material differences in the word choices of new presidents only weeks into the office as compared to other addresses given years along into a presidency. Likewise, it is not inconceivable that presidents about to leave office within weeks would speak differently than those just beginning or in the midst of their terms. Can we isolate these speeches and see what words, topics, and styles stand out? Questions like this help to motivate me and to structure the project in a way that encourages users to ask and find answers for questions like this.

The Format

Though not originally conceived for a class assignment, this project has become the major piece of my DIGH-402 class at Loyola University Chicago, taught by Dr. George Thiruvathukal. Though I expect to continue development of SOTU-db beyond the semester, by the end of the term the goal is to have a "minimum viable product" and build from there. I greatly appreciate the interest and support of Dr. Thiruvathukal throughout this project. My goal is to have a minimum viable product operational by May 1.

Working with R

This post outlines my first forays into working with R, "a language and environment for statistical computing and graphics" (about R).

One of my secondary goals for this project is to enable users to perform textual analyses and visualizations using R. R seems to get a lot of buzz in the DH community right now, and I'm interested to learn what it can do. I wrote my first lines of (Processing) code just months ago, so I don't expect to become fluent in R by May, I just want to dabble and see what's possible.

Over the past couple days I have been following along with various online tutorials to try to do some basic analyses that I already understand pretty well, like word counts and frequency comparisons. The easiest tutorial for me to follow has been "A gentle introduction to text mining using R" on the "Eight to Late" blog.

Following this blog and the advice of many, many others, I'm using RStudio to work with R. I'm beginning to understand the process of loading packages and the syntax of R. It seems that most of what I'm doing is defining variables on the left, and explaining the operations I want to perform to store in those variables on right right of an equals sign (or <- arrow, as in the tutorial, but that's a little confusing for me). The tutorial involves using a corpus provided by the blog (of archived blog posts), but I've been using my Gutenberg SOTU documents as my corpus instead. I've successfully stripped punctuation and numbers, and converted the text to lowercase. But when I try to delete the default stopwords, RStudio just runs and runs, never finishing the operation. I plan to let it run overnight; perhaps my corpus really is large enough and the task so much more complicated than removing numbers or punctuation that it simply needs more time. We'll see! It took several hours, but it looks like removing stopwords was finally successful! However, on examining a couple of SOTUs after this transformation, I am realizing certain stopwords could be of great interest and should probably be retained, especially pronouns like "we," "us," "they," and "them." How will the differences in stopword lists and algorithms for stemming and lemmatizing (which is my next step) impact the results of analytical operations on the text? How much of this is necessary to surface to the user and how much should be hidden? Would it be feasible and advisable to allow users the option to work with different corpora with different cleaning processes applied?

an Rstudio window showing a console with various commands being entered, and an "environment" window showing different SOTU corpora and values
I saved each step along the way as a separate corpus so I would have a record of each step of the process

I'm confident I'll be able to put R to use doing some basic textual mining/analysis, but I'm not sure if I'll be able to get into more advanced techniques like topic modeling (in which I'm more interested). And it's doubtful I'll find a use for it that can't already be done on Voyant Tools or HTRC. Regardless, I feel confident that using R to export data and then using my Processing tool, Grapher, to visualize that data will be an excellent learning experience, right on my level as a budding developer. This will be the first time I've really been able to integrate data that holds interest for me as a humanist and historian into my work with programming.

After a recent class with Dr. Thiruvathukal, my de facto faculty adviser for this project, I'm also  interested in exploring whether I can use the natural language toolkit (NLTK) to dive more deeply into the word choices American presidents have used in their SOTU addresses. Stay tuned for developments on that front, hopefully coming soon!

edited 3/16/2018 to show that stopword removal was eventually successful

Sunday, March 11, 2018

SOTU-db Goals

Today's post will lay down some goals for the SOTU-db project and give a more thorough development update.


Goals for this blog:

  • document the development process
  • encourage self-reflection and accountability
  • practice explaining the project to a casual audience
This blog will serve as a resource for SOTU-db itself by documenting the design process and decisions that go into the product. At the same time the blog will help with my own development as a project leader and communicator by encouraging me to work reflectively, diligently, and transparently. On that note, see the "Development Update" section at the bottom of this post for the latest on the project's progress!

As for SOTU-db itself, the goals here can be split into "product goals" and "process goals." Product goals speak to what we want the finished product to do, or attributes we want it to have. Process goals describe processes, skills, and tasks that will go into the creation of SOTU-db and are more oriented toward my own development as a researcher and developer. These goals are further broken down by priority: goals that make up the "minimum viable product" are marked in bold text. Other goals are simply bulleted.

Product goals (objectives for finished digital project):

  • Create web interface with search at the forefront
  • Interactive cards to encourage user exploration and playful engagement
  • Search texts and generate basic visualizations of how frequently terms appear over time
  • Run more complex textual analyses like topic modeling, 
  • Interface allows many options and variables for interacting with data
  • Users can make documents that contain the text of SOTU addresses with certain topics, parts of speech, etc highlighted
  • Users will have a way to visualize the audience reaction to recent SOTU addresses
  • Each address will have an authoritative, digital edition

Process goals (objectives for learning):

  • Create and publish an interactive digital project
    • including UI design, and incorporating HCI best practices
  • Encode a document into TEI with a custom schema
  • Document the process of creating/cleaning datasets and all major project milestones through a blog or other documentation
  • Practice universal and accessible design practices throughout the project lifecycle
  • Create visualizations for texts in Processing
  • Gain experience and comfort with using Git and GitHub, including:
    • Running project website from GitHub pages
    • Using command line and Git as part of regular workflow
    • Using GitHub as master repository for all digital resources
  • Gain experience with textual analysis and scholarship using familiar texts
  • Learn to outline and code the user interface for this website
  • Explore textual analysis through R
  • Learn more about topic modeling and how it can be made useful to the average user
  • Create a schema and encode a text according to TEI guidelines
  • Use digital tools (Voyant) and languages (R) to work with familiar texts
  • Create meaningful visualizations around trends within and across documents
  • Figure out how to host, and how to handle user interaction (searches)

Development Update

This will be a straightforward and concise update on progress over the past several days. A separate blog post will be made with more information about the project's overall trajectory.
  • Requirements document: draft complete, available on GitHub (in documentation folder)
  • Personas document: draft complete, available on GitHub (in documentation folder)
  • Prototype screens: worked through some more prototype screens in Justinmind (in wireframes folder)
  • Speech plain text: created .txt files of text of each address, 1790-2006. Now have a total of 216 objects in library (in speeches-gutenberg folder)
All of the above are available on the SOTU-db GitHub repository.

The next steps are to begin to figure out how to get the web structure of the product built and published. I'm currently using GitHub Pages with Jekyll to run the project website, but I don't think this infrastructure will work for the actual project itself (I also need to update all the information and timeline on the site - that will be a goal for this week). It would probably be possible for me to create some basic HTML version of what I envision, but this might not be worth the effort since the final product will need to rely on much more than static HTML pages. Figuring out how to put the pieces together will definitely be a learning process! The goal is to have minimum viable product up and running by May 1! Stay tuned to this blog for more updates coming soon!