Tuesday, November 13, 2018

DHCS 2018

This past weekend I had the pleasure of spending some time at the 13th Annual Chicago Colloquium on the Digital Humanities and Computer Science, or #DHCS2018, for which I served as a steering committee member and volunteer. There was a lot of fantastic scholarship, but for the purposes of this blog post I wanted to highlight a few papers that I was able to hear about (due to concurrent panels and some scheduling conflicts, I couldn't see them all):

Circulation Modeling of Library Book Promotions, Robin Burke (DePaul University)

Dr. Burke showed some great work studying the Chicago Public Library's One Book, One Chicago program. Of course what caught my attention in particular was his mention of sentiment analysis; their project searched the texts of the assigned One Book, One Chicago novels for place-names (toponyms), identified the sentiment of each mention, and mapped them. I caught up with him after his panel where he told me that they used Stanford's NLP package and analyzed the sentiment on a per-sentence basis, so each toponym was being mapped along with the sentiment of the sentence in which it occurred. Robin cautioned that moving up to a paragraph-level would be too much text for useful sentiment analysis, and suggested that for certain situations (such as some 18th- and 19th-century POTUS' very long sentences) even shorter lengths might be more useful - a few words in either direction of the occurrence. Since finding word occurrences and recording their sentiment is exactly what my project was doing, this was really useful information to me. Dr. Burke expressed that sentiment analysis might be a little more straightforward with political texts than with, for example, novels, and I shared that I had run across Stanford's time and domain-specific sentiment lexicons, it was a great conversation and cool to feel like I really had something to contribute and take away from it.

Analyzing the Effectiveness of Using Character n-grams to Perform Authorship Attribution in the English Language, David Berdik (Duquesne Univeristy)

David gave a great talk on using n-grams to identify the authors of texts. I was pretty surprised at his conclusion that 2- and 3-grams were the most useful for identifying authorship, and I don't think I was alone in this surprise based on the audience's questions after his talk. However, I think I also misunderstood a bit; I thought he meant words that were only 2 and 3 characters long, but I think it actually means any string of 2 or 3 characters, which could include word fragments and even spaces. In any case, it gave me the idea of using the SOTU-db as training data, then allowing users to run their own texts through a tool on SOTU-db and get an output of which POTUS the user's text most resembles! This could potentially be a really fun classroom tool, especially if combined with a kind of "SOTU template" or "SOTU authoring" tool, and the ability to constrain the output (so teachers can ensure their students are matched with a POTUS they are studying). 

'This, reader, is no fiction': Examining the Correlation Between Reader Address and Author Identity in the Nineteenth- and Twentieth-Century Novel, Gabi Kirilloff (Texas Christian University)

Gabi's talk was an unexpected pleasure, and her discussions about how authors used reader address (such as "you" and "dear reader") to define their imagined audiences (among other things) had me unexpectedly drawing connections to SOTU-db. Analysis of how presidents have used pronouns like "we" and "you" could be quite revealing - something to remember as I think about cleaning the texts (both "we" and "you" would be stripped by the standard stopword lists).

Dissenting Women: Assessing the Significance of Gender on Rhetorical Style in the Supreme Court, Rosamond Thalken (Washington State University)

Rosamond's project was right in my wheelhouse and I was excited to hear about it (and some of the others that have come before that she referenced). I can't find it right now on Google, but I'm going to continue to look; smart discussions of rhetoric and how it can be explored through computational techniques is a weak point of SOTU-db but one that I find extremely interesting and important. Once I am confident the technical side of the tool is somewhat stable and running, I hope it can be a tool to do the same types of analysis that Rosamond has done with her project here.

Moving Forward

On a practical level, the conference has made me want to explore the Stanford NLP package to ensure that I'm not making things harder on myself than they need to be with R and NLTK and everything. Stanford NLP popped up in multiple presentations, so this seems worth being sure that I'm not neglecting the "industry standard" without good cause. Otherwise, the above-mentioned talks have mostly given me bigger-picture things to consider (which is great, because right now I don't need more major changes to my roadmap). It's wild how quickly the time is going - I felt like I was off to a very quick and even sometimes ahead-of-schedule start to the semester, and now I am shocked that my MA defense is only 21 days away, so it's a good thing I got off to that good start or I'd really be in trouble! Maybe I'll get "minimum viable product" tattooed on my fingers or something...