So, basically, the goal is to convert the SOTUs into really long, skinny tables, where each occurrence of a word gets its own row. This means the tidy table (stupidly called a "tibble," apparently) keeps track of where words are in relation to each other. This is an improvement (for my purposes) over the the "document-term-matrix" I created in my initial text mining steps, which only gave each word a single row along with the number of times it appears in the corpus, in alphabetical order (not the order they appear in the text).
|The document-term-matrix. Every word is recorded and we can see which files they are in, but there's no way to track where they are in relation to one another within the files.|
|The tidy data frame retains the words' relationships to each other - linenumber isn't working the way I want right now, but it does reset with each file and keeps the context of each token (word) intact.|
Analysis: Positive Words in Washington's 1792 SOTU
- peace, 5 instances
- present, 5 instances
- found, 4 instances
- A sanction commonly respected even among savages has been found in this instance insufficient to protect from massacre the emissaries of peace.
- I can not dismiss the subject of Indian affairs without again recommending to your consideration the [. . .] restraining the commission of outrages upon the Indians, without which all pacific plans must prove nugatory. To enable [. . .] the employment of qualified and trusty persons to reside among them as agents would also contribute to the preservation of peace . . .
- I particularly recommend to your consideration the means of preventing those aggressions by our citizens on the territory of other nations, and other infractions of the law of nations, which, furnishing just subject of complaint, might endanger our peace with them . . .
- a sanction that "has been found in this instance insufficient,"
- considering the expense of future operations "which may be found inevitable,"
- a reform of the judiciary that "will, it is presumed, be found worthy of particular attention," and
- a question about the post office, which, if "upon due inquiry, be found to be the fact," a remedy would need to be considered.
Sentiment Analysis Within Documents
as.integer()and division to count every ten words as a line but it didn't work), having the words in order still enabled analysis I couldn't do before.
The first visualization I generated with tidy text was a great feeling. Here's the code with some comments:
library(tidyr) #load the tidyr library
SOTU_sentiment <- tidy_SOTUs %>% # dump tidy_SOTUs into SOTU_sentiment,
inner_join(get_sentiments("bing")) %>% # get the sentiment lexicon "bing" (different from NRC lexicon in previous example)
count(file, index = linenumber %/% 80, sentiment) %>% # this is taking a chunk of 80 lines at a time- in this case, 80 words at a time
spread(sentiment, n, fill = 0) %>% # honestly still need to learn what "spread" is doing (this is why we wanted tidyr) - I think it is basically normalizing the charts so they all plot in equal space while representing different word counts
mutate(sentiment = positive - negative) # finally, calculate "sentiment" by subtracting instances of negative words from the instances of positive words. This will be performed per 80-token chunk.
library(ggplot2) # load the ggplot2 library for visualization
ggplot(SOTU_sentiment, aes(index, sentiment, fill = file)) + #plot SOTU_sentiment
geom_col(show.legend = FALSE) + #don't show a legend
facet_wrap(~file, ncol = 2, scales = "free_x") # plot per-file and in 2 columns, allow x-axis to adapt to scale of values being plotted (i think)
and here's the resulting visualization from ggplot:
It is interesting to note the pattern of most addresses beginning on a positive note. This seemed like a clear enough pattern (with notable enough exceptions in 1792 and 1794) that it was worth looking into before going much further - if the NRC list was seemingly so off-base about "peace" and "present," I wanted to see if these visualizations even meant anything.
Grabbing the first 160 words (two 80-word chunks) from the 1790 and 1794 SOTUs, then comparing them subjectively with their respective charts revealed the following (image contains lots of text, also available as a markdown file on the GitHub repo (raw markdown)):
I have to say, these definitely pass the smell-test. 1790 is all congratulatory and full of lofty, philosophic patriotism, while 1794 is about insurrection and taxation. I was especially pleased that, even though 1794 contains a lot of what I would think are positive words (gracious, indulgence, heaven, riches, power, happiness, expedient, stability), the chart still depicts a negative number that is in-line with my subjective reading of the text.