Tuesday, April 17, 2018

Working with Twarc



The last few weeks I have been working with the command line tool Twarc. It essentially scrapes Twitter's data and saves it all as a jsonl file. Now, I'll be frank, I barely know how to use this tool and it took me quite a long time to even set it up. Even when I finally had it up and running, I couldn't figure out how to get to the data and see what I had collected. 

I downloaded it off GitHub and the download itself was easy enough. The Readme gave me information on how to do searches so I was all good there. The issue came when I went searching for the files I had supposedly created. I looked through the twarc master file to see if they had been deposited there, and found nothing. After freaking out a bit, I realized the files would go to my user file on my laptop by looking at the command prompt where I had created the search, seen below:

So you can see, the file path is shown clearly. I was able to find the jsonl file in my user file. The big issue at that point was: how do I open the files to see the data? Eventually I figured out that word was suitable and was able to open a document with all the information for me to peruse. You can see below what that information looks like:



All that data sure looks confusing! I wish I could give you a more detailed walk through of how to use the data and what it all means, but I'm still figuring that out for myself. What I do know, is that the url's will take you to either the tweet itself or the media attached to it. That is what makes Twarc really awesome; you can get access to the media, something not all Twitter scrapers do. Either way, the data can only go back as far as seven days as that is what is available through Twitter's API. 

As I learn more, I can post updates here to try and make your experience with Twarc easier than mine!