Libby Hemphill bio photo

Libby Hemphill

Principal Investigator

Email Twitter Github

We try to follow a standard approach to social media data in all our projects (no matter what where we’re getting data). It has 4 basic steps:

  1. collect
  2. cache
  3. parse
  4. analyze

We are also committed to open research and so push our code for collecting and analyzing data to our public GitHub repos as early and as often as we can.


Under the collect step live scripts for getting data in “raw” form. Here, raw means whatever default format for the data is. Usually this means JSON dumped by an API, but for scrapers it’s whatever data structure and format we decided to use. We are greedy in collection meaning we pull whatever data the API will let us have. For instance, in the Twitter projects, it means data returned by the ever-changing Twitter API.


Once we have “raw” data, we cache it by storing a read-only copy somewhere accessible to the whole team. Usually this storage step is handled by the collection script and isn’t an extra scripting step. I call it out here though because it’s conceptually important - social media data changes all the time, and caching lets us keep track of what the data looked like at the time of collection (e.g., what was returned, what structure was standard then).


Next, we parse. Parsing scripts pull data from the read-only caches and put them in formats that are appropriate for analysis or whatever comes next. For instance, some of our Twitter tools collect data from search API, cache it, then parse it into a human-readable CSV you can import to Excel or your favorite stats program. This leaves us with two related, but not identical, copies of the data - one in JSON from Twitter, and one in CSV. Parsing scripts also do any data transformations that are necessary for analysis (e.g., converting timestamps, calculating user stats).


Finally, we get to analyze the data. Often analysis is included in the same script as parsing, but sometimes analysis steps will live on their own. Some of the analysis will involve machine learning or natural language processing, but some will be simple word clouds or descriptive statistics. Many of our GitHub repos don’t publicly show our analysis steps yet. We’re working on it!