Carol Schmitz bio photo

Carol Schmitz

Graduate Research Assistant

Email Twitter Instagram Github

All of our GitHub repositories for social media data collection and analysis have a similar structure. We use a single Jupyter notebook to explain and run the code within the repo. This notebook calls scripts within the /scripts folder, and stores any files generated in the /files folder. The notebook also contains code to create a settings (aka config) file, where the user may modify a predefined set of configurable options used by the code, such as API credentials and filenames.

Our standard repo structure is as follows:

/data_samples - This directory contains sample files produced by the code after each stage of the collect/cache/analyze/parse workflow, including raw data from the source.

/files - This directory contains any files needed or generated by the code. Note that some of our scripts download or generate extremely large files, and these files are not stored on GitHub. If you run our code locally, this is where such files will live on your computer.

/scripts - This directory contains the meat and potatoes of our codebase. A single repository may contain several scripts which can be run independently of each other as users work through the collect, cache, analyze, parse workflow.

[repo_name].ipynb - This is the Jupyter notebook that explains how the various pieces of the code in this repo work. At a minimum, it contains the following sections: Setup, Collect, Cache, Parse, and Analyze. The notebook contains both Markdown blocks, which explain the goals of each workflow phase, and Code blocks, which run the code contained in the repo. Code blocks call scripts in the /scripts directory and save files in the /files directory.

settings-example.cfg - This file contains a sample version of all configurable options used by the code, except for API keys

environment.yml - This file can be used by a package manager, such as Anaconda, to automatically install the packages and libraries needed to run the code in this repo. Operating system requirements may apply.

requirements.txt - A pip-compatible version of the necessary packages list.

README.md - This file uses markdown syntax and explains what users will find in the repo.

Why We Do This

For a few reasons, including

  • Consistency helps developers know what to expect when they come to our code.
  • Consistency helps us maintain the code even when there’s turnover in the lab.
  • Jupyter makes it possible for people who don’t code to use ours, and it’s popular among data scientists already.
  • Being transparent about our process is better for science - you can hold us accountable or alert us to mistakes. You can also build on our work more easily.