Carol Schmitz bio photo

Carol Schmitz

Graduate Research Assistant

Email Twitter Instagram Github

Here at CASM Lab we primarily write code in Python. We like Python because it is very human-readable and lots of developers write packages (also called modules or libraries) that add features to the built-in Python packages. Using packages is great because you don’t need to code a new solution to a problem that’s already been solved (and tested!), but it adds another layer of complexity to your development: package management.

Why do we use environments and a package manager?

Hello, internet. I am writing this post in June 2016. Let’s say you discover our GitHub repo in June 2017 and you want to run some of our code to reproduce results you read about in a paper we recently published. Thank you! We are flattered that you want to build on our work. The fact that we provide an environment.yml file will help you make sure your local system matches the system we used for development and testing.

To explain why this is exciting, consider a scenario where our code is completely undocumented. You might start trying to run one of our scripts, when it fails before it even begins.

ImportError: No module named 'selenium'

You know it’s a Python script, but you’ve never used the selenium package before. A little Googling will tell you that you can install this package using pip install selenium, which seems to work OK at first. But then some strange error pops up, and you can’t figure out what it means. At this point you discover that the selenium package has been updated to a new version that breaks a piece of our code. Tracking down the actual source of the problem and resolving the error is probably not a trivial task.

Now, let’s say our code is partially documented. We’ve provided a requirements.txt file, which tells you exactly which version of each package is used in this code. No more random code breaking, right?! Not necessarily. If you don’t have separate environments, pip install requirements.txt will install all those packages in the same place on your system. Each package on the requirements list has its own dependencies, and sometimes those dependencies can conflict.

For example, maybe you were working on a mapping project last year, but you’re working on a text-processing project today. Let’s say the mapping project lists package_x version 1.1 as a dependency, and the text-processing project lists package_x version 1.2 as a dependency. If you were not using environments and installed all these packages at the root level of your system, you’d have to do a lot of upgrading and downgrading package versions in order to switch between running the code for different projects. BUT! If you create a mapping environment and a text-processing environment, all you need to do is activate the environment for the code you want to run, and the proper package version for each project will be used. Voilà!

Anaconda

CASM Lab primarily uses Anaconda to manage our Python code, environments, and packages. It’s a great tool for those who want to dive into the code and analysis without too much setup. It offers a graphical interface, but can also be navigated via command line.

For the purposes of this guide, we’ll be discussing the process of using Anaconda to manage environments and packages on OS X. You may or may not get the same results on a different operating system. Also, Anaconda can be used for other data-sciency things, but we’re not talking about those things right now.

Environments

Anaconda provides official documentation on environments. It’s pretty comprehensive, but we’ll explain a little more here.

In each of our code repositories, we include an environment.yml file. If you are using a compatible operating system (most of our code has been tested on the latest version of OS X), you can create a copy of the environment on your local machine directly from this file using conda env create -f environment.yml.

If you aren’t able to install the environment automatically (perhaps you’re on an older version of OS X), you can create your own new environment, open environment.yml and manualy install all the package dependencies needed to run the code for that project. The file will list all libraries that can be installed with conda under dependencies: and all libraries that can be installed with pip under pip. Make sure you install the version of the library specified in environment.yml, because alternate versions may not perform adequately.

I recommend creating a new environment each time you start a new project, to reduce the chance of running into a dependency conflict. It takes a little getting used to, since you’ll need to activate and deactivate your various environments frequently if you’re actively working on several projects. The advantages are two-fold. First, you’ll run into fewer dependency conflicts, as each environment contains only the libraries you’re actually using. You’ll also be able to share your code with collaborators more easily. Win, win!