After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.
A Few Good Options
It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.
Issue 1 — Where do you execute your command?
Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.
Issue 2 — When do you execute your command?
A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.
Issue 3 — Where does your data live?
One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.
While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.