workflow managers

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.

Quick note on workflow managers

After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.

A Few Good Options

It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.

Issue 1 — Where do you execute your command?

Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.

Issue 2 — When do you execute your command?

A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.

Issue 3 — Where does your data live?

One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.

Wrapping Up

While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.