bioinformatics

Working with Nextflow at Fred Hutch

I’ve been putting in a bit of work recently trying to make it easier for other researchers at Fred Hutch to use Nextflow as a way to run their bioinformatics workflows, while also getting the benefits of cloud computing and Docker-based computational reproducibility.

You can see some slides describing some of that content here, including a description of the motivation for using workflow managers, as well as a more detailed walk-through of using Nextflow right here at Fred Hutch.

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.

The Rise of the Machines: Workflow Managers for Bioinformatics

As with many things these days, it started with Twitter and it went further than I expected.

The other day I wrote a slightly snarky tweet

There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.

What is “being a bioinformatician”?

“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.

The aspect of this definition useful here is that bioinformaticians tend to

  1. keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)

  2. analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.

Here’s a good counterpoint that was raised to the above:

Good point, now what is a “Workflow Manager”?

A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.

This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.

Examples of workflow managers (partially culled from the set of responses I got on Twitter):

My Ideal Workflow Manager

I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:

  • Tasks consist of BASH snippets run inside Docker containers

  • Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)

  • The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)

  • Support for file storage locally as well as object stores like AWS S3

  • Easy to read, write, and publish to a general computing audience (highly subjective)

  • Easy to set up and get running (highly subjective)

The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.

Where are we now?

The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.

Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.

In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).

Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.

Hybrid Approach to Microbiome Research (to Culture, and Not to Culture)

I was rereading a great paper from the Huttenhower group (at Harvard) this week and I was struck by a common theme that it shared with another great paper from the Segre group (at NIH), which I think is a nice little window into how good scientists are approaching the microbiome these days. 

The paper I'm thinking about is Hall, et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients (2017) Genome Medicine. The paper is open access so you can feel free to go read it yourself, but my super short summary is: (1) they analyzed the gut microbiome from patients with (and without) IBD and found that a specific clade of Ruminococcus gnavus was enriched in IBD; and then (2) they took the extra step of growing up those bacteria in the lab and sequencing their genomes in order to figure out which specific genes were enriched in IBD. 

The basic result is fantastically interesting – they found enriched genes that were associated with oxidative stress, adhesion, iron acquisition, and mucus utilization, all of which make sense in terms of IBD – but I mostly want to talk about the way they figured this out. Namely, they took a combined approach of (1) analyzing the total DNA from stool samples with culture free genome sequencing, and then (2) they isolated and grew R. gnavus strains in culture from those same stool samples so that they could analyze their genomes.

Fig. 3:  R. gnavus  metagenomic strain phylogeny. 

Fig. 3: R. gnavus metagenomic strain phylogeny. 

Now, if you cast your mind back to the paper on pediatric atopic dermatitis from Drs. Segre and Kong (Byrd, et al. 2017 Science Translational Medicine) you will remember that they took a very similar approach. They did culture-free sequencing of skin samples, while growing Staph strains from those same skin samples in parallel. With the cultures in hand they were able to sequence the genomes of those strains as well as testing for virulence in a mouse model of dermatitis.

So, why do I think this is worth writing a post about? It helps tell the story of how microbiome research has been developing in recent years. At the start, all we could do was describe how different organisms were in higher and lower abundance in different body sites, disease states, etc. Now that the field has progressed, it is becoming clear that the strain-level differences within a given species may be very important to human health and disease. We know that although people may contain a similar set of common bacterial species, the exact strains in their gut (for example) are different between people and usually stick around for months and years. 

With this increased focus on strain-level diversity, we are coming up against the technological challenges of characterizing those differences between people, and how those differences track with health and disease. The two papers I've mentioned here are not the only ones to take this approach (it's also worth mentioning this great paper on urea metabolism in Crohn's disease from UPenn), which was to neatly interweave the complementary sets of information that can gleaned from culture-free whole-genome shotgun sequencing as well as culture-based strain isolation. Both of those techniques are difficult and they require extremely different sets of skills, so it's great to see collaborations come together to make these studies possible.

With such a short post, I've surely left out some important details from these papers, but I hope that the general reflection and point about the development of microbiome research has been of interest. It's certainly going to stay on my mind for the years to come.

Detecting viruses from short-read metagenomic data

My very first foray into the microbiome was in graduate school when my advisor (Dr. Rick Bushman) suggested that I try to use a high-throughput genome sequencing instrument (a 454 pyrosequencer) to characterize the viruses present in the healthy human gut. That project ended up delving deeply into the molecular evolution of bacteriophage populations, and gave me a deep appreciation for the unpredictable complexity of viral genomes. 

While I wasn't the first to publish in this area, the general approach has become more popular over the last decade and a number of different groups have put their own spin on it. 

Because of the diversity of viruses, you can always customize and enhance different aspects of the process, from sample collection, purification, and DNA isolation to metagenomic sequencing, bioinformatic analysis, and visualization. It's been very interesting to see how sophisiticated some of these methods have become.

On the bioinformatic analysis side of things, I ended up having the most success in those days by assembling each viral genome from scratch and measuring the evolution of that genome over time. More of the bespoke approach to bioinformatics, rather than the assembly line. 

In contrast, these days I am much more interested in computational approaches that allow me to analyze very large numbers of samples with a minimum of human input required. For that reason I don't find de novo assembly to be all that appealing. It's not that the computation can't be done, it's more than I have a hard time imagining how to wrap my brain around the results in a productive way. 

In contrast, one approach that I have been quite happy with is also much more simple minded. Instead of trying to assemble all of the new genomes in a sample, it's much easier to simply align your short reads against a reference database of viral genomes. One of the drawbacks is that you can only detect a virus that has been detected before. On the other hand, one of the advantages is also that you can only detect a virus that has been detected before, meaning that all samples can be rapidly and intuitively compared against each other. 

To account for the rapid evolution of viral genomes, I think it's best to do this alignment in protein space against the set of proteins contained in every viral genome. This makes the alignments a bit more robust. 

If you would like to perform this simple read alignment method for viral detection, you can use the code that I've put together at this GitHub repo. There is also a Quay repository hosting a Docker image that you can use to run this analysis without having to install any dependencies. 

This codebase is under active development (version 0.1 at time of writing) so please be sure to test carefully against your controls to make sure everything is behaving well. At some point I may end up publishing more about this method, but it may be just too simple to entice any journal. 

Lastly, I want to point out that straightforward alignment of reads does not account for any number of confounding factors, including but not limited to:

  1. presence of human or host DNA
  2. shared genome segments between extant viruses
  3. novel viral genomes
  4. complex viral quasi-species
  5. integration of prophages into host genomes

There are a handful of tools out there that do try to deal with some of those problems in different ways, quite likely to good effect. However, it's good to remember that with every additional optimization you add a potential confounding factor. For example, it sounds like a good idea to remove human sequences from your sample, but that runs the risk of also eliminating viral sequences that happen to be found with the human genome, such as lab contaminants or integrated viral genome fragments. There are even a few human genes with deep homology to existing viral genes, thought to be due to an ancient integration and subsequent repurposing. All I mean to say here is that sometimes it's better to remove the assumptions from your code, and instead include a good set of controls to your experiment that can be used to robustly eliminate signal coming from, for example, the human genome.