Quick note on workflow managers

After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.

A Few Good Options

It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.

Issue 1 — Where do you execute your command?

Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.

Issue 2 — When do you execute your command?

A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.

Issue 3 — Where does your data live?

One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.

Wrapping Up

While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.

The Blessing and the Curse of Dimensionality

A paper recently caught my eye, and I think it is a great excuse to talk about data scale and dimensionality.

Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 562, 589–594 (2018). (link)

In addition to having a great acronym for study of child development, they also sequenced 10,913 metagenomes from 783 children.

This is a ton of data.

If you haven’t worked with a “metagenome,” it’s usually about 10-20 million short words, each corresponding to 100-300 bases of a microbial genome. It’s a text file with some combination of ATCG written out over tens of millions of lines, with each line being a few hundred letters long. A single metagenome is big. It won’t open in Word. Now imagine you have 10,000 of them. Now imagine you have to make sense out of 10,000 of them.

Now, I’m being a bit extreme – there are some ways to deal with the data. However, I would argue that it’s this problem, how to deal with the data, that we could use some help with.

Taxonomic classification

The most effective way to deal with the data is to take each metagenome and figure out which organisms are present. This process is called “taxonomic classification” and it’s something that people have gotten pretty good at recently. You take all of those short ATCG words, you match them against all of the genomes you know about, and you use that information to make some educated about what collection of organisms are present. This is a biologically meaningful reduction in the data that results in hundreds or thousands of observations per sample. You can also validate these methods by processing “mock communities” and seeing if you get the right answer. I’m a fan.

With taxonomic classification you end up with thousands of observations (in this case organisms) across X samples. In the TEDDY study they had >10,000 samples, and so this dataset has a lot of statistical power (where you generally want more samples than observations).

Metabolic reconstruction

The other main way that people analyze metagenomes these days is by quantifying the abundance of each biochemical pathway present in the sample. I won’t talk about this here because my opinions are controversial and it’s best left for another post.

Gene-level analysis

I spend most of my time these days on “gene-level analysis.” This type of analysis tries to quantify every gene present in every genome in every sample. The motivation here is that sometimes genes move horizontally between species, and sometimes different strains within the same species will have different collections of genes. So, if you want to find something that you can’t find with taxonomic analysis, maybe gene-level analysis will pick it up. However, that’s an entirely different can of worms. Let’s open it up.

Every microbial genome contains roughly 1,000 genes. Every metagenome contains a few hundred genomes. So every metagenome contains hundreds of thousands of genes. When you look across a few hundred samples you might find a few million unique genes. When you look across 10,000 samples I can only guess that you’d find tens of millions of unique genes.

Now the dimensionality of the data is all lopsided. We have tens of millions of genes, which are observed across tens of thousands of samples. A biostatistician would tell us that this is seriously underpowered for making sense of the biology. Basically, this is an approach that just doesn’t work for studies with 10,000 samples, which I find to be pretty daunting.

Dealing with scale

The way that we find success in science is that we take information that a human cannot comprehend, and we transform it into something that a human can comprehend. We cannot look at a text file with ten million lines and understand anything about that sample, but we can transform it into a list of organisms with names that we can Google. I’m spending a lot of my time trying to do the same thing with gene-level metagenomic analysis, trying to transform it into something that a human can comprehend. This all falls into the category of “dimensionality reduction,” trying to reduce the number of observations per sample, while still retaining the biological information we care about. I’ll tell you that this problem is really hard and I’m not sure I have the single best true angle on it. I would absolutely love to have more eyes on the problem.

It increasingly seems like the world is driven by people who try to make sense of large amounts of data, and I would humbly ask for anyone who cares about this to try to think about metagenomic analysis. The data is massive, and we have a hard time figuring out how to make sense of it. We have a lot of good starts to this, and there are a lot of good people working in this area (too many to list), but I think we could always use more help.

The authors of the paper who analyzed 10,000 metagenomes learned a lot about how the microbiome develops during early childhood, but I’m sure that there is even more we can learn from this data. And I am also sure that we are getting close to world where we have 10X the data per sample, and experiments with 10X the samples. That is a world that I think we are ill-prepared for, and I’m excited to try to build the tools that we will need for it.


The Rise of the Machines: Workflow Managers for Bioinformatics

As with many things these days, it started with Twitter and it went further than I expected.

The other day I wrote a slightly snarky tweet

There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.

What is “being a bioinformatician”?

“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.

The aspect of this definition useful here is that bioinformaticians tend to

  1. keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)

  2. analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.

Here’s a good counterpoint that was raised to the above:

Good point, now what is a “Workflow Manager”?

A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.

This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.

Examples of workflow managers (partially culled from the set of responses I got on Twitter):

My Ideal Workflow Manager

I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:

  • Tasks consist of BASH snippets run inside Docker containers

  • Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)

  • The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)

  • Support for file storage locally as well as object stores like AWS S3

  • Easy to read, write, and publish to a general computing audience (highly subjective)

  • Easy to set up and get running (highly subjective)

The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.

Where are we now?

The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.

Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.

In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).

Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.

Niche Theory and the Human Gut Microbiome

Without really having the time to write a full blog post, I want to mention two recent papers that have strongly influenced my understanding of the microbiome.

Niche Theory

The ecological concept of the “niche” is something that is discussed quite often in the field of the microbiome, namely that each bacterial species occupies a niche, and any incoming organism trying to use that same niche will be blocked from establishing itself. The mechanisms and physical factors that cause this “niche exclusion” is probably much more clearly described in the ecological study of plants and animals — in the case of the microbiome I have often wondered just what utility or value this concept really had.

That all changed a few weeks ago with a pair of papers from the Elinav group.

The Papers

Personalized Gut Mucosal Colonization Resistance to Empiric Probiotics Is Associated with Unique Host and Microbiome Features


Quick, Quick Summary

At the risk of oversimplifying, I’ll try to summarize the two biggest points I took home from these papers.

  1. Lowering the abundance and diversity of bacteria in the gut can increase the probability that a new strain of bacteria (from a probiotic) is able to grow and establish itself

  2. The ability of a new bacteria (from a probiotic) to grow and persist in the gut varies widely on a person-by-person basis

Basically, the authors showed quite convincingly that the “niche exclusion” effect does indeed happen in the microbiome, and that the degree of niche exclusion is highly dependent on what microbes are present, as well as a host of other unknown factors.

So many more questions

Like any good study, this raises more questions than it answers. What genetic factors determine whether a new strain of bacteria can grow in the gut? Is it even possible to design a probiotic that can grow in the gut of any human? Are the rules for “niche exclusion” consistent across bacterial species or varied?

As an aside, these studies demonstrate the consistent observation that probiotics generally don’t stick around after you take them. If you have to take a probiotic every day in order to sustain its effect, it’s not a real probiotic.

I invite you to read over these papers and take what you can from them. If I manage to put together a more lengthy or interesting summary, I’ll make sure to post it at some point.

Microbiome Research: Hope over Hype

The story of microbiome research is one of hope and hype, both elevated to an extreme and fraught with controversy. You need look no further than popular blogs and high profile review articles to see this conflict play out. 

As a passionate microbiome researcher, I like to highlight the hope that I see -- the hope that humans will be able to understand and harness the microbiome to improve human health. 

The story of hope I have for you today is a story of the heart. The human heart. Well, heart disease. Reducing heart disease, really.


When I was in graduate school I saw the most fascinating lecture. It was from Stanley Hazen, a researcher at the Cleveland Clinic, and he was describing an experiment in which it appeared that bacteria in the gut were responsible for converting a normal part of our diet into a molecule that promoted atherosclerosis (heart disease). With a combination of (1) molecular analysis of the blood of humans with heart disease and (2) experiments in mice varying the diet and microbes present in the gut, they showed pretty convincingly that bacteria were converting phosphatidylcholine from food into TMAO, which then promoted heart disease (Wang, et al. 2011 Nature). 


Fast forward 7 years, and the microbiome research field has advanced far enough to identify the exact bacterial genes involved in this process. Not only that, they are able to inhibit those specific bacterial enzymes and show in a mouse model that levels of harmful TMAO are pushed down as a result (Roberts, et al. 2018 Nature).

 Fig. 5: A microbial choline TMA lyase inhibitor reverses diet-induced changes in cecal microbial community composition associated with plasma TMAO levels, platelet responsiveness, and in vivo thrombosis potential. Schematic of the relationship between human gut commensal choline TMA lyase activity, TMA and TMAO generation, and enhanced platelet responsiveness and thrombosis risk in the host

Fig. 5: A microbial choline TMA lyase inhibitor reverses diet-induced changes in cecal microbial community composition associated with plasma TMAO levels, platelet responsiveness, and in vivo thrombosis potential. Schematic of the relationship between human gut commensal choline TMA lyase activity, TMA and TMAO generation, and enhanced platelet responsiveness and thrombosis risk in the host

Bioinformatics Aside

One aspect of this story that I'll point out for the bioinformatics folks in the audience is that the biological mechanism involved in choline -> TMAO is not a phylogenetically conserved one. It is mediated by a set of genes that are distributed sporadically across the bacterial tree of life. For that reason and others, I am a strong supporter of microbiome analysis tools that enable gene-level comparison between large sets of samples in order to identify mechanisms like these in the future. 


My hope, my dream, is that the entire human microbiome field is able to eventually follow this path. We observe that the microbiome influences some aspect of human health, we identify the biological mechanism responsible for this effect, and then we demonstrate our knowledge and mastery of this biology to such an extent that we can intentionally manipulate this system and eventually improve human health. 

We have a long way to go, but I believe that this is the path that we can follow, the example we can aspire to. I hope that this story gives you hope, and helps cut through the hype. 

Update on Making CAGs, or The Importance of Good Software

I wrote a little bit recently on why I am so interested in making CAGs from metagenomic data, and I wanted to provide a little update on that topic.

CAGs are "Co-Abundant Genes," which is to say the groups of genes which are found at similar abundances in metagenomic data across many different samples. The inference from the "co-abundance" observation is that those genes are likely present on the same bits of DNA (i.e. chromosomes) across those samples. I am particularly interested in these groups of genes because of the highly mosaic nature of microbial genomic evolution.

Having spent a bit of time on maximizing the computational efficiency of finding CAGs, I have been struck by just how hard it is. Naively, the core computational task requires calculating a similarity (or co-abundance) metric for every pair of genes, which scales exponentially with the number of genes. For 100 genes there are ~5,000 comparisons, for 1,000 genes there are ~500,000, for 10,000 genes there are ~50,000,000 comparisons, etc. etc.

Speaking practically, this means that it takes a long time on a very large computer to get this work done. There are all sorts of tricks to make it faster, including parallelization, code optimization, etc., but at the core it is just a hard task.

However, I happened to ask the Twitter hive mind for help, and Andrew Carroll happened to give me some great advice:

This lead me down the road of exploring the Approximate Nearest Neighbor algorithms. It turns out that there a number of different algorithms out there which approximate this task of finding the closest set of points in highly dimensional space, which is exactly what I needed to make CAGs. There's even a website showing you which of these pieces of code work the best.

Interestingly, one of these algorithms (called "annoy") was produced by Spotify, and can be used to rapidly find the nearest neighbor from a large index that can be mmapped for rapid access across multiple threads.

 Annoy ( Approximate Nearest Neighbors  Oh Yeah) 

In Closing

The point I want to reinforce is that Good Software matters. Implementing an exact solution was taking me 10 days with an expensive 72 core machine, which gave me little opportunity to iterate over different variables or analyze new datasets quickly. The advantage of using a better piece of software is not just that I can make pretty figures, it's that I can actually do science more quickly, spending less money, and getting more accurate answers. I didn't know that this software was out there, and so it wasn't until I asked some people on Twitter for me to find out about it. But now that I know I'm happy to tell the world to give it a try, and hopefully that will save you all some time, some money, and help you find out the answer to whatever questions are important.

Hybrid Approach to Microbiome Research (to Culture, and Not to Culture)

I was rereading a great paper from the Huttenhower group (at Harvard) this week and I was struck by a common theme that it shared with another great paper from the Segre group (at NIH), which I think is a nice little window into how good scientists are approaching the microbiome these days. 

The paper I'm thinking about is Hall, et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients (2017) Genome Medicine. The paper is open access so you can feel free to go read it yourself, but my super short summary is: (1) they analyzed the gut microbiome from patients with (and without) IBD and found that a specific clade of Ruminococcus gnavus was enriched in IBD; and then (2) they took the extra step of growing up those bacteria in the lab and sequencing their genomes in order to figure out which specific genes were enriched in IBD. 

The basic result is fantastically interesting – they found enriched genes that were associated with oxidative stress, adhesion, iron acquisition, and mucus utilization, all of which make sense in terms of IBD – but I mostly want to talk about the way they figured this out. Namely, they took a combined approach of (1) analyzing the total DNA from stool samples with culture free genome sequencing, and then (2) they isolated and grew R. gnavus strains in culture from those same stool samples so that they could analyze their genomes.

 Fig. 3:  R. gnavus  metagenomic strain phylogeny. 

Fig. 3: R. gnavus metagenomic strain phylogeny. 

Now, if you cast your mind back to the paper on pediatric atopic dermatitis from Drs. Segre and Kong (Byrd, et al. 2017 Science Translational Medicine) you will remember that they took a very similar approach. They did culture-free sequencing of skin samples, while growing Staph strains from those same skin samples in parallel. With the cultures in hand they were able to sequence the genomes of those strains as well as testing for virulence in a mouse model of dermatitis.

So, why do I think this is worth writing a post about? It helps tell the story of how microbiome research has been developing in recent years. At the start, all we could do was describe how different organisms were in higher and lower abundance in different body sites, disease states, etc. Now that the field has progressed, it is becoming clear that the strain-level differences within a given species may be very important to human health and disease. We know that although people may contain a similar set of common bacterial species, the exact strains in their gut (for example) are different between people and usually stick around for months and years. 

With this increased focus on strain-level diversity, we are coming up against the technological challenges of characterizing those differences between people, and how those differences track with health and disease. The two papers I've mentioned here are not the only ones to take this approach (it's also worth mentioning this great paper on urea metabolism in Crohn's disease from UPenn), which was to neatly interweave the complementary sets of information that can gleaned from culture-free whole-genome shotgun sequencing as well as culture-based strain isolation. Both of those techniques are difficult and they require extremely different sets of skills, so it's great to see collaborations come together to make these studies possible.

With such a short post, I've surely left out some important details from these papers, but I hope that the general reflection and point about the development of microbiome research has been of interest. It's certainly going to stay on my mind for the years to come.

Notes on identifying Co-Abundant Genes

One of the challenges in the field of microbial metagenomics (part of microbiome research) is going from the "bag-of-genes" stage to something more closely resembling biological reality. One of the general approaches that I have always found very compelling is to identify sets of genes that occur at a similar abundance across multiple samples, so-called Co-Abundant Genes (CAGs). In this post I will quickly describe why I like CAGs, and then talk about how I've been approaching the computational challenges associated with identifying them.

Why I like CAGs

The idea of CAGs is very appealing to me because it sits nicely between the level of a single gene and that of a complete genome. It is useful to track collections of genes because of the modular structure of bacterial genomes. As bacteria evolve they can frequently gain or lose large regions containing multiple genes, which may be due to a number of different biological mechanisms including plasmids, phage, transposons, or even just homologous recombination. Therefore it is useful to identify those cases when groups of genes are always observed together at a similar abundance, because it suggests that those genes are almost always found on the same chromosome and can be grouped together as a biologically meaningful entity.

Tackling the challenges of CAGs

The challenges of building CAGs are entirely computational, and are due to the extremely large number of genes that can be observed during metagenomic experiments.

The operation of finding CAGs can be broken into the following steps:

  1. Calculate a co-abundance metric for each pair of genes
  2. Identify those groups of genes which have a high degree of co-occurrence
  3. Validate that those CAGs are biologically meaningful

All of those steps are totally reasonable and work cleanly using off-the-shelf tools in Python or R when you have small numbers of genes (< 1,000). However, once you start having 100,000's and 1,000,000's of genes, the default tools for steps 1 and 2 will either instantly break or take forever. It's not surprising that making a pairwise distance matrix for ~1e6 by ~1e6 would break your computer -- it's got ~1e12 cells of data!

Here is how I have been tackling these challenges:

  1. Throw away most of the distance metrics. Instead of calculating all of the pairwise distances in a single matrix, iterate over all of the combinations and only store those values which are below some threshold.
  2. Use as many processors as possible. Distribute the process over many cores and run this on as big of a machine as you can find.
  3. Use iterators instead of making lists. This is more of a Python-specific note, but instead of iterating over a list of all of the pairwise combinations, make a function that yields each successive combination so that you never have to load the NxN list into memory.
  4. Cluster the quick and dirty way with single-linkage. There are many different clustering approaches that you could use, but single-linkage is incredibly simple and completely avoids the all-by-all comparison. It can be done with a streaming approach, which helps make this even halfway possible.

The last tip here is use other people's code. You're always going to have to write some code from scratch, but try to copy as much as you can from existing (open source) projects. To this end, I have been working on this problem in a public GitHub repository, and I'd be more than happy for anyone to swipe whatever bits of my code might be useful. I don't expect that anyone else has their data in exactly the way I expect, so it might be best to start with the two core functions, one which generates all of the pairwise connections from a Pandas DataFrame, and another which uses single-linkage clustering to identify CAGs from those connections. This code is not published, it's not heavily tested, and it might not even be terribly readable, but it might still be useful if this is a project you find yourself tackling.

Happy hunting.

Tracking outbreaks with whole-genome sequencing

What is "Genomic Epidemiology"?

The last ten years have seen a number of astonishing advances in our ability to track the spread of outbreaks using something called "genomic epidemiology." The basic concept here is that the pathogens that make us sick (bacteria, viruses, fungi, etc.) each have their own genome, which can be used to trace their ancestry in much the same way that we can use our own genomes to trace our own ancestry. 

Just like you can use the human genome to get an idea of how people are related to each other, you can use the microbial genomes to get an idea of how pathogens are related to each other, and therefore how they are spreading between people. 

Making sense of tons of data

Without doing a comprehensive survey, I will just say that a number of very smart and capable people have been working on all of the different steps of the process – isolating the pathogen's genome, sequencing it quickly, and comparing those genome sequences to each other. What I wanted to focus on was the final step: taking all of those genomes and getting an idea of what is actually happening. I was inspired here by the NextStrain project, which takes large numbers of genomes for something like Zika and provides an intuitive means for figuring out what's going on. 

For my own exploration, I used the data published by Roach, et al. (2015) as they sequenced every single isolate that came through an ICU over the course of a year. I should mention that the laboratory protocols necessary to achieve such a feat are very impressive.

Careful analysis of individual outbreaks involved a number of involved steps to reconstruct bacterial genomes and compare isolates at single-nucleotide resolution. Instead, my goal was to see how quickly I could take the raw data and get an idea of whether there were any clusters that would merit further in-depth investigation (were this an actual surveillance project). 

Since all of the raw data was available on SRA, I simply (1) downloaded the raw data with fastq-dump, (2) made MinHash sketches with Finch-rs, (3) calculated pairwise distances between all isolates, (4) performed Ward hierarchical clustering (the neighbor joining tree was too slow), and (5) output a Newick file with the resulting tree structure. I didn't keep track of how long each step took, but start to finish it all took me about 5 hours to finish (including the time spent reading documentation, as well as raw compute). 

So here's an example of what that dataset looks like, and here is an interactive link

 Overall view of 1262 isolates collected over 1 year in an ICU (Roach, et al. 2015)

Overall view of 1262 isolates collected over 1 year in an ICU (Roach, et al. 2015)

 Zoomed-in view of a clade including  Acinetobacter  (red),  Haemophillus  (green),  Moraxella  (teal), and  Neisseria  (blue). The rollover text shows metadata for sample 583, which was sampled from the same patient as its closest neighbor in the tree, sample 595.

Zoomed-in view of a clade including Acinetobacter (red), Haemophillus (green), Moraxella (teal), and Neisseria (blue). The rollover text shows metadata for sample 583, which was sampled from the same patient as its closest neighbor in the tree, sample 595.

Disclaimer: None of the clades shown above indicate bona fide outbreak clusters without further quality control, whole genome alignment, and inspection by a qualified professional.

I encourage you to follow the link and explore the tree – it's interesting to see how all of the data comes together. A bunch of different species are included on the same tree, which is one of the nice things about using MinHash sketches to compare genomes (you don't need to align to a common genome). When you roll your mouse over each branch you can see the metadata that the authors published for each isolate.

Take home

So what is the point here? (1) genome sequencing can be used to precisely identify closely-related bacterial isolates, (2) the community (FDA, PHE, CDC, etc.) has taken massive strides to integrate this technology into routine public health surveillance, and (3) relatively simple computational approaches can be used to quickly sift through the data to find groups that merit further inspection.

I also wanted to point out that all of the challenges associated with genome assembly and multiple sequence alignment can be entirely sidestepped by the MinHash approach, which makes it a lot easier to scale these processes up. The analysis methods also don't require a ton of compute – it's actually harder to just keep track of all the data and visualize it in a nice way than it is to compute the pairwise distances. 

So, if you're thinking about processing large collections of isolates to find potential outbreak clusters, ask a bioinformatician if MinHash might be right for you.


More Reading:

Application of whole-genome sequencing for bacterial strain typing in molecular epidemiology

Whole Genome Sequencing for the Retrospective Investigation of an Outbreak of Salmonella Typhimurium DT 8

Bacterial genome sequencing in clinical microbiology: a pathogen-oriented review 

Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database

Infection control in the new age of genomic epidemiology

Some MinHash implementations:


I used to work for and have an interest in One Codex, the group which developed the open source implementation of MinHash (finch-rs) that I used in this little exercise.

Functional Analysis of Metagenomes by Likelihood Inference: FAMLI

I've been working for the last six months or so with my colleague Dr. Jonathan Golob on an algorithm to help analyze microbiome samples, and now it's ready to share with the world. 

If you'd like to read a preprint of the paper instead of this blog-based summary, you can do so on bioRxiv.

The problem

As we were analyzing a set of microbiome samples using shotgun genomes sequence data (WGS), we kept running into a tricky problem. Our goal was to identify the set of protein coding genes present in a sample, and so we aligned the WGS reads using DIAMOND against the UniRef90 database and took the best hit(s) for each read. However, we kept getting all sorts of false positives and false negatives, which we ultimately figured out were due to a) sequencing error and b) large regions of identical protein sequence shared between different proteins in the database. In fact, (b) was far more concerning to us than (a) because it meant that we were doing better or worse depending on the quality of annotation for an organism's genome. This was an analytical bias we didn't want in our research.

The solution

After lots of back-and-forth between the two of us, including discussion, arguments, coding, refactoring, and existential crises (on my part), we ended up putting together an algorithm that I'm very happy with. This is not the most efficient or accurate version that is theoretically possible, but it's the best one that we were able to put together after trying lots and lots of variations. The approach is illustrated here:

 Illustration of the FAMLI algorithm

Illustration of the FAMLI algorithm

The steps are generally as follows:

  1. Align every WGS read against a database of protein references (e.g. UniRef90)
  2. Identify the protein references with highly uneven coverage (standard deviation / mean > 1) and filter them out
  3. Iteratively assign each read to a single protein reference, that which is the most likely given the complete set of reads in the sample
  4. Filter the final set of protein references as in (2), using the deduplicated alignments from (3).

We did some work to test how this approach compares to other potential options, and we think it does quite well. You can read the paper for more details on that.

The software

We would like for this approach to be generally available for use by the research community. You can read the source code on GitHub, you can download a Docker image from Quay, and you can install directly from PyPI. Of the three, I suggest you just use the Docker image to minimize any confusion or unpredictability.

Thinking broadly about functional metagenomics

As we worked on benchmarking this algorithm we felt that it was necessary to compare it with other existing methods that are used by the community, but I didn't want to give the impression that this algorithm is intended to replace or supplant any previously existing tools. That's because the goal of this analysis is subtly different than that of other methods. To date, functional metagenomics has generally had the goal of identifying the metabolic pathways encoded by a community (e.g. HUMAnN2) or identifying the complete gene catalog that can be assembled from a community (e.g. de novo assembly). I think that both of those methods work very well for those useful purposes. In contrast, FAMLI attempts to quantify the set of protein-coding genes from a closed reference database (e.g. UniRef90), whether or not they have any annotated metabolic function. In fact, one of the more interesting possibilities would be to incorporate this open source algorithm into those existing tools, using it to optimize any alignments that are created as part of a larger pipeline. If you are a tool maker and are interested in working on that together, please be in touch.

Final note

Working on this project has been an extremely challenging and rewarding process, full of false starts, confusion, and discovery. I cannot be more thankful to Dr. Jonathan Golob for his equal contribution to this effort, and am also thankful for any readers who have been interested enough by our work to read this far. I hope that you find some utility from our efforts.