Notes on identifying Co-Abundant Genes

One of the challenges in the field of microbial metagenomics (part of microbiome research) is going from the "bag-of-genes" stage to something more closely resembling biological reality. One of the general approaches that I have always found very compelling is to identify sets of genes that occur at a similar abundance across multiple samples, so-called Co-Abundant Genes (CAGs). In this post I will quickly describe why I like CAGs, and then talk about how I've been approaching the computational challenges associated with identifying them.

Why I like CAGs

The idea of CAGs is very appealing to me because it sits nicely between the level of a single gene and that of a complete genome. It is useful to track collections of genes because of the modular structure of bacterial genomes. As bacteria evolve they can frequently gain or lose large regions containing multiple genes, which may be due to a number of different biological mechanisms including plasmids, phage, transposons, or even just homologous recombination. Therefore it is useful to identify those cases when groups of genes are always observed together at a similar abundance, because it suggests that those genes are almost always found on the same chromosome and can be grouped together as a biologically meaningful entity.

Tackling the challenges of CAGs

The challenges of building CAGs are entirely computational, and are due to the extremely large number of genes that can be observed during metagenomic experiments.

The operation of finding CAGs can be broken into the following steps:

  1. Calculate a co-abundance metric for each pair of genes
  2. Identify those groups of genes which have a high degree of co-occurrence
  3. Validate that those CAGs are biologically meaningful

All of those steps are totally reasonable and work cleanly using off-the-shelf tools in Python or R when you have small numbers of genes (< 1,000). However, once you start having 100,000's and 1,000,000's of genes, the default tools for steps 1 and 2 will either instantly break or take forever. It's not surprising that making a pairwise distance matrix for ~1e6 by ~1e6 would break your computer -- it's got ~1e12 cells of data!

Here is how I have been tackling these challenges:

  1. Throw away most of the distance metrics. Instead of calculating all of the pairwise distances in a single matrix, iterate over all of the combinations and only store those values which are below some threshold.
  2. Use as many processors as possible. Distribute the process over many cores and run this on as big of a machine as you can find.
  3. Use iterators instead of making lists. This is more of a Python-specific note, but instead of iterating over a list of all of the pairwise combinations, make a function that yields each successive combination so that you never have to load the NxN list into memory.
  4. Cluster the quick and dirty way with single-linkage. There are many different clustering approaches that you could use, but single-linkage is incredibly simple and completely avoids the all-by-all comparison. It can be done with a streaming approach, which helps make this even halfway possible.

The last tip here is use other people's code. You're always going to have to write some code from scratch, but try to copy as much as you can from existing (open source) projects. To this end, I have been working on this problem in a public GitHub repository, and I'd be more than happy for anyone to swipe whatever bits of my code might be useful. I don't expect that anyone else has their data in exactly the way I expect, so it might be best to start with the two core functions, one which generates all of the pairwise connections from a Pandas DataFrame, and another which uses single-linkage clustering to identify CAGs from those connections. This code is not published, it's not heavily tested, and it might not even be terribly readable, but it might still be useful if this is a project you find yourself tackling.

Happy hunting.

Tracking outbreaks with whole-genome sequencing

What is "Genomic Epidemiology"?

The last ten years have seen a number of astonishing advances in our ability to track the spread of outbreaks using something called "genomic epidemiology." The basic concept here is that the pathogens that make us sick (bacteria, viruses, fungi, etc.) each have their own genome, which can be used to trace their ancestry in much the same way that we can use our own genomes to trace our own ancestry. 

Just like you can use the human genome to get an idea of how people are related to each other, you can use the microbial genomes to get an idea of how pathogens are related to each other, and therefore how they are spreading between people. 

Making sense of tons of data

Without doing a comprehensive survey, I will just say that a number of very smart and capable people have been working on all of the different steps of the process – isolating the pathogen's genome, sequencing it quickly, and comparing those genome sequences to each other. What I wanted to focus on was the final step: taking all of those genomes and getting an idea of what is actually happening. I was inspired here by the NextStrain project, which takes large numbers of genomes for something like Zika and provides an intuitive means for figuring out what's going on. 

For my own exploration, I used the data published by Roach, et al. (2015) as they sequenced every single isolate that came through an ICU over the course of a year. I should mention that the laboratory protocols necessary to achieve such a feat are very impressive.

Careful analysis of individual outbreaks involved a number of involved steps to reconstruct bacterial genomes and compare isolates at single-nucleotide resolution. Instead, my goal was to see how quickly I could take the raw data and get an idea of whether there were any clusters that would merit further in-depth investigation (were this an actual surveillance project). 

Since all of the raw data was available on SRA, I simply (1) downloaded the raw data with fastq-dump, (2) made MinHash sketches with Finch-rs, (3) calculated pairwise distances between all isolates, (4) performed Ward hierarchical clustering (the neighbor joining tree was too slow), and (5) output a Newick file with the resulting tree structure. I didn't keep track of how long each step took, but start to finish it all took me about 5 hours to finish (including the time spent reading documentation, as well as raw compute). 

So here's an example of what that dataset looks like, and here is an interactive link

Overall view of 1262 isolates collected over 1 year in an ICU (Roach, et al. 2015)

Overall view of 1262 isolates collected over 1 year in an ICU (Roach, et al. 2015)

Zoomed-in view of a clade including  Acinetobacter  (red),  Haemophillus  (green),  Moraxella  (teal), and  Neisseria  (blue). The rollover text shows metadata for sample 583, which was sampled from the same patient as its closest neighbor in the tree, sample 595.

Zoomed-in view of a clade including Acinetobacter (red), Haemophillus (green), Moraxella (teal), and Neisseria (blue). The rollover text shows metadata for sample 583, which was sampled from the same patient as its closest neighbor in the tree, sample 595.

Disclaimer: None of the clades shown above indicate bona fide outbreak clusters without further quality control, whole genome alignment, and inspection by a qualified professional.

I encourage you to follow the link and explore the tree – it's interesting to see how all of the data comes together. A bunch of different species are included on the same tree, which is one of the nice things about using MinHash sketches to compare genomes (you don't need to align to a common genome). When you roll your mouse over each branch you can see the metadata that the authors published for each isolate.

Take home

So what is the point here? (1) genome sequencing can be used to precisely identify closely-related bacterial isolates, (2) the community (FDA, PHE, CDC, etc.) has taken massive strides to integrate this technology into routine public health surveillance, and (3) relatively simple computational approaches can be used to quickly sift through the data to find groups that merit further inspection.

I also wanted to point out that all of the challenges associated with genome assembly and multiple sequence alignment can be entirely sidestepped by the MinHash approach, which makes it a lot easier to scale these processes up. The analysis methods also don't require a ton of compute – it's actually harder to just keep track of all the data and visualize it in a nice way than it is to compute the pairwise distances. 

So, if you're thinking about processing large collections of isolates to find potential outbreak clusters, ask a bioinformatician if MinHash might be right for you.

 

More Reading:

Application of whole-genome sequencing for bacterial strain typing in molecular epidemiology

Whole Genome Sequencing for the Retrospective Investigation of an Outbreak of Salmonella Typhimurium DT 8

Bacterial genome sequencing in clinical microbiology: a pathogen-oriented review 

Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database

Infection control in the new age of genomic epidemiology

Some MinHash implementations:

Disclosure:

I used to work for and have an interest in One Codex, the group which developed the open source implementation of MinHash (finch-rs) that I used in this little exercise.

Functional Analysis of Metagenomes by Likelihood Inference: FAMLI

I've been working for the last six months or so with my colleague Dr. Jonathan Golob on an algorithm to help analyze microbiome samples, and now it's ready to share with the world. 

If you'd like to read a preprint of the paper instead of this blog-based summary, you can do so on bioRxiv.

The problem

As we were analyzing a set of microbiome samples using shotgun genomes sequence data (WGS), we kept running into a tricky problem. Our goal was to identify the set of protein coding genes present in a sample, and so we aligned the WGS reads using DIAMOND against the UniRef90 database and took the best hit(s) for each read. However, we kept getting all sorts of false positives and false negatives, which we ultimately figured out were due to a) sequencing error and b) large regions of identical protein sequence shared between different proteins in the database. In fact, (b) was far more concerning to us than (a) because it meant that we were doing better or worse depending on the quality of annotation for an organism's genome. This was an analytical bias we didn't want in our research.

The solution

After lots of back-and-forth between the two of us, including discussion, arguments, coding, refactoring, and existential crises (on my part), we ended up putting together an algorithm that I'm very happy with. This is not the most efficient or accurate version that is theoretically possible, but it's the best one that we were able to put together after trying lots and lots of variations. The approach is illustrated here:

Illustration of the FAMLI algorithm

Illustration of the FAMLI algorithm

The steps are generally as follows:

  1. Align every WGS read against a database of protein references (e.g. UniRef90)
  2. Identify the protein references with highly uneven coverage (standard deviation / mean > 1) and filter them out
  3. Iteratively assign each read to a single protein reference, that which is the most likely given the complete set of reads in the sample
  4. Filter the final set of protein references as in (2), using the deduplicated alignments from (3).

We did some work to test how this approach compares to other potential options, and we think it does quite well. You can read the paper for more details on that.

The software

We would like for this approach to be generally available for use by the research community. You can read the source code on GitHub, you can download a Docker image from Quay, and you can install directly from PyPI. Of the three, I suggest you just use the Docker image to minimize any confusion or unpredictability.

Thinking broadly about functional metagenomics

As we worked on benchmarking this algorithm we felt that it was necessary to compare it with other existing methods that are used by the community, but I didn't want to give the impression that this algorithm is intended to replace or supplant any previously existing tools. That's because the goal of this analysis is subtly different than that of other methods. To date, functional metagenomics has generally had the goal of identifying the metabolic pathways encoded by a community (e.g. HUMAnN2) or identifying the complete gene catalog that can be assembled from a community (e.g. de novo assembly). I think that both of those methods work very well for those useful purposes. In contrast, FAMLI attempts to quantify the set of protein-coding genes from a closed reference database (e.g. UniRef90), whether or not they have any annotated metabolic function. In fact, one of the more interesting possibilities would be to incorporate this open source algorithm into those existing tools, using it to optimize any alignments that are created as part of a larger pipeline. If you are a tool maker and are interested in working on that together, please be in touch.

Final note

Working on this project has been an extremely challenging and rewarding process, full of false starts, confusion, and discovery. I cannot be more thankful to Dr. Jonathan Golob for his equal contribution to this effort, and am also thankful for any readers who have been interested enough by our work to read this far. I hope that you find some utility from our efforts.

Journal Club: GraftM, a tool for scalable, phylogenetically informed classification of genes within metagenomes

A paper recently caught my eye for two reasons: (1) it implemented an idea that someone smart recently brought up to me and (2) it perfectly illustrates the question of "functional resolution" that comes up in the functional analysis of metagenomes.

The paper is called GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes, by Joel A. Boyd, Ben J. Woodcroft and GeneW. Tyson, published in Nucleic Acids Research.

The tool that they present is called GraftM, and it does something very clever (in my opinion). It analyzes DNA sequences from mixtures of microbes (so-called "metagenomes" or microbiome samples) and it identifies the allele of a collection of genes that is present. This is a nice extrapolation of the 16S classification process to include protein-coding sequences, and they even use the software package pplacer, which was developed for this purpose by Erick Matsen (Fred Hutch). Here's their diagram of the process:

Screen Shot 2018-03-28 at 12.36.25 PM.png

They present some nice validation data suggesting that their method works well, which I won't reproduce here except to say that I'm always happy when people present controls. 

Now instead of actually diving into the paper, I'm going to ruminate on the concept that this approach raises for the field of functional metagenomics.

Functional Analysis

When we study a collection of microbes, the "functional" approach to analysis considers the actual physical functions that are predicted to be carried out by those microbes. Practically, this means identifying sets of genes that are annotated with one function or another using a database like KEGG or GO. One way to add a level of specificity to functional analysis is to overlay taxonomic information with these measurements. In other words, not only is Function X encoded by this community, we can say that Function X is encoded by Organism Y in this community (while in another community Function X may be encoded by Organism Z). This combination of functional and taxonomic data is output by the popular software package HUMAnN2, for example. 

Let's take a step back. I originally told you that protein functions were important (e.g. KEGG or GO labels), and now I'm telling you that protein functions ~ organism taxonomy is important. The difference is one of "functional specificity."

Functional Specificity

At the physical level, microbial communities consist of cells with DNA encoding RNA and producing proteins, with those RNA and protein molecules doing some interesting biology. When we summarize those communities according to the taxa (e.g. species) or functions (e.g. KEGG) that are present, we are taking a bird's eye view of the complexity of the sample. Adding the overlay of taxonomy ~ function (e.g. HUMAnN2) increases the specificity of our analysis considerably.

Now take a look at the figure above. You can see that a single KEGG group actually consists of a whole family of protein sequences which have their own phylogenetic relationships and evolutionary history. This tool (GraftM) presents a means of increasing the level of functional resolution to differentiate between different protein sequences (alleles) that are contained within a single functional grouping (e.g. KEGG label). However, it should be noted that this approach is much more computationally costly than something like HUMAnN2, and is best run on only a handful of protein families at a time. 

Summary

Lastly, let me say that there is no reason why a higher or lower level of functional specificity is better or worse. There are advantages at both ends of the spectrum, for computational and analytical reasons that would be tiresome to go into here. Rather, the level of functional specificity should be appropriate to the biological effect that you are studying. Of course, it's impossible to know exactly what will be the best approach a priori, which is why it's great to have a whole collection of tools at your disposal to test your hypothesis in different ways. 

Detecting viruses from short-read metagenomic data

My very first foray into the microbiome was in graduate school when my advisor (Dr. Rick Bushman) suggested that I try to use a high-throughput genome sequencing instrument (a 454 pyrosequencer) to characterize the viruses present in the healthy human gut. That project ended up delving deeply into the molecular evolution of bacteriophage populations, and gave me a deep appreciation for the unpredictable complexity of viral genomes. 

While I wasn't the first to publish in this area, the general approach has become more popular over the last decade and a number of different groups have put their own spin on it. 

Because of the diversity of viruses, you can always customize and enhance different aspects of the process, from sample collection, purification, and DNA isolation to metagenomic sequencing, bioinformatic analysis, and visualization. It's been very interesting to see how sophisiticated some of these methods have become.

On the bioinformatic analysis side of things, I ended up having the most success in those days by assembling each viral genome from scratch and measuring the evolution of that genome over time. More of the bespoke approach to bioinformatics, rather than the assembly line. 

In contrast, these days I am much more interested in computational approaches that allow me to analyze very large numbers of samples with a minimum of human input required. For that reason I don't find de novo assembly to be all that appealing. It's not that the computation can't be done, it's more than I have a hard time imagining how to wrap my brain around the results in a productive way. 

In contrast, one approach that I have been quite happy with is also much more simple minded. Instead of trying to assemble all of the new genomes in a sample, it's much easier to simply align your short reads against a reference database of viral genomes. One of the drawbacks is that you can only detect a virus that has been detected before. On the other hand, one of the advantages is also that you can only detect a virus that has been detected before, meaning that all samples can be rapidly and intuitively compared against each other. 

To account for the rapid evolution of viral genomes, I think it's best to do this alignment in protein space against the set of proteins contained in every viral genome. This makes the alignments a bit more robust. 

If you would like to perform this simple read alignment method for viral detection, you can use the code that I've put together at this GitHub repo. There is also a Quay repository hosting a Docker image that you can use to run this analysis without having to install any dependencies. 

This codebase is under active development (version 0.1 at time of writing) so please be sure to test carefully against your controls to make sure everything is behaving well. At some point I may end up publishing more about this method, but it may be just too simple to entice any journal. 

Lastly, I want to point out that straightforward alignment of reads does not account for any number of confounding factors, including but not limited to:

  1. presence of human or host DNA
  2. shared genome segments between extant viruses
  3. novel viral genomes
  4. complex viral quasi-species
  5. integration of prophages into host genomes

There are a handful of tools out there that do try to deal with some of those problems in different ways, quite likely to good effect. However, it's good to remember that with every additional optimization you add a potential confounding factor. For example, it sounds like a good idea to remove human sequences from your sample, but that runs the risk of also eliminating viral sequences that happen to be found with the human genome, such as lab contaminants or integrated viral genome fragments. There are even a few human genes with deep homology to existing viral genes, thought to be due to an ancient integration and subsequent repurposing. All I mean to say here is that sometimes it's better to remove the assumptions from your code, and instead include a good set of controls to your experiment that can be used to robustly eliminate signal coming from, for example, the human genome. 

Journal Club: Discovering new antibiotics with SLAY

This paper is a bit of a departure for me, but even though it's not a microbiome paper it's still one of the most surprising and wonderful papers that I've seen in the last year, so bear with me.

We're looking at Tucker, et al. "Discovery of Next-Generation Antimicrobials through Bacterial Self-Screening of Surface- Displayed Peptide Libraries" Cell 172:3, 2018. (http://www.cell.com/cell/fulltext/S0092-8674(17)31451-4).

When I first learned about this project, I said, "That's a fun idea, but it won't possibly work," to which the PI responded, "We've already done it." 

The extremely cool idea here was to rapidly discover new antibiotics by quickly screening an immense library of novel peptides. The diagram below lays it out: they created a library of peptides, expressed those peptides on the surface of E. coli, and then used genome sequencing to figure out which of those peptides were killing the bacteria. The overall method is called SLAY, which exceedingly clever.

fx1.jpg

While this idea seems simple, there are more than a few parts that I thought would be impossibly difficult. They include (a) building a sufficiently large library of peptides, (b) expressing those peptides on the surface of the bacteria and not inside the cell, and (c) making sure that the peptides were only killing the cells they were tethered to and not any neighbors. 

I won't go through the entire paper, but I will say that the authors ended up doing quite a bit of work to convince the readers that they actually discovered new antimicrobial peptides, and that they weren't observing some artifact. At the end of the day it seems pretty irrefutable to me that they were able to use this entirely novel approach in order to identify a few new antibiotic candidates, which typically takes hundreds of millions of dollars and decades of work. 

In short, it looks like smart people are doing good work, even outside of the microbiome field. I'll definitely be keeping an eye on these authors to see what they come up with next!

Journal Club: Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation

Amid the holiday rush last month, I was gratified to see a publication describing a new computational method that had been on my mind. 

The paper is "Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation", published in Nature Biotechnology on December 11, 2017. 

The Concept

Bacteria do something funny to their DNA that you may not have heard of. It's called covalent DNA modification, and it often consists of adding a small methyl group to the DNA molecule itself. This added methyl group (or hydroxyl, or hydroxymethyl, etc.) doesn't interfere with the normal operation of DNA (making RNA, and more DNA). Instead it serves as a marker that differentiates self-DNA from invading DNA (such as from a phage, plasmid, etc.).

For example, one species may methylate the motif ACCGAG (at the bolded base), while another might methylate CTGCAG. If the first species encounters a ACCGAG motif that lacks the appropriate methyl, it treats it as invading DNA and chews it up.

The ongoing arms race of mobile DNA elements has helped maintain a diversity of DNA modifications, such that many different species of bacteria have a unique profile. 

The interesting trick here is that we now have a way to read out the methylation patterns in DNA in addition to the sequence. This is most notably available via PacBio sequencing, which typically will generate a smaller number of much longer sequence fragments than other genome sequencing methods. Bringing it all together, the authors of this paper were able to use the methylation patterns from PacBio sequencing to much more accurately reconstruct the microbes present in a set of environmental samples (such as the human microbiome). 

The Data

The approach of the authors was to first assemble the PacBio sequences, and then bin those larger genome fragments together based on a common epigenetic signature. 

Figure 2c shows the binning of genome fragments in a sample containing a known mixture of bacteria.

Figure 2c shows the binning of genome fragments in a sample containing a known mixture of bacteria.

Figure 2e shows the binning of genome fragments in a gut microbiome sample containing a mixture of unknown bacteria (and viruses, fungi, etc.).

Figure 2e shows the binning of genome fragments in a gut microbiome sample containing a mixture of unknown bacteria (and viruses, fungi, etc.).

In general, this seems like a very interesting approach. After my read of the paper, it appears that the bacteria present in the microbiome contain a distinct enough set of epigenetic patterns to enable the deconvolution of many different species and strains. I look forward to seeing how this method stacks up against other binning approaches in the future. 

Final note

For those of you interested in phage and mobile genetic elements, I wanted to point out that the authors also explore a topic that has been studied somewhat by others – the linkage of phage to their hosts via epigenetic signatures. The idea here is that it can be computationally difficult to match a phage genome or plasmid with its host. One experimental method that accomplishes this is Hi-C, which takes advantage of the physical proximity of the host genome within an intact cell. In contrast, the PacBio method does not require intact cells, and can be used to link phage or plasmids with their host based on a shared epigenetic signature. 

My hope is that this type of data starts to become more widely available. There are clearly a number of computational tools that need to be refined in order to get full use of all this information, but it does seem to hold a good deal of promise. 

Journal Club: A role for bacterial urease in gut dysbiosis and Crohn’s disease

It's nice when you see a new paper describing the type of science that you find the most exciting and relevant, and doubly so when it's written by a group of people that you know and respect. Full disclosure: this paper is from the group that I did my graduate work with at the University of Pennsylvania, and there's quite a lot to talk about. 

Ni, J. et al. (2017) ‘A role for bacterial urease in gut dysbiosis and Crohn’s disease’, Science Translational Medicine, 9(416), p. eaah6888. doi: 10.1126/scitranslmed.aah6888.

Building a model for microbiome research

Before talking about what the authors did, I want to describe how they did it. They started (in a previous study) with a cohort of patients being treated for a common disease with no known cure (Crohn's) and looked for differences in the genomic content of their microbiome compared to healthy control subjects. That initial analysis indicated that one particular metabolic pathway was enriched in patients with the disease. In this study they followed up on that finding by analyzing the metabolites produced by those microbes. The combination of those two orthogonal analyses (genomic DNA and fecal metabolites) provided some suggestion that microbes in the gut were producing one particular chemical which may have a role in disease. To test that hypothesis they moved into a mouse model where they could test that hypothesis with controlled experiments, not only adding and removing microbes but also tracking the passage of metabolites via isotopic labelling. That model system provided crucial data that supported the findings from the human clinical samples – that a specific bacterial enzyme may have a role in human disease. After this study, I can only imagine that the next step would be to move back into humans and see if that target can be used to generate a useful therapeutic. 

The microbiome field is relatively new, and I believe that its first batch of groundbreaking therapeutics is just over the horizon. I wouldn't be surprised if the most influential studies in this field follow a trajectory that is similar to what I describe above: generating hypotheses in humans, testing them in animal models, and moving back to human to test and deliver therapeutics. 

Ok, enough predictions, let's get into the paper.

The Microbiome – Who's Who and What's What

Fig. 2. Associations between bacterial taxa abundance ascertained by fecal shotgun metagenomic sequencing and the fecal metabolome in healthy pediatric subjects and those with Crohn’s disease.

Fig. 2. Associations between bacterial taxa abundance ascertained by fecal shotgun metagenomic sequencing and the fecal metabolome in healthy pediatric subjects and those with Crohn’s disease.

The figure above shows the association between Who's in the microbiome (bacterial genera, vertical axis) and What's (metabolites, horizontal axis) in the microbiome. The point here is that certain microbes are associated with certain metabolites, with a big focus on the amino acids that are being produced. The prior set of experiments suggested that the microbes in Crohn's patients had an increased capacity for producing amino acids, while this figure (and Figure 1) goes further to show that there is a subset of microbes associated with higher actual levels of amino acids in those subjects.

Tracking metabolism in the microbiome

Here's a deceptively simple figure describing a powerful finding. 

Fig. 3. In vivo heavy isotope assays using 15N-labeled urea to determine the effect of bacterial urease on nitrogen flux in the murine gut microbiota.

Fig. 3. In vivo heavy isotope assays using 15N-labeled urea to determine the effect of bacterial urease on nitrogen flux in the murine gut microbiota.

 

This experiment uses radiolabelled urea to measure the production of lysine by microbes in the gut. The central question here is what is producing the extra amino acids in Crohn's disease? In this experiment the authors added radiolabelled urea so that they could track how much lysine was being produced from that urea. Crucially, they found that adding either antibiotics or a defined set of microbes (called "ASF") reduced the amount of lysine that was produced from that urea, which supports the hypothesis that microbes in the gut are directly metabolizing urea. 

Tying it all together

Fig. 6. Effect of E. coli urease on colitis in a T cell adoptive transfer mouse model of colitis

Fig. 6. Effect of E. coli urease on colitis in a T cell adoptive transfer mouse model of colitis

I've skipped over a lot of interesting control experiments so that I could get to the grand finale. Everything I've told you up until now has established that (a) Crohn's patients have higher levels of certain amino acids in their stool, (b) those high amino acid levels are associated with particular bacteria, and (c) bacteria in the gut are able to produce amino acids from urea. To bring it all together the authors went to a mouse model of colitis to see whether adding or removing a single gene would have an effect on disease. They found that E. coli with urease (an enzyme that metabolizes urea) caused significantly more disease than E. coli without urease. This brings it all back to the action of a single gene on a model of human disease, which is the classic goal of reductionist molecular biology, but the gene of interest is encoded by the human microbiome. 

I think that's pretty cool. 

From here it's easy to imagine that people might be interested in designing a drug that targets microbial urease in order to reduce human disease, although that's got to be a pretty difficult task and I have no idea how diverse bacterial urease enzymes are. 

Bringing it back to my first point, this seems like the best case example of how to advance our understanding of the human microbiome with the goal of treating human disease. I hope to see many more like it in the years to come.

Making sparse DataFrames efficiently

Here's a little story about a computational challenge that I came across recently, and how a little bit of performance profiling went a long way to improve my analysis.

Background

I've been working with a large amount of data recently, and my principal bottleneck hasn't been the raw sequence analysis, but rather analyzing the results across large numbers of samples. For context, I'm working with thousands of samples, each of which contains 10,000 - 10,000,000 observations. The raw data was terabytes of raw genomic sequence data, and so I'm definitely working with a more compact data structure, but it's still quite a bit of data to work with in memory. 

There have been a few different iterations for this workflow, starting by working in memory (started swapping), then moving to a PostgreSQL database (which was too slow with read/write), and now working in memory on a larger machine using sparse matrices (following a helpful suggestion from Jonathan Golob).

What I've really liked about these sparse matrices is that it's incredibly fast to slice the data in different ways, which is what I need to be able to figure out both (a) how similar samples are to each other and (b) how different observations are distributed across samples. In other words, I need to be able to retrieve the data quickly in a column-wise or row-wise manner and sparse matrices really help with this.  Also, my underlying data is very sparse, which is to say that a vast majority of the cells in my table are 0 or N/A. Lastly, they're supported in Pandas, which is really great. 

The Challenge

While it's really quick to query these sparse matrices, I've been finding it difficult to make them. This morning I did a bit of performance profiling to compare a few different ways of constructing these tables and I found that not all construction methods are created equal. You can find my profiling code here, if you're interested (this was run in Pandas 0.20.3). 

The data that I'm working with is a set of observations that are originally constructed in longitudinal format, with every observation as an item in a list that includes the row name, column name, and cell value. What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format, which could be indexed and sliced easily by row or column. 

Options

  1. Sparse DataFrame from nested dictionaries: Make a nested dictionary with the first level of keys as column labels and the second level of keys as row labels, and then convert directly to the SparseDataFrame.

  2. Column-wise addition to sparse DataFrame: Make an empty SparseDataFrame and then sequentially add in the data from each column. 

  3. Dense longitudinal, pivot, then make sparse: Make a dense DataFrame from the longitudinal data, pivot it in dense format, and then make it into a SparseDataFrame.

  4. Sparse longitudinal DataFrame, then pivot: Make a SparseDataFrame in longitudinal format and then pivot it in to wide format. 

Results

I tested each of the above options using four different matrices of varying sizes:

Matrix: 60 rows, 67 cols, 2.463% full

1. Testing sparse DataFrame from nested dictionaries
29.6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
41.5 ms ± 813 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
33.1 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
19 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Matrix: 100 rows, 100 cols, 9.54% full

1. Testing sparse DataFrame from nested dictionaries
61.2 ms ± 8.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
97.7 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
57.3 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
101 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 632 cols, 1.574% full

1. Testing sparse DataFrame from nested dictionaries
1.29 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
1.41 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
1.19 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
94 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 1000 cols, 9.526% full

1. Testing sparse DataFrame from nested dictionaries
2.82 s ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
3.15 s ± 81.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
2.8 s ± 72.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
826 ms ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

In the results above you can see that in general the fastest method was #4 – making the DataFrame sparse in longitudinal format, and then pivoting it into wide matrix format. However, I need to stress that the relative performance of each of these methods varied widely depending on how sparse the matrix was. In general, option #4 worked better with sparser datasets, but I haven't fully tested the range of parameter space varying the number of rows, columns, and values. 

A larger point that I could also make here is that picking the right data structure for your project is actually a pretty tricky problem. I found here that the advantages of sparse matrices was immense relative to relational databases, but that they come with an up-front cost of matrix construction. I'm a biologist, not a computational scientist, and so the only way that I find my way through is by talking to smart people and incorporating their suggestions wherever appropriate. If you come up against a similar problem dealing with largish data, just take a deep breath, ask for help, and then write some tests. 

Journal Club: Strains, functions and dynamics in the expanded Human Microbiome Project

I must have been really busy these last few weeks to have gone so long without posting about this paper. For anyone interested in the microbiome this is a hugely important paper to read (link). If you want a more comprehensive summary, you can find a few in the popular press

At the risk of rambling on and on, I want to talk about a few things from this paper that really caught my eye. Let's start with Figure 1a.

Figure 1a:&nbsp;Personalization, niche association, and reference genome coverage in strain-level metagenomic profiles. a, Mean phylogenetic divergences17 between strains of species with sufficient coverage at each targeted body site (minimum 2 strain pairs)

Figure 1a: Personalization, niche association, and reference genome coverage in strain-level metagenomic profiles. a, Mean phylogenetic divergences17 between strains of species with sufficient coverage at each targeted body site (minimum 2 strain pairs)

The statistic being plotted is the "mean distance of strains," meaning that they computed the exact genome sequence of each of the strains of all of the dominant organisms in every sample(!) and then calculated how different the strains within each species were between different samples. That process is (in my opinion) a difficult task to pull off well, and I find myself in the position once again of being very much in awe of the Huttenhower group for their skill and hard work. Ok, now how about the biology? This figure tells us that people harbor strains of microbes in their microbiome that are distinct from other people's strains, and that those strains stick around over time to some degree. Not only that, but the degree to which those strains stick around varies by body site, with the stool (and gut) likely having the most persistent set of strains. 

Ok, so now we've covered the fact that people have different strains of the same microbial species in their microbiomes, so let's go into a bit more depth with more of figure 1.

Figure 1, continued.&nbsp;b, Individuals tended to retain personalized strains, as visualized by a principal coordinates analysis (PCoA) plot for Actinomyces sp. oral taxon 448, in which lines connect samples from the same individual.&nbsp;d, PCoA showing niche association of Haemophilus parainfluenzae, showing subspecies specialization to three different body sites. e, PCoA for Eubacterium siraeum.&nbsp;

Figure 1, continued. b, Individuals tended to retain personalized strains, as visualized by a principal coordinates analysis (PCoA) plot for Actinomyces sp. oral taxon 448, in which lines connect samples from the same individual. d, PCoA showing niche association of Haemophilus parainfluenzae, showing subspecies specialization to three different body sites. e, PCoA for Eubacterium siraeum. 

Let's break this out:

  • Figure 1b: The exact strain of (one of the) bacteria in your dental plaque sticks around from day to day, even though people are (presumably) brushing their teeth!
  • Figure 1d: A bacterial species that is found all over the body, H. parainfluenzae, is genetically distinct (to some degree) by body site. Most intriguingly, it is only partially distinct by body site, which raises all kinds of questions about its evolutionary history.
  • Figure 1e: An organism that we call a single species (E. siraeum) seems to form three completely distinct genetic groupings. Note that the horizontal axis accounts for ~50% of the total genetic variation. That's huge. Is it a single species? Is it three? What is a "species"? Does it matter?

Reflections:

I don't want to try to sum up this paper with a single take-home message. I think this is a paper to read and reread and think about. However there are a few aspects of the methods used that I want to point out for those who may not think about this type of analysis very often. The first is that the authors defined a single strain for each sample (using StrainPhlAn). Do we think that there is only one strain of each species present at a single time? How could we test that hypothesis or even deal with a sample containing multiple, closely related strains? The next is that their most in-depth characterization of strain differences hinged on comparing the samples to known reference genomes. What about the variation that has never been captured in a reference genome? How would we even approach that data?

Lastly I'll say that I really think this work is important because I think that strain level variability in the microbiome is a crucial factor in human health and disease. This paper provides strong evidence that strain level variation is extensive, and the authors have provided powerful tools for characterizing that variation. The next question is, how do we apply this type of data in a way that uncovers the biological mechanisms underlying human health? In other words, how do we use microbiome profiling to generate some experimentally testable hypotheses? It seems so clear to me that this avenue of research is going to uncover important biological mechanisms of the human microbiome, but I can also see that we're going to need some creative, collaborative problem solving in order to realize this system's full potential. 

 

UPDATE:

On October 30, 2017, the Pollard Lab posted a preprint in which they specifically tackle the challenge of analyzing multiple strains per species in a given sample. You can read the preprint here. There's a lot of detail there that really deserves its own blog post, so stay tuned.