Publication

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Functional Analysis of Metagenomes by Likelihood Inference: FAMLI

I've been working for the last six months or so with my colleague Dr. Jonathan Golob on an algorithm to help analyze microbiome samples, and now it's ready to share with the world. 

If you'd like to read a preprint of the paper instead of this blog-based summary, you can do so on bioRxiv.

The problem

As we were analyzing a set of microbiome samples using shotgun genomes sequence data (WGS), we kept running into a tricky problem. Our goal was to identify the set of protein coding genes present in a sample, and so we aligned the WGS reads using DIAMOND against the UniRef90 database and took the best hit(s) for each read. However, we kept getting all sorts of false positives and false negatives, which we ultimately figured out were due to a) sequencing error and b) large regions of identical protein sequence shared between different proteins in the database. In fact, (b) was far more concerning to us than (a) because it meant that we were doing better or worse depending on the quality of annotation for an organism's genome. This was an analytical bias we didn't want in our research.

The solution

After lots of back-and-forth between the two of us, including discussion, arguments, coding, refactoring, and existential crises (on my part), we ended up putting together an algorithm that I'm very happy with. This is not the most efficient or accurate version that is theoretically possible, but it's the best one that we were able to put together after trying lots and lots of variations. The approach is illustrated here:

Illustration of the FAMLI algorithm

Illustration of the FAMLI algorithm

The steps are generally as follows:

  1. Align every WGS read against a database of protein references (e.g. UniRef90)
  2. Identify the protein references with highly uneven coverage (standard deviation / mean > 1) and filter them out
  3. Iteratively assign each read to a single protein reference, that which is the most likely given the complete set of reads in the sample
  4. Filter the final set of protein references as in (2), using the deduplicated alignments from (3).

We did some work to test how this approach compares to other potential options, and we think it does quite well. You can read the paper for more details on that.

The software

We would like for this approach to be generally available for use by the research community. You can read the source code on GitHub, you can download a Docker image from Quay, and you can install directly from PyPI. Of the three, I suggest you just use the Docker image to minimize any confusion or unpredictability.

Thinking broadly about functional metagenomics

As we worked on benchmarking this algorithm we felt that it was necessary to compare it with other existing methods that are used by the community, but I didn't want to give the impression that this algorithm is intended to replace or supplant any previously existing tools. That's because the goal of this analysis is subtly different than that of other methods. To date, functional metagenomics has generally had the goal of identifying the metabolic pathways encoded by a community (e.g. HUMAnN2) or identifying the complete gene catalog that can be assembled from a community (e.g. de novo assembly). I think that both of those methods work very well for those useful purposes. In contrast, FAMLI attempts to quantify the set of protein-coding genes from a closed reference database (e.g. UniRef90), whether or not they have any annotated metabolic function. In fact, one of the more interesting possibilities would be to incorporate this open source algorithm into those existing tools, using it to optimize any alignments that are created as part of a larger pipeline. If you are a tool maker and are interested in working on that together, please be in touch.

Final note

Working on this project has been an extremely challenging and rewarding process, full of false starts, confusion, and discovery. I cannot be more thankful to Dr. Jonathan Golob for his equal contribution to this effort, and am also thankful for any readers who have been interested enough by our work to read this far. I hope that you find some utility from our efforts.