Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Massive unexplored genetic diversity of the human microbiome

When you analyze extremely large datasets, you tend to be guided by your intuition or predictions on how those datasets are composed, or how they will behave. Having studied the microbiome for a while, I would say that my primary rule of thumb for what to expect from any new sample is tons of novel diversity. This week saw the publication of another great paper showing just how true this is.

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle Resource


The Approach

If you are new to the microbiome, you may be interested to know that there are basically two approaches to figuring out what microbes (bacteria, viruses, etc.) are in a given sample (e.g. stool). You can either (1) compare all of the DNA in that sample to a reference database of microbial genomes, or (2) try to reassemble the genomes in each sample directly from the DNA.

The thesis of this paper is one that I strongly support: reference databases contain very little of the total genomic content of microbes out there in the world. By extension, they predict that (1) would perform poorly, while (2) will generate a much better representation of what microbes are present.

Testing this idea, the authors analyzed an immense amount of microbiome data (almost 10,000 biological samples!), performing the relatively computationally intensive task of reconstructing genomes (so-called _de novo_ assembly).

The Results

The authors found a lot of things, but the big message is that they were able to reconstruct a *ton* of new genomes from these samples — organisms that had never been sequenced before, and many that don’t really resemble any phyla that we know of. In other words, they found a lot more novel genomic content than even I expected, and I was sure that they would find a lot.


There’s a lot more content here for microbial genome afficianados, so feel free to dig in on your own (yum yum).

Take Home

When you think about what microbes are present in the microbiome, remember that there are many new microbes that we’ve never seen before. Some of those are new strains of clearly recognizable species (e.g. E. coli with a dozen new genes), but some will be novel organisms that have never been cultured or sequenced by any lab.

If you’re a scientist, keep that in mind when you are working in this area. If you’re a human, take hope and be encouraged by the fact that there is still a massive undiscovered universe within us, full of potential and amazing new things waiting to be discovered.

The Blessing and the Curse of Dimensionality

A paper recently caught my eye, and I think it is a great excuse to talk about data scale and dimensionality.

Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 562, 589–594 (2018). (link)

In addition to having a great acronym for study of child development, they also sequenced 10,913 metagenomes from 783 children.

This is a ton of data.

If you haven’t worked with a “metagenome,” it’s usually about 10-20 million short words, each corresponding to 100-300 bases of a microbial genome. It’s a text file with some combination of ATCG written out over tens of millions of lines, with each line being a few hundred letters long. A single metagenome is big. It won’t open in Word. Now imagine you have 10,000 of them. Now imagine you have to make sense out of 10,000 of them.

Now, I’m being a bit extreme – there are some ways to deal with the data. However, I would argue that it’s this problem, how to deal with the data, that we could use some help with.

Taxonomic classification

The most effective way to deal with the data is to take each metagenome and figure out which organisms are present. This process is called “taxonomic classification” and it’s something that people have gotten pretty good at recently. You take all of those short ATCG words, you match them against all of the genomes you know about, and you use that information to make some educated about what collection of organisms are present. This is a biologically meaningful reduction in the data that results in hundreds or thousands of observations per sample. You can also validate these methods by processing “mock communities” and seeing if you get the right answer. I’m a fan.

With taxonomic classification you end up with thousands of observations (in this case organisms) across X samples. In the TEDDY study they had >10,000 samples, and so this dataset has a lot of statistical power (where you generally want more samples than observations).

Metabolic reconstruction

The other main way that people analyze metagenomes these days is by quantifying the abundance of each biochemical pathway present in the sample. I won’t talk about this here because my opinions are controversial and it’s best left for another post.

Gene-level analysis

I spend most of my time these days on “gene-level analysis.” This type of analysis tries to quantify every gene present in every genome in every sample. The motivation here is that sometimes genes move horizontally between species, and sometimes different strains within the same species will have different collections of genes. So, if you want to find something that you can’t find with taxonomic analysis, maybe gene-level analysis will pick it up. However, that’s an entirely different can of worms. Let’s open it up.

Every microbial genome contains roughly 1,000 genes. Every metagenome contains a few hundred genomes. So every metagenome contains hundreds of thousands of genes. When you look across a few hundred samples you might find a few million unique genes. When you look across 10,000 samples I can only guess that you’d find tens of millions of unique genes.

Now the dimensionality of the data is all lopsided. We have tens of millions of genes, which are observed across tens of thousands of samples. A biostatistician would tell us that this is seriously underpowered for making sense of the biology. Basically, this is an approach that just doesn’t work for studies with 10,000 samples, which I find to be pretty daunting.

Dealing with scale

The way that we find success in science is that we take information that a human cannot comprehend, and we transform it into something that a human can comprehend. We cannot look at a text file with ten million lines and understand anything about that sample, but we can transform it into a list of organisms with names that we can Google. I’m spending a lot of my time trying to do the same thing with gene-level metagenomic analysis, trying to transform it into something that a human can comprehend. This all falls into the category of “dimensionality reduction,” trying to reduce the number of observations per sample, while still retaining the biological information we care about. I’ll tell you that this problem is really hard and I’m not sure I have the single best true angle on it. I would absolutely love to have more eyes on the problem.

It increasingly seems like the world is driven by people who try to make sense of large amounts of data, and I would humbly ask for anyone who cares about this to try to think about metagenomic analysis. The data is massive, and we have a hard time figuring out how to make sense of it. We have a lot of good starts to this, and there are a lot of good people working in this area (too many to list), but I think we could always use more help.

The authors of the paper who analyzed 10,000 metagenomes learned a lot about how the microbiome develops during early childhood, but I’m sure that there is even more we can learn from this data. And I am also sure that we are getting close to world where we have 10X the data per sample, and experiments with 10X the samples. That is a world that I think we are ill-prepared for, and I’m excited to try to build the tools that we will need for it.