Shiny Microbiome Analysis

Are you a microbiome researcher? Did you do a microbiome experiment? Do you want a quick, first-pass analysis to find out what happened? This post is for you.

When I talk to my collaborators, I find that there is a point people encounter where they have a 16S dataset and they want a quick look at what happened. This is not to generate publication-quality figures and not to take the place of a real-life statistician, but just a rough pass over the data. To help make this possible, I worked with a talented student named Will Frohlich and made a small shinyApp for this first-pass microbiome analysis.

Where To Find It

You can access the app at The code for the app, example data, and documentation can be found at

What It Does

The shinyMicrobiome app only does a few basic things:

  • Plot stacked bar graphs showing relative abundance of taxa over samples

  • Plot the number of reads per sample, or group of samples

  • Plot the estimated total number of taxa (using breakaway)

  • Calculate and plot differential abundance by sample group (using corncob)

Disclaimers: Note that the differential abundance calculation is a very naive implementation of a sophisticated tool (corncob), and you will almost certainly get a more accurate answer by running corncob yourself and selecting parameters as appropriate for your study design. Also note that there is no False Discovery Rate correction in the app.

What You Need

The full description of input data for the app can be found in the GitHub repo, but the short description is that you will need:

  • A metadata sheet (in CSV format) describing what groups (treatment/control, etc) each sample is in. You must have the first column labeled “name” with the sample name, and then you can have as many additional columns as you like.

  • A taxon table (in CSV format) with the number of 16S reads assigned to each taxon for each sample. Each taxon is a row and the first column must be named “tax_name” with the name of the taxon. Each sample is a column, and the name of those columns must match the sample names in the metadata sheet.

Example data with this format can be found in the GitHub repo.

Need Help With Raw 16S Data?

Do you have raw 16S data which you would like to transform into these read-count taxon tables? The MaLiAmPi pipeline created by Jonathan Golob is a great way to process 16S data and make these sorts of tables. If you do use MaLiAmPi, the wide-form tables it produces (found in classify/tables/tallies_wide.genus.csv) are properly formatted for use as taxon tables by this app.

Having Problems With The App?

This app is very much a work-in-progress. Please don’t hesitate to reach out if you have any problems, or just file an issue if you think there are some bugs which might be impacting other people. However, there are many things that are really never going to work well for a simple app like this – axis labels will be misplaced, legends may overlap the plot area, statistical tests won’t be exactly suited for your experiment, etc. As long as you approach this app with those expectations, you may end up finding it to be useful.

3D structures of gut bacteria and the human immune system

When I talk to people about my work I sometimes get the question, “Do you really think that the microbiome has a direct effect on human health?” It’s a completely understandable question – the study-of-the-week which makes it into the news cycle tends to just confirm what we already know about the importance of diet and exercise. Then I come across these beautiful papers that show just how intimately connected we are with our gut bacteria. Here’s a good example, and it even comes with a video.

Ladinsky, M.S., et al. Endocytosis of commensal antigens by intestinal epithelial cells regulates mucosal T cell homeostasis. Science. 363(6431). DOI: 10.1126/science.aat4042.

There are some beautiful illustrations and graphics in this paper which I won’t reproduce here, but which I hope you can access from whichever side of the paywall you are on.

Background: Researchers are continuing to find evidence that the type of bacteria in your gut (if you are a mouse or a human) influences the type of your immune response. If you don’t study the immune system, just remember that the immune system responds in different ways to different kinds of pathogens – viruses are different from bacteria, which are different from parasites, etc. Mounting the correct type of response is essential, and it seems that which bacteria you have in your gut has some influence over the nature of those responses.

The Gist: This study focused on the how of the question, the specific molecular mechanism which would explain this observed relationship between bacteria and the immune system. They used one particular type of bacteria (segmented filamentous bacteria, or “SFB”) and showed that this bacteria gets so close to human cells that bacterial proteins are actually taken up and can be found inside the human cells. In addition, this movement of bacterial proteins inside human cells causes a shift in the type of response mounted by the immune system.

What Caught My Eye: This paper has a video showing a protrusion of a bacterial cell pushing deep into a human cell, complete with a 3D reconstruction of the physical structure using electron tomography. If you can follow the link above and make it to the video, I highly recommend taking a look.

The biggest story for me in the microbiome these days is that there are a number of great researchers who are starting to figure out some of the specific molecular mechanisms by which the microbiome may influence human health. This makes me more and more optimistic and excited that we will see a day where microbiome-based therapeutics make it into the clinic, which could have a profound impact on a broad range of diseases, from inflammatory bowel disease to colorectal cancer and auto-inflammatory disease. It is exciting to be a part of this effort and try to help as we bring that day closer.

Molecules Mediating Microbial Manipulation of Mouse (and Human) Maladies

Sometime in the last ten years I gave up on the idea of truly keeping up with the microbiome field. In graduate school it was more reasonable because I had the luxury of focusing on viruses in the microbiome, but since then my interests have broadened and the size of the field has continued to expand. These days I try to focus on the subset of papers which are telling the story of either gene-level metagenomics, or the specific metabolites which mediate the biological effect of the microbiome on human health. The other day I happened across a paper which did both, and so I thought it might be worth describing it quickly here.

Brown, EM, et al. Bacteroides-Derived Sphingolipids Are Critical for Maintaining Intestinal Homeostasis and Symbiosis. Cell Host & Microbe 2019 25(5) link

As a human, my interest is drawn by stories that confirm my general beliefs about the world, and do so with new specific evidence. Of course this is the fallacy of ascertainment bias, but it’s also an accurate description of why this paper caught my eye.

The larger narrative that I see this paper falling into is the one which says that microbes influence human health largely because they produce a set of specific molecules which interact with human cells. By extension, if you happen to have a set of microbes which cannot produce a certain molecule, then your health will be changed in some way. This narrative is attractive because it implies that if we understand which microbes are making which metabolites (molecules), and how those metabolites act on us, then we can design a therapeutic to improve human health.

Motivating This Study

Jumping into this paper, the authors describe a recently emerging literature (which I was unaware of) on how bacterially-produced sphingolipids have been predicted to influence intestinal inflammation like IBD. Very generally, sphingolipids are a diverse class of molecules that can be found in bacterial cell membranes, but which also can be produced by other organisms, and which also can have a signaling effect on human cells. The gist of the prior evidence going into this paper is that

  • people with IBD have lower levels of different sphingolipids in their stool, and

  • genomic analysis of the microbiome of people with IBD predicts that their bacteria are making less sphingolipids

Of course, those observations don’t go very far on their own, mostly because there are a ton of things that are different in the microbiome of people with IBD, and so it’s hard to point to any one bacteria or molecule from the bunch and say that it is having a causal role, and isn’t just a knock-on effect from some other cause.

The Big Deal Here

The hypothesis in this study is that one particular type of bacteria, Bacteroides are producing sphingolipids which reduce inflammation in the host. The experimental system they used were mice that were born completely germ-free, and which were subsequently colonized with strains of Bacteroides that either did or did not have the genes required to make some particular types of sphingolipids. The really cool thing here was that they were able to knock out the gene for sphingolipid production in one specific species of Bacteroides, and so they could see what the effect was of that particular set of genes, while keeping everything else constant. They found a pretty striking result, which is that inflammation was much lower in the mice which were colonized with the strain which was able to make the sphingolipid.


To me, narrowing down the biological effect in an experiment to the difference of a single gene is hugely motivating, and really makes me think that this could plausibly have a role in the overall phenomenon of microbiome-associated inflammation.

The authors rightly point out that sphingolipids might not actually be the molecular messenger having an impact on host physiology — there are a lot of other things different in the sphingolipid-deficient bacteria used here, including carbohydrate metabolism and membrane composition, but it’s certainly a good place to keep looking.

Of course the authors did a bunch of other work in this paper to demonstrate that the experimental system was doing what they said, and they also went on to re-analyze the metabolites from human stool and identify specific sphingolipids that may be produced by these Bacteroides species, but I hope that my short summary gives you an idea of what they are getting at.

All About Those Genes

I think it can be difficult for non-microbiologists to appreciate just how much genetic diversity there is among bacteria. Strains which seem quite similar can have vastly different sets of genes (encoding, for example, a giant harpoon used to kill neighboring cells), and strains which seem quite different may in fact be sharing genes through exotic forms of horizontal gene transfer. With all of this complexity, I find it very comforting when scientists are able to conduct experiments which identify specific molecules and specific genes within the microbiome which have an impact on human health. I think we are moving closer to a world where we are able to use our knowledge of the microbiome to improve human health, and I think studies like this are bringing us closer.

Working with Nextflow at Fred Hutch

I’ve been putting in a bit of work recently trying to make it easier for other researchers at Fred Hutch to use Nextflow as a way to run their bioinformatics workflows, while also getting the benefits of cloud computing and Docker-based computational reproducibility.

You can see some slides describing some of that content here, including a description of the motivation for using workflow managers, as well as a more detailed walk-through of using Nextflow right here at Fred Hutch.

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.

Massive unexplored genetic diversity of the human microbiome

When you analyze extremely large datasets, you tend to be guided by your intuition or predictions on how those datasets are composed, or how they will behave. Having studied the microbiome for a while, I would say that my primary rule of thumb for what to expect from any new sample is tons of novel diversity. This week saw the publication of another great paper showing just how true this is.

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle Resource


The Approach

If you are new to the microbiome, you may be interested to know that there are basically two approaches to figuring out what microbes (bacteria, viruses, etc.) are in a given sample (e.g. stool). You can either (1) compare all of the DNA in that sample to a reference database of microbial genomes, or (2) try to reassemble the genomes in each sample directly from the DNA.

The thesis of this paper is one that I strongly support: reference databases contain very little of the total genomic content of microbes out there in the world. By extension, they predict that (1) would perform poorly, while (2) will generate a much better representation of what microbes are present.

Testing this idea, the authors analyzed an immense amount of microbiome data (almost 10,000 biological samples!), performing the relatively computationally intensive task of reconstructing genomes (so-called _de novo_ assembly).

The Results

The authors found a lot of things, but the big message is that they were able to reconstruct a *ton* of new genomes from these samples — organisms that had never been sequenced before, and many that don’t really resemble any phyla that we know of. In other words, they found a lot more novel genomic content than even I expected, and I was sure that they would find a lot.


There’s a lot more content here for microbial genome afficianados, so feel free to dig in on your own (yum yum).

Take Home

When you think about what microbes are present in the microbiome, remember that there are many new microbes that we’ve never seen before. Some of those are new strains of clearly recognizable species (e.g. E. coli with a dozen new genes), but some will be novel organisms that have never been cultured or sequenced by any lab.

If you’re a scientist, keep that in mind when you are working in this area. If you’re a human, take hope and be encouraged by the fact that there is still a massive undiscovered universe within us, full of potential and amazing new things waiting to be discovered.

Quick note on workflow managers

After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.

A Few Good Options

It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.

Issue 1 — Where do you execute your command?

Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.

Issue 2 — When do you execute your command?

A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.

Issue 3 — Where does your data live?

One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.

Wrapping Up

While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.