3D structures of gut bacteria and the human immune system

When I talk to people about my work I sometimes get the question, “Do you really think that the microbiome has a direct effect on human health?” It’s a completely understandable question – the study-of-the-week which makes it into the news cycle tends to just confirm what we already know about the importance of diet and exercise. Then I come across these beautiful papers that show just how intimately connected we are with our gut bacteria. Here’s a good example, and it even comes with a video.

Ladinsky, M.S., et al. Endocytosis of commensal antigens by intestinal epithelial cells regulates mucosal T cell homeostasis. Science. 363(6431). DOI: 10.1126/science.aat4042.

There are some beautiful illustrations and graphics in this paper which I won’t reproduce here, but which I hope you can access from whichever side of the paywall you are on.

Background: Researchers are continuing to find evidence that the type of bacteria in your gut (if you are a mouse or a human) influences the type of your immune response. If you don’t study the immune system, just remember that the immune system responds in different ways to different kinds of pathogens – viruses are different from bacteria, which are different from parasites, etc. Mounting the correct type of response is essential, and it seems that which bacteria you have in your gut has some influence over the nature of those responses.

The Gist: This study focused on the how of the question, the specific molecular mechanism which would explain this observed relationship between bacteria and the immune system. They used one particular type of bacteria (segmented filamentous bacteria, or “SFB”) and showed that this bacteria gets so close to human cells that bacterial proteins are actually taken up and can be found inside the human cells. In addition, this movement of bacterial proteins inside human cells causes a shift in the type of response mounted by the immune system.

What Caught My Eye: This paper has a video showing a protrusion of a bacterial cell pushing deep into a human cell, complete with a 3D reconstruction of the physical structure using electron tomography. If you can follow the link above and make it to the video, I highly recommend taking a look.

The biggest story for me in the microbiome these days is that there are a number of great researchers who are starting to figure out some of the specific molecular mechanisms by which the microbiome may influence human health. This makes me more and more optimistic and excited that we will see a day where microbiome-based therapeutics make it into the clinic, which could have a profound impact on a broad range of diseases, from inflammatory bowel disease to colorectal cancer and auto-inflammatory disease. It is exciting to be a part of this effort and try to help as we bring that day closer.

Molecules Mediating Microbial Manipulation of Mouse (and Human) Maladies

Sometime in the last ten years I gave up on the idea of truly keeping up with the microbiome field. In graduate school it was more reasonable because I had the luxury of focusing on viruses in the microbiome, but since then my interests have broadened and the size of the field has continued to expand. These days I try to focus on the subset of papers which are telling the story of either gene-level metagenomics, or the specific metabolites which mediate the biological effect of the microbiome on human health. The other day I happened across a paper which did both, and so I thought it might be worth describing it quickly here.

Brown, EM, et al. Bacteroides-Derived Sphingolipids Are Critical for Maintaining Intestinal Homeostasis and Symbiosis. Cell Host & Microbe 2019 25(5) link

As a human, my interest is drawn by stories that confirm my general beliefs about the world, and do so with new specific evidence. Of course this is the fallacy of ascertainment bias, but it’s also an accurate description of why this paper caught my eye.

The larger narrative that I see this paper falling into is the one which says that microbes influence human health largely because they produce a set of specific molecules which interact with human cells. By extension, if you happen to have a set of microbes which cannot produce a certain molecule, then your health will be changed in some way. This narrative is attractive because it implies that if we understand which microbes are making which metabolites (molecules), and how those metabolites act on us, then we can design a therapeutic to improve human health.

Motivating This Study

Jumping into this paper, the authors describe a recently emerging literature (which I was unaware of) on how bacterially-produced sphingolipids have been predicted to influence intestinal inflammation like IBD. Very generally, sphingolipids are a diverse class of molecules that can be found in bacterial cell membranes, but which also can be produced by other organisms, and which also can have a signaling effect on human cells. The gist of the prior evidence going into this paper is that

  • people with IBD have lower levels of different sphingolipids in their stool, and

  • genomic analysis of the microbiome of people with IBD predicts that their bacteria are making less sphingolipids

Of course, those observations don’t go very far on their own, mostly because there are a ton of things that are different in the microbiome of people with IBD, and so it’s hard to point to any one bacteria or molecule from the bunch and say that it is having a causal role, and isn’t just a knock-on effect from some other cause.

The Big Deal Here

The hypothesis in this study is that one particular type of bacteria, Bacteroides are producing sphingolipids which reduce inflammation in the host. The experimental system they used were mice that were born completely germ-free, and which were subsequently colonized with strains of Bacteroides that either did or did not have the genes required to make some particular types of sphingolipids. The really cool thing here was that they were able to knock out the gene for sphingolipid production in one specific species of Bacteroides, and so they could see what the effect was of that particular set of genes, while keeping everything else constant. They found a pretty striking result, which is that inflammation was much lower in the mice which were colonized with the strain which was able to make the sphingolipid.


To me, narrowing down the biological effect in an experiment to the difference of a single gene is hugely motivating, and really makes me think that this could plausibly have a role in the overall phenomenon of microbiome-associated inflammation.

The authors rightly point out that sphingolipids might not actually be the molecular messenger having an impact on host physiology — there are a lot of other things different in the sphingolipid-deficient bacteria used here, including carbohydrate metabolism and membrane composition, but it’s certainly a good place to keep looking.

Of course the authors did a bunch of other work in this paper to demonstrate that the experimental system was doing what they said, and they also went on to re-analyze the metabolites from human stool and identify specific sphingolipids that may be produced by these Bacteroides species, but I hope that my short summary gives you an idea of what they are getting at.

All About Those Genes

I think it can be difficult for non-microbiologists to appreciate just how much genetic diversity there is among bacteria. Strains which seem quite similar can have vastly different sets of genes (encoding, for example, a giant harpoon used to kill neighboring cells), and strains which seem quite different may in fact be sharing genes through exotic forms of horizontal gene transfer. With all of this complexity, I find it very comforting when scientists are able to conduct experiments which identify specific molecules and specific genes within the microbiome which have an impact on human health. I think we are moving closer to a world where we are able to use our knowledge of the microbiome to improve human health, and I think studies like this are bringing us closer.

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Massive unexplored genetic diversity of the human microbiome

When you analyze extremely large datasets, you tend to be guided by your intuition or predictions on how those datasets are composed, or how they will behave. Having studied the microbiome for a while, I would say that my primary rule of thumb for what to expect from any new sample is tons of novel diversity. This week saw the publication of another great paper showing just how true this is.

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle Resource


The Approach

If you are new to the microbiome, you may be interested to know that there are basically two approaches to figuring out what microbes (bacteria, viruses, etc.) are in a given sample (e.g. stool). You can either (1) compare all of the DNA in that sample to a reference database of microbial genomes, or (2) try to reassemble the genomes in each sample directly from the DNA.

The thesis of this paper is one that I strongly support: reference databases contain very little of the total genomic content of microbes out there in the world. By extension, they predict that (1) would perform poorly, while (2) will generate a much better representation of what microbes are present.

Testing this idea, the authors analyzed an immense amount of microbiome data (almost 10,000 biological samples!), performing the relatively computationally intensive task of reconstructing genomes (so-called _de novo_ assembly).

The Results

The authors found a lot of things, but the big message is that they were able to reconstruct a *ton* of new genomes from these samples — organisms that had never been sequenced before, and many that don’t really resemble any phyla that we know of. In other words, they found a lot more novel genomic content than even I expected, and I was sure that they would find a lot.


There’s a lot more content here for microbial genome afficianados, so feel free to dig in on your own (yum yum).

Take Home

When you think about what microbes are present in the microbiome, remember that there are many new microbes that we’ve never seen before. Some of those are new strains of clearly recognizable species (e.g. E. coli with a dozen new genes), but some will be novel organisms that have never been cultured or sequenced by any lab.

If you’re a scientist, keep that in mind when you are working in this area. If you’re a human, take hope and be encouraged by the fact that there is still a massive undiscovered universe within us, full of potential and amazing new things waiting to be discovered.

Niche Theory and the Human Gut Microbiome

Without really having the time to write a full blog post, I want to mention two recent papers that have strongly influenced my understanding of the microbiome.

Niche Theory

The ecological concept of the “niche” is something that is discussed quite often in the field of the microbiome, namely that each bacterial species occupies a niche, and any incoming organism trying to use that same niche will be blocked from establishing itself. The mechanisms and physical factors that cause this “niche exclusion” is probably much more clearly described in the ecological study of plants and animals — in the case of the microbiome I have often wondered just what utility or value this concept really had.

That all changed a few weeks ago with a pair of papers from the Elinav group.

The Papers

Personalized Gut Mucosal Colonization Resistance to Empiric Probiotics Is Associated with Unique Host and Microbiome Features


Quick, Quick Summary

At the risk of oversimplifying, I’ll try to summarize the two biggest points I took home from these papers.

  1. Lowering the abundance and diversity of bacteria in the gut can increase the probability that a new strain of bacteria (from a probiotic) is able to grow and establish itself

  2. The ability of a new bacteria (from a probiotic) to grow and persist in the gut varies widely on a person-by-person basis

Basically, the authors showed quite convincingly that the “niche exclusion” effect does indeed happen in the microbiome, and that the degree of niche exclusion is highly dependent on what microbes are present, as well as a host of other unknown factors.

So many more questions

Like any good study, this raises more questions than it answers. What genetic factors determine whether a new strain of bacteria can grow in the gut? Is it even possible to design a probiotic that can grow in the gut of any human? Are the rules for “niche exclusion” consistent across bacterial species or varied?

As an aside, these studies demonstrate the consistent observation that probiotics generally don’t stick around after you take them. If you have to take a probiotic every day in order to sustain its effect, it’s not a real probiotic.

I invite you to read over these papers and take what you can from them. If I manage to put together a more lengthy or interesting summary, I’ll make sure to post it at some point.

Microbiome Research: Hope over Hype

The story of microbiome research is one of hope and hype, both elevated to an extreme and fraught with controversy. You need look no further than popular blogs and high profile review articles to see this conflict play out. 

As a passionate microbiome researcher, I like to highlight the hope that I see -- the hope that humans will be able to understand and harness the microbiome to improve human health. 

The story of hope I have for you today is a story of the heart. The human heart. Well, heart disease. Reducing heart disease, really.


When I was in graduate school I saw the most fascinating lecture. It was from Stanley Hazen, a researcher at the Cleveland Clinic, and he was describing an experiment in which it appeared that bacteria in the gut were responsible for converting a normal part of our diet into a molecule that promoted atherosclerosis (heart disease). With a combination of (1) molecular analysis of the blood of humans with heart disease and (2) experiments in mice varying the diet and microbes present in the gut, they showed pretty convincingly that bacteria were converting phosphatidylcholine from food into TMAO, which then promoted heart disease (Wang, et al. 2011 Nature). 


Fast forward 7 years, and the microbiome research field has advanced far enough to identify the exact bacterial genes involved in this process. Not only that, they are able to inhibit those specific bacterial enzymes and show in a mouse model that levels of harmful TMAO are pushed down as a result (Roberts, et al. 2018 Nature).

Fig. 5: A microbial choline TMA lyase inhibitor reverses diet-induced changes in cecal microbial community composition associated with plasma TMAO levels, platelet responsiveness, and in vivo thrombosis potential. Schematic of the relationship between human gut commensal choline TMA lyase activity, TMA and TMAO generation, and enhanced platelet responsiveness and thrombosis risk in the host

Fig. 5: A microbial choline TMA lyase inhibitor reverses diet-induced changes in cecal microbial community composition associated with plasma TMAO levels, platelet responsiveness, and in vivo thrombosis potential. Schematic of the relationship between human gut commensal choline TMA lyase activity, TMA and TMAO generation, and enhanced platelet responsiveness and thrombosis risk in the host

Bioinformatics Aside

One aspect of this story that I'll point out for the bioinformatics folks in the audience is that the biological mechanism involved in choline -> TMAO is not a phylogenetically conserved one. It is mediated by a set of genes that are distributed sporadically across the bacterial tree of life. For that reason and others, I am a strong supporter of microbiome analysis tools that enable gene-level comparison between large sets of samples in order to identify mechanisms like these in the future. 


My hope, my dream, is that the entire human microbiome field is able to eventually follow this path. We observe that the microbiome influences some aspect of human health, we identify the biological mechanism responsible for this effect, and then we demonstrate our knowledge and mastery of this biology to such an extent that we can intentionally manipulate this system and eventually improve human health. 

We have a long way to go, but I believe that this is the path that we can follow, the example we can aspire to. I hope that this story gives you hope, and helps cut through the hype. 

Hybrid Approach to Microbiome Research (to Culture, and Not to Culture)

I was rereading a great paper from the Huttenhower group (at Harvard) this week and I was struck by a common theme that it shared with another great paper from the Segre group (at NIH), which I think is a nice little window into how good scientists are approaching the microbiome these days. 

The paper I'm thinking about is Hall, et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients (2017) Genome Medicine. The paper is open access so you can feel free to go read it yourself, but my super short summary is: (1) they analyzed the gut microbiome from patients with (and without) IBD and found that a specific clade of Ruminococcus gnavus was enriched in IBD; and then (2) they took the extra step of growing up those bacteria in the lab and sequencing their genomes in order to figure out which specific genes were enriched in IBD. 

The basic result is fantastically interesting – they found enriched genes that were associated with oxidative stress, adhesion, iron acquisition, and mucus utilization, all of which make sense in terms of IBD – but I mostly want to talk about the way they figured this out. Namely, they took a combined approach of (1) analyzing the total DNA from stool samples with culture free genome sequencing, and then (2) they isolated and grew R. gnavus strains in culture from those same stool samples so that they could analyze their genomes.

Fig. 3:  R. gnavus  metagenomic strain phylogeny. 

Fig. 3: R. gnavus metagenomic strain phylogeny. 

Now, if you cast your mind back to the paper on pediatric atopic dermatitis from Drs. Segre and Kong (Byrd, et al. 2017 Science Translational Medicine) you will remember that they took a very similar approach. They did culture-free sequencing of skin samples, while growing Staph strains from those same skin samples in parallel. With the cultures in hand they were able to sequence the genomes of those strains as well as testing for virulence in a mouse model of dermatitis.

So, why do I think this is worth writing a post about? It helps tell the story of how microbiome research has been developing in recent years. At the start, all we could do was describe how different organisms were in higher and lower abundance in different body sites, disease states, etc. Now that the field has progressed, it is becoming clear that the strain-level differences within a given species may be very important to human health and disease. We know that although people may contain a similar set of common bacterial species, the exact strains in their gut (for example) are different between people and usually stick around for months and years. 

With this increased focus on strain-level diversity, we are coming up against the technological challenges of characterizing those differences between people, and how those differences track with health and disease. The two papers I've mentioned here are not the only ones to take this approach (it's also worth mentioning this great paper on urea metabolism in Crohn's disease from UPenn), which was to neatly interweave the complementary sets of information that can gleaned from culture-free whole-genome shotgun sequencing as well as culture-based strain isolation. Both of those techniques are difficult and they require extremely different sets of skills, so it's great to see collaborations come together to make these studies possible.

With such a short post, I've surely left out some important details from these papers, but I hope that the general reflection and point about the development of microbiome research has been of interest. It's certainly going to stay on my mind for the years to come.

Functional Analysis of Metagenomes by Likelihood Inference: FAMLI

I've been working for the last six months or so with my colleague Dr. Jonathan Golob on an algorithm to help analyze microbiome samples, and now it's ready to share with the world. 

If you'd like to read a preprint of the paper instead of this blog-based summary, you can do so on bioRxiv.

The problem

As we were analyzing a set of microbiome samples using shotgun genomes sequence data (WGS), we kept running into a tricky problem. Our goal was to identify the set of protein coding genes present in a sample, and so we aligned the WGS reads using DIAMOND against the UniRef90 database and took the best hit(s) for each read. However, we kept getting all sorts of false positives and false negatives, which we ultimately figured out were due to a) sequencing error and b) large regions of identical protein sequence shared between different proteins in the database. In fact, (b) was far more concerning to us than (a) because it meant that we were doing better or worse depending on the quality of annotation for an organism's genome. This was an analytical bias we didn't want in our research.

The solution

After lots of back-and-forth between the two of us, including discussion, arguments, coding, refactoring, and existential crises (on my part), we ended up putting together an algorithm that I'm very happy with. This is not the most efficient or accurate version that is theoretically possible, but it's the best one that we were able to put together after trying lots and lots of variations. The approach is illustrated here:

Illustration of the FAMLI algorithm

Illustration of the FAMLI algorithm

The steps are generally as follows:

  1. Align every WGS read against a database of protein references (e.g. UniRef90)
  2. Identify the protein references with highly uneven coverage (standard deviation / mean > 1) and filter them out
  3. Iteratively assign each read to a single protein reference, that which is the most likely given the complete set of reads in the sample
  4. Filter the final set of protein references as in (2), using the deduplicated alignments from (3).

We did some work to test how this approach compares to other potential options, and we think it does quite well. You can read the paper for more details on that.

The software

We would like for this approach to be generally available for use by the research community. You can read the source code on GitHub, you can download a Docker image from Quay, and you can install directly from PyPI. Of the three, I suggest you just use the Docker image to minimize any confusion or unpredictability.

Thinking broadly about functional metagenomics

As we worked on benchmarking this algorithm we felt that it was necessary to compare it with other existing methods that are used by the community, but I didn't want to give the impression that this algorithm is intended to replace or supplant any previously existing tools. That's because the goal of this analysis is subtly different than that of other methods. To date, functional metagenomics has generally had the goal of identifying the metabolic pathways encoded by a community (e.g. HUMAnN2) or identifying the complete gene catalog that can be assembled from a community (e.g. de novo assembly). I think that both of those methods work very well for those useful purposes. In contrast, FAMLI attempts to quantify the set of protein-coding genes from a closed reference database (e.g. UniRef90), whether or not they have any annotated metabolic function. In fact, one of the more interesting possibilities would be to incorporate this open source algorithm into those existing tools, using it to optimize any alignments that are created as part of a larger pipeline. If you are a tool maker and are interested in working on that together, please be in touch.

Final note

Working on this project has been an extremely challenging and rewarding process, full of false starts, confusion, and discovery. I cannot be more thankful to Dr. Jonathan Golob for his equal contribution to this effort, and am also thankful for any readers who have been interested enough by our work to read this far. I hope that you find some utility from our efforts.

Detecting viruses from short-read metagenomic data

My very first foray into the microbiome was in graduate school when my advisor (Dr. Rick Bushman) suggested that I try to use a high-throughput genome sequencing instrument (a 454 pyrosequencer) to characterize the viruses present in the healthy human gut. That project ended up delving deeply into the molecular evolution of bacteriophage populations, and gave me a deep appreciation for the unpredictable complexity of viral genomes. 

While I wasn't the first to publish in this area, the general approach has become more popular over the last decade and a number of different groups have put their own spin on it. 

Because of the diversity of viruses, you can always customize and enhance different aspects of the process, from sample collection, purification, and DNA isolation to metagenomic sequencing, bioinformatic analysis, and visualization. It's been very interesting to see how sophisiticated some of these methods have become.

On the bioinformatic analysis side of things, I ended up having the most success in those days by assembling each viral genome from scratch and measuring the evolution of that genome over time. More of the bespoke approach to bioinformatics, rather than the assembly line. 

In contrast, these days I am much more interested in computational approaches that allow me to analyze very large numbers of samples with a minimum of human input required. For that reason I don't find de novo assembly to be all that appealing. It's not that the computation can't be done, it's more than I have a hard time imagining how to wrap my brain around the results in a productive way. 

In contrast, one approach that I have been quite happy with is also much more simple minded. Instead of trying to assemble all of the new genomes in a sample, it's much easier to simply align your short reads against a reference database of viral genomes. One of the drawbacks is that you can only detect a virus that has been detected before. On the other hand, one of the advantages is also that you can only detect a virus that has been detected before, meaning that all samples can be rapidly and intuitively compared against each other. 

To account for the rapid evolution of viral genomes, I think it's best to do this alignment in protein space against the set of proteins contained in every viral genome. This makes the alignments a bit more robust. 

If you would like to perform this simple read alignment method for viral detection, you can use the code that I've put together at this GitHub repo. There is also a Quay repository hosting a Docker image that you can use to run this analysis without having to install any dependencies. 

This codebase is under active development (version 0.1 at time of writing) so please be sure to test carefully against your controls to make sure everything is behaving well. At some point I may end up publishing more about this method, but it may be just too simple to entice any journal. 

Lastly, I want to point out that straightforward alignment of reads does not account for any number of confounding factors, including but not limited to:

  1. presence of human or host DNA
  2. shared genome segments between extant viruses
  3. novel viral genomes
  4. complex viral quasi-species
  5. integration of prophages into host genomes

There are a handful of tools out there that do try to deal with some of those problems in different ways, quite likely to good effect. However, it's good to remember that with every additional optimization you add a potential confounding factor. For example, it sounds like a good idea to remove human sequences from your sample, but that runs the risk of also eliminating viral sequences that happen to be found with the human genome, such as lab contaminants or integrated viral genome fragments. There are even a few human genes with deep homology to existing viral genes, thought to be due to an ancient integration and subsequent repurposing. All I mean to say here is that sometimes it's better to remove the assumptions from your code, and instead include a good set of controls to your experiment that can be used to robustly eliminate signal coming from, for example, the human genome. 

Journal Club: Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation

Amid the holiday rush last month, I was gratified to see a publication describing a new computational method that had been on my mind. 

The paper is "Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation", published in Nature Biotechnology on December 11, 2017. 

The Concept

Bacteria do something funny to their DNA that you may not have heard of. It's called covalent DNA modification, and it often consists of adding a small methyl group to the DNA molecule itself. This added methyl group (or hydroxyl, or hydroxymethyl, etc.) doesn't interfere with the normal operation of DNA (making RNA, and more DNA). Instead it serves as a marker that differentiates self-DNA from invading DNA (such as from a phage, plasmid, etc.).

For example, one species may methylate the motif ACCGAG (at the bolded base), while another might methylate CTGCAG. If the first species encounters a ACCGAG motif that lacks the appropriate methyl, it treats it as invading DNA and chews it up.

The ongoing arms race of mobile DNA elements has helped maintain a diversity of DNA modifications, such that many different species of bacteria have a unique profile. 

The interesting trick here is that we now have a way to read out the methylation patterns in DNA in addition to the sequence. This is most notably available via PacBio sequencing, which typically will generate a smaller number of much longer sequence fragments than other genome sequencing methods. Bringing it all together, the authors of this paper were able to use the methylation patterns from PacBio sequencing to much more accurately reconstruct the microbes present in a set of environmental samples (such as the human microbiome). 

The Data

The approach of the authors was to first assemble the PacBio sequences, and then bin those larger genome fragments together based on a common epigenetic signature. 

Figure 2c shows the binning of genome fragments in a sample containing a known mixture of bacteria.

Figure 2c shows the binning of genome fragments in a sample containing a known mixture of bacteria.

Figure 2e shows the binning of genome fragments in a gut microbiome sample containing a mixture of unknown bacteria (and viruses, fungi, etc.).

Figure 2e shows the binning of genome fragments in a gut microbiome sample containing a mixture of unknown bacteria (and viruses, fungi, etc.).

In general, this seems like a very interesting approach. After my read of the paper, it appears that the bacteria present in the microbiome contain a distinct enough set of epigenetic patterns to enable the deconvolution of many different species and strains. I look forward to seeing how this method stacks up against other binning approaches in the future. 

Final note

For those of you interested in phage and mobile genetic elements, I wanted to point out that the authors also explore a topic that has been studied somewhat by others – the linkage of phage to their hosts via epigenetic signatures. The idea here is that it can be computationally difficult to match a phage genome or plasmid with its host. One experimental method that accomplishes this is Hi-C, which takes advantage of the physical proximity of the host genome within an intact cell. In contrast, the PacBio method does not require intact cells, and can be used to link phage or plasmids with their host based on a shared epigenetic signature. 

My hope is that this type of data starts to become more widely available. There are clearly a number of computational tools that need to be refined in order to get full use of all this information, but it does seem to hold a good deal of promise.