Sam Minot

March 29, 2023

Practical Reproducibility in Bioinformatics

Sam Minot

March 29, 2023

I recently had an opportunity to present as part of a seminar series on Rigor, Reproducibility, and Transparency. My goal was to give the talk that I would have liked to hear when I was starting out in graduate school. If you are also starting your journey with computational biology, you may also find something useful here.

Sam Minot

December 14, 2021

Code Discussion

Microbial Pan-Genome Cartography

Sam Minot

December 14, 2021

Code Discussion

Many of the projects that I’ve been working on recently have led me to ask questions like, ”what bacteria encode this group of genes?” and “what genes are shared across this group of bacteria?” Even for a single species, the patterns of genes across genomes can be complex and beautiful!

Update: All of the maps shown below can now be viewed interactively here.

To make it easier to generate interactive displays to explore these patterns (which I call “genes-in-genomes maps”), I ended up building out a collection of tools which I would love for any researcher to use.

To explain a bit more about this topic, I recorded a short talk based on a presentation I gave at a local microbiome interest group.

In the presentation I walk through a handful of these pan-genome maps, which you can also download and explore for yourself using the links below.

I’m excited to think that this way of displaying microbial pan-genomes might be useful to researchers who are interested in exploring the diversity of the microbial world. If you’d like to talk more about using this approach in your research, please don’t hesitate to be in touch.

Sam Minot

January 9, 2021

Musings, Journal Club

What's the Matter with Recombination?

Sam Minot

January 9, 2021

Musings, Journal Club

Required reading for anyone who uses taxonomy or phylogeny in their #bioinformatics to study the #microbiome (thread, and paper) https://t.co/ucs8r7gLH7
— Sam Minot (@sminot) January 9, 2021

Why would I be so bold as to assign required reading on a Saturday morning via Twitter? Because the ideas laid out in this paper have practical implications for many of the computational tools that we use to understand the microbiome. Taxonomy and phylogeny lies at the heart of k-mer based methods for metagenomics, as well as the core justification for measuring any marker gene (e.g. 16S) with amplicon sequencing.

Don’t get me wrong, I am a huge fan of both taxonomy and phylogeny as one of the best ways for humans to understand the microbial world, and I’m going to keep using both for the rest of my career. For that reason, I think it’s very important to understand the ways in which these methods can be confounded (i.e. the ways in which these methods can mislead us) by mechanisms like genomic recombination.

What Is Recombination?

Bacteria are amazingly complex creatures, and they do a lot more than just grow and divide. During the course of a bacterial cell’s life, it may end up doing something exciting like:

Importing a piece of naked DNA from its environment and just pasting it into its genome;
Using a small needle to inject a plasmid (small chromosome) into an adjacent cell; or
Package up a small piece of its genome into a phage (protein capsule) which contains all of the machinery needed to travel a distance and then inject that DNA into another bacterium far away.

Not all bacteria do all of these things all of the time, but we know that they do happen. One common feature of these activities is that genetic material is exchanged between cells in a manner other than clonal reproduction (when a single cell splits into two). In other words, these are all forms of ‘recombination.’

How Do We Study the Microbiome?

Speaking for myself, I study the microbiome by analyzing data generated by genome sequencing instruments. We use those instruments to identify a small fraction of the genetic sequences contained in a microbiome sample, and then we draw inferences from those sequences. This may entail amplicon sequencing of the 16S gene, bulk WGS sequencing of a mixed population, or even single-cell sequencing of individual bacterial cells. Across all of these different technological approaches, we are collecting a small sample of the genomic sequences present in a much larger microbiome sample, and we are using that data to gain some understanding of the microbiome as a whole. In order to extrapolate from data to models, we rely on some key assumptions of how the world works, one of which being that bacteria do not frequently recombine.

What Does Recombination Mean to Me?

If you’ve read this far you are either a microbiome researcher or someone very interested in the topic, and so you should care about recombination. As an example, let’s walk through the logical progression of microbiome data analysis:

I have observed microbial genomic sequence S in specimen X. This may be an OTU, ASV, or WGS k-mer.
The same sequence S can also be observed in specimen Y, but not specimen Z. There may be some nuances of sequencing depth and the limit-of-detection, but I have satisfied myself that for this experiment marker S can be found in X and Y, but not Z.
Because bacteria infrequently recombine, I can infer that marker S represents a larger genomic region G which is similarly present in X and Y, but not Z. For 16S that genomic region would be the species- or genus-level core genome, and for WGS it could also be some accessory genetic elements like plasmids, etc. In the simplest rendering, we may give a name to genomic region G which corresponds to the taxonomic label for those organisms which share marker S (e.g. Escherichia coli).
When I compare a larger set of samples, I find that the marker S can be consistently found in samples obtained from individuals with disease D (like X and Y) but not in samples from healthy controls (like Z). Therefore I would propose the biological model that organisms containing the larger genomic region G are present at significantly higher relative abundance in the microbiome of individuals with disease D.

In this simplistic rendering I’ve tried to make it clear that the degree to which bacteria recombine will have a practical impact on how much confidence we can have in inferences which rely on the concepts of taxonomy or phylogeny.

The key idea here is that if you observe any marker S, we tend to assume that there is a monophyletic group of organisms which share that marker sequence. Monophyly is one of the most important concepts in microbiome analysis which also happens to be a lot of fun to say — it’s worth reading up on it.

How Much Recombination Is There?

Getting back to the paper that started it all, the authors did a nice job of carefully estimating the frequency of recombination across a handful of bacterial species for which a reasonable amount of data is available. The answer they found is that recombination rates vary, and this answer matches our mechanistic understanding of recombination. The documented mechanisms of recombination vary widely across different organisms, and there is undoubtedly a lot more out there we haven’t characterized yet.

At the end of the day, we have only studied a small fraction of the organisms which are found in the microbiome. As such, we should approach them with a healthy dose of skepticism for any key assumption, like a lack of recombination, which we know is not universal.

In conclusion, I am going to continue to use taxonomy and phylogeny every single day that I study the microbiome, but I’m also going to stay alert for how recombination may be misleading me. On a practical note, I am also going to try to use methods like gene-level analysis which keep a tight constraint on the size of regions G which are inferred from any marker S.

Sam Minot

March 25, 2020

Code Discussion

Automating Your Code

Sam Minot

March 25, 2020

Code Discussion

It’s been far too long since I’ve posted, but I wanted to share a small piece of how my work has changed over the last year in case my experience ends up being helpful for anyone.

Automated Actions with Code

For a few years now I’ve gotten used to setting up projects as code repositories (by this I really mean GitHub repositories, but I assume there are other providers to host versioned code). The point of these repositories is to make it easy to track changes to a collection of code, even when a group of people is collaborating and updating different parts of the code. For a while I thought that my sophistication with this system would develop mostly in the areas of managing and tracking these code changes (pull requests, working on branches, etc.), but in the last few months my eyes have been opened to a whole new world of automated tests or “actions.”

Of course, providers like GitLab, CircleCI, TravisCI, etc. have been providing tools for automated code execution for some time, but I never ended up setting those systems up myself and so they seemed a bit too intimidating to start out with. Then, sometime last year GitHub introduced a new part of their website called “Actions” and I started to dive in.

The idea with these actions is that you can automatically execute some compute using the code in your repository. This compute has to be pretty limited, can’t use that much resources and can’t run for that long, but it’s more than enough capacity to do some really useful things.

Pipeline Validation

One task I spend time on is building small workflows to help people run bioinformatics. Something that I’ve found very useful with these workflows is that you can set up an action which will automatically run the entire pipeline and report if any errors are encountered. The prerequisite here is that you have to be able to generate testing data which will run in ~5 minutes, but the benefit is that you can test a range of conditions with your code, and not have to worry that your local environment is misleading you. This is also really nice in case you are making a minor change and don’t want to run any tests locally — all the tests run automatically and you just get a nice email if there are any problems. Peace of mind! Here is an example of the configuration that I used for running this type of testing with a recent repository.

Packaging for Distribution

I recently worked on a project for which I needed to wrap up a small script which would run on multiple platforms as a standalone executable. As common as this task is, it’s not something I’ve done frequently and I wasn’t particularly confident in my ability to cross-compile from my laptop for Windows, Ubuntu, and MacOS. Luckily, I was able to figure out how to configure actions which would package up my code for three different operating systems, and automatically attach those executables as assets to tagged releases. This means that all I have to do to release a new set of binaries is to push a tagged commit, and everything else is just taken care of for me.

In the end, I think that spending more time in bioinformatics means figuring out all of the things which you don’t actually have to do and automating them. If you are on the fence, I would highly recommend getting acquainted with some automated testing system like GitHub Actions to see what work you can take off your plate entirely to focus on more interesting things.

Sam Minot

May 7, 2019

Working with Nextflow at Fred Hutch

Sam Minot

May 7, 2019

I’ve been putting in a bit of work recently trying to make it easier for other researchers at Fred Hutch to use Nextflow as a way to run their bioinformatics workflows, while also getting the benefits of cloud computing and Docker-based computational reproducibility.

You can see some slides describing some of that content here, including a description of the motivation for using workflow managers, as well as a more detailed walk-through of using Nextflow right here at Fred Hutch.

Sam Minot

March 5, 2019

Publication

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

Sam Minot

March 5, 2019

Publication

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

Start with a survey of people with and without a disease,
Collect WGS data from microbiome samples,
Find microbial CAGs that are associated with disease, and then
Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Sam Minot

February 25, 2019

Musings

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

Sam Minot

February 25, 2019

Musings

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.

Sam Minot

September 23, 2018

Code Discussion

The Rise of the Machines: Workflow Managers for Bioinformatics

Sam Minot

September 23, 2018

Code Discussion

As with many things these days, it started with Twitter and it went further than I expected.

The other day I wrote a slightly snarky tweet

It's amazing how much of "being a bioinformatician" can be replaced by a mediocre workflow manager.

Just imagine if there were a really _good_ workflow manager out there!
— Sam Minot (@sminot) September 21, 2018

There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.

What is “being a bioinformatician”?

“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.

The aspect of this definition useful here is that bioinformaticians tend to

keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)
analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.

Here’s a good counterpoint that was raised to the above:

I think being a bioinformatician it’s a lot more than *just* executing pipelines. It’s more building them, deciding and testing each tool, fine tuning everything and more importantly analyse the outputs. But for sure a good workflow manager helps a lot with the boring stuff :)
— Francesco Strozzi (@fstrozzi) September 22, 2018

Good point, now what is a “Workflow Manager”?

A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.

This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.

Examples of workflow managers (partially culled from the set of responses I got on Twitter):

My Ideal Workflow Manager

Sam - curious - what does a great workflow manager look like to you?
— Jonathan Jacobs (@bioinformer) September 21, 2018

I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:

Tasks consist of BASH snippets run inside Docker containers
Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)
The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)
Support for file storage locally as well as object stores like AWS S3
Easy to read, write, and publish to a general computing audience (highly subjective)
Easy to set up and get running (highly subjective)

The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.

Where are we now?

The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.

Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.

In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).

Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.

Sam Minot

May 25, 2018

Journal Club

Hybrid Approach to Microbiome Research (to Culture, and Not to Culture)

Sam Minot

May 25, 2018

Journal Club

I was rereading a great paper from the Huttenhower group (at Harvard) this week and I was struck by a common theme that it shared with another great paper from the Segre group (at NIH), which I think is a nice little window into how good scientists are approaching the microbiome these days.

The paper I'm thinking about is Hall, et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients (2017) Genome Medicine. The paper is open access so you can feel free to go read it yourself, but my super short summary is: (1) they analyzed the gut microbiome from patients with (and without) IBD and found that a specific clade of Ruminococcus gnavus was enriched in IBD; and then (2) they took the extra step of growing up those bacteria in the lab and sequencing their genomes in order to figure out which specific genes were enriched in IBD.

The basic result is fantastically interesting – they found enriched genes that were associated with oxidative stress, adhesion, iron acquisition, and mucus utilization, all of which make sense in terms of IBD – but I mostly want to talk about the way they figured this out. Namely, they took a combined approach of (1) analyzing the total DNA from stool samples with culture free genome sequencing, and then (2) they isolated and grew R. gnavus strains in culture from those same stool samples so that they could analyze their genomes.

Fig. 3: R. gnavus metagenomic strain phylogeny. — Fig. 3: *R. gnavus* metagenomic strain phylogeny.

Now, if you cast your mind back to the paper on pediatric atopic dermatitis from Drs. Segre and Kong (Byrd, et al. 2017 Science Translational Medicine) you will remember that they took a very similar approach. They did culture-free sequencing of skin samples, while growing Staph strains from those same skin samples in parallel. With the cultures in hand they were able to sequence the genomes of those strains as well as testing for virulence in a mouse model of dermatitis.

So, why do I think this is worth writing a post about? It helps tell the story of how microbiome research has been developing in recent years. At the start, all we could do was describe how different organisms were in higher and lower abundance in different body sites, disease states, etc. Now that the field has progressed, it is becoming clear that the strain-level differences within a given species may be very important to human health and disease. We know that although people may contain a similar set of common bacterial species, the exact strains in their gut (for example) are different between people and usually stick around for months and years.

With this increased focus on strain-level diversity, we are coming up against the technological challenges of characterizing those differences between people, and how those differences track with health and disease. The two papers I've mentioned here are not the only ones to take this approach (it's also worth mentioning this great paper on urea metabolism in Crohn's disease from UPenn), which was to neatly interweave the complementary sets of information that can gleaned from culture-free whole-genome shotgun sequencing as well as culture-based strain isolation. Both of those techniques are difficult and they require extremely different sets of skills, so it's great to see collaborations come together to make these studies possible.

With such a short post, I've surely left out some important details from these papers, but I hope that the general reflection and point about the development of microbiome research has been of interest. It's certainly going to stay on my mind for the years to come.

Sam Minot

February 15, 2018

Code Discussion

Detecting viruses from short-read metagenomic data

Sam Minot

February 15, 2018

Code Discussion

My very first foray into the microbiome was in graduate school when my advisor (Dr. Rick Bushman) suggested that I try to use a high-throughput genome sequencing instrument (a 454 pyrosequencer) to characterize the viruses present in the healthy human gut. That project ended up delving deeply into the molecular evolution of bacteriophage populations, and gave me a deep appreciation for the unpredictable complexity of viral genomes.

While I wasn't the first to publish in this area, the general approach has become more popular over the last decade and a number of different groups have put their own spin on it.

Because of the diversity of viruses, you can always customize and enhance different aspects of the process, from sample collection, purification, and DNA isolation to metagenomic sequencing, bioinformatic analysis, and visualization. It's been very interesting to see how sophisiticated some of these methods have become.

On the bioinformatic analysis side of things, I ended up having the most success in those days by assembling each viral genome from scratch and measuring the evolution of that genome over time. More of the bespoke approach to bioinformatics, rather than the assembly line.

In contrast, these days I am much more interested in computational approaches that allow me to analyze very large numbers of samples with a minimum of human input required. For that reason I don't find de novo assembly to be all that appealing. It's not that the computation can't be done, it's more than I have a hard time imagining how to wrap my brain around the results in a productive way.

In contrast, one approach that I have been quite happy with is also much more simple minded. Instead of trying to assemble all of the new genomes in a sample, it's much easier to simply align your short reads against a reference database of viral genomes. One of the drawbacks is that you can only detect a virus that has been detected before. On the other hand, one of the advantages is also that you can only detect a virus that has been detected before, meaning that all samples can be rapidly and intuitively compared against each other.

To account for the rapid evolution of viral genomes, I think it's best to do this alignment in protein space against the set of proteins contained in every viral genome. This makes the alignments a bit more robust.

If you would like to perform this simple read alignment method for viral detection, you can use the code that I've put together at this GitHub repo. There is also a Quay repository hosting a Docker image that you can use to run this analysis without having to install any dependencies.

This codebase is under active development (version 0.1 at time of writing) so please be sure to test carefully against your controls to make sure everything is behaving well. At some point I may end up publishing more about this method, but it may be just too simple to entice any journal.

Lastly, I want to point out that straightforward alignment of reads does not account for any number of confounding factors, including but not limited to:

presence of human or host DNA
shared genome segments between extant viruses
novel viral genomes
complex viral quasi-species
integration of prophages into host genomes

There are a handful of tools out there that do try to deal with some of those problems in different ways, quite likely to good effect. However, it's good to remember that with every additional optimization you add a potential confounding factor. For example, it sounds like a good idea to remove human sequences from your sample, but that runs the risk of also eliminating viral sequences that happen to be found with the human genome, such as lab contaminants or integrated viral genome fragments. There are even a few human genes with deep homology to existing viral genes, thought to be due to an ancient integration and subsequent repurposing. All I mean to say here is that sometimes it's better to remove the assumptions from your code, and instead include a good set of controls to your experiment that can be used to robustly eliminate signal coming from, for example, the human genome.