One of the challenges in the field of microbial metagenomics (part of microbiome research) is going from the "bag-of-genes" stage to something more closely resembling biological reality. One of the general approaches that I have always found very compelling is to identify sets of genes that occur at a similar abundance across multiple samples, so-called Co-Abundant Genes (CAGs). In this post I will quickly describe why I like CAGs, and then talk about how I've been approaching the computational challenges associated with identifying them.
Why I like CAGs
The idea of CAGs is very appealing to me because it sits nicely between the level of a single gene and that of a complete genome. It is useful to track collections of genes because of the modular structure of bacterial genomes. As bacteria evolve they can frequently gain or lose large regions containing multiple genes, which may be due to a number of different biological mechanisms including plasmids, phage, transposons, or even just homologous recombination. Therefore it is useful to identify those cases when groups of genes are always observed together at a similar abundance, because it suggests that those genes are almost always found on the same chromosome and can be grouped together as a biologically meaningful entity.
Tackling the challenges of CAGs
The challenges of building CAGs are entirely computational, and are due to the extremely large number of genes that can be observed during metagenomic experiments.
The operation of finding CAGs can be broken into the following steps:
- Calculate a co-abundance metric for each pair of genes
- Identify those groups of genes which have a high degree of co-occurrence
- Validate that those CAGs are biologically meaningful
All of those steps are totally reasonable and work cleanly using off-the-shelf tools in Python or R when you have small numbers of genes (< 1,000). However, once you start having 100,000's and 1,000,000's of genes, the default tools for steps 1 and 2 will either instantly break or take forever. It's not surprising that making a pairwise distance matrix for ~1e6 by ~1e6 would break your computer -- it's got ~1e12 cells of data!
Here is how I have been tackling these challenges:
- Throw away most of the distance metrics. Instead of calculating all of the pairwise distances in a single matrix, iterate over all of the combinations and only store those values which are below some threshold.
- Use as many processors as possible. Distribute the process over many cores and run this on as big of a machine as you can find.
- Use iterators instead of making lists. This is more of a Python-specific note, but instead of iterating over a list of all of the pairwise combinations, make a function that yields each successive combination so that you never have to load the NxN list into memory.
- Cluster the quick and dirty way with single-linkage. There are many different clustering approaches that you could use, but single-linkage is incredibly simple and completely avoids the all-by-all comparison. It can be done with a streaming approach, which helps make this even halfway possible.
The last tip here is use other people's code. You're always going to have to write some code from scratch, but try to copy as much as you can from existing (open source) projects. To this end, I have been working on this problem in a public GitHub repository, and I'd be more than happy for anyone to swipe whatever bits of my code might be useful. I don't expect that anyone else has their data in exactly the way I expect, so it might be best to start with the two core functions, one which generates all of the pairwise connections from a Pandas DataFrame, and another which uses single-linkage clustering to identify CAGs from those connections. This code is not published, it's not heavily tested, and it might not even be terribly readable, but it might still be useful if this is a project you find yourself tackling.