In an attempt to make all of our work as open as possible, I have added a section to the website for public presentation materials under Slides. Not everything will be posted here, but please don’t hesitate to inquire if you saw a presentation and would like to have the slides posted here.
I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.
Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.
When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.
I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.
That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.
Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.
Associating the Microbiome with Disease
We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.
Discovery / Validation Approach
The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.
Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.
Developing Microbiome Therapeutics: Linking Genes to Isolates
While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.
Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,
Start with a survey of people with and without a disease,
Collect WGS data from microbiome samples,
Find microbial CAGs that are associated with disease, and then
Identify isolates in the freezer containing those genes.
That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.
Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.
I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.
Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.
Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.
Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.
Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.
Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.
I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.
When you analyze extremely large datasets, you tend to be guided by your intuition or predictions on how those datasets are composed, or how they will behave. Having studied the microbiome for a while, I would say that my primary rule of thumb for what to expect from any new sample is tons of novel diversity. This week saw the publication of another great paper showing just how true this is.
If you are new to the microbiome, you may be interested to know that there are basically two approaches to figuring out what microbes (bacteria, viruses, etc.) are in a given sample (e.g. stool). You can either (1) compare all of the DNA in that sample to a reference database of microbial genomes, or (2) try to reassemble the genomes in each sample directly from the DNA.
The thesis of this paper is one that I strongly support: reference databases contain very little of the total genomic content of microbes out there in the world. By extension, they predict that (1) would perform poorly, while (2) will generate a much better representation of what microbes are present.
Testing this idea, the authors analyzed an immense amount of microbiome data (almost 10,000 biological samples!), performing the relatively computationally intensive task of reconstructing genomes (so-called _de novo_ assembly).
The authors found a lot of things, but the big message is that they were able to reconstruct a *ton* of new genomes from these samples — organisms that had never been sequenced before, and many that don’t really resemble any phyla that we know of. In other words, they found a lot more novel genomic content than even I expected, and I was sure that they would find a lot.
There’s a lot more content here for microbial genome afficianados, so feel free to dig in on your own (yum yum).
When you think about what microbes are present in the microbiome, remember that there are many new microbes that we’ve never seen before. Some of those are new strains of clearly recognizable species (e.g. E. coli with a dozen new genes), but some will be novel organisms that have never been cultured or sequenced by any lab.
If you’re a scientist, keep that in mind when you are working in this area. If you’re a human, take hope and be encouraged by the fact that there is still a massive undiscovered universe within us, full of potential and amazing new things waiting to be discovered.
After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.
A Few Good Options
It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.
Issue 1 — Where do you execute your command?
Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.
Issue 2 — When do you execute your command?
A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.
Issue 3 — Where does your data live?
One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.
While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.
A paper recently caught my eye, and I think it is a great excuse to talk about data scale and dimensionality.
Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 562, 589–594 (2018). (link)
In addition to having a great acronym for study of child development, they also sequenced 10,913 metagenomes from 783 children.
This is a ton of data.
If you haven’t worked with a “metagenome,” it’s usually about 10-20 million short words, each corresponding to 100-300 bases of a microbial genome. It’s a text file with some combination of ATCG written out over tens of millions of lines, with each line being a few hundred letters long. A single metagenome is big. It won’t open in Word. Now imagine you have 10,000 of them. Now imagine you have to make sense out of 10,000 of them.
Now, I’m being a bit extreme – there are some ways to deal with the data. However, I would argue that it’s this problem, how to deal with the data, that we could use some help with.
The most effective way to deal with the data is to take each metagenome and figure out which organisms are present. This process is called “taxonomic classification” and it’s something that people have gotten pretty good at recently. You take all of those short ATCG words, you match them against all of the genomes you know about, and you use that information to make some educated about what collection of organisms are present. This is a biologically meaningful reduction in the data that results in hundreds or thousands of observations per sample. You can also validate these methods by processing “mock communities” and seeing if you get the right answer. I’m a fan.
With taxonomic classification you end up with thousands of observations (in this case organisms) across X samples. In the TEDDY study they had >10,000 samples, and so this dataset has a lot of statistical power (where you generally want more samples than observations).
The other main way that people analyze metagenomes these days is by quantifying the abundance of each biochemical pathway present in the sample. I won’t talk about this here because my opinions are controversial and it’s best left for another post.
I spend most of my time these days on “gene-level analysis.” This type of analysis tries to quantify every gene present in every genome in every sample. The motivation here is that sometimes genes move horizontally between species, and sometimes different strains within the same species will have different collections of genes. So, if you want to find something that you can’t find with taxonomic analysis, maybe gene-level analysis will pick it up. However, that’s an entirely different can of worms. Let’s open it up.
Every microbial genome contains roughly 1,000 genes. Every metagenome contains a few hundred genomes. So every metagenome contains hundreds of thousands of genes. When you look across a few hundred samples you might find a few million unique genes. When you look across 10,000 samples I can only guess that you’d find tens of millions of unique genes.
Now the dimensionality of the data is all lopsided. We have tens of millions of genes, which are observed across tens of thousands of samples. A biostatistician would tell us that this is seriously underpowered for making sense of the biology. Basically, this is an approach that just doesn’t work for studies with 10,000 samples, which I find to be pretty daunting.
Dealing with scale
The way that we find success in science is that we take information that a human cannot comprehend, and we transform it into something that a human can comprehend. We cannot look at a text file with ten million lines and understand anything about that sample, but we can transform it into a list of organisms with names that we can Google. I’m spending a lot of my time trying to do the same thing with gene-level metagenomic analysis, trying to transform it into something that a human can comprehend. This all falls into the category of “dimensionality reduction,” trying to reduce the number of observations per sample, while still retaining the biological information we care about. I’ll tell you that this problem is really hard and I’m not sure I have the single best true angle on it. I would absolutely love to have more eyes on the problem.
It increasingly seems like the world is driven by people who try to make sense of large amounts of data, and I would humbly ask for anyone who cares about this to try to think about metagenomic analysis. The data is massive, and we have a hard time figuring out how to make sense of it. We have a lot of good starts to this, and there are a lot of good people working in this area (too many to list), but I think we could always use more help.
The authors of the paper who analyzed 10,000 metagenomes learned a lot about how the microbiome develops during early childhood, but I’m sure that there is even more we can learn from this data. And I am also sure that we are getting close to world where we have 10X the data per sample, and experiments with 10X the samples. That is a world that I think we are ill-prepared for, and I’m excited to try to build the tools that we will need for it.
As with many things these days, it started with Twitter and it went further than I expected.
The other day I wrote a slightly snarky tweet
There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.
What is “being a bioinformatician”?
“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.
The aspect of this definition useful here is that bioinformaticians tend to
keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)
analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.
Here’s a good counterpoint that was raised to the above:
Good point, now what is a “Workflow Manager”?
A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.
This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.
Examples of workflow managers (partially culled from the set of responses I got on Twitter):
My Ideal Workflow Manager
I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:
Tasks consist of BASH snippets run inside Docker containers
Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)
The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)
Support for file storage locally as well as object stores like AWS S3
Easy to read, write, and publish to a general computing audience (highly subjective)
Easy to set up and get running (highly subjective)
The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.
Where are we now?
The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.
Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.
In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).
Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.
Without really having the time to write a full blog post, I want to mention two recent papers that have strongly influenced my understanding of the microbiome.
The ecological concept of the “niche” is something that is discussed quite often in the field of the microbiome, namely that each bacterial species occupies a niche, and any incoming organism trying to use that same niche will be blocked from establishing itself. The mechanisms and physical factors that cause this “niche exclusion” is probably much more clearly described in the ecological study of plants and animals — in the case of the microbiome I have often wondered just what utility or value this concept really had.
That all changed a few weeks ago with a pair of papers from the Elinav group.
Quick, Quick Summary
At the risk of oversimplifying, I’ll try to summarize the two biggest points I took home from these papers.
Lowering the abundance and diversity of bacteria in the gut can increase the probability that a new strain of bacteria (from a probiotic) is able to grow and establish itself
The ability of a new bacteria (from a probiotic) to grow and persist in the gut varies widely on a person-by-person basis
Basically, the authors showed quite convincingly that the “niche exclusion” effect does indeed happen in the microbiome, and that the degree of niche exclusion is highly dependent on what microbes are present, as well as a host of other unknown factors.
So many more questions
Like any good study, this raises more questions than it answers. What genetic factors determine whether a new strain of bacteria can grow in the gut? Is it even possible to design a probiotic that can grow in the gut of any human? Are the rules for “niche exclusion” consistent across bacterial species or varied?
As an aside, these studies demonstrate the consistent observation that probiotics generally don’t stick around after you take them. If you have to take a probiotic every day in order to sustain its effect, it’s not a real probiotic.
I invite you to read over these papers and take what you can from them. If I manage to put together a more lengthy or interesting summary, I’ll make sure to post it at some point.
The story of microbiome research is one of hope and hype, both elevated to an extreme and fraught with controversy. You need look no further than popular blogs and high profile review articles to see this conflict play out.
As a passionate microbiome researcher, I like to highlight the hope that I see -- the hope that humans will be able to understand and harness the microbiome to improve human health.
The story of hope I have for you today is a story of the heart. The human heart. Well, heart disease. Reducing heart disease, really.
When I was in graduate school I saw the most fascinating lecture. It was from Stanley Hazen, a researcher at the Cleveland Clinic, and he was describing an experiment in which it appeared that bacteria in the gut were responsible for converting a normal part of our diet into a molecule that promoted atherosclerosis (heart disease). With a combination of (1) molecular analysis of the blood of humans with heart disease and (2) experiments in mice varying the diet and microbes present in the gut, they showed pretty convincingly that bacteria were converting phosphatidylcholine from food into TMAO, which then promoted heart disease (Wang, et al. 2011 Nature).
Fast forward 7 years, and the microbiome research field has advanced far enough to identify the exact bacterial genes involved in this process. Not only that, they are able to inhibit those specific bacterial enzymes and show in a mouse model that levels of harmful TMAO are pushed down as a result (Roberts, et al. 2018 Nature).
One aspect of this story that I'll point out for the bioinformatics folks in the audience is that the biological mechanism involved in choline -> TMAO is not a phylogenetically conserved one. It is mediated by a set of genes that are distributed sporadically across the bacterial tree of life. For that reason and others, I am a strong supporter of microbiome analysis tools that enable gene-level comparison between large sets of samples in order to identify mechanisms like these in the future.
My hope, my dream, is that the entire human microbiome field is able to eventually follow this path. We observe that the microbiome influences some aspect of human health, we identify the biological mechanism responsible for this effect, and then we demonstrate our knowledge and mastery of this biology to such an extent that we can intentionally manipulate this system and eventually improve human health.
We have a long way to go, but I believe that this is the path that we can follow, the example we can aspire to. I hope that this story gives you hope, and helps cut through the hype.
I wrote a little bit recently on why I am so interested in making CAGs from metagenomic data, and I wanted to provide a little update on that topic.
CAGs are "Co-Abundant Genes," which is to say the groups of genes which are found at similar abundances in metagenomic data across many different samples. The inference from the "co-abundance" observation is that those genes are likely present on the same bits of DNA (i.e. chromosomes) across those samples. I am particularly interested in these groups of genes because of the highly mosaic nature of microbial genomic evolution.
Having spent a bit of time on maximizing the computational efficiency of finding CAGs, I have been struck by just how hard it is. Naively, the core computational task requires calculating a similarity (or co-abundance) metric for every pair of genes, which scales exponentially with the number of genes. For 100 genes there are ~5,000 comparisons, for 1,000 genes there are ~500,000, for 10,000 genes there are ~50,000,000 comparisons, etc. etc.
Speaking practically, this means that it takes a long time on a very large computer to get this work done. There are all sorts of tricks to make it faster, including parallelization, code optimization, etc., but at the core it is just a hard task.
However, I happened to ask the Twitter hive mind for help, and Andrew Carroll happened to give me some great advice:
A solution of this nature is similar to Approximate Nearest Neighbor, which is worth a read just for general background alone - e.g. https://t.co/umqqFcaOAH— Andrew Carroll (@acarroll_ATG) July 6, 2018
I see this problem come up a lot as scale increases. Law of large numbers is a friend here. Using approx soln usually good
This lead me down the road of exploring the Approximate Nearest Neighbor algorithms. It turns out that there a number of different algorithms out there which approximate this task of finding the closest set of points in highly dimensional space, which is exactly what I needed to make CAGs. There's even a website showing you which of these pieces of code work the best.
Interestingly, one of these algorithms (called "annoy") was produced by Spotify, and can be used to rapidly find the nearest neighbor from a large index that can be mmapped for rapid access across multiple threads.
The point I want to reinforce is that Good Software matters. Implementing an exact solution was taking me 10 days with an expensive 72 core machine, which gave me little opportunity to iterate over different variables or analyze new datasets quickly. The advantage of using a better piece of software is not just that I can make pretty figures, it's that I can actually do science more quickly, spending less money, and getting more accurate answers. I didn't know that this software was out there, and so it wasn't until I asked some people on Twitter for me to find out about it. But now that I know I'm happy to tell the world to give it a try, and hopefully that will save you all some time, some money, and help you find out the answer to whatever questions are important.