vignettes/what_is_simplegen.Rmd
what_is_simplegen.Rmd
Malaria molecular surveillance (MMS), sometimes called “genomic surveillance”, is a powerful approach for understanding what is going on inside a malaria-infected population. When we sequence the genome of Plasmodium parasites inside the blood of infected individuals, the information we get back can tell us something about the history of infections. For example, if we see two people carrying parasites with almost identical genomes then it is likely there is a short chain of infections connecting them. This is because genotypes undergo recombination inside the mosquito, meaning they tend to get shuffled together over subsequent infection events. Similarly, when we sequence a single person we might find evidence of not just one parasite genotype, but a whole population of multiple different genotypes (a high “complexity of infection”, COI). This is a clue that the person may have been bitten multiple times by different mosquitoes, and so this tells us something about transmission intensity in the region.
However, both of these examples are more complicated than they first seem. In low-transmission areas we would expect to see many more pairs of people with similar parasite genotypes than we would at high transmission, and so we need to be careful when we start inferring transmission chains. Likewise in the case of the COI analysis, if somebody is infected with multiple parasite genotypes then this doesn’t have to be because they recieved multiple mosquito bites, rather a single mosquito may have passed on several genotypes simultaneously.
In MMS we try to balance between different possible explanations for the observed genetic data so that we can extract useful information while also ensuring that our conclusions are robust. For this reason, MMS tends to be a highly quantitative discipline with a heavy reliance on statistical models. This makes the analysis plan less straightforward then simple prevalence or incidence measures, but at the same time we obtain a fundamentally different type of information that can be very useful for some malaria control questions.
Let’s say I’ve convinced you that MMS is a potentially useful avenue to persue, and you want to start sequencing parasites from your study region. The problem becomes; how should I design my MMS study? There are a huge number of variables to pin down, including the number of clusters, the number of samples per cluster, the method of blood extraction, the method of amplification and sequencing, the number and distribution of loci, all the way down to the final analysis tools that will be used to generate results. Long analysis pipelines like this have multiple issues that can jeapardise a project. First, with long pipelines come more points of failure. Although we might be confident that our samples can flow from one step to the next, in reality we should be prepared for some loss of information at every step (for example samples failing QC), meaning we are left with a reduced dataset at the end. Second, it can be difficult to predict how design changes early on in the pipeline will influence results later on. It may be that changing the number of targetted loci has a major effect on which software tools can ultimately be used. And finally, bringing this together with some statistical principles, it can be very difficult to power a study - by which we mean, design a study in such a way that we have a good chance of detecting something interesting if it is there. For simple prevalence and incidence studies there are formulae and tables in text books that can be used to power a study, but for complex MMS pipelines no such simple formulae exist. In the worst case scenario this may mean we are almost guaranteed to get a non-significant result even before starting the study, which is a rather heartbreaking situation to be in, especially after going to all the effort of conducting field and lab work!
As a concrete example, let’s imagine you want to explore how fragmented the parasite population is in your region using the software STRUCTURE. This software requires that you know the COI of your samples, meaning you probably want to run something like THE REAL McCOIL beforehand to estimate COI (notice how we are chaining analysis methods together here, which is common in this field). If the number or diversity of loci is too low then you may struggle to reliably estimate COI greater than about 2 using THE REAL McCOIL. Therefore, in high transmission sites where polyclonal infections are common, you can expect to throw away upwards of 50% of your data before carrying out the STRUCTURE analysis. If you had anticipated this beforehand then you could mitigate by either 1) increasing the number of samples so enough remain even after discarding, or 2) increased the number of loci so COI can be inferred more accurately. But if not then you may be left with very weak evidence of population structure that doesn’t really answer the question and instead begs for a further study with a larger sample size.
Simulation offers a powerful route out of this problem. By starting with “synthetic data”, meaning data of exactly the same dimension and file format as the real data but produced from a computer simulation model, we have the opportunity for a trial run at our entire analysis pipeline. Some of the simple points of failure will be identified immediately by this method, for example incompatible file formats, but we can also push this idea further. For example, by simulating from an epidemiological transmission model that does a reasonable job of capturing the sorts of dynamics we expect in our region, we can hopefully produce synthetic data that has a similar composition in terms of e.g. number of polyclonal infections or relatedness to our final data. We can also make our results more realistic by using an “observation model”, which includes things like introducing sequencing error, random sequencing failures (i.e. missing data) and merging mixed infections. This gives us a better idea of the sort of signal we can expect to see in the real data if we analyse it exactly the same way. Finally, if we want to be completely thorough, we can simulate data under a model with the known interesting effect of interest (i.e. population structure) and then we can systematically explore how likely we are to detect this effect with different study design choices - for example obtaining more samples vs. more loci. This gives us an estimate of “empirical power” that can be used to justify a particular study design.
Yes, you can! But for many applications this turns out to be quite inefficient. Some of the most interesting MMS questions actually revolve around neutral genetic data, by which we mean anything that is not under selection. This rules out things like loci responsible for drug resistance, but it includes other large swathes of the genome. If a locus is selectively neutral then by definition it does not interact with the transmission process, in other words, a parasite strain will have the same chance of being transmitted irrespective of the allele at this locus. From a simulation point of view, this means we can completely decouple the process of epidemiological simulation from the process of genetic simulation, i.e. we can simulate transmission first and then overlay genetics afterwards. Compare this with the naive approach of simulating genotypes at the same time as transmission - we would have to pass genotypes back and forth between mosquitoes and human hosts countless times, not to mention simulating recombination in the mosquito midgut, and still the vast majority of these mosquitoes would die before passing on infection. For this reason, we expect decoupled simulation to be faster than joint simulation most of the time, and it achieves its greatest advantage when the final sample is small relative to the total population.
This is not to say that forwards simulation of genotypes is always a bad idea, and indeed for loci under selection it may be the simplest approach. There are several, excellent simulators of this sort that have been published in recent years. But if our aim is to analyse neutral genetic markers in large populations then SIMPLEGEN is likely to be competitive in terms of speed.
SIMPLEGEN is a simulation pipeline. It breaks down the process described above into a series of distinct steps with well-defined inputs and outputs. Briefly, these steps are:
The pipeline can be accessed at any point, for example, if you already have a model that produces results in the pedigree (.ped) format in step 4 then you can jump in at this stage. However, it is anticipated that most people will run through from start to finish. A series of models are distributed with the pipeline, but in general it is intended that anyone can use any model as long as it produces results in compatible formats.
This may still feel a little abstract, so here are some concrete examples of things you will be able to do with the SIMPLEGEN pipeline:
SIMPLEGEN is an R package that uses C++ under the hood. Start by installing the package, then follow the series of tutorials to get up and running.
Yes and no. The ideas underpinning SIMPLEGEN are definitely not new, in fact the idea of efficient tree recording and decoupling simulation processes is a growing area in population genetics with some fantastic recent developments. However, in our search of similar tools we have not found anything out there that is ideally suited for use in malaria. The existing tools have a slightly different focus, to the extent that even if they could be used to do something similar to what we are proposing, the activation energy for the user would be much higher.