Skip to contents

In previous pages we learned how to encode genetic data in STAVE. Now we turn to the problem of calculating prevalence from the encoded data.

In simple terms, prevalence is calculated as the number of times a variant is observed divided by the number of times it could have been observed. However, nuances in prevalence calculation warrant further discussion — particularly in cases of mixed (heterozygous) calls, where it may only be possible to estimate a prevalence range rather than an exact value.

Matching variant strings

Imagine the following data are loaded:

study_id survey_id variant_string variant_num total_num
study_01 study_01_survey_01 crt:76:T 4 20
study_01 study_01_survey_01 crt:75_76:ET 1 5

Prevalence calculation in STAVE starts by specifying the variant of interest. For example:

s$get_prevalence(target_variant = "crt:76:T")

This is then compared against all data loaded into the current object.

For example, crt:76:T is clearly a direct match to the first row, and so these values should be added to the numerator and denominator. It is also a match to the second row, which contains crt:76:T as a subset, meaning these values should also be added to the numerator and denominator. This gives prevalence = (4 + 1) / (20 + 5) = 20%.

Now imagine we want to know the prevalence of crt:76:E. This is only a match to the second row in terms of the variant, however, it is a match to both rows in terms of position. In other words, both rows could have produced allele E at position 76. Hence, we obtain prevalence = 1 / (20 + 5) = 4%.

Finally, imagine we want to know the prevalence of the crt:75_76:ET haplotype. This is only a match to the second row in both numerator and denominator. Hence, we obtain prevalence = 1 / 5 = 20%.

Dealing with unphased mixed calls

Now let’s look at an example where there are unphased mixed calls:

study_id survey_id variant_string variant_num total_num
study_02 study_02_survey_01 crt:72-76:CVIET 50 100
study_02 study_02_survey_01 crt:72-76:CVMNK 40 100
study_02 study_02_survey_01 crt:72-76:CV_I/M_E/N_T/K 10 100

Here we have two common pfcrt haplotypes, CVIET and CVMNK, as well as some mixed calls.

Imagine we want to know the prevalence of crt:72-74:CVI. This is a match to the first row, and interestingly it is also an unambiguous match to the third row. The third variant crt:72-76:CV_I/M_E/N_T/K contains multiple unphased heterozygous sites, meaning we cannot work out the exact component haplotypes that make up this mixture. However, looking solely at positions 72 to 74, we know for certain that the haplotypes CVI and CVM are present. Therefore we can be certain that this matches our target. We obtain prevalence = (50 + 10) / 100 = 60%.

Now imagine we want to know the prevalence of the full crt:72-76:CVMNK haplotype. This matches the second row, but because there are multiple unphased heterozygous sites in the third row, it is only an ambiguous match to this variant. Our target may be present in these samples, but equally it may not. This presents a problem - if we include this in the numerator then we risk overestimating the prevalence, but if we exclude it then we risk underestimating the prevalence. Faced with this dilemma, the approach taken in STAVE is to simply report the minimum and maximum possible prevalence implied by the data. This would give min_prevalence = 40 / 100 = 40%, max_prevalence = (40 + 10) / 100 = 50%. It is then up to the user to decide what to do with this information. Valid approaches include taking the min, the max, the midpoint, or doing more advanced statistical imputation based on per-locus prevalence. Whatever choice is made, it should be clearly documented.

Dealing with phased mixed calls

Finally, imagine the same data as before, but now the mixed haplotype is encoded as phased information:

study_id survey_id variant_string variant_num total_num
study_03 study_03_survey_01 crt:72-76:CVIET 50 100
study_03 study_03_survey_01 crt:72-76:CVMNK 40 100
study_03 study_03_survey_01 crt:72-76:CV_I|M_E|N_T|K 10 100

The mixtures are now clearly specified as a combination of the CVIET and CVMNK haplotypes. If our target is crt:72-76:CVMNK as before, this is now an unambiguous match against the second and third rows, giving prevalence = (40 + 10) / 100 = 50%. This is a strong argument for including phased data wherever possible, although in reality it can be challenging to reliably phase genomes.

More complex situations, such as partial phasing or more than two alleles at a locus, are allowed - see the variantstring package documentation for details.


The next page goes on to explain some of the ways that we must be careful when entering or interpreting data.