Calculating Prevalence
howto_calculating_prevalence.RmdIn previous pages we learned how to encode genetic data in STAVE. Now we turn to the problem of calculating prevalence from the encoded data.
In simple terms, prevalence is calculated as the number of times a variant is observed divided by the number of times it could have been observed. However, nuances in prevalence calculation warrant further discussion — particularly in cases of mixed (heterozygous) calls, where it may only be possible to estimate a prevalence range rather than an exact value.
Matching variant strings
Imagine the following data are loaded:
| study_id | survey_id | variant_string | variant_num | total_num |
|---|---|---|---|---|
| study_01 | study_01_survey_01 | crt:76:T | 4 | 20 |
| study_01 | study_01_survey_01 | crt:75_76:ET | 1 | 5 |
Prevalence calculation in STAVE starts by specifying the variant of interest. For example:
s$get_prevalence(target_variant = "crt:76:T")This is then compared against all data loaded into the current object.
For example, crt:76:T is clearly a direct match to the
first row, and so these values should be added to the numerator and
denominator. It is also a match to the second row, which contains
crt:76:T as a subset, meaning these values should also be
added to the numerator and denominator. This gives prevalence = (4 + 1)
/ (20 + 5) = 20%.
Now imagine we want to know the prevalence of crt:76:E.
This is only a match to the second row in terms of the variant, however,
it is a match to both rows in terms of position. In other words, both
rows could have produced allele E at position 76.
Hence, we obtain prevalence = 1 / (20 + 5) = 4%.
Finally, imagine we want to know the prevalence of the
crt:75_76:ET haplotype. This is only a match to the second
row in both numerator and denominator. Hence, we obtain prevalence = 1 /
5 = 20%.
Dealing with unphased mixed calls
Now let’s look at an example where there are unphased mixed calls:
| study_id | survey_id | variant_string | variant_num | total_num |
|---|---|---|---|---|
| study_02 | study_02_survey_01 | crt:72-76:CVIET | 50 | 100 |
| study_02 | study_02_survey_01 | crt:72-76:CVMNK | 40 | 100 |
| study_02 | study_02_survey_01 | crt:72-76:CV_I/M_E/N_T/K | 10 | 100 |
Here we have two common pfcrt haplotypes, CVIET and CVMNK, as well as some mixed calls.
Imagine we want to know the prevalence of crt:72-74:CVI.
This is a match to the first row, and interestingly it is also an
unambiguous match to the third row. The third variant
crt:72-76:CV_I/M_E/N_T/K contains multiple unphased
heterozygous sites, meaning we cannot work out the exact component
haplotypes that make up this mixture. However, looking solely at
positions 72 to 74, we know for certain that the haplotypes CVI and CVM
are present. Therefore we can be certain that this matches our target.
We obtain prevalence = (50 + 10) / 100 = 60%.
Now imagine we want to know the prevalence of the full
crt:72-76:CVMNK haplotype. This matches the second row, but
because there are multiple unphased heterozygous sites in the third row,
it is only an ambiguous match to this variant. Our target
may be present in these samples, but equally it may not. This
presents a problem - if we include this in the numerator then we risk
overestimating the prevalence, but if we exclude it then we risk
underestimating the prevalence. Faced with this dilemma, the approach
taken in STAVE is to simply report the minimum and maximum
possible prevalence implied by the data. This would give
min_prevalence = 40 / 100 = 40%,
max_prevalence = (40 + 10) / 100 = 50%. It is then up to
the user to decide what to do with this information. Valid approaches
include taking the min, the max, the midpoint, or doing more advanced
statistical imputation based on per-locus prevalence. Whatever choice is
made, it should be clearly documented.
Dealing with phased mixed calls
Finally, imagine the same data as before, but now the mixed haplotype is encoded as phased information:
| study_id | survey_id | variant_string | variant_num | total_num |
|---|---|---|---|---|
| study_03 | study_03_survey_01 | crt:72-76:CVIET | 50 | 100 |
| study_03 | study_03_survey_01 | crt:72-76:CVMNK | 40 | 100 |
| study_03 | study_03_survey_01 | crt:72-76:CV_I|M_E|N_T|K | 10 | 100 |
The mixtures are now clearly specified as a combination of the CVIET
and CVMNK haplotypes. If our target is crt:72-76:CVMNK as
before, this is now an unambiguous match against the second and third
rows, giving prevalence = (40 + 10) / 100 = 50%. This is a strong
argument for including phased data wherever possible, although in
reality it can be challenging to reliably phase genomes.
More complex situations, such as partial phasing or more than two alleles at a locus, are allowed - see the variantstring package documentation for details.
The next page goes on to explain some of the ways that we must be careful when entering or interpreting data.