Calculating Prevalence
calculating_prevalence.Rmd
In previous pages we have learned how to encode genetic data in STAVE. Now we turn to the problem of calculating prevalence from the encoded data.
In simple terms, prevalence is calculated as the number of times a variant is observed divided by the number of times it could have been observed. However, nuances in prevalence calculation warrant further discussion — particularly in cases of mixed (heterozygous) calls, where it may only be possible to estimate a prevalence range rather than an exact value.
Matching variant strings
Imagine the following data are loaded:
study_key | survey_key | variant_string | variant_num | total_num |
---|---|---|---|---|
study_01 | survey_01 | crt:76:T | 4 | 20 |
study_01 | survey_01 | crt:75_76:ET | 1 | 5 |
Prevalence calculation in STAVE starts by specifying the variant of interest in variant string format. This is then compared against all data loaded into the current object.
For example, imagine we want to know the prevalence of
crt:76:T
. This is clearly a direct match to the first row,
and so these values should be added to the numerator and denominator. It
is also a match to the second row, which contains crt:76:T
as a subset, meaning these values should also be added to the numerator
and denominator. This gives prevalence = (4 + 1) / (20 + 5) = 20%.
Now imagine we want to know the prevalence of crt:76:E
.
This is only a match to the second row in terms of the variant, however,
it is a match to both rows in terms of position. Hence, we obtain
prevalence = 1 / (20 + 5) = 4%.
Finally, imagine we want to know the prevalence of the
crt:75_76:ET
haplotype. This is only a match to the second
row in both numerator and denominator. We obtain prevalence = 1 / 5 =
20%.
Dealing with unphased mixed calls
Now let’s look at an example where there are unphased mixed calls:
study_key | survey_key | variant_string | variant_num | total_num |
---|---|---|---|---|
study_02 | survey_01 | crt:72_73_74_75_76:CVIET | 50 | 100 |
study_02 | survey_01 | crt:72_73_74_75_76:CVMNK | 40 | 100 |
study_02 | survey_01 | crt:72_73_74_75_76:CV_I/M_E/N_T/K | 10 | 100 |
Here we have two common pfcrt haplotypes, CVIET and CVMNK, as well as some mixed calls.
Imagine we want to know the prevalence of
crt:72_73_74:CVI
. This is a match to the first row, and
interestingly it is also an unambiguous match to the third row.
The third variant crt:72_73_74_75_76:CV_I/M_E/N_T/K
contains multiple unphased heterozygous sites, meaning we cannot work
out the exact component haplotypes that make up this mixture. However,
looking solely at positions 72 to 74, we know for certain that the
haplotypes CVI and CVM are present. Therefore we can be certain that
this matches our target. We obtain prevalence = (50 + 10) / 100 =
60%.
Now imagine we want to know the prevalence of the full
crt:72_73_74_75_76:CVMNK
haplotype. This matches the second
row, but because there are multiple unphased heterozygous sites in the
third row, it is only an ambiguous match to this variant. Our
target may be present in these samples, but equally it may not.
This presents a problem - if we include this in the numerator then we
risk overestimating the prevalence, but if we exclude it then we risk
underestimating. Faced with this dilemma, the approach taken in STAVE is
to simply report the minimum and maximum possible
prevalence implied by the data. This would give min_prevalence
= 40 / 100 = 40%, max_prevalence = (40 + 10) / 100 = 50%. It is then up
to the user to decide what to do with this.
Dealing with phased mixed calls
Finally, imagine the same data as before, but now the mixed haplotype is encoded as phased information:
study_key | survey_key | variant_string | variant_num | total_num |
---|---|---|---|---|
study_03 | survey_01 | crt:72_73_74_75_76:CVIET | 50 | 100 |
study_03 | survey_01 | crt:72_73_74_75_76:CVMNK | 40 | 100 |
study_03 | survey_01 | crt:72_73_74_75_76:CV_I|M_E|N_T|K | 10 | 100 |
The mixtures are now clearly specified as a combination of the CVIET
and CVMNK haplotypes. If our target is
crt:72_73_74_75_76:CVMNK
as before, this is now an
unambiguous match against the second and third rows, giving prevalence =
(40 + 10) / 100 = 50%. This is one argument for including phased data
wherever possible, although in reality it can be challenging to reliably
phase genomes.
More complex situations, such as partial phasing or more than two alleles at a locus, are allowed - see the variantstring package documentation.
(Old stuff)
Prevalence calculation is through the function
get_prevalence()
. It returns the prevalence point estimate
(%) along with lower and upper 95% CIs. It does this for every study in
the loaded data, and returns results appended together over the
Studies and Surveys table. An example is given below;
scroll to the right to see the prevalence estimates:
study_id | study_name | study_type | authors | publication_year | url | survey_id | country_name | site_name | latitude | longitude | spatial_notes | collection_start | collection_end | collection_day | time_notes | numerator | denominator | prevalence | prevalence_lower | prevalence_upper |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
study_01 | NA | other | NA | NA | example_data | site_01 | NA | NA | 0 | 0 | example data | NA | NA | 2020-01-01 | example data | 5 | 25 | 20 | 6.831146 | 40.70374 |
By now, you should have a good idea for how STAVE works. Jump to the Installation and Tutorials sections to start putting some of these ideas into practice.