Skip to contents

In previous pages we have learned how to encode genetic data in STAVE. Now we turn to the problem of calculating prevalence from the encoded data.

In simple terms, prevalence is calculated as the number of times a variant is observed divided by the number of times it could have been observed. However, nuances in prevalence calculation warrant further discussion — particularly in cases of mixed (heterozygous) calls, where it may only be possible to estimate a prevalence range rather than an exact value.

Matching variant strings

Imagine the following data are loaded:

study_key survey_key variant_string variant_num total_num
study_01 survey_01 crt:76:T 4 20
study_01 survey_01 crt:75_76:ET 1 5

Prevalence calculation in STAVE starts by specifying the variant of interest in variant string format. This is then compared against all data loaded into the current object.

For example, imagine we want to know the prevalence of crt:76:T. This is clearly a direct match to the first row, and so these values should be added to the numerator and denominator. It is also a match to the second row, which contains crt:76:T as a subset, meaning these values should also be added to the numerator and denominator. This gives prevalence = (4 + 1) / (20 + 5) = 20%.

Now imagine we want to know the prevalence of crt:76:E. This is only a match to the second row in terms of the variant, however, it is a match to both rows in terms of position. Hence, we obtain prevalence = 1 / (20 + 5) = 4%.

Finally, imagine we want to know the prevalence of the crt:75_76:ET haplotype. This is only a match to the second row in both numerator and denominator. We obtain prevalence = 1 / 5 = 20%.

Dealing with unphased mixed calls

Now let’s look at an example where there are unphased mixed calls:

study_key survey_key variant_string variant_num total_num
study_02 survey_01 crt:72_73_74_75_76:CVIET 50 100
study_02 survey_01 crt:72_73_74_75_76:CVMNK 40 100
study_02 survey_01 crt:72_73_74_75_76:CV_I/M_E/N_T/K 10 100

Here we have two common pfcrt haplotypes, CVIET and CVMNK, as well as some mixed calls.

Imagine we want to know the prevalence of crt:72_73_74:CVI. This is a match to the first row, and interestingly it is also an unambiguous match to the third row. The third variant crt:72_73_74_75_76:CV_I/M_E/N_T/K contains multiple unphased heterozygous sites, meaning we cannot work out the exact component haplotypes that make up this mixture. However, looking solely at positions 72 to 74, we know for certain that the haplotypes CVI and CVM are present. Therefore we can be certain that this matches our target. We obtain prevalence = (50 + 10) / 100 = 60%.

Now imagine we want to know the prevalence of the full crt:72_73_74_75_76:CVMNK haplotype. This matches the second row, but because there are multiple unphased heterozygous sites in the third row, it is only an ambiguous match to this variant. Our target may be present in these samples, but equally it may not. This presents a problem - if we include this in the numerator then we risk overestimating the prevalence, but if we exclude it then we risk underestimating. Faced with this dilemma, the approach taken in STAVE is to simply report the minimum and maximum possible prevalence implied by the data. This would give min_prevalence = 40 / 100 = 40%, max_prevalence = (40 + 10) / 100 = 50%. It is then up to the user to decide what to do with this.

Dealing with phased mixed calls

Finally, imagine the same data as before, but now the mixed haplotype is encoded as phased information:

study_key survey_key variant_string variant_num total_num
study_03 survey_01 crt:72_73_74_75_76:CVIET 50 100
study_03 survey_01 crt:72_73_74_75_76:CVMNK 40 100
study_03 survey_01 crt:72_73_74_75_76:CV_I|M_E|N_T|K 10 100

The mixtures are now clearly specified as a combination of the CVIET and CVMNK haplotypes. If our target is crt:72_73_74_75_76:CVMNK as before, this is now an unambiguous match against the second and third rows, giving prevalence = (40 + 10) / 100 = 50%. This is one argument for including phased data wherever possible, although in reality it can be challenging to reliably phase genomes.

More complex situations, such as partial phasing or more than two alleles at a locus, are allowed - see the variantstring package documentation.

(Old stuff)

Prevalence calculation is through the function get_prevalence(). It returns the prevalence point estimate (%) along with lower and upper 95% CIs. It does this for every study in the loaded data, and returns results appended together over the Studies and Surveys table. An example is given below; scroll to the right to see the prevalence estimates:

study_id study_name study_type authors publication_year url survey_id country_name site_name latitude longitude spatial_notes collection_start collection_end collection_day time_notes numerator denominator prevalence prevalence_lower prevalence_upper
study_01 NA other NA NA example_data site_01 NA NA 0 0 example data NA NA 2020-01-01 example data 5 25 20 6.831146 40.70374

By now, you should have a good idea for how STAVE works. Jump to the Installation and Tutorials sections to start putting some of these ideas into practice.