Cautionary Notes
howto_cautionary_notes.RmdAlthough STAVE rigorously defines the data structure, there are still ways of obtaining misleading results depending on how the data are entered. Given that STAVE encourages the use of exact spatial coordinates, this structure also has the potential to identify individuals, which has ethical implications. This section highlights some common pitfalls and how to avoid them.
Do I need to record wild type samples?
The table below encodes not only the mutant haplotypes of interest
but also how often the wild-type pfcrt haplotype
(CVMNK) was observed.
| study_id | survey_id | variant_string | variant_num | total_num | notes |
|---|---|---|---|---|---|
| study_01 | study_01_site_01 | crt:72-76:CVIET | 20 | 100 | NA |
| study_01 | study_01_site_01 | crt:72-76:SVMNT | 10 | 100 | NA |
| study_01 | study_01_site_01 | crt:72-76:CVMNK | 70 | 100 | NA |
Notice that the sum of the numerators across all variants equals the
denominator. This tells us that the entire space of possible haplotypes
is covered. Because the data are complete, we can query the prevalence
of any allele at any locus and obtain a correct result. For example,
here is the prevalence of the wild-type allele (C) at codon
72:
| survey_id | numerator | denominator | prevalence | prevalence_lower | prevalence_upper |
|---|---|---|---|---|---|
| study_01_site_01 | 90 | 100 | 90 | 82.37774 | 95.09953 |
This is correct - the C allele was seen 20 times in the
CVIET mutant haplotype and 70 times in the WT
CVMNK haplotype. Therefore the prevalence is (20 + 70) /
100 = 90%.
However, many studies do not report how many samples were wild-type. Instead, they only report “interesting” haplotypes — typically those carrying resistance-associated mutations. In the dataset below, the only difference is that the WT observations have been omitted:
| study_id | survey_id | variant_string | variant_num | total_num | notes |
|---|---|---|---|---|---|
| study_02 | study_02_site_01 | crt:72-76:CVIET | 20 | 100 | NA |
| study_02 | study_02_site_01 | crt:72-76:SVMNT | 10 | 100 | NA |
What happens if we now attempt to estimate the prevalence of 72 C?
| numerator | denominator | prevalence | prevalence_lower | prevalence_upper |
|---|---|---|---|---|
| 20 | 100 | 20 | 12.66556 | 29.18427 |
The result is clearly wrong: the estimate of 20% is far too low. This
happens because, once the WT haplotype is omitted, the only remaining
records of the C allele at codon 72 come from samples that
were mutant somewhere else in the haplotype. All the true WT samples are
invisible to STAVE, so they cannot be counted.
This naturally raises the question: Should we always record WT haplotypes so that numerators sum to the denominator?
In an ideal world, yes. If the complete distribution of haplotypes is known, including all WT observations, many downstream queries become trivial and robust.
In practice, this is often impossible because:
- the authors focused on certain mutations, meaning we cannot assume all unreported haplotypes are WT,
- the reference genome used in analysis is not stated,
- the data have already been summarised before reaching you.
If WT samples were not recorded, STAVE cannot invent them. The resulting prevalence estimates will therefore be biased toward the mutants, and uncertainty intervals will reflect only the partial data.
Restricted or identifying data
STAVE is designed to handle both public and non-public datasets. The
access_level field allows users to mark each study as
public, restricted, or private, and STAVE itself places no limitations
on loading or analysing restricted material. This design choice allows
individual laboratories and surveillance teams to use STAVE with their
own internal datasets, even when those datasets cannot be shared
publicly.
However, this also means that users are entirely responsible for ensuring that STAVE objects containing restricted or private data do not enter the public domain. STAVE provides the structure, but it cannot enforce data-governance rules; those must be respected by the analysts who use it.
A second consideration relates to identifiability. Aggregate genetic data are often assumed to be non-identifying because they represent groups rather than individuals. However, STAVE encourages high-resolution spatial recording of sampling locations, and at sufficiently fine granularity this assumption may break down. For example, if a single individual presents to a precisely geo-referenced health facility on a specific day, those data could be identifying even in aggregated form.
For this reason, users must ensure that the spatial and
temporal resolution of their data is ethically and legally
appropriate. This may require intentionally reducing precision
— for example, by jittering, rounding, or otherwise anonymising
coordinates before constructing a STAVE object. If such steps are taken,
they should be clearly documented in the location_notes
(and, if relevant, time_notes) fields so that any
downstream user understands the provenance and limitations of the
spatial information.
If you have read all the documentation up to this point, then god bless you! You should have a good understanding of how STAVE works, and are ready to jump to the Installation and Tutorials sections to start putting some of these ideas into practice.