Skip to contents

Although STAVE rigorously defines the data structure, there are still ways of obtaining misleading results depending on how the data are entered. Given that STAVE encourages the use of exact spatial coordinates, this structure also has the potential to identify individuals, which has ethical implications. This section highlights some common pitfalls and how to avoid them.

Do I need to record wild type samples?

The table below encodes not only the mutant haplotypes of interest but also how often the wild-type pfcrt haplotype (CVMNK) was observed.

study_id survey_id variant_string variant_num total_num notes
study_01 study_01_site_01 crt:72-76:CVIET 20 100 NA
study_01 study_01_site_01 crt:72-76:SVMNT 10 100 NA
study_01 study_01_site_01 crt:72-76:CVMNK 70 100 NA

Notice that the sum of the numerators across all variants equals the denominator. This tells us that the entire space of possible haplotypes is covered. Because the data are complete, we can query the prevalence of any allele at any locus and obtain a correct result. For example, here is the prevalence of the wild-type allele (C) at codon 72:

s$get_prevalence(target_variant = "crt:72:C") |>
  ... # (further filters to only show some columns)
survey_id numerator denominator prevalence prevalence_lower prevalence_upper
study_01_site_01 90 100 90 82.37774 95.09953

This is correct - the C allele was seen 20 times in the CVIET mutant haplotype and 70 times in the WT CVMNK haplotype. Therefore the prevalence is (20 + 70) / 100 = 90%.

However, many studies do not report how many samples were wild-type. Instead, they only report “interesting” haplotypes — typically those carrying resistance-associated mutations. In the dataset below, the only difference is that the WT observations have been omitted:

study_id survey_id variant_string variant_num total_num notes
study_02 study_02_site_01 crt:72-76:CVIET 20 100 NA
study_02 study_02_site_01 crt:72-76:SVMNT 10 100 NA

What happens if we now attempt to estimate the prevalence of 72 C?

s$get_prevalence(target_variant = "crt:72:C") |>
  ... # (further filters to only show some columns)
numerator denominator prevalence prevalence_lower prevalence_upper
20 100 20 12.66556 29.18427

The result is clearly wrong: the estimate of 20% is far too low. This happens because, once the WT haplotype is omitted, the only remaining records of the C allele at codon 72 come from samples that were mutant somewhere else in the haplotype. All the true WT samples are invisible to STAVE, so they cannot be counted.

This naturally raises the question: Should we always record WT haplotypes so that numerators sum to the denominator?

In an ideal world, yes. If the complete distribution of haplotypes is known, including all WT observations, many downstream queries become trivial and robust.

In practice, this is often impossible because:

  • the authors focused on certain mutations, meaning we cannot assume all unreported haplotypes are WT,
  • the reference genome used in analysis is not stated,
  • the data have already been summarised before reaching you.

If WT samples were not recorded, STAVE cannot invent them. The resulting prevalence estimates will therefore be biased toward the mutants, and uncertainty intervals will reflect only the partial data.

Practical implications

  • If the WT count is available, include it — doing so prevents many errors.
  • If it is not available, be cautious when querying allele-level prevalence, especially for positions that are wild-type in the reference haplotype (e.g. mdr1 N86).

Restricted or identifying data

STAVE is designed to handle both public and non-public datasets. The access_level field allows users to mark each study as public, restricted, or private, and STAVE itself places no limitations on loading or analysing restricted material. This design choice allows individual laboratories and surveillance teams to use STAVE with their own internal datasets, even when those datasets cannot be shared publicly.

However, this also means that users are entirely responsible for ensuring that STAVE objects containing restricted or private data do not enter the public domain. STAVE provides the structure, but it cannot enforce data-governance rules; those must be respected by the analysts who use it.

A second consideration relates to identifiability. Aggregate genetic data are often assumed to be non-identifying because they represent groups rather than individuals. However, STAVE encourages high-resolution spatial recording of sampling locations, and at sufficiently fine granularity this assumption may break down. For example, if a single individual presents to a precisely geo-referenced health facility on a specific day, those data could be identifying even in aggregated form.

For this reason, users must ensure that the spatial and temporal resolution of their data is ethically and legally appropriate. This may require intentionally reducing precision — for example, by jittering, rounding, or otherwise anonymising coordinates before constructing a STAVE object. If such steps are taken, they should be clearly documented in the location_notes (and, if relevant, time_notes) fields so that any downstream user understands the provenance and limitations of the spatial information.


If you have read all the documentation up to this point, then god bless you! You should have a good understanding of how STAVE works, and are ready to jump to the Installation and Tutorials sections to start putting some of these ideas into practice.