Cautionary Notes • STAVE

Although STAVE rigorously defines the data structure, there are still ways of obtaining misleading results depending on how the data are entered. Given that STAVE encourages the use of exact spatial coordinates, this structure also has the potential to identify individuals, which has ethical implications. This section highlights some common pitfalls and how to avoid them.

Do I need to record wild type samples?

The table below encodes not only the mutant haplotypes of interest but also how often the wild-type pfcrt haplotype (CVMNK) was observed.

study_id	survey_id	variant_string	variant_num	total_num	notes
study_01	study_01_site_01	crt:72-76:CVIET	20	100	NA
study_01	study_01_site_01	crt:72-76:SVMNT	10	100	NA
study_01	study_01_site_01	crt:72-76:CVMNK	70	100	NA

Notice that the sum of the numerators across all variants equals the denominator. This tells us that the entire space of possible haplotypes is covered. Because the data are complete, we can query the prevalence of any allele at any locus and obtain a correct result. For example, here is the prevalence of the wild-type allele (C) at codon 72:

s$get_prevalence(target_variant = "crt:72:C") |>
  ... # (further filters to only show some columns)

survey_id	numerator	denominator	prevalence	prevalence_lower	prevalence_upper
study_01_site_01	90	100	90	82.37774	95.09953

This is correct - the C allele was seen 20 times in the CVIET mutant haplotype and 70 times in the WT CVMNK haplotype. Therefore the prevalence is (20 + 70) / 100 = 90%.

However, many studies do not report how many samples were wild-type. Instead, they only report “interesting” haplotypes — typically those carrying resistance-associated mutations. In the dataset below, the only difference is that the WT observations have been omitted:

study_id	survey_id	variant_string	variant_num	total_num	notes
study_02	study_02_site_01	crt:72-76:CVIET	20	100	NA
study_02	study_02_site_01	crt:72-76:SVMNT	10	100	NA

What happens if we now attempt to estimate the prevalence of 72 C?

s$get_prevalence(target_variant = "crt:72:C") |>
  ... # (further filters to only show some columns)

numerator	denominator	prevalence	prevalence_lower	prevalence_upper
20	100	20	12.66556	29.18427

The result is clearly wrong: the estimate of 20% is far too low. This happens because, once the WT haplotype is omitted, the only remaining records of the C allele at codon 72 come from samples that were mutant somewhere else in the haplotype. All the true WT samples are invisible to STAVE, so they cannot be counted.

This naturally raises the question: Should we always record WT haplotypes so that numerators sum to the denominator?

In an ideal world, yes. If the complete distribution of haplotypes is known, including all WT observations, many downstream queries become trivial and robust.

In practice, this is often impossible because:

the authors focused on certain mutations, meaning we cannot assume all unreported haplotypes are WT,
the reference genome used in analysis is not stated,
the data have already been summarised before reaching you.

If WT samples were not recorded, STAVE cannot invent them. The resulting prevalence estimates will therefore be biased toward the mutants, and uncertainty intervals will reflect only the partial data.

Practical implications

If the WT count is available, include it — doing so prevents many errors.
If it is not available, be cautious when querying allele-level prevalence, especially for positions that are wild-type in the reference haplotype (e.g. mdr1 N86).

Restricted or identifying data

STAVE is designed to handle both public and non-public datasets. The access_level field allows users to mark each study as public, restricted, or private, and STAVE itself places no limitations on loading or analysing restricted material. This design choice allows individual laboratories and surveillance teams to use STAVE with their own internal datasets, even when those datasets cannot be shared publicly.

However, this also means that users are entirely responsible for ensuring that STAVE objects containing restricted or private data do not enter the public domain. STAVE provides the structure, but it cannot enforce data-governance rules; those must be respected by the analysts who use it.

A second consideration relates to identifiability. Aggregate genetic data are often assumed to be non-identifying because they represent groups rather than individuals. However, STAVE encourages high-resolution spatial recording of sampling locations, and at sufficiently fine granularity this assumption may break down. For example, if a single individual presents to a precisely geo-referenced health facility on a specific day, those data could be identifying even in aggregated form.

For this reason, users must ensure that the spatial and temporal resolution of their data is ethically and legally appropriate. This may require intentionally reducing precision — for example, by jittering, rounding, or otherwise anonymising coordinates before constructing a STAVE object. If such steps are taken, they should be clearly documented in the location_notes (and, if relevant, time_notes) fields so that any downstream user understands the provenance and limitations of the spatial information.

If you have read all the documentation up to this point, then god bless you! You should have a good understanding of how STAVE works, and are ready to jump to the Installation and Tutorials sections to start putting some of these ideas into practice.