Skip to contents

3. Flexible encoding genetic variants

Encoding Plasmodium genetic data can be challenging due to the complexity and variability of the information involved. Sequence data may be recorded at the single-codon level or across multiple codons, and heterozygous calls — where more than one allele is observed — can occur at some codons but not others. These heterozygous calls may also be phased or unphased, adding another layer of detail. Furthermore, patterns of missingness can vary between individuals, causing the denominator to vary as we look along the genome. In many cases, particularly when working with aggregate counts extracted from published studies, some of this detailed information may be incomplete or entirely unavailable. Therefore, we require an encoding system that is both flexible and expressive: capable of capturing this complexity when the data are available, but not overly restrictive in situations where certain aspects are missing.

A drawback of this flexible encoding is that calculating prevalence becomes less straightforward. This trade-off is somewhat unavoidable, as flexibility in data encoding and simplicity in prevalence calculation are somewhat opposing goals. In STAVE, we prioritize achieving a clean and flexible data structure, ensuring that all relevant information is captured. Prevalence calculations are then performed using dedicated member functions that operate on the encoded data.

The next page goes into how to calculate prevalence.