Encoding Variants as Strings
variant_strings.Rmd
Storing genetic information presents unique challenges, particularly when dealing with varying levels of detail reported by different sources. For instance, some academic studies may only provide aggregate counts of specific amino acids observed at a single locus, whereas others might supply raw sequence data at the individual level, allowing us to perform our own aggregation. To address these differences, we need a data format that is both versatile and robust.
The solution in STAVE is a character string format that adheres to strict rules. While this approach may not be the most efficient for data compression, this trade-off is less significant when working with aggregate data, where the number of distinct observations is limited.
This page explains the variant string format in detail, providing clear examples of both compatible and incompatible data types.
1. Gene name, locus, and amino acid are separated by ;
These are the three key elements. For example, the string pfk13;580;Y specifies that in the pfk13 gene (the kelch 13 gene, implicated in Artemisinin resistance) at the 580th codon position, we observed the amino acid Y. Amino acids must follow the IUPAC single-letter format, in this case Y corresponds to Tyrosine.
Another common notation would call this the C580Y variant. We refrain from this notation because:
- This notation does not scale well to multiple loci.
- Sources do not always report the wild type (Cysteine in this case), forcing us to do extra work tracking down the reference genome.
- Strictly, we cannot assume that all non-numerator observations are wild type. The denominator may include all other variants, in which case recording these as wild type would be a corruption of the original data.
Gene names must consist solely of lowercase English letters and
digits (0-9), without hyphens or other special characters. Otherwise,
they do not follow a strict convention. We recognise that this creates
the potential for ambiguity, for example if two sources use slightly
different names for the same gene, such as pfk13 and
k13. However, this can be avoided by manually listing all gene
names in the existing data using the get_variants()
function before appending new data.
2. Loci are separated by _
In some cases, sources may report variants over multiple loci. A good example is the pfcrt gene, implicated in resistance to Amodiaquine among other antimalarials, in which different haplotypes confer different degrees of resistance. Here, the variant string pfcrt;72_73_74_75_76;C_V_I_E_T specifies that at codon 72 we observed a C (Cysteine), at codon 73 we observed a V (Valine), etc.
In a more concise shorthand notation, the underscores between amino acids can be omitted, resulting in pfcrt;72_73_74_75_76;CVIET. However, it is essential to retain the underscores between codon positions to prevent ambiguity in the numbers.
When a data source provides a full haplotype, you should always store the haplotype as a whole. Do not manually break this down into individual loci, for example one entry for pfcrt;72;C, then a separate entry for pfcrt;73;V etc. When calculating prevalence, STAVE is smart enough to know that the pfcrt;72;C variant is contained within the pfcrt;72_73_74_75_76;CVIET haplotype. Even more importantly, do not store both the haplotype and the single locus information, as this will lead to mistakes in prevalence calculation due to double-counting. A more detailed explanation of this issue, along with guidance on handling very long haplotypes derived from raw sequence data, is provided on the next page.
3. Genes are separated by :
We may want to capture information about mutations observed together across multiple genes. A key example is the antimalarial combination therapy sulfadoxine-pyrimethamine (SP), where resistance arises from mutations in two genes:
- Mutations in the pfdhfr gene reduce the parasite’s susceptibility to pyrimethamine.
- Mutations in the pfdhps gene reduce susceptibility to sulfadoxine.
This can be represented in the variant string format by listing multiple genes separated by a colon. For example: pfdhfr;51;I:pfdhps;437;G.
The order in which genes are listed does not matter, as STAVE will automatically sort them before calculating prevalence. Additionally, there is no restriction on the number of genes that can be included in a single variant string.
4. Unphased mixed calls are indicated by /
Polyclonal infections are common in malaria and can result in mixed, or “heterozygous,” calls at the DNA level. In some cases, these mixed calls also manifest at the amino acid level. When this occurs, all observed amino acids should be listed, separated by a forward slash.
For example, the string pfmdr1;76;N/Y indicates that in the pfmdr1 gene at codon position 76, both N (Asparagine) and Y (Tyrosine) were observed.
Any number of alleles can be listed, such as pfmdr1;76;N/Y/F, indicating three amino acids — N, Y, and F (Phenylalanine) — were detected at position 76. These mixed calls can also be incorporated into multi-locus notation. For instance, pfmdr1;76_184;N/Y_Y specifies a heterozygous call at position 76 (N/Y) and a homozygous call at position 184 (Y).
5. Phased mixed calls are indicated by |
The forward slash notation assumes that variants are unphased, meaning we do not know which variants go together. For example, in the string pfmdr1;76_184;N/Y_Y/F we do not know if this is caused by the haplotype combination N_Y + Y_F, or the combination N_F + Y_Y. There could also be more than two genotypes present in the sample, in which case more combinations are possible.
In some cases, we may have phased information, meaning we know which variants go together. These can be indicated using the vertical bar symbol in place of the forward slash. When using this notation, the first letter listed at a locus always goes with the first letter at another locus, etc. For example, pfmdr1;76_184;N|Y_Y|F indicates the haplotype combination N_Y + Y_F.
Any number of alleles can be listed, but the same number of alleles must be present at all heterozygous loci. For example, if we have observed three distinct amino acids at position 76 and two distinct amino acids at position 184, then we must list three amino acids at locus 184 even if this means repeating the same letter. Otherwise it would not be possible to link up which amino acids go together.
There are some limitations to this format. For example, phased and unphased regions are not currently supported in the same variant. Similarly, partial phasing is not supported, by which we mean samples with more than two genomes where a only subset of genomes are phased together. As a rule of thumb, remember that the / and | symbols cannot be used together. While we recognise this as a limitation of the format, these are relatively rare edge cases. They will also not tend to impact prevalence calculation unless we are interested in the prevalence of a long haplotype. In both cases above, data can still be included in STAVE format with some information loss, for example by treating all loci as unphased.
What cannot be encoded?
As with any data format, there are inherent trade-offs in the variant string encoding. The following information cannot be encoded:
- Copy number variation
- More complex genomic rearrangements, including insertions and deletions
- Combinations of phased and unphased regions and partial phasing (see above)
- Within-sample allele frequencies
- Quality scores
Keep in mind that STAVE is designed to facilitate data extraction from diverse sources for the purposes of prevalence calculation. It is not intended to replace individual-level data or raw sequence data, which remain essential for other types of analyses.
The next page explains how prevalence is calculated from this encoding.