Encoding Genetic Data • STAVE

STAVE is designed with the flexibility to encode full haplotypes, preserving information about linkage across multiple genes and multiple loci. But what happens when not all loci amplify in some samples? Or when published data only report locus-by-locus results, rather than full haplotypes? These situations are entirely manageable once we understand the underlying data structure.

Encoding single-locus data

Imagine you come across a paper detailing pfcrt mutations at codons 72 to 76 for a sample of 100 individuals. The CVIET haplotype, commonly observed at these positions, is a well-established as marker of chloroquine resistance. However, rather than presenting haplotype-level data, the paper provides only site-specific information: the number of individuals tested at each codon and the frequency of each observed variant:

study_id	survey_id	variant_string	variant_num	total_num	notes
study_01	study_01_site_01	crt:72:C	100	100	NA
study_01	study_01_site_01	crt:73:V	95	95	NA
study_01	study_01_site_01	crt:74:I	31	98	NA
study_01	study_01_site_01	crt:75:E	20	92	NA
study_01	study_01_site_01	crt:76:T	45	89	NA

Notice that the total_num is not consistent over loci. This is very common, and is usually caused by some loci failing to amplify in some samples.

This method of encoding genetic data is straightforward and easy to interpret. For instance, the prevalence of glutamic acid (E) at codon 75 can be directly calculated as 20 out of 92 samples, or approximately 22%. However, this approach sacrifices information about full haplotypes. Specifically, we cannot determine whether the 20 samples with crt:75:E are the same individuals who also carry crt:76:T. As a result, we are limited to estimating single-locus prevalences and cannot infer multi-locus patterns or linkage. While this encoding is valid, it is suboptimal for analyses that rely on haplotype structure.

Encoding multi-locus data

Now imagine a different study that encodes complete haplotypes:

study_id	survey_id	variant_string	variant_num	total_num	notes
study_02	study_02_site_01	crt:72-76:CVIET	23	65	NA
study_02	study_02_site_01	crt:72-76:CVMET	4	65	NA
study_02	study_02_site_01	crt:72-74:CVI	5	5	NA
study_02	study_02_site_01	crt:74-76:IET	3	3	NA

The first two rows show results for samples that amplified successfully over loci 72 to 76. Notice that the denominator of 65 is the same between these two rows. In general, the denominator must be identical over all rows that share the same combination of genes and positions. The same 65 samples could have produced either of these two variants, and so they are grouped together in the same denominator class.

The next two rows describe samples that only amplified at a subset of loci. The denominator gives the number of samples that amplified at each subset. This does not have to match the other subsets as it relates to a different combination of genes and positions.

Importantly, there is no double counting in this table. The crt:72-76:CVIET samples are not included in the crt:72-74:CVI numbers, even though technically they did amplify at all these loci. If you’re ever unsure what to do, remember that every sample is present in this table only once.

For the same reason, it would be incorrect for this table to also list single-locus prevalences calculated from the full haplotypes. STAVE is smart enough to calculate single-locus prevalences from the full haplotype data, and so encoding the same information twice would result in double-counting. This encoding is generally preferable over the method above because all the same information and more can be calculated from these data.

Encoding mixed calls

This third example has muti-locus haplotypes as well as mixed calls at some loci:

study_id	survey_id	variant_string	variant_num	total_num	notes
study_03	study_03_site_01	crt:72-74:CVI	78	100	NA
study_03	study_03_site_01	crt:72-74:C/S_VI	11	100	NA
study_03	study_03_site_01	crt:72-74:C/S_V_I/M	6	100	NA
study_03	study_03_site_01	crt:72-74:C\|S_V_I\|M	5	100	NA

See the variant string package for further details of how to encode mixed calls, but in short, the / symbol specifies an un-phased mixed call, and | a phased mixed call.

All of these samples amplified successfully at positions 72 to 74, therefore the denominator of 100 is shared over all rows. The various combinations of mixed calls are each given a different row. Samples that are crt:72-74:C/S_VI are not also contained in the numbers for crt:72-74:CVI. As always, samples appear only once.

How many loci is too many?

Consider this final example, in which phased information is presented over multiple genes (scroll to see the full table):

study_id	survey_id	variant_string	variant_num	total_num	notes
study_04	study_04_site_01	crt:72-76:CVIET;dhfr:51_59_108_164:NCSI;dhps:436_437_540_581_613:AGEGT	5	5	NA

This is a very long variant string. On the one hand, encoding the full haplotype retains all information. On the other hand, this notation starts to become quite cumbersome. If we took this approach for a large number of samples then we would reach a combinatorial explosion in the number of possible observed variants and positions, as each distinct variant would require a new row.

A different approach might be to break this into different genes:

study_id	survey_id	variant_string	variant_num	total_num	notes
study_05	study_05_site_01	crt:72-76:CVIET	5	5	NA
study_05	study_05_site_01	dhfr:51_59_108_164:NCSI	5	5	NA
study_05	study_05_site_01	dhps:436_437_540_581_613:AGEGT	5	5	NA

This reduces the number of possible combinations, as new rows are only needed to capture diversity in each gene. However, we have lost the ability to query the prevalence of haplotypes spanning multiple genes. For example, we may be interested in how frequently the dhfr and dhps mutations are found together, which is now lost to us because of how we have broken up the data.

Overall, there is a balance to strike between capturing everything and capturing enough. This balance will vary from one application to the next. In general, if your aim is to capture genome-wide haplotypes then STAVE is probably not the best solution. You may be better sticking with other common formats like VCF and PMO. If your aim is to capture key drug resistance combinations in a lightweight and flexible notation, then STAVE may be a good choice.

The next page goes into the details of how prevalence is calculated from the encoded data.