Whole genome sequencing (WGS) data
P. falciparum WGS data
Pf3k
All of the data within the subfolder wgs/pf3k
was derived from the Pf3k Project. Currently, there are three VCF files, with corresponding CSVs containing metadata, for samples from: - Democratic Republic of the Congo (\(n=113\)) - Vietnam (\(n=97\)) - In vitro mixtures of laboratory strains (\(n=25\))
Each VCF contains 247,496 high-quality (VQSLOD>6) biallelic SNPs across all fourteen somatic chromosomes. The VCFs are sorted and an index file is provided. The Fws statistics provided in the metadata CSVs were collected from the Pf7 data set, which contains the Pf3k samples. These were not calculated for the in vitro lab mixtures.
Simulated
All of the data within the subfolder wgs/simulated
was simulated. In brief, a simulated sample with a given complexity of infection (COI), \(K\), is created by randomly sampling \(K\) clonal haplotypes (\(F_{ws} > 0.95\)) from a given country within the Pf3k Project, assigning these haplotypes to \(j \leq K\) bites, simulating meiosis if \(j < K\), randomly sampling proportions for each haplotype, and then simulating read count data given the proportions and final genotypes. Sequencing error is simulated at a fixed rate and present in the read counts. No variant calling error is simulated; the genotypes are perfect. At present, there is only one VCF file with a corresponding CSV and BED file containing metadata, with samples simulated from: - Democratic Republic of the Congo (\(n=40\))
The COI of these samples ranges from one to four, and about half of them have within-host relatedness.
Lab isolates sub-setted
There are a set of bam files with vcf calls subsetting to just CSP (PF3D7_0304600), CELTOS (PF3D7_1133400), and AMA1 (PF3D7_1216600). These can be found within the wgs/labisolate_subset
directory. With metadata describing what is in each file wgs/labisolate_subset/allControlMixtures.tab.txt
, wgs/labisolate_subset/allControlSampNameToMixName.tab.txt