Whole genome sequencing (WGS) data

P. falciparum WGS data

Pf3k

All of the data within the subfolder wgs/pf3k was derived from the Pf3k Project. Currently, there are three VCF files, with corresponding CSVs containing metadata, for samples from: - Democratic Republic of the Congo (\(n=113\)) - Vietnam (\(n=97\)) - In vitro mixtures of laboratory strains (\(n=25\))

Each VCF contains 247,496 high-quality (VQSLOD>6) biallelic SNPs across all fourteen somatic chromosomes. The VCFs are sorted and an index file is provided. The Fws statistics provided in the metadata CSVs were collected from the Pf7 data set, which contains the Pf3k samples. These were not calculated for the in vitro lab mixtures.

Simulated

All of the data within the subfolder wgs/simulated was simulated. In brief, a simulated sample with a given complexity of infection (COI), \(K\), is created by randomly sampling \(K\) clonal haplotypes (\(F_{ws} > 0.95\)) from a given country within the Pf3k Project, assigning these haplotypes to \(j \leq K\) bites, simulating meiosis if \(j < K\), randomly sampling proportions for each haplotype, and then simulating read count data given the proportions and final genotypes. Sequencing error is simulated at a fixed rate and present in the read counts. No variant calling error is simulated; the genotypes are perfect. At present, there is only one VCF file with a corresponding CSV and BED file containing metadata, with samples simulated from: - Democratic Republic of the Congo (\(n=40\))

The COI of these samples ranges from one to four, and about half of them have within-host relatedness.

Lab isolates sub-setted

There are a set of bam files with vcf calls subsetting to just CSP (PF3D7_0304600), CELTOS (PF3D7_1133400), and AMA1 (PF3D7_1216600). These can be found within the wgs/labisolate_subset directory. With metadata describing what is in each file wgs/labisolate_subset/allControlMixtures.tab.txt, wgs/labisolate_subset/allControlSampNameToMixName.tab.txt

Back to top