This describes a basic IBD analysis of whole genome sequence data using hmmIBD. It assumes you already have a C compiler and Python installed.

The data

We can download sample Pf3k data from PGEforge data folder. For this demo, we will download the Vietnam data from the WGS directory. Let’s download two files - SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.Vietnam.vcf.gz and pf3k.metadata.Vietnam.csv - into a data subdirectory of our current working directory.

Based on the ‘fws’ values in pf3k.metadata.Vietnam.csv, we can create a list of probable monogenomic samples and put it in a file named mono_samples.txt. For this exercise, we can assume that any sample with fws > 0.95 is monogenomic and make a list of them in the file data/mono_samples.txt, one sample name per line.

Analysis

Assuming we have downloaded the hmmIBD package and have hmmIBD.c, vcf2hmm.py, and thin_sites.py in our current working directory, we start the analysis. First, we compile the C code:

cc -o hmmIBD -O3 -lm -Wall hmmIBD.c

This should compile (possibly with an unimportant warning about an unused variable).

Then we can use one of the Python scripts to extract data from the VCF file and look at the variants. The -s argument tells the script to pay attention only to the monogenomic samples we have listed in our file.

vcf2hmm.py data/SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.Vietnam.vcf.gz vietnam -s data/mono_samples.txt

This prints out the following:

Sample source: data/mono_samples.txt
SNP source: all                                 
samples: 97, of which 71 were kept
chromosome Pf3D7_01_v3 maps to 1
chromosome Pf3D7_02_v3 maps to 2
chromosome Pf3D7_03_v3 maps to 3
chromosome Pf3D7_04_v3 maps to 4
chromosome Pf3D7_05_v3 maps to 5
chromosome Pf3D7_06_v3 maps to 6 
chromosome Pf3D7_07_v3 maps to 7
chromosome Pf3D7_08_v3 maps to 8
chromosome Pf3D7_09_v3 maps to 9
chromosome Pf3D7_10_v3 maps to 10
chromosome Pf3D7_11_v3 maps to 11
chromosome Pf3D7_12_v3 maps to 12
chromosome Pf3D7_13_v3 maps to 13
chromosome Pf3D7_14_v3 maps to 14      
N indels killed 0                                                       
N failed/passed filter 0 247496
N dropped for low call rate 1092
N dropped for low minor allele freq 0

Here the ‘maps to’ lines show what integer is being assigned to each chromosome. The script also produces the files vietnam_seq.txt, vietnam_freq.txt, and vietnam_allele.txt.

Examination of vietnam_freq.seq file shows that there are hundreds of thousands of variants but that most of them are monomorphic, meaning that it is a good idea to thin these sites. This we can do using the vietnam_freq.txt file as follows:

thin_sites.py vietnam_freq.txt vietnam_thin_sites.txt
which produces a list in the file vietnam_thin_sites.txt of the sites (~20,000) to keep.
We can now re-extract the sequence data for just these sites, with the samples still restricted to the monogenomic samples:

vcf2hmm.py data/SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.Vietnam.vcf.gz vietnam_thinned -s data/mono_samples.txt -l vietnam_thin_sites.txt

This produces a final vietnam_thinned_seq.txt file with the genotype data to feed into hmmIBD. We can now run the program:

hmmIBD -i vietnam_thinned_seq.txt -o vietnam

The output should appear in the files vietnam.hmm.txt and vietnam.hmm_fract.txt.

Back to top