How to run estMOI

The data

To perform the analysis estMOI requires:

A bam file which should be sorted
A vcf file which can either be zipped or unzipped
The reference genome

For the purpose of hosting the data we have provided bam and vcf files containing data for the csp, celtos and ama1 genes. You can find these files at https://github.com/mrc-ide/PGEforge/tree/main/data/wgs/labisolate_subset. Download the calls.tar.gz file and unzip. You should now have a folder named calls which contains the bam and VCF files that we will user later in the practical.

We now need to download the reference genome. You can download this from PlasmoDB by clicking on this link

Running the tool

Let’s use run the analysis for the sample PG0402-C which is sample with a known MOI > 1. Let’s

Before we can run estMOI we need to index the reference genome. This can by done with

samtools faidx PlasmoDB-66_Pfalciparum3D7_Genome.fasta

Now let’s run the pipeline with default parameters. The three positional arguments should be:

The bam file
The VCF file
The reference genome

You can also customise the output file prefix with --out

estMOI PG0402-C.sorted.bam PG0402-C.vcf PlasmoDB-66_Pfalciparum3D7_Genome.fasta --out PG0402-C

After running it will print a summary of the analysis to the terminal and you should see the output below:

#   RUNNING estMOI version 1.03
#   ................
#   PRINT ESTIMATES...
#   MOI Count   %Total
#   1   319 66.88
#   2   158 100.00  MOI-estimate
    #DONE MOI-estimate using PG0402-C.sorted.bam

Output description

Running estMOI will produce two output files with the prefix that you specified before. Additionally, the values of the --maxsnp, --mindis, --maxdis and --minhap parameters will also be present in the output file names.

For example, you should have the files PG0402-C.moi.3.10.500.3.log and PG0402-C.moi.3.10.500.3.txt in your directory.

Log file

The file ending in .log contains the individual combinations of all haplotypes that were found. For each haplotype it will give you the locations of the SNPs together with the individual haplotype combinations. For example, here is an excerpt form the log file that contains this information for one haplotype:

PG0402-C.sorted.bam Pf3D7_03_v3 221435 221457 221554    2
    # PG0402-C.sorted.bam Pf3D7_03_v3 221435 221457 221554 Hapotype:     T A C  48
    # PG0402-C.sorted.bam Pf3D7_03_v3 221435 221457 221554 Hapotype:     T G C  10

Txt file

The result file will contain the the same information as printed to the terminal. It lists the numbers of haplotypes and for each MOI and provides an estimate of the MOI. The the example below there are 319 locus with a single haplotypes and 158 with two haplotypes.

#MOI    Count   %Total
1   319 66.88
2   158 100.00  MOI-estimate

Refining the results

Let’s run this on another sample:

estMOI PG0389-C.sorted.bam PG0389-C.vcf  PlasmoDB-66_Pfalciparum3D7_Genome.fasta --out PG0389-C

You should see that no MOI has been estimates:

#   RUNNING estMOI version 1.03
#   ................
#   PRINT ESTIMATES...
#   MOI Count   %Total
    #DONE MOI-estimate using PG0389-C.sorted.bam

To troubleshoot it is a good idea to look at the log. Run the following to find information about the haplotypes that have been analysed:

cat  PG0389-C.moi.3.10.500.3.log

You should see no output, indicating that no haplotypes were analysed and explains why there was no MOI estimate produced. estMOI loads the variant locations from the VCF and looks at combinations of those in the reads using the bam files.

Looking at the bam file it is clear that there are some variant haplotypes present.

Next thing to do is to check that we actually have variants in the VCF. You can do this with:

less PG0389-C.vcf

You should be able to see that there are many variant positions, however you might notice that the quality values are very low. estMOI has a hard-coded cutoff for the quality value and only analyses variants with quality values greater or equal to 30. This may explain why the estMOI didn’t analyse any haplotype positions.

The VCFs were created with freebayes using a diploid model and low frequency variants are assigned low quality values. We can use a different variant caller, lofreq, that is more suited to low frequency variants which will hopefully assign higher quality values to these variants. First let’s install lofreq:

mamba install -y -c conda-forge -c bioconda lofreq

Now we can use lofreq to call variants from the bam:

lofreq call -f PlasmoDB-66_Pfalciparum3D7_Genome.fasta PG0389-C.sorted.bam > PG0389-C.lofreq.vcf

Now we are ready to call run estMOI again using the new vcf you created:

estMOI PG0389-C.sorted.bam PG0389-C.lofreq.vcf  PlasmoDB-66_Pfalciparum3D7_Genome.fasta  --out PG0389-C

You shold hopefully see an MOI estimate of 2 now after the tools finishes running:

#   RUNNING estMOI version 1.03
#   ................
#   PRINT ESTIMATES...
#   MOI Count   %Total
#   1   98  68.06
#   2   46  100.00  MOI-estimate
    #DONE MOI-estimate using PG0389-C.sorted.bam

Summary

To summarise, we can use estMOI to estimate MOI using the read data stored in bams and variants stored in VCF format. The variants filtered by the pipeline based on a quality value so make sure you are happy with the variants your VCF before proceeding with the analysis.