In this analysis we will be using the SNP barcode data from the sanger 100 SNP Plasmodium falciparum barcode (Chang et al. 2019). This was generated by subsetting the WGS data (also described within the Data section of this website) and we will be using the data from the DRC. See the description in the Data section for more details on this dataset.
In this tutorial we will use PGEhammer to convert data from the VCF format to the format required by Dcifer. To install the package run the following command
# Install PGEhammer in R:install.packages('PGEhammer', repos =c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))
The downloaded binary packages are in
/var/folders/wx/rr171mzs0lj0mtflng6dwl7h0000gp/T//RtmpawfTNW/downloaded_packages
Moire requires input data in either long or wide format, which can be loaded using the load_long_form_data() or load_delimited_data() functions, respectively. The long format represents data with each observation on a separate row, while the wide format uses a separate column for each variable. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.
# Load the vcf vcf <-read.vcfR('../../data/snp_barcode/sangerBarcode_SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.DRCongo.vcf.gz')
Scanning file to determine attributes.
File attributes:
meta lines: 136
header_line: 137
variant count: 91
column count: 122
Meta line 136 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 91
Character matrix gt cols: 122
skip: 0
nrows: 91
row_num: 0
Processed variant: 91
All variants processed
Prior to executing Moire, it is advisable to introduce a new column to identify alleles with an insufficient read count. In the following step, we apply a filter to isolate alleles with a read count below 10.
df$is_missing <- df$read_count <10
For the purpose of this tutorial we will subset the data to just include chromosome 1
subset_df <- df[grepl('^Pf3D7_01', df$locus), ]
Running the MCMC
Now that we have the data in the correct format we will run the Markov chain Monte Carlo (MCMC) with the default parameters.
From the MCMC results we can produce estimations for each sample and summarise the results.
First we will look at COI using summarize_coi.
# Estimate the COI for each samplecoi_summary <- moire::summarize_coi(mcmc_results)head(coi_summary) |>kable() |>kable_styling(bootstrap_options =c("striped", "hover", "condensed"))
sample_id
post_coi_lower
post_coi_med
post_coi_upper
post_coi_mean
naive_coi
offset_naive_coi
prob_polyclonal
QG0182-C
18
30
40.000
29.771
2
2
1
QG0182-C
19
30
39.000
29.390
2
2
1
QG0182-C
20
29
39.000
29.374
2
2
1
QG0182-C
18
28
37.000
27.604
2
2
1
QG0182-C
21
29
38.025
29.508
2
2
1
QG0183-C
20
30
40.000
29.987
2
2
1
This dataframe includes summaries of both the posterior distribution of COI for each biological sample and the naive estimates.
Below we use the summarize_he function to summarise locus heterozygosity from the posterior distribution of sampled allele frequencies.
The moire tool introduces a new metric called effective MOI, which adjusts for within-host relatedness. A detailed description of this metric and how to interpret it can be found here.
In summary, moire estimates allele frequencies, MOI, and within-host relatedness. We have shown how to generate basic results from a VCF. Moire has extensive documentation, including more details on other functionality available within the tool and a tutorial validating the outputs using simulated data.