Running Moire Example

The data

In this analysis we will be using the SNP barcode data from the sanger 100 SNP Plasmodium falciparum barcode (Chang et al. 2019). This was generated by subsetting the WGS data (also described within the Data section of this website) and we will be using the data from the DRC. See the description in the Data section for more details on this dataset.

In this tutorial we will use PGEhammer to convert data from the VCF format to the format required by MOIRE. To install the package run the following command

# Install PGEhammer in R:
install.packages('PGEhammer', repos = c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))

Now we can import libraries that we will need.

library(tidyverse)
library(here)
library(vcfR)
library(PGEhammer)

MOIRE requires input data in either long or wide format, which can be loaded using the load_long_form_data() or load_delimited_data() functions, respectively. The long format represents data with each observed allele for each sample on a separate row, while the wide format uses a row for each sample, and a separate column for each genomic target, with each cell containing a string indicating the collection of alleles that are observed. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.

# Load the vcf  
vcf <- read.vcfR('../../data/snp_barcode/sangerBarcode_SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.DRCongo.vcf.gz')

Scanning file to determine attributes.
File attributes:
  meta lines: 136
  header_line: 137
  variant count: 91
  column count: 122

Meta line 136 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 91
  Character matrix gt cols: 122
  skip: 0
  nrows: 91
  row_num: 0

Processed variant: 91
All variants processed

# Convert to long format 
df <- vcf2long(vcf)

Converting from vcf to long format...

Reformatting complete.

head(df) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sample_id	locus	allele	read_count
QG0182-C	Pf3D7_01_v3_145515	allele-1	36
QG0182-C	Pf3D7_01_v3_145515	allele-2	0
QG0182-C	Pf3D7_01_v3_179347	allele-1	0
QG0182-C	Pf3D7_01_v3_179347	allele-2	188
QG0182-C	Pf3D7_01_v3_180554	allele-1	0
QG0182-C	Pf3D7_01_v3_180554	allele-2	150

Prior to executing MOIRE, it is advisable to remove alleles that are likely to be erroneous. In the following step, we apply a filter to remove alleles with a read count below 10, however, how you filter your data should be informed by prior knowledge and experience with your data.

df <- df |>
  dplyr::filter(read_count >= 10)

For the purpose of this tutorial we will subset the data to just include chromosome 1

subset_df <- df[grepl('^Pf3D7_01', df$locus), ]

Running the MCMC

Now that we have the data in the correct format we will run the Markov chain Monte Carlo (MCMC) with the default parameters.

data <- moire::load_long_form_data(subset_df)
mcmc_results <- moire::run_mcmc(data, is_missing = data$is_missing, verbose = FALSE)

Summarising the results

From the MCMC results we can produce estimations for each sample and summarise the results.

First we will look at COI using summarize_coi.

# Estimate the COI for each sample
coi_summary <- moire::summarize_coi(mcmc_results)

head(coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sample_id	post_coi_lower	post_coi_med	post_coi_upper	post_coi_mean	naive_coi	offset_naive_coi	prob_polyclonal
QG0182-C	1	1	1	1.000	1	1	0.000
QG0183-C	1	1	2	1.108	2	1	0.096
QG0184-C	1	1	1	1.018	1	1	0.018
QG0185-C	1	1	1	1.007	1	1	0.007
QG0186-C	1	1	2	1.039	1	1	0.036
QG0187-C	1	1	1	1.000	1	1	0.000

This dataframe includes summaries of both the posterior distribution of COI for each biological sample and the naive estimates.

Below we use the summarize_he function to summarise locus heterozygosity from the posterior distribution of sampled allele frequencies.

he_summary <- moire::summarize_he(mcmc_results)

head(he_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

locus	post_stat_lower	post_stat_med	post_stat_upper	post_stat_mean
Pf3D7_01_v3_145515	0.2670405	0.3736329	0.4515084	0.3683778
Pf3D7_01_v3_179347	0.0164229	0.0630886	0.1608248	0.0688653
Pf3D7_01_v3_180554	0.1696347	0.2652525	0.3638944	0.2646813
Pf3D7_01_v3_283144	0.2752269	0.3691431	0.4527259	0.3680179
Pf3D7_01_v3_535211	0.1172749	0.2383835	0.3372387	0.2354225

Using summarize_allele_freqs we can look at individual allele frequencies

allele_freq_summary <- moire::summarize_allele_freqs(mcmc_results)

head(allele_freq_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

post_allele_freqs_lower	post_allele_freqs_med	post_allele_freqs_upper	post_allele_freqs_mean	locus	allele
0.6557107	0.7513633	0.8412912	0.7520307	Pf3D7_01_v3_145515	allele-1
0.1587088	0.2486367	0.3442893	0.2479693	Pf3D7_01_v3_145515	allele-2
0.0082800	0.0326075	0.0881904	0.0361381	Pf3D7_01_v3_179347	allele-1
0.9118095	0.9673925	0.9917200	0.9638619	Pf3D7_01_v3_179347	allele-2
0.7608693	0.8425985	0.9064267	0.8409991	Pf3D7_01_v3_180554	allele-1
0.0935733	0.1574015	0.2391307	0.1590009	Pf3D7_01_v3_180554	allele-2

The summarize_relatedness function provides a dataframe of within-host relatedness

relatedness_summary <- moire::summarize_relatedness(mcmc_results)

head(relatedness_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sample_id	post_relatedness_lower	post_relatedness_med	post_relatedness_upper	post_relatedness_mean
QG0182-C	NA	NA	NA	NaN
QG0183-C	0.0070207	0.5699971	0.8141899	0.4491186
QG0184-C	0.9993327	0.9993360	0.9993471	0.9993373
QG0185-C	0.9954492	0.9954609	0.9954958	0.9954689
QG0186-C	0.2001541	0.8753348	0.9416712	0.7888673
QG0187-C	NA	NA	NA	NaN

The moire tool introduces a new metric called effective MOI, which adjusts for within-host relatedness. A detailed description of this metric and how to interpret it can be found here.

effective_coi_summary <- moire::summarize_effective_coi(mcmc_results)

head(effective_coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sample_id	post_effective_coi_lower	post_effective_coi_med	post_effective_coi_upper	post_effective_coi_mean
QG0182-C	1	1	1.000000	1.000000
QG0183-C	1	1	1.796385	1.057058
QG0184-C	1	1	1.000000	1.000012
QG0185-C	1	1	1.000000	1.000032
QG0186-C	1	1	1.078041	1.008387
QG0187-C	1	1	1.000000	1.000000

Summary

In summary, moire estimates allele frequencies, MOI, and within-host relatedness. We have shown how to generate basic results from a VCF. MOIRE has extensive documentation, including more details on other functionality available within the tool and a tutorial validating the outputs using simulated data.