Running Moire Example

The data

In this analysis we will be using the SNP barcode data from the sanger 100 SNP Plasmodium falciparum barcode (Chang et al. 2019). This was generated by subsetting the WGS data (also described within the Data section of this website) and we will be using the data from the DRC. See the description in the Data section for more details on this dataset.

In this tutorial we will use PGEhammer to convert data from the VCF format to the format required by MOIRE. To install the package run the following command

# Install PGEhammer in R:
install.packages('PGEhammer', repos = c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))

Now we can import libraries that we will need.

library(tidyverse)
library(here)
library(vcfR)
library(PGEhammer)

MOIRE requires input data in either long or wide format, which can be loaded using the load_long_form_data() or load_delimited_data() functions, respectively. The long format represents data with each observed allele for each sample on a separate row, while the wide format uses a row for each sample, and a separate column for each genomic target, with each cell containing a string indicating the collection of alleles that are observed. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.

# Load the vcf  
vcf <- read.vcfR('../../data/snp_barcode/sangerBarcode_SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.DRCongo.vcf.gz')
Scanning file to determine attributes.
File attributes:
  meta lines: 136
  header_line: 137
  variant count: 91
  column count: 122

Meta line 136 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 91
  Character matrix gt cols: 122
  skip: 0
  nrows: 91
  row_num: 0

Processed variant: 91
All variants processed
# Convert to long format 
df <- vcf2long(vcf)
Converting from vcf to long format...
Reformatting complete.
head(df) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id locus allele read_count
QG0182-C Pf3D7_01_v3_145515 allele-1 36
QG0182-C Pf3D7_01_v3_145515 allele-2 0
QG0182-C Pf3D7_01_v3_179347 allele-1 0
QG0182-C Pf3D7_01_v3_179347 allele-2 188
QG0182-C Pf3D7_01_v3_180554 allele-1 0
QG0182-C Pf3D7_01_v3_180554 allele-2 150

Prior to executing MOIRE, it is advisable to remove alleles that are likely to be erroneous. In the following step, we apply a filter to remove alleles with a read count below 10, however, how you filter your data should be informed by prior knowledge and experience with your data.

df <- df |>
  dplyr::filter(read_count >= 10)

For the purpose of this tutorial we will subset the data to just include chromosome 1

subset_df <- df[grepl('^Pf3D7_01', df$locus), ]

Running the MCMC

Now that we have the data in the correct format we will run the Markov chain Monte Carlo (MCMC) with the default parameters.

data <- moire::load_long_form_data(subset_df)
mcmc_results <- moire::run_mcmc(data, is_missing = data$is_missing, verbose = FALSE)

Summarising the results

From the MCMC results we can produce estimations for each sample and summarise the results.

First we will look at COI using summarize_coi.

# Estimate the COI for each sample
coi_summary <- moire::summarize_coi(mcmc_results)

head(coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_coi_lower post_coi_med post_coi_upper post_coi_mean naive_coi offset_naive_coi prob_polyclonal
QG0182-C 1 1 1 1.000 1 1 0.000
QG0183-C 1 1 2 1.108 2 1 0.096
QG0184-C 1 1 1 1.018 1 1 0.018
QG0185-C 1 1 1 1.007 1 1 0.007
QG0186-C 1 1 2 1.039 1 1 0.036
QG0187-C 1 1 1 1.000 1 1 0.000

This dataframe includes summaries of both the posterior distribution of COI for each biological sample and the naive estimates.

Below we use the summarize_he function to summarise locus heterozygosity from the posterior distribution of sampled allele frequencies.

he_summary <- moire::summarize_he(mcmc_results)

head(he_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
locus post_stat_lower post_stat_med post_stat_upper post_stat_mean
Pf3D7_01_v3_145515 0.2670405 0.3736329 0.4515084 0.3683778
Pf3D7_01_v3_179347 0.0164229 0.0630886 0.1608248 0.0688653
Pf3D7_01_v3_180554 0.1696347 0.2652525 0.3638944 0.2646813
Pf3D7_01_v3_283144 0.2752269 0.3691431 0.4527259 0.3680179
Pf3D7_01_v3_535211 0.1172749 0.2383835 0.3372387 0.2354225

Using summarize_allele_freqs we can look at individual allele frequencies

allele_freq_summary <- moire::summarize_allele_freqs(mcmc_results)

head(allele_freq_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
post_allele_freqs_lower post_allele_freqs_med post_allele_freqs_upper post_allele_freqs_mean locus allele
0.6557107 0.7513633 0.8412912 0.7520307 Pf3D7_01_v3_145515 allele-1
0.1587088 0.2486367 0.3442893 0.2479693 Pf3D7_01_v3_145515 allele-2
0.0082800 0.0326075 0.0881904 0.0361381 Pf3D7_01_v3_179347 allele-1
0.9118095 0.9673925 0.9917200 0.9638619 Pf3D7_01_v3_179347 allele-2
0.7608693 0.8425985 0.9064267 0.8409991 Pf3D7_01_v3_180554 allele-1
0.0935733 0.1574015 0.2391307 0.1590009 Pf3D7_01_v3_180554 allele-2

The summarize_relatedness function provides a dataframe of within-host relatedness

relatedness_summary <- moire::summarize_relatedness(mcmc_results)

head(relatedness_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_relatedness_lower post_relatedness_med post_relatedness_upper post_relatedness_mean
QG0182-C NA NA NA NaN
QG0183-C 0.0070207 0.5699971 0.8141899 0.4491186
QG0184-C 0.9993327 0.9993360 0.9993471 0.9993373
QG0185-C 0.9954492 0.9954609 0.9954958 0.9954689
QG0186-C 0.2001541 0.8753348 0.9416712 0.7888673
QG0187-C NA NA NA NaN

The moire tool introduces a new metric called effective MOI, which adjusts for within-host relatedness. A detailed description of this metric and how to interpret it can be found here.

effective_coi_summary <- moire::summarize_effective_coi(mcmc_results)

head(effective_coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_effective_coi_lower post_effective_coi_med post_effective_coi_upper post_effective_coi_mean
QG0182-C 1 1 1.000000 1.000000
QG0183-C 1 1 1.796385 1.057058
QG0184-C 1 1 1.000000 1.000012
QG0185-C 1 1 1.000000 1.000032
QG0186-C 1 1 1.078041 1.008387
QG0187-C 1 1 1.000000 1.000000

Summary

In summary, moire estimates allele frequencies, MOI, and within-host relatedness. We have shown how to generate basic results from a VCF. MOIRE has extensive documentation, including more details on other functionality available within the tool and a tutorial validating the outputs using simulated data.

Back to top