Running Moire Example

The data

In this analysis we will be using the SNP barcode data from the sanger 100 SNP Plasmodium falciparum barcode (Chang et al. 2019). This was generated by subsetting the WGS data (also described within the Data section of this website) and we will be using the data from the DRC. See the description in the Data section for more details on this dataset.

In this tutorial we will use PGEhammer to convert data from the VCF format to the format required by Dcifer. To install the package run the following command

# Install PGEhammer in R:
install.packages('PGEhammer', repos = c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))

The downloaded binary packages are in
    /var/folders/wx/rr171mzs0lj0mtflng6dwl7h0000gp/T//RtmpawfTNW/downloaded_packages

Now we can import libraries that we will need.

library(tidyverse)
library(here)
library(vcfR)
library(PGEhammer)

Moire requires input data in either long or wide format, which can be loaded using the load_long_form_data() or load_delimited_data() functions, respectively. The long format represents data with each observation on a separate row, while the wide format uses a separate column for each variable. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.

# Load the vcf  
vcf <- read.vcfR('../../data/snp_barcode/sangerBarcode_SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.DRCongo.vcf.gz')
Scanning file to determine attributes.
File attributes:
  meta lines: 136
  header_line: 137
  variant count: 91
  column count: 122

Meta line 136 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 91
  Character matrix gt cols: 122
  skip: 0
  nrows: 91
  row_num: 0

Processed variant: 91
All variants processed
# Convert to long format 
df <- vcf2long(vcf)
Converting from vcf to long format...
Reformatting complete.
head(df) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id locus allele read_count
QG0182-C Pf3D7_01_v3_145515 allele-1 36
QG0182-C Pf3D7_01_v3_145515 allele-2 0
QG0182-C Pf3D7_01_v3_179347 allele-1 0
QG0182-C Pf3D7_01_v3_179347 allele-2 188
QG0182-C Pf3D7_01_v3_180554 allele-1 0
QG0182-C Pf3D7_01_v3_180554 allele-2 150

Prior to executing Moire, it is advisable to introduce a new column to identify alleles with an insufficient read count. In the following step, we apply a filter to isolate alleles with a read count below 10.

df$is_missing <- df$read_count < 10

For the purpose of this tutorial we will subset the data to just include chromosome 1

subset_df <- df[grepl('^Pf3D7_01', df$locus), ]

Running the MCMC

Now that we have the data in the correct format we will run the Markov chain Monte Carlo (MCMC) with the default parameters.

data <- moire::load_long_form_data(subset_df)
Adding missing grouping variables: `locus`
mcmc_results <- moire::run_mcmc(data, is_missing=data$is_missing, verbose=FALSE)

Summarising the results

From the MCMC results we can produce estimations for each sample and summarise the results.

First we will look at COI using summarize_coi.

# Estimate the COI for each sample
coi_summary <- moire::summarize_coi(mcmc_results)

head(coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_coi_lower post_coi_med post_coi_upper post_coi_mean naive_coi offset_naive_coi prob_polyclonal
QG0182-C 18 30 40.000 29.771 2 2 1
QG0182-C 19 30 39.000 29.390 2 2 1
QG0182-C 20 29 39.000 29.374 2 2 1
QG0182-C 18 28 37.000 27.604 2 2 1
QG0182-C 21 29 38.025 29.508 2 2 1
QG0183-C 20 30 40.000 29.987 2 2 1

This dataframe includes summaries of both the posterior distribution of COI for each biological sample and the naive estimates.

Below we use the summarize_he function to summarise locus heterozygosity from the posterior distribution of sampled allele frequencies.

he_summary <- moire::summarize_he(mcmc_results)

head(he_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
locus post_stat_lower post_stat_med post_stat_upper post_stat_mean
Pf3D7_01_v3_145515 0.4109930 0.4874592 0.4999798 0.4792203
Pf3D7_01_v3_179347 0.4080085 0.4880050 0.4999290 0.4797910
Pf3D7_01_v3_180554 0.4193941 0.4903099 0.4999454 0.4807489
Pf3D7_01_v3_283144 0.4123817 0.4918062 0.4999624 0.4819307
Pf3D7_01_v3_535211 0.4262921 0.4895070 0.4999511 0.4819215

Using summarize_allele_freqs we can look at individual allele frequencies

allele_freq_summary <- moire::summarize_allele_freqs(mcmc_results)

head(allele_freq_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
post_allele_freqs_lower post_allele_freqs_med post_allele_freqs_upper post_allele_freqs_mean locus allele
0.3041579 0.5005537 0.6774932 0.4964165 Pf3D7_01_v3_145515 allele-1
0.3225068 0.4994463 0.6958421 0.5035835 Pf3D7_01_v3_145515 allele-2
0.3087308 0.4980584 0.6845234 0.4976799 Pf3D7_01_v3_179347 allele-1
0.3154766 0.5019416 0.6912692 0.5023201 Pf3D7_01_v3_179347 allele-2
0.3112285 0.5079918 0.6829471 0.5037133 Pf3D7_01_v3_180554 allele-1
0.3170529 0.4920082 0.6887715 0.4962867 Pf3D7_01_v3_180554 allele-2

The summarize_relatedness function provides a dataframe of within-host relatedness

relatedness_summary <- moire::summarize_relatedness(mcmc_results)

head(relatedness_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_relatedness_lower post_relatedness_med post_relatedness_upper post_relatedness_mean
QG0182-C 0.0197030 0.4604298 0.8541539 0.4249814
QG0182-C 0.0337680 0.3783443 0.8409948 0.4144203
QG0182-C 0.0191771 0.3811882 0.8407041 0.4019389
QG0182-C 0.0190355 0.3933736 0.8386370 0.4018551
QG0182-C 0.0135405 0.3737309 0.8285997 0.3866478
QG0183-C 0.0256464 0.4261874 0.8394141 0.4265783

The moire tool introduces a new metric called effective MOI, which adjusts for within-host relatedness. A detailed description of this metric and how to interpret it can be found here.

effective_coi_summary <- moire::summarize_effective_coi(mcmc_results)

head(effective_coi_summary) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id post_effective_coi_lower post_effective_coi_med post_effective_coi_upper post_effective_coi_mean
QG0182-C 5.135987 16.71772 32.38607 17.47504
QG0182-C 5.313454 17.67340 32.47499 17.59741
QG0182-C 5.074139 17.62940 31.90710 17.89686
QG0182-C 5.008999 16.24006 31.78271 16.96374
QG0182-C 5.553618 18.14092 34.68141 18.56370
QG0183-C 5.266727 17.06538 33.06818 17.52451

Summary

In summary, moire estimates allele frequencies, MOI, and within-host relatedness. We have shown how to generate basic results from a VCF. Moire has extensive documentation, including more details on other functionality available within the tool and a tutorial validating the outputs using simulated data.

Back to top