# Install PGEhammer in R:
install.packages('PGEhammer', repos = c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))
Running Moire Example
The data
In this analysis we will be using the SNP barcode data from the sanger 100 SNP Plasmodium falciparum barcode (Chang et al. 2019). This was generated by subsetting the WGS data (also described within the Data section of this website) and we will be using the data from the DRC. See the description in the Data section for more details on this dataset.
In this tutorial we will use PGEhammer
to convert data from the VCF format to the format required by MOIRE. To install the package run the following command
Now we can import libraries that we will need.
library(tidyverse)
library(here)
library(vcfR)
library(PGEhammer)
MOIRE
requires input data in either long or wide format, which can be loaded using the load_long_form_data()
or load_delimited_data()
functions, respectively. The long format represents data with each observed allele for each sample on a separate row, while the wide format uses a row for each sample, and a separate column for each genomic target, with each cell containing a string indicating the collection of alleles that are observed. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.
# Load the vcf
<- read.vcfR('../../data/snp_barcode/sangerBarcode_SNP_INDEL_Pf3D7_ALL_v3.combined.filtered.vqslod6.biallelic_snp.DRCongo.vcf.gz') vcf
Scanning file to determine attributes.
File attributes:
meta lines: 136
header_line: 137
variant count: 91
column count: 122
Meta line 136 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 91
Character matrix gt cols: 122
skip: 0
nrows: 91
row_num: 0
Processed variant: 91
All variants processed
# Convert to long format
<- vcf2long(vcf) df
Converting from vcf to long format...
Reformatting complete.
head(df) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id | locus | allele | read_count |
---|---|---|---|
QG0182-C | Pf3D7_01_v3_145515 | allele-1 | 36 |
QG0182-C | Pf3D7_01_v3_145515 | allele-2 | 0 |
QG0182-C | Pf3D7_01_v3_179347 | allele-1 | 0 |
QG0182-C | Pf3D7_01_v3_179347 | allele-2 | 188 |
QG0182-C | Pf3D7_01_v3_180554 | allele-1 | 0 |
QG0182-C | Pf3D7_01_v3_180554 | allele-2 | 150 |
Prior to executing MOIRE
, it is advisable to remove alleles that are likely to be erroneous. In the following step, we apply a filter to remove alleles with a read count below 10, however, how you filter your data should be informed by prior knowledge and experience with your data.
<- df |>
df ::filter(read_count >= 10) dplyr
For the purpose of this tutorial we will subset the data to just include chromosome 1
<- df[grepl('^Pf3D7_01', df$locus), ] subset_df
Running the MCMC
Now that we have the data in the correct format we will run the Markov chain Monte Carlo (MCMC) with the default parameters.
<- moire::load_long_form_data(subset_df)
data <- moire::run_mcmc(data, is_missing = data$is_missing, verbose = FALSE) mcmc_results
Summarising the results
From the MCMC results we can produce estimations for each sample and summarise the results.
First we will look at COI using summarize_coi
.
# Estimate the COI for each sample
<- moire::summarize_coi(mcmc_results)
coi_summary
head(coi_summary) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id | post_coi_lower | post_coi_med | post_coi_upper | post_coi_mean | naive_coi | offset_naive_coi | prob_polyclonal |
---|---|---|---|---|---|---|---|
QG0182-C | 1 | 1 | 1 | 1.000 | 1 | 1 | 0.000 |
QG0183-C | 1 | 1 | 2 | 1.108 | 2 | 1 | 0.096 |
QG0184-C | 1 | 1 | 1 | 1.018 | 1 | 1 | 0.018 |
QG0185-C | 1 | 1 | 1 | 1.007 | 1 | 1 | 0.007 |
QG0186-C | 1 | 1 | 2 | 1.039 | 1 | 1 | 0.036 |
QG0187-C | 1 | 1 | 1 | 1.000 | 1 | 1 | 0.000 |
This dataframe includes summaries of both the posterior distribution of COI for each biological sample and the naive estimates.
Below we use the summarize_he
function to summarise locus heterozygosity from the posterior distribution of sampled allele frequencies.
<- moire::summarize_he(mcmc_results)
he_summary
head(he_summary) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
locus | post_stat_lower | post_stat_med | post_stat_upper | post_stat_mean |
---|---|---|---|---|
Pf3D7_01_v3_145515 | 0.2670405 | 0.3736329 | 0.4515084 | 0.3683778 |
Pf3D7_01_v3_179347 | 0.0164229 | 0.0630886 | 0.1608248 | 0.0688653 |
Pf3D7_01_v3_180554 | 0.1696347 | 0.2652525 | 0.3638944 | 0.2646813 |
Pf3D7_01_v3_283144 | 0.2752269 | 0.3691431 | 0.4527259 | 0.3680179 |
Pf3D7_01_v3_535211 | 0.1172749 | 0.2383835 | 0.3372387 | 0.2354225 |
Using summarize_allele_freqs
we can look at individual allele frequencies
<- moire::summarize_allele_freqs(mcmc_results)
allele_freq_summary
head(allele_freq_summary) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
post_allele_freqs_lower | post_allele_freqs_med | post_allele_freqs_upper | post_allele_freqs_mean | locus | allele |
---|---|---|---|---|---|
0.6557107 | 0.7513633 | 0.8412912 | 0.7520307 | Pf3D7_01_v3_145515 | allele-1 |
0.1587088 | 0.2486367 | 0.3442893 | 0.2479693 | Pf3D7_01_v3_145515 | allele-2 |
0.0082800 | 0.0326075 | 0.0881904 | 0.0361381 | Pf3D7_01_v3_179347 | allele-1 |
0.9118095 | 0.9673925 | 0.9917200 | 0.9638619 | Pf3D7_01_v3_179347 | allele-2 |
0.7608693 | 0.8425985 | 0.9064267 | 0.8409991 | Pf3D7_01_v3_180554 | allele-1 |
0.0935733 | 0.1574015 | 0.2391307 | 0.1590009 | Pf3D7_01_v3_180554 | allele-2 |
The summarize_relatedness
function provides a dataframe of within-host relatedness
<- moire::summarize_relatedness(mcmc_results)
relatedness_summary
head(relatedness_summary) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id | post_relatedness_lower | post_relatedness_med | post_relatedness_upper | post_relatedness_mean |
---|---|---|---|---|
QG0182-C | NA | NA | NA | NaN |
QG0183-C | 0.0070207 | 0.5699971 | 0.8141899 | 0.4491186 |
QG0184-C | 0.9993327 | 0.9993360 | 0.9993471 | 0.9993373 |
QG0185-C | 0.9954492 | 0.9954609 | 0.9954958 | 0.9954689 |
QG0186-C | 0.2001541 | 0.8753348 | 0.9416712 | 0.7888673 |
QG0187-C | NA | NA | NA | NaN |
The moire tool introduces a new metric called effective MOI, which adjusts for within-host relatedness. A detailed description of this metric and how to interpret it can be found here.
<- moire::summarize_effective_coi(mcmc_results)
effective_coi_summary
head(effective_coi_summary) |>
kable() |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
sample_id | post_effective_coi_lower | post_effective_coi_med | post_effective_coi_upper | post_effective_coi_mean |
---|---|---|---|---|
QG0182-C | 1 | 1 | 1.000000 | 1.000000 |
QG0183-C | 1 | 1 | 1.796385 | 1.057058 |
QG0184-C | 1 | 1 | 1.000000 | 1.000012 |
QG0185-C | 1 | 1 | 1.000000 | 1.000032 |
QG0186-C | 1 | 1 | 1.078041 | 1.008387 |
QG0187-C | 1 | 1 | 1.000000 | 1.000000 |
Summary
In summary, moire estimates allele frequencies, MOI, and within-host relatedness. We have shown how to generate basic results from a VCF. MOIRE
has extensive documentation, including more details on other functionality available within the tool and a tutorial validating the outputs using simulated data.