install.package("vcfR")
Running MultiLociBiallelicModel on bialleic SNP data
The data
Data is generated by simulating sanger 100 barcode data for 100 samples with simulated COIs. main Data tab for further details.
R packages
Install R package vcfR for reading in vcf data.
library(tidyverse)
library(vcfR)
Converting from VCF to input data
The input data to MultiLociBiallelicModel requires a table with sample ID in first column and then a column for each marker present with a 0 == homozygous reference, 1 == homozygous alt, and 2 for heterozygote. This only works on biallelic SNPS and therefore should be limited to only these SNPs.
# read in vcf
= read.vcfR("../../data/snp_barcode/SpotMalariapfPanel_simData_sanger100.vcf.gz") SpotMalariapfPanel_simData_sanger100
Scanning file to determine attributes.
File attributes:
meta lines: 76
header_line: 77
variant count: 100
column count: 109
Meta line 76 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 100
Character matrix gt cols: 109
skip: 0
nrows: 100
row_num: 0
Processed variant: 100
All variants processed
# get the genotyping info
= SpotMalariapfPanel_simData_sanger100@gt %>%
SpotMalariapfPanel_simData_sanger100_gt_tibble as_tibble()
# get the genomic location info
= as_tibble(getFIX(SpotMalariapfPanel_simData_sanger100))
SpotMalariapfPanel_simData_sanger100_fixDf
# get a vector of whether or not the SNPs are biallelic
= is.biallelic(SpotMalariapfPanel_simData_sanger100)
snpIsBiallelic
# filter and combine the genotyping data
= SpotMalariapfPanel_simData_sanger100_fixDf[snpIsBiallelic, ]
SpotMalariapfPanel_simData_sanger100_fixDf_filt = SpotMalariapfPanel_simData_sanger100_fixDf_filt %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt bind_cols( SpotMalariapfPanel_simData_sanger100_gt_tibble[is.biallelic(SpotMalariapfPanel_simData_sanger100),]) %>%
gather(sample, call, 9:ncol(.)) %>%
mutate(FORMAT = strsplit(FORMAT, split = ":"),
call = strsplit(call, split = ":")) %>%
unnest(cols = c(FORMAT, call))
# get the genotyping calls
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt filter(FORMAT == "GT") %>%
mutate(varID = paste0(CHROM, "-", POS)) %>%
mutate(genotypeCall = case_when(
"0/0" == call ~ "0",
"1/1" == call ~ "1",
"0/1" == call ~ "2"
%>%
)) mutate(genotypeCall = as.numeric(genotypeCall))
# create the matrix for the input into the mle data
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_sp select(sample, varID, genotypeCall) %>%
rename(ID = sample) %>%
spread(varID, genotypeCall)
The mle function is found within the R script supplied by the publication[@Tsoungui_Obama2022-gz]. The number of loci used in the publication was limited to about 10 and the function is unable to do large number of loci due to the size of the matrix that would have to be created so would advise limiting number of loci used, here doing the 6 most diverse loci by he.
# calculating he
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod mutate(genotypeCallExpand = case_when(
2 == genotypeCall ~ "ref/alt",
1 == genotypeCall ~ "alt",
0 == genotypeCall ~ "ref"
))
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel select(varID, genotypeCallExpand) %>%
mutate(genotypeCallExpand = strsplit(genotypeCallExpand, split = "/")) %>%
unnest(genotypeCallExpand) %>%
group_by(varID, genotypeCallExpand) %>%
count() %>%
group_by(varID) %>%
mutate(total = sum(n)) %>%
mutate(freq = n/total)
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel_he group_by(varID) %>%
summarise(he = 1 - sum(freq^2)) %>%
arrange(desc(he))
MLE
# select just the top 6 loci
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel_he %>%
SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel_he_top head(n = 6)
# select out only the sample id and the varID
= c("ID", SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_mod_sel_he_top$varID)
selCols
= SpotMalariapfPanel_simData_sanger100_fixDf_filt_allgt_gt_sp[, selCols]
inputToMLE
# load mle functions
source("SNPModel.R")
# Find the MLEs
<- mle(inputToMLE, id=TRUE) est
prevalence
Several prevalences are called from the mle estimation results. These are the definition provided by the paper[@Tsoungui_Obama2022-gz].
- Unobservable prevalence
Since haplotype information is typically unavailable from molecular assays, haplotypes are per se not observable in molecular samples. To emphasize this fact we call the probability that a haplotype occurs in an infection “unobservable prevalence.”
- Conditional prevalence
Because of ambiguity of haplotype information in multiple infections, it is impossible to identify the number of samples containing haplotype h in a dataset. In practice, often only unambiguous samples are considered, to determine prevalence. Here, we derive the corresponding quantity in the underlying framework, i.e., the prevalence of haplotype h, conditioned on observing only unambiguous data. The quantity is referred to as “conditional prevalence.”
- Relative prevalence
Due to unobservable information, a statistical model is required to obtain estimates for frequencies. However, in practice, “ad-hoc” estimates are popular if statistical methods are not available. Frequency estimates can be obtained, by first disregarding all ambiguous observations, calculate the empirically observed unambiguous prevalence of all haplotypes, and finally normalizing them - here we refer to this as the “relative unambiguous prevalence.”
# Estimate prevalence
## Unobservable prevalence
<- estunobsprev(est)
unobsprev = tibble(
unobsprev_df haplotype = dimnames(unobsprev)[[2]],
unobsprevalence = c(unobsprev)
)
## Conditional prevalence
<- estcondprev(est)
condprev = tibble(
condprev_df haplotype = dimnames(condprev)[[2]],
condprevalence = c(condprev)
)
## Relative prevalence
<- estrelprev(inputToMLE, id=TRUE)
relprev = tibble(
relprev_df haplotype = dimnames(relprev)[[2]],
relprevalence = c(relprev)
)
= full_join(
prevalences
unobsprev_df,
condprev_df%>%
) full_join(relprev_df) %>%
arrange(desc(unobsprevalence))
Joining with `by = join_by(haplotype)`
Joining with `by = join_by(haplotype)`
::datatable(prevalences) DT
Sample MOI calls
Using the mle estimation results combined with the haplotype calls for the sample there are functions provided for making MOI estimates based on what’s the most probable MOI. The function samplwiseMOI gives you a list of probabilities for a range of MOIs and returns the most the probable MOI.
# convert to matrix to make it easier to provide the haplotype call to the function
= as.matrix(inputToMLE[, 2:ncol(inputToMLE)])
inputToMLE_mat rownames(inputToMLE_mat) = inputToMLE$ID
= tibble()
allSampleMOIs # iterate over each sample by row and get an estimate of the MOI for each sample by providing the haplotype call for that sample
for (row in 1:nrow(inputToMLE_mat)) {
# estimate MOI probabilites
= samplwiseMOI(inputToMLE_mat[row, ], est)
currentSampleMOI # gathering
= bind_rows(allSampleMOIs,
allSampleMOIs tibble(sample = rownames(inputToMLE_mat)[row],
MOI = currentSampleMOI$MOI))
}
::datatable(allSampleMOIs) DT
Since these samples are simualted, the known COI for the sample is known so can compare simulated MOI(COI) vs determined MOI by MultiLociBiallelicModel.
= readr::read_tsv("../../data/snp_barcode/SpotMalariapfPanel_simData_sanger100_simCOIs.tab.txt") %>%
simCOI rename(sample = Patient) %>%
left_join(allSampleMOIs)
Rows: 100 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Patient
dbl (1): initialCOI
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(sample)`
ggplot(simCOI) +
geom_count(aes(x = initialCOI,
y = MOI)) +
geom_abline(intercept = 0, slope = 1)
= simCOI %>%
simCOI_sum group_by(initialCOI, MOI) %>%
count()
ggplot(simCOI_sum) +
geom_col(aes(x = initialCOI, y = n, fill = factor(MOI)), color = "black", position = "dodge") +
scale_fill_brewer("Observed MOI", palette = "Dark2") +
theme_minimal() +
labs(x = "Simualted COI")