Running `coiaf`

Introduction

For this tutorial, we will be using the coiaf package to estimate the complexity of infection (COI) from SNP barcode data. Instructions to install this package may be found here.

The data

coiaf expects a matrix of within sample reference allele frequencies (WSAF), as well as an estimate of population level reference allele frequencies (PLAF). The WSAF matrix should have samples in rows and sites in columns. The PLAF should be a vector of length equal to the number of sites. To explore this, we will use simulated SNP barcoding data as described here. We will load the data using the vcfR package.

vcf_data <- vcfR::read.vcfR(
  here::here("data/snp_barcode/SpotMalariapfPanel_simData_sanger100.vcf.gz")
)

Scanning file to determine attributes.
File attributes:
  meta lines: 76
  header_line: 77
  variant count: 100
  column count: 109

Meta line 76 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 100
  Character matrix gt cols: 109
  skip: 0
  nrows: 100
  row_num: 0

Processed variant: 100
All variants processed

To calculate within host allele frequencies, we need to extract the depth of coverage and allele counts

coverage <- t(vcfR::extract.gt(vcf_data, element = "DP", as.numeric = TRUE))
counts_raw <- t(vcfR::extract.gt(vcf_data, element = "AD"))
counts <- vcfR::masplit(
  counts_raw, record = 1, sort = FALSE, decreasing = FALSE
)

# We then directly calculate the WSAF
wsaf <- counts / coverage

# and estimate the PLAF with the empirical mean across samples
plaf <- colMeans(wsaf, na.rm = TRUE)

Running `coiaf`

coiaf calcualtes the COI on a sample by sample basis, so we need to break up the WSAF matrix into a list of data frames

input_data <- purrr::map(seq_len(nrow(wsaf)), function(i) {
  tibble::tibble(wsmaf = wsaf[i, ], plmaf = plaf) |>
    tidyr::drop_na()
})

We can then run coiaf on each sample using purrr::map_dbl

results <- purrr::map_dbl(
  input_data, ~ coiaf::optimize_coi(.x, data_type = "real")
)

# We can then combine the results into a data frame with the sample names
res_df <- data.frame(Patient = rownames(wsaf), COI = results)

Plotting the results

Now that we have the estimated COI for each sample, we can plot the results.

library(ggplot2)
ggplot(res_df, aes(x = COI)) +
  geom_histogram() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Uncertainty quantification

coiaf also provides a function to estimate the uncertainty in the COI estimate. This function uses a non-parametric bootstrap to estimate the uncertainty in the COI estimate. Note that this function can take a long time to run, so we will only run it on the first 10 samples.

# We can use the same input data as before
bootstrap_results <- purrr::map(
  input_data[1:10], ~ coiaf::bootstrap_ci(.x, solution_method = "continuous")
)

[1] "All values of t are equal to  1 \n Cannot calculate confidence intervals"

# We can then combine the results into a data frame with the sample names
bootstrap_df <- data.frame(
  Patient = rownames(wsaf)[1:10], bootstrap_results |> dplyr::bind_rows()
)

We can then plot the results for each sample

ggplot(
  bootstrap_df,
  aes(x = Patient, y = coi, ymin = conf.low, ymax = conf.high)
) +
  geom_pointrange() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Summary

In this tutorial, we have shown how to use coiaf to estimate the complexity of infection from SNP barcode data. For more information on the package, please see the package documentation.

Introduction

The data

Running coiaf

Plotting the results

Uncertainty quantification

Summary

Running `coiaf`