Using `dcifer` to estimate relatedness from biallelic SNP data

Author

Kathryn Murie

Published

December 11, 2023

The data

Below we will demonstrate how to use dcifer using biallelic Sanger 100-SNP barcode data in .vcf format. We will use data created by simulating 100 polyclonal infections from Bangladesh (n=50) and Ghana (n=50). See the PGEforge website for further details.

In this tutorial we will use PGEhammer to convert data from the VCF format to the format required by Dcifer. To install the package run the following command

# Install PGEhammer in R:
install.packages('PGEhammer', repos = c('https://plasmogenepi.r-universe.dev', 'https://cloud.r-project.org'))


The downloaded binary packages are in
    /var/folders/wx/rr171mzs0lj0mtflng6dwl7h0000gp/T//Rtmp8DqrQV/downloaded_packages

Now we can load the packages we will need.

library(tidyverse)
library(here)
library(dcifer)
library(vcfR)
library(kableExtra)
library(PGEhammer)

Wrangling the data

Dcifer requires input data in long format, the long format represents data with each observation on a separate row. This data can be biallelic or multiallelic. In the following steps, we will convert the Variant Call Format (VCF) data to the required long format using the function vcf2long from PGEhammer.

# Read vcf
vcf <- read.vcfR(here('data/snp_barcode/SpotMalariapfPanel_simData_sanger100.vcf.gz'))

Scanning file to determine attributes.
File attributes:
  meta lines: 76
  header_line: 77
  variant count: 100
  column count: 109

Meta line 76 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 100
  Character matrix gt cols: 109
  skip: 0
  nrows: 100
  row_num: 0

Processed variant: 100
All variants processed

# Convert to long format 
df <- vcf2long(vcf)

Converting from vcf to long format...

Reformatting complete.

Warning in vcf2long(vcf): Your vcf is not all bi-allelic. Make sure to double
check if this is not expected.

head(df) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

sample_id	locus	allele	read_count
Ghana-45	Pf3D7_01_v3_145515	allele-1	889
Ghana-45	Pf3D7_01_v3_145515	allele-2	7354
Ghana-45	Pf3D7_01_v3_179347	allele-1	1061
Ghana-45	Pf3D7_01_v3_179347	allele-2	9661
Ghana-45	Pf3D7_01_v3_180554	allele-1	9564
Ghana-45	Pf3D7_01_v3_180554	allele-2	5

Calculate COI and allele frequencies

Before we calculate IBD we first need to calculate COI. Below we use the function getCOI that Dcifer provides which uses naive estimation, but you could use another tool for this.

lrank <- 2
coi   <- getCOI(dsmp, lrank = lrank)

The last thing we need to do before calculating IBD is to add in allele frequencies. Again we use a function within Dcifer for this, calcAfreq.

afreq <- calcAfreq(dsmp, coi, tol = 1e-5) 
str(afreq, list.len = 2)

List of 87
 $ t1  : Named num [1:5] 0.4239 0.2808 0.1116 0.0422 0.1415
  ..- attr(*, "names")= chr [1:5] "D10--D6--FCR3--V1-S.0" "HB3.0" "t1.0" "t1.1" ...
 $ t10 : Named num [1:4] 0.8539 0.00942 0.00951 0.12717
  ..- attr(*, "names")= chr [1:4] "D10--D6--HB3.0" "t10.0" "t10.2" "U659.0"
  [list output truncated]

dres0 <- ibdDat(dsmp, coi, afreq, pval = TRUE, confint = TRUE, rnull = 0, 
                alpha = 0.05, nr = 1e3)

Visualising the Data

Here we use plotRel to visualise the data, comparing samples on the axes. Significantly related samples are outlined in red.

alpha <- 0.05                          # significance level                    
dmat <- dres0[, , "estimate"]
# create symmetric matrix
dmat[upper.tri(dmat)] <- t(dmat)[upper.tri(t(dmat))]  
#  determine significant, reverse columns for upper triangle
isig <- which(dres0[, , "p_value"] <= alpha, arr.ind = TRUE)[, 2:1] 
plotRel(dmat, isig = isig, draw_diag = TRUE, lwd_diag = 0.5, idlab = TRUE, )

Summary

In summary, we have used Dcifer to estimate COI and allele frequencies before estimating IBD. Dcifer has extensive documentation, including more details on other functionality available within the tool and a tutorial using microhaplotype data.