Simulate biallelic data — sim_biallelic • MIPanalyzer

Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level allele frequencies (PLAF) and some parameters dicating skew and error distributions. Outputs include the phased haplotypes and the un-phased read count and coverage data.

sim_biallelic(
  COI = 3,
  PLAF = runif(10, 0, 0.5),
  coverage = 100,
  alpha = 1,
  overdispersion = 0,
  epsilon = 0
)

Arguments

COI: complexity of infection.
PLAF: vector of population-level allele frequencies at each locus.
coverage: coverage at each locus. If a single value then the same coverage is applied over all loci.
alpha: shape parameter of the symmetric Dirichlet prior on strain proportions.
overdispersion: the extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-binomially distributed, with the beta distribution having shape parameters p/overdispersion and (1-p)/overdispersion.
epsilon: the probability of a single read being mis-called as the other allele. Applies in both directions.

Details

Simulated data are drawn from a simple statistical model:

Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter alpha.
Phased haplotypes are drawn at every locus, one for each COI. The allele at each locus is drawn from a Bernoulli distribution with probability given by the PLAF.
The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation $$wsaf_error = wsaf*(1-e) + (1-wsaf)*e$$where $wsaf$ is the WSAF without error and $e$ is the error parameter epsilon.
Final read counts are drawn from a beta-binomial distribution with expectation $w_error$. The raw number of draws is given by the coverage, and the skew of the distribution is given by the overdispersion parameter. If overdispersion = 0 then the distribution is binomial, rather than beta-binomial.