Estimate prevalence and intra-cluster correlation from raw counts

Takes raw counts of the number of positive samples per cluster (numerator) and the number of tested samples per cluster (denominator) and returns posterior estimates of the prevalence and intra-cluster correlation coefficient (ICC).

get_prevalence(
  n,
  N,
  alpha = 0.05,
  prev_thresh = 0.05,
  ICC = NULL,
  prior_prev_shape1 = 1,
  prior_prev_shape2 = 1,
  prior_ICC_shape1 = 1,
  prior_ICC_shape2 = 9,
  MAP_on = TRUE,
  post_mean_on = FALSE,
  post_median_on = FALSE,
  post_CrI_on = TRUE,
  post_thresh_on = TRUE,
  post_full_on = FALSE,
  post_full_breaks = seq(0, 1, l = 1001),
  CrI_type = "HDI",
  n_intervals = 20,
  round_digits = 2,
  use_cpp = TRUE,
  silent = FALSE
)

get_ICC(
  n,
  N,
  alpha = 0.05,
  prior_prev_shape1 = 1,
  prior_prev_shape2 = 1,
  prior_ICC_shape1 = 1,
  prior_ICC_shape2 = 9,
  MAP_on = TRUE,
  post_mean_on = FALSE,
  post_median_on = FALSE,
  post_CrI_on = TRUE,
  post_full_on = FALSE,
  post_full_breaks = seq(0, 1, l = 1001),
  CrI_type = "HDI",
  n_intervals = 20,
  round_digits = 4,
  use_cpp = TRUE
)

Arguments

n, N

the numerator (n) and denominator (N) per cluster. These are both integer vectors.

alpha

the significance level of the credible interval - for example, use alpha = 0.05 for a 95% interval. See also CrI_type argument for how this is calculated.

prev_thresh

the prevalence threshold that we are comparing against. Can be a vector, in which case the return object contains one value for each input.

ICC

normally this should be set to NULL (the default), in which case the ICC is estimated from the data. However, a fixed value can be entered here, in which case this overrides the use of the prior distribution as specified by prior_ICC_shape1 and prior_ICC_shape2.

prior_prev_shape1, prior_prev_shape2, prior_ICC_shape1, prior_ICC_shape2

parameters that dictate the shape of the Beta priors on prevalence and the ICC. See the Wikipedia page on the Beta distribution for more detail. The default values of these parameters were chosen based on an analysis of historical pfhrp2/3 studies, although this does not guarantee that they will be suitable in all settings.

MAP_on, post_mean_on, post_median_on, post_CrI_on, post_thresh_on, post_full_on

a series of boolean values specifying which outputs to produce. The options are:

MAP_on: if TRUE then return the maximum a posteriori.
post_mean_on: if TRUE then return the posterior mean.
post_median_on: if TRUE then return the posterior median.
post_CrI_on: if TRUE then return the posterior credible interval at significance level alpha. See CrI_type argument for how this is calculated.
post_thresh_on: if TRUE then return the posterior probability of being above the threshold(s) specified by prev_thresh.
post_full_on: if TRUE then return the full posterior distribution, produced using the adaptive quadrature approach, at breaks specified by post_full_breaks.

post_full_breaks

a vector of breaks at which to evaluate the full posterior distribution (only if post_full_on = TRUE). Defaults to 0.1% intervals from 0% to 100%.

CrI_type

which method to use when computing credible intervals. Options are "ETI" (equal-tailed interval) or "HDI" (high-density interval). The ETI searches a distance alpha/2 from either side of the [0,1] interval. The HDI method returns the narrowest interval that subtends a proportion 1-alpha of the distribution. The HDI method is used by default as it guarantees that the MAP estimate is within the credible interval, which is not always the case for the ETI.

n_intervals

the number of intervals used in the adaptive quadrature method. Increasing this value gives a more accurate representation of the true posterior, but comes at the cost of reduced speed.

round_digits

the number of digits after the decimal point that are used when reporting estimates. This is to simplify results and to avoid giving the false impression of extreme precision.

use_cpp

if TRUE (the default) then use an Rcpp implementation of the adaptive quadrature approach that is much faster than the base R method.

silent

if TRUE then suppress all console output.

Details

There are two unknown quantities in the DRpower model: the prevalence and the intra-cluster correlation (ICC). These functions integrate over a prior on one quantity to arrive at the marginal posterior distribution of the other. Possible outputs include the maximum a posteriori (MAP) estimate, the posterior mean, posterior median, credible interval (CrI), probability of being above a set threshold, and the full posterior distribution. For speed, distributions are approximated using an adaptive quadrature approach in which the full distribution is split into intervals and each intervals is approximated using Simpson's rule. The number of intervals used in quadrature can be increased for more accurate results at the cost of slower speed.

Examples

# basic example of estimating prevalence and
# ICC from observed counts
sample_size <- c(80, 110, 120)
deletions <- c(3, 5, 6)

get_prevalence(n = deletions, N = sample_size)
#>    MAP CrI_lower CrI_upper prob_above_threshold
#> 1 4.96       1.7     15.72               0.6739
get_ICC(n = deletions, N = sample_size)
#>   MAP CrI_lower CrI_upper
#> 1   0         0    0.1642