Margin of error calculations when estimating prevalence from a clustered survey

Calculate the expected margin of error when estimating prevalence from a clustered survey, or calculate the sample size required to achieve a given target margin of error.

get_margin(N, n_clust, prevalence = 0.2, ICC = 0.05, alpha = 0.05)

get_sample_size_margin(
  MOE,
  n_clust,
  prevalence = 0.2,
  ICC = 0.05,
  alpha = 0.05
)

get_margin_CP(N, n_clust, prevalence = 0.2, ICC = 0.05, alpha = 0.05)

get_sample_size_margin_CP(
  MOE,
  n_clust,
  prevalence = 0.2,
  ICC = 0.05,
  alpha = 0.05,
  N_max = 2000
)

Arguments

N: the number of samples obtained from each cluster, assumed the same over all clusters.
n_clust: the number of clusters.
prevalence: the true prevalence of the marker in the population as a proportion between 0 and 1.
ICC: assumed true intra-cluster correlation (ICC) between 0 and 1.
alpha: the significance level of the CI.
MOE: the target margin of error.
N_max: the largest value of $N$ to consider.

Value

the functions get_margin() and get_margin_CP() return the expected lower and upper CI limits on the prevalence as percentage. Technically this is not the MOE, as that would be the difference between these limits and the assumed prevalence. However, we feel this is a more useful and more intuitive output.

Details

A very common approach when constructing confidence intervals (CIs) from prevalence data is to use the Wald interval:

$$\hat{p} \pm z\sqrt{\frac{\hat{p}(1 - \hat{p})}{N}}$$

where $\hat{p}$ is our estimate of the prevalence, $z$ is the critical value of the normal distribution ($z=1.96$ for a 95% interval) and $N$ is the sample size. When estimating prevalence from a clustered survey, we need to modify this formula as follows:

$$\hat{p} \pm z\sqrt{\frac{\hat{p}(1 - \hat{p})}{Nc}(1 + (n - 1) r)}$$

where $\hat{p}$ is the mean prevalence over clusters, $c$ is the number of clusters, and $r$ is the intra-cluster correlation (ICC, a value between 0 and 1). The term to the right of the $\pm$ symbol is called the margin of error (MOE). We can give this term the name $d$. The function get_margin() returns the values $\hat{p}-d$ and $\hat{p}+d$, i.e. the lower and upper estimates of what our CI will be.

We can also rearrange this formula to get the sample size ($N$) required to achieve any given MOE:

$$ N = \frac{ z^2p(1-p)(1-r) }{ cd^2 - z^2p(1-p)r } $$

The function get_sample_size_margin() returns the value of $N$. Note that in some cases it might not be possible to achieve the specified MOE for any finite sample size due to the ICC introducing too much variation, in which case this formula will return a negative value and the function will return an error.

Although this is a very common approach, it has several weaknesses. First, notice that we sneakily replaced $\hat{p}$ with $p$ when moving to the sample size formula above. This implies that there is no uncertainty in our prevalence estimate, which is not true. Also note that the Wald interval assumes that the sampling distribution of our estimator is Gaussian, which is also not true. The difference between the Gaussian and the true distribution is particularly pronounced when prevalence is at the extremes of the range (near 0% or 100%). Here, the Wald interval can actually include values less than 0 or greater than 1, which are nonsensical.

An arguably better approach is to construct CIs using the method of Clopper and Pearson (1934). This confidence interval guarantees that the false positive rate is at least $alpha$, and in this sense is conservative. It can be asymmetric and does not suffer from the problem of allowing values outside the [0,1] range. To make the Clopper-Pearson interval apply to a multi-cluster survey, we can use the idea of effective sample size, $N_e$:

$$ D_{eff} = 1 + (N - 1)r $$ $$ N_e = \frac{N}{D_{eff}} $$

We then calculate the Clopper-Pearson CI but using $N_e$ in place of $N$. The function get_margin_CP() returns the expected lower and upper CI limits using the Clopper-Pearson interval, and the function get_sample_size_margin_CP() returns the corresponding sample size needed to achieve a certain MOE (the maximum of either lower or upper).

A third option is to use the DRpower Bayesian model to estimate the credible interval of prevalence. See ?get_margin_Bayesian() for how to do this.

References

Clopper, C.J. and Pearson, E.S., 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26, 404–413. doi: 10.2307/2331986.

Examples

get_margin(N = 60, n_clust = 3, prevalence = 0.2)
#>     lower     upper 
#>  8.386306 31.613694 

get_sample_size_margin(MOE = 0.07, n_clust = 3, prevalence = 0.2, ICC = 0.01)
#> [1] 72

get_margin_CP(N = 60, n_clust = 3, prevalence = 0.2)
#>     lower     upper 
#>  9.630688 34.487997 

get_sample_size_margin_CP(MOE = 0.14, n_clust = 3, prevalence = 0.2, ICC = 0.01)
#> [1] 19