The core of dust is a very fast parallel random number generation system. This vignette describes the details of the system and why it is needed.

We provide an interface to the “Xoshiro” / Blackman-Vigna family of generators (Xoshiro is derived from XOR/shift/rotate). These are designed to allow use in parallel by “jumping ahead” in the sequence and we use this below to interleave generators. The stream is unrelated to and unaffected by R’s random number generation. set.seed has no effect, for example. The random numbers are not cryptographically secure; for that see the excellent sodium package.

Ordinarily this is used from C++; the model as discussed in vignette("dust") uses rng_state_type and a function from dust::random to interact with the generator. However, an R interface is provided for debugging and testing purposes.

rng <- dust::dust_rng$new(seed = 1)
rng
#> <dust_rng>
#>   Public:
#>     binomial: function (n, size, prob, n_threads = 1L) 
#>     exponential: function (n, rate, n_threads = 1L) 
#>     gamma: function (n, shape, scale, n_threads = 1L) 
#>     hypergeometric: function (n, n1, n2, k, n_threads = 1L) 
#>     info: list
#>     initialize: function (seed = NULL, n_streams = 1L, real_type = "double", 
#>     jump: function () 
#>     long_jump: function () 
#>     multinomial: function (n, size, prob, n_threads = 1L) 
#>     nbinomial: function (n, size, prob, n_threads = 1L) 
#>     normal: function (n, mean, sd, n_threads = 1L, algorithm = "box_muller") 
#>     poisson: function (n, lambda, n_threads = 1L) 
#>     random_normal: function (n, n_threads = 1L, algorithm = "box_muller") 
#>     random_real: function (n, n_threads = 1L) 
#>     size: function () 
#>     state: function () 
#>     uniform: function (n, min, max, n_threads = 1L) 
#>   Private:
#>     float: FALSE
#>     n_streams: 1
#>     ptr: externalptr

Currently only a few distributions are supported, with an interface that somewhat mimics R’s interface (we avoid abbreviations so that rnorm becomes normal and there are no default arguments). For example, with the generator above we can generate 100 standard normal samples:

rng$normal(100, 0, 1)
#>   [1]  1.12910079  0.53780677 -1.26698746  0.95829321  3.21195234  0.09921154
#>   [7] -0.18600342 -0.29222356  2.08163389 -0.05121175  0.27770699  0.53856861
#>  [13]  0.03225818 -0.32777772  0.18042116  0.46339742 -0.38466512 -1.13577238
#>  [19]  0.91316208  0.03922728 -1.49714784  0.73104348 -0.83564659 -1.11930919
#>  [25]  0.02762058 -0.99778204  0.85397997  0.93862789  0.27005717 -0.22477757
#>  [31] -0.44778373  0.14367032 -0.06827267  0.68396692 -1.16518273 -0.89172407
#>  [37] -0.29950034  0.72347846 -1.14585657  0.34617660 -0.03401702  0.04667183
#>  [43]  0.09952394 -1.94032159  1.35699294 -0.34150243 -0.62118413  0.06133529
#>  [49]  1.18840972  0.56786089 -1.70569039  0.37351828  0.78885706  0.52182712
#>  [55] -0.82799139 -0.04261288  0.61557195  0.83137281 -1.21370978 -0.94812792
#>  [61] -0.12333629  0.52869359 -0.88974620 -0.17544455  1.80250618  1.12137595
#>  [67]  0.30129136 -0.07441952  1.03229412 -0.47172538 -0.97177969  0.82461084
#>  [73] -0.64920225 -0.00744135 -0.90106996  0.03281943 -1.57134280  0.46664117
#>  [79]  0.16352787  0.21926382  0.20193191 -1.84076636 -0.74116728 -0.99895652
#>  [85] -0.06497241  0.76942936 -0.67097495 -0.10362127  0.47417215 -0.77659258
#>  [91] -0.28248620 -1.09359110  0.61960546 -0.20527080  0.79773008  0.31563408
#>  [97]  0.53182428 -0.41118602  0.05391659  1.16452122

One feature that we use here is to allow multiple streams of random numbers within a rng object. If running in parallel (either via a dust model or by passing n_threads) these different random number streams can be given to different threads.

rng1 <- dust::dust_rng$new(seed = 1, n_streams = 1)
rng2 <- dust::dust_rng$new(seed = 1, n_streams = 2)
rng1$random_real(5)
#> [1] 0.45135139 0.07353466 0.21812941 0.20013953 0.01717274
rng2$random_real(5)
#>            [,1]      [,2]
#> [1,] 0.45135139 0.8989717
#> [2,] 0.07353466 0.3624481
#> [3,] 0.21812941 0.4729296
#> [4,] 0.20013953 0.3462430
#> [5,] 0.01717274 0.4180079

Notice here how in the output from rng2, the first column matches the series of numbers out of rng1.

This is achieved by “jumping” the random number streams forward. Here are the second column of numbers from the output of rng2:

rng3 <- dust::dust_rng$new(seed = 1)$jump()
rng3$random_real(5)
#> [1] 0.8989717 0.3624481 0.4729296 0.3462430 0.4180079

With the default generator (xoshoro256+, see below), a jump is equivalent to 2^128 draws from the random number generator (about 10^38). There are 2^128 of these non-overlapping subsequences in the generator, which is quite a lot. If this feels too close together, then the $long_jump() method jumps even further (2^192 draws, or about 10^57). There are 2^64 (10^20) of these sequences.

Supported distributions

We do not yet support the full set of distributions provided by R, and the distributions that we do support represent our immediate needs. Contributions of further distributions is welcome. Note that in the R version the first argument represents the number of draws requested, but in the C++ interface we always return a single number.

Uniform distribution between min and max:

rng$uniform(10, 3, 6)
#>  [1] 4.773239 5.654481 3.977174 4.026823 4.115612 3.788596 3.917003 5.702068
#>  [9] 5.720697 5.547363

Normal distribution with parameters mean and sd

rng$normal(10, 3, 6)
#>  [1] 12.2139218  4.3761453  0.9866168  4.5615065 -6.0847775  7.4186531
#>  [7] -4.9939206  5.0829829 -2.7531772  6.3068671

This function supports using the ziggurat algorithm, which will generally be faster:

rng$normal(10, 3, 6, algorithm = "ziggurat")
#>  [1] 12.679238  3.933790  8.843829  6.480052 -6.475361  5.767795 12.462605
#>  [8] 10.099802  7.808174 10.040571

Poisson distribution with mean lambda

rng$poisson(10, 4.5)
#>  [1] 3 6 2 4 7 9 2 4 3 4

Binomial distribution with mean size and prob

rng$binomial(10L, 10, 0.3)
#>  [1] 6 6 0 1 3 4 4 5 1 1

Negative binomial distribution with mean size and prob

rng$nbinomial(10L, 10, 0.3)
#>  [1] 14 16 16  8 10 27 42 18 14 23

Hypergeometric distribution with parameters n1 (number of white balls), n2 (number of black balls) and k (number of samples)

rng$hypergeometric(10L, 3, 10, 4)
#>  [1] 0 1 1 2 2 1 2 0 1 2

Exponential distribution with rate rate

rng$exponential(10L, 4)
#>  [1] 0.07361117 0.39088773 0.08186782 0.61963654 0.34793074 0.13892083
#>  [7] 0.46349529 0.49744941 1.39014645 0.26500585

Multinomial distribution with parameters size and prob

rng$multinomial(10L, 20, c(0.3, 0.5, 0.2))
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]    8    8    8    6    7    2    7    6    6     8
#> [2,]    7    9    8   11   11   14   10    9   11     9
#> [3,]    5    3    4    3    2    4    3    5    3     3

There are also two special case distributions, the Standard uniform distribution (between 0 and 1) - faster than using $uniform(n, 0, 1):

rng$random_real(10)
#>  [1] 0.242903786 0.049219283 0.916447682 0.008016532 0.056566402 0.123235690
#>  [7] 0.967806313 0.406906615 0.586602176 0.539672238
real_type u = dust::random::random_real<real_type>(state);

the Standard normal distribution (between mean 0 and sd 1) - faster than using $normal(n, 0, 1):

rng$random_normal(10)
#>  [1]  0.609695640  0.146842292 -0.131451864 -0.005175636  0.120022427
#>  [6] -0.464251232  0.054394819 -0.174200970 -0.019314869  1.315462849
real_type u = dust::random::random_normal<real_type>(state);

Gamma distribution parameterized by shape and scale

rng$gamma(10L, 2, 5)
#>  [1] 10.135325 11.303813  4.016615 27.730667  6.175187 15.813924  8.100044
#>  [8]  2.681611 11.177406  2.162054

All of these distributions are available in C++, and this is documented in the C++ reference documentation

Performance

Performance should be on par or better with R’s random number generator, though here the timings are likely to be mostly due to allocations and copies of memory:

bench::mark(
  rng1$random_real(1000),
  rng1$uniform(1000, 0, 1),
  runif(1000),
  time_unit = "us",
  check = FALSE)
#> # A tibble: 3 × 6
#>   expression                 min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rng1$random_real(1000)    6.30   10.8    94139.    7.86KB     9.41
#> 2 rng1$uniform(1000, 0, 1)  9.30   12.1    80858.    7.86KB     8.09
#> 3 runif(1000)              29.2    30.4    31496.   12.77KB     6.30

For normally distributed random numbers, the different algorithms will perform differently:

bench::mark(
  rng1$random_normal(1000),
  rng1$random_normal(1000, algorithm = "ziggurat"),
  rng1$normal(1000, 0, 1),
  rnorm(1000),
  time_unit = "us",
  check = FALSE)
#> # A tibble: 4 × 6
#>   expression                                         min median itr/se…¹ mem_a…²
#>   <bch:expr>                                       <dbl>  <dbl>    <dbl> <bch:b>
#> 1 rng1$random_normal(1000)                          42.5   43.6   22791.  7.86KB
#> 2 rng1$random_normal(1000, algorithm = "ziggurat")  11.7   13.0   74161.  7.86KB
#> 3 rng1$normal(1000, 0, 1)                           51.0   52.9   18824.  7.86KB
#> 4 rnorm(1000)                                       63.5   65.3   15262. 12.77KB
#> # … with 1 more variable: `gc/sec` <dbl>, and abbreviated variable names
#> #   ¹​`itr/sec`, ²​mem_alloc

On reasonable hardware (10-core i9 at 2.8 GHz) we see throughput of ~1 billion U(0, 1) draws per second (1ns/draw) scaling linearly to 10 cores if the numbers are immediately discarded. Doing anything with the numbers or trying to store them comes with a non-trivial overhead.

The difference between random_real and uniform here is the cost of recycling the parameters, not the actual generation!

Binomial distribution, small n * p, which uses an inversion algorithm

rng1 <- dust::dust_rng$new(seed = 1)
n <- as.numeric(rep(9:10, length.out = 1000))
p <- rep(c(0.1, 0.11), length.out = 1000)
bench::mark(
  rng1$binomial(1000, n, p),
  rbinom(1000, n, p),
  time_unit = "us",
  check = FALSE)
#> # A tibble: 2 × 6
#>   expression                  min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rng1$binomial(1000, n, p)  38.0   41.0    24251.    7.86KB     2.43
#> 2 rbinom(1000, n, p)         60.8   63.1    15812.    8.83KB     2.02

(note we vary n and p here as we’ve optimised things for random parameter access).

Large n * p uses a rejection sampling algorithm

n <- as.numeric(rep(9999:10000, length.out = 1000))
p <- rep(c(0.3, 0.31), length.out = 1000)
bench::mark(
  rng1$binomial(1000, n, p),
  rbinom(1000, n, p),
  time_unit = "us",
  check = FALSE)
#> # A tibble: 2 × 6
#>   expression                  min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rng1$binomial(1000, n, p)  52.1   57.1    17454.    7.86KB     4.09
#> 2 rbinom(1000, n, p)         94.1   99.6    10024.    6.45KB     0

Practically, performance will vary based on the parameters of the distributions and the underlying algorithms, and the principle performance gain we hope to get here is driven by the ability to parallelise safely rather than the speed of the draws themselves.

Underlying random number engine

Under the hood, the random number generators work by first creating a random integer, which we then convert to a floating point number, then for the distributions above we apply algorithms that convert one or more uniformly distributed (i.e., U(0, 1)) floating point numbers to a given distribution through techniques such as inversion (for a random exponential) or rejection sampling (for a random binomial).

[random integer] -> [random real] -> [random draw from a distribution]

We include 12 different algorithms for creating the underlying random integer, from the same family of generators. These provide different underlying storage types (either 32 bit or 64 bit integers) and different state types.

Normally you do not need to worry about these details and declaring

using rng_state_type = dust::random::generator<real_type>;

in your model will select a reasonable generator.

If for some reason you want to try a different generator you can directly specify one of the types, for example

using rng_state_type = dust::random::xoshiro256starstar;

which means that whatever real type you use, you want to use the Xoshiro256** generator.

The supported types are:

  • xoshiro256starstar, xoshiro256plusplus, xoshiro256plus (4 x 64 bit state)
  • xoroshiro128starstar, xoroshiro128plusplus, xoroshiro128plus (2 x 64 bit state)
  • xoshiro128starstar, xoshiro128plusplus, xoshiro128plus (4 x 32 bit state)
  • xoshiro512starstar, xoshiro512plusplus, xoshiro512plus (8 x 64 bit state; this is far more state space than typically needed)

The “starstar”, “plusplus” or “plus” refers to the final scrambling operation (two multiplications, two additions or one addition, respectively); the speeds of these might vary depending on your platform. The “plus” version will be the fastest but produces slightly less randomness in the the lower bits of the underlying integers, which theoretically is not a problem when converting to a real number.

If you are generating single precision float numbers for a GPU, then you may want to use the xoshiro128 family as these will be faster than the 64 bit generators on that platform. On a CPU you will likely not see a difference

rng_f <- dust::dust_rng$new(seed = 1, real_type = "float")
rng_d <- dust::dust_rng$new(seed = 1, real_type = "double")
bench::mark(
  rng_f$random_real(1000),
  rng_d$random_real(1000),
  time_unit = "us",
  check = FALSE)
#> # A tibble: 2 × 6
#>   expression                min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rng_f$random_real(1000)  6.40   6.90   141798.    7.86KB     28.4
#> 2 rng_d$random_real(1000)  6.30   6.80   143989.    7.86KB     14.4

If you do not need to run many parallel realisations, then the xoroshiro256 (note the extra ro) generators may be preferable as they carry half the state and may be slightly faster. The xoshiro512 generators are included for completeness and offer an excessive state space).

The table below summarises the properties of the underlying generators (each of these exists with three different scrambling options).

Name Size State Period Jump Long jump
xoshiro128 32 bits 2 uint32 2^128 2^64 2^96
xoroshiro128 64 bits 2 uint64 2^128 2^64 2^96
xoshiro256 64 bits 4 uint64 2^256 2^128 2^192
xoshiro512 64 bits 4 uint64 2^512 2^512 2^384

Size is the size of the returned random integer (32 or 64 bits, though with the + scrambler not all bits will be of high quality). The State is the number and size of the internal state of the generator. Period refers to the number of states that the generator will move through, Jump and Long jump are the number of steps (equivalent) that a jump operation will take through this sequence.

Note that the period is the two size of number of bits in the model state (e.g., xoshiro256 uses 4 x 64 bit integers, for a total of 256 bits of state, 2^256 possible combinations, each of which will be visited once in the cycle). The Jump coefficients have been computed to have size sqrt(Period).

Reusing the random random number generator in other projects

Our random number library can be reused in other projects without using anything else from dust; either in an R package or in a standalone project.

In a package

A minimal package looks like

#> .
#> ├── DESCRIPTION
#> ├── NAMESPACE
#> └── src
#>     └── rnguse.cpp

with the core of the C++ file containing a small program that uses the dust random number generator to draw a series of normally distributed random number with a single mean and standard deviation.

#include <cpp11.hpp>
#include <dust/random/random.hpp>

[[cpp11::register]]
cpp11::doubles random_normal(int n, double mu, double sd, int seed) {
  using rng_state_type = dust::random::generator<double>;
  auto state = dust::random::seed<rng_state_type>(seed);

  cpp11::writable::doubles ret(n);
  for (int i = 0; i < n; ++i) {
    ret[i] = dust::random::normal<double>(state, mu, sd);
  }

  return ret;
}

To complete the package, the DESCRIPTION includes dust and cpp11 in the LinkingTo section:

Package: rnguse
Title: Use 'dust' RNG From Elsewhere
Version: 0.1.0
Authors@R: c(person("Rich", "FitzJohn", role = c("aut", "cre"),
                    email = "rich.fitzjohn@gmail.com"))
Description: Simple example package showing how to use dust's random
    number library from your R package's C++ code.
License: CC0
Encoding: UTF-8
LinkingTo: cpp11, dust
SystemRequirements: C++11

You must remember to update your NAMESPACE to include useDynLib (either directly or via roxygen)

useDynLib(rnguse, .registration = TRUE)

Finally, run cpp11::cpp_register() before compiling your package so that the relevant interfaces are created (R/cpp11.R and cpp11/cpp11.cpp). A similar process would likely work with Rcpp without any dependency on cpp11.

The issue with this example is that every call to random_normal must provide its own seed argument, so the random number streams are not continued with each call to the function. This is is not very useful in practice and we describe more fully how to do this properly in vignette("rng_package.Rmd").

Standalone, parallel with OpenMP

This is somewhat more experimental, so let us know if you have success using the library this way.

It is possible to include the dust random library in a standalone C++ program (or one embedded from another language) without using any R support. The library is a header only library and <dust/random/random.hpp> is the main file to include.

The simplest way to get started is to copy the contents of inst/include/dust/random into your project. You can do this by downloading and unpacking the latest release of dust-random. Then after including <dust/random/random.hpp> you can use the contents of the random number library.

For example, below is a small program that computes the sum of a series of random numbers, in parallel using OpenMP to parallelise across a set of generators.

#include <iostream>
#include <iomanip>
#include <vector>

#ifndef NO_OPENMP
#include <omp.h>
#endif

#include <dust/random/random.hpp>

template <typename real_type>
std::vector<real_type> random_sum(int n_streams, int n_draws,
                                  int seed, int n_threads) {
  using rng_state_type = dust::random::generator<real_type>;
  dust::random::prng<rng_state_type> rng(n_streams, seed, false);

  std::vector<real_type> ret(n_streams, 0.0);

  #pragma omp parallel for schedule(static) num_threads(n_threads)
  for (int i = 0; i < n_streams; ++i) {
    for (size_t j = 0; j < n_draws; ++j) {
      ret[i] += dust::random::random_real<real_type>(rng.state(i));
    }
  }

  return ret;
}

int main(int argc, char* argv[]) {
  // Extremely simple but non-robust arg handling:
  if (argc < 2) {
    std::cout <<
      "Usage: rnguse <n_draws> [<n_streams> [<seed> [<n_threads>]]]" <<
      std::endl;
    return 1;
  }
  int n_draws   = atoi(argv[1]);
  int n_streams = argc < 3 ?  10 : atoi(argv[2]);
  int seed      = argc < 4 ?  42 : atoi(argv[3]);
  int n_threads = argc < 5 ?   1 : atoi(argv[4]);

  auto ans = random_sum<double>(n_streams, n_draws, seed, n_threads);

  std::cout << std::setprecision(10);
  for (int i = 0; i < n_streams; ++i) {
    std::cout << i << ": " << ans[i] << std::endl;
  }

  return 0;
}

This program can be compiled with

g++ -I$(PATH_DUST_INCLUDE) -O2 -std=c++11 -fopenmp -o rnguse rnguse.cpp

where $PATH_DUST_INCLUDE is the path to the header only library.

Standalone, parallel on a GPU

This is considerably more complicated and will depend on your aims with your GPU-accelerated program.

We include examples in a repository showing dust::random vs curand benchmarks. This covers setting up the RNG state so that it can be set from the host and retrieved, interleaving as appropriate, plus samples from the uniform, normal, exponential Poisson and binomial distributions. We overlap with curand only for uniform, normal, and Poisson. See the repository for more details.

Other packages with similar functionality

There are many packages offering similar functionality to dust:

  • sitmo offers several modern RNGs (sitmo, threefry, vandercorput) for generating standard uniform numbers (U(0, 1)), designed to be usable from an Rcpp-using package.
  • dqrng offers an overlapping set and generators for normal and exponential random numbers, also designed to be used with Rcpp and BH.
  • rTRNG exposes “Tina’s Random Number Generator”, among others, and support for several distributions. It can be used from other packages but requires linking (i.e., not header only) which does not work on macOS. Generator jumps typically require a known number of draws per thread.
  • miranda includes support for several underlying generators (including xoshiro256+) but with a single global state and no support for use from other packages’ compiled code.

Why does this package exist? We had a set of needs that meant using the a

  • generation of single precision float numbers with the same interface
  • absolutely no global state
  • nice interface for reproducible parallel random number generation where the generated number does not depend on the number of threads used
  • possible to use on GPUs; this exposes some exotic issues around what types of structures can be used
  • minimal dependencies and impact on the C++ code (e.g., no boost/BH) and header-only in order to satisfy the above
  • support for additional distributions, especially binomial, in a thread-safe way, optimised for rapidly changing parameters

On this last point; we noticed that many distribution packages, including Boost.Random (and to a degree R’s random functions) assume that you want to sample many random numbers from a distribution with fixed parameters. This is probably reasonable in many cases, but in the use case we had (stochastic simulation models) we expected that all parameters were likely to change at every iteration. These different approaches have important tradeoffs - if you will take many samples from a single distribution you might compute tables of coefficients once and use them many times which will make the sampling faster, but this will be wasteful if only a single sample is taken before the parameters change.

The situation is slightly more complex on a GPU where we need to make sure that different threads within a block do not get needlessly onto different branches of conditional logic. Things like early exits need to be avoided and we have carefully profiled to make sure that threads within a warp end up re-synchronised at the optimal spot (see for example this blog post on __syncwarp).