The core of dust
is a very fast parallel random number
generation system. This vignette describes the details of the system and
why it is needed.
We provide an interface to the “Xoshiro” / Blackman-Vigna family
of generators (Xoshiro is derived from XOR/shift/rotate). These are
designed to allow use in parallel by “jumping ahead” in the sequence and
we use this below to interleave generators. The stream is unrelated to
and unaffected by R’s random number generation. set.seed
has no effect, for example. The random numbers are not
cryptographically secure; for that see the excellent sodium package.
Ordinarily this is used from C++; the model as discussed in
vignette("dust")
uses rng_state_type
and a
function from dust::random
to interact with the generator.
However, an R interface is provided for debugging and testing
purposes.
rng <- dust::dust_rng$new(seed = 1)
rng
#> <dust_rng>
#> Public:
#> binomial: function (n, size, prob, n_threads = 1L)
#> cauchy: function (n, location, scale, n_threads = 1L)
#> exponential: function (n, rate, n_threads = 1L)
#> gamma: function (n, shape, scale, n_threads = 1L)
#> hypergeometric: function (n, n1, n2, k, n_threads = 1L)
#> info: list
#> initialize: function (seed = NULL, n_streams = 1L, real_type = "double",
#> jump: function ()
#> long_jump: function ()
#> multinomial: function (n, size, prob, n_threads = 1L)
#> nbinomial: function (n, size, prob, n_threads = 1L)
#> normal: function (n, mean, sd, n_threads = 1L, algorithm = "box_muller")
#> poisson: function (n, lambda, n_threads = 1L)
#> random_normal: function (n, n_threads = 1L, algorithm = "box_muller")
#> random_real: function (n, n_threads = 1L)
#> size: function ()
#> state: function ()
#> uniform: function (n, min, max, n_threads = 1L)
#> Private:
#> float: FALSE
#> n_streams: 1
#> ptr: externalptr
Currently only a few distributions are supported, with an interface
that somewhat mimics R’s interface (we avoid abbreviations so that
rnorm
becomes normal
and there are no default
arguments). For example, with the generator above we can generate 100
standard normal samples:
rng$normal(100, 0, 1)
#> [1] 1.12910079 0.53780677 -1.26698746 0.95829321 3.21195234 0.09921154
#> [7] -0.18600342 -0.29222356 2.08163389 -0.05121175 0.27770699 0.53856861
#> [13] 0.03225818 -0.32777772 0.18042116 0.46339742 -0.38466512 -1.13577238
#> [19] 0.91316208 0.03922728 -1.49714784 0.73104348 -0.83564659 -1.11930919
#> [25] 0.02762058 -0.99778204 0.85397997 0.93862789 0.27005717 -0.22477757
#> [31] -0.44778373 0.14367032 -0.06827267 0.68396692 -1.16518273 -0.89172407
#> [37] -0.29950034 0.72347846 -1.14585657 0.34617660 -0.03401702 0.04667183
#> [43] 0.09952394 -1.94032159 1.35699294 -0.34150243 -0.62118413 0.06133529
#> [49] 1.18840972 0.56786089 -1.70569039 0.37351828 0.78885706 0.52182712
#> [55] -0.82799139 -0.04261288 0.61557195 0.83137281 -1.21370978 -0.94812792
#> [61] -0.12333629 0.52869359 -0.88974620 -0.17544455 1.80250618 1.12137595
#> [67] 0.30129136 -0.07441952 1.03229412 -0.47172538 -0.97177969 0.82461084
#> [73] -0.64920225 -0.00744135 -0.90106996 0.03281943 -1.57134280 0.46664117
#> [79] 0.16352787 0.21926382 0.20193191 -1.84076636 -0.74116728 -0.99895652
#> [85] -0.06497241 0.76942936 -0.67097495 -0.10362127 0.47417215 -0.77659258
#> [91] -0.28248620 -1.09359110 0.61960546 -0.20527080 0.79773008 0.31563408
#> [97] 0.53182428 -0.41118602 0.05391659 1.16452122
One feature that we use here is to allow multiple streams of random
numbers within a rng object. If running in parallel (either via a dust
model or by passing n_threads
) these different random
number streams can be given to different threads.
rng1 <- dust::dust_rng$new(seed = 1, n_streams = 1)
rng2 <- dust::dust_rng$new(seed = 1, n_streams = 2)
rng1$random_real(5)
#> [1] 0.45135139 0.07353466 0.21812941 0.20013953 0.01717274
rng2$random_real(5)
#> [,1] [,2]
#> [1,] 0.45135139 0.8989717
#> [2,] 0.07353466 0.3624481
#> [3,] 0.21812941 0.4729296
#> [4,] 0.20013953 0.3462430
#> [5,] 0.01717274 0.4180079
Notice here how in the output from rng2
, the first
column matches the series of numbers out of rng1
.
This is achieved by “jumping” the random number streams forward. Here
are the second column of numbers from the output of
rng2
:
rng3 <- dust::dust_rng$new(seed = 1)$jump()
rng3$random_real(5)
#> [1] 0.8989717 0.3624481 0.4729296 0.3462430 0.4180079
With the default generator (xoshoro256+
, see below), a
jump is equivalent to 2^128 draws from the random number generator
(about 10^38). There are 2^128 of these non-overlapping subsequences in
the generator, which is quite a lot. If this feels too close together,
then the $long_jump()
method jumps even further (2^192
draws, or about 10^57). There are 2^64 (10^20) of these sequences.
We do not yet support the full set of distributions provided by R, and the distributions that we do support represent our immediate needs. Contributions of further distributions is welcome. Note that in the R version the first argument represents the number of draws requested, but in the C++ interface we always return a single number.
Uniform distribution between min
and
max
:
rng$uniform(10, 3, 6)
#> [1] 4.773239 5.654481 3.977174 4.026823 4.115612 3.788596 3.917003 5.702068
#> [9] 5.720697 5.547363
Normal distribution with parameters
mean
and sd
rng$normal(10, 3, 6)
#> [1] 12.2139218 4.3761453 0.9866168 4.5615065 -6.0847775 7.4186531
#> [7] -4.9939206 5.0829829 -2.7531772 6.3068671
This function supports using the ziggurat algorithm, which will generally be faster:
rng$normal(10, 3, 6, algorithm = "ziggurat")
#> [1] 12.679238 3.933790 8.843829 6.480052 -6.475361 5.767795 12.462605
#> [8] 10.099802 7.808174 10.040571
Poisson distribution with mean
lambda
rng$poisson(10, 4.5)
#> [1] 3 6 2 4 7 9 2 4 3 4
Binomial distribution with mean size
and prob
rng$binomial(10L, 10, 0.3)
#> [1] 6 6 0 1 3 4 4 5 1 1
Negative binomial distribution with mean
size
and prob
rng$nbinomial(10L, 10, 0.3)
#> [1] 14 16 16 8 10 27 42 18 14 23
Hypergeometric distribution with parameters
n1
(number of white balls), n2
(number of
black balls) and k
(number of samples)
rng$hypergeometric(10L, 3, 10, 4)
#> [1] 0 1 1 2 2 1 2 0 1 2
Exponential distribution with rate
rate
rng$exponential(10L, 4)
#> [1] 0.07361117 0.39088773 0.08186782 0.61963654 0.34793074 0.13892083
#> [7] 0.46349529 0.49744941 1.39014645 0.26500585
Multinomial distribution with parameters
size
and prob
rng$multinomial(10L, 20, c(0.3, 0.5, 0.2))
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 8 8 8 6 7 2 7 6 6 8
#> [2,] 7 9 8 11 11 14 10 9 11 9
#> [3,] 5 3 4 3 2 4 3 5 3 3
There are also two special case distributions, the Standard
uniform distribution (between 0 and 1) - faster than using
$uniform(n, 0, 1)
:
rng$random_real(10)
#> [1] 0.242903786 0.049219283 0.916447682 0.008016532 0.056566402 0.123235690
#> [7] 0.967806313 0.406906615 0.586602176 0.539672238
the Standard normal distribution (between mean 0 and
sd 1) - faster than using $normal(n, 0, 1)
:
rng$random_normal(10)
#> [1] 0.609695640 0.146842292 -0.131451864 -0.005175636 0.120022427
#> [6] -0.464251232 0.054394819 -0.174200970 -0.019314869 1.315462849
Gamma distribution parameterized by
shape
and scale
rng$gamma(10L, 2, 5)
#> [1] 10.135325 11.303813 4.016615 27.730667 6.175187 15.813924 8.100044
#> [8] 2.681611 11.177406 2.162054
All of these distributions are available in C++, and this is documented in the C++ reference documentation
Performance should be on par or better with R’s random number generator, though here the timings are likely to be mostly due to allocations and copies of memory:
bench::mark(
rng1$random_real(1000),
rng1$uniform(1000, 0, 1),
runif(1000),
time_unit = "us",
check = FALSE)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 rng1$random_real(1000) 4.59 8.17 134978. 7.86KB 27.0
#> 2 rng1$uniform(1000, 0, 1) 6.37 7.22 119412. 7.86KB 11.9
#> 3 runif(1000) 8.68 9.32 105123. 10.27KB 21.0
For normally distributed random numbers, the different algorithms will perform differently:
bench::mark(
rng1$random_normal(1000),
rng1$random_normal(1000, algorithm = "ziggurat"),
rng1$normal(1000, 0, 1),
rnorm(1000),
time_unit = "us",
check = FALSE)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 "rng1$random_normal(1000)" 34.5 35.8 27549. 7.86KB 5.51
#> 2 "rng1$random_normal(1000, algorithm… 8.97 9.83 99675. 7.86KB 9.97
#> 3 "rng1$normal(1000, 0, 1)" 37.9 39.2 25211. 7.86KB 5.04
#> 4 "rnorm(1000)" 35.3 36.9 26808. 10.27KB 2.68
On reasonable hardware (10-core i9 at 2.8 GHz) we see throughput of ~1 billion U(0, 1) draws per second (1ns/draw) scaling linearly to 10 cores if the numbers are immediately discarded. Doing anything with the numbers or trying to store them comes with a non-trivial overhead.
The difference between random_real
and
uniform
here is the cost of recycling the parameters, not
the actual generation!
Binomial distribution, small n * p
, which uses an
inversion algorithm
rng1 <- dust::dust_rng$new(seed = 1)
n <- as.numeric(rep(9:10, length.out = 1000))
p <- rep(c(0.1, 0.11), length.out = 1000)
bench::mark(
rng1$binomial(1000, n, p),
rbinom(1000, n, p),
time_unit = "us",
check = FALSE)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 rng1$binomial(1000, n, p) 27.9 30.1 32838. 7.86KB 6.57
#> 2 rbinom(1000, n, p) 37.6 39.8 24876. 6.34KB 0
(note we vary n
and p
here as we’ve
optimised things for random parameter access).
Large n * p
uses a rejection sampling algorithm
n <- as.numeric(rep(9999:10000, length.out = 1000))
p <- rep(c(0.3, 0.31), length.out = 1000)
bench::mark(
rng1$binomial(1000, n, p),
rbinom(1000, n, p),
time_unit = "us",
check = FALSE)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 rng1$binomial(1000, n, p) 43.0 47.7 20723. 7.86KB 4.15
#> 2 rbinom(1000, n, p) 60.1 65.8 15089. 3.95KB 2.01
Practically, performance will vary based on the parameters of the distributions and the underlying algorithms, and the principle performance gain we hope to get here is driven by the ability to parallelise safely rather than the speed of the draws themselves.
Under the hood, the random number generators work by first creating a random integer, which we then convert to a floating point number, then for the distributions above we apply algorithms that convert one or more uniformly distributed (i.e., U(0, 1)) floating point numbers to a given distribution through techniques such as inversion (for a random exponential) or rejection sampling (for a random binomial).
[random integer] -> [random real] -> [random draw from a distribution]
We include 12 different algorithms for creating the underlying random integer, from the same family of generators. These provide different underlying storage types (either 32 bit or 64 bit integers) and different state types.
Normally you do not need to worry about these details and declaring
in your model will select a reasonable generator.
If for some reason you want to try a different generator you can directly specify one of the types, for example
which means that whatever real type you use, you want to use the Xoshiro256** generator.
The supported types are:
xoshiro256starstar
, xoshiro256plusplus
,
xoshiro256plus
(4 x 64 bit state)xoroshiro128starstar
,
xoroshiro128plusplus
, xoroshiro128plus
(2 x 64
bit state)xoshiro128starstar
, xoshiro128plusplus
,
xoshiro128plus
(4 x 32 bit state)xoshiro512starstar
, xoshiro512plusplus
,
xoshiro512plus
(8 x 64 bit state; this is far more state
space than typically needed)The “starstar”, “plusplus” or “plus” refers to the final scrambling operation (two multiplications, two additions or one addition, respectively); the speeds of these might vary depending on your platform. The “plus” version will be the fastest but produces slightly less randomness in the the lower bits of the underlying integers, which theoretically is not a problem when converting to a real number.
If you are generating single precision float
numbers for
a GPU, then you may want to use the xoshiro128
family as
these will be faster than the 64 bit generators on that platform. On a
CPU you will likely not see a difference
rng_f <- dust::dust_rng$new(seed = 1, real_type = "float")
rng_d <- dust::dust_rng$new(seed = 1, real_type = "double")
bench::mark(
rng_f$random_real(1000),
rng_d$random_real(1000),
time_unit = "us",
check = FALSE)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 rng_f$random_real(1000) 4.74 5.21 178384. 7.86KB 17.8
#> 2 rng_d$random_real(1000) 4.67 5.16 187147. 7.86KB 37.4
If you do not need to run many parallel realisations, then the
xoroshiro256
(note the extra ro
) generators
may be preferable as they carry half the state and may be slightly
faster. The xoshiro512
generators are included for
completeness and offer an excessive state space).
The table below summarises the properties of the underlying generators (each of these exists with three different scrambling options).
Name | Size | State | Period | Jump | Long jump |
---|---|---|---|---|---|
xoshiro128 |
32 bits | 2 uint32 | 2^128 | 2^64 | 2^96 |
xoroshiro128 |
64 bits | 2 uint64 | 2^128 | 2^64 | 2^96 |
xoshiro256 |
64 bits | 4 uint64 | 2^256 | 2^128 | 2^192 |
xoshiro512 |
64 bits | 4 uint64 | 2^512 | 2^512 | 2^384 |
Size is the size of the returned random integer (32 or 64 bits,
though with the +
scrambler not all bits will be of high
quality). The State is the number and size of the internal state of the
generator. Period refers to the number of states that the generator will
move through, Jump and Long jump are the number of steps (equivalent)
that a jump operation will take through this sequence.
Note that the period is the two size of number of bits in the model
state (e.g., xoshiro256 uses 4 x 64 bit integers, for a total of 256
bits of state, 2^256 possible combinations, each of which will be
visited once in the cycle). The Jump coefficients have been computed to
have size sqrt(Period)
.
Our random number library can be reused in other projects without using anything else from dust; either in an R package or in a standalone project.
A minimal package looks like
#> .
#> ├── DESCRIPTION
#> ├── NAMESPACE
#> └── src
#> └── rnguse.cpp
with the core of the C++ file containing a small program that uses the dust random number generator to draw a series of normally distributed random number with a single mean and standard deviation.
#include <cpp11.hpp>
#include <dust/random/random.hpp>
[[cpp11::register]]
cpp11::doubles random_normal(int n, double mu, double sd, int seed) {
using rng_state_type = dust::random::generator<double>;
auto state = dust::random::seed<rng_state_type>(seed);
cpp11::writable::doubles ret(n);
for (int i = 0; i < n; ++i) {
ret[i] = dust::random::normal<double>(state, mu, sd);
}
return ret;
}
To complete the package, the DESCRIPTION
includes
dust
and cpp11
in the LinkingTo
section:
Package: rnguse
Title: Use 'dust' RNG From Elsewhere
Version: 0.1.0
Authors@R: c(person("Rich", "FitzJohn", role = c("aut", "cre"),
email = "rich.fitzjohn@gmail.com"))
Description: Simple example package showing how to use dust's random
number library from your R package's C++ code.
License: CC0
Encoding: UTF-8
LinkingTo: cpp11, dust
You must remember to update your NAMESPACE
to include
useDynLib
(either directly or via roxygen)
useDynLib(rnguse, .registration = TRUE)
Finally, run cpp11::cpp_register()
before compiling your
package so that the relevant interfaces are created
(R/cpp11.R
and cpp11/cpp11.cpp
). A similar
process would likely work with Rcpp without any dependency on cpp11.
The issue with this example is that every call to
random_normal
must provide its own seed
argument, so the random number streams are not continued with each call
to the function. This is is not very useful in practice and we describe
more fully how to do this properly in
vignette("rng_package.Rmd")
.
This is somewhat more experimental, so let us know if you have success using the library this way.
It is possible to include the dust random library in a standalone C++
program (or one embedded from another language) without using any R
support. The library is a header only library and
<dust/random/random.hpp>
is the main file to
include.
The simplest way to get started is to copy the contents of
inst/include/dust/random
into your project. You can do this
by downloading and unpacking the
latest release of dust-random. Then after including
<dust/random/random.hpp>
you can use the contents of
the random number library.
For example, below is a small program that computes the sum of a series of random numbers, in parallel using OpenMP to parallelise across a set of generators.
#include <iostream>
#include <iomanip>
#include <vector>
#ifndef NO_OPENMP
#include <omp.h>
#endif
#include <dust/random/random.hpp>
template <typename real_type>
std::vector<real_type> random_sum(int n_streams, int n_draws,
int seed, int n_threads) {
using rng_state_type = dust::random::generator<real_type>;
dust::random::prng<rng_state_type> rng(n_streams, seed, false);
std::vector<real_type> ret(n_streams, 0.0);
#pragma omp parallel for schedule(static) num_threads(n_threads)
for (int i = 0; i < n_streams; ++i) {
for (size_t j = 0; j < n_draws; ++j) {
ret[i] += dust::random::random_real<real_type>(rng.state(i));
}
}
return ret;
}
int main(int argc, char* argv[]) {
// Extremely simple but non-robust arg handling:
if (argc < 2) {
std::cout <<
"Usage: rnguse <n_draws> [<n_streams> [<seed> [<n_threads>]]]" <<
std::endl;
return 1;
}
int n_draws = atoi(argv[1]);
int n_streams = argc < 3 ? 10 : atoi(argv[2]);
int seed = argc < 4 ? 42 : atoi(argv[3]);
int n_threads = argc < 5 ? 1 : atoi(argv[4]);
auto ans = random_sum<double>(n_streams, n_draws, seed, n_threads);
std::cout << std::setprecision(10);
for (int i = 0; i < n_streams; ++i) {
std::cout << i << ": " << ans[i] << std::endl;
}
return 0;
}
This program can be compiled with
g++ -I$(PATH_DUST_INCLUDE) -O2 -std=c++11 -fopenmp -o rnguse rnguse.cpp
where $PATH_DUST_INCLUDE
is the path to the header only
library.
This is considerably more complicated and will depend on your aims with your GPU-accelerated program.
We include examples
in a repository showing dust::random
vs curand
benchmarks. This covers setting up the RNG state so that it can be
set from the host and retrieved, interleaving as appropriate, plus
samples from the uniform, normal, exponential Poisson and binomial
distributions. We overlap with curand
only for uniform,
normal, and Poisson. See the repository
for more details.
There are many packages offering similar functionality to
dust
:
sitmo
offers several modern RNGs (sitmo, threefry, vandercorput) for
generating standard uniform numbers (U(0, 1)), designed to be usable
from an Rcpp-using package.dqrng
offers an overlapping set and generators for normal and exponential
random numbers, also designed to be used with Rcpp and BH.rTRNG
exposes “Tina’s Random Number Generator”, among others, and support for
several distributions. It can be used from other packages but requires
linking (i.e., not header only) which does not work on macOS. Generator
jumps typically require a known number of draws per thread.miranda
includes support for several underlying generators (including
xoshiro256+
) but with a single global state and no support
for use from other packages’ compiled code.Why does this package exist? We had a set of needs that meant using the a
float
numbers with the
same interfaceOn this last point; we noticed that many distribution packages, including Boost.Random (and to a degree R’s random functions) assume that you want to sample many random numbers from a distribution with fixed parameters. This is probably reasonable in many cases, but in the use case we had (stochastic simulation models) we expected that all parameters were likely to change at every iteration. These different approaches have important tradeoffs - if you will take many samples from a single distribution you might compute tables of coefficients once and use them many times which will make the sampling faster, but this will be wasteful if only a single sample is taken before the parameters change.
The situation is slightly more complex on a GPU where we need to make
sure that different threads within a block do not get needlessly onto
different branches of conditional logic. Things like early exits need to
be avoided and we have carefully profiled to make sure that threads
within a warp end up re-synchronised at the optimal spot (see for
example this
blog post on __syncwarp
).