library(odin2)
library(dust2)
6 Data
Before we start on inference, we need to start thinking about how our models fit to data, and how we can get data into the models. This gets very close to the interface that we need to work with monty, which is the focus of the next section of the book.
6.1 The SIR model revisited
In a couple of chapters we’ll explore fitting a stochastic SIR model to data, so we’ll start with expressions from the model in Section 3.2
<- odin({
sir update(S) <- S - n_SI
update(I) <- I + n_SI - n_IR
update(R) <- R + n_IR
update(incidence) <- incidence + n_SI
<- 1 - exp(-beta * I / N * dt)
p_SI <- 1 - exp(-gamma * dt)
p_IR <- Binomial(S, p_SI)
n_SI <- Binomial(I, p_IR)
n_IR
initial(S) <- N - I0
initial(I) <- I0
initial(R) <- 0
initial(incidence, zero_every = 1) <- 0
<- parameter(1000)
N <- parameter(10)
I0 <- parameter(0.2)
beta <- parameter(0.1)
gamma })
Assuming our time unit is one day, if we run our model in time steps of a quarter of a day, then plotting incidence at daily time intervals we get:
<- list(beta = 1, gamma = 0.6)
pars <- dust_system_create(sir, pars, dt = 0.25)
sys dust_system_set_state_initial(sys)
<- seq(0, 20)
t <- dust_system_simulate(sys, t)
y <- dust_unpack_state(sys, y)
y plot(t, y$incidence, type = "p", xlab = "Time", ylab = "Incidence")
6.2 The data cometh
We need a data set to fit to; we’ll use the data set data/incidence.csv
, which you can download.
<- read.csv("data/incidence.csv")
d head(d)
#> time cases
#> 1 1 12
#> 2 2 23
#> 3 3 25
#> 4 4 36
#> 5 5 30
#> 6 6 57
tail(d)
#> time cases
#> 15 15 27
#> 16 16 25
#> 17 17 15
#> 18 18 20
#> 19 19 11
#> 20 20 7
We have a column for time time
and one for observed cases cases
spanning the time range 0 to 20.
6.3 Comparison to data
The next step is to tell our model about this data:
<- odin({
sir update(S) <- S - n_SI
update(I) <- I + n_SI - n_IR
update(R) <- R + n_IR
update(incidence) <- incidence + n_SI
<- 1 - exp(-beta * I / N * dt)
p_SI <- 1 - exp(-gamma * dt)
p_IR <- Binomial(S, p_SI)
n_SI <- Binomial(I, p_IR)
n_IR
initial(S) <- N - I0
initial(I) <- I0
initial(R) <- 0
initial(incidence, zero_every = 1) <- 0
<- parameter(1000)
N <- parameter(10)
I0 <- parameter(0.2)
beta <- parameter(0.1)
gamma
<- data()
cases ~ Poisson(incidence)
cases })
The last two lines here are doing the work for us:
First, cases <- data()
says that cases
is a special data variable. It will vary over time (there are different observations of cases
at different times) and it will come from the data rather than from the model dynamics.
Second, cases ~ Poisson(incidence)
describes the per-data-point likelihood calculation; the syntax may be familiar to you if you have read Richard McElreath’s Statistical Rethinking or used any number of Bayesian statistical frameworks.
A data variable cannot have the same name as a state variable - in situations where one directly corresponds to the other you may have to think carefully about naming to distinguish between the two in a way that you can remember which refers to which.
The generator will advertise that this system can be compared to data:
sir#>
#> ── <dust_system_generator: odin> ───────────────────────────────────────────────
#> ℹ This system has 'compare_data' support
#> ℹ This system runs in discrete time with a default dt of 1
#> ℹ This system has 4 parameters
#> → 'N', 'I0', 'beta', and 'gamma'
#> ℹ Use dust2::dust_system_create() (`?dust2::dust_system_create()`) to create a system with this generator
#> ℹ Use coef() (`?stats::coef()`) to get more information on parameters
Once we have our new model, we can see how the data comparison works.
<- dust_system_create(sir, list(), n_particles = 10)
sys dust_system_set_state_initial(sys)
dust_system_run_to_time(sys, d$time[[1]])
dust_system_compare_data(sys, d[1, ])
#> [1] -7.351682 -20.987214 -9.803867 -13.669448 -20.987214 -13.669448
#> [7] -7.351682 -13.669448 -9.803867 -9.803867
This has run the model to the point where we have the first observed data (time 1), this time without returning any data to R, then we have used dust_system_compare_data
with the first row of data. This returns a vector of 10 likelihoods – one per particle.
This is probably the first mention of a “particle”, and it is from this that the name dust
is derived. As a nod to the particle filter, which will be the focus of the next section, we refer to each realisation of the dynamical system as a “particle”.
To demystify this a little we can perform this calculation manually. First we extract the state from the system, unpacking it to make it easier to work with:
<- dust_unpack_state(sys, dust_system_state(sys))
s
s#> $S
#> [1] 986 989 987 988 989 988 986 988 987 987
#>
#> $I
#> [1] 12 10 11 12 9 12 13 12 11 13
#>
#> $R
#> [1] 2 1 2 0 2 0 1 0 2 0
#>
#> $incidence
#> [1] 4 1 3 2 1 2 4 2 3 3
We can then manually compute the likelihood, and bind this together with the calculation from dust to compare:
cbind(dpois(d$cases[[1]], s$incidence, log = TRUE),
dust_system_compare_data(sys, d[1, ]))
#> [,1] [,2]
#> [1,] -7.351682 -7.351682
#> [2,] -20.987214 -20.987214
#> [3,] -9.803867 -9.803867
#> [4,] -13.669448 -13.669448
#> [5,] -20.987214 -20.987214
#> [6,] -13.669448 -13.669448
#> [7,] -7.351682 -7.351682
#> [8,] -13.669448 -13.669448
#> [9,] -9.803867 -9.803867
#> [10,] -9.803867 -9.803867
As you can see, these are the same; the expression
~ Poisson(incidence) cases
has done the same sort of thing as we can do ourselves by asking “what is the (log) probability of observing cases
cases if the underlying rate of incidence is incidence
”. Note that this can be -Inf
(a probability of 0) in the case where the modelled incidence is zero.
dpois(10, 0, log = TRUE)
#> [1] -Inf
This is fine, so long as not all particles are impossible.
6.4 The particle filter
We include a simple bootstrap particle filter in dust
for estimating the marginal likelihood in this case; the probability of the entire data series conditional on our model, marginalised (averaged) over all stochastic realisations. Because our model is stochastic, this is an estimate of the likelihood but the estimate will get less noisy as the number of particles increases.
<- dust_filter_create(sir, time_start = 0, data = d, n_particles = 200) filter
Here, we’ve created a filter where the time series starts at 0, using our data set d
and we’ve picked 200 particles to start with. We run the filter with dust_likelihood_run
, and this returns a likelihood:
dust_likelihood_run(filter, list())
#> [1] -215.6063
Each time we run the filter, we’ll get a different answer though.
dust_likelihood_run(filter, list())
#> [1] -219.9024
For example, running for 100 times:
<- replicate(100, dust_likelihood_run(filter, list()))
ll mean(ll)
#> [1] -222.711
var(ll)
#> [1] 195.166
If we increase the number of particles, this variance will decrease, at the cost of taking longer to run:
<- dust_filter_create(sir, time_start = 0, data = d,
filter2 n_particles = 2000)
<- replicate(100, dust_likelihood_run(filter2, list()))
ll2 mean(ll2)
#> [1] -187.6805
var(ll2)
#> [1] 68.52534
The log-likelihoods from different numbers of particles are not directly comparable, though.
You can extract trajectories from the filter after running it: this gives a tree showing the ancestry of the particles remaining in the sample. To enable this, you must run with save_trajectories = TRUE
, as this incurs a runtime cost.
<- dust_likelihood_run(filter, list(), save_trajectories = TRUE)
ll <- dust_likelihood_last_trajectories(filter) h
The trajectories will be an array with 3 dimensions representing (in turn) state, particle and time:
dim(h)
#> [1] 4 200 20
We can plot the trajectories of our incidence here:
matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
lty = 1, col = "#00000044", ylim = c(0, 75),
xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
This model does not fit the data very well! The points don’t lie close to the modelled trajectories. We also see everything before time ~10 reduced to a single particle’s trajectories, which will have the effect of increasing the variance.
If we found better parameters things would look much better. We just so happen to know that if beta
is 1 and gamma
0.6 then the model will fit better:
<- list(beta = 1, gamma = 0.6)
pars <- dust_likelihood_run(filter, pars, save_trajectories = TRUE)
ll <- dust_likelihood_last_trajectories(filter)
h matplot(d$time, t(dust_unpack_state(filter, h)$incidence), type = "l",
lty = 1, col = "#00000044", ylim = c(0, 75),
xlab = "Time", ylab = "Incidence")
points(cases ~ time, d, pch = 19, col = "red")
As the model fit improves and the log-likelihood increases, the variance in that estimator will also decrease as the model struggles less to describe the data:
<- replicate(100, dust_likelihood_run(filter, pars))
ll mean(ll)
#> [1] -80.72078
var(ll)
#> [1] 1.932347
Moving from the poor-fitting parameters to the better fitting ones is the process of inference, which will be the focus of most of the rest of this book.