This tutorial goes through the basic process of creating a SIMPLEGEN project and running the inbuilt epidemiological simulator.

Setting up the model

SIMPLEGEN works with projects, which are essentially just glorified lists that hold all inputs and outputs in a single convenient place. We start by creating a new project and specifying the parameters of the epidemiological model:

myproj <- simplegen_project()
myproj <- define_epi_model_parameters(myproj,
                                      H = 1e3,
                                      M = 1e4)

Alternatively, if you are familiar with the pipe operator %>% from the dplyr package then you can use this to chain together multiple commands:

library(dplyr)

myproj <- simplegen_project() %>%
  define_epi_model_parameters(H = 1e3,
                              M = 1e4)

Here we’ve just specified the human and mosquito population sizes and left all other parameters with their default values. A complete list of model parameters and their defaults can be found in the help for this function (see ?define_epi_model_parameters).

Taking a quick look at our project we get e a summary of the current setup:

myproj
#> Epidemiological model:
#>   demes: 1
#>   H:  1000
#>   M:  10000
#>   seed infections: 100
#> 

We can use our inbuilt plotting functions to explore some of the assumed distributions of our model. For example, we can look at the distribution of time spent in the acute state:

plot_epi_distribution(myproj, name = "duration_acute")

Many distributions can be explored in this way, and ?plot_epi_distribution gives a complete list.

Once we’re happy with the parameters of the model, the next step is to define the outputs that we expect it to produce. There are three possible types of output:

  1. daily output contains values for each day of simulation. This can be useful for visualising trends over time.
  2. sweeps are the same as daily output, but evaluated at specific points in time. This can be useful for getting a more detailed snap-shot of the population without overloading the daily output. For example, we might want to return the prevalence in 5-year age bands on day 100, but without returning these values every other day of the simulation.
  3. surveys represent samples taken from the population according to a predefined sampling strategy. Surveys are the most important type of output from the genetic perspective, as genetic data will only be generated from individuals in the survey.

We specify daily outputs using a basic dataframe. Table 1 gives the columns of this dataframe along with all possible combinations of values (Table 1 should be read from left to right, moving into any box not separated by a horizontal line - for example, state = "A", measure = "count" is a viable combination, but state = "A", measure = "EIR" is not).

Table 1 All possible options for daily outputs. Human outputs are shaded in red and mosquito outputs in green.

Any number of extra columns can be added to this basic dataframe, for example it can be useful to have a name column giving a short description of what each row contains. We can use any standard method to produce this dataframe; the code below uses the rbind() method:

daily_dataframe <- rbind(data.frame(name = "acute_prev", deme = 1, state = "A", measure = "prevalence", diagnostic = "microscopy", age_min = 0, age_max = 5),
                         data.frame(name = "chronic_prev", deme = 1, state = "C", measure = "prevalence", diagnostic = "microscopy", age_min = 0, age_max = 5),
                         data.frame(name = "acute_inc", deme = 1, state = "A", measure = "incidence", diagnostic = "PCR", age_min = 2, age_max = 10),
                         make.row.names = FALSE)

print(daily_dataframe)
#>           name deme state    measure diagnostic age_min age_max
#> 1   acute_prev    1     A prevalence microscopy       0       5
#> 2 chronic_prev    1     C prevalence microscopy       0       5
#> 3    acute_inc    1     A  incidence        PCR       2      10

Once we’re happy with this dataframe we can load it into the project using define_epi_sampling_parameters(). At this point any errors in the format will be flagged.

myproj <- define_epi_sampling_parameters(project = myproj,
                                         daily = daily_dataframe)

The second type of output is sweeps. These can be defined using a second dataframe identical in format to the daily output but with an added time column. For example, the following code specifies that we want to output acute prevalence and acute incidence in 5-year age bands at day 365:

sweep_dataframe <- rbind(data.frame(name = "prev_acute", time = 365, deme = 1, measure = "prevalence", state = "A", diagnostic = "true", age_min = seq(0, 100, 5), age_max = seq(0, 100, 5) + 4),
                         data.frame(name = "inc_acute", time = 365, deme = 1, measure = "incidence",  state = "A", diagnostic = "true", age_min = seq(0, 100, 5), age_max = seq(0, 100, 5) + 4),
                         make.row.names = FALSE)

print(head(sweep_dataframe))
#>         name time deme    measure state diagnostic age_min age_max
#> 1 prev_acute  365    1 prevalence     A       true       0       4
#> 2 prev_acute  365    1 prevalence     A       true       5       9
#> 3 prev_acute  365    1 prevalence     A       true      10      14
#> 4 prev_acute  365    1 prevalence     A       true      15      19
#> 5 prev_acute  365    1 prevalence     A       true      20      24
#> 6 prev_acute  365    1 prevalence     A       true      25      29

Again, this needs to be loaded into the project:

myproj <- define_epi_sampling_parameters(project = myproj,
                                         daily = daily_dataframe,
                                         sweeps = sweep_dataframe)

The final type of output is surveys, but these mainly relate to genetic data and so will be explored in a later vignette.

Running the model and extracting results

Once we have specified our model parameters and outputs we are ready to run the simulation. This is done using the sim_epi() function. Note that the argument pb_markdown = TRUE is used here just to avoid cluttering this tutorial with too much output, but you should run without this argument.

myproj <- sim_epi(myproj,
                  max_time = 365,
                  pb_markdown = TRUE)
#> Running simulation
#> 
  |                                                                            
  |======================================================================| 100%
#> 
#> completed in 0.027570 seconds

Taking a quick look at the project we can now see that we have some output:

myproj
#> Epidemiological model:
#>   demes: 1
#>   H:  1000
#>   M:  10000
#>   seed infections: 100
#> 
#> Sampling strategy:
#>   daily outputs: 3
#>   sweep timepoints: 1
#>   survey outputs: 0
#> 
#> Output:
#>   simulation days: 365

All output from the epidemiological model is stored within the epi_outputs element of the project. For example, we can look at the first few daily output values:

head(myproj$epi_output$daily)
#>   time         name deme state    measure diagnostic age_min age_max value
#> 1    1   acute_prev    1     A prevalence microscopy       0       5     0
#> 2    1 chronic_prev    1     C prevalence microscopy       0       5     0
#> 3    1    acute_inc    1     A  incidence        PCR       2      10     0
#> 4    2   acute_prev    1     A prevalence microscopy       0       5     0
#> 5    2 chronic_prev    1     C prevalence microscopy       0       5     0
#> 6    2    acute_inc    1     A  incidence        PCR       2      10     0

Notice that the format of this dataframe is identical to the daily dataframe we loaded in, but now replicated over all time points and with an added value column. We see a very similar thing when we look at the sweep output:

head(myproj$epi_output$sweeps)
#>         name time deme    measure state diagnostic age_min age_max      value
#> 1 prev_acute  365    1 prevalence     A       true       0       4 0.01111111
#> 2 prev_acute  365    1 prevalence     A       true       5       9 0.03488372
#> 3 prev_acute  365    1 prevalence     A       true      10      14 0.03174603
#> 4 prev_acute  365    1 prevalence     A       true      15      19 0.03409091
#> 5 prev_acute  365    1 prevalence     A       true      20      24 0.04109589
#> 6 prev_acute  365    1 prevalence     A       true      25      29 0.06349206

Having outputs in long format like this is not always the most efficient in terms of storage, but is very convenient when it comes to plotting and summarizing results as will be seen below.

Plotting results

One of the most basic plots that we might be interested in is the daily number of hosts in each state. We can specify the states that we want to plot as an input vector:

library(ggplot2)

myproj$epi_output$daily %>%
  dplyr::filter(name %in% c("acute_prev", "chronic_prev")) %>%
  ggplot(aes(x = time, y = value)) + theme_bw() +
  geom_line(aes(color = name))

This is also true of mosquito states, which are suffixed with “v”:

#plot_daily_states(myproj, states = c("Ev", "Iv"))

In addition to storing daily counts in each state, the sim_epi() function takes a slice of the population at various times, allowing us to plot age-distributions at these points in time. The sampling times are specified by the argument output_age_times, which defaults to the final timepoint only. Hence, we can plot:

#plot_age_states(myproj, state = "C")

Multiple demes and migration

SIMPLEGEN allows for multiple sub-populations (also called demes), which can be perfectly isolated or connected by migration. First, we specify the human and mosquito population sizes in each deme by using a vector of values for H, M, and seed_infections. Second, we specify a migration matrix giving the daily probability of each human moving from the source deme (rows) to the destination deme (columns).

# define parameters over multiple demes
n_demes <- 3
H <- rep(1e3, n_demes)
seed_infections <- c(1e2, 0, 0)
M <- c(1e3, 2e3, 3e3)

# define a migration matrix
m <- 0.01
mig_mat <- matrix(m / n_demes, n_demes, n_demes) + diag(1 - m, n_demes)

# load new paramters into the model
myproj <- define_epi_model_parameters(myproj,
                                      H = H,
                                      seed_infections = seed_infections,
                                      M = M,
                                      mig_mat = mig_mat)

myproj
#> Epidemiological model:
#>   demes: 3
#>   H:  1000, 1000, 1000
#>   M:  1000, 2000, 3000
#>   seed infections: 100, 0, 0
#> 
#> Sampling strategy:
#>   daily outputs: 3
#>   sweep timepoints: 1
#>   survey outputs: 0
#> 
#> Output:
#>   simulation days: 365

Then define sampling parameters

myproj <- define_epi_sampling_parameters(myproj, daily = daily_dataframe)

We then run the simulation as normal:

myproj <- sim_epi(myproj, pb_markdown = TRUE)
#> Running simulation
#> 
  |                                                                            
  |======================================================================| 100%
#> 
#> completed in 0.059445 seconds

Now when plotting results we can use the deme argument to specify which deme to plot:

#plot_daily_states(myproj, states = c("A", "C", "P"), deme = 1)
#plot_daily_states(myproj, states = c("A", "C", "P"), deme = 2)
#plot_daily_states(myproj, states = c("A", "C", "P"), deme = 3)

In the example above there are three demes, with only the first containing infected hosts initially. As time progresses we see infections seed into the other two demes via migration.