Skip to contents

Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.

Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.

The hipercow package aims to remove some of this pain, with the aim that running a task on the cluster should be (almost) as straightforward as running things locally, at least once some basic setup is done.

At the moment, this document assumes that we will be using the “Windows” cluster, which implies the existence of some future non-Windows cluster. Stay tuned.

This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks.

Installing prerequisites

Install the required packages from our “r-universe”. Be sure to run this in a fresh session.

install.packages(
  "hipercow",
  repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))

Once installed you can load the package with

or use the package by prefixing the calls below with hipercow::, as you prefer.

Follow any platform-specific instructions in vignettes("<cluster>"); this will depend on the cluster you intend to use:

Filesystems and paths

We need a concept of a “root”; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be within this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it’s easy to move around, which is exactly what we need to do in order to run it on the cluster.

If you are using RStudio, then we strongly recommend using an RStudio project.

Initialising

Run

hipercow_init()
#>  Initialised hipercow at '.' (/home/runner/work/_temp/hv-20241209-31c1774eb1f4)
#>  Next, call 'hipercow_configure()'

which will write things to a new path hipercow/ within your working directory.

After initialisation you will typically want to configure a “driver”, which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write:

hipercow_configure(driver = "windows")

however, for this vignette we will use a special “example” driver which simulates what the cluster will do (don’t use this for anything yourself, it really won’t help):

hipercow_configure(driver = "example")
#>  Configured hipercow to use 'example'

You can run initialisation and configuration in one step by running

hipercow_init(driver = "windows")

After initialisation and configuration you can see the computed configuration by running hipercow_configuration():

hipercow_configuration()
#> 
#> ── hipercow root at /home/runner/work/_temp/hv-20241209-31c1774eb1f4 ───────────
#>  Working directory '.' within root
#>  R version 4.4.2 on Linux (runner@fv-az1913-706)
#> 
#> ── Packages ──
#> 
#>  This is hipercow 1.0.52
#>  Installed: conan2 (1.9.101), logwatch (0.1.1), rrq (0.7.22)
#>  hipercow.windows is not installed
#> 
#> ── Environments ──
#> 
#> ── default
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#> 
#> ── empty
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#> 
#> ── Drivers ──
#> 
#>  1 driver configured ('example')
#> 
#> ── example
#> (unconfigurable)

Here, you can see versions of important packages, information about where you are working, and information about how you intend to interact with the cluster. See vignette("windows") for example output you might expect on the Windows cluster, which includes information about mapping of your paths onto those of the cluster, the version of R you will use, and other information.

If you have issues with hipercow we will always want to see the output of hipercow_configuration().

Running your first task

The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected:

id <- task_create_expr(sessionInfo())
#>  Submitted task '9991985a4a73ec910c9f1b0d0617e1be' using 'example'

This creates a new task that will run the expression sessionInfo() on the cluster. The task_create_expr() function works by so-called “non standard evaluation” and the expression is not evaluated from your R session, but sent to run on another machine.

The id returned is just an ugly hex string:

id
#> [1] "9991985a4a73ec910c9f1b0d0617e1be"

Many other functions accept this id as an argument. You can get the status of the task, which will have finished now because it really does not take very long:

task_status(id)
#> [1] "success"

Once the task has completed you can inspect the result:

task_result(id)
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2  cli_3.6.3       withr_3.0.2     hipercow_1.0.52
#> [5] rlang_1.1.4

Because we are using the “example” driver here, this is the same as the result that you’d get running sessionInfo() directly, just with more steps. See vignette("windows") for an example that runs on Windows.

Using functions you have written

It’s unlikely that the code you want to run on the cluster is one of the functions built into R itself; more likely you have written a simulation or similar and you want to run that instead. In order to do this, we need to tell the cluster where to find your code. There are two broad places where code that you want to run is likely to be found script files and packages; we start with the former here, and deal with packages in much more detail in vignette("packages").

Suppose you have a file simulation.R containing some simulation:

random_walk <- function(x, n_steps) {
  ret <- numeric(n_steps)
  for (i in seq_len(n_steps)) {
    x <- rnorm(1, x)
    ret[[i]] <- x
  }
  ret
}

We can’t run this on the cluster immediately, because the cluster does not know about the new function:

id <- task_create_expr(random_walk(0, 10))
#>  Submitted task '5784dfd8b6ae06288e98edd1dcb2fd7e' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
task_result(id)
#> <simpleError in random_walk(0, 10): could not find function "random_walk">

(See vignette("troubleshooting") for more on failures.)

We need to tell hipercow to source() the file simulation.R before running the task. To do this we use hipercow_environment_create() to create an “environment” (not to be confused with R’s environments) in which to run things:

hipercow_environment_create(sources = "simulation.R")
#>  Created environment 'default'

Now we can run our simulation:

id <- task_create_expr(random_walk(0, 10))
#>  Submitted task '0641ee70e2734f66849bbd6896d3cbb5' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#>  [1] -0.7071049 -1.2740309 -0.4966545  0.2219471 -1.3205475 -2.2873065
#>  [7] -1.3199502 -1.0696891 -2.4250178 -1.8959445
  • You can have multiple environments and each task can be set to run in a different environment
  • Each environment can source any number of source files, and load any number of packages
  • This will become the mechanism by which environments on parallel workers (via parallel, future or rrq) will set up their environments

Read more about environments in vignette("environments")

Getting information about tasks

Once you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong.

Fetching information with task_info()

The primary function for fetching information about a task is task_info():

task_info(id)
#> 
#> ── task 0641ee70e2734f66849bbd6896d3cbb5 (success) ─────────────────────────────
#>  Submitted with 'example'
#>  Task type: expression
#>   • Expression: random_walk(0, 10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#>  Created at 2024-12-09 18:46:15.47919 (moments ago)
#>  Started at 2024-12-09 18:46:15.538477 (moments ago; waited 60ms)
#>  Finished at 2024-12-09 18:46:15.539579 (moments ago; ran for 2ms)

This prints out core information about the task; its identifier (0641ee70e2734f66849bbd6896d3cbb5) and status (success), along with information about what sort of task it was, what expression it had, variables it used, the environment it executed in and the time that key events happened for the task (when it was created, started and finished).

This display is meant to be friendly; if you need to compute on this information, you can access the times by reading the $times element of the task_info() return value:

task_info(id)$times
#>                   created                   started                  finished 
#> "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC"

Likewise, the information about the task itself is within $data. To work with the underling data you might just unclass the object to see the structure:

unclass(task_info(id))
#> $id
#> [1] "0641ee70e2734f66849bbd6896d3cbb5"
#> 
#> $status
#> [1] "success"
#> 
#> $data
#> $data$type
#> [1] "expression"
#> 
#> $data$id
#> [1] "0641ee70e2734f66849bbd6896d3cbb5"
#> 
#> $data$time
#> [1] "2024-12-09 18:46:15 UTC"
#> 
#> $data$path
#> [1] "."
#> 
#> $data$environment
#> [1] "default"
#> 
#> $data$envvars
#>            name value secret
#> 1 R_GC_MEM_GROW     3  FALSE
#> 
#> $data$parallel
#> NULL
#> 
#> $data$expr
#> random_walk(0, 10)
#> 
#> $data$variables
#> NULL
#> 
#> 
#> $driver
#> [1] "example"
#> 
#> $times
#>                   created                   started                  finished 
#> "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC" 
#> 
#> $retry_chain
#> NULL

but note that the exact structure is subject to (infrequent) change.

Fetching logs with task_log_show

Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong.

You can view the log with task_log_show()

task_log_show(id)
#>  No logs for task '0641ee70e2734f66849bbd6896d3cbb5' (yet?)

This prints the contents of the logs to the screen; you can access the values directly with task_log_value(id). The format of the logs will be generally the same for all tasks; after the header saying where we are running, some information about the task will be printed (its identifier, the time, details about the task itself), then any logs that come from calls to message() and print() within the queued function (within the “task logs” section; here that is empty because our task prints nothing). Finally, a summary will be printed with the final status, final time (and elapsed time), then any warnings that were produced will be flushed (see vignette("troubleshooting") for more on warnings).

There is a second log too, the “outer” log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks:

task_log_show(id, outer = TRUE)
#>  No logs for task '0641ee70e2734f66849bbd6896d3cbb5' (yet?)

The logs returned by task_log_show(id, outer = FALSE) are the logs generated by the statement containing Rscript -e.

Watching logs with task_log_watch

If your task is still running, you can stream logs to your computer using task_log_watch(); this will print new logs line-by-line as they arrive (with a delay of up to 1s by default). This can be useful while debugging something to give the illusion that you’re running it locally.

Using Ctrl-C (or ESC in RStudio) to escape will only stop log streaming and not the underlying task.

Running many tasks at once

Running one task on the cluster is nice, because it takes the load off your laptop, but it’s generally not why you’re going through this process. More likely, you have many, similar, tasks that you want to set running at once. You might be:

  • Fitting a model to a series of countries
  • Exploring uncertainty in a parameter
  • Running a series of stochastic processes

In all these cases, you would want to submit a group of related tasks, sharing a common function, but differing in the data passed into that function. We call this the “bulk interface”, and it is the simplest and usually most effective way of getting started with parallel computing.

This sort of problem is referred to as “embarrassingly parallel”; this is not a pejorative, it just means that your work decomposes into a bunch of independent chunks and all we have to do is start them. You are already familiar with things that can be run this way: anything that can be run with R’s lapply could be parallelised.

There are two similar bulk creation functions, which differ based on the way the data you have are structured:

  • task_create_bulk_call is used where you have a list, where each element represents the key input to some computation (this is similar to lapply())
  • task_create_bulk_expr is used where you have a data.frame, where each row represents the inputs to some computation (this is a little similar to dplyr’s rowwise support)

Bulk call, or “parallel map”

The bulk call interface is the one that might feel most familiar to you; it is modelled on ideas from functions like lapply (or, if you use purrr, its map() function). The idea is simple, we have a list of data and we apply some function to each element within it.

We’ll start with a reminder of how lapply works, then adapt this to run in parallel on a cluster. Imagine that we want to run some simple simulation with a different parameter. In this example we simulate n samples from a normal distribution and compute the observed mean and variance:

mysim <- function(mu, sd = 1, n = 1000) {
  r <- rnorm(n, mu, sd)
  c(mean(r), var(r))
}

We can run locally it like this:

source("simulation-bulk.R")
mysim(0, 1)
#> [1] 0.006475844 0.993635933

Suppose that we have a vector of means to run this with:

mu <- c(0, 1, 2, 3, 4)

We can apply mysim to each of the elements of mu by writing:

lapply(mu, mysim, sd = 1)
#> [[1]]
#> [1] -0.01562212  0.93114162
#> 
#> [[2]]
#> [1] 1.043111 1.006988
#> 
#> [[3]]
#> [1] 1.9583502 0.9799025
#> 
#> [[4]]
#> [1] 3.052979 0.994695
#> 
#> [[5]]
#> [1] 3.9999284 0.9695735

Of note here:

  • Only the mu argument was iterated over
  • We provided a sd argument that was passed through to every call to mysim
  • We get back a list in return, the same length as mu, with each element the result of applying mysim to that element.

Nothing in the above said anything about the order in which these calculations were carried out; one might assume that we applied myfun to the first element of mu at once, then the second, but that is just conjecture. This last point seems a bit silly, but is a useful condition to think about when considering what can be parallelised; if you can run a “loop” backwards and get the same answer (ignoring things like the specific draws from random number generating functions) then your problem is well suited to being parallelised.

Our function mysim is in a file called simulation-bulk.R, which we’ll add to our environment so that it’s available on the cluster (alongside random_walk from above):

hipercow_environment_create(sources = c("simulation.R", "simulation-bulk.R"))
#>  Updated environment 'default'

We can then submit tasks to the cluster using task_create_bulk_call:

bundle <- task_create_bulk_call(mysim, mu, args = list(sd = 1))
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'bossy_mice' with 5 tasks
bundle
#> → <hipercow_bundle 'bossy_mice' with 5 tasks>

This creates a task bundle which groups together related tasks. There is a whole set of functions for working with bundles that behave similarly to the task query functions. So where task_status() retrieves the status for a single task, we can get the status for a bundle by running

hipercow_bundle_status(bundle)
#> [1] "success" "success" "success" "success" "success"

which returns a vector over all tasks included in the bundle. You can also “reduce” this status to the “worst” status over all tasks:

hipercow_bundle_status(bundle, reduce = TRUE)
#> [1] "success"

Similarly, you can wait for a whole bundle to complete

hipercow_bundle_wait(bundle)
#> [1] TRUE

And then get the results as a list

hipercow_bundle_result(bundle)
#> [[1]]
#> [1] 0.009423132 1.025663100
#> 
#> [[2]]
#> [1] 0.9934573 1.0691607
#> 
#> [[3]]
#> [1] 2.008034 1.003267
#> 
#> [[4]]
#> [1] 3.012224 1.026637
#> 
#> [[5]]
#> [1] 3.9623300 0.8999291

This flow (create, wait, result) is equivalent to lapply and produces data of the same shape in return, but the tasks will be carried out in parallel! Each task is submitted to the cluster and picked up by the first available node. You might submit 100 tasks and if the cluster is quiet, a few seconds later all of them will be running at the same time.

We might want to vary both mu and sd, in which case it might be convenient to keep track of our inputs in a data.frame:

pars <- data.frame(mu = mu, sd = sqrt(mu + 1))

We can use task_create_bundle_call with this, too:

b <- task_create_bulk_call(mysim, pars)
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'roastable_racerunner' with 5 tasks
hipercow_bundle_wait(b)
#> [1] TRUE
hipercow_bundle_result(b)
#> [[1]]
#> [1] -0.01324526  0.99680800
#> 
#> [[2]]
#> [1] 1.012203 2.010859
#> 
#> [[3]]
#> [1] 2.062672 2.972183
#> 
#> [[4]]
#> [1] 3.093298 3.566746
#> 
#> [[5]]
#> [1] 4.005499 4.781374

This iterates over the data in a row-wise way. Note that this is very different to lapply which would iterate over columns (in practice we find that this is almost never what people want). The names of the columns must match the names of your function arguments and all columns must be used.

We could have passed additional arguments here too, for example changing n:

b <- task_create_bulk_call(mysim, pars, args = list(n = 40))
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'bashful_warthog' with 5 tasks

Bulk expression

We also support a bulk expression interface, which can be clearer than the above.

b <- task_create_bulk_expr(mysim(mu, sd, n = 40), pars)
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'distorted_pika' with 5 tasks

This would again work row-wise over pars but evaluate the expression in the first argument with the data found in the data.frame. This would allow you to use different column names if convenient:

pars <- data.frame(mean = mu, stddev = sqrt(mu + 1))
b <- task_create_bulk_expr(mysim(mean, stddev, n = 40), pars)
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'incompatible_mollies' with 5 tasks

More on bundles

You can do most things to bundles that you can do to tasks:

Action Single task Bundle
Get result task_result hipercow_bundle_result
Wait for completion task_wait hipercow_bundle_wait
Retry failed tasks (see below) task_retry hipercow_bundle_retry
List task_list hipercow_bundle_list
Cancel task_cancel hipercow_bundle_cancel
Get log value task_log_value hipercow_bundle_log_value

There is no equivalent of task_log_watch or task_log_show because we can’t easily do this for multiple tasks at the same time in a satisfactory way.

hipercow_bundle_delete will delete bundles, but leave tasks alone. hipercow_purge will delete tasks, causing actual deletion of data.

Some of these functions naturally have slightly different semantics to the single-task function; for example, hipercow_bundle_result() returns a list of results and hipewcow_bundle_wait has an option fail_early to control if it shold return FALSE as soon as any task fails.

Picking bundles back up again later

You can use the hipercow_bundle_list() function to list known bundles:

hipercow_bundle_list()
#>                   name                time
#> 1 incompatible_mollies 2024-12-09 18:46:19
#> 2       distorted_pika 2024-12-09 18:46:18
#> 3      bashful_warthog 2024-12-09 18:46:18
#> 4 roastable_racerunner 2024-12-09 18:46:17
#> 5           bossy_mice 2024-12-09 18:46:17

Each bundle has a name (automatically generated by default) and the time that it was created. If you have launched a bundle and for some reason lost your session (e.g., windows update has rebooted your computer) you can use this to get your ids back.

name <- hipercow_bundle_list()$name[[1]]
bundle <- hipercow_bundle_load(name)

If you’re not sure what you launched, you can use task_info:

task_info(bundle$ids[[1]])
#> 
#> ── task d302fb75e66082790328a170cb11b1f4 (success) ─────────────────────────────
#>  Submitted with 'example'
#>  Task type: expression
#>   • Expression: mysim(mean, stddev, n = 40)
#>   • Locals: mean and stddev
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#>  Created at 2024-12-09 18:46:18.992137 (moments ago)
#>  Started at 2024-12-09 18:46:19.036948 (moments ago; waited 45ms)
#>  Finished at 2024-12-09 18:46:19.03823 (moments ago; ran for 2ms)

You can make this process a bit more friendly by setting your own name into the bundle when creating it using the bundle_name argument:

pars <- data.frame(mean = mu, stddev = sqrt(mu + 1))
b <- task_create_bulk_expr(mysim(mean, stddev, n = 40), pars, 
                           bundle_name = "final_runs_v2")
#>  Submitted 5 tasks using 'example'
#>  Created bundle 'final_runs_v2' with 5 tasks

Making bundles from tasks

You can also make a bundle yourself from a group of tasks; this may be convenient if you need to launch a number of tasks individually for some reason but want to then consider them together as a group.

id1 <- task_create_expr(mysim(1, 2))
#>  Submitted task '89d9603e060fee5dab74796cd9df7317' using 'example'
id2 <- task_create_expr(mysim(2, 2))
#>  Submitted task '52e091f0e0e041737b5c9f26befdf4db' using 'example'
b <- hipercow_bundle_create(c(id1, id2), "my_new_bundle")
#>  Created bundle 'my_new_bundle' with 2 tasks
b
#> → <hipercow_bundle 'my_new_bundle' with 2 tasks>

We can then use this bundle as above:

hipercow_bundle_status(b)
#> [1] "submitted" "submitted"
hipercow_bundle_wait(b)
#> [1] TRUE

Parallel tasks

So far, the tasks we submitted have been run using a single core on the cluster, with no special other requests made. Here is a simple example using two cores; we’ll use hipercow_resources() to specify we want two cores on the cluster, and hipercow_parallel() to say that we want to set up two processes on those cores, using the parallel package. (We also support future).

resources <- hipercow_resources(cores = 2)
id <- task_create_expr(
  parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)),
  parallel = hipercow_parallel("parallel"),
  resources = resources)
#>  Submitted task '7a766a516541c84ef2279bc6372fd1fd' using 'example'
task_wait(id)
#> [1] TRUE
task_info(id)
#> 
#> ── task 7a766a516541c84ef2279bc6372fd1fd (success) ─────────────────────────────
#>  Submitted with 'example'
#>  Task type: expression
#>   • Expression: parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5))
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#>  Created at 2024-12-09 18:46:20.62808 (moments ago)
#>  Started at 2024-12-09 18:46:20.677778 (moments ago; waited 50ms)
#>  Finished at 2024-12-09 18:46:26.068083 (moments ago; ran for 5.4s)

Both of our parallel tasks are to sleep for 5 seconds. We use task_info() to report how long it took for those two runs to execute; if they ran one-by-one, we’d expect around 10 seconds, but we are seeing a much shorter time than that, so our pair of processes are running at the same time.

For details on specifying resources and launching different kinds of parallel tasks, see vignette("parallel").

Understanding where variables come from

Suppose our simulation started not from 0, but from some point that we have computed locally (say x, imaginatively)

x <- 100

You can use this value to start the simulation by running:

id <- task_create_expr(random_walk(x, 10))
#>  Submitted task 'dfaf412c14d94a60788dfade5ec59ac4' using 'example'

Here the x value has come from the environment where the expression passed into task_create_expr() was found (specifically, we use the rlang “tidy evaluation” framework you might be familiar with from dplyr and friends).

task_wait(id)
#> [1] TRUE
task_result(id)
#>  [1] 99.51397 98.64564 97.98732 97.61299 96.89123 96.67704 96.40233 97.27315
#>  [9] 95.84085 94.63125

If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created:

id <- task_create_expr(random_walk(starting_point, 10))
#> Error in `rlang::env_get_list()`:
#> ! Can't find `starting_point` in environment.

Cancelling tasks

You can cancel a task if it has been submitted and not completed, using task_cancel():

For example, here’s a task that will sleep for 10 seconds, which we submit to the cluster:

id <- task_create_expr(Sys.sleep(10))
#>  Submitted task 'b8b1066ed9afd2c6b0a130683505851b' using 'example'

Having decided that this is a silly idea, we can try and cancel it:

task_cancel(id)
#>  Did not manage to cancel 'b8b1066ed9afd2c6b0a130683505851b' which had status 'running'
#> [1] FALSE
task_status(id)
#> [1] "running"
task_info(id)
#> 
#> ── task b8b1066ed9afd2c6b0a130683505851b (running) ─────────────────────────────
#>  Submitted with 'example'
#>  Task type: expression
#>   • Expression: Sys.sleep(10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#>  Created at 2024-12-09 18:46:27.052863 (moments ago)
#>  Started at 2024-12-09 18:46:27.087537 (moments ago; waited 35ms)
#> ! Not finished yet (running for 85ms)

You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!).

You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled.

Retrying tasks

There are lots of reasons why you might want to retry a task. For example:

  • it failed but you think it might work next time
  • you updated a package that it used, and want to try again with the new version
  • you don’t like the output from some stochastic function and want to generate new output
  • you cancelled the task but want to try again now

You can retry tasks with task_retry(), which is easier than submitting a new task with the same content, and also preserves a link between retried tasks.

Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that:

id1 <- task_create_expr(random_walk(0, 10))
#>  Submitted task '16edcb834f183787968f566439230f24' using 'example'
task_wait(id1)
#> [1] TRUE
task_result(id1)
#>  [1]  0.3478105 -0.5730981 -0.6992461 -1.9641252 -1.1837832 -2.1647231
#>  [7] -3.0301028 -1.7880717 -2.5532617 -2.4678150

Here we ran a random walk and it got to -2.467815, which is clearly not what we were expecting. Let’s try it again:

id2 <- task_retry(id1)
#>  Submitted task '2a8087f61189016cc9497027e2e1b3fb' using 'example'

Running task_retry() creates a new task, with a new id 2a8087... compared with 16edcb....

Once this task has finished, we get a different result:

task_wait(id2)
#> [1] TRUE
task_result(id2)
#>  [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#>  [8] 0.9572328 1.5361760 2.0629778

Much better!

We get a hint that this is a retried task from the task_info()

task_info(id2)
#> 
#> ── task 2a8087f61189016cc9497027e2e1b3fb (success) ─────────────────────────────
#>  Submitted with 'example'
#>  Task type: expression
#>   • Expression: random_walk(0, 10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#>  Created at 2024-12-09 18:46:27.243542 (less than a minute ago)
#>  Started at 2024-12-09 18:46:37.407237 (moments ago; waited 10.2s)
#>  Finished at 2024-12-09 18:46:37.40864 (moments ago; ran for 2ms)
#>  Last of a chain of a task retried 1 time

You can see the full chain of retries here:

task_info(id2)$retry_chain
#> [1] "16edcb834f183787968f566439230f24" "2a8087f61189016cc9497027e2e1b3fb"

Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain:

task_result(id1)
#>  [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#>  [8] 0.9572328 1.5361760 2.0629778
task_result(id2)
#>  [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#>  [8] 0.9572328 1.5361760 2.0629778

You can get the original result back by passing the argument follow = FALSE:

task_result(id1, follow = FALSE)
#>  [1]  0.3478105 -0.5730981 -0.6992461 -1.9641252 -1.1837832 -2.1647231
#>  [7] -3.0301028 -1.7880717 -2.5532617 -2.4678150
task_result(id2)
#>  [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#>  [8] 0.9572328 1.5361760 2.0629778

Only tasks that have been completed (success, failure or cancelled) can be retried, and doing so adds a new task to the end of the chain; there is no branching. Retrying the id1 here would create the chain id1 -> id2 -> id3, and following would select id3 for any of the three tasks in the chain.

You cannot currently change any property of a retried task, we may change this in future.