Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.
Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.
The hipercow
package aims to remove some of this pain,
with the aim that running a task on the cluster should be (almost) as
straightforward as running things locally, at least once some basic
setup is done.
At the moment, this document assumes that we will be using the “Windows” cluster, which implies the existence of some future non-Windows cluster. Stay tuned.
This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks.
Installing prerequisites
Install the required packages from our “r-universe”. Be sure to run this in a fresh session.
install.packages(
"hipercow",
repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
Once installed you can load the package with
or use the package by prefixing the calls below with
hipercow::
, as you prefer.
Follow any platform-specific instructions in
vignettes("<cluster>")
; this will depend on the
cluster you intend to use:
- Windows:
vignette("windows")
Filesystems and paths
We need a concept of a “root”; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be within this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it’s easy to move around, which is exactly what we need to do in order to run it on the cluster.
If you are using RStudio, then we strongly recommend using an RStudio project.
Initialising
Run
hipercow_init()
#> ✔ Initialised hipercow at '.' (/home/runner/work/_temp/hv-20241209-31c1774eb1f4)
#> ℹ Next, call 'hipercow_configure()'
which will write things to a new path hipercow/
within
your working directory.
After initialisation you will typically want to configure a “driver”, which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write:
hipercow_configure(driver = "windows")
however, for this vignette we will use a special “example” driver which simulates what the cluster will do (don’t use this for anything yourself, it really won’t help):
hipercow_configure(driver = "example")
#> ✔ Configured hipercow to use 'example'
You can run initialisation and configuration in one step by running
hipercow_init(driver = "windows")
After initialisation and configuration you can see the computed
configuration by running hipercow_configuration()
:
hipercow_configuration()
#>
#> ── hipercow root at /home/runner/work/_temp/hv-20241209-31c1774eb1f4 ───────────
#> ✔ Working directory '.' within root
#> ℹ R version 4.4.2 on Linux (runner@fv-az1913-706)
#>
#> ── Packages ──
#>
#> ℹ This is hipercow 1.0.52
#> ℹ Installed: conan2 (1.9.101), logwatch (0.1.1), rrq (0.7.22)
#> ✖ hipercow.windows is not installed
#>
#> ── Environments ──
#>
#> ── default
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── empty
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── Drivers ──
#>
#> ✔ 1 driver configured ('example')
#>
#> ── example
#> (unconfigurable)
Here, you can see versions of important packages, information about
where you are working, and information about how you intend to interact
with the cluster. See vignette("windows")
for example
output you might expect on the Windows cluster, which includes
information about mapping of your paths onto those of the cluster, the
version of R you will use, and other information.
If you have issues with hipercow
we will always want to
see the output of hipercow_configuration()
.
Running your first task
The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected:
id <- task_create_expr(sessionInfo())
#> ✔ Submitted task '9991985a4a73ec910c9f1b0d0617e1be' using 'example'
This creates a new task that will run the expression
sessionInfo()
on the cluster. The
task_create_expr()
function works by so-called “non standard
evaluation” and the expression is not evaluated from your R session,
but sent to run on another machine.
The id
returned is just an ugly hex string:
id
#> [1] "9991985a4a73ec910c9f1b0d0617e1be"
Many other functions accept this id
as an argument. You
can get the status of the task, which will have finished now because it
really does not take very long:
task_status(id)
#> [1] "success"
Once the task has completed you can inspect the result:
task_result(id)
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2 cli_3.6.3 withr_3.0.2 hipercow_1.0.52
#> [5] rlang_1.1.4
Because we are using the “example” driver here, this is the same as
the result that you’d get running sessionInfo()
directly,
just with more steps. See vignette("windows")
for an
example that runs on Windows.
Using functions you have written
It’s unlikely that the code you want to run on the cluster is one of
the functions built into R itself; more likely you have written a
simulation or similar and you want to run that instead. In
order to do this, we need to tell the cluster where to find your code.
There are two broad places where code that you want to run is likely to
be found script files and packages; we
start with the former here, and deal with packages in much more detail
in vignette("packages")
.
Suppose you have a file simulation.R
containing some
simulation:
random_walk <- function(x, n_steps) {
ret <- numeric(n_steps)
for (i in seq_len(n_steps)) {
x <- rnorm(1, x)
ret[[i]] <- x
}
ret
}
We can’t run this on the cluster immediately, because the cluster does not know about the new function:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task '5784dfd8b6ae06288e98edd1dcb2fd7e' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
task_result(id)
#> <simpleError in random_walk(0, 10): could not find function "random_walk">
(See vignette("troubleshooting")
for more on
failures.)
We need to tell hipercow to source()
the file
simulation.R
before running the task. To do this we use
hipercow_environment_create()
to create an “environment”
(not to be confused with R’s environments) in which to run things:
hipercow_environment_create(sources = "simulation.R")
#> ✔ Created environment 'default'
Now we can run our simulation:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task '0641ee70e2734f66849bbd6896d3cbb5' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] -0.7071049 -1.2740309 -0.4966545 0.2219471 -1.3205475 -2.2873065
#> [7] -1.3199502 -1.0696891 -2.4250178 -1.8959445
- You can have multiple environments and each task can be set to run in a different environment
- Each environment can source any number of source files, and load any number of packages
- This will become the mechanism by which environments on parallel
workers (via
parallel
,future
orrrq
) will set up their environments
Read more about environments in
vignette("environments")
Getting information about tasks
Once you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong.
Fetching information with task_info()
The primary function for fetching information about a task is
task_info()
:
task_info(id)
#>
#> ── task 0641ee70e2734f66849bbd6896d3cbb5 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-12-09 18:46:15.47919 (moments ago)
#> ℹ Started at 2024-12-09 18:46:15.538477 (moments ago; waited 60ms)
#> ℹ Finished at 2024-12-09 18:46:15.539579 (moments ago; ran for 2ms)
This prints out core information about the task; its identifier
(0641ee70e2734f66849bbd6896d3cbb5
) and status
(success
), along with information about what sort of task
it was, what expression it had, variables it used, the environment it
executed in and the time that key events happened for the task (when it
was created, started and finished).
This display is meant to be friendly; if you need to compute on this
information, you can access the times by reading the $times
element of the task_info()
return value:
task_info(id)$times
#> created started finished
#> "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC"
Likewise, the information about the task itself is within
$data
. To work with the underling data you might just
unclass the object to see the structure:
unclass(task_info(id))
#> $id
#> [1] "0641ee70e2734f66849bbd6896d3cbb5"
#>
#> $status
#> [1] "success"
#>
#> $data
#> $data$type
#> [1] "expression"
#>
#> $data$id
#> [1] "0641ee70e2734f66849bbd6896d3cbb5"
#>
#> $data$time
#> [1] "2024-12-09 18:46:15 UTC"
#>
#> $data$path
#> [1] "."
#>
#> $data$environment
#> [1] "default"
#>
#> $data$envvars
#> name value secret
#> 1 R_GC_MEM_GROW 3 FALSE
#>
#> $data$parallel
#> NULL
#>
#> $data$expr
#> random_walk(0, 10)
#>
#> $data$variables
#> NULL
#>
#>
#> $driver
#> [1] "example"
#>
#> $times
#> created started finished
#> "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC" "2024-12-09 18:46:15 UTC"
#>
#> $retry_chain
#> NULL
but note that the exact structure is subject to (infrequent) change.
Fetching logs with task_log_show
Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong.
You can view the log with task_log_show()
task_log_show(id)
#> ✖ No logs for task '0641ee70e2734f66849bbd6896d3cbb5' (yet?)
This prints the contents of the logs to the screen; you can access
the values directly with task_log_value(id)
. The format of
the logs will be generally the same for all tasks; after the header
saying where we are running, some information about the task will be
printed (its identifier, the time, details about the task itself), then
any logs that come from calls to message()
and
print()
within the queued function (within the “task logs”
section; here that is empty because our task prints nothing). Finally, a
summary will be printed with the final status, final time (and elapsed
time), then any warnings that were produced will be flushed (see
vignette("troubleshooting")
for more on warnings).
There is a second log too, the “outer” log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks:
task_log_show(id, outer = TRUE)
#> ✖ No logs for task '0641ee70e2734f66849bbd6896d3cbb5' (yet?)
The logs returned by task_log_show(id, outer = FALSE)
are the logs generated by the statement containing
Rscript -e
.
Watching logs with task_log_watch
If your task is still running, you can stream logs to your computer
using task_log_watch()
; this will print new logs
line-by-line as they arrive (with a delay of up to 1s by default). This
can be useful while debugging something to give the illusion that you’re
running it locally.
Using Ctrl-C
(or ESC
in RStudio) to escape
will only stop log streaming and not the underlying task.
Running many tasks at once
Running one task on the cluster is nice, because it takes the load off your laptop, but it’s generally not why you’re going through this process. More likely, you have many, similar, tasks that you want to set running at once. You might be:
- Fitting a model to a series of countries
- Exploring uncertainty in a parameter
- Running a series of stochastic processes
In all these cases, you would want to submit a group of related tasks, sharing a common function, but differing in the data passed into that function. We call this the “bulk interface”, and it is the simplest and usually most effective way of getting started with parallel computing.
This sort of problem is referred to as “embarrassingly
parallel”; this is not a pejorative, it just means that your work
decomposes into a bunch of independent chunks and all we have to do is
start them. You are already familiar with things that can be run this
way: anything that can be run with R’s lapply
could be
parallelised.
There are two similar bulk creation functions, which differ based on the way the data you have are structured:
-
task_create_bulk_call
is used where you have alist
, where each element represents the key input to some computation (this is similar tolapply()
) -
task_create_bulk_expr
is used where you have adata.frame
, where each row represents the inputs to some computation (this is a little similar todplyr
’srowwise
support)
Bulk call, or “parallel map”
The bulk call interface is the one that might feel most familiar to
you; it is modelled on ideas from functions like lapply
(or, if you use purrr
, its map()
function).
The idea is simple, we have a list of data and we apply some function to
each element within it.
We’ll start with a reminder of how lapply
works, then
adapt this to run in parallel on a cluster. Imagine that we want to run
some simple simulation with a different parameter. In this example we
simulate n
samples from a normal distribution and compute
the observed mean and variance:
We can run locally it like this:
source("simulation-bulk.R")
mysim(0, 1)
#> [1] 0.006475844 0.993635933
Suppose that we have a vector of means to run this with:
mu <- c(0, 1, 2, 3, 4)
We can apply mysim
to each of the elements of
mu
by writing:
lapply(mu, mysim, sd = 1)
#> [[1]]
#> [1] -0.01562212 0.93114162
#>
#> [[2]]
#> [1] 1.043111 1.006988
#>
#> [[3]]
#> [1] 1.9583502 0.9799025
#>
#> [[4]]
#> [1] 3.052979 0.994695
#>
#> [[5]]
#> [1] 3.9999284 0.9695735
Of note here:
- Only the
mu
argument was iterated over - We provided a
sd
argument that was passed through to every call tomysim
- We get back a list in return, the same length as
mu
, with each element the result of applyingmysim
to that element.
Nothing in the above said anything about the order in which
these calculations were carried out; one might assume that we applied
myfun
to the first element of mu
at once, then
the second, but that is just conjecture. This last point seems a bit
silly, but is a useful condition to think about when considering what
can be parallelised; if you can run a “loop” backwards and get the same
answer (ignoring things like the specific draws from random number
generating functions) then your problem is well suited to being
parallelised.
Our function mysim
is in a file called
simulation-bulk.R
, which we’ll add to our environment so
that it’s available on the cluster (alongside random_walk
from above):
hipercow_environment_create(sources = c("simulation.R", "simulation-bulk.R"))
#> ✔ Updated environment 'default'
We can then submit tasks to the cluster using
task_create_bulk_call
:
bundle <- task_create_bulk_call(mysim, mu, args = list(sd = 1))
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'bossy_mice' with 5 tasks
bundle
#> → <hipercow_bundle 'bossy_mice' with 5 tasks>
This creates a task bundle which groups together
related tasks. There is a whole set of functions for working with
bundles that behave similarly to the task query functions. So where
task_status()
retrieves the status for a single task, we
can get the status for a bundle by running
hipercow_bundle_status(bundle)
#> [1] "success" "success" "success" "success" "success"
which returns a vector over all tasks included in the bundle. You can also “reduce” this status to the “worst” status over all tasks:
hipercow_bundle_status(bundle, reduce = TRUE)
#> [1] "success"
Similarly, you can wait for a whole bundle to complete
hipercow_bundle_wait(bundle)
#> [1] TRUE
And then get the results as a list
hipercow_bundle_result(bundle)
#> [[1]]
#> [1] 0.009423132 1.025663100
#>
#> [[2]]
#> [1] 0.9934573 1.0691607
#>
#> [[3]]
#> [1] 2.008034 1.003267
#>
#> [[4]]
#> [1] 3.012224 1.026637
#>
#> [[5]]
#> [1] 3.9623300 0.8999291
This flow (create
, wait
,
result
) is equivalent to lapply
and produces
data of the same shape in return, but the tasks will be carried out in
parallel! Each task is submitted to the cluster and picked up by the
first available node. You might submit 100 tasks and if the cluster is
quiet, a few seconds later all of them will be running at the same
time.
We might want to vary both mu
and
sd
, in which case it might be convenient to keep track of
our inputs in a data.frame
:
pars <- data.frame(mu = mu, sd = sqrt(mu + 1))
We can use task_create_bundle_call
with this, too:
b <- task_create_bulk_call(mysim, pars)
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'roastable_racerunner' with 5 tasks
hipercow_bundle_wait(b)
#> [1] TRUE
hipercow_bundle_result(b)
#> [[1]]
#> [1] -0.01324526 0.99680800
#>
#> [[2]]
#> [1] 1.012203 2.010859
#>
#> [[3]]
#> [1] 2.062672 2.972183
#>
#> [[4]]
#> [1] 3.093298 3.566746
#>
#> [[5]]
#> [1] 4.005499 4.781374
This iterates over the data in a row-wise way. Note
that this is very different to lapply
which would iterate
over columns (in practice we find that this is almost
never what people want). The names of the columns must match the names
of your function arguments and all columns must be used.
We could have passed additional arguments here too, for example
changing n
:
b <- task_create_bulk_call(mysim, pars, args = list(n = 40))
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'bashful_warthog' with 5 tasks
Bulk expression
We also support a bulk expression interface, which can be clearer than the above.
b <- task_create_bulk_expr(mysim(mu, sd, n = 40), pars)
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'distorted_pika' with 5 tasks
This would again work row-wise over pars
but evaluate
the expression in the first argument with the data found in the
data.frame
. This would allow you to use different column
names if convenient:
pars <- data.frame(mean = mu, stddev = sqrt(mu + 1))
b <- task_create_bulk_expr(mysim(mean, stddev, n = 40), pars)
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'incompatible_mollies' with 5 tasks
More on bundles
You can do most things to bundles that you can do to tasks:
Action | Single task | Bundle |
---|---|---|
Get result | task_result |
hipercow_bundle_result |
Wait for completion | task_wait |
hipercow_bundle_wait |
Retry failed tasks (see below) | task_retry |
hipercow_bundle_retry |
List | task_list |
hipercow_bundle_list |
Cancel | task_cancel |
hipercow_bundle_cancel |
Get log value | task_log_value |
hipercow_bundle_log_value |
There is no equivalent of task_log_watch
or
task_log_show
because we can’t easily do this for multiple
tasks at the same time in a satisfactory way.
hipercow_bundle_delete
will delete bundles, but leave
tasks alone. hipercow_purge
will delete tasks, causing
actual deletion of data.
Some of these functions naturally have slightly different semantics
to the single-task function; for example,
hipercow_bundle_result()
returns a list of results and
hipewcow_bundle_wait
has an option fail_early
to control if it shold return FALSE
as soon as any task
fails.
Picking bundles back up again later
You can use the hipercow_bundle_list()
function to list
known bundles:
hipercow_bundle_list()
#> name time
#> 1 incompatible_mollies 2024-12-09 18:46:19
#> 2 distorted_pika 2024-12-09 18:46:18
#> 3 bashful_warthog 2024-12-09 18:46:18
#> 4 roastable_racerunner 2024-12-09 18:46:17
#> 5 bossy_mice 2024-12-09 18:46:17
Each bundle has a name (automatically generated by default) and the time that it was created. If you have launched a bundle and for some reason lost your session (e.g., windows update has rebooted your computer) you can use this to get your ids back.
name <- hipercow_bundle_list()$name[[1]]
bundle <- hipercow_bundle_load(name)
If you’re not sure what you launched, you can use
task_info
:
task_info(bundle$ids[[1]])
#>
#> ── task d302fb75e66082790328a170cb11b1f4 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: mysim(mean, stddev, n = 40)
#> • Locals: mean and stddev
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-12-09 18:46:18.992137 (moments ago)
#> ℹ Started at 2024-12-09 18:46:19.036948 (moments ago; waited 45ms)
#> ℹ Finished at 2024-12-09 18:46:19.03823 (moments ago; ran for 2ms)
You can make this process a bit more friendly by setting your own
name into the bundle when creating it using the bundle_name
argument:
pars <- data.frame(mean = mu, stddev = sqrt(mu + 1))
b <- task_create_bulk_expr(mysim(mean, stddev, n = 40), pars,
bundle_name = "final_runs_v2")
#> ✔ Submitted 5 tasks using 'example'
#> ✔ Created bundle 'final_runs_v2' with 5 tasks
Making bundles from tasks
You can also make a bundle yourself from a group of tasks; this may be convenient if you need to launch a number of tasks individually for some reason but want to then consider them together as a group.
id1 <- task_create_expr(mysim(1, 2))
#> ✔ Submitted task '89d9603e060fee5dab74796cd9df7317' using 'example'
id2 <- task_create_expr(mysim(2, 2))
#> ✔ Submitted task '52e091f0e0e041737b5c9f26befdf4db' using 'example'
b <- hipercow_bundle_create(c(id1, id2), "my_new_bundle")
#> ✔ Created bundle 'my_new_bundle' with 2 tasks
b
#> → <hipercow_bundle 'my_new_bundle' with 2 tasks>
We can then use this bundle as above:
hipercow_bundle_status(b)
#> [1] "submitted" "submitted"
hipercow_bundle_wait(b)
#> [1] TRUE
Parallel tasks
So far, the tasks we submitted have been run using a single core on
the cluster, with no special other requests made. Here is a simple
example using two cores; we’ll use hipercow_resources()
to
specify we want two cores on the cluster, and
hipercow_parallel()
to say that we want to set up two
processes on those cores, using the parallel
package. (We
also support future
).
resources <- hipercow_resources(cores = 2)
id <- task_create_expr(
parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)),
parallel = hipercow_parallel("parallel"),
resources = resources)
#> ✔ Submitted task '7a766a516541c84ef2279bc6372fd1fd' using 'example'
task_wait(id)
#> [1] TRUE
task_info(id)
#>
#> ── task 7a766a516541c84ef2279bc6372fd1fd (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5))
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-12-09 18:46:20.62808 (moments ago)
#> ℹ Started at 2024-12-09 18:46:20.677778 (moments ago; waited 50ms)
#> ℹ Finished at 2024-12-09 18:46:26.068083 (moments ago; ran for 5.4s)
Both of our parallel tasks are to sleep for 5 seconds. We use
task_info()
to report how long it took for those two runs
to execute; if they ran one-by-one, we’d expect around 10 seconds, but
we are seeing a much shorter time than that, so our pair of processes
are running at the same time.
For details on specifying resources and launching different kinds of
parallel tasks, see vignette("parallel")
.
Understanding where variables come from
Suppose our simulation started not from 0, but from some point that
we have computed locally (say x
, imaginatively)
x <- 100
You can use this value to start the simulation by running:
id <- task_create_expr(random_walk(x, 10))
#> ✔ Submitted task 'dfaf412c14d94a60788dfade5ec59ac4' using 'example'
Here the x
value has come from the environment where the
expression passed into task_create_expr()
was found
(specifically, we use the rlang
“tidy evaluation” framework you might be familiar with from
dplyr
and friends).
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 99.51397 98.64564 97.98732 97.61299 96.89123 96.67704 96.40233 97.27315
#> [9] 95.84085 94.63125
If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created:
id <- task_create_expr(random_walk(starting_point, 10))
#> Error in `rlang::env_get_list()`:
#> ! Can't find `starting_point` in environment.
Cancelling tasks
You can cancel a task if it has been submitted and not completed,
using task_cancel()
:
For example, here’s a task that will sleep for 10 seconds, which we submit to the cluster:
id <- task_create_expr(Sys.sleep(10))
#> ✔ Submitted task 'b8b1066ed9afd2c6b0a130683505851b' using 'example'
Having decided that this is a silly idea, we can try and cancel it:
task_cancel(id)
#> ✖ Did not manage to cancel 'b8b1066ed9afd2c6b0a130683505851b' which had status 'running'
#> [1] FALSE
task_status(id)
#> [1] "running"
task_info(id)
#>
#> ── task b8b1066ed9afd2c6b0a130683505851b (running) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: Sys.sleep(10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-12-09 18:46:27.052863 (moments ago)
#> ℹ Started at 2024-12-09 18:46:27.087537 (moments ago; waited 35ms)
#> ! Not finished yet (running for 85ms)
You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!).
You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled.
Retrying tasks
There are lots of reasons why you might want to retry a task. For example:
- it failed but you think it might work next time
- you updated a package that it used, and want to try again with the new version
- you don’t like the output from some stochastic function and want to generate new output
- you cancelled the task but want to try again now
You can retry tasks with task_retry()
, which is easier
than submitting a new task with the same content, and also preserves a
link between retried tasks.
Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that:
id1 <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task '16edcb834f183787968f566439230f24' using 'example'
task_wait(id1)
#> [1] TRUE
task_result(id1)
#> [1] 0.3478105 -0.5730981 -0.6992461 -1.9641252 -1.1837832 -2.1647231
#> [7] -3.0301028 -1.7880717 -2.5532617 -2.4678150
Here we ran a random walk and it got to -2.467815, which is clearly not what we were expecting. Let’s try it again:
id2 <- task_retry(id1)
#> ✔ Submitted task '2a8087f61189016cc9497027e2e1b3fb' using 'example'
Running task_retry()
creates a new task, with a
new id 2a8087...
compared with 16edcb...
.
Once this task has finished, we get a different result:
task_wait(id2)
#> [1] TRUE
task_result(id2)
#> [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#> [8] 0.9572328 1.5361760 2.0629778
Much better!
We get a hint that this is a retried task from the
task_info()
task_info(id2)
#>
#> ── task 2a8087f61189016cc9497027e2e1b3fb (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-12-09 18:46:27.243542 (less than a minute ago)
#> ℹ Started at 2024-12-09 18:46:37.407237 (moments ago; waited 10.2s)
#> ℹ Finished at 2024-12-09 18:46:37.40864 (moments ago; ran for 2ms)
#> ℹ Last of a chain of a task retried 1 time
You can see the full chain of retries here:
task_info(id2)$retry_chain
#> [1] "16edcb834f183787968f566439230f24" "2a8087f61189016cc9497027e2e1b3fb"
Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain:
task_result(id1)
#> [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#> [8] 0.9572328 1.5361760 2.0629778
task_result(id2)
#> [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#> [8] 0.9572328 1.5361760 2.0629778
You can get the original result back by passing the argument
follow = FALSE
:
task_result(id1, follow = FALSE)
#> [1] 0.3478105 -0.5730981 -0.6992461 -1.9641252 -1.1837832 -2.1647231
#> [7] -3.0301028 -1.7880717 -2.5532617 -2.4678150
task_result(id2)
#> [1] 0.6343730 2.9987212 2.4692061 1.0161391 0.2056873 1.3857663 0.6534240
#> [8] 0.9572328 1.5361760 2.0629778
Only tasks that have been completed (success
,
failure
or cancelled
) can be retried, and
doing so adds a new task to the end of the chain; there is no
branching. Retrying the id1
here would create the chain
id1 -> id2 -> id3
, and following would select
id3
for any of the three tasks in the chain.
You cannot currently change any property of a retried task, we may change this in future.