Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.
Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.
The hipercow
package aims to remove some of this pain,
with the aim that running a task on the cluster should be (almost) as
straightforward as running things locally, at least once some basic
setup is done.
At the moment, this document assumes that we will be using the “Windows” cluster, which implies the existence of some future non-Windows cluster. Stay tuned.
This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks.
Installing prerequisites
Install the required packages from our “r-universe”. Be sure to run this in a fresh session.
install.packages(
"hipercow",
repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
Once installed you can load the package with
or use the package by prefixing the calls below with
hipercow::
, as you prefer.
Follow any platform-specific instructions in
vignettes("<cluster>")
; this will depend on the
cluster you intend to use:
- Windows:
vignette("windows")
Filesystems and paths
We need a concept of a “root”; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be within this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it’s easy to move around, which is exactly what we need to do in order to run it on the cluster.
If you are using RStudio, then we strongly recommend using an RStudio project.
Initialising
Run
hipercow_init()
#> ✔ Initialised hipercow at '.' (/tmp/RtmpsuE5XH/file29607d6b3f2a)
#> ℹ Next, call 'hipercow_configure()'
which will write things to a new path hipercow/
within
your working directory.
After initialisation you will typically want to configure a “driver”, which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write:
hipercow_configure(driver = "windows")
however, for this vignette we will use a special “example” driver which simulates what the cluster will do (don’t use this for anything yourself, it really won’t help):
hipercow_configure(driver = "example")
#> ✔ Configured hipercow to use 'example'
You can run initialisation and configuration in one step by running
hipercow_init(driver = "windows")
After initialisation and configuration you can see the computed
configuration by running hipercow_configuration()
:
hipercow_configuration()
#>
#> ── hipercow root at /tmp/RtmpsuE5XH/file29607d6b3f2a ───────────────────────────
#> ✔ Working directory '.' within root
#> ℹ R version 4.4.1 on Linux (runner@fv-az775-99)
#>
#> ── Packages ──
#>
#> ℹ This is hipercow 1.0.36
#> ℹ Installed: conan2 (1.9.101), logwatch (0.1.1)
#> ✖ hipercow.windows is not installed
#>
#> ── Environments ──
#>
#> ── default
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── empty
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── Drivers ──
#>
#> ✔ 1 driver configured ('example')
#>
#> ── example
#> (unconfigurable)
Here, you can see versions of important packages, information about
where you are working, and information about how you intend to interact
with the cluster. See vignette("windows")
for example
output you might expect on the Windows cluster, which includes
information about mapping of your paths onto those of the cluster, the
version of R you will use, and other information.
If you have issues with hipercow
we will always want to
see the output of hipercow_configuration()
.
Running your first task
The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected:
id <- task_create_expr(sessionInfo())
#> ✔ Submitted task 'aaccfebf05a219837c9c066ef4de69c7' using 'example'
This creates a new task that will run the expression
sessionInfo()
on the cluster. The
task_create_expr()
function works by so-called “non standard
evaluation” and the expression is not evaluated from your R session,
but sent to run on another machine.
The id
returned is just an ugly hex string:
id
#> [1] "aaccfebf05a219837c9c066ef4de69c7"
Many other functions accept this id
as an argument. You
can get the status of the task, which will have finished now because it
really does not take very long:
task_status(id)
#> [1] "success"
Once the task has completed you can inspect the result:
task_result(id)
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.1 cli_3.6.3 withr_3.0.1 hipercow_1.0.36
#> [5] rlang_1.1.4
Because we are using the “example” driver here, this is the same as
the result that you’d get running sessionInfo()
directly,
just with more steps. See vignette("windows")
for an
example that runs on Windows.
Using functions you have written
It’s unlikely that the code you want to run on the cluster is one of
the functions built into R itself; more likely you have written a
simulation or similar and you want to run that instead. In
order to do this, we need to tell the cluster where to find your code.
There are two broad places where code that you want to run is likely to
be found script files and packages; we
start with the former here, and deal with packages in much more detail
in vignette("packages")
.
Suppose you have a file simulation.R
containing some
simulation:
random_walk <- function(x, n_steps) {
ret <- numeric(n_steps)
for (i in seq_len(n_steps)) {
x <- rnorm(1, x)
ret[[i]] <- x
}
ret
}
We can’t run this on the cluster immediately, because the cluster does not know about the new function:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task '8c2935a911aa81fd303c9d3251d19a0a' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
task_result(id)
#> <simpleError in random_walk(0, 10): could not find function "random_walk">
(See vignette("troubleshooting")
for more on
failures.)
We need to tell hipercow to source()
the file
simulation.R
before running the task. To do this we use
hipercow_environment_create()
to create an “environment”
(not to be confused with R’s environments) in which to run things:
hipercow_environment_create(sources = "simulation.R")
#> ✔ Created environment 'default'
Now we can run our simulation:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task '17846cc69d07a2b72134240497065cb0' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] -0.3635061 -0.2048345 0.1675290 0.2047832 -0.4900986 0.5197873
#> [7] -0.9751158 -2.2937897 -2.0265623 -1.5793458
We have more to write on environments but briefly:
- You can have multiple environments and each task can be set to run in a different environment
- Each environment can source any number of source files, and load any number of packages
- This will become the mechanism by which environments on parallel
workers (via
parallel
,future
orrrq
) will set up their environments
Getting information about tasks
Once you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong.
Fetching information with task_info()
The primary function for fetching information about a task is
task_info()
:
task_info(id)
#>
#> ── task 17846cc69d07a2b72134240497065cb0 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-10-08 15:27:34.415422 (moments ago)
#> ℹ Started at 2024-10-08 15:27:34.645239 (moments ago; waited 230ms)
#> ℹ Finished at 2024-10-08 15:27:34.88775 (moments ago; ran for 243ms)
This prints out core information about the task; its identifier
(17846cc69d07a2b72134240497065cb0
) and status
(success
), along with information about what sort of task
it was, what expression it had, variables it used, the environment it
executed in and the time that key events happened for the task (when it
was created, started and finished).
This display is meant to be friendly; if you need to compute on this
information, you can access the times by reading the $times
element of the task_info()
return value:
task_info(id)$times
#> created started finished
#> "2024-10-08 15:27:34 UTC" "2024-10-08 15:27:34 UTC" "2024-10-08 15:27:34 UTC"
Likewise, the information about the task itself is within
$data
. To work with the underling data you might just
unclass the object to see the structure:
unclass(task_info(id))
#> $id
#> [1] "17846cc69d07a2b72134240497065cb0"
#>
#> $status
#> [1] "success"
#>
#> $data
#> $data$type
#> [1] "expression"
#>
#> $data$id
#> [1] "17846cc69d07a2b72134240497065cb0"
#>
#> $data$time
#> [1] "2024-10-08 15:27:34 UTC"
#>
#> $data$path
#> [1] "."
#>
#> $data$environment
#> [1] "default"
#>
#> $data$envvars
#> name value secret
#> 1 R_GC_MEM_GROW 3 FALSE
#>
#> $data$parallel
#> NULL
#>
#> $data$expr
#> random_walk(0, 10)
#>
#> $data$variables
#> NULL
#>
#>
#> $driver
#> [1] "example"
#>
#> $times
#> created started finished
#> "2024-10-08 15:27:34 UTC" "2024-10-08 15:27:34 UTC" "2024-10-08 15:27:34 UTC"
#>
#> $retry_chain
#> NULL
but note that the exact structure is subject to (infrequent) change.
Fetching logs with task_log_show
Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong.
You can view the log with task_log_show()
task_log_show(id)
#>
#> ── hipercow 1.0.36 running at '/tmp/RtmpsuE5XH/file29607d6b3f2a' ───────────────
#> ℹ library paths:
#> • /home/runner/work/_temp/Library
#> • /opt/R/4.4.1/lib/R/site-library
#> • /opt/R/4.4.1/lib/R/library
#> ℹ id: 17846cc69d07a2b72134240497065cb0
#> ℹ starting at: 2024-10-08 15:27:34.645239
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: (none)
#> • sources: simulation.R
#> • globals: (none)
#> ───────────────────────────────────────────────────────────────── task logs ↓ ──
#>
#> ───────────────────────────────────────────────────────────────── task logs ↑ ──
#> ✔ status: success
#> ℹ finishing at: 2024-10-08 15:27:34.645239 (elapsed: 0.2452 secs)
This prints the contents of the logs to the screen; you can access
the values directly with task_log_value(id)
. The format of
the logs will be generally the same for all tasks; after the header
saying where we are running, some information about the task will be
printed (its identifier, the time, details about the task itself), then
any logs that come from calls to message()
and
print()
within the queued function (within the “task logs”
section; here that is empty because our task prints nothing). Finally, a
summary will be printed with the final status, final time (and elapsed
time), then any warnings that were produced will be flushed (see
vignette("troubleshooting")
for more on warnings).
There is a second log too, the “outer” log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks:
task_log_show(id, outer = TRUE)
#> Running task 17846cc69d07a2b72134240497065cb0
#> Finished task 17846cc69d07a2b72134240497065cb0
The logs returned by task_log_show(id, outer = FALSE)
are the logs generated by the statement containing
Rscript -e
.
Watching logs with task_log_watch
If your task is still running, you can stream logs to your computer
using task_log_watch()
; this will print new logs
line-by-line as they arrive (with a delay of up to 1s by default). This
can be useful while debugging something to give the illusion that you’re
running it locally.
Using Ctrl-C
(or ESC
in RStudio) to escape
will only stop log streaming and not the underlying task.
Parallel tasks
So far, the tasks we submitted have been run using a single core on
the cluster, with no special other requests made. Here is a simple
example using two cores; we’ll use hipercow_resources()
to
specify we want two cores on the cluster, and
hipercow_parallel()
to say that we want to set up two
processes on those cores, using the parallel
package. (We
also support future
).
resources <- hipercow_resources(cores = 2)
id <- task_create_expr(
parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)),
parallel = hipercow_parallel("parallel"),
resources = resources)
#> ✔ Submitted task '68a1741e75f7b861f2d53f6c1a49507d' using 'example'
task_wait(id)
#> [1] TRUE
task_info(id)
#>
#> ── task 68a1741e75f7b861f2d53f6c1a49507d (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5))
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-10-08 15:27:35.888448 (moments ago)
#> ℹ Started at 2024-10-08 15:27:36.098322 (moments ago; waited 210ms)
#> ℹ Finished at 2024-10-08 15:27:41.701272 (moments ago; ran for 5.6s)
Both of our parallel tasks are to sleep for 5 seconds. We use
task_info()
to report how long it took for those two runs
to execute; if they ran one-by-one, we’d expect around 10 seconds, but
we are seeing a much shorter time than that, so our pair of processes
are running at the same time.
For details on specifying resources and launching different kinds of
parallel tasks, see vignette("parallel")
.
Understanding where variables come from
Suppose our simulation started not from 0, but from some point that
we have computed locally (say x
, imaginatively)
x <- 100
You can use this value to start the simulation by running:
id <- task_create_expr(random_walk(x, 10))
#> ✔ Submitted task 'a0b523bee04a51edc60c38839a12e76c' using 'example'
Here the x
value has come from the environment where the
expression passed into task_create_expr()
was found
(specifically, we use the rlang
“tidy evaluation” framework you might be familiar with from
dplyr
and friends).
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 100.4505 100.5432 101.5970 103.0494 104.2283 104.1224 104.7091 107.1502
#> [9] 106.8447 107.3183
If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created:
id <- task_create_expr(random_walk(starting_point, 10))
#> Error in `rlang::env_get_list()`:
#> ! Can't find `starting_point` in environment.
Cancelling tasks
You can cancel a task if it has been submitted and not completed,
using task_cancel()
:
For example, here’s a task that will sleep for 10 seconds, which we submit to the cluster:
id <- task_create_expr(Sys.sleep(10))
#> ✔ Submitted task '71ca5a1e491493bad8d20cc0c3966901' using 'example'
Having decided that this is a silly idea, we can try and cancel it:
task_cancel(id)
#> ✖ Did not manage to cancel '71ca5a1e491493bad8d20cc0c3966901' which had status 'submitted'
#> [1] FALSE
task_status(id)
#> [1] "submitted"
task_info(id)
#>
#> ── task 71ca5a1e491493bad8d20cc0c3966901 (submitted) ───────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: Sys.sleep(10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-10-08 15:27:43.297579 (moments ago)
#> ! Not started yet (waiting for 146ms)
#> ! Not finished yet (waiting to start)
You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!).
You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled.
Retrying tasks
There are lots of reasons why you might want to retry a task. For example:
- it failed but you think it might work next time
- you updated a package that it used, and want to try again with the new version
- you don’t like the output from some stochastic function and want to generate new output
- you cancelled the task but want to try again now
You can retry tasks with task_retry()
, which is easier
than submitting a new task with the same content, and also preserves a
link between retried tasks.
Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that:
id1 <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'c838b0b668ee2bd8ebe808c79ece632e' using 'example'
task_wait(id1)
#> [1] TRUE
task_result(id1)
#> [1] 0.1384072 0.2166016 -0.5040800 -1.8591285 -3.0762034 -0.7985341
#> [7] -0.5833057 1.2366931 0.1790221 -1.1875241
Here we ran a random walk and it got to -1.1875241, which is clearly not what we were expecting. Let’s try it again:
id2 <- task_retry(id1)
#> ✔ Submitted task 'd71e6ca35b87b103329ecf010f8fe423' using 'example'
Running task_retry()
creates a new task, with a
new id d71e6c...
compared with c838b0...
.
Once this task has finished, we get a different result:
task_wait(id2)
#> [1] TRUE
task_result(id2)
#> [1] -0.5387175 0.3498383 -0.8364318 -2.1500931 -1.7593317 -2.8556558
#> [7] -2.9939287 -3.4006717 -4.8534850 -4.2236560
Much better!
We get a hint that this is a retried task from the
task_info()
task_info(id2)
#>
#> ── task d71e6ca35b87b103329ecf010f8fe423 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-10-08 15:27:43.518564 (less than a minute ago)
#> ℹ Started at 2024-10-08 15:27:54.867067 (moments ago; waited 11.3s)
#> ℹ Finished at 2024-10-08 15:27:55.111883 (moments ago; ran for 245ms)
#> ℹ Last of a chain of a task retried 1 time
You can see the full chain of retries here:
task_info(id2)$retry_chain
#> [1] "c838b0b668ee2bd8ebe808c79ece632e" "d71e6ca35b87b103329ecf010f8fe423"
Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain:
task_result(id1)
#> [1] -0.5387175 0.3498383 -0.8364318 -2.1500931 -1.7593317 -2.8556558
#> [7] -2.9939287 -3.4006717 -4.8534850 -4.2236560
task_result(id2)
#> [1] -0.5387175 0.3498383 -0.8364318 -2.1500931 -1.7593317 -2.8556558
#> [7] -2.9939287 -3.4006717 -4.8534850 -4.2236560
You can get the original result back by passing the argument
follow = FALSE
:
task_result(id1, follow = FALSE)
#> [1] 0.1384072 0.2166016 -0.5040800 -1.8591285 -3.0762034 -0.7985341
#> [7] -0.5833057 1.2366931 0.1790221 -1.1875241
task_result(id2)
#> [1] -0.5387175 0.3498383 -0.8364318 -2.1500931 -1.7593317 -2.8556558
#> [7] -2.9939287 -3.4006717 -4.8534850 -4.2236560
Only tasks that have been completed (success
,
failure
or cancelled
) can be retried, and
doing so adds a new task to the end of the chain; there is no
branching. Retrying the id1
here would create the chain
id1 -> id2 -> id3
, and following would select
id3
for any of the three tasks in the chain.
You cannot currently change any property of a retried task, we may change this in future.