Get yourself running R jobs on the cluster in 10 minutes or so.

Assumptions that we make here:

  • you are using R

  • your task can be represented as running a function on some inputs to create an output (a file based output is OK)

  • you are working on a network share and have this mounted on your computer

  • you know what packages your code depends on

  • your package dependencies are all on CRAN, and are all available in windows binary form.

If any of these do not apply to you, you’ll probably need to read the full vignette. In any case the full vignette contains a bunch more information anyway.

Install a lot of packages

Install the packages using drat

# install.package("drat") # if you don't have it already
drat:::add("mrc-ide")
install.packages("didehpc")

Describe your computer so we can find things

On windows if you are using a domain machine, you should need only to select the cluster you want to use

options(didehpc.cluster = "fi--didemrchnb")

Otherwise, and on any other platform you’ll need to provide your username:

options(didehpc.cluster = "fi--didemrchnb",
        didehpc.username = "yourusername")

You can see the default configuration with

didehpc::didehpc_config()
#> <didehpc_config>
#>  - cluster: fi--dideclusthn
#>  - credentials:
#>     - username: rfitzjoh
#>     - password: *******************
#>  - username: rfitzjoh
#>  - resource:
#>     - template: GeneralNodes
#>     - parallel: FALSE
#>     - count: 1
#>     - type: Cores
#>  - shares:
#>     - home: (local) /home/rich/net/home => \\fi--san03.dide.ic.ac.uk\homes\rfitzjoh => Q: (remote)
#>     - temp: (local) /home/rich/net/temp => \\fi--didef3.dide.ic.ac.uk\tmp => T: (remote)
#>  - use_workers: FALSE
#>  - use_rrq: FALSE
#>  - worker_timeout: 600
#>  - conan_bootstrap: TRUE
#>  - r_version: 4.0.3
#>  - use_java: FALSE
#>  - redis_host: fi--dideclusthn.dide.ic.ac.uk

If this is the first time you have run this package, best to try out the login procedure with:

didehpc::web_login()

because this exposes a number of problems early on.

Describe your project dependencies so we can recreate that on the cluster

Make a vector of packages that you use in your project:

packages <- c("dplyr", "tidyr")

And of files that define functions that you ned to run things:

sources <- "mysources.R"

If you had a vector here that would be OK too.

Then save this together to form a “context”.

ctx <- context::context_save("contexts", packages = packages, sources = sources)
#> [ open:db   ]  rds
#> [ save:id   ]  9a70cec48c3108b80503144f9b88cc8d
#> [ save:name ]  complexional_australiancurlew

If you have no packages or no sources, use NULL or omit them in the call below (which is the default anyway).

The first argument here, "contexts" is the name of a directory that we will use to hold a lot of information about your jobs. You don’t need (or particularly want) to know what is in here.

Build a queue, based on this context.

This will prompt you for your password, as it will try and log in.

It also installs windows versions of all packages within the contexts directory – both packages required to get this whole system working and then the packages required for your particular jobs.

obj <- didehpc::queue_didehpc(ctx)
#> Loading context 9a70cec48c3108b80503144f9b88cc8d
#> [ context   ]  9a70cec48c3108b80503144f9b88cc8d
#> [ library   ]  dplyr, tidyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
#> [ namespace ]
#> [ source    ]  mysources.R
#> Running installation script on cluster
#>   ,:\      /:.
#>  //  \_()_/  \\
#> ||   |    |   ||  CONAN THE LIBRARIAN
#> ||   |    |   ||  Library:   Q:\didehpc\20210817-145020\contexts\lib\windows\4.0
#> ||   |____|   ||  Bootstrap: T:\conan\bootstrap\4.0
#>  \\  / || \  //   Cache:     Q:\didehpc\20210817-145020\contexts\conan\cache/pkg
#>   `:/  ||  \;'    Policy:    lazy
#>        ||         Repos:
#>        ||           * https://mrc-ide.github.io/didehpc-pkgs
#>        XX           * https://cloud.r-project.org
#>        XX         Packages:
#>        XX           * dplyr
#>        XX           * tidyr
#>        OO
#>        `'
#> i Loading metadata database
#> v Loading metadata database ... done
#> i Getting 17 pkgs (9.49 MB)
#> v Got ellipsis 0.3.2 (windows) (49.19 kB)
#> v Got generics 0.1.0 (windows) (70.74 kB)
#> v Got glue 1.4.2 (windows) (155.50 kB)
#> v Got lifecycle 1.0.0 (windows) (111.22 kB)
#> v Got fansi 0.5.0 (windows) (248.45 kB)
#> v Got cli 3.0.1 (windows) (758.73 kB)
#> v Got pkgconfig 2.0.3 (windows) (22.31 kB)
#> v Got magrittr 2.0.1 (windows) (234.90 kB)
#> v Got dplyr 1.0.7 (windows) (1.35 MB)
#> v Got purrr 0.3.4 (windows) (430.04 kB)
#> v Got tidyselect 1.1.1 (windows) (204.19 kB)
#> v Got tibble 3.1.3 (windows) (835.59 kB)
#> v Got utf8 1.2.2 (windows) (209.88 kB)
#> v Got rlang 0.4.11 (windows) (1.21 MB)
#> v Got vctrs 0.3.8 (windows) (1.25 MB)
#> v Got tidyr 1.1.3 (windows) (1.06 MB)
#> v Got pillar 1.6.2 (windows) (1.07 MB)
#> v Installed generics 0.1.0  (532ms)
#> v Installed cli 3.0.1  (1.3s)
#> v Installed ellipsis 0.3.2  (1.1s)
#> v Installed fansi 0.5.0  (1.3s)
#> v Installed glue 1.4.2  (1.4s)
#> v Installed lifecycle 1.0.0  (1.6s)
#> v Installed magrittr 2.0.1  (1.7s)
#> v Installed dplyr 1.0.7  (2.3s)
#> v Installed pkgconfig 2.0.3  (1.3s)
#> v Installed pillar 1.6.2  (1.8s)
#> v Installed purrr 0.3.4  (1.5s)
#> v Installed rlang 0.4.11  (1.4s)
#> v Installed tidyselect 1.1.1  (1.2s)
#> v Installed tibble 3.1.3  (1.6s)
#> v Installed utf8 1.2.2  (1.3s)
#> v Installed vctrs 0.3.8  (1.2s)
#> v Installed tidyr 1.1.3  (1.2s)
#> v Summary:   17 new   2 kept  in 23.7s
#> Done!

Once you get to this point we’re ready to start running things on the cluster. Let’s fire off a test to make sure that everything works OK:

t <- obj$enqueue(sessionInfo())

We can poll the job for a while, which will print a progress bar. If the job is returned in time, it will return the result of running the function. Otherwise it will throw an error.

t$wait(120)
#> (-) waiting for 2fa8770...608, giving up in 119.5 s (\) waiting for
#> 2fa8770...608, giving up in 119.0 s (|) waiting for 2fa8770...608, giving up in
#> 118.5 s (/) waiting for 2fa8770...608, giving up in 117.9 s
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows Server 2012 R2 x64 (build 9600)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] dplyr_1.0.7 tidyr_1.1.3
#>
#> loaded via a namespace (and not attached):
#>  [1] fansi_0.5.0      digest_0.6.27    utf8_1.2.2       crayon_1.4.1
#>  [5] context_0.3.0    R6_2.5.0         lifecycle_1.0.0  storr_1.2.5
#>  [9] magrittr_2.0.1   pillar_1.6.2     rlang_0.4.11     vctrs_0.3.8
#> [13] generics_0.1.0   ellipsis_0.3.2   glue_1.4.2       purrr_0.3.4
#> [17] compiler_4.0.3   pkgconfig_2.0.3  tidyselect_1.1.1 tibble_3.1.3

You can use t$result() to get the result straight away (throwing an error if it is not ready) or t$wait(Inf) to wait forever.

t$result()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows Server 2012 R2 x64 (build 9600)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] dplyr_1.0.7 tidyr_1.1.3
#>
#> loaded via a namespace (and not attached):
#>  [1] fansi_0.5.0      digest_0.6.27    utf8_1.2.2       crayon_1.4.1
#>  [5] context_0.3.0    R6_2.5.0         lifecycle_1.0.0  storr_1.2.5
#>  [9] magrittr_2.0.1   pillar_1.6.2     rlang_0.4.11     vctrs_0.3.8
#> [13] generics_0.1.0   ellipsis_0.3.2   glue_1.4.2       purrr_0.3.4
#> [17] compiler_4.0.3   pkgconfig_2.0.3  tidyselect_1.1.1 tibble_3.1.3

Running a single task

This is just using the enqueue function as above. But it also works with functions defined in files passed in as sources; here the function random_walk.

t <- obj$enqueue(random_walk(0, 10))
res <- t$wait(120)
#> (-) waiting for 66cc979...4f0, giving up in 119.5 s (\) waiting for
#> 66cc979...4f0, giving up in 119.0 s (|) waiting for 66cc979...4f0, giving up in
#> 118.5 s (/) waiting for 66cc979...4f0, giving up in 118.0 s
res
#>  [1] -1.973025 -2.823971 -2.880453 -2.392717 -1.782159 -2.923010 -2.981436
#>  [8] -3.403000 -2.978410 -3.989555

The t object has a number of other methods you can use:

t
#> <queuer_task>
#>   Public:
#>     clone: function (deep = FALSE)
#>     context_id: function ()
#>     expr: function (locals = FALSE)
#>     id: 66cc9796e23fa4489a41bb5cfdbef4f0
#>     initialize: function (id, root, check_exists = TRUE)
#>     log: function (parse = TRUE)
#>     result: function (allow_incomplete = FALSE)
#>     root: context_root
#>     status: function ()
#>     times: function (unit_elapsed = "secs")
#>     wait: function (timeout, time_poll = 0.5, progress = NULL)

Get the result from running a task

t$result()
#>  [1] -1.973025 -2.823971 -2.880453 -2.392717 -1.782159 -2.923010 -2.981436
#>  [8] -3.403000 -2.978410 -3.989555

Get the status of the task

t$status()
#> [1] "COMPLETE"

(might also be “PENDING”, “RUNNING” or “ERROR”

Get the original expression:

t$expr()
#> random_walk(0, 10)

Find out how long everything took

t$times()
#>                            task_id           submitted             started
#> 1 66cc9796e23fa4489a41bb5cfdbef4f0 2021-08-17 14:53:04 2021-08-17 14:53:06
#>              finished  waiting    running      idle
#> 1 2021-08-17 14:53:06 2.134212 0.03126001 0.4418352

You may see negative numbers for “waiting” as the submitted time is based on your computer and started/finished are based on the cluster.

And get the log from running the task

t$log()
#> [ hello     ]  2021-08-17 14:53:04
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:05.042
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3800
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  9a70cec48c3108b80503144f9b88cc8d
#> [ library   ]  dplyr, tidyr
#>
#>     Attaching package: 'dplyr'
#>
#>     The following objects are masked from 'package:stats':
#>
#>         filter, lag
#>
#>     The following objects are masked from 'package:base':
#>
#>         intersect, setdiff, setequal, union
#>
#> [ namespace ]
#> [ source    ]  mysources.R
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  9a70cec48c3108b80503144f9b88cc8d
#> [ task      ]  66cc9796e23fa4489a41bb5cfdbef4f0
#> [ expr      ]  random_walk(0, 10)
#> [ start     ]  2021-08-17 14:53:06.199
#> [ ok        ]
#> [ end       ]  2021-08-17 14:53:06.261
#>     Warning messages:
#>     1: package 'tidyr' was built under R version 4.0.5
#>     2: package 'dplyr' was built under R version 4.0.5

There is also a bit of DIDE specific logging that happens before this point; if the job fails inexplicably the answer may be in:

obj$dide_log(t)
#>  [1] "generated on host: kea"
#>  [2] "generated on date: 2021-08-17"
#>  [3] "didehpc version: 0.3.6"
#>  [4] "context version: 0.3.0"
#>  [5] "running on: FI--DIDECLUST26"
#>  [6] "mapping Q: -> \\\\fi--san03.dide.ic.ac.uk\\homes\\rfitzjoh"
#>  [7] "The command completed successfully."
#>  [8] ""
#>  [9] "mapping T: -> \\\\fi--didef3.dide.ic.ac.uk\\tmp"
#> [10] "The command completed successfully."
#> [11] ""
#> [12] "Using Rtools at T:\\Rtools\\Rtools40"
#> [13] "working directory: Q:\\didehpc\\20210817-145020"
#> [14] "this is a single task"
#> [15] "logfile: Q:\\didehpc\\20210817-145020\\contexts\\logs\\66cc9796e23fa4489a41bb5cfdbef4f0"
#> [16] ""
#> [17] "Q:\\didehpc\\20210817-145020>Rscript \"Q:\\didehpc\\20210817-145020\\contexts\\bin\\task_run\" \"Q:\\didehpc\\20210817-145020\\contexts\" 66cc9796e23fa4489a41bb5cfdbef4f0  1>\"Q:\\didehpc\\20210817-145020\\contexts\\logs\\66cc9796e23fa4489a41bb5cfdbef4f0\" 2>&1"
#> [18] "Removing mapping Q:"
#> [19] "Q: was deleted successfully."
#> [20] ""
#> [21] "Removing mapping T:"
#> [22] "T: was deleted successfully."
#> [23] ""
#> [24] "Quitting"

Want more information? See vignette("didehpc") for more details.