Configuration — didehpc_config • didehpc

Collects configuration information. Unfortunately there's a fairly complicated process of working out what goes where so documentation coming later.

didehpc_config(
  credentials = NULL,
  home = NULL,
  temp = NULL,
  cluster = NULL,
  shares = NULL,
  template = NULL,
  cores = NULL,
  wholenode = NULL,
  parallel = NULL,
  workdir = NULL,
  use_workers = NULL,
  use_rrq = NULL,
  worker_timeout = NULL,
  worker_resource = NULL,
  conan_bootstrap = NULL,
  r_version = NULL,
  use_java = NULL,
  java_home = NULL
)

didehpc_config_global(..., check = TRUE)

Arguments

credentials: Either a list with elements username, password, or a path to a file containing lines username=<username> and password=<password> or your username (in which case you will be prompted graphically for your password).
home: Path to network home directory, on local system
temp: Path to network temp directory, on local system
cluster: Name of the cluster to use; one of valid_clusters() or one of the aliases (small/little/dide/ide; big/mrc).
shares: Optional additional share mappings. Can either be a single path mapping (as returned by path_mapping() or a list of such calls.
template: A job template. On fi--dideclusthn this can be "GeneralNodes" or "8Core". On "fi--didemrchnb" this can be "GeneralNodes", "12Core", "16Core", "12and16Core", "20Core", "24Core", "32Core", or "MEM1024" (for nodes with 1Tb of RAM; we have three - two of which have 32 cores, and the other is the AMD epyc with 64). On the new "wpia-hn" cluster, you should currently use "AllNodes". See the main cluster documentation if you tweak these parameters, as you may not have permission to use all templates (and if you use one that you don't have permission for the job will fail). For training purposes there is also a "Training" template, but you will only need to use this when instructed to.
cores: The number of cores to request. If specified, then we will request this many cores from the windows queuer. If you request too many cores then your task will queue forever! 24 is the largest this can be on fi--dideclusthn. On fi--didemrchnb, the GeneralNodes template has mainly 20 cores or less, with a single 64 core node, and the 32Core template has 32 core nodes. On wpia-hn, all the nodes are 32 core. If cores is omitted then a single core is assumed, unless wholenode is TRUE.
wholenode: If TRUE, request exclusive access to whichever compute node is allocated to the job. Your code will have access to all the cores and memory on the node.
parallel: Should we set up the parallel cluster? Normally if more than one core is implied (via the cores or wholenode arguments, then a parallel cluster will be set up (see Details). If parallel is set to FALSE then this will not occur. This might be useful in cases where you want to manage your own job level parallelism (e.g. using OpenMP) or if you're just after the whole node for the memory).
workdir: The path to work in on the cluster, if running out of place.
use_workers: Submit jobs to an internal queue, and run them on a set of workers submitted separately? If TRUE, then enqueue and the bulk submission commands no longer submit to the DIDE queue. Instead they create an internal queue that workers can poll. After queuing tasks, use submit_workers to submit workers that will process these tasks, terminating when they are done. You can use this approach to throttle the resources you need.
use_rrq: Use rrq to run a set of workers on the cluster. This is an experimental option, and the interface here may change. For now all this does is ensure a few additional packages are installed, and tweaks some environment variables in the generated batch files. Actual rrq workers are submitted with the submit_workers method of the object.
worker_timeout: When using workers (via use_workers or use_rrq, the length of time (in seconds) that workers should be willing to set idle before exiting. If set to zero then workers will be added to the queue, run jobs, and immediately exit. If greater than zero, then the workers will wait at least this many seconds after running the last task before quitting. The number provided can be Inf, in which case the worker will never exit (but be careful to clean the worker up in this case!). The default is 600s (10 minutes) should be more than enough to get your jobs up and running. Once workers are established you can extend or reset the timeout by sending the TIMEOUT_SET message (proper documentation will come for this soon).
worker_resource: Optionally, an object created by worker_resource() which controls the resources used by workers where these are different to jobs directly submitted by $enqueue(). This is only meaningful if you are using use_rrq = TRUE.
conan_bootstrap: Logical, indicating if we should use the shared conan "bootstrap" library stored on the temporary directory. Setting this to FALSE will install all dependencies required to install packages first into a temporary location (this may take a few minutes) before installation. Generally leave this as-is.
r_version: A string, or numeric_version object, describing the R version required. Not all R versions are known to be supported, so this will check against a list of installed R versions for the cluster you are using. If omitted then: if your R version matches a version on the cluster that will be used, or the oldest cluster version that is newer than yours, or the most recent cluster version.
use_java: Logical, indicating if the script is going to require Java, for example via the rJava package.
java_home: A string, optionally giving the path of a custom Java Runtime Environment, which will be used if the use_java logical is true. If left blank, then the default cluster Java Runtime Environment will be used.
...: arguments to didehpc_config
check: Logical, indicating if we should check that the configuration object can be created

Resources and parallel computing

If you need more than one core per task (i.e., you want the each task to do some parallel processing in addition to the parallelism between tasks) you can do that through the configuration options here.

The template option chooses among templates defined on the cluster.

If you specify cores, the HPC will queue your job until an appropriate number of cores appears for the selected template. This can leave your job queuing forever (e.g., selecting 20 cores on a 16Core template) so be careful.

Alternatively, if you specify wholenode as TRUE, then you will have exclusive access to whichever compute node is allocated to your job, reserving all of its cores.

If more than 1 core is requested, either by choosing wholenode, or by specifying a cores value greater than 1) on startup, a parallel cluster will be started, using parallel::makePSOCKcluster and this will be registered as the default cluster. The nodes will all have the appropriate context loaded and you can immediately use them with parallel::clusterApply and related functions by passing NULL as the first argument. The cluster will be shut down politely on exit, and logs will be output to the "workers" directory below your context root.

Workers and rrq

The options use_workers and use_rrq interact, share some functionality, but are quite different.

With use_workers, jobs are never submitted when you run enqueue or one of the bulk submission commands in queuer. Instead you submit workers using submit_workers and then the submission commands push task ids onto a Redis queue that the workers monitor.

With use_rrq, enqueue etc still work as before, plus you must submit workers with submit_workers. The difference is that any job may access the rrq_controller and push jobs onto a central pool of tasks.

I'm not sure at this point if it makes any sense for the two approaches to work together so this is disabled for now. If you think you have a use case please let me know.