Collects configuration information. Unfortunately there's a fairly complicated process of working out what goes where so documentation coming later.
didehpc_config(
credentials = NULL,
home = NULL,
temp = NULL,
cluster = NULL,
shares = NULL,
template = NULL,
cores = NULL,
wholenode = NULL,
parallel = NULL,
workdir = NULL,
use_workers = NULL,
use_rrq = NULL,
worker_timeout = NULL,
worker_resource = NULL,
conan_bootstrap = NULL,
r_version = NULL,
use_java = NULL,
java_home = NULL
)
didehpc_config_global(..., check = TRUE)
Either a list with elements username, password,
or a path to a file containing lines username=<username>
and password=<password>
or your username (in which case
you will be prompted graphically for your password).
Path to network home directory, on local system
Path to network temp directory, on local system
Name of the cluster to use; one of
valid_clusters()
or one of the aliases
(small/little/dide/ide; big/mrc).
Optional additional share mappings. Can either be a
single path mapping (as returned by path_mapping()
or a list of such calls.
A job template. On fi--dideclusthn this can be "GeneralNodes" or "8Core". On "fi--didemrchnb" this can be "GeneralNodes", "12Core", "16Core", "12and16Core", "20Core", "24Core", "32Core", or "MEM1024" (for nodes with 1Tb of RAM; we have three - two of which have 32 cores, and the other is the AMD epyc with 64). On the new "wpia-hn" cluster, you should currently use "AllNodes". See the main cluster documentation if you tweak these parameters, as you may not have permission to use all templates (and if you use one that you don't have permission for the job will fail). For training purposes there is also a "Training" template, but you will only need to use this when instructed to.
The number of cores to request. If
specified, then we will request this many cores from the windows
queuer. If you request too many cores then your task will queue
forever! 24 is the largest this can be on fi--dideclusthn. On fi--didemrchnb,
the GeneralNodes template has mainly 20 cores or less, with a single 64 core
node, and the 32Core template has 32 core nodes. On wpia-hn, all the nodes are
32 core. If cores
is omitted then a single core is assumed, unless
wholenode
is TRUE.
If TRUE, request exclusive access to whichever compute node is allocated to the job. Your code will have access to all the cores and memory on the node.
Should we set up the parallel cluster? Normally
if more than one core is implied (via the cores
or wholenode
arguments, then a parallel cluster will be set up (see
Details). If parallel
is set to FALSE
then this
will not occur. This might be useful in cases where you want to
manage your own job level parallelism (e.g. using OpenMP) or if
you're just after the whole node for the memory).
The path to work in on the cluster, if running out of place.
Submit jobs to an internal queue, and run them
on a set of workers submitted separately? If TRUE
, then
enqueue
and the bulk submission commands no longer submit
to the DIDE queue. Instead they create an internal queue
that workers can poll. After queuing tasks, use
submit_workers
to submit workers that will process these
tasks, terminating when they are done. You can use this
approach to throttle the resources you need.
Use rrq
to run a set of workers on the
cluster. This is an experimental option, and the interface here
may change. For now all this does is ensure a few additional
packages are installed, and tweaks some environment variables in
the generated batch files. Actual rrq workers are submitted
with the submit_workers
method of the object.
When using workers (via use_workers
or use_rrq
, the length of time (in seconds) that workers
should be willing to set idle before exiting. If set to zero
then workers will be added to the queue, run jobs, and
immediately exit. If greater than zero, then the workers will
wait at least this many seconds after running the last task
before quitting. The number provided can be Inf
, in which
case the worker will never exit (but be careful to clean the
worker up in this case!). The default is 600s (10 minutes)
should be more than enough to get your jobs up and running.
Once workers are established you can extend or reset the timeout
by sending the TIMEOUT_SET
message (proper documentation
will come for this soon).
Optionally, an object created by
worker_resource()
which controls the resources used
by workers where these are different to jobs directly submitted
by $enqueue()
. This is only meaningful if you are using
use_rrq = TRUE
.
Logical, indicating if we should use the
shared conan "bootstrap" library stored on the temporary
directory. Setting this to FALSE
will install all dependencies
required to install packages first into a temporary location
(this may take a few minutes) before installation. Generally
leave this as-is.
A string, or numeric_version
object, describing
the R version required. Not all R versions are known to be
supported, so this will check against a list of installed R
versions for the cluster you are using. If omitted then: if
your R version matches a version on the cluster that will be
used, or the oldest cluster version that is newer than yours, or
the most recent cluster version.
Logical, indicating if the script is going to require Java, for example via the rJava package.
A string, optionally giving the path of a custom Java Runtime Environment, which will be used if the use_java logical is true. If left blank, then the default cluster Java Runtime Environment will be used.
arguments to didehpc_config
Logical, indicating if we should check that the configuration object can be created
If you need more than one core per task (i.e., you want the each task to do some parallel processing in addition to the parallelism between tasks) you can do that through the configuration options here.
The template
option chooses among templates defined on the
cluster.
If you specify cores
, the HPC will queue your job until an
appropriate number of cores appears for the selected template.
This can leave your job queuing forever (e.g., selecting 20 cores
on a 16Core template) so be careful.
Alternatively, if you specify wholenode
as TRUE, then you will
have exclusive access to whichever compute node is allocated
to your job, reserving all of its cores.
If more than 1 core is requested, either by choosing
wholenode
, or by specifying a
cores
value greater than 1) on startup, a parallel
cluster will be started, using parallel::makePSOCKcluster
and this will be registered as the default cluster. The nodes
will all have the appropriate context loaded and you can
immediately use them with parallel::clusterApply
and
related functions by passing NULL
as the first argument.
The cluster will be shut down politely on exit, and logs will be
output to the "workers" directory below your context root.
The options use_workers
and use_rrq
interact, share
some functionality, but are quite different.
With use_workers
, jobs are never submitted when you run
enqueue
or one of the bulk submission commands in
queuer
. Instead you submit workers using
submit_workers
and then the submission commands push task
ids onto a Redis queue that the workers monitor.
With use_rrq
, enqueue
etc still work as before, plus
you must submit workers with submit_workers
. The
difference is that any job may access the rrq_controller
and push jobs onto a central pool of tasks.
I'm not sure at this point if it makes any sense for the two approaches to work together so this is disabled for now. If you think you have a use case please let me know.