Troubleshooting

Base queue object:

ctx <- context::context_save("contexts")
#> [ open:db   ]  rds
#> [ save:id   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ save:name ]  discretionary_stingray
obj <- didehpc::queue_didehpc(ctx)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]

My job has failed

My job status is `ERROR`

Caused by an error in your code

If your job status is ERROR that probably indicates an error in your code. There are lots of reasons that this could be for, and the first challenge is working out what happened.

t <- obj$enqueue(mysimulation(10))

#> (-) waiting for f416387...bac, giving up in 9.5 s (\) waiting for f416387...bac,
#> giving up in 9.0 s

This job will fail, and $status() will report ERROR

t$status()
#> [1] "ERROR"

The first place to look is the result of the job itself. Unlike an error in your console, an error that happens on the cluster can be returned and inspected:

t$result()
#> <context_task_error in mysimulation(10): could not find function "mysimulation">

In this case the error is because the function mysimulation does not exist.

The other place worth looking is the job log

t$log()
#> [ hello     ]  2021-08-17 14:53:09
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:09.042
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3672
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ task      ]  f4163877615d5f6fa6fb6392b6739bac
#> [ expr      ]  mysimulation(10)
#> [ start     ]  2021-08-17 14:53:09.199
#> [ error     ]
#>     Error in mysimulation(10): could not find function "mysimulation"
#> [ end       ]  2021-08-17 14:53:09.308
#>     Error in context:::main_task_run() : Error while running task:
#>     Execution halted

Sometimes there will be additional diagnostic information there.

Here’s another example:

t <- obj$enqueue(read.csv("c:/myfile.csv"))

#> (-) waiting for 9df1d9f...c74, giving up in 9.5 s (\) waiting for 9df1d9f...c74,
#> giving up in 9.0 s

This job will fail, and $status() will report ERROR

t$status()
#> [1] "ERROR"

Here is the error, which is a bit less informative this time:

t$result()
#> <context_task_error in file(file, "rt"): cannot open the connection>

The log gives a better idea of what is going on - the file c:/myfile.csv does not exist (because it is not found on the cluster; using relative paths is much preferred to absolute paths)

t$log()
#> [ hello     ]  2021-08-17 14:53:10
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:10.777
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  2000
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ task      ]  9df1d9f60717796422e1e62c8e206c74
#> [ expr      ]  read.csv("c:/myfile.csv")
#> [ start     ]  2021-08-17 14:53:10.949
#> [ error     ]
#>     Error in file(file, "rt"): cannot open the connection
#> [ end       ]  2021-08-17 14:53:11.074
#>     Error in context:::main_task_run() : Error while running task:
#>     In addition: Warning message:
#>     In file(file, "rt") :
#>       cannot open file 'c:/myfile.csv': No such file or directory
#>     Execution halted

The real content of the error message is present in the warning! You can also get the warnings with

t$result()$warnings
#> [[1]]
#> <simpleWarning in file(file, "rt"): cannot open file 'c:/myfile.csv': No such file or directory>

Which will be a list of all warnings generated during the execution of your task (even if it succeeds). The traceback also shows what happened:

t$result()$trace
#>  [1] "context:::main_task_run()"
#>  [2] "task_run(task_id, ctx)"
#>  [3] "eval_safely(dat$expr, dat$envir, \"context_task_error\", 3)"
#>  [4] "tryCatch(withCallingHandlers(eval(expr, envir), warning = function(e) warni"
#>  [5] "tryCatchList(expr, classes, parentenv, handlers)"
#>  [6] "tryCatchOne(expr, names, parentenv, handlers[[1]])"
#>  [7] "doTryCatch(return(expr), name, parentenv, handler)"
#>  [8] "withCallingHandlers(eval(expr, envir), warning = function(e) warnings$add(e"
#>  [9] "eval(expr, envir)"
#> [10] "eval(expr, envir)"
#> [11] "read.csv(\"c:/myfile.csv\")"
#> [12] "read.table(file = file, header = header, sep = sep, quote = quote, dec = de"
#> [13] "file(file, \"rt\")"

Caused by an error during startup

These are harder to troubleshoot but we can still pull some information out. The example here was a real-world case and illustrates one of the issues with using a shared filesystem in the way that we do here.

Suppose you have a context that uses some code in mycode.R:

times2 <- function(x) {
  2 * x
}

You create a connection to the cluster:

ctx <- context::context_save("contexts", sources = "mycode.R")
#> [ open:db   ]  rds
#> [ save:id   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ save:name ]  fuzzy_bass
obj <- didehpc::queue_didehpc(ctx)
#> Loading context dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ context   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mycode.R

Everything seems to work fine:

t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for d478d26...d70, giving up in 9.5 s (\) waiting for d478d26...d70,
#> giving up in 9.0 s
#> [1] 20

…but then you’re editing the file and save the file but it is not syntactically correct:

times2 <- function(x) {
  2 * x
}
newfun <- function(x)

And then you either submit a job, or a job that you have previously submitted gets run (which could happen ages after you submit it if the cluster is busy).

t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for ef79e3b...390, giving up in 9.5 s (\) waiting for ef79e3b...390,
#> giving up in 9.0 s
#> <context_task_error in source(s, envir): mycode.R:5:0: unexpected end of input
#> 3: }
#> 4: newfun <- function(x)
#>   ^>
t$status()
#> [1] "ERROR"

The error here has happened before getting to your code - it is happening when context loads the source files. The log makes this a bit clearer:

t$log()
#> [ hello     ]  2021-08-17 14:53:14
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:14.324
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3528
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mycode.R
#>     Error in source(s, envir) : mycode.R:5:0: unexpected end of input
#>     3: }
#>     4: newfun <- function(x)
#>       ^
#>     Calls: <Anonymous> -> withCallingHandlers -> context_load -> source
#>     Execution halted

My jobs are getting stuck at `PENDING`

This is the most annoying one, and can happen for many reasons. You can see via the web interface or the Microsoft cluster tools that your job has failed but didehpc is reporting it as pending. This happens when something has failed during the script that runs before any didehpc code runs on the cluster.

Things that have triggered this situation in the past:

An error in the Microsoft cluster tools
A misconfigured node (sometimes they are missing particular software)
A networking issue
Gremlins
Network path mapping error

There are doubtless others. Here, I’ll simulate one so you can see how to troubleshoot it. I’m going to deliberately misconfigure the network share that this is running on so that the cluster will not be able to map it and the job will fail to start

home <- didehpc::path_mapping("home", getwd(),
                              "//fi--wronghost/path", "Q:")

The host fi--wronghost does not exist so things will likely fail on startup.

config <- didehpc::didehpc_config(home = home)
ctx <- context::context_save("contexts")
#> [ open:db   ]  rds
obj <- didehpc::queue_didehpc(ctx, config)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]

Submit a job:

t <- obj$enqueue(sessionInfo())

And wait…

t$wait(10)
#> (-) waiting for 36dece3...711, giving up in 9.5 s (\) waiting for 36dece3...711,
#> giving up in 9.0 s (|) waiting for 36dece3...711, giving up in 8.5 s (/) waiting
#> for 36dece3...711, giving up in 8.0 s (-) waiting for 36dece3...711, giving
#> up in 7.5 s (\) waiting for 36dece3...711, giving up in 7.0 s (|) waiting for
#> 36dece3...711, giving up in 6.4 s (/) waiting for 36dece3...711, giving up
#> in 5.9 s (-) waiting for 36dece3...711, giving up in 5.4 s (\) waiting for
#> 36dece3...711, giving up in 4.9 s (|) waiting for 36dece3...711, giving up
#> in 4.4 s (/) waiting for 36dece3...711, giving up in 3.9 s (-) waiting for
#> 36dece3...711, giving up in 3.4 s (\) waiting for 36dece3...711, giving up
#> in 2.9 s (|) waiting for 36dece3...711, giving up in 2.4 s (/) waiting for
#> 36dece3...711, giving up in 1.9 s (-) waiting for 36dece3...711, giving up
#> in 1.4 s (\) waiting for 36dece3...711, giving up in 0.9 s (|) waiting for
#> 36dece3...711, giving up in 0.3 s (/) waiting for 36dece3...711, giving up in
#> 0.0 s
#> Error in task_wait(self$root$db, self$id, timeout, time_poll, progress): task not returned in time

It’s never going to succeed and yet it’s status will stay as PENDING:

t$status()
#> [1] "PENDING"

To get the log from the DIDE cluster you can run:

obj$dide_log(t)
#> [1] "Task failed during execution with exit code . Please check task's output for error details."
#> [2] "Output                          : The network path was not found."

which here indicates that the network path was not found (because it was wrong!)

You can also update any incorrect statuses by running:

obj$reconcile()
#> Fetching job status from the cluster...
#>   ...done
#> manually erroring task 36dece32d6c38f71464ed1fd9a9be711
#> Tasks have failed while context booting:
#>   - 36dece32d6c38f71464ed1fd9a9be711

Which will print information about anything that was adjusted.

My job works on my computer but not on the cluster

In that case, something is different between how the cluster sees the world, and how your computer sees it.

Look in the logs to try and find the reason why the failing jobs are doing so.
Are there variables in the global R environment on your local computer, that your code relies upon, that won’t be present on the cluster? Do you have local R packages or sources loaded which you haven’t declared when initialising your context?
Or any system variables, or other dependencies which enable your job to work locally, but won’t be set up on a cluster node?
Are you referring to any files visible to your local machine, but not on the cluster? Are you referring to C: for instance?
(Rarely:) Are you viewing a cached version of any network storage on your local computer, that has not been synced to the real network storage view that the cluster has?
Check that you have not run out of disk-space. The Q: quota is normally 15Gb.
If you are running C code, check for other causes of indeterminate failures, such as uninitialised variables, or array out-of-bounds errors. These are unpredictable errors by nature, but surprisingly often you might get away with it on a local computer, while a cluster node behaves differently.
If you are running stochastic code, check that you are really using the same random number seeds.

Some of my jobs work on the cluster, but others fail.

Look in the logs to try and find the reason why the failing jobs do so.
Try rerunning the failed jobs. Is it the same set that passes, and the same set that fails? In that case, consider what makes your jobs different - perhaps job-specific input data or parameters could cause the failures.
If you find messages about “Error allocating a vector…” or “std::bad_alloc”, then try and work out the memory usage of a single job. Perhaps run it locally with task manager (windows), or top (linux) running, and watch to see what the memory usage is. If the job is single-core, consider the total memory used if you run 8 or 16 instances on the same cluster machine. If the total memory exceeds the available, then behaviour will be undefined, and some jobs will likely fail.
In the above example, note that someone else’s memory-hungry combination of jobs may affect your small-memory job, if they run on the same node. We don’t enforce any memory limits on jobs, which on the whole, is nice and convenient, but it carries the risk that the above can happen.
Always check you’re not running out disk space. The Q: quota is normally 15Gb.
Find what node your jobs were running on. If you consistently get errors on one node, but not others, then get in touch with Wes, as we do get node failures from time to time, where the fault is not obvious at first.

My job is slower on the cluster than running locally!

This is expected, especially for single-core jobs. Cluster nodes are aiming to provide bandwidth, rather than linear performance, so a single job may run slower on a cluster node than on your own computer. But the cluster node might be able to run 8 or more such jobs at once, without taking any longer, while you continue using your local computer for local things.
If that is still insufficient, and you still want to compare timings in this way, then check that the cluster is doing exactly the same work as your local computer.

Asking for help

If you need help, you can ask in the “Cluster” teams channel or try your luck emailing Rich and Wes (they may or may not have time to respond, or may be on leave).

When asking for help it is really important that you make it as easy as possible for us to help you. This is surprisingly hard to do well, and we would ask that you first take a look at these two short articles:

Things we will need to know:

Your DIDE username (as this makes seeing job statuses much easier)
Minimally, which cluster you are using but better to post your entire didehpc config (obj$config if you have managed to create an object)
What you’ve tried doing
The values of any errors (not just that they occurred!)
Logs of the offending job if you have it

Too often, we will get requests from people that where we have no information about what was run, what packages or versions are being installed, etc. This means your message sits there until we see it, we’ll ask for clarification - that message sits there until you see it, you respond with a little more information, and it may be days until we finally discover the root cause of your problem, by which point we’re both quite fed up. We will never complain if you provide “too much” information in a good effort to outline where your problem is.

Don’t say

Hi, I was running a cluster job, but it seems like it failed. I’m sure it worked the other day though! Do you know what the problem is?

Do say

Since yesterday, my cluster job has stopped working.

My dide username is alicebobson and my dide config is:
<didehpc_config>
 - cluster: fi--dideclusthn
 - username: rfitzjoh
 (etc)
I am working on the myproject directory of the malaria share (\\projects\malaria)

I have set up my cluster job with
# include short script here if you can!
The job 43333cbd79ccbf9ede79556b592473c8 is one that failed with an error, and the log says
# contents of t$log() here

with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.