Base queue object:
ctx <- context::context_save("contexts")
#> [ open:db ] rds
#> [ save:id ] 81b64478fe2182e45e83fec2156c6ec7
#> [ save:name ] discretionary_stingray
obj <- didehpc::queue_didehpc(ctx)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ library ]
#> [ namespace ]
#> [ source ]
ERROR
If your job status is ERROR
that probably
indicates an error in your code. There are lots of reasons that this
could be for, and the first challenge is working out what happened.
t <- obj$enqueue(mysimulation(10))
#> (-) waiting for f416387...bac, giving up in 9.5 s (\) waiting for f416387...bac,
#> giving up in 9.0 s
This job will fail, and $status()
will report
ERROR
t$status()
#> [1] "ERROR"
The first place to look is the result of the job itself. Unlike an error in your console, an error that happens on the cluster can be returned and inspected:
t$result()
#> <context_task_error in mysimulation(10): could not find function "mysimulation">
In this case the error is because the function
mysimulation
does not exist.
The other place worth looking is the job log
t$log()
#> [ hello ] 2021-08-17 14:53:09
#> [ wd ] Q:/didehpc/20210817-145020
#> [ init ] 2021-08-17 14:53:09.042
#> [ hostname ] FI--DIDECLUST26
#> [ process ] 3672
#> [ version ] 0.3.0
#> [ open:db ] rds
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ library ]
#> [ namespace ]
#> [ source ]
#> [ parallel ] running as single core job
#> [ root ] Q:\didehpc\20210817-145020\contexts
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ task ] f4163877615d5f6fa6fb6392b6739bac
#> [ expr ] mysimulation(10)
#> [ start ] 2021-08-17 14:53:09.199
#> [ error ]
#> Error in mysimulation(10): could not find function "mysimulation"
#> [ end ] 2021-08-17 14:53:09.308
#> Error in context:::main_task_run() : Error while running task:
#> Execution halted
Sometimes there will be additional diagnostic information there.
Here’s another example:
t <- obj$enqueue(read.csv("c:/myfile.csv"))
#> (-) waiting for 9df1d9f...c74, giving up in 9.5 s (\) waiting for 9df1d9f...c74,
#> giving up in 9.0 s
This job will fail, and $status()
will report
ERROR
t$status()
#> [1] "ERROR"
Here is the error, which is a bit less informative this time:
t$result()
#> <context_task_error in file(file, "rt"): cannot open the connection>
The log gives a better idea of what is going on - the file
c:/myfile.csv
does not exist (because it is not found on
the cluster; using relative paths is much preferred to absolute
paths)
t$log()
#> [ hello ] 2021-08-17 14:53:10
#> [ wd ] Q:/didehpc/20210817-145020
#> [ init ] 2021-08-17 14:53:10.777
#> [ hostname ] FI--DIDECLUST26
#> [ process ] 2000
#> [ version ] 0.3.0
#> [ open:db ] rds
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ library ]
#> [ namespace ]
#> [ source ]
#> [ parallel ] running as single core job
#> [ root ] Q:\didehpc\20210817-145020\contexts
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ task ] 9df1d9f60717796422e1e62c8e206c74
#> [ expr ] read.csv("c:/myfile.csv")
#> [ start ] 2021-08-17 14:53:10.949
#> [ error ]
#> Error in file(file, "rt"): cannot open the connection
#> [ end ] 2021-08-17 14:53:11.074
#> Error in context:::main_task_run() : Error while running task:
#> In addition: Warning message:
#> In file(file, "rt") :
#> cannot open file 'c:/myfile.csv': No such file or directory
#> Execution halted
The real content of the error message is present in the warning! You can also get the warnings with
t$result()$warnings
#> [[1]]
#> <simpleWarning in file(file, "rt"): cannot open file 'c:/myfile.csv': No such file or directory>
Which will be a list of all warnings generated during the execution of your task (even if it succeeds). The traceback also shows what happened:
t$result()$trace
#> [1] "context:::main_task_run()"
#> [2] "task_run(task_id, ctx)"
#> [3] "eval_safely(dat$expr, dat$envir, \"context_task_error\", 3)"
#> [4] "tryCatch(withCallingHandlers(eval(expr, envir), warning = function(e) warni"
#> [5] "tryCatchList(expr, classes, parentenv, handlers)"
#> [6] "tryCatchOne(expr, names, parentenv, handlers[[1]])"
#> [7] "doTryCatch(return(expr), name, parentenv, handler)"
#> [8] "withCallingHandlers(eval(expr, envir), warning = function(e) warnings$add(e"
#> [9] "eval(expr, envir)"
#> [10] "eval(expr, envir)"
#> [11] "read.csv(\"c:/myfile.csv\")"
#> [12] "read.table(file = file, header = header, sep = sep, quote = quote, dec = de"
#> [13] "file(file, \"rt\")"
These are harder to troubleshoot but we can still pull some information out. The example here was a real-world case and illustrates one of the issues with using a shared filesystem in the way that we do here.
Suppose you have a context that uses some code in
mycode.R
:
times2 <- function(x) {
2 * x
}
You create a connection to the cluster:
ctx <- context::context_save("contexts", sources = "mycode.R")
#> [ open:db ] rds
#> [ save:id ] dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ save:name ] fuzzy_bass
obj <- didehpc::queue_didehpc(ctx)
#> Loading context dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ context ] dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library ]
#> [ namespace ]
#> [ source ] mycode.R
Everything seems to work fine:
t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for d478d26...d70, giving up in 9.5 s (\) waiting for d478d26...d70,
#> giving up in 9.0 s
#> [1] 20
…but then you’re editing the file and save the file but it is not syntactically correct:
<- function(x) {
times2 2 * x
}<- function(x) newfun
And then you either submit a job, or a job that you have previously submitted gets run (which could happen ages after you submit it if the cluster is busy).
t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for ef79e3b...390, giving up in 9.5 s (\) waiting for ef79e3b...390,
#> giving up in 9.0 s
#> <context_task_error in source(s, envir): mycode.R:5:0: unexpected end of input
#> 3: }
#> 4: newfun <- function(x)
#> ^>
t$status()
#> [1] "ERROR"
The error here has happened before getting to your code - it is happening when context loads the source files. The log makes this a bit clearer:
t$log()
#> [ hello ] 2021-08-17 14:53:14
#> [ wd ] Q:/didehpc/20210817-145020
#> [ init ] 2021-08-17 14:53:14.324
#> [ hostname ] FI--DIDECLUST26
#> [ process ] 3528
#> [ version ] 0.3.0
#> [ open:db ] rds
#> [ context ] dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library ]
#> [ namespace ]
#> [ source ] mycode.R
#> Error in source(s, envir) : mycode.R:5:0: unexpected end of input
#> 3: }
#> 4: newfun <- function(x)
#> ^
#> Calls: <Anonymous> -> withCallingHandlers -> context_load -> source
#> Execution halted
PENDING
This is the most annoying one, and can happen for many reasons. You
can see via the web interface
or the Microsoft cluster tools that your job has failed but
didehpc
is reporting it as pending. This happens when
something has failed during the script that runs before any
didehpc
code runs on the cluster.
Things that have triggered this situation in the past:
There are doubtless others. Here, I’ll simulate one so you can see how to troubleshoot it. I’m going to deliberately misconfigure the network share that this is running on so that the cluster will not be able to map it and the job will fail to start
home <- didehpc::path_mapping("home", getwd(),
"//fi--wronghost/path", "Q:")
The host fi--wronghost
does not exist so things will
likely fail on startup.
config <- didehpc::didehpc_config(home = home)
ctx <- context::context_save("contexts")
#> [ open:db ] rds
obj <- didehpc::queue_didehpc(ctx, config)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context ] 81b64478fe2182e45e83fec2156c6ec7
#> [ library ]
#> [ namespace ]
#> [ source ]
Submit a job:
t <- obj$enqueue(sessionInfo())
And wait…
t$wait(10)
#> (-) waiting for 36dece3...711, giving up in 9.5 s (\) waiting for 36dece3...711,
#> giving up in 9.0 s (|) waiting for 36dece3...711, giving up in 8.5 s (/) waiting
#> for 36dece3...711, giving up in 8.0 s (-) waiting for 36dece3...711, giving
#> up in 7.5 s (\) waiting for 36dece3...711, giving up in 7.0 s (|) waiting for
#> 36dece3...711, giving up in 6.4 s (/) waiting for 36dece3...711, giving up
#> in 5.9 s (-) waiting for 36dece3...711, giving up in 5.4 s (\) waiting for
#> 36dece3...711, giving up in 4.9 s (|) waiting for 36dece3...711, giving up
#> in 4.4 s (/) waiting for 36dece3...711, giving up in 3.9 s (-) waiting for
#> 36dece3...711, giving up in 3.4 s (\) waiting for 36dece3...711, giving up
#> in 2.9 s (|) waiting for 36dece3...711, giving up in 2.4 s (/) waiting for
#> 36dece3...711, giving up in 1.9 s (-) waiting for 36dece3...711, giving up
#> in 1.4 s (\) waiting for 36dece3...711, giving up in 0.9 s (|) waiting for
#> 36dece3...711, giving up in 0.3 s (/) waiting for 36dece3...711, giving up in
#> 0.0 s
#> Error in task_wait(self$root$db, self$id, timeout, time_poll, progress): task not returned in time
It’s never going to succeed and yet it’s status will stay as
PENDING
:
t$status()
#> [1] "PENDING"
To get the log from the DIDE cluster you can run:
obj$dide_log(t)
#> [1] "Task failed during execution with exit code . Please check task's output for error details."
#> [2] "Output : The network path was not found."
which here indicates that the network path was not found (because it was wrong!)
You can also update any incorrect statuses by running:
obj$reconcile()
#> Fetching job status from the cluster...
#> ...done
#> manually erroring task 36dece32d6c38f71464ed1fd9a9be711
#> Tasks have failed while context booting:
#> - 36dece32d6c38f71464ed1fd9a9be711
Which will print information about anything that was adjusted.
In that case, something is different between how the cluster sees the world, and how your computer sees it.
C:
for
instance?top
(linux) running, and watch to see what the memory usage
is. If the job is single-core, consider the total memory used if you run
8 or 16 instances on the same cluster machine. If the total memory
exceeds the available, then behaviour will be undefined, and some jobs
will likely fail.If you need help, you can ask in the “Cluster” teams channel or try your luck emailing Rich and Wes (they may or may not have time to respond, or may be on leave).
When asking for help it is really important that you make it as easy as possible for us to help you. This is surprisingly hard to do well, and we would ask that you first take a look at these two short articles:
Things we will need to know:
obj$config
if you have managed to
create an object)Too often, we will get requests from people that where we have no information about what was run, what packages or versions are being installed, etc. This means your message sits there until we see it, we’ll ask for clarification - that message sits there until you see it, you respond with a little more information, and it may be days until we finally discover the root cause of your problem, by which point we’re both quite fed up. We will never complain if you provide “too much” information in a good effort to outline where your problem is.
Don’t say
Hi, I was running a cluster job, but it seems like it failed. I’m sure it worked the other day though! Do you know what the problem is?
Do say
Since yesterday, my cluster job has stopped working.
My dide username is
alicebobson
and my dide config is:<didehpc_config> - cluster: fi--dideclusthn - username: rfitzjoh (etc)
I am working on the
myproject
directory of the malaria share (\\projects\malaria
)I have set up my cluster job with
# include short script here if you can!
The job
43333cbd79ccbf9ede79556b592473c8
is one that failed with an error, and the log says# contents of t$log() here
with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.