What does "Slurm error: QOSMinNode " mean in Slurm? - slurm

I got the following error log,
sbatch: error: QOSMinNode
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
what does it mean? I cannot find any "QOSMinNode " error in the Slurm manual.

The QOSMinNode is probably the name of a QOS defined on the cluster. You can verify with
sacctmgr list qos

Related

What kind of error can I get when I have a executor request that either the memory or cpu cannot fit in a node

What kind of error can I get when I have a executor request that either the memory or cpu cannot fit in a node?
Do I get the error from driver? Such as there is no resource available or I get the error from the scheduler which states that the request cannot fit in the cluster?
I think that I have found the answer by myself.
If you request an executor with requested memory/cpu that cannot fit in any of the nodes, for example, your requested memory/cpu is greater than the node it has. The following error can be found in the scheduler log
If the requested the cpu core number is greater than the total cpu core number in the node, you will get similar error like below. Pay attention to the keyword Insufficient cpu in the log
I1115 18:02:02.371606 10 factory.go:339] "Unable to schedule pod; no fit; waiting" pod="dxxxs-xxxxx/xxxpycomputediffconnections3np-uoiqajbz-exec-1" err="0/433 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1668535262}, that the pod didn't tolerate, 431 node(s) didn't match Pod's node affinity/selector."
If the requested the memory is greater than the total memory in the node, you will get similar error like below. Pay attention to the keyword Insufficient memory in the log
I1115 18:02:02.371606 10 factory.go:339] "Unable to schedule pod; no fit; waiting" pod="dxxxs-xxxxx/xxxpycomputediffconnections3np-uoiqajbz-exec-1" err="0/433 nodes are available: 1 Insufficient memory, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1668535262}, that the pod didn't tolerate, 431 node(s) didn't match Pod's node affinity/selector."

YARN + SPARK + Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

I have a small 6 data-nodes machines cluster and am experiencing total failure when running Spark jobs.
the failed:
ERROR [SparkListenerBus][driver][] [org.apache.spark.scheduler.LiveListenerBus] Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]], original=[DatanodeInfoWithStorage[42.3.44.157:50010,DS-87cdbf42-3995-4313-8fab-2bf6877695f6,DISK], DatanodeInfoWithStorage[42.3.44.154:50010,DS-60eb1276-11cc-4cb8-a844-f7f722de0e15,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)
---T08:18:07.007 ERROR [SparkListenerBus][driver][] [STATISTICS] [onQueryTerminated] queryId:
I found the following workaround by setting the following values in HDFS configuration
dfs.client.block.write.replace-datanode-on-failure.enable=true
dfs.client.block.write.replace-datanode-on-failure.policy=NEVER
The two properties dfs.client.block.write.replace-datanode-on-failure.policy and dfs.client.block.write.replace-data node-on-failure.enable influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Could be setting those parameter values a good solution?
dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.

slurm: How can I prevent job's information to be removed?

Using sacct I want to obtain information about my completed jobs.
Answer mentions how could we obtain a job's information.
I have submitted a job name jobName.sh which has jobID 176. After 12 hours and new 200 jobs came in, I want to check my job's (jobID=176) information and I obtain slurm_load_jobs error: Invalid job id specified.
scontrol show job 176
slurm_load_jobs error: Invalid job id specified
And following line returns nothing: sacct --name jobName.sh
I assume there is a time-limit to keep previously submitted job's information that somehow previous jobs' information has been removed. Is there a limit for that? How could I make that limit very large value in order to prevent them to be deleted?
Please not that JobRequeue=0 is at slurm.conf.
Assuming that you are using mySQL to store that data, in your database configuration file slurmdbd.conf, you can tune, among others, the purging time. Here you have some examples:
PurgeJobAfter=12hours
PurgeJobAfter=1month
PurgeJobAfter=24months
If not set (default), then job records are never purged.
More info.
On Slurm documentation mentioned that:
MinJobAge The minimum age of a completed job before its record is
purged from Slurm's active database. Set the values of MaxJobCount and
to ensure the slurmctld daemon does not exhaust its memory or other
resources. The default value is 300 seconds. A value of zero prevents
any job record purging. In order to eliminate some possible race
conditions, the minimum non-zero value for MinJobAge recommended is 2.
On my slurm.conf file, MinJobAge was 300 which is 5 minutes. That's why after 5 minutes each completed job's information was removed. I increased MinJobAge's value in order to prevent the delete operation.

Access reason why slurm stopped a job

Is there a way to find out why a job was canceled by slurm? I would like to distinguish the cases where a resource limit was hit from all other reasons (like a manual cancellation). In case a resource limit was hit, I also would like to know which one.
The slurm log file contains that information explicitly. It is also written to the job's output file with something like:
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT
or
Job <jobid> exceeded <mem> memory limit, being killed:
or
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE
etc.

Slurm sinfo format

When I use "sinfo" in slurm, I see an asterik near one of the partition (like: RUNNING-CLUSTER*).
The partition look well and all nodes under it are idle.
When I run a simple script with "sleep 300" for example, I can see the jobs in the queue (using "squeue") but they run for a few seconds and end. No error message (I can see in the log that they failed. No more info there).
Any idea what the asterisk is for?
Couldn't find it in the manual.
Thanks.
The "*" following the partition name indicates that this is the default partition for submitted jobs. LLNL provides documentation that directly supports my findings:
LLNL Documentation

Resources