Torque BankFailure (cannot debit job account) - pbs

I am new with torque scheduler and I try to understand checkjob command result : BankFailure (cannot debit job account).
It concerns a job marked "Q" and appear to be stuck
When I type checkjob [job_id] I get this message:
State: Idle EState: Deferred
Creds: user:xxx group:xxx class:batch qos:DEFAULT
WallTime: 00:00:00 of 12:00:00
SubmitTime: Wed Jun 1 13:37:41
(Time Queued Total: 2:49:31 Eligible: 00:00:00)
StartDate: -2:49:29 Wed Jun 1 13:37:43
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [xxxxx]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 4
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: BankFailure (cannot debit job account)
Holds: Defer (hold reason: BankFailure)
PE: 1.00 StartPriority: 40
cannot select job xxxx for partition DEFAULT (job hold active)
According to the torque official doc, BankFailure (cannot debit job account) mean "If instead, you see the following as part of the checkjob output, it means that the job you are trying to run will exceed the allocation you have remaining. This may simply be because you did not specify a walltime as part of your job specification."
But the walltime value is set, and there is enough core to run this job.
Does It means that walltime is not enough to run this job ? Or Does It mean that time calculation allowed to the user is higher than his quota.
Thank you for your help :)

I'm pretty sure this isn't a Torque problem. "BankFailure" isn't in the Torque source, and it's very likely qrun on this job would succeed. I would focus on examining your accounting manager configuration (e.g., make sure funds have been allocated for the appropriate credential).

Related

How to set total resource limit (core hours) per User in SLURM

Is there a way to set a limit on total (not simultaneous) used resources (core hours/minutes) for specific User or Account in SLURM?
My total spent resources in seconds are for example 109 seconds usage of threads. I want to limit that just for my user not minding the sizes of submitted jobs until that limit is reached.
[root#cluster slurm]# sreport job SizesByAccount User=HPCUser -t Seconds start=2022-01-01 end=2022-07-01 Grouping=99999999 Partition=cpu
--------------------------------------------------------------------------------
Job Sizes 2022-01-01T00:00:00 - 2022-06-16T14:59:59 (14392800 secs)
Time reported in Seconds
--------------------------------------------------------------------------------
Cluster Account 0-99999998 CP >= 99999999 C % of cluster
--------- --------- ------------- ------------- ------------
cluster root 109 0 100.00%

Slurm requeued jobs are given BeginTime that makes newer jobs skip queue order if higher priority job is quickly cancelled

I'm having an unexpected problem with Slurm when it comes to jobs that are requeued due to preemption.
Job ID 5 is queued with priority 1 and PreemptMode=requeue
Job ID 66 is queued with priority 1 and PreemptMode=requeue
Job ID 777 is queued with priority 3
Job ID 777 is cancelled within a short period of time from being queued. Job ID 66 starts.
What appears to be happening is that when Job ID 5 is requeued it's given a BeginTime roughly two minutes from current time. If Job ID 7777 is cancelled after that BeginTime, Job ID 66 will start because Job ID 5 is queued for a later time.
How can I set up Slurm to get around this problem? I want it to respect queue order in situations like this and always start Job ID 5 before Job ID 66.

slurm: How can I prevent job's information to be removed?

Using sacct I want to obtain information about my completed jobs.
Answer mentions how could we obtain a job's information.
I have submitted a job name jobName.sh which has jobID 176. After 12 hours and new 200 jobs came in, I want to check my job's (jobID=176) information and I obtain slurm_load_jobs error: Invalid job id specified.
scontrol show job 176
slurm_load_jobs error: Invalid job id specified
And following line returns nothing: sacct --name jobName.sh
I assume there is a time-limit to keep previously submitted job's information that somehow previous jobs' information has been removed. Is there a limit for that? How could I make that limit very large value in order to prevent them to be deleted?
Please not that JobRequeue=0 is at slurm.conf.
Assuming that you are using mySQL to store that data, in your database configuration file slurmdbd.conf, you can tune, among others, the purging time. Here you have some examples:
PurgeJobAfter=12hours
PurgeJobAfter=1month
PurgeJobAfter=24months
If not set (default), then job records are never purged.
More info.
On Slurm documentation mentioned that:
MinJobAge The minimum age of a completed job before its record is
purged from Slurm's active database. Set the values of MaxJobCount and
to ensure the slurmctld daemon does not exhaust its memory or other
resources. The default value is 300 seconds. A value of zero prevents
any job record purging. In order to eliminate some possible race
conditions, the minimum non-zero value for MinJobAge recommended is 2.
On my slurm.conf file, MinJobAge was 300 which is 5 minutes. That's why after 5 minutes each completed job's information was removed. I increased MinJobAge's value in order to prevent the delete operation.

How to set the limit of cpu per user using maxTRESperuser on qos for slurm

I just set the qos parameter MaxTRESperuser to cpu=10 for testing purpose, but slurm is schedulling the job.
I used:
sacctmgr modify qos normal set maxtresperuser=cpu=1
and we can view on
sacctmgr show qos
Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
normal 0 00:00:00 cluster 1.000000 cpu=1
but all the jobs sent from the same user were allocated, each job using 2 cpus
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
370 teste script.s root R 0:11 1 slurmcomputenode2.novalocal
371 teste script.s root R 0:11 1 slurmcomputenode2.novalocal
372 teste teste.sh root R 0:07 1 slurmcomputenode1.novalocal
The slurm documetantion doesn't say anything else about it.
Do I need to change something on slurm.conf file?
Thanks
Make sure AccountingStorageEnforce is set to limits,qos. You also need proper accounting for limits to be enforced. See the documentation.

How Many Hive Dynamic Partitions are Needed?

I am running a large job that consolidates about 55 streams (tags) of samples (one sample per record) at irregular times over two years into 15-minute averages. There are about 1.1 billion records in 23k streams in the raw dataset, and these 55 streams make up about 33 million of those records.
I calculated a 15-minute index and am grouping by that to get the average value, however I seem to have am exceeded the max dynamic partitions on my hive job in spite of cranking it way up to 20k. I can increase it further I suppose, but it already takes awhile to fail (about 6 hours, although I reduced it to 2 by reducing the number of streams to consider), and I don’t actually know how to calculate how many I really need.
Here is the code:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.max.dynamic.partitions=50000;
SET hive.exec.max.dynamic.partitions.pernode=20000;
DROP TABLE IF EXISTS sensor_part_qhr;
CREATE TABLE sensor_part_qhr (
tag STRING,
tag0 STRING,
tag1 STRING,
tagn_1 STRING,
tagn STRING,
timestamp STRING,
unixtime INT,
qqFr2013 INT,
quality INT,
count INT,
stdev double,
value double
)
PARTITIONED BY (bld STRING);
INSERT INTO TABLE sensor_part_qhr
PARTITION (bld)
SELECT tag,
min(tag),
min(tag0),
min(tag1),
min(tagn_1),
min(tagn),
min(timestamp),
min(unixtime),
qqFr2013,
min(quality),
count(value),
stddev_samp(value),
avg(value)
FROM sensor_part_subset
WHERE tag1='Energy'
GROUP BY tag,qqFr2013;
And here is the error message:
Error during job, obtaining debugging information...
Examining task ID: task_1442824943639_0044_m_000008 (and more) from job job_1442824943639_0044
Examining task ID: task_1442824943639_0044_r_000000 (and more) from job job_1442824943639_0044
Task with the most failures(4):
-----
Task ID:
task_1442824943639_0044_r_000000
URL:
http://headnodehost:9014/taskdetails.jsp?jobid=job_1442824943639_0044&tipid=task_1442824943639_0044_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException:
[Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:747)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:498)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:521)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:232)
... 7 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 520 Reduce: 140 Cumulative CPU: 7409.394 sec HDFS Read: 0 HDFS Write: 393345977 SUCCESS
Job 1: Map: 9 Reduce: 1 Cumulative CPU: 87.201 sec HDFS Read: 393359417 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 days 2 hours 4 minutes 56 seconds 595 msec
Can anyone give some ideas as to how to calculate how many of these dynamic nodes I might need for a job like this?
Or maybe I should be doing this differently? I am running Hive 0.13 by the way on Azure HDInsight.
Update:
Corrected some of the numbers above.
Reduced it to 3 streams operating on 211k records and it finally
succeeded.
Started experimenting, reduced the partitions per node to 5k, and then 1k, and it still succeeded.
So I am not blocked anymore, but I am thinking I would have needed millions of nodes to do the whole dataset in one go (which is what I really wanted to do).
Dynamic partition columns must be specified last among the columns in the SELECT statement during insertion in sensor_part_qhr.

Resources