Is there a way to set a limit on total (not simultaneous) used resources (core hours/minutes) for specific User or Account in SLURM?
My total spent resources in seconds are for example 109 seconds usage of threads. I want to limit that just for my user not minding the sizes of submitted jobs until that limit is reached.
[root#cluster slurm]# sreport job SizesByAccount User=HPCUser -t Seconds start=2022-01-01 end=2022-07-01 Grouping=99999999 Partition=cpu
--------------------------------------------------------------------------------
Job Sizes 2022-01-01T00:00:00 - 2022-06-16T14:59:59 (14392800 secs)
Time reported in Seconds
--------------------------------------------------------------------------------
Cluster Account 0-99999998 CP >= 99999999 C % of cluster
--------- --------- ------------- ------------- ------------
cluster root 109 0 100.00%
I'm having an unexpected problem with Slurm when it comes to jobs that are requeued due to preemption.
Job ID 5 is queued with priority 1 and PreemptMode=requeue
Job ID 66 is queued with priority 1 and PreemptMode=requeue
Job ID 777 is queued with priority 3
Job ID 777 is cancelled within a short period of time from being queued. Job ID 66 starts.
What appears to be happening is that when Job ID 5 is requeued it's given a BeginTime roughly two minutes from current time. If Job ID 7777 is cancelled after that BeginTime, Job ID 66 will start because Job ID 5 is queued for a later time.
How can I set up Slurm to get around this problem? I want it to respect queue order in situations like this and always start Job ID 5 before Job ID 66.
Using sacct I want to obtain information about my completed jobs.
Answer mentions how could we obtain a job's information.
I have submitted a job name jobName.sh which has jobID 176. After 12 hours and new 200 jobs came in, I want to check my job's (jobID=176) information and I obtain slurm_load_jobs error: Invalid job id specified.
scontrol show job 176
slurm_load_jobs error: Invalid job id specified
And following line returns nothing: sacct --name jobName.sh
I assume there is a time-limit to keep previously submitted job's information that somehow previous jobs' information has been removed. Is there a limit for that? How could I make that limit very large value in order to prevent them to be deleted?
Please not that JobRequeue=0 is at slurm.conf.
Assuming that you are using mySQL to store that data, in your database configuration file slurmdbd.conf, you can tune, among others, the purging time. Here you have some examples:
PurgeJobAfter=12hours
PurgeJobAfter=1month
PurgeJobAfter=24months
If not set (default), then job records are never purged.
More info.
On Slurm documentation mentioned that:
MinJobAge The minimum age of a completed job before its record is
purged from Slurm's active database. Set the values of MaxJobCount and
to ensure the slurmctld daemon does not exhaust its memory or other
resources. The default value is 300 seconds. A value of zero prevents
any job record purging. In order to eliminate some possible race
conditions, the minimum non-zero value for MinJobAge recommended is 2.
On my slurm.conf file, MinJobAge was 300 which is 5 minutes. That's why after 5 minutes each completed job's information was removed. I increased MinJobAge's value in order to prevent the delete operation.
I just set the qos parameter MaxTRESperuser to cpu=10 for testing purpose, but slurm is schedulling the job.
I used:
sacctmgr modify qos normal set maxtresperuser=cpu=1
and we can view on
sacctmgr show qos
Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
normal 0 00:00:00 cluster 1.000000 cpu=1
but all the jobs sent from the same user were allocated, each job using 2 cpus
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
370 teste script.s root R 0:11 1 slurmcomputenode2.novalocal
371 teste script.s root R 0:11 1 slurmcomputenode2.novalocal
372 teste teste.sh root R 0:07 1 slurmcomputenode1.novalocal
The slurm documetantion doesn't say anything else about it.
Do I need to change something on slurm.conf file?
Thanks
Make sure AccountingStorageEnforce is set to limits,qos. You also need proper accounting for limits to be enforced. See the documentation.
I am running a large job that consolidates about 55 streams (tags) of samples (one sample per record) at irregular times over two years into 15-minute averages. There are about 1.1 billion records in 23k streams in the raw dataset, and these 55 streams make up about 33 million of those records.
I calculated a 15-minute index and am grouping by that to get the average value, however I seem to have am exceeded the max dynamic partitions on my hive job in spite of cranking it way up to 20k. I can increase it further I suppose, but it already takes awhile to fail (about 6 hours, although I reduced it to 2 by reducing the number of streams to consider), and I don’t actually know how to calculate how many I really need.
Here is the code:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.max.dynamic.partitions=50000;
SET hive.exec.max.dynamic.partitions.pernode=20000;
DROP TABLE IF EXISTS sensor_part_qhr;
CREATE TABLE sensor_part_qhr (
tag STRING,
tag0 STRING,
tag1 STRING,
tagn_1 STRING,
tagn STRING,
timestamp STRING,
unixtime INT,
qqFr2013 INT,
quality INT,
count INT,
stdev double,
value double
)
PARTITIONED BY (bld STRING);
INSERT INTO TABLE sensor_part_qhr
PARTITION (bld)
SELECT tag,
min(tag),
min(tag0),
min(tag1),
min(tagn_1),
min(tagn),
min(timestamp),
min(unixtime),
qqFr2013,
min(quality),
count(value),
stddev_samp(value),
avg(value)
FROM sensor_part_subset
WHERE tag1='Energy'
GROUP BY tag,qqFr2013;
And here is the error message:
Error during job, obtaining debugging information...
Examining task ID: task_1442824943639_0044_m_000008 (and more) from job job_1442824943639_0044
Examining task ID: task_1442824943639_0044_r_000000 (and more) from job job_1442824943639_0044
Task with the most failures(4):
-----
Task ID:
task_1442824943639_0044_r_000000
URL:
http://headnodehost:9014/taskdetails.jsp?jobid=job_1442824943639_0044&tipid=task_1442824943639_0044_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException:
[Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:747)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:498)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:521)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:232)
... 7 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 520 Reduce: 140 Cumulative CPU: 7409.394 sec HDFS Read: 0 HDFS Write: 393345977 SUCCESS
Job 1: Map: 9 Reduce: 1 Cumulative CPU: 87.201 sec HDFS Read: 393359417 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 days 2 hours 4 minutes 56 seconds 595 msec
Can anyone give some ideas as to how to calculate how many of these dynamic nodes I might need for a job like this?
Or maybe I should be doing this differently? I am running Hive 0.13 by the way on Azure HDInsight.
Update:
Corrected some of the numbers above.
Reduced it to 3 streams operating on 211k records and it finally
succeeded.
Started experimenting, reduced the partitions per node to 5k, and then 1k, and it still succeeded.
So I am not blocked anymore, but I am thinking I would have needed millions of nodes to do the whole dataset in one go (which is what I really wanted to do).
Dynamic partition columns must be specified last among the columns in the SELECT statement during insertion in sensor_part_qhr.