Is there a way to set a limit on total (not simultaneous) used resources (core hours/minutes) for specific User or Account in SLURM?
My total spent resources in seconds are for example 109 seconds usage of threads. I want to limit that just for my user not minding the sizes of submitted jobs until that limit is reached.
[root#cluster slurm]# sreport job SizesByAccount User=HPCUser -t Seconds start=2022-01-01 end=2022-07-01 Grouping=99999999 Partition=cpu
--------------------------------------------------------------------------------
Job Sizes 2022-01-01T00:00:00 - 2022-06-16T14:59:59 (14392800 secs)
Time reported in Seconds
--------------------------------------------------------------------------------
Cluster Account 0-99999998 CP >= 99999999 C % of cluster
--------- --------- ------------- ------------- ------------
cluster root 109 0 100.00%
I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?
Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.
I have a simulation that consists of N steps, run sequentially. Each of these steps modifies a global state in memory, until the final step which is the result. It is possible, after a step has run, to write to disk the intermediate state that this step just computed, and to load such an intermediate state instead of starting from scratch. Writing and loading intermediate states has a non-negligible cost.
I want to run many variations of a simulation on a Slurm cluster. Each variation will change the parameter of some of the steps.
Example
Simulation steps
S1 --> S2 --> S3 --> S4
Variations
run1: S2.speed=2, S3.height=12
run2: S2.speed=2, S3.height=20
run3: S2.speed=2, S3.height=40
run4: S2.speed=5, S3.height=12
run5: S2.speed=5, S3.height=80
What I want to do is for the various runs to share common computations, by dumping the intermediate state of the shared steps. This will form a tree of step runs:
S1
├─ S2 (speed=2)
│ ├─ S3 (height=12)
│ │ └─ S4
│ ├─ S3 (height=20)
│ │ └─ S4
│ └─ S3 (height=40)
│ └─ S4
└─ S2 (speed=5)
├─ S3 (height=12)
│ └─ S4
└─ S3 (height=80)
└─ S4
I know I can get the result of the 5 runs by running 5 processes:
run1: S1 --> S2 (speed=2) --> S3 (height=12) --> S4
run2: (dump of run1.S2) --> S3 (height=20) --> S4
run3: (dump of run1.S2) --> S3 (height=40) --> S4
run4: (dump of run1.S1) --> S2 (speed=5) --> S3 (height=12) --> S4
run5: (dump of run4.S2) --> S3 (height=80) --> S4
This reduces the computation from 20 steps using a naive approach, to 13 steps with 3 dumps and 4 loads.
Now, my question is how to model this with Slurm, to make the best use of the scheduler?
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. Run1 will submit run4 after dumping S1, and then it will submit run2 and run3 after dumping S2, and run4 will submit run5 after dumping S2. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. The list of jobs to submit and their dependencies would be basically the tree I drew above (except the pairs S3/S4 would be merged in the same job). This is 8 jobs to submit instead of 5, but I can submit them all at once from the beginning, with the right dependencies. However, I am not sure what the advantages of this approach would be. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start? Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)? I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
Finally, are there other approaches I did not think about?
Edit
The example I gave is of course simplified a lot. The real simulations will contain hundreds of steps, with about a thousand variations to try. The scalability of the chosen solution is important.
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
This is an approach often followed with simple workflows that involve long-running jobs that must checkpoint and restart.
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start?
No. Slurm will just ignore the jobs that are not eligible to start because their dependent jobs are not finished.
Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)?
Yes, but that is marginally useful.
I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
No you cannot set dependencies between jobs of the same array.
Finally, are there other approaches I did not think about?
You could use a workflow management system.
One of the simplest solution is Makeflow. It uses files that look like classical Makefiles that describe the dependencies between jobs. Then, simply running something like makeflow –T slurm makefile.mf
Another option is Bosco. Bosco offers a bit more possibilities, and is good for personal use. It is easy to setup and can submit jobs to multiple clusters.
Finally, Fireworks is a very powerful solution. It requires a MongoDB, and is more suited for lab-wise use, but it can implement very complex logic for job submission/resubmission based on the outputs of jobs, and can handle errors in a clever way. You can for instance implement a workflow where a job is submitted with a given value for a given parameter, and have Fireworks monitor the convergence based on the output file, and cancel and re-submit with another value in case the convergence is not satisfactory.
Another possible solution is to make use of pipeline tools. In the field of bioinformatics SnakeMake is becoming really popular. SnakeMake is based on GNU Make, but made in Python, hence the name SnakeMake. For SnakeMake to work you specify which output you want, and SnakeMake will deduce which rules it has to run for this output. One of the nice things about SnakeMake is that it scales really easily from personal laptops, to bigger computers, and even clusters (for instance slurm clusters). Your example would look something like this:
rule all:
input:
['S4_speed_2_height_12.out',
'S4_speed_2_height_20.out',
'S4_speed_2_height_40.out',
'S4_speed_5_height_12.out',
'S4_speed_5_height_80.out']
rule S1:
output:
"S1.out"
shell:
"touch {output}" # do your heavy computations here
rule S2:
input:
"S1.out"
output:
"S2_speed_{speed}.out"
shell:
"touch {output}"
rule S3:
input:
"S2_speed_{speed}.out"
output:
"S3_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
rule S4:
input:
"S3_speed_{speed}_height_{height}.out"
output:
"S4_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
We can then ask snakemake to make a pretty image of how it would perform these computations:
Snakemake automatically figures out which output can be used by different rules.
Running this on your local machine is as simple executing snakemake, and to submit the actions to slurm is just snakemake --cluster "sbatch". The example I gave is obviously an oversimplification, but SnakeMake is highly customizable (nr of threads per rule, memory usage, etc.), and has the advantage that it is based on Python. It takes a bit of figuring out how everything works in SnakeMake, but I can definitely recommend it.
I have roughly 100Gb of data that I'm trying to process. The data has the form:
| timestamp | social_profile_id | json_payload |
|-----------|-------------------|------------------|
| 123 | 1 | {"json":"stuff"} |
| 124 | 2 | {"json":"stuff"} |
| 125 | 3 | {"json":"stuff"} |
I'm trying to split this data frame into folders in S3 by social_profile_id. There are roughly 430,000 social_profile_ids.
I've loaded the data no problem into a Dataset. However when I'm writing it out, and trying to partition it, it takes forever! Here's what I've tried:
messagesDS
.write
.partitionBy("socialProfileId")
.mode(sparkSaveMode)
I don't really care how many files are in each folder at the end of the job. My theory is that each node can group by the social_profile_id, then write out to its respective folder without having to do a shuffle or communicate with other nodes. But this isn't happening as evidenced by the long job time. Ideally the end result would look a little something like this:
├── social_id_1 (only two partitions had id_1 data)
| ├── partition1_data.parquet
| └── partition3_data.parquet
├── social_id_2 (more partitions have this data in it)
| ├── partition3_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── social_id_3
| ├── partition2_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── etc.
I've tried increasing the compute resources a few times, both increasing instances sizes and # of instances. What I've been able to see form the spark UI that is the majority of time time is being taken by the write operation. It seems that all of the executors are being used, but they take an absurdly long time to execute (like taking 3-5 hours to write ~150Mb) Any help would be appreciated! Sorry if I mixed up some of the spark terminology.
This question already has an answer here:
MongoDB aggregation comparison: group(), $group and MapReduce
(1 answer)
Closed 5 years ago.
I tried to bench 3 methods to group data: native js (with underscore), group and Aggregate with $group
I uses these datas (genre/position' trees in Paris) (237 168 rows, 35Mo)
This is my script test and the result is a bit surprising !
┌─────────────┬───────────────┐
│ Method │ avg time (ms) │
├─────────────┼───────────────┤
│ Pure js │ 897 │
├─────────────┼───────────────┤
│ Group │ 3863 │
├─────────────┼───────────────┤
│ Aggregation │ 364 │
└─────────────┴───────────────┘
Why grouping with group is 10x slower than Aggregation ?
For what is used "Group" ?
And how can i optimise again my request ?
Thanks.
Group command uses the same framework as mapreduce and there are many resources for why MR is slower than aggregation framework. Main one is it runs in a separate JS thread, where agg framework runs natively on the server.
See details here MongoDB aggregation comparison: group(), $group and MapReduce