I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID
I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html
I'm running a Kafka Streams application with three sub-topologies. The stages of activity are roughly as follows:
stream Topic A
selectKey and repartition Topic A to Topic B
stream Topic B
foreach Topic B to Topic C Producer
stream Topic C
Topic C to Topic D
Topics A, B, and C are each materialized, which means that if each topic has 40 partitions, my maximum parallelism is 120.
At first I was running 5 streams applications with 8 threads a piece. With this set up I was experiencing inconsistent performance. It seems like some sub-topologies sharing the same thread were hungrier for CPU than others and after a while, I'd get this error: Member [client_id] in group [consumer_group] has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator). Everything would get rebalanced, which could lead to decreased performance until the next failure and rebalance.
My questions are as follows:
How is it that multiple sub-topologies are able to be run on a single thread? A poll queue?
How does each thread decide how to allocate compute resources to each of its sub-topologies?
How do you optimize your thread to topic-partition ratio in such cases to avoid periodic consumer failures? e.g., will a 1:1 ratio ensure more consistent performance?
If you use a 1:1 ratio, how do you ensure that every thread gets assigned its own topic-partition and some threads aren't left idle?
The thread will poll() for all topics of different sub-topologies and check the records topic metadata to feed it into the correct task.
Each sub-topology is treated the same, ie, available resources are evenly distributed if you wish.
A 1:1 ratio is only useful if you have enough cores. I would recommend to monitor your CPU utilization. If it's too high (larger >80%) you should add more cores/threads.
Kafka Streams handles this for you automatically.
Couple of general comments:
you might consider to increase max.poll.interval.ms config to avoid that a consumer drops out of the group
you might consider to decrease max.poll.records to get less records per poll() call, and thus decrease the time between two consecutive calls to poll().
note, that max.poll.records does not imply increases network/broker communication -- if a single fetch request return more records than max.poll.records config, the data is just buffered within the consumer and the next poll() will be served from the buffered data avoiding a broker round trip
In other words, what does a parallelism value of 5 and a priority value of 1000 mean?
They impact how and when your job can run. Priority determines in which order a job can run in relation to other queued jobs, parallelism sets how many parallel processes are started for it (more means it runs faster but costs more)
https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-manage-use-portal
Priority
Lower number has higher priority. If two jobs are both queued, the one with lower priority runs first
The default value is 1000 for this.
Parallelism
Max number of compute processes that can happen at the same time. Increasing this number can improve performance but can also increase cost.
JSR 352 - Batch Applications for the Java Platform provides parallelism feature using partitions. Batch runtime can execute a step in different partitions in order to accelerate the progress. JSR 352 also introduces the threads definition : we can define the number of threads to use, such as
<step id="Step1">
<chunk .../>
<partition>
<plan partitions="3" threads="2"/>
</partition>
</chunk>
</step>
Then I feel confused : how to give an appreciated partition plan so that each thread is occupied and ensure the CPU balance ?
For example, there're table A, B, C to do and their rows are respectively 1 billion, 1 million, 1 thousand. The step aims to process these entities to documents, one entity go to one document. The order of document production is not important. The CPU time for these tables' entity is respectively 1s, 2s, 5s. The threads number is 4.
If there're 3 partitions, one per table type, then the step will take 1 * 10^9 seconds to finish, because :
Partition A will take 1 * 10^9 * 1s = 1 * 10^9s, run on thread 2
Partition B will take 1 * 10^6 * 2s = 2 * 10^6s, run on thread 3
Partition C will take 1 * 10^3 * 5s = 5 * 10^3s, run on thread 4
However, while the thread 2 is occupied, thread 3 is free since 2 * 10^6s and thread 4 is free since 5 * 10^3s. So obviously, this is not a good partition plan.
My questions are :
Is there a better partition plan to complete in the above example ?
Can I consider : partitions is a queue to consume and threads consume this queue ?
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Answers...
Is there a better partition plan to complete in the above example?
Yes, there is. See answer 4...
Can I consider : partitions is a queue to consume and threads consume this queue ?
That is what exactly happens!
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
It depends. This question has many perspectives... From the JSR-352 Specification View, "threads":
Specifies the maximum number of threads on which to execute the partitions
of this step. Note the batch runtime cannot guarantee the requested number of threads are available; it will use as many as it can up to the requested maximum. This is an optional attribute. The default is the number of partitions.
So, based only in this perspective, you should set this value as high as you want (the batch runtime will set the real limit, according to its resources!).
From the Batch Runtime Perspective (JSR352 Implementation): Any decent implementation will use a thread pool to execute the partitioned steps. So, if such pool has a fixed size of N, no matter how big you set your threads number, you will never execute more than N partitions concurrently.
JBeret is an implementation of JSR352 specification, used by wildfly server (It is the implementation that I've used). At Wildfly, it has a default thread pool setting of max 10 threads. This pool is not only shared between partitioned steps, it is also shared between batch jobs. So, if you're running 2 jobs at the same time, you will have 2 thread less for use. Additional to this fact, when you partition, one thread takes the role of coordinator, assigning partitions to the others threads and waiting for results ... so if your partition plan says that it uses 2 threads, it will in fact uses 3! (two as workers, one as coordinator)... and all this resources (threads) are taken from the same pool!!
Anyway, the important thing of all this is: investigate what implementation of JSR325 are you using and setup it accordingly.
From hardware View, your CPU has a thread max limit. Under this perspective (and as rule of thumb), set the "threads" value equals to such value.
From the Performance View, analyze the work that are you doing. If you're accessing a shared resource (like a DB) between many threads, you can produce a bottleneck causing thread blocking. If you face that kind of problem, you must think at lowering the "theads" value.
In Summary, set the "threads" value as high as the CPU max thread limit. Then, check if that value does not cause blocking issues; if it does, reduce the value. Also, verify it the batch runtime is configured accordingly and it allows to you execute as many threads as you desire.
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Avoid the use of static partition plans (at least for you case). Instead, use a Partition Mapper. A Partition Mapper is a class that implements the javax.batch.api.partition.PartitionMapper interface and allows to define a partition plan (how many partitions, how many threads, the properties of each partition) programatically. So for your case, take your tables (A, B, C) and split them into blocks of N (where N = 1000) ... each block will be a partition. You should start with the partition of type C and do round robin between your entity partitions (tables): C0, B0, A0, B1, A1, ..., B999, A999, A1000, ..., A999999 ... using this scheme, entity C will finish first, leaving one thread open to resolve more A and B partitions. Later, B will finish, leaving more resources to attack the remaining A partitions.
Hope this help...