Apache Flink: Limit number of CPUs in a TaskManager - resources

First I am running on Standalone mode!
I have been trying to find any configuration but I haven't found anything about this.
In Spark there are some configurations which let you limit the number of CPUs to use in each slave:
SPARK_WORKER_CORES (worker configurations)
spark.executor.cores (cluster configuration)
But in Flink you just can set the maximun memory to use and the number of task slots (which just divides the memory) as said in the official documentation:
taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that a single TaskManager can run (DEFAULT:
1). If this value is larger than 1, a single TaskManager takes
multiple instances of a function or operator. That way, the
TaskManager can utilize multiple CPU cores, but at the same time, the
available memory is divided between the different operator or function
instances. This value is typically proportional to the number of
physical CPU cores that the TaskManager’s machine has (e.g., equal to
the number of cores, or half the number of cores).
And here more focused on my question:
Each task slot represents a fixed subset of resources of the
TaskManager. A TaskManager with three slots, for example, will
dedicate 1/3 of its managed memory to each slot. Slotting the
resources means that a subtask will not compete with subtasks from
other jobs for managed memory, but instead has a certain amount of
reserved managed memory. Note that no CPU isolation happens here;
currently slots only separate the managed memory of tasks.
Thanks!!

I was looking for the same question. In my understanding, there is no configuration that will set number of CPUs per slot. Setting the number of slots will divide the memory among the slots reducing the memory per slot. My best guess is set the number of slots to 1 and have CPUs available to the task manager process running in a container(may be docker). You can achieve the same parallelism by increasing the number taskmanagers.

Think this is in the flink config documentation:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#yarn
yarn.containers.vcores -1
The number of virtual cores (vcores) per YARN container. By default, the number of vcores is set to the number of slots per TaskManager, if set, or to 1, otherwise. In order for this parameter to be used your cluster must have CPU scheduling enabled. You can do this by setting the org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
david is correct above -- but the reasoning is because of this setting and I think this more closely answers the OPs question. So if you leave the default value, adjusting the number of task slots will adjust the number of cores.

Related

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

Prometheus. CPU process time total to % percent

We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.
For instance, we are using this PQuery to calculate total CPU usage in %:
100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.
Until we started to write queries for services.
For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?
Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.
Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?
Thank you, I would be grateful for any help!
I hope this will help!
The result is pretty much the same as the Windows performance manager.
So, for CPU % for running services (tasks, processes):
sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.
Cheers!

What do priority and parallelism value mean in Azure Data Lakes (Hadoop)?

In other words, what does a parallelism value of 5 and a priority value of 1000 mean?
They impact how and when your job can run. Priority determines in which order a job can run in relation to other queued jobs, parallelism sets how many parallel processes are started for it (more means it runs faster but costs more)
https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-manage-use-portal
Priority
Lower number has higher priority. If two jobs are both queued, the one with lower priority runs first
The default value is 1000 for this.
Parallelism
Max number of compute processes that can happen at the same time. Increasing this number can improve performance but can also increase cost.

How to define a good partition plan to ensure CPU balance in JSR 352?

JSR 352 - Batch Applications for the Java Platform provides parallelism feature using partitions. Batch runtime can execute a step in different partitions in order to accelerate the progress. JSR 352 also introduces the threads definition : we can define the number of threads to use, such as
<step id="Step1">
<chunk .../>
<partition>
<plan partitions="3" threads="2"/>
</partition>
</chunk>
</step>
Then I feel confused : how to give an appreciated partition plan so that each thread is occupied and ensure the CPU balance ?
For example, there're table A, B, C to do and their rows are respectively 1 billion, 1 million, 1 thousand. The step aims to process these entities to documents, one entity go to one document. The order of document production is not important. The CPU time for these tables' entity is respectively 1s, 2s, 5s. The threads number is 4.
If there're 3 partitions, one per table type, then the step will take 1 * 10^9 seconds to finish, because :
Partition A will take 1 * 10^9 * 1s = 1 * 10^9s, run on thread 2
Partition B will take 1 * 10^6 * 2s = 2 * 10^6s, run on thread 3
Partition C will take 1 * 10^3 * 5s = 5 * 10^3s, run on thread 4
However, while the thread 2 is occupied, thread 3 is free since 2 * 10^6s and thread 4 is free since 5 * 10^3s. So obviously, this is not a good partition plan.
My questions are :
Is there a better partition plan to complete in the above example ?
Can I consider : partitions is a queue to consume and threads consume this queue ?
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Answers...
Is there a better partition plan to complete in the above example?
Yes, there is. See answer 4...
Can I consider : partitions is a queue to consume and threads consume this queue ?
That is what exactly happens!
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
It depends. This question has many perspectives... From the JSR-352 Specification View, "threads":
Specifies the maximum number of threads on which to execute the partitions
of this step. Note the batch runtime cannot guarantee the requested number of threads are available; it will use as many as it can up to the requested maximum. This is an optional attribute. The default is the number of partitions.
So, based only in this perspective, you should set this value as high as you want (the batch runtime will set the real limit, according to its resources!).
From the Batch Runtime Perspective (JSR352 Implementation): Any decent implementation will use a thread pool to execute the partitioned steps. So, if such pool has a fixed size of N, no matter how big you set your threads number, you will never execute more than N partitions concurrently.
JBeret is an implementation of JSR352 specification, used by wildfly server (It is the implementation that I've used). At Wildfly, it has a default thread pool setting of max 10 threads. This pool is not only shared between partitioned steps, it is also shared between batch jobs. So, if you're running 2 jobs at the same time, you will have 2 thread less for use. Additional to this fact, when you partition, one thread takes the role of coordinator, assigning partitions to the others threads and waiting for results ... so if your partition plan says that it uses 2 threads, it will in fact uses 3! (two as workers, one as coordinator)... and all this resources (threads) are taken from the same pool!!
Anyway, the important thing of all this is: investigate what implementation of JSR325 are you using and setup it accordingly.
From hardware View, your CPU has a thread max limit. Under this perspective (and as rule of thumb), set the "threads" value equals to such value.
From the Performance View, analyze the work that are you doing. If you're accessing a shared resource (like a DB) between many threads, you can produce a bottleneck causing thread blocking. If you face that kind of problem, you must think at lowering the "theads" value.
In Summary, set the "threads" value as high as the CPU max thread limit. Then, check if that value does not cause blocking issues; if it does, reduce the value. Also, verify it the batch runtime is configured accordingly and it allows to you execute as many threads as you desire.
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Avoid the use of static partition plans (at least for you case). Instead, use a Partition Mapper. A Partition Mapper is a class that implements the javax.batch.api.partition.PartitionMapper interface and allows to define a partition plan (how many partitions, how many threads, the properties of each partition) programatically. So for your case, take your tables (A, B, C) and split them into blocks of N (where N = 1000) ... each block will be a partition. You should start with the partition of type C and do round robin between your entity partitions (tables): C0, B0, A0, B1, A1, ..., B999, A999, A1000, ..., A999999 ... using this scheme, entity C will finish first, leaving one thread open to resolve more A and B partitions. Later, B will finish, leaving more resources to attack the remaining A partitions.
Hope this help...

Requesting integer multiple of "M" cores per node on SGE

I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots parameter and then SGE tries to find num_slots slots following the allocation policy of the pe_name PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M slots per host (and not a multiple of M), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn, set its allocation_rule to 8, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40 and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1 for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G. Assuming that the nodes are configured with h_vmem of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem of less than 47 GB. h_vmem serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores with k being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.

Resources