Execute (or queue) set of tasks simultaneously - multithreading

I have the situation like:
between 5 and 20 test environments separated to groups by 5 VMs (1 set = 5 VMs usually)
hundreds of test cases which should be executed simultaneously on 1 VM set.
celery with 5 workers (each worker for 1 VM item from VM's set: alpha, beta, charlie, delta, echo)
Test sets can run in different order and use diff. amount of time to execute.
Each worker should execute only one test case without overlapping or concurrency.
Each worker run tasks only from its own queue/consumer.
In previous version I had a solution with multiprocessing and it works fine. But with Celery I can't add all 100 tests cases for all 5VMs from one set, it only starts adding tasks for VM alpha and wait until they all finished to start tasks for next VM beta and so on.
Now when I've tried to use multiprocessing to create separate threads for each worker I got: AssertionError: daemonic processes are not allowed to have children
Problem is - how to add 100 tests for 5 workers at the same time?
So each worker (from alpha, beta, ...) will run its own set of 100 test cases simultaneously.

This problem can be solved using task keys based on each consumer, like:
app.control.add_consumer(
queue='alpha',
exchange = 'local',
exchange_type = 'direct',
routing_key = 'alpha.*',
destination = ['worker_for_alpha#HOSTNAME'])
So now you can send any task to this consumer for separate worker using key and queue name:
#app.task(queue='alpha', routing_key = 'alpha.task_for_something')
def any_task(arg_1, arg_2):
do something with arg_1 and arg_2
Now you can scale it to any amount of workers or consumers for single worker. Just make a collection and iter them one by one for multiple workers\consumers.
Another issue can be solved with --concurrency option of each worker.
You can set concurrency to 5 to have 5 simultaneously threads on one worker. Or break the task flow on separate threads for each worker with unique key and consumer(queue).

Related

Dataflow exceeds number_of_worker_harness_threads

I deployed Dataflow job with param --number_of_worker_harness_threads=5 (streaming mode).
Next I send 20x PubSub messages triggering 20x loading big CSV files from GCS and start processing.
In the logs I see that job took 10 messages and process it in parallel on 6-8 threads (I checked several times, sometimes it was 6, sometimes 8).
Nevertheless all the time it was more than 5.
Any idea how it works? It does not seem to be expected behavior.
Judging from the flag name, you are using Beam Python SDK.
For Python streaming, the total number of threads running DoFns on 1 worker VM in current implementation may be up to the value provided in --number_of_worker_harness_threads times the number of SDK processes running on the worker, which by default is the number of vCPU cores. There is a way to limit number of processes to 1 regardless of # of vCPUs. To do so, set --experiments=no_use_multiple_sdk_containers.
For example, if you are using --machine_type=n1-standard-2 and --number_of_worker_harness_threads=5, you may have up to 10 DoFn instances in different threads running concurrently on the same machine.
If --number_of_worker_harness_threads is not specified, up to 12 threads per process are used. See also: https://cloud.google.com/dataflow/docs/resources/faq#how_many_instances_of_dofn_should_i_expect_dataflow_to_spin_up_

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

How to limit/set parallelism of QueueTriggers that get executed with Azure WebJobs

I have 5 QueueTrigger jobs within a single Function.cs file. 3 jobs must execute sequentially (synchronously) and 2 can process up to 16 items at a time.
From what I can decode from the documentation the AddAzureStorage queue configuration method only supports setting this parallelism for all the jobs:
.AddAzureStorage(queueConfig =>
{
queueConfig.BatchSize = 1;
});
The above now sets that all jobs can process only one item at a time. If I set it to 16, then all jobs will run in parallel which is not what I want either.
Is there a way to set the BatchSize per QueueTrigger webjob or will I have to set it to 16 and use locks on those I don't want to run in parallel to achieve the desired behaviour?

How to set agenda job concurrency properly

Here's an example job:
const Agenda = require('agenda')
const agenda = new Agenda({db: {address: process.env.MONGO_URL}})
agenda.define('example-job', (job) => {
console.log('took a job -', job.attrs._id)
})
So now, let's say I queue up up 11 agenda jobs like this:
const times = require('lodash/times')
times(11, () => agenda.now('example-job'))
Now if I look in the DB I can see that there are 11 jobs queued and ready to go (like I would expect).
So now I start one worker process:
agenda.on('ready', () => {
require('./jobs/example_job')
agenda.start()
})
When that process starts I see 5 jobs get pulled off the queue, this makes sense because the defaultConcurrency is 5 https://github.com/agenda/agenda#defaultconcurrencynumber
So far so good, but if I start another worker process (same as the one above) I would expect 5 more jobs to be pulled off the queue so there would be a total of 10 running (5 per process), and one left on the queue.
However, when the second worker starts, it doesn't pull down any more jobs, it just idles.
I would expect that defaultConcurrency is the number of jobs that can run at any given moment per process, but it looks like it is a setting that applies to the number of jobs at any moment in aggregate, across all agenda processes.
What am I missing here or what is the correct way to specify how many jobs can run per process, without putting a limit on the number of jobs that can be run across all the processes.
The problem is that the defaultLockLimit needs to be set.
By default, lock limit is 0, which means no limit, which means one worker will lock up all the available jobs, allowing no other workers to claim them.
By setting defaultLockLimit to the same value as defaultConcurrency this ensures that a worker will only lock the jobs that it is actively processing.
See: https://github.com/agenda/agenda/issues/412#issuecomment-374430070

Requesting integer multiple of "M" cores per node on SGE

I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots parameter and then SGE tries to find num_slots slots following the allocation policy of the pe_name PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M slots per host (and not a multiple of M), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn, set its allocation_rule to 8, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40 and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1 for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G. Assuming that the nodes are configured with h_vmem of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem of less than 47 GB. h_vmem serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores with k being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.

Resources