Tests running in parallel - optimal threads used? - multithreading

if a VM has 2 cores and 4 logical processors, then we should be able to run 4 parallel tests optimal which can make use of the 4 available threads.
If I have mentioned 5 tests to run parallelly in my build file, then ideally 4 runs in parallel and upon completion of a test whose thread would be free can pick the last test case to execute.
so optimal parallelism is 4 only in this case. Is my understanding right.

Related

SLURM Schedule Tasks Without Node Constraints

I have to schedule jobs on a very busy GPU cluster. I don't really care about nodes, more about GPUs. The way my code is structured, each job can only use a single GPU at a time and then they communicate to use multiple GPUs. The way we generally schedule something like this is by doing gpus_per_task=1, ntasks_per_node=8, nodes=<number of GPUs you want / 8> since each node has 8 GPUs.
Since not everyone needs 8 GPUs, there are often nodes that have a few (<8) GPUs lying around, which using my parameters wouldn't be schedulable. Since I don't care about nodes, is there a way to tell slurm I want 32 tasks and I dont care how many nodes you use to do it?
For example if it wants to give me 2 tasks on one machine with 2 GPUs left and the remaining 30 split up between completely free nodes or anything else feasible to make better use of the cluster.
I know there's an ntasks parameter which may do this but the documentation is kind of confusing about it. It states
The default is one task per node, but note that the --cpus-per-task option will change this default.
What does cpus_per_task have to do with this?
I also saw
If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node
but I'm also confused about this interaction. Does this mean if I ask for --ntasks=32 --ntasks-per-node=8 it will put at most 8 tasks on a single machine but it could put less if it decides to (basically this is what I want)
Try --gpus-per-task 1 and --ntasks 32. No tasks per node or number of nodes specified. This allows slurm to distribute the tasks across the nodes however it wants and to use leftover GPUs on nodes that are not fully utilized.
And it won't place more then 8 tasks on a single node, as there are no more then 8 GPUs available.
Regarding ntasks vs cpus-per-task: This should not matter in your case. Per default a task gets one CPU. If you use --cpus-per-tasks x it is guaranteed that the x CPUs are on one node. This is not the case if you just say --ntasks, where the tasks are spread however slurm decides. There is an example for this in the documentation.
Caveat: This requires a version of slurm >= 19.05, as all the --gpu options have been added there.

When 2 threads would be executed on a 1 physical CPU core with a multi-core CPU machine?

Lets say there's a machine with 8-cores CPU.
I'm creating 2 posix threads using standard pthread_create(...) function.
As I know there's no any garanties these threads always would be executed by a 2 different physical cores, but practically in 90% they will run simultaneously (or in parallel). At least for my cases I seen that top command shows 2 cpu's are running ... thus around 160-180% CPU usage
The question is:
What could be the scenario when 2 threads within a single process are running only on 1 physical core ?
Two cases:
1) The other physical cores are busy doing other stuff, so only one core gets used by this process. The two threads run in alternation on that core.
2) The physical core supports executing more than one thread concurrently using hyperthreading or something similar. The other physical cores are busy doing other stuff, so the best the scheduler can do is run both threads in a single physical core.

Hybrid parallelization with OpenMP and MPI

I'm trying to setup a program which runs across a cluster of 20 nodes, each with 12 cores each. The idea is to have the head process distribute some data out to each node, and have each node perform some operations on the data using OpenMP to utilize the 12 cores. I'm relatively new to this an not sure about the best way to set this up.
We use PBS as the scheduler and my original plan was to create a single MPI process on each node, and let OpenMP create 12 threads per node.
#PBS -l nodes=20:ppn=1
But when I run this, OpenMP seems to only create 1 thread per process. How can I set this up so OpenMP will always create 12 threads per MPI process?
Edit: As soon as I specify to use more than 1 process in PBS, OpenMP will start using 6 threads per process, can't seem to figure out why using only 1 process per node isn't working.

Make Parallel Jobs Performance

I was using time to profile make builds, and I noticed that having -j 8 was several milliseconds slower than -j 4. I am compiling with gcc on a Intel Core2 Quad, so there are only four processor cores. Could this slowdown be due to the resource limitations, and whatever make uses to schedule jobs is adding some overhead?
If you have more processes running than processors, then the operating system will require some context switching. This isn't an issue with make; it's just how jobs are scheduled when there are insufficient resources.
Honestly, I would consider a difference of several milli-seconds to probably just be statistical noise. Run the tests several times and see if the difference is repeatable before assuming it's significant.
That said, running 8 CPU-bound processes on 4 CPUs will usually run into more multitasking overhead than running two sets of 4 processes. If the make process involves a lot of I/O (and it usually does), there is some benefit to running more than 4 (say 5 or 6) to fill in the CPU queue when other processes are stalled on I/O, but 8 might be overkill.

Misunderstanding the difference between single-threading and multi-threading programming

I have a misunderstanding of the difference between single-threading and multi-threading programming, so I want an answer to the following question to make everything clear.
Suppose that there are 9 independent tasks and I want to accomplish them with a single-threaded program and a multi-threaded program. Basically it will be something like this:
Single-thread:
- Execute task 1
- Execute task 2
- Execute task 3
- Execute task 4
- Execute task 5
- Execute task 6
- Execute task 7
- Execute task 8
- Execute task 9
Multi-threaded:
Thread1:
- Execute task 1
- Execute task 2
- Execute task 3
Thread2:
- Execute task 4
- Execute task 5
- Execute task 6
Thread3:
- Execute task 7
- Execute task 8
- Execute task 9
As I understand, only ONE thread will be executed at a time (get the CPU), and once the quantum is finished, the thread scheduler will give the CPU time to another thread.
So, which program will be finished earlier? Is it the multi-threaded program (logically)? or is it the single-thread program (since the multi-threading has a lot of context-switching which takes some time)? and why? I need a good explanation please :)
It depends.
How many CPUs do you have? How much I/O is involved in your tasks?
If you have only 1 CPU, and the tasks have no blocking I/O, then the single threaded will finish equal to or faster than multi-threaded, as there is overhead to switching threads.
If you have 1 CPU, but the tasks involve a lot of blocking I/O, you might see a speedup by using threading, assuming work can be done when I/O is in progress.
If you have multiple cpus, then you should see a speedup with the multi-threaded implementation over the single-threaded since more than 1 thread can execute in parallel. Unless of course the tasks are I/O dominated, in which case the limiting factor is your device speed, not CPU power.
As I understand, only ONE thread will be executed at a time
That would be the case if the CPU only had one core. Modern CPUs have multiple cores, and can run multiple threads in parallel.
The program running three threads would run almost three times faster. Even if the tasks are independent, there are still some resources in the computer that has to be shared between the threads, like memory access.
Well, this isn't entirely language agnostic. Some interpreted programming languages don't support real Threads. That is, threads of execution can be defined by the program, but the interpreter is single threaded so all execution is on one core of the CPU.
For compiled languages and languages that support true multi-threading, a single CPU can have many cores. Actually, most desktop computers now have 2 or 4 cores. So a multi-threaded program executing truely independent tasks can finish 2-4 times faster based on the number of available cores in the CPU.
Assumption Set:
Single core with no hyperthreading;
tasks are CPU bound;
Each task take 3 quanta of time;
Each scheduler allocation is limited to 1 quanta of time;
FIFO scheduler Nonpreemptive;
All threads hit the scheduler at the same time;
All context switches require the same amount of time;
Processes are delineated as follows:
Test 1: Single Process, single thread (contains all 9 tasks)
Test 2: Single Process, three threads (contain 3 tasks each)
Test 3: Three Processes, each single threaded (contain 3 tasks each)
Test 4: Three Processes, each with three threads (contain one task each)
With the above assumptions, they all finish at the same time. This is because there is an identicle amount of time scheduled for the CPU, context switches are identicle, there is no interrupt handling, and nothing is waiting for IO.
For more depth into the nature of this, please find this book.
The main difference between single thread and multi thread in Java is that single thread executes tasks of a process while in multi-thread, multiple threads execute the tasks of a process.
A process is a program in execution. Process creation is a resource consuming task. Therefore, it is possible to divide a process into multiple units called threads. A thread is a lightweight process. It is possible to divide a single process into multiple threads and assign tasks to them. When there is one thread in a process, it is called a single threaded application. When there are multiple threads in a process, it is called a multi-threaded application.
ruby vs python vs nodejs : performances in web app, which takes alot of I/O non blockingrest/dbQuery will impact alot. and being the only multi threaded of all 3, nodejs is the winner with big lead gap

Resources