Make Parallel Jobs Performance

Make Parallel Jobs Performance - linux

I was using time to profile make builds, and I noticed that having -j 8 was several milliseconds slower than -j 4. I am compiling with gcc on a Intel Core2 Quad, so there are only four processor cores. Could this slowdown be due to the resource limitations, and whatever make uses to schedule jobs is adding some overhead?

If you have more processes running than processors, then the operating system will require some context switching. This isn't an issue with make; it's just how jobs are scheduled when there are insufficient resources.

Honestly, I would consider a difference of several milli-seconds to probably just be statistical noise. Run the tests several times and see if the difference is repeatable before assuming it's significant.
That said, running 8 CPU-bound processes on 4 CPUs will usually run into more multitasking overhead than running two sets of 4 processes. If the make process involves a lot of I/O (and it usually does), there is some benefit to running more than 4 (say 5 or 6) to fill in the CPU queue when other processes are stalled on I/O, but 8 might be overkill.

Related

What does it mean when we say "4 cores 8 threads"?

When I run lscpu on my host, it shows
CPU(s): 8
Thread(s) per core: 2
Core(s) per socket: 4
My host has 4 physical CPUs, but 8 logical CPUs due to 2 threads per core. ok, "2 threads per core" means one core can execute 2 threads simultaneously so as if we have doubled the CPU capacity? So this is parallel concept?
While we have another concept that "one process can have multiple threads", I believe this means one process can handle multiple threads concurrently by switching context, but not necessarily in parallel. In most cases one CPU can execute one thread at a time, right?
I'd like to confirm my understanding above is correct. Thanks
Ref for concurrent and parallel difference: What is the difference between concurrency and parallelism?

This concept is called Simultaneous multithreading (SMT). It is implemented in many processor, from x86-64 (both AMD and Intel) to POWER. The idea is to execute 2 threads concurrently. Some operation can be parallel regarding the specific target architecture.
one core can execute 2 threads simultaneously so as if we have doubled the CPU capacity?
No. Hardware threads (also called logical cores) are not equivalent to cores (ie. in opposition to physical cores). Some processor units are statically allocated for the hardware threads while some units are dynamically allocated for the hardware thread meaning the threads share the available resources.
The initial idea was to execute something useful when a core was stalling on some operations like memory reads. With 2 hardware threads, a core can execute the instructions of another thread if the current one is waiting on memory, for example due to a cache miss. Memory-bound parallel codes that are limited by the latency of the RAM like naive transpositions or linked-list traversals can benefit from this mechanism.
The SMT implementation has significantly improved over time. Especially in x86-64 processor recently. Nowadays, hardware threads of modern processor can execute computing instructions truly in parallel. For example, an Intel Skylake processor can execute up to 4 arithmetic instructions at a time per cycle, thanks to 4 ALUs. 1 thread can execute 4 instructions per cycle only if the instructions are independent (during the target cycles). This is not always possible as some loops are inherently sequential and do not contain enough independent instruction for each loop (eg. cumulative sum). With a 2-way SMT enabled, 2 software threads can be scheduled on the same core and the core can execute 2 instructions of each thread completely in parallel in a given cycle. It can even load balance the number of instruction regarding the needs of each thread in real time (eg. 1 vs 3 instructions per cycle). In the end, latency-bound codes can be up to 2 times faster on a 2-way SMT processor like Skylake. That being said, it does not speed up codes that can already fully use all the available processor computing units. For example, a parallel matrix multiplication using an optimized BLAS library will nearly always be slower with 2 software threads running per core than with only 1 software thread per core. The execution can be slower because hardware thread share some resources like caches and they can conflict each other with 2 threads per core running simultaneously. Put it shortly, efficient codes should not benefit from it, but people tends to write inefficient code and it is not rare for compilers to fail to generate efficient codes saturating computing units of a core (they often need some help).
While we have another concept that "one process can have multiple threads", I believe this means one process can handle multiple threads concurrently by switching context, but not necessarily in parallel.
I would like to set the record straight: software threads and hardware threads are two very different things despite the name.
A software thread is a logical OS unit that can be scheduled on a hardware thread. A hardware thread can be seen as a physical part of a processor core (it is actually a naive simplistic view). A software thread is a part of an OS process. The OS is responsible for the scheduling of the ready software threads. Processes are not scheduled, software threads are (at least on a modern OS). 2 software threads of 2 different processes can run in parallel on a processor with multiple cores (or even on some 2-way SMT cores).
In most cases one CPU can execute one thread at a time, right?
The term "CPU" is not clear here: it can mean different things regarding the context.
If "one CPU" means a modern microprocessor chip that is typically a multicore one nowadays, then definitively no. Software threads can truly run in parallel on different cores for examples.
If "one CPU" means a core (like often in high-performance computing), then it depends: a 1-way SMT core can execute only 1 thread at a time while a 2-way SMT core can execute 2 thread at a time.
On old microprocessor chip with 1 core and no SMT, it was true that one thread was running at a time and context switches was used to execute thread concurrently from the user point-of-view but not in parallel. This time is long gone (since nearly 2 decades) except maybe on some embedded microprocessor chips.

Is this...parallel?
Maybe.
Hyperthreading is Intel's trademark* for processor cores that have two complete sets of context registers. A hyperthreaded CPU can concurrently execute code on behalf of two threads without any intervention by the operating system (i.e., with no need for context switching.)
The extent to which those two concurrent executions actually are parallel executions varies from CPU model to model, and it depends on what the two threads actually are doing. For example (I'm just making this part up because it's been a few decades since I've needed to worry about any particular CPU architecture) if some "hyperthreaded" CPU has two integer ALUs per core, then the two threads might both be able to perform integer operations in parallel, but if it has only one FPU per core, then they would have to take turns using it.
Some Hyperthreaded CPU models have more duplicate execution units than others have, and so can parallelize more parts of the execution.
* AMD calls their similar capability, "2-way simultaneous multithreading."

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?

I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

Is it reasonable to use all the cores on the compute nodes?

I have a small question.
The compute node has 2 sockets, with 12 cores per sockets. So it has 24 cores (24 cpus in my case).
When I run a parallel computing, can I use all the cpus? In other words, do we need to spare several cpus for the background programs?
BTW, I think using the cpus on the same chip(same sockets) can avoid the commmunication between sockets, which could speed up the running. So how to determine how many cpus should be used to generate the quickest running?
Any general suggestions on this issue would be appreciated.
Best,

To answer your question: Yes, you can use all cores for a parallel job or program. Depending on what programs you are to run in the background, you might see some performance drops during the execution of your job/program. The real way to determine the optimal number of cores to use is to execute various runs using different numbers of cores and analyzing the performance of you program and the programs in the background. If you want to take full advantage of all of your cores, I would recommend just running your program with all cores and little to none background programs running.

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.

In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.

Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Running mlutiple threads in CPU

All we know that JVM schedules the user threads in a single CPU based machine .Why cant a single CP run mltiple process/threads in parallel,What is the constrain stops that capability
Also JVM is like a another software which is running in any machine,There may be thousands of other programs may waiting for the CPU cycle at a given time between this how JVM threads get the schedules from the CPU What is the parameter which gives the speed/possibility of the allocation of cycles for any process in any machine.

This is not really a Java question, but a cpu architecture question.
And some CPUs DO run multiple threads in parallel per core. Look at Intel and Hyperthreading.. a 4 core machine with 8 threads, does the opposite of what you suggest.

traditional single-core processors can only process one instruction at a time, meaning that they can only work in a single thread at any one point in time.
multithread support is achieved synthetically by giving threads 'turns' on the cpu so that they appear to be running concurrently.
multi-core processors can process an instruction per CPU at any one point in time.
this question is more in relation to CPU hardware design than programming and especially not a single language ie java as the restriction is across the board.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string