How can I utilize multithread CPU most in Matlab? - multithreading

I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?

By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.

You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.

Related

Multi threaded vs multi process design approach for cpu intensive applications

We have to design a system that runs parallel algorithms in iterations and sync after certain steps, kind of fork-join model. Sync after few steps is required to exchange data via shared memory to continue the next iterations.
This loop(s) will continue until user specified time.
One loop will act as controller to coordinate the sync points(spinlock in our case).
Goal is also to run as many iterations as possible (no sleep) in these code path.
When we modeled the above behavior in multiple processes vs multiple threads, threads are not scaling as good as processes.
This is not a memory intensive application. Both on windows, linux the c++ code shows similar pattern .
In first design,
Controller is in one application and manages spinlock and other 3 applications are launched waiting for respective spinlock. In second design, same logic is deployed as multiple threads is one application.
Benchmark for our design is to maximize the count of sync point in given time.
As I increased numberof processes or threads performance degrades, but threads degrade is bad. Even though 5 cores are 100% loaded, in both cases, threads are bad after number 4.
Our plan is to keep 6 threads maximum .
To eliminate context switch overhead, boost fibers are tried. But results not promising.
Why threads are not performing on par with multiple processes?
We did tests on intel i7 desktop with same configuration for windows, linux .
You might want to check cache hit rate and context switches.
A process has its own memory space and therefore its own cache region near the processor that it is running on. It may be that threads, since they share memory space, have to deal with the fact that the leading cache is near one processor and further away from the other (L1 hits vs L2 hits vs L3 hits). Not all cache hits are the same.
You may also want to check how many context switches, that is when a process is scheduled and unscheduled, occur. You should want to minimize that.
And then there is the process that a re-scheduled process may end up in the wrong processor, which then may have "the wrong cache" in front of him. Some kernels have an "affinity" function to calculate where a rescheduled process should be located. But that may not work for threads. Not sure there.

Improvement on execution time from an aplication done with multi-threading is limited by the number of physical cores?

I was doing some testing with multi-threading on a linux virtual machine, and I implemented a benchmark with 10 threads (in this application each instruction would be executed 10x times more than in the single-thread scenario) and i was tweaking with the number of "physical cores" from the VM settings and with the single thread case I obtain 3s on average independently of the number of physical cores, If the number of cores is set to 1, and I run the multi-thread version, the execution time will be 30s. If I run it with 2 cores I obtain 15s and with 8 cores (the maximum number I can set) I obtain 6s, I obtain this dependancy due to the fact that I´m executing 10x times each instruction or is always like this?
If you have N threads running on N cores, and if they are all doing pure computation (i.e., not waiting for any I/O devices), and if they are all completely independent of each other, then they should be able to do N times as much work in a given amount of time as a single thread can do in the same amount of time.
But, that's if they are completely independent. That's a hard thing to achieve. For example, if the threads can't each do all of their work in their own, independent cache (e.g., in L1 cache,) then they will compete with each other for access to the main memory. They will sometimes have to wait for one another, because only one core can access main memory at any given moment. So, if the threads need to use memory, then the speedup will be somewhat less than N times.
If the threads need to share data in main memory, then it gets worse because then they will need to use mutual exclusion locks. One thread may keep a lock locked while it executes dozens of instructions, and any other thread that wants the same lock will have to wait until it is finished.
If the threads need to synchronize with each other/communicate with each other, then it gets worse still because unless their work loads are carefully balanced, a thread with less work to do may spend long periods of time awaiting signals from threads that have more work to do.
It's not unusual for a novice programmer to invent a multi-threaded version of some single-threaded algorithm, and find out that the multi-threaded version actually is slower than the single-threaded version.
There are some algorithms, for which even an expert programmer can't get much speed up by throwing more threads at it.

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

Why does hyper threaded or Multi-threaded CPU matter?

Since single CPU could only execute one instruction at a time. Basically, what Multi-threaded CPU does is switch back and forth between multiple threads within single core. Since single-threaded & singl-core CPU could do multitasking by context switch between processes, why does Multi-threaded CPU matter?
You're mixing up quite a few things here ...
First of all : hardware-threads have next to nothing in common with software-threads. As far as i know, there can only be n hardware-threads on a CPU whereas n is the amount of real or virtual CPU-cores (an ALU, for example)
Context switching is done to allow the illusion of parallelism on one single core.
Now : since there are no CPUs without several cores anymore, every CPU supports MT which effectively enables somewhat real parallelism - multiple calculations can be done at the same time yet the result has to be pipelined.
Modern CPUs even simulate additional cores - thats possible because there is a time-gap between result-delivery and command-dispatch, AFAIR - this can be used for additional calculations ... thats called hyperthreading and can boost your performance a bit.

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.
In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.
Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Resources