Increasing number of threads decreases cpu performance

Increasing number of threads decreases cpu performance - multithreading

I have written a program that tries to do some calculations. It can be run any number of cores.
Below it's a breakdown on how many calculations are performed if it's run on 1,2,3,4 cores (on a 4 logical processors laptop). The numbers in parenthesis show calculations per thread/core. My question is why the performance decreases very rapidly then number of threads/cores increases ? I dont expect the performance to double but it's significantly lower. I also observer the same issue when running 4 instance of the same program setup to run on one thread so i know that it's not an issue with the program it self.
The greatest improvement is going form 1 thread to 2 why is that ?
# Threads | calculations/sec
1 | 87000
2 | 129000 (65000,64000)
3 | 135000 (46000,45000,44000)
4 | 140000 (34000,34000,34000,32000)
One interesting thing is that exactly the same issue i can see on Google Cloud Platform in Compute Engine, when I run the program on 16 threads on 16 virtual cores platform the performance of each core drops down to like 8K states per second. However if I run 16 instances of the same program each instance does around 100K states /s. When I do the same test but on 4 cores on my home laptop i can still see the same drop of performance whether I run 4 separate .exe of one with 4 cores, I was expecting to see the same behavior on GCP but it's not the case, is it because of virtual cores ? do they behave differently.
Example code that reproduces this issue, just need to paste to your console application, but need to refresh the stats like 20 times for the performance to stabilize, not sure why it fluctuates so much. code example you can see that if you run the app with with 4 threads the you get significant performance impact compared to 4 instances with 1 thread each. I was hoping that that enabling gcServer will solver the problem but did not see any improvement

Related

What is the load on the system

I have a Red hat server where I can see the load average on the system is 23 24 23 (1min 5min 15min) using the top command. And i can see in /proc/cpuinfo there are 24 processors entry (0-23). But in each processor entry the cpu cores value is 6 and in each processor entry the physical id is either 1 or 0.
I want to know if my system is overloaded. Can anyone please tell me.

It seems you have a system with two processors, each with 6 cores. Each of the cores can likely run hyperthread => 2 x 6 x 2 = 24. In /proc/cpuinfo, top etc, you will see each hyperthread: that's the number of parallel processes or threads that your hardware can run.
The quick answer is that your system is probably not overloaded and that it processes a relatively stable amount of work over time (as the 1, 5 and 15 minute values are about the same). A rule of thumb is that the load average value should stay below the number of hyperthreads -- this is not, however, exactly true.
You'll find a more in-depth discussion here:
https://unix.stackexchange.com/questions/303699/how-is-the-load-average-interpreted-in-top-output-is-it-the-same-for-all-di
and here:
https://superuser.com/questions/23498/what-does-load-average-mean-on-unix-linux
and perhaps here:
https://linuxhint.com/load_average_linux/
However, please keep in mind that load average does not tell you everything about your system -- though it is usually a pretty good indicator in my experience. You'd have to check many other factors to determine overloadedness, like memory pressure, I/O wait times, I/O bandwidth utilization. It also depends on what kind of processing the system is performing.

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!

I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

What Would Happen if CPU Load Average is High

I read some articles about the CPU load average. They were talking about the definition, the differences between the CPU usage, and the optimal value (roughly equals to the number of cores). They also mentioned that if the number is high, you will be in trouble (waking up at mid-night etc.), but what would actually be happening if the number is high?
For example, I have been running 4, 6 and 8 sessions on a 4 core Linux server. Although the time it took to finish the task were different (4 fasted, 8 slowest), the results seem OK. The CPU load averages were roughly 4, 8 and 10. I understand that 10 might not be a good number, but then what?

It's just that: if you run absurdly high load averages, the overall efficiency will suffer: the CPU processing power will go to waste.
This is caused by several factors; the most immediate being more CPU time needed for scheduling the competing tasks. One not at all insignificant factor is that several competing processes will also overutlize the CPU cache; each task switch effectively throwing out the cache contents and replacing them with new ones. Further choke points come in forms of bottlenecks in memory and storage bandwidths.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?

In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.

have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard

Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.

tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick

I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm

It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string