Local and Global size influence on program execution - OpenCl

Local and Global size influence on program execution - OpenCl - multithreading

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!

I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

Related

Balancing blocks, threads and workgroups?

I have an application (did not create myself) that requires three parameters
Blocks
Threads
Points (number of calcs per thread I'm assuming)
It uses OpenCL and I have an RX 580...my current efficiency is low
The GPU has 2304 modules in 36 compute units
Now I have played around with different values but I have no idea what would be the most optimal starting point because I dont know how blocks and threads relate to the compute units.
Any help would be greatly appreciated in understanding how to decide #of blocks, #of threads per block and #of calcs per thread
Thank you so much

I'm going to make the same assumptions you have:
Blocks: Number of workgroups
Thread: Number of threads
Points: Some metric of work per thread
Its more important to set the correct workgroup size rather than the number of workgroups. You will want the group size to be a minimum of the SIMD width which is usually 32 on most GPUs. So blocks should be set to Threads / 32.
For "Points". This will depend on how much work is done per "calc". There is overhead with kicking off a workgroup so you want to make sure each thread has enough work to do. From experience ~16 instructions is usually enough. But if you can't see the kernel code then you will just have to experiment.
In summary:
Set "Points" so that you have at least 2304 threads for the work you need
Set Blocks to threads / 32
All of this is assuming you have at least 2304 work items otherwise you are not fully utilising your hardware.

How to prevent two processess from fighting for a common cache?

I was asked this question on an exam. We have two CPUs, or two cores in the same CPU, that share a common cache (for example, L3). On each CPU there is an MPI process (or a thread of one common process). How can we assure that these two processes don't interfere, meaning that they don't push each others entries out or use a half of the cache each or something similar. The goal is to improve the speed of memory access here.
The OS is some sort of Unix, if that is important.

Based on your comments, it seems that a "textbook answer" is expected, so I would suggest partitioning the cache between the processes. This way you guarantee that they don't compete over the same cache sets and thrash each other. This is assuming you don't want to actually share anything between the 2 processes, in which case this approach would fail (although a possible fix would be to split the cache space in 3 - one range for each process, and one for shared data).
Since you're probably not expected to redesign the cache and provide HW partitioning scheme (unless the question comes in the scope of computer architecture course), the simplest way to achieve this is simply by inspecting the cache size and associativity, figuring our the number of sets, and aligning the data sets of each process/thread to a different part.
For example, if your shared cache is 2MB big, and has 16 ways and 64B lines, you would have 2k sets. In such case, each process would want to align its physical addresses (assuming the cache is physically mapped) to a different half 1k sets, or a different 0x10000 out of each 0x20000. In other words, P0 would be free to use any physical address with bit 16 equals 0 , and P1 would use the addresses with bit 16 equals 1.
Note, that since that exceeds the size of a basic 4k page (alignment of 0x1000), you would either need to hack your OS to assign your pages to the appropriate physical addresses for each process, or simply use larger pages (2M would be enough).
Also note that by keeping a contiguous 0x10000 per allocation, we still enjoy spatial locality and efficient hardware prefetching (otherwise you could simply pick any other split, even even/odd sets by using bit 6, but that would leave your data fractured.
Last issue is for data sets larger than this 0x10000 quota - to make then align you'd simply have to break them into chunks up to 0x10000, and align each separately. There's also the issue of code/stack/pagemap and other types of OS/system data which you have less control over (actually code can also be aligned, or more likely in this case - shared) - I'm assuming this has negligible impact on thrashing.
Again - this attempts to answer without knowing what system you work with, what you need to achieve, or even what is the context of the course. With more context we can probably focus this to a simpler solution.

How large is a way in the cache?
For example, if you have a cache where each way is 128KiB in size, you partition your memory in such a way that for each address modulo 128KiB, process A uses the 0-64KiB region, and process B uses the lower 64KiB-128KiB region. (This assumes private L1-per-core).
If your physical page size is 4KiB (and your CPU uses physical addresses for caching, not virtual - which does occur on some CPUs), you can make this much nicer. Let's say you're mapping the same amount of memory into virtual address space for each core - 16KiB. Pages 0, 2, 4, 6 go to process A's memory map, and pages 1, 3, 5, 7 go to process B's memory map. As long as you only address memory in that carefully laid out region, the caches should never fight. Of course, you've effectively halved the size of your cache-ways by doing so, but you have multiple ways...

You'll want to utilize a lock in regards to multi-thread programming. It's hard to provide an example due to not knowing your specific situation.
When one process has access, lock all other processes out until the 'accessing' process is finished with the resource.

What Would Happen if CPU Load Average is High

I read some articles about the CPU load average. They were talking about the definition, the differences between the CPU usage, and the optimal value (roughly equals to the number of cores). They also mentioned that if the number is high, you will be in trouble (waking up at mid-night etc.), but what would actually be happening if the number is high?
For example, I have been running 4, 6 and 8 sessions on a 4 core Linux server. Although the time it took to finish the task were different (4 fasted, 8 slowest), the results seem OK. The CPU load averages were roughly 4, 8 and 10. I understand that 10 might not be a good number, but then what?

It's just that: if you run absurdly high load averages, the overall efficiency will suffer: the CPU processing power will go to waste.
This is caused by several factors; the most immediate being more CPU time needed for scheduling the competing tasks. One not at all insignificant factor is that several competing processes will also overutlize the CPU cache; each task switch effectively throwing out the cache contents and replacing them with new ones. Further choke points come in forms of bottlenecks in memory and storage bandwidths.

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?

I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.

To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!

My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.

Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.

When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string