Balancing blocks, threads and workgroups? - multithreading

I have an application (did not create myself) that requires three parameters
Blocks
Threads
Points (number of calcs per thread I'm assuming)
It uses OpenCL and I have an RX 580...my current efficiency is low
The GPU has 2304 modules in 36 compute units
Now I have played around with different values but I have no idea what would be the most optimal starting point because I dont know how blocks and threads relate to the compute units.
Any help would be greatly appreciated in understanding how to decide #of blocks, #of threads per block and #of calcs per thread
Thank you so much

I'm going to make the same assumptions you have:
Blocks: Number of workgroups
Thread: Number of threads
Points: Some metric of work per thread
Its more important to set the correct workgroup size rather than the number of workgroups. You will want the group size to be a minimum of the SIMD width which is usually 32 on most GPUs. So blocks should be set to Threads / 32.
For "Points". This will depend on how much work is done per "calc". There is overhead with kicking off a workgroup so you want to make sure each thread has enough work to do. From experience ~16 instructions is usually enough. But if you can't see the kernel code then you will just have to experiment.
In summary:
Set "Points" so that you have at least 2304 threads for the work you need
Set Blocks to threads / 32
All of this is assuming you have at least 2304 work items otherwise you are not fully utilising your hardware.

Related

What should be the minimum and maximum number of threads that i could run for a task?

I am new to CUDA and need help understanding some things.From this i understood 2 things.
6144 is the instantaneous capacity of my GPU.
So that while trying some multiprocessing code, i could run up to only 6144 thread? Is that correct?
I could create a kernel of any size (In Cuda C programming). With any block size and number of threads.
The device query is shown here.
It will be a pleasure if i got clear clarification on these two.
Still i am confused about how many threads and blocks are there in my GPU, and also how many threads could i used to run for a task (max and min limit).

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!
I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

CUDA: question about the active warps (active blocks) and how to choose the block size

Suppose a CUDA GPU can have 48 simultaneously active warps on one multiprocessor, that is 48 blocks of one warp, or 24 blocks of 2 warp, ..., since all the active warps from multiple blocks are scheduled for execution, it seems the size of the block is not important for the occupancy of the GPU (of course it should be multiple of 32), whether 32, 64, or 128 make no difference, right? So the size of the block is just determined by the computation task and the resource limit (shared memory or registers)?
There are multiple factors worth considering, that you ommit.
There is a limit on the number of active blocks on a SM. Current limit is 8 (all devices), so if you want to achieve full occupancy, your blocks shouldn't be smaller than: 3-warps (devices 1.0, 1.1), 4-warps (1.2, 1.3), and 6-warps (2.x)
Depending on the device, there are 8K, 16K or 32K registers available per multiprocessor. The bigger your blocks, the bigger "granularity" of how many registers the block needs. For big blocks, if full occupancy cannot be achieved, you loose a lot. For smaller blocks, the loss may be smaller. That's why personally, I prefer for example 2x256 rather than 1x512.
If you do need synchronisation between warps in a block, bigger blocks allow you to have wider synchronisation.
Single block is guaranteed to be scheduled on a single multiprocessor. If all its warps have some common data (e.g. control variables), you can reduce the number of global memory fetches. On the other hand, when you create lots of small blocks, each of them might need to load the same data separately. On Fermi, which has some caches, it is not as important as on GF-200 series. Keep in mind, however, that since there are so many multiprocessors, 1MB L2 cache is still very, very small!
No.
The blocksize does matter.
If you have a blocksize of 32 threads you have a very low occupancy.
If you have a blocksize of 256 you have a high occupancy. That means that all the 256 are concurrently active.
More than 256 threads / block would rarely make some difference.
As the architecture involved is complex, testing it with your software is always the best approach.

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.
have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard
Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.
tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick
I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm
It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Resources