How to quantify the processing tradeoffs of CUDA devices for C kernels? - linux

I recently upgraded from a GTX480 to a GTX680 in the hope that the tripled number of cores would manifest as significant performance gains in my CUDA code. To my horror, I've discovered that my memory intensive CUDA kernels run 30%-50% slower on the GTX680.
I realize that this is not strictly a programming question but it does directly impact on the performance of CUDA kernels on different devices. Can anyone provide some insight into the specifications of CUDA devices and how they can be used to deduce their performance on CUDA C kernels?

Not exactly an answer to your question, but some information that might be of help in understanding the performance of the GK104 (Kepler, GTX680) vs. the GF110 (Fermi, GTX580):
On Fermi, the cores run on double the frequency of the rest of the logic. On Kepler, they run at the same frequency. That effectively halves the number of cores on Kepler if one wants to do more of an apples to apples comparison to Fermi. So that leaves the GK104 (Kepler) with 1536 / 2 = 768 "Fermi equivalent cores", which is only 50% more than the 512 cores on the GF110 (Fermi).
Looking at the transistor counts, the GF110 has 3 billion transistors while the GK104 has 3.5 billion. So, even though the Kepler has 3 times as many cores, it only has slightly more transistors. So now, not only does the Kepler have only 50% more "Fermi equivalent cores" than Fermi, but each of those cores must be much simpler than the ones of Fermi.
So, those two issues probably explain why many projects see a slowdown when porting to Kepler.
Further, the GK104, being a version of Kepler made for graphics cards, has been tuned in such a way that cooperation between threads is slower than on Fermi (as such cooperation is not as important for graphics). Any potential potential performance gain, after taking the above facts into account, may be negated by this.
There is also the issue of double precision floating point performance. The version of GF110 used in Tesla cards can do double precision floating point at 1/2 the performance of single precision. When the chip is used in graphics cards, the double precision performance is artificially limited to 1/8 of single precision performance, but this is still much better than the 1/24 double precision performance of GK104.

One of the advances of new Kepler architecture is 1536 cores grouped into 8 192-core SMX'es but at the same time this number of cores is a big problem. Because shared memory is still limited to 48 kb. So if your application needs a lot of SMX resources then you can't execute 4 warps in parallel on single SMX. You can profile your code to find real occupancy of you GPU. The possible ways to improve you application:
use warp vote functions instead of shared memory communications;
increase a number of tread blocks and decrease a number threads in one block;
optimize global loads/stores. Kepler have 32 load/store modules for each SMX (twice more than on Kepler).

I am installing nvieuw and I use coolbits 2.0 to unlock your shader cores from default to max performance. Also, you must have both connectors of your device to 1 display, which can be enabled in nVidia control panel screen 1/2 and screen 2/2. Now you must clone this screen with the other, and Windows resolution config set screen mode to extended desktop.
With nVidia inspector 1.9 (BIOS level drivers), you can activate this mode by setting up a profile for the application (you need to add application's exe file to the profile). Now you have almost double performance (keep an eye on the temperature).
DX11 also features tesselation, so you want to override that and scale your native resolution.
Your native resolution can be achieved by rendering a lower like 960-540P and let the 3D pipelines do the rest to scale up to full hd (in nv control panel desktop size and position). Now scale the lower res to full screen with display, and you have full HD with double the amount of texture size rendering on the fly and everything should be good to rendering 3D textures with extreme LOD-bias (level of detail). Your display needs to be on auto zoom!
Also, you can beat sli config computers. This way I get higher scores than 3-way sli in tessmark. High AA settings like 32X mixed sample makes al look like hd in AAA quality (in tessmark and heavon benchies). There are no resolution setting in the endscore, so that shows it's not important that you render your native resolution!
This should give you some real results, so please read thoughtfully not literary.

I think the problem may lie in the number of Streaming Multiprocessors: The GTX 480 has 15 SMs, the GTX 680 only 8.
The number of SMs is important, since at most 8/16 blocks or 1536/2048 threads (compute capability 2.0/3.0) can reside on an single SM. The resources they share, e.g. shared memory and registers, can further limit the number of blocks per SM. Also, the higher number of cores per SM on the GTX 680 can only reasonably be exploited using instruction-level parallelism, i.e. by pipelining several independent operations.
To find out the number of blocks you can run concurrently per SM, you can use nVidia's CUDA Occupancy Calculator spreadsheet. To see the amount of shared memory and registers required by your kernel, add -Xptxas –v to the nvcc command line when compiling.

Related

Is therea way to count high-performance cores on heterogenous multicore CPUs

In a Heterogeneous MultiProcessing model, the different cores of a CPU or SoC don't have the same performance profiles. While HMP systems first got deployed a while ago (the wiki mentions Samsung started using this model back in 2013), they are coming to a head with apple's M-series (technically it was already an issue with HyperThreading, which could be considered a form of HMP I guess).
Parallelized tools generally try to guess the number of workers they should create by counting the number of cores on the system, however in an HMP model that can be counter-productive, especially for personal devices: while the "efficient" cores are technically available they have low performances, loading them will not be a huge performance gain, and it can drastically impact the interactivity / pleasantness of interacting with the system. This is usually configurable (so the user can set something more reasonable), but it seems a better default to segregate such tools to "high-performance" cores only (at least assuming they are CPU bound), leaving users the option to increase residency if they so choose.
And so is there a portable way to list and segregate "real" and "high-performance" cores from "virtual" and "efficient" cores?
Note: I've seen 68444429 "how can I distinguish between high- and low-performance cores in C++", while the title is similar the question and goals are rather different as the goal here is to avoid generating unnecessary and inefficient work by default.

openmp: performance decreases with multiple threads on my desktop, but the opposite over my server

I believe I read pretty much about StackOverflow threads regarding decreasing performance when increasing threads number with OpenMP. Mostly they were due to false sharing. My situation is quite different because my two machines are showing the opposite result.
I'm running STREAM benchmark, and information about my machines is written below:
Intel Xeon Gold-6148. 20 cores (40 threads), 2.4 GHz, 27.5MB LLC
Intel Core i5-9400. 6 cores (6 threads), 2.9 GHz, 9MB LLC
Memory information is similar between the two machines so I will omit it.
I ran the benchmark quite a lot of times and checked the variance between runs is small enough. The result is quite interesting.
Gold-6148(a.k.a server) gets enhanced results when increasing threads number with the OMP_NUM_THREADS option. However, the result of i5-9400(a.k.a desktop) decreases with multiple threads number.
I set the STREAM_ARRAY_SIZE with 20m, and double-checked with various sizes, so it won't affect the result.
Also, I doubted maybe because of any difference over glibc/gomp library between the two, but no difference.
Any idea why this is happening? I just absentmindedly watching the chart below again and again...
i5-9400 result
gold-6148 result
Memory information is similar between the two machines so I will omit it.
You cannot simply omit the most important information when talking about the STREAM benchmark. Xeon Gold 6148 has six DDR4-2666 memory channels split over two separate memory controllers while i5-9400 (assuming i5-9700 is a typo since i7-9700 is an octo-core i7 and not a hexa-core i5 CPU) has only two DDR4-2666 memory channels on a single memory controller. Therefore, 6148 with memory modules installed on all six channels is capable of delivering 3x the memory bandwidth of i5-9400 with the same type of memory modules on both channels. It can also handle more simultaneous memory requests and therefore provides better memory utilisation with more than one thread. Thus, the actual memory configuration is quite important.
Interpreting the STREAM results requires deep understanding of the underlying CPU architecture. There is a nice article by Georg Hager on that topic.

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

Intel MSR frequency scaling per - thread

I'm extending the Linux kernel in order to control the frequency of some threads: when they are scheduled onto a core (any core!), the core's frequency is changed by writing the proper p-state to the register IA32_PERF_CTL, as suggested in Intel's manual.
But when different threads with different "custom" frequencies are scheduled, it appears that the throughput of all the thread increases, as if all the cores run at the maximum set frequency.
I did many trials and measurements in different conditions of load and configuration, but the result is the same.
After some trials with CPUFreq (with no running app, I set different frequencies on each core, and finally the measured frequencies, with cpufreq-info -w, were equal), I wonder if the CPU cores can really run at different, independent frequencies, or if there are hardware policies or constraints.
Finally, is there a CPU model which makes this fine-grained frequency scaling feasible?
The CPU I am using is Intel Core i5 750
You cannot control individual core frequencies for active cores. You can, however, control frequencies of all active cores to be the same. The reasons are in the previous answers - all cores are on the same active voltage plane.
Hopefully, the next-gen Haswell processors will make it possible to control each core separately.
I think you're missing a big piece of the picture!
Read up on power and clocks domains. All processor cores within a domain run at the same P-state (i.e., the same frequency and voltage). The P-state that all cores will run at in that domain will always be the P-state of the core requesting the highest P-state in that domain. The MSRs don't reflect this at all, nor do the interfaces that the kernel exposes.
Anandtech has a good article on this:
http://www.anandtech.com/show/2658/2
"This is all very similar to AMD's Phenom, but where the two differ is in how they handle power management. While AMD will allow individual cores to request different clock speeds, Nehalem attempts to run all of its cores at the same frequency; if one core is idle then it's simply power gated and the core is effectively turned off."
I haven't hooked a power-meter up to SB/IB, but my guess is that the behavior is the same.
cpufreq-info will display information about which cores need to be synchronous in their P-states:
[root#navi ~]# cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq#vger.kernel.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0 1 <---- THIS
CPUs which need to have their frequency coordinated by software: 0 <--- and THIS
maximum transition latency: 10.0 us.
At least because of that, I'd recommend going through cpufreq interfaces instead of directly setting registers, as well as making it possible to run on non-intel CPUs which might have uncommon requirements.
Also check on how to make kernel threads stick to specific core, to avoid unexpecteded switching, if you didn't do so already.
I want to thank everyone for the contribution!
Further investigating, I found other details I share with the community.
As suggested, Nehalem places all the cores in a single clock domain, so that the maximum frequency set among all the cores is applied to all of them; some tools may show different frequencies on idle cores, but it is sufficient to run any application to make the frequency raise to the maximum.
This, from my tests, also applies to Sandy Bridge, where cores and LLC slices all reside in the same frequency/voltage domain.
I assume that this behavior also happens with Ivy Bridge, as it is only a 'tick' iteration.
Instead, I believe that Haswell places cores and LLC slices in different, singular domains, thus enabling per-core frequencies. This is also advertized in several pages like
http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Resources