Is therea way to count high-performance cores on heterogenous multicore CPUs - multithreading

In a Heterogeneous MultiProcessing model, the different cores of a CPU or SoC don't have the same performance profiles. While HMP systems first got deployed a while ago (the wiki mentions Samsung started using this model back in 2013), they are coming to a head with apple's M-series (technically it was already an issue with HyperThreading, which could be considered a form of HMP I guess).
Parallelized tools generally try to guess the number of workers they should create by counting the number of cores on the system, however in an HMP model that can be counter-productive, especially for personal devices: while the "efficient" cores are technically available they have low performances, loading them will not be a huge performance gain, and it can drastically impact the interactivity / pleasantness of interacting with the system. This is usually configurable (so the user can set something more reasonable), but it seems a better default to segregate such tools to "high-performance" cores only (at least assuming they are CPU bound), leaving users the option to increase residency if they so choose.
And so is there a portable way to list and segregate "real" and "high-performance" cores from "virtual" and "efficient" cores?
Note: I've seen 68444429 "how can I distinguish between high- and low-performance cores in C++", while the title is similar the question and goals are rather different as the goal here is to avoid generating unnecessary and inefficient work by default.

Related

Will running a parallel process on twice the machines using half the cores result in a speed increase?

I'm designing very compute heavy algorithm, and I'm more constrained by time than access to remote machines on which to run the algorithm.
My question is the following:
Let's say each machine I have access to has 24 cores, and I have 48 tasks to run. Currently, I'm dispatching the algorithm to two machines, each of which use their 24 cores to handle 24 of the tasks.
If I instead dispatched the same process to 4 machines which each spawned 12 threads, would it (likely) result in the tasks being completed quicker? I'm curious if having some extra cores available on the machine means computations are performed faster than if every single core is occupied running an individual thread.
This is highly dependent of the actual algorithm, the actual dataset, the target hardware including the interconnection network if data communicate and the input/data data are heavy (or if the algorithm runs very quickly). Some applications scale better on many machines with few cores and some scale better on few machines with many cores. In high-performance computing researchers worked during decades so to understand the performance of hybrid applications and there is no clear answer to this: it depends (Note that the question is already quite hard to answer for a given well defined application with a well defined dataset so for people to write research papers on it).
If your tasks are memory-bound, then using more machine with less core is often better. If the amount of transferred data is big or the algorithm require a low-latency, then using fewer machines is often better (typically one big SMP). There are many others things to consider since machines are not just a bag of core. NUMA effect should be considered for example as well as caches, the storage device system and even the OS (not all subsystem scale on a given machine regarding the OS).

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

Regarding relationship between cores and ranks of a MPI program

As far as I know, in a multiprocessor environment any thread/process can be allocated to any core/processor so, what is meant by following line:
the number of MPI ranks used on an Intel Xeon Phi coprocessor should be substantially fewer than the number of cores in no small part because of limited memory on the coprocessor.
I mean, what are the issues if #cores <= #MPI Ranks ?
That quote is correct only when it is applied to a memory size constrained problem; in general it would be an incorrect statement. In general you should use more tasks than you have physical cores on the Xeon Phi in order to hide memory latency1.
To answer your question "What are the issues if the number of cores is fewer than the number of MPI ranks?": you run the risk of having too much context switching. On many problems it is advantageous to use more tasks than you have cores to hide memory latency2.
1. I don't even feel like I need to cite a reference for this because how loudly it is advertised; however, they do mention it in an article on the OpenCL design document: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor
2. This advice applies to the Xeon Phi specifically, not necessarily other pieces of hardware.
Well if you make number of MPI tasks higher than number of cores it makes no sense, because you start to enforce 2 tasks on one processing unit, and therefore exhaustion of computing resources.
When it comes to preferred substantially lower number of tasks over cores on Xeon Phi. Maybe they prefer threads over processes. The architecture of Xeon Phi is quite peculiar and overhead introduced by maintaining an MPI task can seriously cripple computing performance. I will not hide that I do not know technical reason behind it. But maybe someone will fill it in.
If I recall correctly communication bus in there is a ring (or two rings), so maybe all to all communication and barriers are polluting bus and turns out to be ineffective.
Using threads or the native execution mode they provide has less overhead.
Also I think you should look at it more like a multicore CPU, not a multi-CPU machine. For greater performance you don't want to run 4 MPI tasks on a 4-core CPU either, you want to run one 4-threaded MPI task.

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
*
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?
If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.
#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).
The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.
If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.
It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Resources