Highly concurrent multi-threaded application requires hardware - linux

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)

Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.

Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Related

How can I determine if a raspberry pi is powerful enough to run my code?

Perhaps I posted this with the wrong tags but hopefully someone can help me. I am an engineer finding myself deeper and deeper in automation. Recently I designed an automated system on a raspberry pi. I wrote a pretty simple code which was duplicated to read sensor values from different serial ports simultaneously. I did it this way so I could shut down one script without compromising the others if need be. It runs very well now but I had problems overloading my cpu when I first started (I believe it was because I opened all of the code at once rather than one at a time).
My question is:
How can I determine how much computing power is required by code I have written? How can I spec out a computer to run my code before I start building the robot?
The three resources you're likely to be bounded by on any computer are disk, RAM, and CPU (cores). MicroSD cards are cheap, and easily swapped, so the bigger concern is the latter two.
Depending on the language you're writing in, you'll have more or less control over memory usage. Python in particular "saves" the developer by "handling" memory automatically. There are a few good articles on memory management in Python, like this one. When running a simple script (e.g. activate these IO pins) on a machine with gigabytes of memory, this is rarely an issue. When running data intensive applications (e.g. do linear algebra on this gigantic array) then you have to worry about how much memory you need to do the computation and whether the interpreter actually frees it when you're done. This is not always easy to calculate but if you profile your software on another machine you may be able to estimate it.
CPU utilization is comparatively easy to prepare for. Reserve 1 core for OS and other functions and the rest are available to your software. If you write single threaded code this should be plenty. If you have parallel processing, then either stick to N-1 workers or you'll need to get creative with the software design.
Edit: all of this is with the Raspberry Pi in mind. The Pi is a full computer is a tiny form factor: OS, BIOS, boot time, etc. Many embedded problems can be solved with an Arduino or some other controller which has a different set of considerations.

Choices for shared memory system, MPI library, original RDMA or ULP over RDMA?

I am new on High Performance Computing (HPC), but I am going to have a HPC project, so I need some help to solve some fundamental problems.
The application scenario is simple: Several servers connected by the InfiniBand (IB) network, one server for Master, and others for slaves. only the master read/write in-memory data (size of the data ranges from 1KB to several hundred MBs) into slaves, while slaves just passively store the data in their memories ( and dump the in-memory data into disks at the right time). All computation are are performed in the Master, before the writing or after the reading the data to/from the slaves. The requirement of the system is low latency (small regions of data, such as 1KB-16KB) and high throughput (large regions of data, several hundred MBs).
So, My questions are
1. Which concrete way is more suitable for us? MPI, primitive IB/RDMA library or ULPs over RDMA.
As far as I know, existing Message Passing Interface (MPI) library, primitive IB/RDMA library such as libverbs and librdmacm and User Level Protocal (ULPs) over RDMA might be feasible choices, but I am not very sure of their applicable scopes.
2. Should I make some tunings for the OS or the IB network for better performance?
There is a paper [1] from Microsoft announces that
We improved performance
by up to a factor of eight with careful tuning and
changes to the operating system and the NIC drive
For my part, I will try to avoid such performance tuning as I can. However, if the tuning is unavoidable, I will try my best. The IB network of our environment is Mellanox InfiniBand QDR 40Gb/s and I can choose the Linux OS for servers freely.
If you have any ideas, comments and answers are welcome!
Thanks in advance!
[1] FaRM: Fast Remote Memory
If you use MPI, you will have the benefit of an interconnect-independent solution. It doesn't sound like this is going to be something you are going to keep around for 20 years, but software lasts longer than you ever think it will.
Using MPI also gives you the benefit of being able to debug on your (oversusbscribed, possibly) laptop or workstation before rolling it out onto the infiniband machines.
As to your second question about tuning the network, I am sure there is no end of tuning you can do but until you have some real workloads and hard numbers, you're wasting your time. Get things working first, then worry about optimizing the network. Maybe you need to tune for many tiny packages. Perhaps you need to worry about a few large transfers. The tuning will be pretty different depending on the case.

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

Moving threads across CPUs with clock_gettime(CLOCK_MONOTONIC)

I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?
OP:
I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
I'm not sure what you mean by "erratic" or "unstable." If you mean drift, or variance between cores, these concerns are probably based on computers that shipped 12-15 years ago (XP and Win2000-based OS). From Microsoft:
QPC is available on Windows XP and Windows 2000 and works well on most systems. However, some hardware systems' BIOS didn't indicate the hardware CPU characteristics correctly (a non-invariant TSC), and some multi-core or multi-processor systems used processors with TSCs that couldn't be synchronized across cores. Systems with flawed firmware that run these versions of Windows might not provide the same QPC reading on different cores if they used the TSC as the basis for QPC.
That has pretty much become a non-issue for most current hardware (see link, below).
OP:
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Well, any high-frequency clock has to come from somewhere. On a Windows box (since you asked about QPC), where is that value going to come from? Adding an additional layer to any call to QueryPerformanceCounter essentially guarantees less precision, as there will be more CPU instructions in the mix between "now" and "now as reported back to you by the OS" (and, although extremely unlikely, there would be a small increase in the possibility of preemption, contributing further loss of precision).
This applies equally to any Intel-based Linux/BSD/whatever boxes, as they have to run with the same hardware characteristics. In the Intel-architecture world, the highest frequency you'll be able to get is going to be a value based on RDTSC and the operating system's best effort to keep any TSC values as close as possible across cores or processors (at least in the absence of specialized hardware).
This is why any benchmarking should be an average performance expectation based on a large number of data points.
Microsoft actually has a pretty good document outlining the implementation characteristics of high-frequency timers, including HPET and ACPI power management times, and touches briefly on multi-clock and virtualization: http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx.
OP:
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?
For most of the world, yes. I work on server-based code where microseconds count (high frequency market data), but most people who think a few hundred microseconds are going to make or break their program are kidding themselves. Do you know how long it takes to serve a web page? The CPU itself gets bored with all that waiting, even with thousands of users.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

Resources