Choices for shared memory system, MPI library, original RDMA or ULP over RDMA? - linux

I am new on High Performance Computing (HPC), but I am going to have a HPC project, so I need some help to solve some fundamental problems.
The application scenario is simple: Several servers connected by the InfiniBand (IB) network, one server for Master, and others for slaves. only the master read/write in-memory data (size of the data ranges from 1KB to several hundred MBs) into slaves, while slaves just passively store the data in their memories ( and dump the in-memory data into disks at the right time). All computation are are performed in the Master, before the writing or after the reading the data to/from the slaves. The requirement of the system is low latency (small regions of data, such as 1KB-16KB) and high throughput (large regions of data, several hundred MBs).
So, My questions are
1. Which concrete way is more suitable for us? MPI, primitive IB/RDMA library or ULPs over RDMA.
As far as I know, existing Message Passing Interface (MPI) library, primitive IB/RDMA library such as libverbs and librdmacm and User Level Protocal (ULPs) over RDMA might be feasible choices, but I am not very sure of their applicable scopes.
2. Should I make some tunings for the OS or the IB network for better performance?
There is a paper [1] from Microsoft announces that
We improved performance
by up to a factor of eight with careful tuning and
changes to the operating system and the NIC drive
For my part, I will try to avoid such performance tuning as I can. However, if the tuning is unavoidable, I will try my best. The IB network of our environment is Mellanox InfiniBand QDR 40Gb/s and I can choose the Linux OS for servers freely.
If you have any ideas, comments and answers are welcome!
Thanks in advance!
[1] FaRM: Fast Remote Memory

If you use MPI, you will have the benefit of an interconnect-independent solution. It doesn't sound like this is going to be something you are going to keep around for 20 years, but software lasts longer than you ever think it will.
Using MPI also gives you the benefit of being able to debug on your (oversusbscribed, possibly) laptop or workstation before rolling it out onto the infiniband machines.
As to your second question about tuning the network, I am sure there is no end of tuning you can do but until you have some real workloads and hard numbers, you're wasting your time. Get things working first, then worry about optimizing the network. Maybe you need to tune for many tiny packages. Perhaps you need to worry about a few large transfers. The tuning will be pretty different depending on the case.

Related

Will running a parallel process on twice the machines using half the cores result in a speed increase?

I'm designing very compute heavy algorithm, and I'm more constrained by time than access to remote machines on which to run the algorithm.
My question is the following:
Let's say each machine I have access to has 24 cores, and I have 48 tasks to run. Currently, I'm dispatching the algorithm to two machines, each of which use their 24 cores to handle 24 of the tasks.
If I instead dispatched the same process to 4 machines which each spawned 12 threads, would it (likely) result in the tasks being completed quicker? I'm curious if having some extra cores available on the machine means computations are performed faster than if every single core is occupied running an individual thread.
This is highly dependent of the actual algorithm, the actual dataset, the target hardware including the interconnection network if data communicate and the input/data data are heavy (or if the algorithm runs very quickly). Some applications scale better on many machines with few cores and some scale better on few machines with many cores. In high-performance computing researchers worked during decades so to understand the performance of hybrid applications and there is no clear answer to this: it depends (Note that the question is already quite hard to answer for a given well defined application with a well defined dataset so for people to write research papers on it).
If your tasks are memory-bound, then using more machine with less core is often better. If the amount of transferred data is big or the algorithm require a low-latency, then using fewer machines is often better (typically one big SMP). There are many others things to consider since machines are not just a bag of core. NUMA effect should be considered for example as well as caches, the storage device system and even the OS (not all subsystem scale on a given machine regarding the OS).

Is therea way to count high-performance cores on heterogenous multicore CPUs

In a Heterogeneous MultiProcessing model, the different cores of a CPU or SoC don't have the same performance profiles. While HMP systems first got deployed a while ago (the wiki mentions Samsung started using this model back in 2013), they are coming to a head with apple's M-series (technically it was already an issue with HyperThreading, which could be considered a form of HMP I guess).
Parallelized tools generally try to guess the number of workers they should create by counting the number of cores on the system, however in an HMP model that can be counter-productive, especially for personal devices: while the "efficient" cores are technically available they have low performances, loading them will not be a huge performance gain, and it can drastically impact the interactivity / pleasantness of interacting with the system. This is usually configurable (so the user can set something more reasonable), but it seems a better default to segregate such tools to "high-performance" cores only (at least assuming they are CPU bound), leaving users the option to increase residency if they so choose.
And so is there a portable way to list and segregate "real" and "high-performance" cores from "virtual" and "efficient" cores?
Note: I've seen 68444429 "how can I distinguish between high- and low-performance cores in C++", while the title is similar the question and goals are rather different as the goal here is to avoid generating unnecessary and inefficient work by default.

How can I determine if a raspberry pi is powerful enough to run my code?

Perhaps I posted this with the wrong tags but hopefully someone can help me. I am an engineer finding myself deeper and deeper in automation. Recently I designed an automated system on a raspberry pi. I wrote a pretty simple code which was duplicated to read sensor values from different serial ports simultaneously. I did it this way so I could shut down one script without compromising the others if need be. It runs very well now but I had problems overloading my cpu when I first started (I believe it was because I opened all of the code at once rather than one at a time).
My question is:
How can I determine how much computing power is required by code I have written? How can I spec out a computer to run my code before I start building the robot?
The three resources you're likely to be bounded by on any computer are disk, RAM, and CPU (cores). MicroSD cards are cheap, and easily swapped, so the bigger concern is the latter two.
Depending on the language you're writing in, you'll have more or less control over memory usage. Python in particular "saves" the developer by "handling" memory automatically. There are a few good articles on memory management in Python, like this one. When running a simple script (e.g. activate these IO pins) on a machine with gigabytes of memory, this is rarely an issue. When running data intensive applications (e.g. do linear algebra on this gigantic array) then you have to worry about how much memory you need to do the computation and whether the interpreter actually frees it when you're done. This is not always easy to calculate but if you profile your software on another machine you may be able to estimate it.
CPU utilization is comparatively easy to prepare for. Reserve 1 core for OS and other functions and the rest are available to your software. If you write single threaded code this should be plenty. If you have parallel processing, then either stick to N-1 workers or you'll need to get creative with the software design.
Edit: all of this is with the Raspberry Pi in mind. The Pi is a full computer is a tiny form factor: OS, BIOS, boot time, etc. Many embedded problems can be solved with an Arduino or some other controller which has a different set of considerations.

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.
In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions
Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.
The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Resources