OpenMP program on different hosts - multithreading

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.

Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.

Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.

Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Related

Running a single threaded app across multiple cores/cpu's

I have a single-threaded java application that I'd like to make run on multiple cores. Is there a software that'll allow me to essentially "combine" multiple cores into one logical unit for the software to run on, with minimal to no changes to the underlying software?
If not, how would I go about addressing this issue without changing too much of the java application?
No, this is basically a fundamental problem in hardware utilization. There is currently no uniform way of automatically parallelizing single-threaded applications (aside from running the application as multiple processes).
This is well established and has been the focus of compiler research for decades. That being said, it is probably not impossible, but evidently very hard to achieve.
More info:
Automatic parallelization
Does the JVM have the ability to detect opportunities for parallelization?

Fully utilizing HW accelerator

I would like to use OpenSSL for handling all our SSL communication (both client and server sides). We would like to use HW acceleration card for offloading the heavy cryptographic calculations.
We noticed that in the OpenSSL 'speed' test, there are direct calls to the cryptographic functions (e.g. RSA_sign/decrypt, etc.). In order to fully utilize the HW capacity, multiple threads were needed (up to 128 threads) which load the card with requests and make sure the HW card is never idle.
We would like to use the high level OpenSSL API for handling SSL connections (e.g. SSL_connect/read/write/accept), but this API doesn't expose the point where the actual cryptographic operation is done. For example, when calling SSL_connect, we are not aware of the point where the RSA operations are done, and we don't know in advance which calls will lead to heavy cryptographic calculations and refer only those to the accelerator.
Questions:
How can I use the high level API while still fully utilizing the HW accelerator? Should I use multiple threads?
Is there a 'standard' way of doing this? (implementation example)
(Answered in UPDATE) Are you familiar with Intel's asynchronous OpenSSL ? It seems that they were trying to solve this exact issue, but we cannot find the actual code or usage examples.
UPDATE
From Accelerating OpenSSL* Using Intel® QuickAssist Technology you can see, that Intel also mentions utilization of multiple threads/processes:
The standard release of OpenSSL is serial in nature, meaning it
handles one connection within one context. From the point of view of
cryptographic operations, the release is based on a synchronous/
blocking programming model. A major limitation is throughput can be
scaled higher only by adding more threads (i.e., processes) to take
advantage of core parallelization, but this will also increase context
management overhead.
The Intel's OpenSSL branch is finally found here.
More info can be found in pdf contained here.
It looks like Intel changed the way OpenSSL ENGINE works - it posts work to driver and immediately returns, while the corresponding result should be polled.
If you use other SSL accelerator, than corresponding OpenSSL ENGINE should be modified too.
According to Interpreting openssl speed output for rsa with multi option , -multi doesn't "parallelize" work or something, it just runs multiple benchmarks in parallel.
So, your HW card's load will be essentially limited by how much work is available at the moment (note that in industry in general, 80% planned capacity load is traditionally considered optimal in case of load spikes). Of course, running multiple server threads/processes will give you the same effect as multiple benchmarks.
OpenSSL supports multiple threads provided that you give it callbacks to lock shared data. For multiple processes, it warns about reusing data state inherited from parent.
That's it for scaling vertically. For scaling horizontally:
openssl supports asynchronous I/O through asynchronous BIOs
but, its elemental crypto operations and internal ENGINE calls are synchronous, and changing this would require a logic overhaul
private efforts to make them provide asynchronous operation have met severe criticism due to major design flaws
Intel announced some "Asynchronous OpenSSL" project (08.2014) to use with its hardware, but the linked white paper gives little details about its implementation and development state. One developer published some related code (10.2015), noting that it's "stable enough to get an overview".
As jww has mentioned in the comments, you should use the engine API to accomplish the task. There is an example in the above link on how to use that API. Usually, the hardware accelerator provider implements a library that is called an "ENGINE" this engine provides cryptographic acceleration and can be used by OpenSSL internally. Assuming that the accelerator you want to use has an ENGINE implemented(for example "cswitft") you should get the Engine by calling ENGINE *e = ENGINE_by_id("cswift"); and then initialize it ENGINE_init(e); and set it to be the default for the operations you want to use, for example ENGINE_set_default_RSA(e);
After calling these functions, you can use the high level API of OpenSSL (e.g. SSL_connect/read/write/accept)

performance - multithreaded or multiprocess applications

In order to develop a highly network intensive server application on linux, what sort of architecture is preferred? The idea is that this app would typically run on machines with multiple cores (either virtual or physical). Considering that performance is the key criteria, is it better to go for a multi-threaded application or the one with multi-process design? I do know that sharing of resources and synchronization to access of such resources from multiple processes is a lot of programming overhead, but as mentioned earlier overall performance is the key requirement and so we can ignore those things. And the programming language would be C/C++.
I have heard that even the multi-threaded applications (single process) can take advantage of multiple cores and run each thread on a different core independently (as long as there is no sync issues). And this scheduling is done by the kernel. If so, is there not much difference in performance between multi-threaded applications and multi-process applications? Nginx uses a multi-process architecture and is really quick, but can one get the same performance with multi-threaded applications?
Thanks.
Processes and threads on linux are very similar to each other - the main difference is that the whole virtual memory is shared as well as certain things like signal handling differ.
This makes for cheaper context switches between threads (no need for costly MMU reloads etc.) but doesn't necessarily cause much difference in speed (especially outside of thread creation).
For designing a highly network intensive application, basically the only solution is to use an evented architecture (otherwise you'll bog down the system with huge amount of processes/threads and spend more time on their management than actually running work code), where you react to I/O on sockets and based on which sockets exhibit activity do apropriate operations.
A famous writeup about the problems faced in such situations is "The C10k problem", available from http://www.kegel.com/c10k.html - it describes different I/O approaches, so despite being a bit dated, it's a very good introduction.
Be careful before jumping deeply into reactor-like designs, though - they can get unwieldy and complex, so see if you can't use library/language that provides a nicer abstraction over it (Erlang is my personal favourite in this, languages with coroutines like Go can be useful too).
If your threads are doing the job independent from one another, under linux, there is simply no reason to not going with multiple processes instead. Multiple processes would increase your memory usage as each process has its own private memory space, but on the other hand sharing the memory space between independent threads is the worse decision. Context switching between threads vs processes is usually done better for processes rather than threads although its a little bit architecture and code dependent. Processes are safe to not get serialized with locks and mutex es. Processes are easier to manage and interact with in Linux. here is a good document you might find interesting (http://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf).

Managing multiple-processes: What are the common strategies?

While multithreading is faster in some cases, sometimes we just want to spawn multiple worker processes to do work. This has the benefits of not crashing the main app if one of the worker crashes, and that the user doesn't need to worry a lot about inter-locking stuffs.
COM+'s Application Pooling seems like a good way to achieve this on Windows. The downside is that we need to write a COM+ wrapper for the worker process.
However, when I search for Application Pooling on Google, it seems like most of its usages are related to IIS. Don't other applications (such as scientific/graphics) find it useful to spawn multiple worker processes?
So there are several questions:
Why isn't COM+ more popular in areas other than IIS? If I write a non-IIS application and want to use process management on Windows, should I go with COM+ or are there better alternatives out there?
What would be the cross platform way to do it? Are there libraries out there that give me a "process pool" (worker processes will intelligently pick up work, can be managed, etc.)
I can't offer any answers to the COM aspect of your question, but it's worth noting there's another world (besides HPC MPI) where multi-processing (rather than the more common multi-threading approach) is apparently alive, well and thriving: Python.
Why ? Python's GIL ("global interpreter lock") cripples most attempts to multithread python code so badly that multiprocessing is the generally recommended approach to parallelising Python on SMP. The standard library includes process pools; there are various other options too.
Python certainly ought to satisfy any multi-platform requirement!
You might want to investigate how the apache web server manages process pools. From version 2.0 it runs natively on windows and one of the multi-processing models it supports are process pools. A part of apache is also APR (apache portable runtime), which handles platform-specific issues.
No one can answer why something is not popular because may be no body is looking for what you are looking for. After .NET came in picture, people shifted from COM to Managed Environment, before .NET, COM and ATL and relative other technologies were quite painful to implement and they would crash and were also quite difficult to debug.
That is the reason, managed environment came in existence.
However, .NET 4 onwards, parallel libraries give much more power to user for parallel programming and also you can spawn and control other proceeses.
For multiplatform, you can look for zvrba's answer.
Yes, other applications--especially science applications--find it useful to spawn multiple processes. Since few super-computers run Microsoft Windows, scientists generally avoid using anything that ties them to a Microsoft platform. Nothing related to COM will help scientists leverage their enormous existing code base written in Fortran.
People who choose to run IIS have generally already drunk the Microsoft Koolaid, so they have fewer inhibitions to tying themselves to Microsoft's proprietary platforms, which is why COM-specific terminology will get lots of hits related to IIS.
One of the open standards for doing what you want is the Message Passing Interface. Several implementations exist and some of them run on supercomputers using Fortran. Some of them run on cheaper computers using sexier languages.
See http://en.wikipedia.org/wiki/Message_Passing_Interface
There hasn't been a mob rushing through the doors of COM application pooling primarily because of two factors:
COM is a pain in the ass to deal with compared to just about anything else
Threading can be a headache, but it's a lot easier and more convenient to manage than inter-process communication
COM application pooling was essentially created for IIS. It has one very specific benefit over normal multithreading: the multiple processes are fully isolated from each other. This is important for data security and for app stability when dealing with third party plugins of questionable stability.
Scientific computing generally doesn't need strong data security isolation between operations, and I would venture to guess that scientific computing doesn't rely much on third party plugins of questionable stability. When doing big math operations, you're either using a sexy numerics library that had better be rock solid to be taken seriously, or you're using your own code, in which case crashes should be fixed and repeat offenders should be spanked.
Oh, and all crashes except stack overflow can be trapped and dealt with within a multithreaded app, especially if it's your own code.
In short, COM app pooling is overkill for just about anything other than IIS.
Google's webbrowser chrome is a multi-process architecture software. It is open source, so you can check out its code and see how to manage processes.

Are there examples for programming-languages support automatic management of resources besides memory?

The idea of automatic memory management has gained big support with new programming languages. I'm interested if concepts exists for automatic management of other resources like files, network-sockets etc.?
For single threaded applications, the pattern of a resource being available for the extent of a block of code, with clean-up at the end, exists in several languages. Examples are the use of RAII in C++, or with-open-file in Common Lisp (and equivalent in newer Lisp-influenced languages - the same in Dylan, C#, Python and in Ruby you can pass a block to a file object).
I'm not aware of anything better suited for the multithreaded environments where modern garbage collection shines, short of combining RAII and reference counting or auto_ptr in C++, which isn't always a trivial combination.
One important distinction between automatic management of resources and automatic memory management is that memory management can often afford to be non-deterministic and only reclaimed when the process requires it, whereas often a resource is limited at an OS level, so should be reclaimed as soon as it is no longer used. Hence the choice of smart pointers rather than garbage collection as the management implementation. There's an intermediate level of resource - GDI objects, temporary file handles, threads - where an application wants to limit the total it uses, but doesn't care so much about releasing them to other processes - these are often pooled, which gets you some way towards automatic management.
One of the reasons we can automatically manage memory allocation now is we have so much of it.
Back in the days when memory was tight you had to squeeze the most out of every bite the system had.
Other resouces such as file handles and sockets are far fewer, and still need to be handled by hand (pun intended).
Consider also the .net compact framework, it’s not uncommon for windows mobile devices to have 32mb or 64mb of volatile memory to play with which - when you think about it - is still “lots”.
I wondering what the footprint of the .net compact framework is, and how would it perform on a Nokia phone with 4mb of volatile memory.
Anyone any ideas?
(This is a wiki answer, feel free to correct or add more detail)
So, IMHIO we can afford to be slow reclaiming memory, because we're not going to run out of it in a hurry, which isn't the case with other resources.
Object persistence and caching subsystems can be considered an automatic allocation of files and resources. If you apply a caching subsystem to a network connection you don't have to care about file opening, file deleting, and so on.
A way to manage automatically network connection could be done in parallel computing environment (i.e MPI), you can set programmatically the shape of the processors interconnections. Than you just send a message from a process to another, almost ignoring the way it's implemented. Sometimes those messages are translated in sockets.
If you have a function that let you get a page from its Url, would you consider it a sort of Automatic socket management?

Resources