Fully utilizing HW accelerator

Fully utilizing HW accelerator - multithreading

I would like to use OpenSSL for handling all our SSL communication (both client and server sides). We would like to use HW acceleration card for offloading the heavy cryptographic calculations.
We noticed that in the OpenSSL 'speed' test, there are direct calls to the cryptographic functions (e.g. RSA_sign/decrypt, etc.). In order to fully utilize the HW capacity, multiple threads were needed (up to 128 threads) which load the card with requests and make sure the HW card is never idle.
We would like to use the high level OpenSSL API for handling SSL connections (e.g. SSL_connect/read/write/accept), but this API doesn't expose the point where the actual cryptographic operation is done. For example, when calling SSL_connect, we are not aware of the point where the RSA operations are done, and we don't know in advance which calls will lead to heavy cryptographic calculations and refer only those to the accelerator.
Questions:
How can I use the high level API while still fully utilizing the HW accelerator? Should I use multiple threads?
Is there a 'standard' way of doing this? (implementation example)
(Answered in UPDATE) Are you familiar with Intel's asynchronous OpenSSL ? It seems that they were trying to solve this exact issue, but we cannot find the actual code or usage examples.
UPDATE
From Accelerating OpenSSL* Using Intel® QuickAssist Technology you can see, that Intel also mentions utilization of multiple threads/processes:
The standard release of OpenSSL is serial in nature, meaning it
handles one connection within one context. From the point of view of
cryptographic operations, the release is based on a synchronous/
blocking programming model. A major limitation is throughput can be
scaled higher only by adding more threads (i.e., processes) to take
advantage of core parallelization, but this will also increase context
management overhead.
The Intel's OpenSSL branch is finally found here.
More info can be found in pdf contained here.
It looks like Intel changed the way OpenSSL ENGINE works - it posts work to driver and immediately returns, while the corresponding result should be polled.
If you use other SSL accelerator, than corresponding OpenSSL ENGINE should be modified too.

According to Interpreting openssl speed output for rsa with multi option , -multi doesn't "parallelize" work or something, it just runs multiple benchmarks in parallel.
So, your HW card's load will be essentially limited by how much work is available at the moment (note that in industry in general, 80% planned capacity load is traditionally considered optimal in case of load spikes). Of course, running multiple server threads/processes will give you the same effect as multiple benchmarks.
OpenSSL supports multiple threads provided that you give it callbacks to lock shared data. For multiple processes, it warns about reusing data state inherited from parent.
That's it for scaling vertically. For scaling horizontally:
openssl supports asynchronous I/O through asynchronous BIOs
but, its elemental crypto operations and internal ENGINE calls are synchronous, and changing this would require a logic overhaul
private efforts to make them provide asynchronous operation have met severe criticism due to major design flaws
Intel announced some "Asynchronous OpenSSL" project (08.2014) to use with its hardware, but the linked white paper gives little details about its implementation and development state. One developer published some related code (10.2015), noting that it's "stable enough to get an overview".

As jww has mentioned in the comments, you should use the engine API to accomplish the task. There is an example in the above link on how to use that API. Usually, the hardware accelerator provider implements a library that is called an "ENGINE" this engine provides cryptographic acceleration and can be used by OpenSSL internally. Assuming that the accelerator you want to use has an ENGINE implemented(for example "cswitft") you should get the Engine by calling ENGINE *e = ENGINE_by_id("cswift"); and then initialize it ENGINE_init(e); and set it to be the default for the operations you want to use, for example ENGINE_set_default_RSA(e);
After calling these functions, you can use the high level API of OpenSSL (e.g. SSL_connect/read/write/accept)

Related

synthetic multi-node crossbar system implementation

I am implementing a system composed of a collection of small systems, ie. Raspberry, Yun, Beaglebone, the occasional PC. Crossbar.io has real promise ... but, as I understand it, doesn't currently support multiple nodes. Am I correct? Does anyone know when that might happen?
In the meantime it occurred to me that each individual node can offer an http interface that I might be able to use for my purposes. My initial thought is to crate workers that wrap access to the web the interface on subsidiary nodes. This fits the overall architecture of the system I want to create - does it have any merit? Is it tractable? I'm new to websockets - and insight would be a great help.
Thanks for your time,
Al

In general that does sound like a fit for Crossbar.io.
There is no timeline on multi-node (i.e. multiple routers), but we hope to have at least hot-standby nodes for high availability ready in Q1. Other than for high availability, I think that a single instance should provide sufficient performance for most applications out there - on a single current (non-high-end) Xeon we're talking tens of thousands of events a second, and concurrent connections are mostly limited by RAM (and 100s of thousands on a single box are definitely not a problem). (If you need more than that then I'd be very interested in your specific use case - we want to learn more about our users.)
I don't completely understand the second part of your question: What precisely is the architecture you're planning here? If you're talking about the integrated Web server, then with recent optimizations (it can now use multiple cores) this should be enough for even moderately big sites, and with SPAs you're not likely to ever run into performance issues.
Hope this helps, and I'll be glad to answer in more detail once you've clarified the second part.

OpenMP program on different hosts

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.

Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.

Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.

Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

What code should NOT be written as a real time one?

In Xenomai's API of Posix skin, I find the following:
POSIX skin.
Clocks and timers services.
Condition variables services.
Interruptions management services.
Message queues services.
Mutex services.
Semaphores services.
Shared memory services.
Signals services.
Threads management services.
Thread cancellation.
Threads scheduling services.
Thread creation attributes.
Thread-specific data.
I can't see anything regarding the file handling and socket programming, so I am guessing that perhaps file handling and sockets are not to be dealt in the real time? Is the guess wrong?
Please guide.

Xenomai and its origin, RTAI, both take control of your scheduler, handling the linux kernel itself as a non-real-time thread.
They have provided many services, most of which as you can see is related to threads and synchornization that do NOT call Linux API (in kernel space) or system calls (in user space). As you know, real-time is all about "guaranteeing deadline" and calling Linux violates it (because Linux doesn't guarantee anything).
Since drivers are also important in real-time systems, they have implemented the real-time driver model, or RTDM that helps both implementing and using device drivers in a real-time context.
File handling in kernel is strongly frowned upon. If you are talking about user-space real-time applications, then you can have access to any drivers that is implemented in RTDM. If you don't find one for file handling or sockets, then no you can't use them. Note that even a printf uses Linux system calls and is forbidden.
Note that if you do use them, nothing breaks, you just lose your real-time-ness! I personally do use files for logging, but only call them in case of an error that means real-time is already ruined anyway.
I don't know about Xenomai, but at least in RTAI, if you call a Linux system call, then you get a warning like "RTAI: LXRT changed mode: syscall ..." in your kernel logs.

The real-time is a property of the ENTIRE SYSTEM. To achieve property of real-time in the system all its components (including hardware, operating system, drivers, libraries, and applications) should be designed taking into account the requirements applied to real-time systems. Such components (like RTOS) can be used to build a real-time system. But they usage doesn't automatically mean that final system will be a real-time system. Actually, if at least one of the component of your system doesn't support requirements of real time systems, your entire system won't be real-time!
Real-time systems usually has resources significantly exceeding the average requirements of the real-time tasks. Unconsumed resources can be used for performing useful but non-critical background tasks, such as logging, monitoring of the system state, statistics collection and analysis, etc. Applications that will perform this tasks can be designed as non real-time components which run atop of real-time components. This design is safe if you are sure that all components participating in real-time tasks support requirements of real-time. Due to this direct answer to your question:
It completely depends from application. In general, all code, that is not used in handling of real-time tasks, CAN BE written as non real-time. All code which is used in the handling of real-time tasks MUST BE written as real-time.
What the Xenomai is doing is isolation of the non-real time Linux and its activities used for handling of non-real-time tasks in the special container, which is run atop of RTOS kernel and in parallel with RTOS-based real-time tasks. To build real-time system on the Xenomai bases your application should rely only to the Xenomai API and on the other libraries and APIs which are known and are proven to be a real-time. All background activities which can be useful, but completely uncritical can be written as a ordinal Linux applications.
Such systems and services as storage and network services usually are not used in real-time tasks, because the commonly used hardware is very indeterministic and thus doesn't fit well into real-time concept. It is hard to say a priory how much time it will take to send five packets over network or write a file into the HDD. Due to this, interfaces for such systems are not commonplace. But again, the application dictates what real-time services it needs. I can imagine real-time tasks, which involve storage and network actions. In the case of such tasks designer is forced to find such system components, which will provide real-time storage and network services. As you can see, Xenomai is not a candidate.

Which Linux IPC technique to use?

We are still in the design-phase of our project but we are thinking of having three separate processes on an embedded Linux kernel. One of the processes with be a communications module which handles all communications to and from the device through various mediums.
The other two processes will need to be able to send/receive messages through the communication process. I am trying to evaluate the IPC techniques that Linux provides; the message the other processes will be sending will vary in size, from debug logs to streaming media at ~5 Mbit rate. Also, the media could be streaming in and out simultaneously.
Which IPC technique would you suggestion for this application?
http://en.wikipedia.org/wiki/Inter-process_communication
Processor is running around 400-500 Mhz if that changes anything.
Does not need to be cross-platform, Linux only is fine.
Implementation in C or C++ is required.

When selecting your IPC you should consider causes for performance differences including transfer buffer sizes, data transfer mechanisms, memory allocation schemes, locking mechanism implementations, and even code complexity.
Of the available IPC mechanisms, the choice for performance often comes down to Unix domain sockets or named pipes (FIFOs). I read a paper on Performance Analysis of Various Mechanisms for Inter-process Communication that indicates Unix domain sockets for IPC may provide the best performance. I have seen conflicting results elsewhere which indicate pipes may be better.
When sending small amounts of data, I prefer named pipes (FIFOs) for their simplicity. This requires a pair of named pipes for bi-directional communication. Unix domain sockets take a bit more overhead to setup (socket creation, initialization and connection), but are more flexible and may offer better performance (higher throughput).
You may need to run some benchmarks for your specific application/environment to determine what will work best for you. From the description provided, it sounds like Unix domain sockets may be the best fit.
Beej's Guide to Unix IPC is good for getting started with Linux/Unix IPC.

I would go for Unix Domain Sockets: less overhead than IP sockets (i.e. no inter-machine comms) but same convenience otherwise.

Can't believe nobody has mentioned dbus.
http://www.freedesktop.org/wiki/Software/dbus
http://en.wikipedia.org/wiki/D-Bus
Might be a bit over the top if your application is architecturally simple, in which case - in a controlled embedded environment where performance is crucial - you can't beat shared memory.

If performance really becomes a problem you can use shared memory - but it's a lot more complicated than the other methods - you'll need a signalling mechanism to signal that data is ready (semaphore etc) as well as locks to prevent concurrent access to structures while they're being modified.
The upside is that you can transfer a lot of data without having to copy it in memory, which will definitely improve performance in some cases.
Perhaps there are usable libraries which provide higher level primitives via shared memory.
Shared memory is generally obtained by mmaping the same file using MAP_SHARED (which can be on a tmpfs if you don't want it persisted); a lot of apps also use System V shared memory (IMHO for stupid historical reasons; it's a much less nice interface to the same thing)

As of this writing (November 2014) Kdbus and Binder have left the staging branch of the linux kernel. There is no guarantee at this point that either will make it in, but the outlook is somewhat positive for both. Binder is a lightweight IPC mechanism in Android, Kdbus is a dbus-like IPC mechanism in the kernel which reduces context switch thus greatly speeding up messaging.
There is also "Transparent Inter-Process Communication" or TIPC, which is robust, useful for clustering and multi-node set ups; http://tipc.sourceforge.net/

Unix domain sockets will address most of your IPC requirements. You don't really need a dedicated communication process in this case since kernel provides this IPC facility. Also, look at POSIX message queues which in my opinion is one of the most under-utilized IPC in Linux but comes very handy in many cases where n:1 communications are needed.

Are there examples for programming-languages support automatic management of resources besides memory?

The idea of automatic memory management has gained big support with new programming languages. I'm interested if concepts exists for automatic management of other resources like files, network-sockets etc.?

For single threaded applications, the pattern of a resource being available for the extent of a block of code, with clean-up at the end, exists in several languages. Examples are the use of RAII in C++, or with-open-file in Common Lisp (and equivalent in newer Lisp-influenced languages - the same in Dylan, C#, Python and in Ruby you can pass a block to a file object).
I'm not aware of anything better suited for the multithreaded environments where modern garbage collection shines, short of combining RAII and reference counting or auto_ptr in C++, which isn't always a trivial combination.
One important distinction between automatic management of resources and automatic memory management is that memory management can often afford to be non-deterministic and only reclaimed when the process requires it, whereas often a resource is limited at an OS level, so should be reclaimed as soon as it is no longer used. Hence the choice of smart pointers rather than garbage collection as the management implementation. There's an intermediate level of resource - GDI objects, temporary file handles, threads - where an application wants to limit the total it uses, but doesn't care so much about releasing them to other processes - these are often pooled, which gets you some way towards automatic management.

One of the reasons we can automatically manage memory allocation now is we have so much of it.
Back in the days when memory was tight you had to squeeze the most out of every bite the system had.
Other resouces such as file handles and sockets are far fewer, and still need to be handled by hand (pun intended).
Consider also the .net compact framework, it’s not uncommon for windows mobile devices to have 32mb or 64mb of volatile memory to play with which - when you think about it - is still “lots”.
I wondering what the footprint of the .net compact framework is, and how would it perform on a Nokia phone with 4mb of volatile memory.
Anyone any ideas?
(This is a wiki answer, feel free to correct or add more detail)
So, IMHIO we can afford to be slow reclaiming memory, because we're not going to run out of it in a hurry, which isn't the case with other resources.

Object persistence and caching subsystems can be considered an automatic allocation of files and resources. If you apply a caching subsystem to a network connection you don't have to care about file opening, file deleting, and so on.
A way to manage automatically network connection could be done in parallel computing environment (i.e MPI), you can set programmatically the shape of the processors interconnections. Than you just send a message from a process to another, almost ignoring the way it's implemented. Sometimes those messages are translated in sockets.
If you have a function that let you get a page from its Url, would you consider it a sort of Automatic socket management?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string