Is ZeroMQ with RT Linux (RT-PREEMPT patch) real time?

Is ZeroMQ with RT Linux (RT-PREEMPT patch) real time? - linux

I am considering setting up ZeroMQ as message broker on a Linux kernel patched up with RT-PREEMPT (to make it real time).
Basically I want to publish/subscribe short events that are serialized using google protocol buffers.
1. Event Model Object (App #1) --->
2. Serialize Google protobuf --->
3. ZeroMQ --->
4. Deserialize Google protobuf -->
5. Event Model object (App #2)
From #1 to #5 and perhaps back to #1, how will the real time guarantees of linux RT-PREEMPT be affected by ZeroMQ?
I am specifically looking for real time features of ZeroMQ. Does it provide such guarantees?
To put the question in perspective, lets say I want to know if ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.

Firstly PREEMPT_RT reduces the maximum latency for a task but overall system throughput will be lower and the average latency probably higher.
Real time is a broad subject; to avoid confusion my definition of real time is in the order of tens of milliseconds per frame running at 30hz or higher.
Does it provide such (real time features) guarantees?
As already answered no it does not and it it isn't what PREEMPT_RT is really about.
Is ZeroMQ worthy of deploying on time critical systems?
Time critical is a loose definition; but a with a correctly designed protocol ZeroMQ is going to give you options in how messages are transported (memory, TCP, UDP / multicast) and rest assured that ZeroMQ does what it does really well.
IME ZeroMQ typically delivers fast on a local high speed network - but this will drop if you are using a wide network and this figure may raise as endpoints are added depending on which model you are using within ZeroMQ.
For real time systems it's not just about transmission speed, there is also latency and time synchronisation to consider.
It's worth reading the article 0MQ: Measuring messaging performance - note that there is a high (1.5ms) latency at the start of message transmission that settles quickly - probably good if you are needing to transmit a lot of small messages at high frequency - not as good if you are transmitting a few larger messages at a lower rate.
is ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.
It's important to understand the topology of what you're connecting and also how the latency will affect things and design a protocol accordingly.
So to take the case of the 777 autopilot - almost certainly yes ZeroMQ would be suitable simply because there is a lot of inertia in a stable aircraft so the airframe will take time to respond and the cleverness is inside the autopilot and not so much reliant on incoming sensor data at a high rate. On a 777 there is an Arinc 429 bus connecting the avionics and this runs at a maximum 100kbit/s between the limited number of endpoints that can be on any given bus.

Q: Does it provide such (real time features) guarantees?
No, it does not and never tried to do so. The Zen-of-Zero guarantees one to either deliver a complete message-payload or none at all, which means, your R/T-level code has never a need to test for a damaged message integrity (once a message got delivered). Yet, it does not guarantee a delivery as a whole and the application domain is free to build any such additional service-layer atop the smart and ultimately performant ZeroMQ-layer.
ZeroMQ has been since its earliest days a smart, asynchronous, broker-less tool for designing any kind of almost linear-scalable distributed-system messaging, with low-profile resources' needs and having an excellent latency/performance envelope.
This said, anyone can build on this and design any add-on features, that match some target application-domain specific needs ( N+1 robustness, N+M failure-resilience, adaptive node-rediscovery, ... ), which are principally neither hard-coded, nor pre-wired into the minimum footprint for low latency, best scalable smart-messaging core-engine.
All designs, where the rules of the Zen-of-Zero, coded so wisely into the ZeroMQ core-engine, can safely match the defined hard-Real-Time limits for the RT/Execution needs of such Real-Time constrained System-under-Review, will enjoy the ZeroMQ and its many-protocol support for inproc://, ipc://, tipc://, tcp://, pgm://, epgm://, udp://, vmci:// and other wire-level protocols.
Q: Is ZeroMQ worthy of deploying on time critical systems?
That depends a lot on indeed many things.
I remember guys in days, when F-16 avionics simulated an onboard network, that used internally an isolated high-speed deterministic and rather low latency ( due to static packet/payload sizes ) of 155-Mbit/s+ ATM-fabric for this on-plane switching-network just to enjoy it's benefits right for the R/T-control-motivated needs, so technology always matches some set of needs-to-have. Given your R/T-system design criteria gets defined, anyone may confirm/deny a feasible/in-feasible tools to design towards such R/T-system goals.
Factors will go way beyond the just ZeroMQ smart-features:
properties of your critical system
external constraints ( sector specific authoritative regulations )
Project's planning
Project's external ecosystem of cooperating parties ( greenfield systems have it too )
your Team's knowledge of ZeroMQ or absence of using it in distributed-systems' designs
...
and last but not least - the ceilings for financing and for showing the RTO demonstrator live demo and roll-out for acceptance tests.
Good luck with your forthcoming BMD or any other MCS. ZeroMQ style of thinking, the Zen-of-Zero can help you in many aspects of doing right things right.

Related

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.

can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

Using TCP for memory sharing across processes

I made a mistake working on nodejs in the beginning by not utilizing Redis or Memcacheor other memory storage systems. Now, it's far too late to be rewriting everything to accommodate and correlate my code within those API's.
However, I just recently found out about forking processes and how beneficial they can be; especially since I'm working on a gameserver.
The problem I have is: The memory is not shared between cores in nodejs.. until I found a TCP memory sharing module called Amensia.
With all that said, I have some question about it pertaining to nodejs and tcp in general:
1) The maximum size of a TCP packet is around 64k, so when using this module I can only share data up to 64k in size?
2) I use a global GAMES and users object to store player data. These objects are updated when a player moves in a map (x,y positions) and upon other actions. Would sending all this data across TCP derive into a bottleneck?

A minimum overhead approach
Equip all your localhost forked processes with a inter-process smart-messaging layer.
This way your "sharing" might be achieved in both abstract meaning and ( in ZeroMQ case very attractively ) in literally exact meaning, whence ZeroMQ allows you to avoid data duplication by a shared buffer ( a ZeroCopy maxim ).
If your OS allows IPC:// and inproc:// transport class are almost overhead-less and inproc:// even does not ( thanks to the great architecture thoughts of the ZeroMQ team ) require _any_additional_ thread ( CPU/RAM-overheads ) once invoked via ZeroThread-context( 0 )
An approach even less subject to overhead ( if your app fits nanomsg )
In case ZeroMQ seems too powerful for your particular goal, may be interested in it's younger sister Martin Sustrik, co-father of ZeroMQ has spun off - nanomsg which also has node.js port available
Where to go for more details?
A best next step you may do in ether ZeroMQ / nanomsg case for this is to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1, if it were not the case of reading step-by-step there.
The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details.
If you fall in love with this mode-of-thinking, you would love Martin Sustrik's blog posts - a smart man, indeed. It is worth the time to at least get inspired by his views and experience.

1) You should not have any problems with TCP packet size. Node will buffer/queue your data if it's too big and send them when the OS gives it a writable socket's file descriptor. You may hit performance issues only if you are writing more then your network bandwidth per second. At this point Node will also use more RAM to queue all this messages.
https://nodejs.org/api/net.html#net_socket_buffersize
2) Most games use TCP or UDP for real time communication. It can be a bottleneck, as anything else (RAM, CPU, bandwidth, storage) can. At some point of stress, one or more resources will end/fail/perform badly. It's generally a good practice to use an architecture that can grow horizontally (adding more machines) when all optimizations are done for your bottleneck and you still need to add more simultaneous users to your game server.
https://1024monkeys.wordpress.com/2014/04/01/game-servers-udp-vs-tcp/
You'll probably use TCP to connect to a Redis server (but you can also use a unix socket).
If you only need inter-process communication (and not inter-machine), you should take a look at the "cluster" Node.js core module. It has built-in IPC.

What bluetooth to use (2.1 or 4.0) and how?

The title seems to be too general (I can't think of a good title). I'll try to be specific in the question description.
I was required to make an industrial control box that collects data periodically (maybe 10-20 bytes of data per 5 seconds). The operator will use a laptop or mobile phone to collect the data (without opening the box) via Bluetooth, weekly or monthly or probably at even longer period.
I will be in charge of selecting proper modules/chips, doing PCB, and doing the embedded software too. Because the box is not produced in high volume, so I have freedom (modules/chips to use, prices, capabilities, etc.) in designing different components.
The whole application requires an USART port to read in data when available (maybe every 5-10 seconds), a SPI port for data storage (SD Card reader/writer), several GPIO pins for LED indicator or maybe buttons (whether we need buttons and how many is up to my design).
I am new to Bluetooth. I read through wiki and some googling pages. Knowing about the pairing, knowing about the class 1 and class 2 differences, knowing about the 2.1 and 4.0 differences.
But I still have quite several places not clear to decide what Bluetooth module/chip to use.
A friend of mine mentioned TI CC2540 to me. I checked and it only supports 4.0 BLE mode. And from Google, BT4.0 has payload of at most 20 bytes. Is BT4.0 suitable for my application when bulk data will need to be collected every month or several months? Or it's better to use BT2.1 with EDR for this application? BT4.0 BLE mode seems to have faster pairing speed but lower throughput?
I read through CC2540, and found that it is not a BT only chip, it has several GPIO pins and uart pins (I am not sure about SPI). Can I say that CC2540 itself is powerful enough to hold the whole application? Including bluetooth, data receiving via UART, and SD card reading/writing?
My original design was to use an ARM cortex-M/AVR32 MCU. The program is just a loop to serve each task/events in rounds (or I can even install Linux). There will be a Bluetooth module. The module will automatically take care of pairing. I will only need to send the module what data to send to the other end. Of course there might be some other controlling, such as to turn the module to low power mode because the Bluetooth will only be used once per month or something like that. But after some study of Bluetooth, I am not sure whether such BT module exists or not. Is programming chips like CC2540 a must?
In my understanding, my designed device will be a BT slave, the laptop/phone will be the master. My device will periodically probe (maybe with longer period to save power) the existence of master and pair with it. Once it's paired, it will start sending data. Is my understanding on the procedure correct? Is there any difference in pairing/data sending for 2.1 and 4.0?
How should authentication be designed? I of course want unlimited phones/laptops to pair with the device, but only if they can prove they are the operator.
It's a bit messy. I will be appreciated if you have read through the above questions. The following is the summary,
2.1 or 4.0 to use?
Which one is the better choice? Meaning that suitable for the application, and easy to develop.
ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible)
ARM/avr32 + some BT modules ( such as Bluegiga https://www.bluegiga.com/en-US/products/ )
Should Linux be used?
How the pairing and data sending should be like for power saving? Are buttons useful to facilitate the sleep mode and active pairing and data sending mode for power saving?
How the authentication should be done? Only operators are allowed but he can use any laptops/phones.
Thanks for reading.

My two cents (I've done a fair amount of Bluetooth work, and I've designed consumer products currently in the field)... Please note that I'll be glossing over A LOT of design decisions to keep this 'concise'.
2.1 or 4.0 to use?
Looking over your estimates, it sounds like you're looking at about 2MB of data per week, maybe 8MB per month. The question of what technology to use here comes down to how long people are willing to wait to collect data.
For BLE (BT 4.0) - assume your data transfer is in the 2-4kB/s range. For 2.1, assume it's in the 15-30kB range. This depends on a ton of factors, but has generally been my experience between Android/iOS and BLE devices.
At 2MB, either one of those is going to take a long time to transfer. Is it possible to get data more frequently? Or maybe have a wifi-connected base station somewhere that collects data frequently? (I'm not sure of the application)
Which one is the better choice? Meaning that suitable for the
application, and easy to develop. ARM/avr32 + CC2540 (or the like)
CC2540 only or the like (if possible) ARM/avr32 + some BT modules (
such as Bluegiga https://www.bluegiga.com/en-US/products/ ) Should
Linux be used?
This is a tricky question to answer without knowing more about the complexity of the system, the sensors, the data storage capacity, BOM costs, etc...
In my experience, a Linux OS is HUGE overkill for a simple GPIO/UART/I2C based system (unless you're super comfortable with Linux). Chips that can run Linux, and add the additional RAM are usually costly (e.g. a cheap ARM Cortex M0 is like 50 cents in decent volume and sounds like all you need to use).
The question usually comes down to 'external MCU or not' - as in, do you try to get an all-in-one BT module, which has application space to program on. This is a size and cost savings to using it, but it adds risks and unknowns vs a braindead BT module + external MCU.
One thing I would like to clarify is that you mention the TI CC2540 a few times (actually, CC2541 is the newer version). Anyways, that chip is an IC level component. Unless you want to do the antenna design and put it through FCC intentional radiator certification (the certs are between 1k-10k usually - and I'm assuming you're in the US when I say FCC).
What I think you're looking for is a ready-made, certified module. For example, the BLE113 from Bluegiga is an FCC certified module which contains the CC2541 internally (plus some bells and whistles). That one specifically also has an interpreted language called BGScript to speed up development. It's great for very simple applications - and has nice, baked in low-power modes.
So, the BLE113 is an example of a module that can be used with/without an external MCU depending on application complexity.
If you're willing to go through FCC intentional radiator certification, then the TI CC2541 is common, as well as the Nordic NRF51822 (this chip has a built in ARM core that you can program on as well, so you don't need an external MCU).
An example of a BLE module which requires an external MCU would be the Bobcats (AMS001) from AckMe. They have a BLE stack running on the chip which communicates to an external MCU using UART.
As with a comment above, if you need iOS compatibility, using a Bluetooth 2.1 (BT Classic) is a huge pain due to the MFI program (which I have gone through - pure misery). So, unless REALLY necessary, I would stick with BT Classic and Android/PC.
Some sample BT classic chips might be the Roving Networks RN42, the AmpedRF BT 33 (or 43 or 53). If you're interested, I did a throughput test on iOS devices with a Bluetooth classic device (https://stackoverflow.com/a/22441949/992509)
How the pairing and data sending should be like for power saving? Are
buttons useful to facilitate the sleep mode and active pairing and
data sending mode for power saving?
If the radio is only on every week or month when the operator is pulling data down, there's very little to do other than to put the BT module into reset mode to ensure there is no power used. BT Classic tends to use more power during transfer, but if you're streaming data constantly, the differences can be minimal if you choose the right modules (e.g. lower throughput on BLE for a longer period of time, vs higher throughput on BT 2.1 for less time - it works itself out in the wash).
The one thing I would do here is to have a button trigger the ability to pair to the BT modules, because that way they're not always on and advertising, and are just asleep or in reset.
How the authentication should be done? Only operators are allowed but
he can use any laptops/phones.
Again, not knowing the environment, this can be tricky. If it's in a secure environment, that should be enough (e.g. behind locked doors that you need to be inside to press a button to wake up the BT module).
Otherwise, at a bare minimum, have the standard BT pairing code enabled, rather than allowing anyone to pair. Additionally, you could bake extra authentication and security in (e.g. you can't download data without a passcode or something).
If you wanted to get really crazy, you could encrypt the data (this might involve using Linux though) and make sure only trusted people had decryption keys.

We use both protocols on different products.
Bluetooth 4.0 BLE aka Smart
low battery consumption
low data rates (I came up to 20 bytes each 40 ms. As I remember apple's minimum interval is 18 ms and other handset makers adapted that interval)
you have to use Bluetooth's characteristics mechanism
you have to implement chaining if your data packages are longer
great distances 20-100m
new technology with a lot of awful premature implementations. Getting better slowly.
we used a chip from Bluegiga that allowed a script language for programming. But still many limitations and bugs are build in.
we had a greater learning curve to implement BLE than using 2.1
Bluetooth 2.1
good for high data rates depending on used baud rate. The bottle neck here was the buffer in the controller.
weak distances 2-10 m
Its much easier to stream data
Did not notice a big time difference in pairing and connecting with both technologies.
Here are two examples of devices which clearly demand either 2.1 or BLE. Maybe your use case is closer to one of those examples:
Humidity sensors attached to trees in a Forrest. Each week the ranger walks through the Forrest and collects the data.
Wireless stereo headsets

High performance packet handling in Linux

I’m working on a packet reshaping project in Linux using the BeagleBone Black. Basically, packets are received on one VLAN, modified, and then are sent out on a different VLAN. This process is bidirectional - the VLANs are not designated as being input-only or output-only. It’s similar to a network bridge, but packets are altered (sometimes fairly significantly) in-transit.
I’ve tried two different methods for accomplishing this:
Creating a user space application that opens raw sockets on both
interfaces. All packet processing (including bridging) is handled in
the application.
Setting up a software bridge (using the kernel
bridge module) and adding a kernel module that installs a netfilter
hook in post routing (NF_BR_POST_ROUTING). All packet processing is
handled in the kernel.
The second option appears to be around 4 times faster than the first option. I’d like to understand more about why this is. I’ve tried brainstorming a bit and wondered if there is a substantial performance hit in rapidly switching between kernel and user space, or maybe something about the socket interface is inherently slow?
I think the user application is fairly optimized (for example, I’m using PACKET_MMAP), but it’s possible that it could be optimized further. I ran perf on the application and noticed that it was spending a good deal of time (35%) in v7_flush_kern_dcache_area, so perhaps this is a likely candidate. If there are any other suggestions on common ways to optimize packet processing I can give them a try.

Context switches are expensive and kernel to user space switches imply a context switch. You can see this article for exact numbers, but the stated durations are all in the order of microseconds.
You can also use lmbench to benchmark the real cost of context switches on your particular cpu.

The performance of the user space application depends on the used syscall to monitor the sockets too. The fastest syscall is epoll() when you need to handle a lot of sockets. select() will perform very poor, if you handle a lot of sockets.
See this post explaining it:
Why is epoll faster than select?

Is there any modern review of solutions to the 10000 client/sec problem

(Commonly called the C10K problem)
Is there a more contemporary review of solutions to the c10k problem (Last updated: 2 Sept 2006), specifically focused on Linux (epoll, signalfd, eventfd, timerfd..) and libraries like libev or libevent?
Something that discusses all the solved and still unsolved issues on a modern Linux server?

The C10K problem generally assumes you're trying to optimize a single server, but as your referenced article points out "hardware is no longer the bottleneck". Therefore, the first step to take is to make sure it isn't easiest and cheapest to just throw more hardware in the mix.
If we've got a $500 box serving X clients per second, it's a lot more efficient to just buy another $500 box to double our throughput instead of letting an employee gobble up who knows how many hours and dollars trying to figure out how squeeze more out of the original box. Of course, that's assuming our app is multi-server friendly, that we know how to load balance, etc, etc...

Coincidentally, just a few days ago, Programming Reddit or maybe Hacker News mentioned this piece:
Thousands of Threads and Blocking IO
In the early days of Java, my C programming friends laughed at me for doing socket IO with blocking threads; at the time, there was no alternative. These days, with plentiful memory and processors it appears to be a viable strategy.
The article is dated 2008, so it pulls your horizon up by a couple of years.

To answer OP's question, you could say that today the equivalent document is not about optimizing a single server for load, but optimizing your entire online service for load. From that perspective, the number of combinations is so large that what you are looking for is not a document, it is a live website that collects such architectures and frameworks. Such a website exists and its called www.highscalability.com
Side Note 1:
I'd argue against the belief that throwing more hardware at it is a long term solution:
Perhaps the cost of an engineer that "gets" performance is high compared to the cost of a single server. What happens when you scale out? Lets say you have 100 servers. A 10 percent improvement in server capacity can save you 10 servers a month.
Even if you have just two machines, you still need to handle performance spikes. The difference between a service that degrades gracefully under load and one that breaks down is that someone spent time optimizing for the load scenario.
Side note 2:
The subject of this post is slightly misleading. The CK10 document does not try to solve the problem of 10k clients per second. (The number of clients per second is irrelevant unless you also define a workload along with sustained throughput under bounded latency. I think Dan Kegel was aware of this when he wrote that doc.). Look at it instead as a compendium of approaches to build concurrent servers, and micro-benchmarks for the same. Perhaps what has changed between then and now is that you could assume at one point of time that the service was for a website that served static pages. Today the service might be a noSQL datastore, a cache, a proxy or one of hundreds of network infrastructure software pieces.

You can also take a look at this series of articles:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
He shows a fair amount of performance data and the OS configuration work he had to do in order to support 10K and then 1M connections.
It seems like a system with 30GB of RAM could handle 1 million connected clients on a sort of social network type of simulation, using a libevent frontend to an Erlang based app server.

libev runs some benchmarks against themselves and libevent...

I'd recommend Reading Zed Shaw's poll, epoll, science, and superpoll[1]. Why epoll isn't always the answer, and why sometimes it's even better to go with poll, and how to bring the best of both worlds.
[1] http://sheddingbikes.com/posts/1280829388.html

Have a look at the RamCloud project at Stanford: https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud
Their goal is 1,000,000 RPC operations/sec/server. They have numerous benchmarks and commentary on the bottlenecks that are present in a system which would prevent them from reaching their throughput goals.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string