Using TCP for memory sharing across processes - node.js

I made a mistake working on nodejs in the beginning by not utilizing Redis or Memcacheor other memory storage systems. Now, it's far too late to be rewriting everything to accommodate and correlate my code within those API's.
However, I just recently found out about forking processes and how beneficial they can be; especially since I'm working on a gameserver.
The problem I have is: The memory is not shared between cores in nodejs.. until I found a TCP memory sharing module called Amensia.
With all that said, I have some question about it pertaining to nodejs and tcp in general:
1) The maximum size of a TCP packet is around 64k, so when using this module I can only share data up to 64k in size?
2) I use a global GAMES and users object to store player data. These objects are updated when a player moves in a map (x,y positions) and upon other actions. Would sending all this data across TCP derive into a bottleneck?

A minimum overhead approach
Equip all your localhost forked processes with a inter-process smart-messaging layer.
This way your "sharing" might be achieved in both abstract meaning and ( in ZeroMQ case very attractively ) in literally exact meaning, whence ZeroMQ allows you to avoid data duplication by a shared buffer ( a ZeroCopy maxim ).
If your OS allows IPC:// and inproc:// transport class are almost overhead-less and inproc:// even does not ( thanks to the great architecture thoughts of the ZeroMQ team ) require _any_additional_ thread ( CPU/RAM-overheads ) once invoked via ZeroThread-context( 0 )
An approach even less subject to overhead ( if your app fits nanomsg )
In case ZeroMQ seems too powerful for your particular goal, may be interested in it's younger sister Martin Sustrik, co-father of ZeroMQ has spun off - nanomsg which also has node.js port available
Where to go for more details?
A best next step you may do in ether ZeroMQ / nanomsg case for this is to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1, if it were not the case of reading step-by-step there.
The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details.
If you fall in love with this mode-of-thinking, you would love Martin Sustrik's blog posts - a smart man, indeed. It is worth the time to at least get inspired by his views and experience.

1) You should not have any problems with TCP packet size. Node will buffer/queue your data if it's too big and send them when the OS gives it a writable socket's file descriptor. You may hit performance issues only if you are writing more then your network bandwidth per second. At this point Node will also use more RAM to queue all this messages.
https://nodejs.org/api/net.html#net_socket_buffersize
2) Most games use TCP or UDP for real time communication. It can be a bottleneck, as anything else (RAM, CPU, bandwidth, storage) can. At some point of stress, one or more resources will end/fail/perform badly. It's generally a good practice to use an architecture that can grow horizontally (adding more machines) when all optimizations are done for your bottleneck and you still need to add more simultaneous users to your game server.
https://1024monkeys.wordpress.com/2014/04/01/game-servers-udp-vs-tcp/
You'll probably use TCP to connect to a Redis server (but you can also use a unix socket).
If you only need inter-process communication (and not inter-machine), you should take a look at the "cluster" Node.js core module. It has built-in IPC.

Related

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.
can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

Is ZeroMQ with RT Linux (RT-PREEMPT patch) real time?

I am considering setting up ZeroMQ as message broker on a Linux kernel patched up with RT-PREEMPT (to make it real time).
Basically I want to publish/subscribe short events that are serialized using google protocol buffers.
1. Event Model Object (App #1) --->
2. Serialize Google protobuf --->
3. ZeroMQ --->
4. Deserialize Google protobuf -->
5. Event Model object (App #2)
From #1 to #5 and perhaps back to #1, how will the real time guarantees of linux RT-PREEMPT be affected by ZeroMQ?
I am specifically looking for real time features of ZeroMQ. Does it provide such guarantees?
To put the question in perspective, lets say I want to know if ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.
Firstly PREEMPT_RT reduces the maximum latency for a task but overall system throughput will be lower and the average latency probably higher.
Real time is a broad subject; to avoid confusion my definition of real time is in the order of tens of milliseconds per frame running at 30hz or higher.
Does it provide such (real time features) guarantees?
As already answered no it does not and it it isn't what PREEMPT_RT is really about.
Is ZeroMQ worthy of deploying on time critical systems?
Time critical is a loose definition; but a with a correctly designed protocol ZeroMQ is going to give you options in how messages are transported (memory, TCP, UDP / multicast) and rest assured that ZeroMQ does what it does really well.
IME ZeroMQ typically delivers fast on a local high speed network - but this will drop if you are using a wide network and this figure may raise as endpoints are added depending on which model you are using within ZeroMQ.
For real time systems it's not just about transmission speed, there is also latency and time synchronisation to consider.
It's worth reading the article 0MQ: Measuring messaging performance - note that there is a high (1.5ms) latency at the start of message transmission that settles quickly - probably good if you are needing to transmit a lot of small messages at high frequency - not as good if you are transmitting a few larger messages at a lower rate.
is ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.
It's important to understand the topology of what you're connecting and also how the latency will affect things and design a protocol accordingly.
So to take the case of the 777 autopilot - almost certainly yes ZeroMQ would be suitable simply because there is a lot of inertia in a stable aircraft so the airframe will take time to respond and the cleverness is inside the autopilot and not so much reliant on incoming sensor data at a high rate. On a 777 there is an Arinc 429 bus connecting the avionics and this runs at a maximum 100kbit/s between the limited number of endpoints that can be on any given bus.
Q: Does it provide such (real time features) guarantees?
No, it does not and never tried to do so. The Zen-of-Zero guarantees one to either deliver a complete message-payload or none at all, which means, your R/T-level code has never a need to test for a damaged message integrity (once a message got delivered). Yet, it does not guarantee a delivery as a whole and the application domain is free to build any such additional service-layer atop the smart and ultimately performant ZeroMQ-layer.
ZeroMQ has been since its earliest days a smart, asynchronous, broker-less tool for designing any kind of almost linear-scalable distributed-system messaging, with low-profile resources' needs and having an excellent latency/performance envelope.
This said, anyone can build on this and design any add-on features, that match some target application-domain specific needs ( N+1 robustness, N+M failure-resilience, adaptive node-rediscovery, ... ), which are principally neither hard-coded, nor pre-wired into the minimum footprint for low latency, best scalable smart-messaging core-engine.
All designs, where the rules of the Zen-of-Zero, coded so wisely into the ZeroMQ core-engine, can safely match the defined hard-Real-Time limits for the RT/Execution needs of such Real-Time constrained System-under-Review, will enjoy the ZeroMQ and its many-protocol support for inproc://, ipc://, tipc://, tcp://, pgm://, epgm://, udp://, vmci:// and other wire-level protocols.
Q: Is ZeroMQ worthy of deploying on time critical systems?
That depends a lot on indeed many things.
I remember guys in days, when F-16 avionics simulated an onboard network, that used internally an isolated high-speed deterministic and rather low latency ( due to static packet/payload sizes ) of 155-Mbit/s+ ATM-fabric for this on-plane switching-network just to enjoy it's benefits right for the R/T-control-motivated needs, so technology always matches some set of needs-to-have. Given your R/T-system design criteria gets defined, anyone may confirm/deny a feasible/in-feasible tools to design towards such R/T-system goals.
Factors will go way beyond the just ZeroMQ smart-features:
properties of your critical system
external constraints ( sector specific authoritative regulations )
Project's planning
Project's external ecosystem of cooperating parties ( greenfield systems have it too )
your Team's knowledge of ZeroMQ or absence of using it in distributed-systems' designs
...
and last but not least - the ceilings for financing and for showing the RTO demonstrator live demo and roll-out for acceptance tests.
Good luck with your forthcoming BMD or any other MCS. ZeroMQ style of thinking, the Zen-of-Zero can help you in many aspects of doing right things right.

Is it possible to drop down packets

I am trying to write some sort of very basic packet filtering in Linux (Ubuntu) user space.
Is it possible to drop down packets in user space via c program using raw socket (AF_PACKET), without any kernel intervention (such as writing kernel module) and net filtering?
Thanks a lot
Tali
It is possible (assuming I understand what you're asking). There are a number of "zero-copy" driver implementations that allow user-space to obtain a large memory-mapped buffer into which (/ from which) packets are directly DMA'd.
That pretty much precludes having the kernel process those same packets though (possible but very difficult to properly coordinate user-space packet sniffing with kernel processing of the same packets). But it's fine if you're creating your own IDS/IPS or whatever and don't need to "terminate" connections on the local machine.
It would definitely not be the standard AF_PACKET; you have to either create your own or use an existing implementation: look into netmap, DPDK, and PF_RING (maybe PF_RING/ZC? not sure). I worked on a couple of proprietary implementations in a previous career so I know it's possible.
The basic idea is either (1) completely duplicate everything the driver is responsible for -- that is, move the driver implementation completely into user space (DPDK basically does this). This is straight-forward on paper, but is a lot of work and makes the driver pretty much fully custom.
Or (2) modify driver source so that key network buffer allocation requests get satisfied with an address that is also mmap'd by the user-space process. You then have the problem of communicating buffer life-cycles / reference counts between user-space and kernel. That's very messy but can be done and is probably less work overall. (I dunno -- there may be a way to automate this latter method if you're clever enough -- I haven't been in this space in some years.)
Whichever way you go, there are several pieces you need to put together to do this right. For example, if you want really high performance, you'll need to use the adapter's "RSS" type mechanisms to split the traffic into multiple queues and pin each to a particular CPU -- then make sure the corresponding application components are pinned to the same CPU.
All that being said, unless your need is pretty severe, you're best staying with plain old AF_PACKET.
You can use iptable rules to drop packets for a given criteria, but dropping using packet filters is not possible, because the packet filters get a copy of the packet while the original packet flows through usual path.

High performance packet handling in Linux

I’m working on a packet reshaping project in Linux using the BeagleBone Black. Basically, packets are received on one VLAN, modified, and then are sent out on a different VLAN. This process is bidirectional - the VLANs are not designated as being input-only or output-only. It’s similar to a network bridge, but packets are altered (sometimes fairly significantly) in-transit.
I’ve tried two different methods for accomplishing this:
Creating a user space application that opens raw sockets on both
interfaces. All packet processing (including bridging) is handled in
the application.
Setting up a software bridge (using the kernel
bridge module) and adding a kernel module that installs a netfilter
hook in post routing (NF_BR_POST_ROUTING). All packet processing is
handled in the kernel.
The second option appears to be around 4 times faster than the first option. I’d like to understand more about why this is. I’ve tried brainstorming a bit and wondered if there is a substantial performance hit in rapidly switching between kernel and user space, or maybe something about the socket interface is inherently slow?
I think the user application is fairly optimized (for example, I’m using PACKET_MMAP), but it’s possible that it could be optimized further. I ran perf on the application and noticed that it was spending a good deal of time (35%) in v7_flush_kern_dcache_area, so perhaps this is a likely candidate. If there are any other suggestions on common ways to optimize packet processing I can give them a try.
Context switches are expensive and kernel to user space switches imply a context switch. You can see this article for exact numbers, but the stated durations are all in the order of microseconds.
You can also use lmbench to benchmark the real cost of context switches on your particular cpu.
The performance of the user space application depends on the used syscall to monitor the sockets too. The fastest syscall is epoll() when you need to handle a lot of sockets. select() will perform very poor, if you handle a lot of sockets.
See this post explaining it:
Why is epoll faster than select?

Is there any modern review of solutions to the 10000 client/sec problem

(Commonly called the C10K problem)
Is there a more contemporary review of solutions to the c10k problem (Last updated: 2 Sept 2006), specifically focused on Linux (epoll, signalfd, eventfd, timerfd..) and libraries like libev or libevent?
Something that discusses all the solved and still unsolved issues on a modern Linux server?
The C10K problem generally assumes you're trying to optimize a single server, but as your referenced article points out "hardware is no longer the bottleneck". Therefore, the first step to take is to make sure it isn't easiest and cheapest to just throw more hardware in the mix.
If we've got a $500 box serving X clients per second, it's a lot more efficient to just buy another $500 box to double our throughput instead of letting an employee gobble up who knows how many hours and dollars trying to figure out how squeeze more out of the original box. Of course, that's assuming our app is multi-server friendly, that we know how to load balance, etc, etc...
Coincidentally, just a few days ago, Programming Reddit or maybe Hacker News mentioned this piece:
Thousands of Threads and Blocking IO
In the early days of Java, my C programming friends laughed at me for doing socket IO with blocking threads; at the time, there was no alternative. These days, with plentiful memory and processors it appears to be a viable strategy.
The article is dated 2008, so it pulls your horizon up by a couple of years.
To answer OP's question, you could say that today the equivalent document is not about optimizing a single server for load, but optimizing your entire online service for load. From that perspective, the number of combinations is so large that what you are looking for is not a document, it is a live website that collects such architectures and frameworks. Such a website exists and its called www.highscalability.com
Side Note 1:
I'd argue against the belief that throwing more hardware at it is a long term solution:
Perhaps the cost of an engineer that "gets" performance is high compared to the cost of a single server. What happens when you scale out? Lets say you have 100 servers. A 10 percent improvement in server capacity can save you 10 servers a month.
Even if you have just two machines, you still need to handle performance spikes. The difference between a service that degrades gracefully under load and one that breaks down is that someone spent time optimizing for the load scenario.
Side note 2:
The subject of this post is slightly misleading. The CK10 document does not try to solve the problem of 10k clients per second. (The number of clients per second is irrelevant unless you also define a workload along with sustained throughput under bounded latency. I think Dan Kegel was aware of this when he wrote that doc.). Look at it instead as a compendium of approaches to build concurrent servers, and micro-benchmarks for the same. Perhaps what has changed between then and now is that you could assume at one point of time that the service was for a website that served static pages. Today the service might be a noSQL datastore, a cache, a proxy or one of hundreds of network infrastructure software pieces.
You can also take a look at this series of articles:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
He shows a fair amount of performance data and the OS configuration work he had to do in order to support 10K and then 1M connections.
It seems like a system with 30GB of RAM could handle 1 million connected clients on a sort of social network type of simulation, using a libevent frontend to an Erlang based app server.
libev runs some benchmarks against themselves and libevent...
I'd recommend Reading Zed Shaw's poll, epoll, science, and superpoll[1]. Why epoll isn't always the answer, and why sometimes it's even better to go with poll, and how to bring the best of both worlds.
[1] http://sheddingbikes.com/posts/1280829388.html
Have a look at the RamCloud project at Stanford: https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud
Their goal is 1,000,000 RPC operations/sec/server. They have numerous benchmarks and commentary on the bottlenecks that are present in a system which would prevent them from reaching their throughput goals.

Resources