Multiplexing with io_uring

Multiplexing with io_uring - linux

I've recently written a simple TCP server using epoll, but I want to explore other mechanisms for high performance mutliplexing, to that end I came across io_uring, and am planning on making another simple TCP server using it.
However I read that the number of entries for io_uring is limited to 4096 in here https://kernel.dk/io_uring.pdf, which seems to imply that I won't be able to theoretically have more than that number of persistent connections.
To my understanding where normally I'd use something like epoll_wait() to wait on an event for epoll, I instead submit a specific request in io_uring and am notified when the request has completed/failed, so does that mean I can submit up to 4096 read() requests for example?
Have I misunderstood the use case of io_uring or have I misunderstood how to use it?

In the same document I linked, it says:
Normally an application would ask for a ring of a given size, and the
assumption may be that this size corresponds directly to how many requests the application can have pending in the
kernel. However, since the sqe lifetime is only that of the actual submission of it, it's possible for the application to
drive a higher pending request count than the SQ ring size would indicate.
Which is precisely what you'd do for the case of listening for messages on lots of sockets - it's just that the upper limit of how many submissions you can send at once is 4096.

Related

Websockets: listen multiple connections simultaneously?

I am working on a project which goal is to receive and store real time data from financial exchanges, using websockets. I have some very general questions about the technology.
Suppose that I have two websocket connections open, receiving real time data from two different servers. How do I make sure not to miss any messages? I have learned a bit of asynchronous programming (python asyncio) but it does not seem to solve the problem: when I listen to one connection, I cannot listen to the other one at the same time, right?
I can think of two solutions: the first one would require that the servers use a buffer system to send their data, but I do not think this is the case (Binance, Bitfinex...). The second solution I see is to listen each websocket using a different core. If my laptop has 8 cores I can listen to 8 connections and be sure not to miss any messages. I guess I can then scale up by using a cloud service.
Is that correct or am I missing something? Many thanks.

when I listen to one connection, I cannot listen to the other one at the same time, right?
Wrong.
When using an evented programming design, you will be using an IO "reactor" that adds IO related events to the event loop.
This allows your code to react to events from a number of connections.
It's true that the code reacts to the events in sequence, but as long as your code doesn't "block", these events could be handled swiftly and efficiently.
Blocking code should be avoided and big / complicated tasks should be fragmented into a number of "events". There should be no point at which your code is "blocking" (waiting) on an IO read or write.
This will allow your code to handle all the connections without significant delays.
...the first one would require that the servers use a buffer system to send their data...
Many evented frameworks use an internal buffer that streams to the IO when "ready" events are raised. For example, look up the drained event in node.js (or the on_ready in facil.io).
This is a convenience feature rather than a requirement.
The event loop might as well add an "on ready" event and assume your code will handle buffering after partial write calls return EAGAIN / EWOULDBLOCK.
The second solution I see is to listen each websocket using a different core.
No need. A single thread on a single core with an evented design should support thousands (and tens of thousands) of concurrent clients with reasonable loads (per-client load is a significant performance factor).
Attaching TCP/IP connections to a specific core can (sometimes) improve performance, but this is a many-to-one relationship. If we had to dedicate a CPU core per connection than server prices would shoot through the roof.

Low latency serving same data to many clients (multicasting or not...)

I need to send identical information to 100's of clients over the Internet. I currently maintain a list of client connections and iterate over the list. Obviously the longer the list gets the more latency there is toward the end of the list.
I have looked at multicasting. However unless I am missing something it is only good for LAN-based communications at present. It requires routers that support multicasting and most routers do not. There is no mechanism that I can see where one requests an available multicast address to avoid broadcasting to an address already in use.
So my questions are:
1) Am I missing something and can I use multicasting to accomplish this? (have tried without success)
2) Other than multicasting, is there a short cut to sending identical packets to many recipients?

I solved the problem by multicasting between threads in the server. Every client connection results in the creation of an object. These objects are stored in a queue. Each object has its own thread and joins the multicast group. When the server multicasts a string to the client objects the delay that arose from the list iteration no longer occurs.
Every now and then there is huge latency (nearly a second). I suspect that this is a JVM thing.

If you need high performing low latency IO, you shoud try http://nodejs.org/
You may be also interested in some cache http://memcached.org/

Linux multicast sendto() performance degrades with local listeners

We have a "publisher" application that sends out data using multicast. The application is extremely performance sensitive (we are optimizing at the microsecond level). Applications that listen to this published data can be (and often are) on the same machine as the publishing application.
We recently noticed an interesting phenomenon: the time to do a sendto() increases proportionally to the number of listeners on the machine.
For example, let's say with no listeners the base time for our sendto() call is 5 microseconds. Each additional listener increases the time of the sendto() call by about 2 microseconds. So if we have 10 listeners, now the sendto() call takes 2*10+5 = 25 microseconds.
This to me suggests that the sendto() call blocks until the data has been copied to every single listener.
Analysis of the listening side supports this as well. If there are 10 listeners, each listener receives the data two microseconds later than the previous. (I.e., the first listener gets the data in about five microseconds, and the last listener gets the data in about 23--25 microseconds.)
Is there any way, either at the programmatic level or the system level to change this behavior? Something like a non-blocking/asynchronous sendto() call? Or at least block only until the message is copied into the kernel's memory, so it can return without waiting on all the listeners)?

Multicast loop is incredibly inefficient and shouldn't be used for high performance messaging. As you noted for every send the kernel is copying the message to every local listener.
The recommended approach is to use a separate IPC method to distribute to other threads and processes on the same host, either shared memory or unix sockets.
For example this can easily be implemented using ZeroMQ sockets by adding an IPC connection above a PGM multicast connection on the same ZeroMQ socket.

Sorry for asking the obvious, but is the socket nonblocking? (add O_NONBLOCK to the set of flags for the port -- see fcntl)

UNIX socket magic. Recommended for high performance application?

I'm looking using to transfer an accept()ed socket between processes using sendmsg(). In short, I'm trying to build a simple load balancer that can deal with a large number of connections without having to buffer the stream data.
Is this a good idea when dealing with a large number (let's say hundreds) of concurrent TCP connections? If it matters, my system is Gentoo Linux

You can share the file descriptor as per the previous answer here.
Personally, I've always implemented servers using pre-fork. The parent sets up the listening socket, spawns (pre-forks) children, and each child does a blocking accept. I used pipes for parent <-> child communication.

Until someone does a benchmark and establishes how "hard" it is to send a file descriptor, this remains speculation (someone might pop up: "Hey, sending the descriptor like that is dirt-cheap"). But here goes.
You will (likely, read above) be better off if you just use threads. You can have the following workflow:
Start a pool of threads that just wait around for work. Alternatively you can just spawn a new thread when a request arrives (it's cheaper than you think)
Use epoll(7) to wait for traffic (wait for connections + interesting traffic)
When interesting traffic arrives you can just dispatch a "job" to one of the threads.
Now, this does circumvent the whole descriptor sending part. So what's the catch ? The catch is that if one of the threads crashes, the whole process crashes. So it is up to you to benchmark and decide what's best for your server.
Personally I would do it the way I outlined it above. Another point: if the workers are children of the process doing the accept, sending the descriptor is unnecessary.

PUB/SUB with short-lived publisher and long-lived subscribers

Context: OS: Linux (Ubuntu), language: C (actually Lua, but this should not matter).
I would prefer a ZeroMQ-based solution, but will accept anything sane enough.
Note: For technical reasons I can not use POSIX signals here.
I have several identical long-living processes on a single machine ("workers").
From time to time I need to deliver a control message to each of processes via a command-line tool. Example:
$ command-and-control worker-type run-collect-garbage
Each of workers on this machine should receive a run-collect-garbage message. Note: it would be perfect if the solution would somehow work for all workers on all machines in the cluster, but I can write that part myself.
This is easily done if I will store some information about running workers. For example keep the PIDs for them in a known location and open a control Unix domain socket on a known path with a PID somewhere in it. Or open TCP socket and store host and port somewhere.
But this would require careful management of the stored information — e.g. what if worker process suddenly dies? (Nothing unmanageable, but, still, extra fuss.) Also, the information needs to be stored somewhere, thus adding an extra bit of complexity.
Is there a good way to do this in PUB/SUB style? That is, workers are subscribers, command-and-control tool is a publisher, and all they know is a single "channel url", so to say, on which to come for messages.
Additional requirements:
Messages to the control channel must wake up workers from the poll (select, whatever)
loop.
Message delivery must be guaranteed, and it must reach each and every worker that is listening.
Worker should have a way to monitor for messages without blocking — ideally by the poll/select/whatever loop mentioned above.
Ideally, worker process should be "server" in a sense — he should not bother about keeping connections to the "channel server" (if any) persistent etc. — or this should be done transparently by the framework.

Usually such a pattern requires a proxy for the publisher, i.e. you send to the proxy which immediately accepts delivery and then that reliably forwads to the end subscriber workers. The ZeroMQ guide covers a few different methods of implementing this.
http://zguide.zeromq.org/page:all

Given your requirements, Steve's suggestion does seem the simplest: run a daemon which listens on two known sockets - the workers connect to that and the command tool pushes to it which redistributes to connected workers.
You could do something complicated that would probably work, by effectively nominating one of the workers. For example, on startup workers attempt to bind() a PUB ipc:// socket somewhere accessible, like tmp. The one that wins bind()s a second IPC as a PULL socket and acts as a forwarder device on top of it's normal duties, the others connect() to the original IPC. The command line tool connect()s to the second IPC, and pushes it's message. The risk there is that the winner dies, leaving a locked file. You could identify this in the command line tool, rebind then sleep (to allow the connections to be established). Still, that's all a little bit complex, I think I'd go with a proxy!

I think what you're describing would fit well with a gearmand/supervisord implementation.
Gearman is a great task queue manager and supervisord would allow you to make sure that the process(es) are all running. It's TCP based too so you could have clients/workers on different machines.
http://gearman.org/
http://supervisord.org/
I recently set something up with multiple gearmand nodes, linked to multiple workers so that there's no single point of failure
edit: Sorry - my bad, I just re-read and saw that this might not be ideal.
Redis has some nice and simple looking pub/sub functionality that I've not used yet but sounds promising.

Use a mulitcast PUB/SUB. You'll have to make sure the pgm option is compiled into your ZeroMQ distribution (man 7 zmq_pgm).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string