epoll and set multiple interests at once - linux

Interestingly, I cannot find any discussion on this rather than some
old slides from 2004.
IMHO, the current scheme of epoll() usage is begging for something
like epoll_ctlv() call. Although this call does not make sense for
typical HTTP web servers, it does make sense in a game server where
we are sending same data to multiple clients at once. This does not
seem hard to implement given the fact that epoll_ctl() is already there.
Do we have any reason for not having this functionality? Maybe no
optimization window, there?

You would typically only use epoll_ctl() to add and remove sockets from the epoll set as clients connect and disconnect, which doesn't happen very often.
Sending the same data to multiple sockets would rather require a version of send() (or write()) that takes a vector of file descriptors. The reason this hasn't been implemented is probably just because no-one with sufficient interest in it has done so yet (of course, there are lots of subtle issues - what if each destination file descriptor can only successfully write a different number of bytes).


Is python3's BufferedProtocol an abstraction over TCP? Or is it so low-level that I have to implement the TCP things too?

I am referring to this: https://docs.python.org/3/library/asyncio-protocol.html#asyncio.BufferedProtocol
I haven't seen the answer to this question documented anywhere and I want to know the answer in advance of writing any code.
It seems to imply that it is a modification of asyncio.Protocol (for TCP) but seeing as though TCP is not mentioned for BufferedProtocol it's got me concerned that I'd have to contend with out of order packets etc.
Many thanks!
BufferedProtocol isn't a protocol based on TCP, it's an interface (base class) for custom implementation of asyncio protocols, specifically those that try to minimize the amount of copying. The docstring provides more details:
The idea of BufferedProtocol is that it allows to manually allocate and control the receive buffer. Event loops can then use the buffer provided by the protocol to avoid unnecessary data copies. This can result in noticeable performance improvement for protocols that receive big amounts of data. Sophisticated protocols can allocate the buffer only once at creation time.
Currently none of the protocols shipped with asyncio derive from BufferedProtocol, so the use case for this is user code that needs to achieve high throughput - see the BPO issue and the linked mailing list post for details.
seeing as though TCP is not mentioned for BufferedProtocol it's got me concerned that I'd have to contend with out of order packets etc.
Unless you are writing custom low-level asyncio code, you shouldn't care about BufferedProtocol at all. Regular asyncio TCP code calls functions such as open_connection or start_server, both of which handle provide a streaming abstraction on top of TCP sockets in the usual way (using a buffer, handling errors, etc.).
I can confirm - BufferedProtocol is for TCP only. Not for files or anything else. And it gives you a handle on a zero copy buffer to work with. That's basically all I wanted to know.

what is the side effect of setting tcp_max_tw_buckets to a very small value?

I know it is quite normal setting tcp_max_tw_buckets to a relatively small number such as 30000 or 50000, to avoid the situation when a host have a lots of time-wait state connections and application failed to open new one. It is something mentioned quite a lots. such as the question like this: How to reduce number of sockets in TIME_WAIT?
As before I know time-wait is a state to avoid TCP packets out of order, and it may be better using some other approach to coping it. And if you setting it to a small number thing may went wrong.
I feel I'm stucking at somewhere that I have to set tcp_max_tw_buckets to a small number, and don't know the specific scenarios I shall avoid it.
So my question is what is the side effect of setting tcp_max_tw_buckets to a very small value, and can I setup a specific scenario using lab environment, that a small number of tcp_max_tw_buckets will cause the trouble?
As you can see in this Kernel source, that option prevents graceful termination of the socket. In terms of the socket state, you have reduced the time wait duration for this connection to zero.
So what happens next? First off, you'll see the error message on your server. The rest is then a race condition for subsequent connections from your clients. Section 2 of rfc 1337 then covers what you may see. In short, some connections may show the following symptoms.
Corruption of your data stream (because the socket accepts an old transmission).
Infinite ACK loops (due to an old duplicate ACK being picked up).
Dropped connections (due to old data turning up in the SYN-SENT state).
However, proving this may be hard. As noted in the same RFC:
The three hazards H1, H2, and H3 have been demonstrated on a stock Sun OS 4.1.1 TCP running in an simulated environment that massively duplicates segments. This environment is far more hazardous than most real TCP's must cope with, and the conditions were carefully tuned to create the necessary conditions for the failures.
The real answer to your question is that the correct way to avoid TIME_WAIT states is to be the end that receives the first close.
In the case of a server, that means that after you've sent the response you should loop waiting for another request on the same socket, with a read timeout of course, so that it is normally the client end which will close first. That way the TIME_WAIT state occurs at the client, where it is fairly harmless, as clients don't have lots of outbound connections.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!
This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

inotify: are the events reported in strictly the same order as they have occured in the file system?

I'm using inotify to monitor various directories on various partitions (which are possibly located on different hard disks). To be sure to have collected all events which have occurred until a certain point in time T, I'm touching a special file in my home directory and wait for inotify to report this modification. Once I've received this notification, can I be sure that I've also received all events for all modifications before T (for all directories and all partitions)?
I'm uncertain about whether this works for watches on different filesystems on the same inotify instance, but can speak with authority that the technique does work in general: we use it in Watchman (we describe it here: https://facebook.github.io/watchman/docs/cookies.html)
We assumed that this wouldn't be ordered correctly across filesystem boundaries and create one instance per watched root; this makes it simpler for us to track and associate events properly. We also have to deal with fsevents, kqueue and other watching implementations, so we try to avoid coupling too closely to the underlying implementation.
Depending on what your precise use case is, you may be able to get away with one instance per filesystem and touch a special file in the root of each at your time T. Provided that you've observed both of your special file changes, you know you've seen everything up to time T, and perhaps a little more. If the "perhaps a little more" part isn't a deal breaker then you're golden.
The inotify documentation in the kernel says "that each [inotify] instance is associated with a unique, ordered queue." So, I think that events related to the watches added to a given instance (created with inotify_init()) are received in the same order they occur.

winsock 2. thread safety for simultaneous send's. tcp

is it possible to have multiple threads sending on the same socket? will there be interleaving of the streams or will the socket block on the first thread (assuming tcp)? the majority of opinions i've found seems to warn against doing this for obvious fears of interleaving, but i've also found a few comments that state the opposite. are interleaving fears a carryover from winsock1 and are they well-founded for winsock2? is there a way to setup a winsock2 socket that would allow for lack of local synchronization?
two of the contrary opinions below... who's right?
comment 1
"Winsock 2 implementations should be completely thread safe. Simultaneous reads / writes on different threads should succeed, or fail with WSAEINPROGRESS, depending on the setting of the overlapped flag when the socket is created. Anyway by default, overlapped sockets are created; so you don't have to worry about it. Make sure you don't use NT SP6, if ur on SP6a, you should be ok !"
comment 2
"The same DLL doesn't get accessed by multiple processes as of the introduction of Windows 95. Each process gets its own copy of the writable data segment for the DLL. The "all processes share" model was the old Win16 model, which is luckily quite dead and buried by now ;-)"
looking forward to your comments!
to clarify what i mean by interleaving. thread 1 sends the msg "Hello" thread 2 sends the msg "world!". recipient receives: "Hwoel lorld!". this assumes both messages were NOT sent in a while loop. is this possible?
I'd really advice against doing this in any case. The send functions might send less than you tell it to for various very legit reasons, and if another thread might enter and try to also send something, you're just messing up your data.
Now, you can certainly write to a socket from several threads, but you've no longer any control over what gets on the wire unless you've proper locking at the application level.
consider sending some data:
the sent parameter will hold the no. of bytes actually sent - similar to the return value of the send()function. To send all the data in buf you will have to loop doing a WSASend until all all the data actually get sent.
If, say, the first WSASend sends all but the last 4 bytes, another thread might go and send something while you loop back and try to send the last 4 bytes.
With proper locking to ensure that can't happen, it should e no problem sending from several threads - I wouldn't do it anyway just for the pure hell it will be to debug when something does go wrong.
is it possible to have multiple threads sending on the same socket?
Yes - although, depending on implementation this can be more or less visible. First, I'll clarify where I am coming from:
C# / .Net 3.5
The overall visibility (i.e. required management) of threading and the headaches incurred will be directly dependent on how the socket is implemented (synchronously or asynchronously). If you go the synchronous route then you have a lot of work to manually manage connecting, sending, and receiving over multiple threads. I highly recommend that this implementation be avoided. The efforts to correctly and efficiently perform the synchronous methods in a threaded model simply are not worth the comparable efforts to implement the asynchronous methods.
I have implemented an asynchronous Tcp server in less time than it took for me to implement the threaded synchronous version. Async is much easier to debug - and if you are intent on Tcp (my favorite choice) then you really have few worries in lost messages, missing data, or whatever.
will there be interleaving of the streams or will the socket block on the first thread (assuming tcp)?
I had to research interleaved streams (from wiki) to ensure that I was accurate in my understanding of what you are asking. To further understand interleaving and mixed messages, refer to these links on wiki:
Real Time Messaging Protocol
Transmission Control Protocol
Specifically, the power of Tcp is best described in the following section:
Due to network congestion, traffic load balancing, or other unpredictable network behavior, IP packets can be
lost, duplicated, or delivered out of order. TCP detects these problems, requests retransmission of lost
packets, rearranges out-of-order packets, and even helps minimize network congestion to reduce the
occurrence of the other problems. Once the TCP receiver has finally reassembled a perfect copy of the data
originally transmitted, it passes that datagram to the application program. Thus, TCP abstracts the application's
communication from the underlying networking details.
What this means is that interleaved messages will be re-ordered into their respective messages as sent by the sender. It is expected that threading is or would be involved in developing a performance-driven Tcp client/server mechanism - whether through async or sync methods.
In order to keep a socket from blocking, you can set it's Blocking property to false.
I hope this gives you some good information to work with. Heck, I even learned a little bit...
