Before I dive into the code, can someone tell me if there is any documentation available for confirmed delivery in Socket.IO?
Here's what I've been able to glean so far:
A callback can be provided to be invoked when and if a message is acknowledged
There is a special mode "volatile" that does not guarantee delivery
There is a default mode that is not "volatile"
This leaves me with some questions:
If a message is not volatile, how is it handled? Will it be buffered indefinitely?
Is there any way to be notified if a message can't be delivered within a reasonable amount of time?
Is there any way to unbuffer a message if I want to give up?
I'm at a bit of a loss as to how Socket.IO can be used in a time sensitive application without falling back to volatile mode and using an external ACK layer that can provide failure events and some level of configurability. Or am I missing something?
TL;DR You can't have reliable confirmed delivery unless you're willing to wait until the universe dies.
The delivery confirmation you seek is related to the theoretical Two Generals Problem, which is also discussed in this SO answer.
TCP manages the reliability problem by guaranteeing delivery after infinite retries. We live in a finite universe, so the word "guarantee" is theoretically dubious :-)
Theory aside, consider this: engine.io, the underpinnings of socket.io 1.x, uses the following transports:
WebSocket
FlashSocket
XHR polling
JSONP polling
Each of those transports is based upon TCP, and TCP is reliable. So as long as connections stay connected and transports don't change, each individual socket.io message or event should be reliable. However, two things can happen on the fly:
engine.io can change transports
socket.io can reconnect in case the underlying transport disconnects
So what happens when a client or your server squirts off a few messages while the plumbing is being fiddled with like that? It doesn't say in either the engine.io protocol or the socket.io protocol (at versions 3 and 4, respectively, as of this writing).
As you suggest in your comments, there is some acknowledgement logic in the implementation. But even simple digital communications has notrivial behavior, so I do not trust an unsupervised socket.io connection for reliable delivery for mission- or safety-critical operations. That won't change until reliable delivery is part of their protocol and their methods have been independently and formally verified.
You're welcome to adopt my policies:
Number my messages
Ask for a resend when in doubt
Do not mutate my state - client or server - unless I know I'm ready
In Short:
Guaranteed message delivery acknowledgement is proven impossible, but TCP guarantees delivery and order given "infinite" retries. I'm less confident about socket.io messages, but they're really powerful and easy to use so I just use them with care.
I ensured delivery using different strategies
I send data using socket including nonce in the message to prevent repeated message errors
The other party sends a confirmation of recived meassage or i resend after x seconds
I used a REST call by the client every 30 seconds to request all new messages sent by server to catch any dropped messages during transport
Related
I am using the ws Node.js package to create a simple WebSocket client connection to a server that is sending hundreds of messages per second. Even with a simple onMessage handler that just console.logs incoming messages, the client cannot keep up. My understanding is that this is referred to as backpressure, and incoming messages may start piling up in a network buffer on the client side, or the server may throttle the connection or disconnect all-together.
How can I monitor backpressure, or the network buffer from the client side? I've found several articles speaking about this issue from the perspective of the server, but I have no control over the server and need to know just how slow is my client?
So you don't have control over the server and want to know how slow your client is.(seems like you already have read about backpressure). Then I can only think of using a stress tool like artillery
Check this blog, it might help you setting up a benchmarking scenario.
https://ma.ttias.be/benchmarking-websocket-server-performance-with-artillery/
Add timing metrics to your onMessage function to track how long it takes to process each message. You can also use RUM instrumentation like from the APM providers -- NewRelic or Appdynamics for paid options or you could use free tier of Google Analytics timing.
If you can, include a unique identifier for correlation between the client and server for each message sent.
Then you can correlate for a given window how long a message took to send from the server and how long it spent being processed by the client.
You can't get directly to the network socket buffer associated with your websocket traffic since you're inside the browser sandbox. I checked the WebSocket APIs and there's no properties that expose receive buffer information.
If you don't have control over the server, you are limited. But you could try some client tricks to simulate throttling.
This heavily assumes you don't mind skipping messages.
One approach would be to enable the socket, start receiving events and set your own max count in a in-memory queue/array. Once you reach a full queue, turn off the socket. Process enough of the queue, then enable the socket again.
This has high cost to disable/enable the socket, as well as the loss of events, but at least your client will not crash.
Once your client is not crashing, you can put some additional counts on timestamp and the queue size to determine the threshold before the client starts crashing.
I am having serious problems to make messages delivery fail proof in a chat system.
Having several node.js and live communication via websocket to the clients, I use rabbit to callback the correct consumer at a specific node.
I declare my queues as {durable: true, prefetch:1, expires: 2*3600*1000, autoDelete: true}
consumerOption is {noAck: false, exclusive: false}
Once I receive a message from the server, I callback the server, get the message, and use message.ack(false)
Sometimes, a message appears with a pendent ACK in rabbit and as I would expect, the consumers stop being callbacked.
Here is my overall strategy:
1- when socket disconnects, I recover the queue using queue.recover() during the the reconnection/connection (more frequent).
2- When I send a message to the server and not receive it back, I send a message to the server to recover the queue.
3- I use the socket callback function to send the ack confirmation. On the server, I use message.ack(false) The server keeps a hashmap {[ackCode: string]: RabbitMessage} and I send the ackCode back to the server, so it can retrieve the correct message and ack it.
5- If client is not receiving any message for 2 minutes, I ask to the server to recover the queue.
The step 5 should not exist but even with this step, sometimes I send a recover queue request to the server, the server executes the command, but nothing happens and chat is freeze.
These are very difficult events to debug. I am using a Typescript library which is 3 year without any commit and this could be one of the causes.
Regarding the strategy, is it correct? Any idea on what I could be facing?
What I've learned and why I think that I couldn't use rabbit to solve the specific problem mentioned in the original post.
The domain: A "chat" where the message order is very important (some are chains) and we must be sure that the message will be delivery if/when the client is online.
The problem: We have several node.js servers, sockets are spread among them. Sockets falls all time, and it is common to a client connection that was in the first server be connected again in another. We don't use cookies, session affinity by IP won't handle the issue.
Limitations: That being said, I can't activate a consumer that is currently activated in another server, so if a customer Queue is tied to server 1 I can't activate it in server 2. And all the messages that need to be sent are tied to this specific queue.
Another limitation is that I don't have an easy way to consume queues, re-queue, to know in advance how much not ack messages I have in the queue, aggregate them and bulk send them via socket.
The solution: I am no longer using {noAck: false} and I am controlling the ack in a Redis queue. Thus, I am using Rabbit as a pub-sub, to callback the correct consumers to send the message using the socket. Rabbit wake me up, first thing I do is to put the message at the end of a redis queue. When I send a message via socket, I always start sending the messages from the beginning of the queue, regardless of the message that have just woke me up. I send the message, wait for the callback event, If it is not ok, I re-queue the messages,
After decoupling the pub-sub from the queue/ack control, I can now easily change my rabbit pub/sub from one server to another (declaring using socket.id and no more with the client queue), with no concern of loosing any message. Also, now I am capable of much more advanced operations on my queue.
As my use case don't allow me to use the full power of exchanges/binds (i have complex routing rules), I am evaluating the possibility of changing from rabbit to redis pub/sub, but in this case, I would continue to differentiate pub/sub from the queue.
After more than a month trying to make rabbit working in this scenery, I think that I was using a good technology to the wrong use case. It is much simpler now.
So I have a socket.io server which works well. It's very simple: kind of mimicking screen sharing as it broadcasts one clients position on the page to the other, who catches it and moves to said location etc. All of this works fine, but because of the way I'm catching movement, its possible (and quite common) for it to be sent too many messages at once, making it impossible for the other client to keep up.
I was wondering if there is a way to make socket.io 'sleep' or 'wait' for a certain interval, ignore the messages sent during this interval without returning an error, and then begin listening again?
It is feasible to implement this in each client (and this may be the better option), but I just wanted to know if this is possible on the server side too.
Use volatile messages. If there are too much messages, they will just be dropped to go again with real time messages.
socket.volatile.emit('msg', data);
From socket.io website :
Sending volatile messages.
Sometimes certain messages can be dropped. Let's say you have an app that shows realtime tweets for the keyword `bieber`.
If a certain client is not ready to receive messages (because of network slowness or other issues, or because he's connected through long polling and is in the middle of a request-response cycle), if he doesn't receive ALL the tweets related to bieber your application won't suffer.
In that case, you might want to send those messages as volatile messages.
For interactive web apps, things like Websockets are getting more popular. However, as the client, and proxy world is not always fully compliant, one usually use a complex framework like 'Socket.IO', hiding several different mechanisms for any case that may disable the other ones.
I just wonder what the downsides of a properly implemented long polling are, because with today's servers like node.js it is quite easy to implement and relies on old http technology which is well supported (despite the long polling behaveiour itself may break it).
From an high level view, long polling (despite some additional overhead, feasable for medium traffic apps) resembles a true push behaviour as WebSockets do, as the server actually send it's answer whenever he likes (despite some timeout / heartbeat mechanism).
So we have some more overhead due to the more TCP/IP acknowledgements I guess, but no constant traffic like frequent polling would do.
And using an event driven server, we would have no thread overhead to keep the connections blocked.
So is there any else hard downside that forces medium-traffic apps like chats to use WebSockets rather than long polling?
Overhead
It will create a new connection each time, so it will send the HTTP headers... including the cookie header that may be large.
Also, just "check if there is something new" is another connection for nothing. Connections implies the work of many items like firewalls, load balancers, web servers ... etc.. Probably, establish the connection is most time consuming thing as soon your IT infrastructure have several inspectors.
If you are using HTTPS, you are doing again and again the most expensive operation, the TLS handshake. TLS performance is good once the connection is established and the symmetric encryption is working, but the process of establishing the connection, key exchange and all that jazz is not fast.
Also, when connections are done, log entries are written somewhere, counters are incremented somewhere, memory is consumed, objects are created... etc... etc.. For example, the reason why we have different logging configurations when in production and in development, is because writing log entries also affect performance.
Presence
When is a long polling user connected or disconnected? If you check for this at a given moment of time... what would be the reliable amount of time you should wait to double check, to ensure it is disconnected or connected?
This may be totally irrelevant if your application just broadcast stuff, but it may be very relevant if your application is a game.
Not persistent
This is the big deal.
Since a new connection is created each time, if you have load balanced servers, in a round robbin scenario you cannot know in which server the next connection is going to fall.
When a user's server is known, like when using a WebSocket, you can push the events to that server straight away, and the server will relay them to the connection. If the user disconnects, the server can notify straight away that the user is not connected anymore, and when connect again can subscribe again.
If the server where the user is connected at the moment that an event for him is generated is unknown, you have to wait for the user to connect so then you can say "hey, user 123 is here, give me all the news since this timestamp", what make it a little bit more cumbersome. Long polling is not really push technology, but request-response, so if you plan for a EDA architecture, at some point you are going to have some level of impedance you have to address, like for example, you need a event aggregator that can give you all the events from a given timestamp (the last time that user connected to ask for news).
SignalR (I guess it is the equivalent in .NET to socket.io) for example, has a message bus named backplane, that relay all the messages to all the servers, as key for scaling out. Therefore, when a user connect to other server, "his" pending events are there "as well"(!) It is a "not too bad approach", but as you can guess, affects the throughput:
Limitations
Using a backplane, the maximum message throughput is lower than it is
when clients talk directly to a single server node. That's because the
backplane forwards every message to every node, so the backplane can
become a bottleneck. Whether this limitation is a problem depends on
the application. For example, here are some typical SignalR scenarios:
Server broadcast (e.g., stock ticker): Backplanes work well for this
scenario, because the server controls the rate at which messages are
sent.
Client-to-client (e.g., chat): In this scenario, the backplane might
be a bottleneck if the number of messages scales with the number of
clients; that is, if the rate of messages grows proportionally as more
clients join.
High-frequency realtime (e.g., real-time games): A backplane is not
recommended for this scenario.
For some projects, this may be a showstopper.
Some applications just broadcast general data, but others have a connection semantics, like for example a multiplayer game, and it is important to send the right events to the right connections.
IMHO
Long polling is a good solution for small projects, but became a big burden for high scalable apps that need high frecuency and/or very segmented event sending.
I implemented a Node.js Express server that supported long polling. The biggest mistake I made was not cleaning up the requests which caused slowing down the server. If your server doesn't support concurrency or threads, one of the essential tasks is to set the appropriate timeouts for the requests/responses to release them from the loop, which you have to do by yourself.
Edit: Also you need to keep in mind that browsers have their specific limit for the number of connections (i.e. 6 per hostname for Google Chrome). So if you have too many long polling connections at the same time, you will probably block yourself.
Is there some kind of ordering mechanism in Socket.IO that guarantees that events are received in order by clients?
For example: if a server emits event Evt1 to client A, and the server broadcasts Evt2 to all clients.
Thus client A receives Evt1 then Evt2 and only in that order.
My guess is NO and, if it's the case, how would you implement it, or are there existing solutions?
Since previous answers could potentially be misleading, I think it's important to clarify one thing:
When using Socket.io with WebSockets, packet order is guaranteed to be maintained.
This is an old post, but it's worth noting that Socket.io over WebSockets actually does guarantee event ordering is maintained. This is because TCP itself, which is the underlying technology for WebSockets & HTTP guarantees packet ordering to be maintained. However Socket.io also supports several other protocols that do not guarantee order.
There have been several posted questions about this issue, supporting this fact:
https://github.com/josephg/ShareJS/issues/375
Can websocket messages arrive out-of-order?
No, you have to do that at the application level if you need to do that. The internet doesn't guarantee that two different packets will take the same route, so timing can vary. Maybe add a timestamp to each message so you can sort by that timestamp to keep things in order.