Hard downsides of long polling? - node.js

For interactive web apps, things like Websockets are getting more popular. However, as the client, and proxy world is not always fully compliant, one usually use a complex framework like 'Socket.IO', hiding several different mechanisms for any case that may disable the other ones.
I just wonder what the downsides of a properly implemented long polling are, because with today's servers like node.js it is quite easy to implement and relies on old http technology which is well supported (despite the long polling behaveiour itself may break it).
From an high level view, long polling (despite some additional overhead, feasable for medium traffic apps) resembles a true push behaviour as WebSockets do, as the server actually send it's answer whenever he likes (despite some timeout / heartbeat mechanism).
So we have some more overhead due to the more TCP/IP acknowledgements I guess, but no constant traffic like frequent polling would do.
And using an event driven server, we would have no thread overhead to keep the connections blocked.
So is there any else hard downside that forces medium-traffic apps like chats to use WebSockets rather than long polling?

Overhead
It will create a new connection each time, so it will send the HTTP headers... including the cookie header that may be large.
Also, just "check if there is something new" is another connection for nothing. Connections implies the work of many items like firewalls, load balancers, web servers ... etc.. Probably, establish the connection is most time consuming thing as soon your IT infrastructure have several inspectors.
If you are using HTTPS, you are doing again and again the most expensive operation, the TLS handshake. TLS performance is good once the connection is established and the symmetric encryption is working, but the process of establishing the connection, key exchange and all that jazz is not fast.
Also, when connections are done, log entries are written somewhere, counters are incremented somewhere, memory is consumed, objects are created... etc... etc.. For example, the reason why we have different logging configurations when in production and in development, is because writing log entries also affect performance.
Presence
When is a long polling user connected or disconnected? If you check for this at a given moment of time... what would be the reliable amount of time you should wait to double check, to ensure it is disconnected or connected?
This may be totally irrelevant if your application just broadcast stuff, but it may be very relevant if your application is a game.
Not persistent
This is the big deal.
Since a new connection is created each time, if you have load balanced servers, in a round robbin scenario you cannot know in which server the next connection is going to fall.
When a user's server is known, like when using a WebSocket, you can push the events to that server straight away, and the server will relay them to the connection. If the user disconnects, the server can notify straight away that the user is not connected anymore, and when connect again can subscribe again.
If the server where the user is connected at the moment that an event for him is generated is unknown, you have to wait for the user to connect so then you can say "hey, user 123 is here, give me all the news since this timestamp", what make it a little bit more cumbersome. Long polling is not really push technology, but request-response, so if you plan for a EDA architecture, at some point you are going to have some level of impedance you have to address, like for example, you need a event aggregator that can give you all the events from a given timestamp (the last time that user connected to ask for news).
SignalR (I guess it is the equivalent in .NET to socket.io) for example, has a message bus named backplane, that relay all the messages to all the servers, as key for scaling out. Therefore, when a user connect to other server, "his" pending events are there "as well"(!) It is a "not too bad approach", but as you can guess, affects the throughput:
Limitations
Using a backplane, the maximum message throughput is lower than it is
when clients talk directly to a single server node. That's because the
backplane forwards every message to every node, so the backplane can
become a bottleneck. Whether this limitation is a problem depends on
the application. For example, here are some typical SignalR scenarios:
Server broadcast (e.g., stock ticker): Backplanes work well for this
scenario, because the server controls the rate at which messages are
sent.
Client-to-client (e.g., chat): In this scenario, the backplane might
be a bottleneck if the number of messages scales with the number of
clients; that is, if the rate of messages grows proportionally as more
clients join.
High-frequency realtime (e.g., real-time games): A backplane is not
recommended for this scenario.
For some projects, this may be a showstopper.
Some applications just broadcast general data, but others have a connection semantics, like for example a multiplayer game, and it is important to send the right events to the right connections.
IMHO
Long polling is a good solution for small projects, but became a big burden for high scalable apps that need high frecuency and/or very segmented event sending.

I implemented a Node.js Express server that supported long polling. The biggest mistake I made was not cleaning up the requests which caused slowing down the server. If your server doesn't support concurrency or threads, one of the essential tasks is to set the appropriate timeouts for the requests/responses to release them from the loop, which you have to do by yourself.
Edit: Also you need to keep in mind that browsers have their specific limit for the number of connections (i.e. 6 per hostname for Google Chrome). So if you have too many long polling connections at the same time, you will probably block yourself.

Related

Node.js design approach. Server polling periodically from clients

I'm trying to learn Node.js and adequate design approaches.
I've implemented a little API server (using express) that fetches a set of data from several remote sites, according to client requests that use the API.
This process can take some time (several fecth / await), so I want the user to know how is his request doing. I've read about socket.io / websockets but maybe that's somewhat an overkill solution for this case.
So what I did is:
For each client request, a requestID is generated and returned to the client.
With that ID, the client can query the API (via another endpoint) to know his request status at any time.
Using setTimeout() on the client page and some DOM manipulation, I can update and display the current request status every X, like a polling approach.
Although the solution works fine, even with several clients connecting concurrently, maybe there's a better solution?. Are there any caveats I'm not considering?
TL;DR The approach you're using is just fine, although it may not scale very well. Websockets are a different approach to solve the same problem, but again, may not scale very well.
You've identified what are basically the only two options for real-time (or close to it) updates on a web site:
polling the server - the client requests information periodically
using Websockets - the server can push updates to the client when something happens
There are a couple of things to consider.
How important are "real time" updates? If the user can wait several seconds (or longer), then go with polling.
What sort of load can the server handle? If load is a concern, then Websockets might be the way to go.
That last question is really the crux of the issue. If you're expecting a few or a few dozen clients to use this functionality, then either solution will work just fine.
If you're expecting thousands or more to be connecting, then polling starts to become a concern, because now we're talking about many repeated requests to the server. Of course, if the interval is longer, the load will be lower.
It is my understanding that the overhead for Websockets is lower, but still can be a concern when you're talking about large numbers of clients. Again, a lot of clients means the server is managing a lot of open connections.
The way large services handle this is to design their applications in such a way that they can be distributed over many identical servers and which server you connect to is managed by a load balancer. This is true for either polling or Websockets.

How does Server keep track of all Client(s) connected in Real time data pushing scenario?

I kinda understand that Websocket is the protocol that is used for real-time data flowing back & forth.
My question can be very pre-mature but couldn't find much help on the web.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead ?
This SOF answer made some sense but didn't clear my doubt.
How does Server keep track of all Client(s) connected in Real time data pushing scenario?
It doesn't... it only keeps track of the clients it's serving specifically.
This answer is not node.js specific.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
To actually understand this a little better, we should consider larger numbers. i.e., let's assume 1 million clients connected to a service.
Obviously, a sane design will require redundancy, so no single service will hold all 1 million connections (and if a single server instance fails, clients can re-connect to a different server instance).
In this case, there's no single server that is aware of all clients.
It makes more sense for each server to manage it's own internal subscription / client list. Each server will also act as a pub/sub client for a centralized pub/sub service (such as a Redis cluster or whatever).
Assuming 1000 server instances, each serving 1000 clients, we would have find that the pub/sub service is aware only of 1,000 "clients" (server instances). Each server is unaware of the other clients, it's only aware of the 1,000 clients it's managing.
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead?
The algorithm itself is implementation specific, but in general, each server will incur some overhead in order to manage the pub/sub layer.
However, since each server only manages a small subset of the total client count, the overhead is distributed across a number of systems.
Channel Oriented vs. Connection Oriented Design
I should probably note that the pub/sub design isn't connection oriented.
The server isn't (or shouldn't be) looping over all the connections asking "are you subscribed to this channel"?.
Rather, pub/sub design assumes a "channel" oriented design, where it locates the channel object(s) and loops over a client list.
On one hand, this approach might (or might not) consume more memory. Since each "channel" should contain a list of clients listening to that channel, a single client object might belong to more than a single list.
On the other hand, the loop has less code branches and experiences less overhead than a connection oriented design. Also, this approach allows for pub/sub clients that aren't connection bound (such as internal hooks / callbacks).
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
Socket.io already keeps track by itself and its pretty easy to emit to all connected clients.
Socket.io - Emit Cheatsheet
If you are worried about what would happen when your user-base grows, you can scale your service to multiple nodes.
If you actually end up scaling and have more than one server node, then you can use
socketio-redis.
Adapter to enable broadcasting of events to multiple separate socket.io server nodes.

Reliable and fast way to send database updates to one or more web browsers

What is a reliable and fast way to send database updates to one or more web browsers?
I have a Postgres database with a few tables being updated over time. The updates range from 0 to 1000 updates per second. When a table is updated I want one to many web client to receive the updates as fast and efficiently as possible. The updates are less than 1K each.
UDP will be the fastest, but it requires dedicated clients and data loss is likely to occur.
TCP/IP guaranties data integrity, which means you can use SSE or WebSockets for browser clients. However, it requires the data to be sent for each client.
SSE only supports text data and is unidirectional. It also imposes other limits and uses up one of the browser's per-domain connection limits (browsers are often limited to 6 HTTP connections per domain).
WebSockets are bi-directional and offer more flexibility. Also, they don't detract from a browser's per-domain connection limit.
Polling is really a bad idea as far as performance goes, both due to overhead and the chance of redundant requests.
A short search will get you more information. Many questions about this subject have been asked before.
There's a discussion about WebSockets vs. SSE, performance discussions about polling vs. WebSockets and an overview of use-cases for AJAX in a WebSockets world.
These should get you started.

choose between tcp "long" connection and "short" connection for internal service

I got an app that web server re-direct some requests to backend servers, and the backend servers(Linux) will do complicated computations and response to web server.
For the tcp socket connection management between web server and backend server, i think there are two basic strategy:
"short" connection: that is, one connection per request. This seems very easy for socket management and simplify the whole program structure. After accept, we just get some thread to process the request and finally close this socket.
"long" connection: that is, for one tcp connection, there could be multi request one by one. It seems this strategy could make better use of socket resource and bring some performance improvement(i am not quite sure). BUT it seems this brings a lot of complexity than "short" connection. For example, since now socket fd may be used by multi-threads, synchronization must be involved. and there are more, socket failure process, message sequence...
Is there any suggestions for these two strategies?
UPDATE:, #SargeATM 's answer remind me that i should tell more about the backend service.
Each request is kind of context-free. Backend service can do calculation based on one single request message. It seems to be sth. stateless.
Without getting into the architecture of the backend which I think heavily influences this decision, I prefer short connections for stateless "quick" request/response type traffic and long connections for stateful protocols like a synchronization or file transfer.
I know there is some tcp overhead for establishing a new connection (if it isn't local host) but that has never been anything I have had to optimize in my applications.
Ok I will get a little into architecture since this is important. I always use threads not per request but by function. So I would have a thread that listened on the socket. Another thread that read packets off of all the active connections and another thread doing the backend calculations and a last thread saving to a database if needed. This keep things clean and simple. Easy to measure slow spots, maintain, and to optimize later when needed if needed.
What about a third option... no connection!
If your job description and job results are both of small size, UDP sockets may be a good idea. You have even less resources to manage, as there's no need to bound the request/response to a file descriptor, which give you some flexibility for the future. Imagine you have more backend services and would like to do some load balancing – a busy service can send the job to another one with UDP address of job submitter. The latter just waits for the result and doesn't care where you performed the task.
Obviously you'd have to deal with lost, duplicated and out of order packets, but as a reward you don't have to deal with broken connections. Out of order packets are probably not a big deal if you can fit the request and response in one UDP message, duplication can be taken care of by some job ids, and lost packet... well, they can be simply resent ;-)
Consider this!
Well, you are right.
The biggest problem with persistent connections will be making sure that app got "clean" connection from pool. Without any garbage left of data from another request.
There are a lot of ways to deal with that problem, but at the end it is better to close() tainted connection and open new one than trying to clean it...

Additional technologies to correctly use node.js and Socket.IO in a time-intensive app?

As a hypothetical example, let's say that I wanted to make an application that displays peoples twitter networks. I would provide an API that would allow a client to query on a single username. That user's top x tweets would be sent to the client. Then, each person that had been mentioned by the initial person would be scanned. Their top x tweets would be sent to the client. This process would recursively continue, breadth-first, until a pre-defined depth was reached. The client would be receiving the data in real time, displaying statistics such as number of users scanned, number of known users remaining to scan, and a growing list of the tweet data. None of the processing is complicated (regex of small amounts of text), but many, many network requests would be spawned from a single initial request.
I really want the fantastic realtime capabilities of node.js with socket.io, but I feel like this is an abuse of those technologies - they're not meant for heavy server-side lifting. Is there a more appropriate toolset for what I am trying to accomplish, or a particular way to use these tools to that end? Milewise is doing something similar-ish, but I think that my application would consume significantly more network resources than theirs.
Thanks.
The best network transport which you can get on the web now are WebSockets which offers persistent bi-directional real-time connection between server and client. Although not every browser supports them, socket.io gives you a couple of fallback solutions which may however decrease the network performance when compared to WebSockets as stated in this article:
During making connection with WebSocket, client and server exchange
data per frame which is 2 bytes each, compared to 8 kilo bytes of http
header when you do continuous polling.
...
Reducing kilobytes of data
to 2 bytes…and reducing latency from 150ms to 50ms is far more than
marginal. In fact, these two factors alone are enough to make
WebSocket seriously interesting to Google.
Apart from network transport, other things may also be important, for example how are you fetching, formating and processing the data on the server side. In node.js heavy CPU bound computations may block processing of other asynchronous operations, therefore these kind of operations should be dispatched to separate threads or processes in order to prevent blocking.

Resources