Scalable push application with node.js - node.js

I'm thinking about writing a few web applications having almost the same requirements as a chat. And I would like them to be able to scale easily.
I have worked a bit with node.js and I understand how it can help design push applications but I have some difficulties when thinking about having them run on multiple servers.
Here are some design I can think of for a large scale chat app :
1 - Servers have state, they keep the connections opened and clients can have new messages pushed to them. In this scenario, we are limited by the physical memory of one server so we cannot scale linearly if we have too many users per room.
2 - Servers have no state, they request a distributed database to respond to clients requests. In this scenario, clients poll the servers. We could scale linearly but the throughput is decreased, the messages are not delivered instantly and polling has been shown as a bad practice when scaling.
3 - Mix of 1 and 2. Servers keep the connections of its clients opened and poll the distributed database. The application is more complex to write and we still use polling. Similar client's requests (clients of the same room) are just grouped into a single one done by the server. The code becomes unnecessary complicated and it does not scale in the situation where we have many rooms and a few users per room.
4 - Servers have no state and the database cluster uses event to notify every registered servers about new messages. This is the solution I would like to have but I haven't heard of any database which has this feature. (Some people are talking about this feature for mongodb here: https://jira.mongodb.org/browse/SERVER-124)
So Why is the 4th solution not used so much today?
How do people usually design their applications in this case?

Since you want a push application, you would probably use Socket.IO with RedisStore.
By using this combination, the data for all the connections is kept in Redis (in-memory database), so you can scale outside a process. Another use of Redis here is for pub-sub.
The idea is to trigger an event when something needs to be pushed, then sent a message to the browser using Socket.io. If you want to listen to database changes, perhaps it's better to use CouchDB with it's _changes feature.
Resources:
https://github.com/dshaw/talks/tree/master/2011-10-jsclub/sample-app
http://www.ranu.com.ar/2011/11/redisstore-and-rooms-with-socketio.html
How to reuse redis connection in socket.io?

Instead of triggers for case 4, you might want to hook into MongoDB replication.
Let's assume you have a replica set (you wouldn't run single mongod, would you?).
Every change is written to the oplog on the primary and then is replicated to secondaries.
You can efficiently pull new updates from the oplog, using tailable cursors. Note, this is still pull, not push.
Then your node.js will push these events to the clients.

Related

How does Server keep track of all Client(s) connected in Real time data pushing scenario?

I kinda understand that Websocket is the protocol that is used for real-time data flowing back & forth.
My question can be very pre-mature but couldn't find much help on the web.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead ?
This SOF answer made some sense but didn't clear my doubt.
How does Server keep track of all Client(s) connected in Real time data pushing scenario?
It doesn't... it only keeps track of the clients it's serving specifically.
This answer is not node.js specific.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
To actually understand this a little better, we should consider larger numbers. i.e., let's assume 1 million clients connected to a service.
Obviously, a sane design will require redundancy, so no single service will hold all 1 million connections (and if a single server instance fails, clients can re-connect to a different server instance).
In this case, there's no single server that is aware of all clients.
It makes more sense for each server to manage it's own internal subscription / client list. Each server will also act as a pub/sub client for a centralized pub/sub service (such as a Redis cluster or whatever).
Assuming 1000 server instances, each serving 1000 clients, we would have find that the pub/sub service is aware only of 1,000 "clients" (server instances). Each server is unaware of the other clients, it's only aware of the 1,000 clients it's managing.
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead?
The algorithm itself is implementation specific, but in general, each server will incur some overhead in order to manage the pub/sub layer.
However, since each server only manages a small subset of the total client count, the overhead is distributed across a number of systems.
Channel Oriented vs. Connection Oriented Design
I should probably note that the pub/sub design isn't connection oriented.
The server isn't (or shouldn't be) looping over all the connections asking "are you subscribed to this channel"?.
Rather, pub/sub design assumes a "channel" oriented design, where it locates the channel object(s) and loops over a client list.
On one hand, this approach might (or might not) consume more memory. Since each "channel" should contain a list of clients listening to that channel, a single client object might belong to more than a single list.
On the other hand, the loop has less code branches and experiences less overhead than a connection oriented design. Also, this approach allows for pub/sub clients that aren't connection bound (such as internal hooks / callbacks).
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
Socket.io already keeps track by itself and its pretty easy to emit to all connected clients.
Socket.io - Emit Cheatsheet
If you are worried about what would happen when your user-base grows, you can scale your service to multiple nodes.
If you actually end up scaling and have more than one server node, then you can use
socketio-redis.
Adapter to enable broadcasting of events to multiple separate socket.io server nodes.

Hard downsides of long polling?

For interactive web apps, things like Websockets are getting more popular. However, as the client, and proxy world is not always fully compliant, one usually use a complex framework like 'Socket.IO', hiding several different mechanisms for any case that may disable the other ones.
I just wonder what the downsides of a properly implemented long polling are, because with today's servers like node.js it is quite easy to implement and relies on old http technology which is well supported (despite the long polling behaveiour itself may break it).
From an high level view, long polling (despite some additional overhead, feasable for medium traffic apps) resembles a true push behaviour as WebSockets do, as the server actually send it's answer whenever he likes (despite some timeout / heartbeat mechanism).
So we have some more overhead due to the more TCP/IP acknowledgements I guess, but no constant traffic like frequent polling would do.
And using an event driven server, we would have no thread overhead to keep the connections blocked.
So is there any else hard downside that forces medium-traffic apps like chats to use WebSockets rather than long polling?
Overhead
It will create a new connection each time, so it will send the HTTP headers... including the cookie header that may be large.
Also, just "check if there is something new" is another connection for nothing. Connections implies the work of many items like firewalls, load balancers, web servers ... etc.. Probably, establish the connection is most time consuming thing as soon your IT infrastructure have several inspectors.
If you are using HTTPS, you are doing again and again the most expensive operation, the TLS handshake. TLS performance is good once the connection is established and the symmetric encryption is working, but the process of establishing the connection, key exchange and all that jazz is not fast.
Also, when connections are done, log entries are written somewhere, counters are incremented somewhere, memory is consumed, objects are created... etc... etc.. For example, the reason why we have different logging configurations when in production and in development, is because writing log entries also affect performance.
Presence
When is a long polling user connected or disconnected? If you check for this at a given moment of time... what would be the reliable amount of time you should wait to double check, to ensure it is disconnected or connected?
This may be totally irrelevant if your application just broadcast stuff, but it may be very relevant if your application is a game.
Not persistent
This is the big deal.
Since a new connection is created each time, if you have load balanced servers, in a round robbin scenario you cannot know in which server the next connection is going to fall.
When a user's server is known, like when using a WebSocket, you can push the events to that server straight away, and the server will relay them to the connection. If the user disconnects, the server can notify straight away that the user is not connected anymore, and when connect again can subscribe again.
If the server where the user is connected at the moment that an event for him is generated is unknown, you have to wait for the user to connect so then you can say "hey, user 123 is here, give me all the news since this timestamp", what make it a little bit more cumbersome. Long polling is not really push technology, but request-response, so if you plan for a EDA architecture, at some point you are going to have some level of impedance you have to address, like for example, you need a event aggregator that can give you all the events from a given timestamp (the last time that user connected to ask for news).
SignalR (I guess it is the equivalent in .NET to socket.io) for example, has a message bus named backplane, that relay all the messages to all the servers, as key for scaling out. Therefore, when a user connect to other server, "his" pending events are there "as well"(!) It is a "not too bad approach", but as you can guess, affects the throughput:
Limitations
Using a backplane, the maximum message throughput is lower than it is
when clients talk directly to a single server node. That's because the
backplane forwards every message to every node, so the backplane can
become a bottleneck. Whether this limitation is a problem depends on
the application. For example, here are some typical SignalR scenarios:
Server broadcast (e.g., stock ticker): Backplanes work well for this
scenario, because the server controls the rate at which messages are
sent.
Client-to-client (e.g., chat): In this scenario, the backplane might
be a bottleneck if the number of messages scales with the number of
clients; that is, if the rate of messages grows proportionally as more
clients join.
High-frequency realtime (e.g., real-time games): A backplane is not
recommended for this scenario.
For some projects, this may be a showstopper.
Some applications just broadcast general data, but others have a connection semantics, like for example a multiplayer game, and it is important to send the right events to the right connections.
IMHO
Long polling is a good solution for small projects, but became a big burden for high scalable apps that need high frecuency and/or very segmented event sending.
I implemented a Node.js Express server that supported long polling. The biggest mistake I made was not cleaning up the requests which caused slowing down the server. If your server doesn't support concurrency or threads, one of the essential tasks is to set the appropriate timeouts for the requests/responses to release them from the loop, which you have to do by yourself.
Edit: Also you need to keep in mind that browsers have their specific limit for the number of connections (i.e. 6 per hostname for Google Chrome). So if you have too many long polling connections at the same time, you will probably block yourself.

Using Backbone.iobind (socket.io) with a cluster of node.js servers

I'm using Backbone.iobind to bind my client Backbone models over socket.io to the back-end server which in turn store it all to MongoDB.
I'm using socket.io so I can synchronize changes back to other clients Backbone models.
The problems starts when I try to run the same thing over a cluster of node.js servers.
Setting a session store was easy using connect-mongo which stores the session to MongoDB.
But now I can't inform all the clients on every change, since the clients are distributed between the different node.js servers.
The only solution I found is to set a pub/sub queue between the different node.js servers (e.g. mubsub), which seems like a very heavy weight solution that will trigger an event on all the servers for every change.
How did you reach the conclusion that pub/sub is a "very heavy weight solution"?
Sounds like you got it right up until that part :-)
Oh, and pub/sub is not a queue.
Let's examine that claim:
The nice thing about pub/sub is that you publish and subscribe to channels/topics.
So, using the classic chat server example, let's say you have a million users connected in total, but #myroom only has 50 users in it.
When a message is sent to #myroom, it's being published once. No duplication whatsoever.
In most use-cases you won't even need to store it on disk/RAM, so we're mostly looking at network/bandwidth here. And, I mean, you're probably throwing more data (probably over the wire?) to MongoDB already, so I assume that's not your bottleneck.
If you also use socket.io's rooms features (which is basically its own pub/sub mechanism), that means only 5 users will have that message emitted to them over the websocket.
And no, socket.io won't iterate over 1M clients to find out which of them are in room #myroom ;-)
So the message is published once, each subscriber (node.js instance) will get notified once, and only the relevant clients -- socket.io won't waste CPU cycles in order to find them as it keeps track of them when they join() or leave() a room -- will receive the message.
Doesn't that sound pretty efficient and light-weight?
Give Redis a shot.
It's really simple to set-up, runs entirely in memory, blazing-fast, replication is extremely simple, etc.
That's the way socket.io recommends passing events between nodes.
You can find more information/code here.
Additionally, if MongoDB can't handle the load at any point, you can use Redis as your session-store as well.
Hope this helps!

node.js server with socket.io handling 50000 simultaneous clients

We are developing a Javascript control which should be constantly connected to a server for receiving animation updates.
We are planning to host this stuff on an Amazon cloud.
The scenario is like this: server connects to activemq queue waiting for updates, for each update it broadcasts it to all connected clients.
Is it even possible to handle such load with node.js + socket.io?
Will a single node.js server be able to handle such load?
How to organize fast transport between different nodes if we will have to use more than one node?
Will single node.js server be able to handle such load?.. How to organize fast transport between different nodes if we will have to use more than one node
You say that you are planning to host on Amazon. So first off, nothing should be scoped for a single server. Amazon machines will simply "disappear", you have to assume that you are going to use multiple computers.
...handling 50k simultaneous clients
So to start with, 50k connections for a single box is a very big number. Here's a very detailed blog post discussing "getting to 10k" with node.js+socket.io.
Here's a very telling quote:
it seemed as though 10,000 clients simply required more serialization
than my server was able to handle.
So a key component to "getting to 50k" is going to be the amount of work required just pushing data over the wire.
How to organize fast transport between different nodes if we will have to use more than one node.
That blog post is the first of 3. When you're done the first, read the other two. That should point you in the right direction.

How are Node.js+Socket.io+MongoDB webapps truly asynchronous?

I have a good old-style LAMP webapp. A week ago I needed to add a push notification mechanism to it.
Therefore, what I did was to add node.js+socket.io on the server and poll the MySQL database every 10 seconds using node.js to check whether there were new items: if so, I would have sent them to the client(s) with socket.io.
I was pretty happy with the result, even if that is not a proper realtime notification (as there is a lag of up to 10 secs).
Now, I am about to build a new webapp which will need push notifications, too. I am wondering whether to go with the same approach as the first one (that I believe is more stable and mature) or to go totally Node.js, without PHP and Apache. As for the database, I have already decided to go for MongoDB.
Finally, my question is: if I go for Node.js+Socket.io+MongoDB will I get a truly near-real-time webapp? I mean, as soon as a new record is inserted into MongoDB, will there be some sort of event triggered that I can catch via node.js, do some checking on it and, if relevant, send the notification to the client? Or will there be anyway some sort of polling on the db server-side and lag, as with my first LAMP webapp?
A related question: can you build a realtime webapp on MySQL without doing any polling as I did with my first app. Or do you need MongoDB (or Redis)?
I hope this question is not too silly - sorry, I am just starting with Node.js and co.
Thanks.
I understand your problem because I switched to node.js from php/apache/mysql too.
Generally node.js is stable, modules and your scripts are the main reasons for errors
Real-time has nothing to do with database, it's all about client and server, you can query as many data as you want in your requests and push it to the other client.
Choosing node.js is very wise but it's harder to implement.
When you insert a new record to your db, the event is the request itself, you will make a push event along with the database query something like:
// Please note this is not real code, just an example of the idea
app.get('/query', function(request, response){
// Query your database
db.query('SELECT * FROM users', function(rows){
// Push notification to dan
socket.emit('database_query_executed', 'to_dan', rows);
// End request
response.end('success');
})
})
Of course you can use MySQL! And any database you want, as I said real-time has nothing to do with databases because the database is in the middle of the process and it's totally optional.
If you want to use node.js for push notifications and php/apache for mysql then you will need to create 2 requests for each server something like:
// this is javascript
ajax('http://node.yoursite.com/push', node_options)
ajax('http://php.yoursite.com/mysql_query', php_options)
or if you want just one request, or you want to use a form, you can call your php and inside php you can create an http or net request to node.js from php, something like:
// this is php
new HttpRequest('http://node.youtsite.com/push', HttpRequest::METH_GET);
Using:
A regular MongoDB Collection as the Store,
A MongoDB Capped Collection with Tailable Cursors as the Queue,
A Node worker with Socket.IO watching the Queue as the Worker,
A Node server to serve the page with the Socket.IO client, and to receive POSTed data (or however else the data gets added) as the Server
It goes like:
The new data gets sent to the Server,
The Server puts the data in the Store,
The Server adds the data's ObjectID to the Queue,
The Queue will send the newly arrived ObjectID to the open Tailable Cursor on the Worker,
The Worker goes and gets the actual data in the ObjectID from the Store,
The Worker emits the data through the socket,
The client receives the data from the socket.
This is 'push' from the initial addition of the data all the way to receipt at the client - no polling, so as real-time as you can get given the processing time at each step.
Re: triggers in MongoDB - please see this answer: https://stackoverflow.com/a/12405093/1651408
There are much more convenient triggers in MySQL, but to call Node.js from them would require a bit of work with MySQL UDFs (user-defined functions), for instance pushing data through a Unix socket. Please note that this is necessary only when other applications (besides your Node.js process) are updating the database, and be sure to choose InnoDB as storage in this case (row- vs. table-level locking).
Can see no big problem with your technology choice of sockets.io, even if client-side web sockets aren't supported, you'll fall back (gracefully, I hope) to polling.
Finally, your question is not silly at all, since push technology is definitely superior to the flood of polling requests - it scales better. EDIT: However, would not describe either technology as real-time.
Another EDIT: for a quite well-known and successful setup of this kind please read this: http://blog.fogcreek.com/the-trello-tech-stack/
Have you discovered Chole? It works separately from your web sever and interfaces with it by using HTTP POSTs. That way you can code your web app any which way you want.
Actually Using Push Technology like Socket.IO helps you to use
the server's resource efficiently and also helps you to leverage old browsers to modern browsers making websocket or websocket-like connection.
10 sec polling is a HTTP request which is expensive especially when a lot of users present.
Unlike polling technology, push technology is relatively cheap. Users' client is opening a dedicated socket(ie. websocket) to listen to the server's push notification.
And usually your client-side JavaScript do some actions when the push notification is received.
Using your LAMP stack and Socket.IO with different port (other than 80) will be good enough to implement what you need.
But using Node.js + MongoDB + Socket.IO actually helps you to manage your server's resource much efficiently.
Because those three have non-blocking nature.
If you understand non-blocking concept correctly and implement your app appropriately,
your identical app, an app with same feature but with different language and different database, would be able to handle a lot more requests than general LAMP stack.
Above picture is a famous chart of comparing Non-blocking vs Thread way to handle concurrency
Apache(Thread) vs Nginx(Non-blocking)
MySQL is a great database. I believe you won't need join and transactions for realtime notification.
MongoDB does not have those two features unless you implement similar features by yourself.
Because of not having those two and some characteristics of its own, MongoDB can store and fetch data much faster than traditional SQL databases.
Switching from MySQL to MongoDB will decrease the time taking to insert and fetch data.
with JS you can open a socket to your server (not old browser), the server will have a ah-hoc program (on an ad-hoc port, so you need the permission to open door and run program on your server) that will send data (almost) realtime from and to the client, and without the HTTP's protocol overhead.old browser will just fall-back to polling mechanism.
I can't see other way to do this (probably there are already "coocked" framework that do this)

Resources