node.js server with socket.io handling 50000 simultaneous clients - node.js

We are developing a Javascript control which should be constantly connected to a server for receiving animation updates.
We are planning to host this stuff on an Amazon cloud.
The scenario is like this: server connects to activemq queue waiting for updates, for each update it broadcasts it to all connected clients.
Is it even possible to handle such load with node.js + socket.io?
Will a single node.js server be able to handle such load?
How to organize fast transport between different nodes if we will have to use more than one node?

Will single node.js server be able to handle such load?.. How to organize fast transport between different nodes if we will have to use more than one node
You say that you are planning to host on Amazon. So first off, nothing should be scoped for a single server. Amazon machines will simply "disappear", you have to assume that you are going to use multiple computers.
...handling 50k simultaneous clients
So to start with, 50k connections for a single box is a very big number. Here's a very detailed blog post discussing "getting to 10k" with node.js+socket.io.
Here's a very telling quote:
it seemed as though 10,000 clients simply required more serialization
than my server was able to handle.
So a key component to "getting to 50k" is going to be the amount of work required just pushing data over the wire.
How to organize fast transport between different nodes if we will have to use more than one node.
That blog post is the first of 3. When you're done the first, read the other two. That should point you in the right direction.

Related

Working with WebSockets and NodeJs clusters

I currently have a Node server running that works with MongoDB. It handles some HTTP requests, but it largely used WebSockets. Basically, the server connects multiple users to rooms with WebSockets.
My server currently has around 12k WebSockets open and it's almost crippling my single threaded server, and now I'm not sure how to convert it over.
The server holds HashMap variables for the connected users and rooms. When a user does an action, the server often references those HashMap variables. So, I'm not sure how to use clusters in this. I thought maybe creating a thread for every WebSocket message, but I'm not sure if this is the right approach, and it would not be able to access the HashMaps for the other users
Does anyone have any ideas on what to do?
Thank you.
You can look at the socket.io-redis adapter for architectural ideas or you can just decide to use socket.io and the Redis adapter.
They move the equivalent of your hashmap to a separate process redis in-memory database so all clustered processes can get access to it.
The socket.io-redis adapter also supports higher-level functions so that you can emit to every socket in a room with one call and the adapter finds where everyone in the room is connected, contacts that specific cluster server, and has it send the message to them.
I thought maybe creating a thread for every WebSocket message, but I'm not sure if this is the right approach, and it would not be able to access the HashMaps for the other users
Threads in node.js are not lightweight things (each has its own V8 instance) so you will not want a nodejs thread for every WebSocket connection. You could group a certain number of WebSocket connections on a web worker, but at that point, it is likely easier to use clustering because nodejs will handle the distribution across the clusters for you automatically whereas you'll have to do that yourself for your own web worker pool.

How does Server keep track of all Client(s) connected in Real time data pushing scenario?

I kinda understand that Websocket is the protocol that is used for real-time data flowing back & forth.
My question can be very pre-mature but couldn't find much help on the web.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead ?
This SOF answer made some sense but didn't clear my doubt.
How does Server keep track of all Client(s) connected in Real time data pushing scenario?
It doesn't... it only keeps track of the clients it's serving specifically.
This answer is not node.js specific.
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
To actually understand this a little better, we should consider larger numbers. i.e., let's assume 1 million clients connected to a service.
Obviously, a sane design will require redundancy, so no single service will hold all 1 million connections (and if a single server instance fails, clients can re-connect to a different server instance).
In this case, there's no single server that is aware of all clients.
It makes more sense for each server to manage it's own internal subscription / client list. Each server will also act as a pub/sub client for a centralized pub/sub service (such as a Redis cluster or whatever).
Assuming 1000 server instances, each serving 1000 clients, we would have find that the pub/sub service is aware only of 1,000 "clients" (server instances). Each server is unaware of the other clients, it's only aware of the 1,000 clients it's managing.
If this is some sort of looping that happens on the server side where all connected clients details are cached & then update will be sent out to all of them, isn't is an overhead?
The algorithm itself is implementation specific, but in general, each server will incur some overhead in order to manage the pub/sub layer.
However, since each server only manages a small subset of the total client count, the overhead is distributed across a number of systems.
Channel Oriented vs. Connection Oriented Design
I should probably note that the pub/sub design isn't connection oriented.
The server isn't (or shouldn't be) looping over all the connections asking "are you subscribed to this channel"?.
Rather, pub/sub design assumes a "channel" oriented design, where it locates the channel object(s) and loops over a client list.
On one hand, this approach might (or might not) consume more memory. Since each "channel" should contain a list of clients listening to that channel, a single client object might belong to more than a single list.
On the other hand, the loop has less code branches and experiences less overhead than a connection oriented design. Also, this approach allows for pub/sub clients that aren't connection bound (such as internal hooks / callbacks).
Say 1000 clients are connected to a server which sends out real-time stock prices. When there is an update on the server front, how will server know all the 1000 clients to which it needs to send an update?
Socket.io already keeps track by itself and its pretty easy to emit to all connected clients.
Socket.io - Emit Cheatsheet
If you are worried about what would happen when your user-base grows, you can scale your service to multiple nodes.
If you actually end up scaling and have more than one server node, then you can use
socketio-redis.
Adapter to enable broadcasting of events to multiple separate socket.io server nodes.

What is the best way to communicate between two servers?

I am building a web app which has two parts. In one part it uses a real time connection between the server and the client and in the other part it does some cpu intensive task to provide relevant data.
Implementing the real time communication in nodejs and the cpu intensive part in python/java. What is the best way the nodejs server can participate in a duplex communication with the other server ?
For a basic solution you can use Socket.IO if you are already using it and know how it works, it will get the job done since it allows for communication between a client and server where the client can be a different server in a different language.
If you want a more robust solution with additional options and controls or which can handle higher traffic throughput (though this shouldn't be an issue if you are ultimately just sending it through the relatively slow internet) you can look at something like ØMQ (ZeroMQ). It is a messaging queue which gives you more control and lots of different communications methods beyond just request-response.
When you set either up I would recommend using your CPU intensive server as the stable end(server) and your web server(s) as your client. Assuming that you are using a single server for your CPU intensive tasks and you are running several NodeJS server instances to take advantage of multi-cores for your web server. This simplifies your communication since you want to have a single point to connect to.
If you foresee needing multiple CPU servers you will want to setup a routing server that can route between multiple web servers and multiple CPU servers and in this case I would recommend the extra work of learning ØMQ.
You can use http.request method provided to make curl request within node's code.
http.request method is also used for implementing Authentication api.
You can put your callback in the success of request and when you get the response data in node, you can send it back to user.
While in backgrount java/python server can utilize node's request for CPU intensive task.
I maintain a node.js application that intercommunicates among 34 tasks spread across 2 servers.
In your case, for communication between the web server and the app server you might consider mqtt.
I use mqtt for this kind of communication. There are mqtt clients for most languages, including node/javascript, python and java. In my case I publish json messages using mqtt 'topics' and any task that has registered to subscribe to a 'topic' receives it's data when published. If you google "pub sub", "mqtt" and "mosquitto" you'll find lots of references and examples. Mosquitto (now an Eclipse project) is only one of a number of mqtt brokers that are available. Another very good broker that is written in Java is called hivemq.
This is a very simple, reliable solution that scales well. In my case literally millions of messages reliably pass through mqtt every day.
You must be looking for socketio
Socket.IO enables real-time bidirectional event-based communication.
It works on every platform, browser or device, focusing equally on reliability and speed.
Sockets have traditionally been the solution around which most
realtime systems are architected, providing a bi-directional
communication channel between a client and a server.

Using Backbone.iobind (socket.io) with a cluster of node.js servers

I'm using Backbone.iobind to bind my client Backbone models over socket.io to the back-end server which in turn store it all to MongoDB.
I'm using socket.io so I can synchronize changes back to other clients Backbone models.
The problems starts when I try to run the same thing over a cluster of node.js servers.
Setting a session store was easy using connect-mongo which stores the session to MongoDB.
But now I can't inform all the clients on every change, since the clients are distributed between the different node.js servers.
The only solution I found is to set a pub/sub queue between the different node.js servers (e.g. mubsub), which seems like a very heavy weight solution that will trigger an event on all the servers for every change.
How did you reach the conclusion that pub/sub is a "very heavy weight solution"?
Sounds like you got it right up until that part :-)
Oh, and pub/sub is not a queue.
Let's examine that claim:
The nice thing about pub/sub is that you publish and subscribe to channels/topics.
So, using the classic chat server example, let's say you have a million users connected in total, but #myroom only has 50 users in it.
When a message is sent to #myroom, it's being published once. No duplication whatsoever.
In most use-cases you won't even need to store it on disk/RAM, so we're mostly looking at network/bandwidth here. And, I mean, you're probably throwing more data (probably over the wire?) to MongoDB already, so I assume that's not your bottleneck.
If you also use socket.io's rooms features (which is basically its own pub/sub mechanism), that means only 5 users will have that message emitted to them over the websocket.
And no, socket.io won't iterate over 1M clients to find out which of them are in room #myroom ;-)
So the message is published once, each subscriber (node.js instance) will get notified once, and only the relevant clients -- socket.io won't waste CPU cycles in order to find them as it keeps track of them when they join() or leave() a room -- will receive the message.
Doesn't that sound pretty efficient and light-weight?
Give Redis a shot.
It's really simple to set-up, runs entirely in memory, blazing-fast, replication is extremely simple, etc.
That's the way socket.io recommends passing events between nodes.
You can find more information/code here.
Additionally, if MongoDB can't handle the load at any point, you can use Redis as your session-store as well.
Hope this helps!

Scalable push application with node.js

I'm thinking about writing a few web applications having almost the same requirements as a chat. And I would like them to be able to scale easily.
I have worked a bit with node.js and I understand how it can help design push applications but I have some difficulties when thinking about having them run on multiple servers.
Here are some design I can think of for a large scale chat app :
1 - Servers have state, they keep the connections opened and clients can have new messages pushed to them. In this scenario, we are limited by the physical memory of one server so we cannot scale linearly if we have too many users per room.
2 - Servers have no state, they request a distributed database to respond to clients requests. In this scenario, clients poll the servers. We could scale linearly but the throughput is decreased, the messages are not delivered instantly and polling has been shown as a bad practice when scaling.
3 - Mix of 1 and 2. Servers keep the connections of its clients opened and poll the distributed database. The application is more complex to write and we still use polling. Similar client's requests (clients of the same room) are just grouped into a single one done by the server. The code becomes unnecessary complicated and it does not scale in the situation where we have many rooms and a few users per room.
4 - Servers have no state and the database cluster uses event to notify every registered servers about new messages. This is the solution I would like to have but I haven't heard of any database which has this feature. (Some people are talking about this feature for mongodb here: https://jira.mongodb.org/browse/SERVER-124)
So Why is the 4th solution not used so much today?
How do people usually design their applications in this case?
Since you want a push application, you would probably use Socket.IO with RedisStore.
By using this combination, the data for all the connections is kept in Redis (in-memory database), so you can scale outside a process. Another use of Redis here is for pub-sub.
The idea is to trigger an event when something needs to be pushed, then sent a message to the browser using Socket.io. If you want to listen to database changes, perhaps it's better to use CouchDB with it's _changes feature.
Resources:
https://github.com/dshaw/talks/tree/master/2011-10-jsclub/sample-app
http://www.ranu.com.ar/2011/11/redisstore-and-rooms-with-socketio.html
How to reuse redis connection in socket.io?
Instead of triggers for case 4, you might want to hook into MongoDB replication.
Let's assume you have a replica set (you wouldn't run single mongod, would you?).
Every change is written to the oplog on the primary and then is replicated to secondaries.
You can efficiently pull new updates from the oplog, using tailable cursors. Note, this is still pull, not push.
Then your node.js will push these events to the clients.

Resources