Pls validate my apporach on Redis, Socke.io in node.js - node.js

I am new to node.js so need your help in validating the below mentioned approach.
Problem: I am developing a node.js application which is broadcasts messages to the people who are specifically subscribed to a topic. If the user in logged into the application either via web or mobile I want to use socket.io to push new messages as and when they are created. As I mentioned I need to push the messages to a selected list of logged in users based on certain filters, the message is not pushed to everyone logged in only to the users matching filter criteria. This is express application.
Approach: As soon as a client makes a connection to server a socket is established. The socket will be allocated a room. The key will be login name, so if there are further request from the same login ex. multiple browser windows those sockets also will be added to the same room. The Login Name and the room with sockets will be stored in Redis. When a new message is created internal application logic will determine the users who needs to be notified. Fetch only those logins from Redis along with the room information and push the message to them. When the sockets are closed remove the Redis entry for that login...Also this needs to be scalable since i might use node cluster in the future.
I read lot of about socket.io and Redis pub/sub approach and i am not using them in the approach above. Just storing the login and sockets as key value pairs
Can you please let me know if this is a good approach. Will there be any performance/scalability issue? Is there any other better ways to do this?
Thanks a lot for all your help....

You're Redis model will have to be a little more complicated than that. You'll need to maintain an index using sets, so you can find intersects which can be used to find all users in a given room. You'll then need to use redis's pub/sub functionality to enable realtime notifications. You'll also need to store messages in indexed sets, then just publish to inform your application that a change has been made, therefore sending the new data from the set.
If you could provide an example I can provide some redis commands to better explain how Redis works.
Update
This is in response to comments below.
Technologies I would use:
Nginx
Socket.io
mranney/node_redis
Redis
Scaling Redis
There are several solutions to scale Redis. If you need higher concurrency you can scale using master-slave replication. If you need more memory you can set up partitioning, or you can use the Redis Cluster beta(3.0.0). Alternatively you can outsource your solution to one of many Redis services(RedisGreen,RedisLabs,etc.), however this is best paired with a PaaS provider(AWS, Google Compute, Joyent) so it can be depolyed in the same cloud.
Scaling Socket.io
Socket.io can be scaled using Nginx. This is pretty common practice when scaling WebSockets. You then can synchronize each node app(with socket.io) using Redis as a messaging protocol(pub/sub).
You can SUBSCRIBE connections to track when a user joins or leaves, on the event of that, which ever app/server fires the event will PUBLISH connections update or PUBLISH connections "user:john left". If a user were to leave like in the latter example, you must also remember to remove that user from the set that represents a room(ex generalChat) so something like this SREM generalChat user:john, then execute the latter upon callback from the SREM command. Once the PUBLISH is sent, all apps/servers connected to redis, already having subscribed, will receive a message query from Redis in realtime notifying them to update. All apps/servers will broadcast to the corresponding room either a new user list(redis set type) or a command notifying the frontend to remove the user.
Basically all your sockets are in sync with Redis, so you can host multiple socket.io servers and use Messaging via Pub/Sub to queue actions across your entire cloud.
Examples
It's not hard to scale socket.io with Redis, however Redis may be cumbersome to setup and scale, but Redis doesn't use that much memory because you manage your own relations so you therefore only have relations mapped for your specific intentions. Also you can lease cloud hosting for 8GB for $80 a month, and that would support higher concurrency than the Big Boy plan from pusher, for less than half the price, and you get persistence as well so your stack is more uniform and has less dependencies.
If you were to use Pusher you'd probably need a persistent storage medium like MongoDB, MySQL, Postgre, etc. With Redis you can rely on it for all your data storage(excluding file storage). This would then create more traffic depending on your implementation.
Ex 1
You can use pusher to notify changes and refer to the backend to populate the new/changed data.
Pusher for Messaging
Boiler Plate:
Client <== Socket Opened ==> Pusher
Client === User Left ==> Pusher
All Clients <== User left === Pusher
All Clients === Request New Data ==> Backend <==> Database
All Clients <== Response === Backend
This can create a lot of problems, and you'd have to implement timeouts. This also takes a lot of Pusher connections, which is expensive.
Ex 2
You can connect to pusher with your backend to save the frontend from handling many requests(probably better for mobile users). This saves pusher traffic, because its not sending to hundreds/thousands of clients, just a handful of your backend servers.
This example assumes that you have 4 socket.io servers running.
Pusher for MQ on Backend
Boiler Plate:
Backend 1/2/3/4 <== Socket Opened ==> Pusher
Backend 1 === Remove User from room ==> Database
Backend 1 === User Leaves ==> Pusher
Backend 1/2/3/4 <== Use Left === Pusher
Backend 1/2/3/4 === Get Data ==> Database
Backend 1/2/3/4 <== Recieve Data === Database
Backend 1/2/3/4 === New Data ==> Room(clients)
Ex 3
You can use Redis as explained above.
Again assuming 4 socket.io servers.
Redis as MQ and datastore
Boiler Plate:
Backend 1/2/3/4 <== Connected ==> Redis
Backend 1/2/3/4 === Subscribe ==> Redis
Backend 1 === User Left ==> Redis (removes user)
Backend 1 === PUBLISH queue that user left ==> Redis
Backend 1/2/3/4 <== User Left Message === Redis
Backend 1/2/3/4 === Get New Data ==> Redis
Backend 1/2/3/4 <== New Data === Redis
Backend 1/2/3/4 === New Data ==> Room(clients)
All of these examples can be improved and optimized significantly, but I won't do that for sake of readability and clarity.
Conclusion
If you know how Redis works implementing this should be fairly straight forward. If you're learning redis you should start out a little smaller to get a hang of how redis works(its more than key:value storage). In the end running redis would be more cost effective, and efficient, but would take longer to develop. Pusher would be much more expensive, include more dependencies into your stack, and wouldn't be as effective(pusher is on a different cloud). Only advantage for using Pusher or any other service similar to it, is the ease of use for the platform they provide. You're essentially paying a monthly fee for boilerplate code and stack management.
Bottom Line
It would be best to reverse proxy with Nginx regardless of which stack you choose, so you can easily scale.
Redis, Socket.io, Node.js stack would be the best for large scale projects, and professional products. It will keep your operating cost down, and increase your concurrency without dramatically increasing your cost as you scale.
Redis, Socket.io(optional), Node.js, Pusher, Database stack would be best for smaller projects that you don't expect much growth out of. Once you get to 5,000 connections you're forking out $199/mo just for pusher, then you have to consider the cost for the rest of your stack. If you connect your backend to Pusher instead you'll save money, increase production time, and you'll still suffer performance hits from retrieving data from a thirdparty cloud.

Related

How to build a scalable realtime chat messaging with Websocket?

I'm trying to build a realtime (private) chat between users of a video game with 25K+ concurrent connections. We currently run 32 nodes where users can connect through a load balancer. The problem I'm trying to solve is how to route messages to each user?
Currently, we are using socket.io & socket.io-redis, where each websocket joins a room with its user ID, and we emit each message they should receive to that room. The problem with this design is that we are reaching the limits of Redis Pubsub, and Socket.io which doesn't scale well (socket.io emit messages to all nodes which check if the user is connected, this is not viable).
Our current stack is composed of Postgres, Redis & RabbitMQ. I have been thinking about this problem a lot and have come up with 3 different solutions :
Route all messages with RabbitMQ. When a user connects, we create an exchange with type fanout with the user ID and a queue per websocket connection (we have to handle multiple connections per user). When we want to emit to that user, we simply publish to that exchange. The problem with that approach is that we have to create a lot of queues, and I heard that this may not be very efficient.
Create a queue for each node in RabbitMQ. When a user connects, we save the node & socket ID in a Redis Set, so that when we have to send a message to that specific user, we first get the list of nodes, emit to each node queue, which then handle routing to specific client in the app. The problems with that approach is that in the case of a node failure, we may store that a user is connected when this is not the case. To fix that, we would need to expire the users's Redis entry but this is not a perfect fix. Also, if we later want to implement group chat, it would mean we have to send duplicates messages in Rabbit, this is not ideal.
Go all in with Firebase Cloud Messaging. We have a mobile app, and we plan to use it for push notifications when the user isn't connected, but would it be a good fit even if the user is connected?
What do you think is the best fit for our use case? Do you have any other idea?
I found a better solution : create a binding for each user but using only one queue on each node, then we route each messages to each user.

Is a node.js app that both servs a rest-api and handles web sockets a good idea?

Disclaimer: I'm new to node.js so I am sorry if this is a weird question :)
I have a node.js using express.js to serv a REST-API. The data served by the REST-API is fetched from a nosql database by the node.js app. All clients only use HTTP-GET. There is one exception though: Data is PUT and DELETEd from the master database (a relational database on another server).
The thought for this setup is of course to let the 'node.js/nosql database' server(s) be a public front end and thereby protecting the master database from heavy traffic.
Potentially a number of different client applications will use the REST-API, but mainly it will be used by a client app with a long lifetime (typically 0.5 to 2 hours). Instead of letting this app constantly polling the REST-API for possible new data I want to use websockets so that data is only sent to client when there is any new data. I will use a node.js app for this and probably socket.io so that it could fall back to api-polling if websockets are not supported by the client. New data should be sent to clients each time the master database PUTs or DELETEs objects in the nosql database.
The question is if I should use one node.js for both the API and the websockets or one for the API and one for the websockets.
Things to consider:
- Performance: The app(s) will be hosted on a cluster of servers with a load balancer and a HTTP accelerator in front. Would one app handling everything perform better than two apps with distinct tasks?
- Traffic between app: If I choose a two app solution the api app that receives PUTs and DELETEs from the master database will have to notice the websocket app every time it receives new data (or the master database will have to notice both apps). Could the doubled traffic be a performance issue?
- Code cleanlines: I believe two apps will result in cleaner and better code, but then again there will surely be some common code for both apps which will lead to having two copies it.
As to how heavy the load can be it is very difficult to say, but a possible peak can involve:
50000 clients
each listening to up to 5 different channels
new data being sent from master each 5th second
new data should be sent to approximately 25% of the clients (for some data it should be sent to all clients and other data probably below 1% of the clients)
UPDATE:
Thanks for the answers guys. More food for thoughts here. I have decided to have two node.js apps, one for the REST-API and one for web sockets. The reason is that I belive it will be easier to scale them. To begin with the whole system will be hosted on three physical servers and one node.js app for the REST-API on each server should bu sufficient, but for the websocket app there probably needs to several instances of it on each physical server.
This is a very good question.
If you are looking at a legacy system, and you already have a REST interface defined, there is not a lot of advantages to adding WebSockets. Things that may point you to WebSockets would be:
a demand for server-to-client or client-to-client real-time data
a need to integrate with server-components using a classic bi-directional protocol (e.g. you want to write an FTP or sendmail client in javascript).
If you are starting a new project, I would try to have a hard split in the project between:
the serving of static content (images, js, css) using HTTP (that was what it was designed for) and
the serving of dynamic content (real-time data) using WebSockets (load-balanced, subscription/messaging based, automatic reconnect enabled to handle network blips).
So, why should we try to have a hard separation? Let's consider the advantages of a HTTP-based REST protocol.
The use of the HTTP protocol for REST semantics is an invention that has certain advantages
Stateless Interactions: none of the client's context is to be stored on the server side between the requests.
Cacheable: Clients can cache the responses.
Layered System: undetectability of intermediaries
Easy testing: it's easy to use curl to test an HTTP-based protocol
On the other hand...
The use of a messaging protocol (e.g. AMQP, JMS/STOMP) on top of WebSockets does not preclude any of these advantages.
WebSockets can be transparently load-balanced, messages and state can be cached, efficient stateful or stateless interactions can be defined.
A basic reactive analysis style can define which events trigger which messages between the client and the server.
Key additional advantages are:
a WebSocket is intended to be a long-term persistent connection, usable for multiple different messaging purpose over a single connection
a WebSocket connection allows for full bi-directional communication, allowing data to be sent in either direction in sympathy with network characteristics.
one can use connection offloading to share subscriptions to common topics using intermediaries. This means with very few connections to a core message broker, you can serve millions of connected users efficiently at scale.
monitoring and testing can be implemented with an admin interface to send/recieve messages (provided with all message brokers).
the cost of all this is that one needs to deal with re-establishment of state when the WebSocket needs to reconnect after being dropped. Many protocol designers build in the notion of a "sync" message to provide context from the server to the client.
Either way, your model object could be the same whether you use REST or WebSockets, but that might mean you are still thinking too much in terms of request-response rather than publish/subscribe.
The first thing you must think about, is how you're going to scale the servers and manage their state. With a REST API this is largely straightforward, as they are for the most part stateless, and every load balancer knows how to proxy http requests. Hence, REST APIs can be scaled horizontally, leaving the few bits of state to the persistence layer (database) to deal with. With websockets, often times its a different matter. You need to research what load balancer you're going to use (if its a cloud deployment, often times it depends on the cloud provider). Then figure out what type of websocket support or configuration the load balancer will need. Then depending on your application, you need to figure out how to manage the state of your websocket connections across the cluster. Think about the different use cases, e.g. if a websocket event on one server alters the state of the data, will you need to propagate this change to a different user on a different connection? If the answer is yes, then you'll probably need something like Redis to manage your ws connections and communicate changes between the servers.
As for performance, at the end of the day its still just HTTP connections, so I doubt there will be a big difference in separating the server functionality. However, I think two servers would go a big way in improving code cleanliness, as long as you have another 'core' module to isolate code common to both servers.
Personally I would do them together, this is because you can share the models and most of the code between the REST and the WS.
At the end of the day what Yuri said in his answer is correct, but is not so much work to load balance WS any way, everyone does it nowadays. The approach I took is have REST for everything and then create some WS "endpoints" for subscribing for realtime data server-client.
So for what I understood, your client would just get notifications from the server, with updates, so definitely I would go with WS. You subscribe to some events and then you get new results when there are. Keep asking with HTTP calls is not the best way.
We had this need and basically built a small framework around this idea http://devalien.github.io/Axolot/
Basically you can understand our approach in the controller (this is just an example, in our real world app we have subscriptions so we can notify when we have new data or when we finish a procedure). In actions there are the rest endpoints and in sockets the websockets endpoints.
module.exports = {
model: 'user', // We are attaching the user to the model, so CRUD operations are there (good for dev purposes)
path: '/user', // Tthis is the end point
actions: {
'get /': [
function (req, res) {
var query = {};
Model.user.find(query).then(function(user) { // Find from the User Model declared above
res.send(user);
}).catch(function (err){
res.send(400, err);
});
}],
},
sockets: {
getSingle: function(userId, cb) { // This one is callable from socket.io using "user:getSingle
Model.user.findOne(userId).then(function(user) {
cb(user)
}).catch(function (err){
cb({error: err})
});
}
}
};

Using Backbone.iobind (socket.io) with a cluster of node.js servers

I'm using Backbone.iobind to bind my client Backbone models over socket.io to the back-end server which in turn store it all to MongoDB.
I'm using socket.io so I can synchronize changes back to other clients Backbone models.
The problems starts when I try to run the same thing over a cluster of node.js servers.
Setting a session store was easy using connect-mongo which stores the session to MongoDB.
But now I can't inform all the clients on every change, since the clients are distributed between the different node.js servers.
The only solution I found is to set a pub/sub queue between the different node.js servers (e.g. mubsub), which seems like a very heavy weight solution that will trigger an event on all the servers for every change.
How did you reach the conclusion that pub/sub is a "very heavy weight solution"?
Sounds like you got it right up until that part :-)
Oh, and pub/sub is not a queue.
Let's examine that claim:
The nice thing about pub/sub is that you publish and subscribe to channels/topics.
So, using the classic chat server example, let's say you have a million users connected in total, but #myroom only has 50 users in it.
When a message is sent to #myroom, it's being published once. No duplication whatsoever.
In most use-cases you won't even need to store it on disk/RAM, so we're mostly looking at network/bandwidth here. And, I mean, you're probably throwing more data (probably over the wire?) to MongoDB already, so I assume that's not your bottleneck.
If you also use socket.io's rooms features (which is basically its own pub/sub mechanism), that means only 5 users will have that message emitted to them over the websocket.
And no, socket.io won't iterate over 1M clients to find out which of them are in room #myroom ;-)
So the message is published once, each subscriber (node.js instance) will get notified once, and only the relevant clients -- socket.io won't waste CPU cycles in order to find them as it keeps track of them when they join() or leave() a room -- will receive the message.
Doesn't that sound pretty efficient and light-weight?
Give Redis a shot.
It's really simple to set-up, runs entirely in memory, blazing-fast, replication is extremely simple, etc.
That's the way socket.io recommends passing events between nodes.
You can find more information/code here.
Additionally, if MongoDB can't handle the load at any point, you can use Redis as your session-store as well.
Hope this helps!

Scalable push application with node.js

I'm thinking about writing a few web applications having almost the same requirements as a chat. And I would like them to be able to scale easily.
I have worked a bit with node.js and I understand how it can help design push applications but I have some difficulties when thinking about having them run on multiple servers.
Here are some design I can think of for a large scale chat app :
1 - Servers have state, they keep the connections opened and clients can have new messages pushed to them. In this scenario, we are limited by the physical memory of one server so we cannot scale linearly if we have too many users per room.
2 - Servers have no state, they request a distributed database to respond to clients requests. In this scenario, clients poll the servers. We could scale linearly but the throughput is decreased, the messages are not delivered instantly and polling has been shown as a bad practice when scaling.
3 - Mix of 1 and 2. Servers keep the connections of its clients opened and poll the distributed database. The application is more complex to write and we still use polling. Similar client's requests (clients of the same room) are just grouped into a single one done by the server. The code becomes unnecessary complicated and it does not scale in the situation where we have many rooms and a few users per room.
4 - Servers have no state and the database cluster uses event to notify every registered servers about new messages. This is the solution I would like to have but I haven't heard of any database which has this feature. (Some people are talking about this feature for mongodb here: https://jira.mongodb.org/browse/SERVER-124)
So Why is the 4th solution not used so much today?
How do people usually design their applications in this case?
Since you want a push application, you would probably use Socket.IO with RedisStore.
By using this combination, the data for all the connections is kept in Redis (in-memory database), so you can scale outside a process. Another use of Redis here is for pub-sub.
The idea is to trigger an event when something needs to be pushed, then sent a message to the browser using Socket.io. If you want to listen to database changes, perhaps it's better to use CouchDB with it's _changes feature.
Resources:
https://github.com/dshaw/talks/tree/master/2011-10-jsclub/sample-app
http://www.ranu.com.ar/2011/11/redisstore-and-rooms-with-socketio.html
How to reuse redis connection in socket.io?
Instead of triggers for case 4, you might want to hook into MongoDB replication.
Let's assume you have a replica set (you wouldn't run single mongod, would you?).
Every change is written to the oplog on the primary and then is replicated to secondaries.
You can efficiently pull new updates from the oplog, using tailable cursors. Note, this is still pull, not push.
Then your node.js will push these events to the clients.

Building a web app to support team collaboration using Socket.io

I'm building a web application that will allow team collaboration. That is, a user within a team will be able to edit shared data, and their edits should be pushed to other connected team members.
Are Socket.io rooms a reasonable way of achieving this?
i.e. (roughly speaking):
All connected team members will join the same room (dynamically created upon first team member connecting).
Any edits received by the
server will be broadcast to the room (in addition to being persisted,
etc).
On the client-side, any edits received will be used to update
the shared data displayed in the browser accordingly.
Obviously it will need to somehow handle simultaneous updates to the same data.
Does this seem like a reasonable approach?
Might I need to consider something more robust, such as having a Redis database to hold the shared data during an editing session (with it being 'flushed' to the persistant DB at regular intervals)?
All you need is Socket.IO (with RedisStore) and Express.js. With Socket.IO you can setup rooms and also limit the access per room to only users who are auth.
Using Redis you can make your app scale outside a process.
Useful links for you to read:
Handling Socket.IO, Express and sessions
Scaling Socket.IO
How to reuse redis connection in socket.io?
socket.io chat with private rooms
How to handle user and socket pairs with node.js + redis
Node.js, multi-threading and Socket.io

Resources