I have a socket.io chat room running whose traffic is getting larger as we are running on one machine. We have ran benchmarks using the ws library for sockets and they do perform much better which would better utilize our hardware. This would come at a cost of having to rewrite our application though.
Our socket.io application allows users to create private chat rooms which are implemented by using namespaces. E.g
localhost:8080/room/1
localhost:8080/room/2
localhost:8080/room/3
When everything is in one instance it is quite easy, but now we are looking to expand this capacity into multiple nodes.
We run this instance in Amazon's cloud. Previously it looked like scaling websockets was an issue with ELBs. We have noticed that Amazon now supports and application load balancer which supports websockets. This sounds great but after reading the documentation I must admit I don't really know what it means. If I am using socket.io with thousands of namespaces do I just put instances behind this ALB and everything will work?My main questions is:
If x number of users join a namespace, will the ALB automatically redirect my messages to and from the proper users? So let's say I have 5 vanilla socket.io instance running behind the ALB. User 1 creates a namespace. Few hours later pass and User 99999 comes and wants to join this namespace, will there need to be any additional code written to do this or will the alb redirect everything where it should go? The same goes for sending and receiving messages?
While ALB will load balance the users correctly, you will need to adapt your code a little since users that joined a specific room will be dispersed throughout different servers.
In their documentation socket.io provides a way to do this:
Now that you have multiple Socket.IO nodes accepting connections, if
you want to broadcast events to everyone (or even everyone in a
certain room) you’ll need some way of passing messages between
processes or computers.
The interface in charge of routing messages is what we call the
Adapter. You can implement your own on top of the socket.io-adapter
(by inheriting from it) or you can use the one we provide on top of
Redis: socket.io-redis:
var io = require('socket.io')(3000);
var redis = require('socket.io-redis');
io.adapter(redis({ host: 'localhost', port: 6379 }));
ALB setup
I would recommend to enable sticky session in your ALB, otherwise socket.io handshake will fail when using a non-websocket transport, such as long polling, since handshaking task using this transports requires more than one request, and you need all of those requests to be performed against the same server.
Alternative using ALB Routing without socket.io adapter.
If I wanted to avoid having a redis database. For example, if my rooms are created by users, if userA creates a room
on instance 4, if another user wants to join this room, how would they
know which instance it is on? Would I need the adapter here too?
The goal of this alternative is to have each room assigned to a specific EC2 Instance. We're going to achieve this using ALB Routing
N rooms > 1 instance.
Step 1:
You will need to change your rooms URL to something like:
/i1/room/550
/i1/room/20
/i2/room/5
/i5/room/492
being:
/{instance-number}/room/{room-id}
This is needed so ALB can route each room to a specific instance.
Step 2:
Create N target groups (N being the number of instances you have at the moment)
Step 3:
Register each instance to each target group
Target Groups > Instance X target group > Target tab > Edit > Choose instance X > add to registered
Target group X > EC2 Instance X
Target group Y > EC2 Instance Y
Step 4:
Edit ALB target rules
Load Balancers > Your ALB > Listeners > View/Edit Rules
Step 5:
Create one rule per target group/instance with following settings:
IF > Path: /iX/room/*
THEN > forward to: instanceX
Once you have this setup when you enter:
/i1/room/550 you will be using EC2 Instance 1.
/i2/room/200 will be using EC2 Instance 2
and so on.
Now you will have to make your own logic in order to have the rooms balanced across your instances. You don't want to have one instance hosting almost all the groups.
I recommend the first approach since it can be autoscaled easily.
Related
I am trying to find a good way to horizontally scale a stateful NodeJS service.
The Problem
The problem is that most of the options I find online assume the service is stateless. The NodeJS cluster documentation says:
Node.js [Cluster] does not provide routing logic. It is, therefore important to design an application such that it does not rely too heavily on in-memory data objects for things like sessions and login.
https://nodejs.org/api/cluster.html
We are using Kubernetes so scaling across multiple machines would also be easy if my service was stateless, but it is not.
Current Setup
I have a list of objects that stay in memory, each object alone is a transaction boundary. Requests to this service always have the object ID in the url. Requests to the same object ID are put into a queue and processed one at a time.
Desired Setup
I would like to keep this interface to the external world but internally spread this list of objects across multiple nodes and based on the ID in the URL the request would be routed to the appropriate node.
What is the usual way to do it in NodeJS? I've seen people using the user session to make sure a given user always go to the same node, what I would like to do is the same thing but instead of using the user session using the ID in the url.
How to implement a custom load-balancing decision method to specify which exactly server should process a request?
Currently, I am working with Azure, so the MS solutions are more preferable (ARR or WLBS).
Each server instance may have several unique resources ("unique" means that only this particular server instance has it).
An application creates a unique ResourceID for each resource and gives this ResourceID to a client "on demand".
The client's further requests are specified by the ResourceID.
The custom load balancer decision method should allow me to specify how:
To get the ResourceID from the request (should work on Layer 7).
To get the ServerInstanceID (or IP or whatever is required) based on the ResourceID (from my custom table).
To notify the load balancer which exactly application server instance should process this request (pass the ServerInstanceID).
P.S. May be I should say "proxy" here instead of the "load balancer". But for the sake of high availability, it will require several proxy servers and the load balancer to spread traffic between them. So, a pure proxy solution will just bring another one tier to the application.
I have found two useful threads on the IIS.NET Forums:
Custom load balancing decision function
using URL Rewrite Module for custom load balancing
Two main approaches are recommended:
To use custom load balancing.
To use custom Application Request Routing (ARR).
The most interesting thing is that different threads recommend using each other :)
Nevertheless, I will review both the suggested approaches.
I'm looking after building a global app from the ground up that can be updated and scaled transparently to the user.
The architecture so far is very easy, each part of the application has it own process and talk to other trough sockets.
This way i can spawn as many instances i want for each part of the application and distribute them across the globe accordingly to my needs.
In the front of the system i'll have a load balancer, which will them route the users to their closest instance, and when new code is spawned my instances will spawn new processes with the new code and route new requests to it and gracefully shutdown.
Thank you very much for any advice.
Edit:
The question is: What is the best ( and simplest ) solution for achieving zero downtime when deploying node to multiple instances ?
About the app:
https://github.com/Raynos/boot for "socket" connections,
http for http requests,
mongo for database
Solutions i'm trying at the moment:
https://www.npmjs.org/package/thalassa ( which managed haproxy configuration files and app instances ), if you don't know it, watch this talk: https://www.youtube.com/watch?v=k6QkNt4hZWQ and be aware crowsnest is being replaced by https://github.com/PearsonEducation/thalassa-consul
Deployment with zero downtime is only possible if the data you share between old and new nodes are compatible.
So in case you change the structure, you have to build a intermediate release, that can handle the old and new data structure without utilizing the new structure until you have replaced all nodes with that intermediate version. Then roll out the new version.
Taking nodes in and out of production can be done with your loadbalancer (and a grace time until all sessions expired on the nodes) (don't know enough about your application).
I wrote a multi-process realtime WebSocket server which uses the session id to load-balance traffic to the relevant worker based on the port number that it is listening on. The session id contains the hostname, source port number, worker port number and the actual hash id which the worker uses to uniquely identify the client. A typical session id would look like this:
localhost_9100_8000_0_AoT_eIwV0w4HQz_nAAAV
I would like to know the security implications for having the worker port number (in this case 9100) as part of the session id like that.
I am a bit worried about Denial of Service (DoS) threats - In theory, this could allow a malicious user to generate a large number of HTTP requests targeted at a specific port number (for example by using a fake sessionID which contains that port number) - But is this a serious threat? (assuming you have decent firewalls)? How do big companies like Google handle dealing with sticky sessions from a security perspective?
Are there any other threats which I should consider?
The reason why I designed the server like this is to account for the initial HTTP handshake and also for when the client does not support WebSocket (in which case HTTP long-polling is used - And hence subsequent HTTP requests from a client need to go to the same worker in the backend).
So there are several sub-questions in your question. I'll try to split them up and answer them accordingly:
Is DoS-Attack on a specific worker a serious threat?
It depends. If you will have 100 users, probably not. But you can be sure, that there are people out there, which will have a look at your application and will try to figure out the weaknesses and exploit those.
Now is a DoS-Attack on single workers a serious possibility, if you can just attack the whole server? I would actually say yes, because it is a more precise attack => you need less resources to kill the workers when you do it one by one. However if you allow connection from the outside only on port 80 for HTTP and block everything else, this problem will be solved.
How do big companies like Google handle dealing with sticky sessions?
Simple answer - who says, they do? There are multiple other ways to solve the problem of sessions, when you have a distributed system:
don't store anything session based on the server, just have a key in the cooky with which you can identify the user again, similar as with automatic login.
store the session state in a data base or object storage (this will add a lot of overhead)
store session information in the proxy (or broker, http endpoint, ...) and send them together with the request to the next worker
Are there any other threats which I should consider?
There are always unforeseen threats, and that's the reason, why you should never publish more information than necessary. In that case, most big companies don't even publish the correct name and version of their WebServer (for google it is gws for instance)
That being said, I see your point why you might want to keep your implementation, but maybe you can modify it slightly to store in your load balancer a dictionary with a hashed value of hostname, source port number, worker port number and have as a session id a collection of two hashes. Than the load balancer knows, by looking into the dictionary, to which worker it needs to be sent. This info should be saved together with a timestamp, when the info was retrieved the last time, and every minute you can delete unused data.
i have a problem with auto scaling in azure. The scaling process works fine but when a new instance is added it becomes no traffic.
My scenario:
I have 2 running instances whit a WCF webservice on it. Now i shot from 2 other servers(not azure) data to the webservice.
After a while the auto scaling kicks in and a new instance is added. The 2 servers are producing still load on the first 2 azure servers. However the new one doesn´t get any.
I thought azure is using round robin for load balancing or am i missing sth. else?
Thx for any help.
The problem is because of TCP connection keep-alive - when the clients first connect the connection is established to existing instances and then it persists to those instances. So when the service scales out the clients won't reconnect unless the connection is broken. New clients will connect to both existing and new instances.
Here's another question for a very similar scenario. For testing purposes you can just disable keep-alive to ensure that load is indeed distributed between instances.