Faye clustering multiple nodes NodeJS - node.js

I am trying to make a pub/sub infra using faye (nodejs). I wish to know whether horizontal scaling would be possible or not.
One nodejs process will run on single core, so when people are talking about clustering, they talk about creating multiple processes on the same machine, sharing a port, and sharing data through redis.
Like this:
http://www.davidado.com/2013/12/18/using-node-js-cluster-with-socket-io-for-push-notifications/
Firstly, I don't understand how we make sure that each of the forked processes goes to a different core. If I fork 10 node servers on a machine with 4 cores, is it taken care that they are equally distributed?
What if I wish to add is a new machine, and thus scale it. I have not seen any such support anywhere. I am not sure if it is even possible to do it.
Let's say somehow multiple nodes are being used and there is some load balancer. But one client will connect to only one server process. So when a client C1 publishes on a channel on which a client C2 has subscribed, and C1 is connected to process P1 and C2 is connected to process P2, how will P1 publish the message to C2 when it doesn't have the connection?
This would probably be possible in case of a single machine, because the cluster module enables all processes to share the same port and the connections too.
I am fairly new to the web world, as well as nodejs and faye. Please enlighten me if there is something wrong in the question.

You are correct in thinking that the cluster module allows multiple cores to be used on a single machine. The cluster module allows the same application to be spawned multiple times whilst listening to the same port. The distribution amongst the cores is down to the operating system, so if you have 10 processes and 4 cores then the OS will figure out how best to distribute them (as long as they haven't been spawned with a set affinity). By default this shouldn't be a concern for you.
Load-balancing can be done through node too but that is separate from clustering. Instead you would have a separate application that would grab the load statistics on each running server and proxy the http request to the most appropriate server (using http-proxy as an example). A very primitive load balancer will send one request to each running server instance incrementally to give an even distribution.
The final point about sharing messages between all the instances assumes that there is a single point where all the messages are held. In the article you linked to they assume that there is only one server and all the processes share access to the redis instance. As they all access the same redis instance, all processes will be able to receive the same messages. If we're going to start thinking about multiple servers that are in different locations in the world that all have different message stores (i.e. their own redis instances) then we get into the domain of 'replication'. Some data stores are built with this in mind and redis is one of them. You end up with a 'master' set of data and a set of 'slaves' that will periodically update with the master and grab anything they are missing. It is important to note here that messages will not be sent in 'real-time' here unless you have a very intensive replication process.
In conclusion, developers go through this chain of scaling for their applications. The first is to make the application multi-process (the cluster module). The second is to have a load balancer that proxies the http request to the appropriate server that is running the multi-process application. The third is to replicate the datastores so that the servers can run independently but keep in sync with each other.

Related

How to use clusters in node js?

I am very new to Node.js and express. I am currently learning it by building my own services.
I recently read about clusters. I understood what clusters do. What I am not able to understand is how to make use of clusters in a production application.
One way I can think of is to use the Master process to just sit in front and route the incoming request to the next available child process in a round robin fashion. I am not sure if this is how it is designed to be used. I would like to know how should clusters be used in a typical web application.
Thanks.
The node.js cluster modules is used with node.js any time you want to spread out the request processing across multiple node.js processes. This is most often used when you wish to increase your ability to handle more requests/second and you have multiple CPU cores in your server. By default, a single instance of node.js will not fully utilize multiple cores because the core Javascript you run in your server is single threaded (uses one core). Node.js itself does use threads for some things internally, but that's still unlikely to fully utilize a mult-core system. Setting up a clustered node.js process for each CPU core will allow you to better maximize the available compute resources.
Clustering also provides you with some additional fault tolerance. If one cluster process goes down, you can still have other live clusters serving requests while the disabled cluster restarts.
The cluster module for node.js has a couple different scheduling algorithms - the round robin you mention is one. You can read more about that here: Cluster Round-Robin Load Balancing.
Because each cluster is a separate process, there is no automatic shared data among the different cluster processes. As such, clustering is simplest to implement either where there is no shared data or where the shared data is already in a place that it can be accessed by multiple processes (such as in a database).
Keep in mind that a single node.js process (if written to properly use async I/O and not heavily compute bound) can server many requests itself at once. Clustering is when you want to expand scalability beyond what one instance can deliver.
I have created a poc on cluster in nodejs and added some details in the below blogs. Once go through it. It may provide some clearance.
https://jksnu.blogspot.com/2022/02/cluster-in-node-js-application.html
https://jksnu.blogspot.com/2022/02/cluster-management-in-node-js.html

is cluster for node.js required when running it as a job worker?

Do we need cluster module for a node.js script which just fetches some job from gearman server or from a rest api like in AWS SQS and performs it?
What i know is that, cluster is more useful in case of socket sharing (ex listening on a port) like in a web server.
PS: I am already using monit to monitor and restart these daemon process in case of a crash and in future planning to use pm2 (in non cluster mode, i.e. without -i flag).
No, you do not have to use the cluster module in order to service multiple operations from some sort of work queue. You should use the cluster module when it's specific features match up well with the type of work you are doing (load balancing multiple incoming connections).
In fact, if the operations to be done are mostly asynchronous things (such as sending an update to an external database), you may not even need multiple processes. But, if you do need multiple processes, then you can use the various options in the process module to start other work processes to carry out individual tasks and use the main central server to monitor and coordinate these other processes. This is part of what the cluster module does, but it's use is more specialized than just something that starts other external processes that you could code yourself.

Connecting Node.js applications running in different servers

It is not uncommon to think about distributing the logic of an application between different servers whether because of scalability, security or any other arbitrary concern. In such a scenario it's important to have reliable channels of communication between the separate modules or applications.
A practical case could look like this:
(Server #1) You have a DB table filling up with tasks (in the form of table entries) that need to be processed.
(Server #2) You have an arbitrator that fetches these tasks one by one so as to handle them in some specific fashion.
(Server #3 -- #n) You have multiple worker applications that receive tasks from the arbitrator and return the results back to it.
Now imagine that everything is programmed with Node.js. You want the worker servers to be able to spawn when more resources are needed and be terminated when the processing load is low. When a worker node is created it has to connect back to the arbitrator to signal that it is ready to receive tasks.
What are the available options for communicating the worker nodes with the arbitrator such that the arbitrator can detect when a new worker node is connecting to it and data between both can start to flow. Or, in other words, how to go about creating reliable state-full channels of communication between two remote Node.js applications?
As much as this shouldn't turn into a battle of messaging technologies, another option is RabbitMQ. They have quick tutorials for both worker queues and remote procedure calls (rpc).
Although these tutorials are in python, they are still easy to follow though (and I believe a bit of googling will find you Node translations on github).
In your situation, Rabbit will be able to handle dispatching messages to particular workers, however I think you will have to write your scaling logic yourself.
zeromq is a good option for that.

How to make a distributed node.js application?

Creating a node.js application is simple enough.
var app = require('express')();
app.get('/',function(req,res){
res.send("Hello world!");
});
But suppose people became obsessed with your Hello World! application and exhausted your resources. How could this example be scaled up on practice? I don't understand it, because yes, you could open several node.js instance in different computers - but when someone access http://your_site.com/ it aims directly that specific machine, that specific port, that specific node process. So how?
There are many many ways to deal with this, but it boils down to 2 things:
being able to use more cores per server
being able to scale beyond more than one server.
node-cluster
For the first option, you can user node-cluster or the same solution as for the seconde option. node-cluster (http://nodejs.org/api/cluster.html) essentially is a built in way to fork the node process into one master and multiple workers. Typically, you'd want 1 master and n-1 to n workers (n being your number of available cores).
load balancers
The second option is to use a load balancer that distributes the requests amongst multiple workers (on the same server, or across servers).
Here you have multiple options as well. Here are a few:
a node based option: Load balancing with node.js using http-proxy
nginx: Node.js + Nginx - What now? (using more than one upstream server)
apache: (no clearly helpful link I could use, but a valid option)
One more thing, once you start having multiple processes serving requests, you can no longer use memory to store state, you need an additional service to store shared states, Redis (http://redis.io) is a popular choice, but by no means the only one.
If you use services such as cloudfoundry, heroku, and others, they set it up for you so you only have to worry about your app's logic (and using a service to deal with shared state)
I've been working with node for quite some time but recently got the opportunity to try scaling my node apps and have been researching on the same topic for some time now and have come across following pre-requisites for scaling:
My app needs to be available on a distributed system each running multiple instances of node
Each system should have a load balancer that helps distribute traffic across the node instances.
There should be a master load balancer that should distribute traffic across the node instances on distributed systems.
The master balancer should always be running OR should have a dependable restart mechanism to keep the app stable.
For the above requisites I've come across the following:
Use modules like cluster to start multiple instances of node in a system.
Use nginx always. It's one of the most simplest mechanism for creating a load balancer i've came across so far
Use HAProxy to act as a master load balancer. A few pointers on how to use it and keep it forever running.
Useful resources:
Horizontal scaling node.js and websockets.
Using cluster to take advantages of multiple cores.
I'll keep updating this answer as I progress.
The basic way to use multiple machines is to put them behind a load balancer, and point all your traffic to the load balancer. That way, someone going to http://my_domain.com, and it will point at the load balancer machine. The sole purpose (for this example anyways; in theory more could be done) of the load balancer is to delegate the traffic to a given machine running your application. This means that you can have x number of machines running your application, however an external machine (in this case a browser) can go to the load balancer address and get to one of them. The client doesn't (and doesn't have to) know what machine is actually handling its request. If you are using AWS, it's pretty easy to set up and manage this. Note that Pascal's answer has more detail about your options here.
With Node specifically, you may want to look at the Node Cluster module. I don't really have alot of experience with this module, however it should allow you to spawn multiple process of your application on one machine all sharing the same port. Also node that it's still experimental and I'm not sure how reliably it will be.
I'd recommend to take a look to http://senecajs.org, a microservices toolkit for Node.js. That is a good start point for beginners and to start thinking in "services" instead of monolitic applications.
Having said that, building distributed applcations is hard, take time to learn, take LOT of time to master it, and usually you will face a lot trade-off between performance, reliability, manteinance, etc.

How do I set up routing to multiple instances of a node.js server on one url?

I have a simple node.js server app built that I'm hoping to test out soon. It's single threaded and works fine without any child processing whatsoever. My problem is that the server box has multiple cores and the simplest way I can think to utilize them is by running multiple instances of the server app. However this would require them all to be on the same domain name and so some sort of request routing is required. I personally don't have much experience with servers in general and don't know if this is a task for node.js to perform or some other less complicated program (or more complicated.) If there is a node.js mechanism to solve this, for example, if one running instance can send incoming requests to the next instance, than how would I detect when this needs to happen? Transversely, if I use some other program how will it manage to detect when it needs to start talking to a new instance?
Node.js includes built-in support for managing a cluster of instances of your application to take advantage of multiple cores via the cluster module.

Resources