nodeJS multi node Web server - multithreading

I need to create multi node web server that will be allow to control number of nodes in real time and change process UID and GUID.
For example at start server starts 5 workers and pushes them into workers pool.
When the server gets the new request it searches for free workers, sets UID or GUID if needed, and gives it the request to proces. In case if there is no free workers, server will create new one, set GUID or UID, also pushes it into pool and so on.
Can you suggest me how it can be implemented?
I've tried this example http://nodejs.ru/385 but it doesn't allow to control the number of workers, so I decided that there must be other solution but I can't find it.
If you have some examples or links that will help me to resolve this issue write me please.

I guess you are looking for this: http://learnboost.github.com/cluster/

I don't think cluster will do it for you.
What you want is to use one process per request.
Have in mind that this can be very innefficient, and node is designed to work around those types of worker processing, but if you really must do it, then you must do it.
On the other hand, node is very good at handling processes, so you need to keep a process pool, which is easily accomplished by using node internal child_process.spawn API.
Also, you will need a way for you to communicate to the worker process.
I suggest opening a unix-domain socket and sending the client connection file descriptor, so you can delegate that connection into the new worker.
Also, you will need to handle edge-cases for timeouts, etc.

https://github.com/pgte/fugue I use this.

Related

Distributing topics between worker instances with minimum overlap

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?
Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.
RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

zmq_connect() a socket while waiting for a zmq_send() or zmq_recv()

I'm working on an application where I want to use ZeroMQ to connect nodes of different types which may be added and removed while the system is running. This means that I want to call zmq_connect() or zmq_disconnect() at any time as nodes come and go.
Some connection use sockets of type ZMQ_REQ, which block when no peers are available. Thus, it may happen that one node is blocked in a zmq_recv(), without any node available for processing the request. If then a new node becomes available, I would like to connect the socket using zmq_connect(). The only way I can see how I could do that is to call zmq_connect() from a different thread. But the documentation states pretty clearly that zmq_socket instances cannot be used from multiple threads simultaneously.
How can I solve this problem, sending messages on a ZMQ_REQ socket without any connections (or connection which cannot be established) and then later add connections and have the waiting requests being processed?
You should not use zmq_recv() when no messages are ready. That way you avoid blocking your thread. Instead check that there indeed are a message to receive. The easiest way to achieve this is using a poller. Since you haven't stated which library or language you're using I can't give you the right example, but I guess C example from the ZeroMQ Guide's examples here could be of use.
Building ZeroMQ based applications is, in my experience, most effective by building one threaded nodes that reacts to messages and, if necessary, runs methods based on time intervals.
For building a system like you talk about I suggest you look at the Service Discovery chapter of the awesome ZeroMQ Guide.

Develop a clock and workers in node.js on heroku

I'm working on a service that needs to analyze data from social media networks every five minutes for different users. I'm developing it in node.js and I will implement it on Heroku.
According to this article on Heroku website, the best way to do that is separating the logic of the scheduler from the logic of the worker. In fact, the idea is to have one dyno dedicated to schedule tasks to avoid duplication. This dyno instructs a farm of workers (n dynos as needed) to do the tasks.
Here is the procfile of this architecture:
web: node web.js
worker: node worker.js
clock: node clock.js
The problem is how to implement it in node.js. I googled it, and the suggestion is to use message queue systems (like IronMQ, RabbitMQ or CloudAMQP). But I'm trying to set my code and app simple and with the minor need of add-ons.
The question is: is there a way to communicate directly from my scheduler (clock) to the worker dynos?
Thanks for your answers.
Heroku dynos do not have fixed IP addresses, so there is no way to open a direct connection between them. That's why you need to create a separate server instance with a static IP or other fixed endpoint that acts as a go-between.
You have at least two viable options: a RabbitMQ-type message queue, or a stripped down version using a pub-sub redis feed. I generally use the latter because it's quick, simple, and sufficiently robust for all my needs (e.g. if a message gets lost every once in a blue moon, it's no big deal). If, however, it is essential that you never lose a message, you should use a full-blown message queue like RabbitMQ.
Setting up the redis implementation is very straightforward. There are several redis add-ons (I use RedisCloud) with free and inexpensive plans. When you provision them, you get an endpoint to connect to and a password. Then you just connect your web dyno(s) and worker dyno(s) to your redis instance such that your web app publishes tasks to a channel and the worker subscribes to that channel.
If you need the web app to communicate with the client after task completion, you just create another channel for the worker to publish task completion messages and the web app to listen for them.
You'll never get duplication of tasks, as each time a worker receives a message it pops off the queue.
If I understood this correctly, you want to spin a clock as one app, and then spin workers as separate apps? Sure, there is a direct way. You open a connection from the clock app towards the worker app.
For example, have every worker open a client sockets connection to the clock. Then the clock can communicate to them and relay orders.
Or use WebRTC. That way the workers will talk to the clock, but they can also talk to each other.
Or make an (authenticated) HTTP(s) REST endpoint on the worker where it will receive tasks. Like, POST /tasks will create a task on the worker. If the task is short, it can reply right away, so that the clock knows the job is done. Or if it's a longer task, it can acknowledge it, but later call an endpoint on the clock to say it's done, something like PUT /tasks/32.
Or even more directly, open a direct net connection towards the clock, for example on worker start (and the other way around). Use dgram and send UDP messages between worker and clock.
In any way, I also believe that the people suggesting MQ like RabbitMQ is much better to just push jobs/tasks on. Then it can distribute tasks as needed, and based on unacked count on the job queue, it can spin up more workers when needed.
But your question is very broad, so to get more details, you could provide a little more details.
This might be helpful.
http://blog.andyjiang.com/intermediate-cron-jobs-with-heroku/
Basically you require the worker file directly into the clock file.
I solved in an easy way with the following three steps:
Set credit card information in Heroku account;
Installed Heroku Scheduler addon (you can use the command heroku addons:create scheduler:standard --app <yourAppName>)
Set up the script to run as scheduled job
More info here or here.

Node.js - child process handle urls related to itself

I'm trying to scale a chatter app using socket.io + cluster. Is it possible for child processes to handle incoming request belong to its process id (assigned when fork)?
For example:
http://mydomain/calculate?process=1
The above request is only handled by process 1, other processes will ignore it. In this way, I want to make sure requests of the same room are handled by same process, so I may don't have to use RedisStore as socket.io backend.
I also wonder how RedisStore works, because when using it, I found io.sockets.manager.rooms data are not accurate in all processes.
Edit:
Put it another way: can cluster master process dispatch request to different child processes based on the querystring?
The answer is no. The OS takes care of load balancing in this situation and in order to process query string you already have to be connected to a web server ( in your case child process ).
From my experience I find cluster a bit useless. It is a lot easier to spawn multiple NodeJS processes ( on multiple ports ) and put a proxy ( nginx? ) in front of them. It is easy and scalable.
As for socket.io: I don't think it works correctly with cluster ( because of sharing global variables, which causes issues ). Again: spawning separate NodeJS processes should fix the problem. Also it will be useful once you reach the point when you will have to scale to multiple machines. Any tricks with cluster won't help you at that point.
One last note: socket.io does not scale well. I suggest writing your own WebSocket server ( based on WS for example ) and implement your own scaling mechanism. For example based on all-to-all UDP pinging, which should scale well when dealing with small amount of servers ( 50? 100? ).

How does the cluster module work in Node.js?

Can someone explain in detail how the core cluster module works in Node.js?
How the workers are able to listen to a single port?
As far as I know that the master process does the listening, but how it can know which ports to listen since workers are started after the master process? Do they somehow communicate that back to the master by using the child_process.fork communication channel? And if so how the incoming connection to the port is passed from the master to the worker?
Also I'm wondering what logic is used to determine to which worker an incoming connection is passed?
I know this is an old question, but this is now explained at nodejs.org here:
The worker processes are spawned using the child_process.fork method,
so that they can communicate with the parent via IPC and pass server
handles back and forth.
When you call server.listen(...) in a worker, it serializes the
arguments and passes the request to the master process. If the master
process already has a listening server matching the worker's
requirements, then it passes the handle to the worker. If it does not
already have a listening server matching that requirement, then it
will create one, and pass the handle to the worker.
This causes potentially surprising behavior in three edge cases:
server.listen({fd: 7}) -
Because the message is passed to the master,
file descriptor 7 in the parent will be listened on, and the handle
passed to the worker, rather than listening to the worker's idea of
what the number 7 file descriptor references.
server.listen(handle) -
Listening on handles explicitly will cause the
worker to use the supplied handle, rather than talk to the master
process. If the worker already has the handle, then it's presumed that
you know what you are doing.
server.listen(0) -
Normally, this will cause servers to listen on a
random port. However, in a cluster, each worker will receive the same
"random" port each time they do listen(0). In essence, the port is
random the first time, but predictable thereafter. If you want to
listen on a unique port, generate a port number based on the cluster
worker ID.
When multiple processes are all accept()ing on the same underlying
resource, the operating system load-balances across them very
efficiently. There is no routing logic in Node.js, or in your program,
and no shared state between the workers. Therefore, it is important to
design your program such that it does not rely too heavily on
in-memory data objects for things like sessions and login.
Because workers are all separate processes, they can be killed or
re-spawned depending on your program's needs, without affecting other
workers. As long as there are some workers still alive, the server
will continue to accept connections. Node does not automatically
manage the number of workers for you, however. It is your
responsibility to manage the worker pool for your application's needs.
NodeJS uses a round-robin decision to make load balancing between the child processes. It will give the incoming connections to an empty process, based on the RR algorithm.
The children and the parent do not actually share anything, the whole script is executed from the beginning to end, that is the main difference between the normal C fork. Traditional C forked child would continue executing from the instruction where it was left, not the beginning like NodeJS. So If you want to share anything, you need to connect to a cache like MemCache or Redis.
So the code below produces 6 6 6 (no evil means) on the console.
var cluster = require("cluster");
var a = 5;
a++;
console.log(a);
if ( cluster.isMaster){
worker = cluster.fork();
worker = cluster.fork();
}
Here is a blog post that explains this
As an update to #OpenUserX03's answer, nodejs has no longer use system load-balances but use a built in one. from this post:
To fix that Node v0.12 got a new implementation using a round-robin algorithm to distribute the load between workers in a better way. This is the default approach Node uses since then including Node v6.0.0

Resources