Distributing topics between worker instances with minimum overlap - node.js

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?

Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.

RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

Related

How to send a message to ReactPHP/Amp/Swoole/etc. from PHP-FPM?

I'm thinking about making a worker script to handle async tasks on my server, using a framework such as ReactPHP, Amp or Swoole that would be running permanently as a service (I haven't made my choice between these frameworks yet, so solutions involving any of these are helpful).
My web endpoints would still be managed by Apache + PHP-FPM as normal, and I want them to be able to send messages to the permanently running script to make it aware that an async job is ready to be processed ASAP.
Pseudo-code from a web endpoint:
$pdo->exec('INSERT INTO Jobs VALUES (...)');
$jobId = $pdo->lastInsertId();
notify_new_job_to_worker($jobId); // how?
How do you typically handle communication from PHP-FPM to the permanently running script in any of these frameworks? Do you set up a TCP / Unix Socket server and implement your own messaging protocol, or are there ready-made solutions to tackle this problem?
Note: In case you're wondering, I'm not planning to use a third-party message queue software, as I want async jobs to be stored as part of the database transaction (either the whole transaction is successful, including committing the pending job, or the whole transaction is discarded). This is my guarantee that no jobs will be lost. If, worst case scenario, the message cannot be sent to the running service, missed jobs may still be retrieved from the database at a later time.
If your worker "runs permanently" as a service, it should provide some API to interact through. I use AmPHP in my project for async services, and my services implement HTTP/Websockets servers (using Amp libraries) as an API transport.
Hey ReactPHP core team member here. It totally depends on what your ReactPHP/Amp/Swoole process does. Looking at your example my suggestion would be to use a message broker/queue like RabbitMQ. That way the process can pic it up when it's ready for it and ack it when it's done. If anything happens with your process in the mean time and dies it will retry as long as it hasn't acked the message. You can also do a small HTTP API but that doesn't guarantee reprocessing of messages on fatal failures. Ultimately it all depends on your design, all 3 projects are a toolset to build your own architectures and systems, it's all up to you.

Does Node.js need a job queue?

Say I have a express service which sends email:
app.post('/send', function(req, res) {
sendEmailAsync(req.body).catch(console.error)
res.send('ok')
})
this works.
I'd like to know what's the advantage of introducing a job queue here? like Kue.
Does Node.js need a job queue?
Not generically.
A job queue is to solve a specific problem, usually with more to do than a single node.js process can handle at once so you "queue" up things to do and may even dole them out to other processes to handle.
You may even have priorities for different types of jobs or want to control the rate at which jobs are executed (suppose you have a rate limit cap you have to remain below on some external server or just don't want to overwhelm some other server). One can also use nodejs clustering to increase the amount of tasks that your node server can handle. So, a queue is about controlling the execution of some CPU or resource intensive task when you have more of it to do than your server can easily execute at once. A queue gives you control over the flow of execution.
I don't see any reason for the code you show to use a job queue unless you were doing a lot of these all at once.
The specific https://github.com/OptimalBits/bull library or Kue library you mention lists these features on its NPM page:
Delayed jobs
Distribution of parallel work load
Job event and progress pubsub
Job TTL
Optional retries with backoff
Graceful workers shutdown
Full-text search capabilities
RESTful JSON API
Rich integrated UI
Infinite scrolling
UI progress indication
Job specific logging
So, I think it goes without saying that you'd add a queue if you needed some specific queuing features and you'd use the Kue library if it had the best set of features for your particular problem.
In case it matters, your code is sending res.send("ok") before it finishes with the async tasks and before you know if it succeeded or not. Sometimes there are reasons for doing that, but sometimes you want to communicate back whether the operation was successful or not (which you are not doing).
Basically, the point of a queue would simply be to give you more control over their execution.
This could be for things like throttling how many you send, giving priority to other actions first, evening out the flow (i.e., if 10000 get sent at the same time, you don't try to send all 10000 at the same time and kill your server).
What exactly you use your queue for, and whether it would be of any benefit, depends on your actual situation and use cases. At the end of the day, it's just about controlling the flow.

Netty multi threading per connection

I am new to netty. I would like to develop a server which aims at receiving requests from possibly few(say Max is of 2) clients. But each client will be sending many requests to server continuously. Server has to process such requests and respond to client. So, here I assume that even though if I configure multiple worker threds,it may not be useful as there are only 2 active connections. Worker thread again block till it process and respond to client. So, please let me know how to handle these type of problems.
If I use threadpoolexecutor in worker thread to process both clients requests in multi threaded manner, will it be efficient? Or if it cane achieved through netty framework, plz let me know how to do this?
Thanks in advance...
If I understand correctly: your clients (2) will send many messages, each of them implying an answear as quickly as possible from the server.
2 options can be seen:
The answear process is short time (short enough to not be an isssue for the rate you want to reach, meaning 1 thread is able to answear as fast as you need for 1 client): then you can stay with the standard threads from Netty (1 worker thread for 1 client at a time) set up in the server bootstrap. This is the shortest path.
The answear process is not short time enough (the rate will be terrible, for instance because there is a "long time" process, such as blocking call, database access, file writing, ...): then you can add a thread pool (a group) in the Netty pipeline for you ChannelHandler doing such blocking/long process.
Here is an extract of the API documentation taken from ChannelPipeline:
http://netty.io/4.0/api/io/netty/channel/ChannelPipeline.html
// Tell the pipeline to run MyBusinessLogicHandler's event handler methods
// in a different thread than an I/O thread so that the I/O thread is not blocked by
// a time-consuming task.
// If your business logic is fully asynchronous or finished very quickly, you don't
// need to specify a group.
pipeline.addLast(group, "handler", new MyBusinessLogicHandler());
just add a ChannelHandler with a special EventExecutorGroup to the ChannelPipeline. For example UnorderedThreadPoolEventExecutor (src).
something like this.
UnorderedThreadPoolEventExecutor executorGroup = ...;
pipeline.addLast(executorGroup, new MyChannelHandler());

Node/Express: running specific CPU-instensive tasks in the background

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?
The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

Develop a clock and workers in node.js on heroku

I'm working on a service that needs to analyze data from social media networks every five minutes for different users. I'm developing it in node.js and I will implement it on Heroku.
According to this article on Heroku website, the best way to do that is separating the logic of the scheduler from the logic of the worker. In fact, the idea is to have one dyno dedicated to schedule tasks to avoid duplication. This dyno instructs a farm of workers (n dynos as needed) to do the tasks.
Here is the procfile of this architecture:
web: node web.js
worker: node worker.js
clock: node clock.js
The problem is how to implement it in node.js. I googled it, and the suggestion is to use message queue systems (like IronMQ, RabbitMQ or CloudAMQP). But I'm trying to set my code and app simple and with the minor need of add-ons.
The question is: is there a way to communicate directly from my scheduler (clock) to the worker dynos?
Thanks for your answers.
Heroku dynos do not have fixed IP addresses, so there is no way to open a direct connection between them. That's why you need to create a separate server instance with a static IP or other fixed endpoint that acts as a go-between.
You have at least two viable options: a RabbitMQ-type message queue, or a stripped down version using a pub-sub redis feed. I generally use the latter because it's quick, simple, and sufficiently robust for all my needs (e.g. if a message gets lost every once in a blue moon, it's no big deal). If, however, it is essential that you never lose a message, you should use a full-blown message queue like RabbitMQ.
Setting up the redis implementation is very straightforward. There are several redis add-ons (I use RedisCloud) with free and inexpensive plans. When you provision them, you get an endpoint to connect to and a password. Then you just connect your web dyno(s) and worker dyno(s) to your redis instance such that your web app publishes tasks to a channel and the worker subscribes to that channel.
If you need the web app to communicate with the client after task completion, you just create another channel for the worker to publish task completion messages and the web app to listen for them.
You'll never get duplication of tasks, as each time a worker receives a message it pops off the queue.
If I understood this correctly, you want to spin a clock as one app, and then spin workers as separate apps? Sure, there is a direct way. You open a connection from the clock app towards the worker app.
For example, have every worker open a client sockets connection to the clock. Then the clock can communicate to them and relay orders.
Or use WebRTC. That way the workers will talk to the clock, but they can also talk to each other.
Or make an (authenticated) HTTP(s) REST endpoint on the worker where it will receive tasks. Like, POST /tasks will create a task on the worker. If the task is short, it can reply right away, so that the clock knows the job is done. Or if it's a longer task, it can acknowledge it, but later call an endpoint on the clock to say it's done, something like PUT /tasks/32.
Or even more directly, open a direct net connection towards the clock, for example on worker start (and the other way around). Use dgram and send UDP messages between worker and clock.
In any way, I also believe that the people suggesting MQ like RabbitMQ is much better to just push jobs/tasks on. Then it can distribute tasks as needed, and based on unacked count on the job queue, it can spin up more workers when needed.
But your question is very broad, so to get more details, you could provide a little more details.
This might be helpful.
http://blog.andyjiang.com/intermediate-cron-jobs-with-heroku/
Basically you require the worker file directly into the clock file.
I solved in an easy way with the following three steps:
Set credit card information in Heroku account;
Installed Heroku Scheduler addon (you can use the command heroku addons:create scheduler:standard --app <yourAppName>)
Set up the script to run as scheduled job
More info here or here.

Resources