Scheduling function calls in a stateless Node.js application - node.js

I'm trying to figure out a design pattern for scheduling events in a stateless Node back-end with multiple instances running simultaneously.
Use case example:
Create an message object with a publish date/time and save it to a database
Optionally update the publishing time or delete the object
When the publish time is reached, the message content is sent to a 3rd party API endpoint
Right now my best idea is to use bee-queue or bull to queue delayed jobs. It should be able to store the state and ensure that the job is executed only once. However, I feel like it might introduce a single point of failure, especially when maintaining state on Redis for months and then hoping that the future version of the queue library is still working.
Another option is a service worker that polls the database for upcoming events every n minutes, but this seems like a potential scaling issue down the line for multi-tenant SaaS.
Are there more robust design patterns for solving this?

Don't worry about redis breaking. It's pretty stable, and eventually you can decide to freeze the version.
If there are jobs that will be executed in the future I would suggest a database, like Mongo or Redis, with a disk-store. So you will survive a reboot, you don't have to reinvent the wheel, and already have a nice set of tools for scalability.

Related

Solution for user-specific background job queues

I have been researching how to efficiently solve the following use case and I am struggling to find the best solution.
Basically I have a Node.js REST API which handles requests for users from a mobile application. We want some requests to launch background tasks outside of the req/res flow because they are CPU intensive or might just take a while to execute. We are trying to implement or use any existing frameworks which are able to handle different job queues in the following way (or at least compatible with the use case):
Every user has their own set job queues (there are different kind of jobs).
The jobs within one specific queue have to be executed sequentially and only one job at a time but everything else can be executed in parallel (it would be preferable if there are no queues hogging the workers or whatever is actually consuming the tasks so all queues get more or less the same priority).
Some queues might fill up with hundreds of tasks at a given time but most likely they will be empty a lot of the time.
Queues need to be persistent.
We currently have a solution with RabbitMQ with one queue for every kind of task which all the users share. The users dump tasks into the same queues which results in them filling up with tasks from a specific user for a long time and having the rest of users wait for those tasks to be done before their own start to be consumed. We have looked into priority queues but we don't think that's the way to go for our own use case.
The first somewhat logical solution we thought of is to create temporary queues whenever a user needs to run background jobs and have them be deleted when empty. Nevertheless we are not sure if having that many queues is scalable and we are also struggling with dynamically creating RabbitMQ queues, exchanges, etc (we have even read somewhere that it might be an anti-pattern?).
We have been doing some more research and maybe the way to go would be with other stuff such as Kafka or Redis based stuff like BullMQ or similar.
What would you recommend?
If you're on AWS, have you considered SQS? There is no limit on number of standard queues created, and in flight messages can reach up to 120k. This would seem to satisfy your requirements above.
While the mentioned SQS solution did prove to be very scalable our amount of polling we would need to do or use of SNS did not make the solution optimal. On the other hand implementing a self made solution via database polling was too much for our use case and we did not have the time or computational resources to consider a new database in our stack.
Luckily, we ended up finding that the Pro version of BullMQ does have a "Group" functionality which performs a round robin strategy for different tasks within a single queue. This ended up adjusting perfectly to our use case and is what we ended up using.

Task scheduling behind multiple instances

Currently I am solving an engineering problem, and want to open the conversation to the SO community.
I want to implement a task scheduler. I have two separate instances of a nodeJS application sitting behind an elastic load balancer (ELB). The problem is when both instances come up, they try to execute the same tasks logic, causing the tasks run more than once.
My current solution is to use node-schedule to schedule tasks to run, then have them referencing the database to check if the task hasn't already been run since it's specified run time interval.
The logic here is a little messy, and I am wondering if there is a more elegant way I could go about doing this.
Perhaps it is possible to set a particular env variable on a specific instance - so that only that instance will run the tasks.
What do you all think?
What you are describing appears to be a perfect example of a use case for AWS Simple Queue Service.
https://aws.amazon.com/sqs/details/
Key points to look out for in your solution:
Make sure that you pick a visibility timeout that is reflective of your workload (so messages don't reenter the queue whilst still in process by another worker)
Don't store your workload in the message, reference it! A message can only be up to 256kb in size and message sizes have an impact on performance and cost.
Make sure you understand billing! As billing is charged in 64KB chunks, meaning 1 220KB message is charged as 4x 64KB chucks / requests.
If you make your messages small, you can save more money by doing batch requests as your bang for buck will be far greater!
Use longpolling to retrieve messages to get the most value out of your message requests.
Grant your application permissions to SQS by the use of an EC2 IAM Role, as this is the best security practice and the recommended approach by AWS.
It's an excellent service, and should resolve your current need nicely.
Thanks!
Xavier.

Scaling and selecting unique records per worker

I'm part of a project where we are having to deal with a lot of data in a stream. It's going to be passed to Mongo and from there it needs to be processed by workers to see if it needs to be persisted, amongst other things, or discarded.
We want to scale this horizontally. My question is, what methods are there for ensuring that each worker selects a unique record, that isn't already being processed by another worker?
Is a central main worker required to hand out jobs to the sub workers, if that is the case, the bottle neck and point of failure is with that central worker, right?
Any ideas or suggestions welcome.
Thanks!
Josh
You can use findAndModify to both select and flag a document atomically, making sure that only one worker gets to process it. My experience is that this can be slow due to excessive database locking, but that experience is based on MongoDB 2.x so it may not be an issue anymore on 3.x.
Also, with MongoDB it's difficult to "wait" for new jobs/documents (you can tail the oplog, but you'd have to do this from every worker and each one will wake up and perform the findAndModify() query, resulting in the aforementioned locking).
I think that ultimately you should consider using a proper messaging solution (write data to MongoDB, write the _id to the broker, have the workers subscribe to the message queue, and if you configure things properly only one worker will get a job). Well-known brokers are RabbitMQ, nsq.io and with a bit of extra work you can even use Redis.

Orchestrating a Windows Azure web role to cope with occasional high workload

I'm running a Windows Azure web role which, on most days, receives very low traffic, but there are some (foreseeable) events which can lead to a high amount of background work which has to be done. The background work consists of many database calls (Azure SQL) and HTTP calls to external web services, so it is not really CPU-intensive, but it requires a lot of threads which are waiting for the database or the web service to answer. The background work is triggered by a normal HTTP request to the web role.
I see two options to orchestrate this, and I'm not sure which one is better.
Option 1, Threads: When the request for the background work comes in, the web role starts as many threads as necessary (or queues the individual work items to the thread pool). In this option, I would configure a larger instance during the heavy workload, because these threads could require a lot of memory.
Option 2, Self-Invoking: When the request for the background work comes in, the web role which receives it generates a HTTP request to itself for every item of background work. In this option, I could configure several web role instances, because the load balancer of Windows Azure balances the HTTP requests across the instances.
Option 1 is somewhat more straightforward, but it has the disadvantage that only one instance can process the background work. If I want more than one Azure instance to participate in the background work, I don't see any other option than sending HTTP requests from the role to itself, so that the load balancer can delegate some of the work to the other instances.
Maybe there are other options?
EDIT: Some more thoughts about option 2: When the request for the background work comes in, the instance that receives it would save the work to be done in some kind of queue (either Windows Azure Queues or some SQL table which works as a task queue). Then, it would generate a lot of HTTP requests to itself, so that the load balancer 'activates' all of the role instances. Each instance then dequeues a task from the queue and performs the task, then fetches the next task etc. until all tasks are done. It's like occasionally using the web role as a worker role.
I'm aware this approach has a smelly air (abusing web roles as worker roles, HTTP requests to the same web role), but I don't see the real disadvantages.
EDIT 2: I see that I should have elaborated a little bit more about the exact circumstances of the app:
The app needs to do some small tasks all the time. These tasks usually don't take more than 1-10 seconds, and they don't require a lot of CPU work. On normal days, we have only 50-100 tasks to be done, but on 'special days' (New Year is one of them), they could go into several 10'000 tasks which have to be done inside of a 1-2 hour window. The tasks are done in a web role, and we have a Cron Job which initiates the tasks every minute. So, every minute the web role receives a request to process new tasks, so it checks which tasks have to be processed, adds them to some sort of queue (currently it's an SQL table with an UPDATE with OUTPUT INSERTED, but we intend to switch to Azure Queues sometime). Currently, the same instance processes the tasks immediately after queueing them, but this won't scale, since the serial processing of several 10'000 tasks takes too long. That's the reason why we're looking for a mechanism to broadcast the event "tasks are available" from the initial instance to the others.
Have you considered using Queues for distribution of work? You can put the "tasks" which needs to be processed in queue and then distribute the work to many worker processes.
The problem I see with approach 1 is that I see this as a "Scale Up" pattern and not "Scale Out" pattern. By deploying many small VM instances instead of one large instance will give you more scalability + availability IMHO. Furthermore you mentioned that your jobs are not CPU intensive. If you consider X-Small instance, for the cost of 1 Small instance ($0.12 / hour), you can deploy 6 X-Small instances ($0.02 / hour) and likewise for the cost of 1 Large instance ($0.48) you could deploy 24 X-Small instances.
Furthermore it's easy to scale in case of a "Scale Out" pattern as you just add or remove instances. In case of "Scale Up" (or "Scale Down") pattern since you're changing the VM Size, you would end up redeploying the package.
Sorry, if I went a bit tangential :) Hope this helps.
I agree with Gaurav and others to consider one of the Azure Queue options. This is really a convenient pattern for cleanly separating concerns while also smoothing out the load.
This basic Queue-Centric Workflow (QCW) pattern has the work request placed on a queue in the handling of the Web Role's HTTP request (the mechanism that triggers the work, apparently done via a cron job that invokes wget). Then the IIS web server in the Web Role goes on doing what it does best: handling HTTP requests. It does not require any support from a load balancer.
The Web Role needs to accept requests as fast as they come (then enqueues a message for each), but the dequeue part is a pull so the load can easily be tuned for available capacity (or capacity tuned for the load! this is the cloud!). You can choose to handle these one at a time, two at a time, or N at a time: whatever your testing (sizing exercise) tells you is the right fit for the size VM you deploy.
As you probably also are aware, the RoleEntryPoint::Run method on the Web Role can also be implemented to do work continually. The default implementation on the Web Role essentially just sleeps forever, but you could implement an infinite loop to query the queue to remove work and process it (and don't forget to Sleep whenever no messages are available from the queue! failure to do so will cause a money leak and may get you throttled). As Gaurav mentions, there are some other considerations in robustly implementing this QCW pattern (what happens if my node fails, or if there's a bad ("poison") message, bug in my code, etc.), but your use case does not seem overly concerned with this since the next kick from the cron job apparently would account for any (rare, but possible) failures in the infrastructure and perhaps assumes no fatal bugs (so you can't get stuck with poison messages), etc.
Decoupling placing items on the queue from processing items from the queue is really a logical design point. By this I mean you could change this at any time and move the processing side (the code pulling from the queue) to another application tier (a service tier) rather easily without breaking any part of the essential design. This gives a lot of flexibility. You could even run everything on a single Web Role node (or two if you need the SLA - not sure you do based on some of your comments) most of the time (two-tier), then go three-tier as needed by adding a bunch of processing VMs, such as for the New Year.
The number of processing nodes could also be adjusted dynamically based on signals from the environment - for example, if the queue length is growing or above some threshold, add more processing nodes. This is the cloud and this machinery can be fully automated.
Now getting more speculative since I don't really know much about your app...
By using the Run method mentioned earlier, you might be able to eliminate the cron job as well and do that work in that infinite loop; this depends on complexity of cron scheduling of course. Or you could also possibly even eliminate the entire Web tier (the Web Role) by having your cron job place work request items directly on the queue (perhaps using one of the SDKs). You still need code to process the requests, which could of course still be your Web Role, but at that point could just as easily use a Worker Role.
[Adding as a separate answer to avoid SO telling me to switch to chat mode + bypass comments length limitation] & thinking out loud :)
I see your point. Basically through HTTP request, you're kind of broadcasting the availability of a new task to be processed to other instances.
So if I understand correctly, when an instance receives request for the task to be processed, it pushes that request in some kind of queue (like you mentioned it could either be Windows Azure Queues [personally I would actually prefer that] or SQL Azure database [Not prefer that because you would have to implement your own message locking algorithm]) and then broadcast a message to all instances that some work needs to be done. Remaining instances (or may be the instance which is broadcasting it) can then see if they're free to process that task. One instance depending on its availability can then fetch the task from the queue and start processing that task.
Assuming you used Windows Azure Queues, when an instance fetched the message, it becomes unavailable to other instances immediately for some amount of time (visibility timeout period of Azure queues) thus avoiding duplicate processing of the task. If the task is processed successfully, the instance working on that task can delete the message.
If for some reason, the task is not processed, it will automatically reappear in the queue after visibility timeout period has expired. This however leads to another problem. Since your instances look for tasks based on a trigger (generating HTTP request) rather than polling, how will you ensure that all tasks get done? Assuming you get to process just one task and one task only and it fails since you didn't get a request to process the 2nd task, the 1st task will never get processed again. Obviously it won't happen in practical situation but something you might want to think about.
Does this make sense?
i would definitely go for a scale out solution: less complex, more manageable and better in pricing. Plus you have a lesser risk on downtime in case of deployment failure (of course the mechanism of fault and upgrade domains should cover that, but nevertheless). so for that matter i completely back Gaurav on this one!

which one to use windows services or threading

We are having a web application build using asp.net 3.5 & SQL server as database which is quite big and used by around 300 super users for managing around 5000 staffs.
Now we are implementing SMS functionality into the application which means the users will be able to send and receive SMS. Every two minute the SMS server of the third party is pinged to check whether there are any new messages. Also SMS are hold in queue and send every time interval of 15 to 30 minutes.
I want this checking and sending process to run in the background of the application all the time, even if the user closes the browser window.
I need some advice on how do I do this?
Will using thread will achieve this or do I need to create a windows service for it or are there any other options?
More information:
I want to execute a task in a timer, what will happen if I close the browser window, the task wont be completed isn't it so.
For example I am saving 10 records to the database in a time interval of 5 minutes, which means every 5 minutes when the timer tick event fires, a record is inserted into the database.
How do I run this task if I close the browser window?
I tried looking at windows service but how do I pass a generic collection of data to it for processing.
There really is no thread or service choice, a service can (and usually is!) multi threaded, a thread can start a service.
There are three basic choices you can:-
Somehow start another thread running when a user logs in -- this is probably a very poor choice for what you want, as you cannot really keep it running once the user session is lost.
Write a fully fledged windows service which is starts on OS startup and continues running unitl the server is shutdown. You can make this dependant on the SQLserver service, so it starts after the DB is available. This is the "best" solution but may be overkill for your purposes. Aslo you need to know the services API to write it properly as you need to respond correctly to shutdown and status requests.
You can schedule your task periodically using either the Windows schedular, or, preferably the schedular which is built in to SQLServer, I think this would be the most suitable option for your needs.
Distinguish between what the browser is doing and what's happening server-side.
Your Web App is sitting server-side waiting for requests from whatever browsers may be running, and servicing those requests, in servicing those requests I guess it may well put messages on a queue and have a look in a database for any new messages.
You want the daemon processor, which talks to the third-party SMS, to be triggered by time rather than by browser function. Either of your suggestions would work:
A competely independent service could run and work against the queues and database.
Your web app, which I assume is already a service, could spawn a thread
In either case we have a few technical questions of avoiding any race conditions between the browser-request processing and the daemon - but databases and queueing systems can deal with that.
So I would decide between stand-alone daemon and background thread like this:
Which is easier to implement? I'm a Java EE developer, I know in my app server I have an API for specifying code to be run according to a timer, the API deals with the threading issues. So for me that's very easy. I don't know what you have available. Timers are not quite as trivial as they may appear - so having a reliable API is beneficial. If this was a more complex requirement, where the daemon code were gnarly and might possibly interfere with the WebApp code then I might prefer to keep it conspicuously separate.
Which is easier to deploy and administer? Deploy separate Web App and daemon, or deploy one thing. In the Java EE world we could have a single Enterprise Application with all the code, so that's a single thing to deploy, start and control.
One other thing to consider: Scaling and Resilience. You might choose to have more than one copy of your web app running, either to provide fail-over capabilities or just because you need the extra power. In which case how many daemons would you have? Would it be a problem to have two daemons running? You might need some extra code to mediate between two daemons, for example log in the database the time of last work, each daemon can say "Oh, my buddy balready did the 10:30 job, I'll go back to sleep"

Resources