What are ways to trigger a worker process in Heroku? In particular, a process that is needed infrequently but quickly when needed, e.g. bluemoon.js.
Polling every second to read a task queue (which can be stored in a database) is the approach I can think of.
Trigger makes more sense to me for this case. Is there a way to directly trigger a worker process when needed? Or is there no real downside to frequent polling?
What you really want is a message queueing service like Amazon SQS, RabbitMQ, or something similar.
What message queueing services do is this:
You have your web dyno fire off a message into a messaging service that says "Hey! Run this task. Here's some data to process."
The message service then grabs this message, and relays it (quickly) to any of your worker dynos.
Your worker dynos then complete the work that needs to be done, and can communicate back to the messaging service that the job has been finished.
The reason the above pattern works so well is that these services are optimized for speed and cost -- they're VERY inexpensive to run (I'm a huge fan of Amazon SQS myself), have almost no overhead, and are incredibly fast.
The reason you DON'T want to poll a database (which is what most people think of when they imagine stuff like this) is because it's going to waste resources and cause problems later on:
You'll be constantly hitting your database server from your worker dynos, using a lot of unnecessary bandwidth / IO / CPU resources.
You'll be constantly hitting your database server making queries, which is going to slow down your database and reduce the amount of important queries it can run.
In general, for problems like this, a messaging service is the perfect solution!
Related
I'm trying to figure out a design pattern for scheduling events in a stateless Node back-end with multiple instances running simultaneously.
Use case example:
Create an message object with a publish date/time and save it to a database
Optionally update the publishing time or delete the object
When the publish time is reached, the message content is sent to a 3rd party API endpoint
Right now my best idea is to use bee-queue or bull to queue delayed jobs. It should be able to store the state and ensure that the job is executed only once. However, I feel like it might introduce a single point of failure, especially when maintaining state on Redis for months and then hoping that the future version of the queue library is still working.
Another option is a service worker that polls the database for upcoming events every n minutes, but this seems like a potential scaling issue down the line for multi-tenant SaaS.
Are there more robust design patterns for solving this?
Don't worry about redis breaking. It's pretty stable, and eventually you can decide to freeze the version.
If there are jobs that will be executed in the future I would suggest a database, like Mongo or Redis, with a disk-store. So you will survive a reboot, you don't have to reinvent the wheel, and already have a nice set of tools for scalability.
In Node.js cluster mode, if multiple jobs exist in the event loop for one process, should the current job crash the process, what happens to the remaining job?
I'm assuming the remaining jobs in the event loop would go unfulfilled or return a server error. My question is, why is this an acceptable risk? Why would someone opt to use Node.js cluster mode in production then, rather than use something like PHP in production, where there is no risk of this, because PHP handles each request in its own process.
Edit:
Obviously this doesn't just apply to Node.js cluster mode. It can happen on a single instance, in which case obviously the end user would just get a server error. Cluster mode just happens to be my personal use case.
I'm looking for a way to pick back up a job in the queue job should a previous job cause the process to exit, before the subsequent job gets a change to be fulfilled. I am currently reading about how you can use a tool like RabbitMQ to handle your job queue outside of the node.js cluster, and each cluster instance just pulls jobs from the RabbitMQ queue. If anyone has any input on that, that would also be greatly appreciated.
If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?
If a node.js process crashes, the same thing happens to it that happens to any other process. All open sockets get automatically disconnected and the client will receive an immediate close on their socket (socket connection dropped essentially).
If you were using a Java server that was in the middle of handling 10 requests (perhaps in threads) and it crashed, the consequences would be the same. All 10 socket connections would get dropped.
If process isolation from one request to another is your #1 criteria for selecting a server environment, then I guess you wouldn't pick any environment that ever serves multiple requests from the same process. But, you would give up a lot of get that. One of the reasons for the node.js design is that is scales really, really well for a high number of concurrent connections that are all doing mostly I/O things (disk, networking, database stuff, etc...) which happens to be most web servers. Whereas a design that fires up a new process for every incoming connection does not scale as well for a large number of concurrent connections because a process is a much more heavy-weight thing in the eyes of the operating system (memory usage, other system resource usage, task switching overhead, etc...) than the way node.js does things.
And, there are obviously hundreds of other considerations too when choosing a server environment. So, you kind of have to look at the whole picture of what you're designing for and make the best set of tradeoffs.
In general, I wouldn't put this issue anywhere on the radar for why you should choose one over the other unless you expect to be running risky code (perhaps out of your control) that crashes a lot and this issue is therefore more important in your deployment than all the other differences. And, if that was the case, I'd probably isolate the risky code to its own process (even when using nodejs) to alleviate any pain from that crash. You could have a process pool waiting to process risky things. For example, if you were running code submitted by a user, I might run that code in its own isolated VM.
If you're just worried about your own code crashing a lot, then you probably have bigger problems and need more extensive unit testing, more robust error handling and need to take advantage of other tools just as a linter and other code analysis tools to find potential problem areas. With proper design, implementation and error handling, you should be able to keep a single incoming request from harming anything other than itself. That's certainly the philosophy that every server environment that serves multiple requests from the same process advises and the people/companies deploying those servers use.
I have a service bus trigger function that when receiving a message from the queue will do a simple db call, and then send out emails/sms. Can I run > 1000 calls in my service bus queue to trigger a function to run simultaneously without the run time being affected?
My concern is that I queue up 1000+ messages to trigger my function all at the same time, say 5:00PM to send out emails/sms. If they end up running later because there is so many running threads the users receiving the emails/sms don't get them until 1 hour after the designated time!
Is this a concern and if so is there a remedy?
FYI - I know I can make the function run asynchronously, would that make any difference in this scenario?
1000 messages is not a big number. If your e-mail/sms service can handle them fast, the whole batch will be gone relatively quickly. Few things to know though:
Functions won't scale to 1000 parallel executions in this case. They will start with 1 instance doing ~16 parallel calls at the same time, and then observe how fast the processing goes, then maybe add a second instance, wait again etc.
The exact scaling behavior is not publicly described and can change over time. Thus, YMMV, and you need to test against your specific scenario.
Yes, make the functions async whenever you can. I don't expect a huge boost in processing speed just because of that, but it certainly won't hurt.
Bottom line: your scenario doesn't sound like a problem for Functions, but if you need very short latency, you'll have to run a test before relying on it.
I'm assuming you are talking about an Azure Service Bus Binding to an Azure Function. There should be no issue with >1000 Azure Functions firing at the same time. They are a Serverless runtime and should be able to scale greatly if you are running under a consumption model. If you are running the functions in a service plan, you may be limited by the service plan.
In your scenario you are probably more likely to overwhelm the downstream dependencies: the database and SMS sending system, before you overwhelm the Azure Functions infrastructure.
The best thing to do is to do some load testing, and monitor the exceptions coming out of the connections to the database and SMS systems.
I've seen some older posts touching on this topic but I wanted to know what the current, modern approach is.
The use case is: (1) assume you want to do a long running task on a video file, say 60 seconds long, say jspm install that can take up to 60 seconds. (2) you can NOT subdivide the task.
Other requirements include:
need to know when a task finishes
nice to be able to stop a running task
stability: if one task dies, it doesn't bring down the server
needs to be able to handle 100s of simultaneous requests
I've seen these solutions mentioned:
nodejs child process
webworkers
fibers - not used for CPU-bound tasks
generators - not used for CPU-bound tasks
https://adambom.github.io/parallel.js/
https://github.com/xk/node-threads-a-gogo
any others?
Which is the modern, standard-based approach? Also, if nodejs isn't suited for this type of task, then that's also a valid answer.
The short answer is: Depends
If you mean a nodejs server, then the answer is no for this use case. Nodejs's single-thread event can't handle CPU-bound tasks, so it makes sense to outsource the work to another process or thread. However, for this use case where the CPU-bound task runs for a long time, it makes sense to find some way of queueing tasks... i.e., it makes sense to use a worker queue.
However, for this particular use case of running JS code (jspm API), it makes sense to use a worker queue that uses nodejs. Hence, the solution is: (1) use a nodejs server that does nothing but queue tasks in the worker queue. (2) use a nodejs worker queue (like kue) to do the actual work. Use cluster to spread the work across different CPUs. The result is a simple, single server that can handle hundreds of requests (w/o choking). (Well, almost, see the note below...)
Note:
the above solution uses processes. I did not investigate thread solutions because it seems that these have fallen out of favor for node.
the worker queue + cluster give you the equivalent of a thread pool.
yea, in the worst case, the 100th parallel request will take 25 minutes to complete on a 4-core machine. The solution is to spin up another worker queue server (if I'm not mistaken, with a db-backed worker queue like kue this is trivial---just make each point server point to the same db).
You're mentioning a CPU-bound task, and a long-running one, that's definitely not a node.js thing. You also mention hundreds of simultaneous tasks.
You might take a look at something like Gearman job server for things like that - it's a dedicated solution.
Alternatively, you can still have Node.js manage the requests, just not do the actual job execution.
If it's relatively acceptable to have lower then optimal performance, and you want to keep your code in JavaScript, you can still do it, but you should have some sort of job queue - something like Redis or RabbitMQ comes to mind.
I think job queue will be a must-have requirement for long-running, hundreds/sec tasks, regardless of your runtime. Except if you can spawn this job on other servers/services/machines - then you don't care, your Node.js API is just a front and management layer for the job cluster, then Node.js is perfectly ok for the job, and you need to focus on that job cluster, and you could then make a better question.
Now, node.js can still be useful for you here, it can help manage and hold those hundreds of tasks, depending where they come from (ie. you might only allow requests to go through to your job server for certain users, or limit the "pause" functionality to others etc.
Easily perform Concurrent Execution to LongRunning Processes using Simple ConcurrentQueue. Feel free to improve and share feedback.
👨🏻💻 Create your own Custom ConcurrentExecutor and set your concurrency limit.
🔥 Boom you got all your long-running processes run in concurrent mode.
For Understanding you can have a look:
Concurrent Process Executor Queue
I'm running a Windows Azure web role which, on most days, receives very low traffic, but there are some (foreseeable) events which can lead to a high amount of background work which has to be done. The background work consists of many database calls (Azure SQL) and HTTP calls to external web services, so it is not really CPU-intensive, but it requires a lot of threads which are waiting for the database or the web service to answer. The background work is triggered by a normal HTTP request to the web role.
I see two options to orchestrate this, and I'm not sure which one is better.
Option 1, Threads: When the request for the background work comes in, the web role starts as many threads as necessary (or queues the individual work items to the thread pool). In this option, I would configure a larger instance during the heavy workload, because these threads could require a lot of memory.
Option 2, Self-Invoking: When the request for the background work comes in, the web role which receives it generates a HTTP request to itself for every item of background work. In this option, I could configure several web role instances, because the load balancer of Windows Azure balances the HTTP requests across the instances.
Option 1 is somewhat more straightforward, but it has the disadvantage that only one instance can process the background work. If I want more than one Azure instance to participate in the background work, I don't see any other option than sending HTTP requests from the role to itself, so that the load balancer can delegate some of the work to the other instances.
Maybe there are other options?
EDIT: Some more thoughts about option 2: When the request for the background work comes in, the instance that receives it would save the work to be done in some kind of queue (either Windows Azure Queues or some SQL table which works as a task queue). Then, it would generate a lot of HTTP requests to itself, so that the load balancer 'activates' all of the role instances. Each instance then dequeues a task from the queue and performs the task, then fetches the next task etc. until all tasks are done. It's like occasionally using the web role as a worker role.
I'm aware this approach has a smelly air (abusing web roles as worker roles, HTTP requests to the same web role), but I don't see the real disadvantages.
EDIT 2: I see that I should have elaborated a little bit more about the exact circumstances of the app:
The app needs to do some small tasks all the time. These tasks usually don't take more than 1-10 seconds, and they don't require a lot of CPU work. On normal days, we have only 50-100 tasks to be done, but on 'special days' (New Year is one of them), they could go into several 10'000 tasks which have to be done inside of a 1-2 hour window. The tasks are done in a web role, and we have a Cron Job which initiates the tasks every minute. So, every minute the web role receives a request to process new tasks, so it checks which tasks have to be processed, adds them to some sort of queue (currently it's an SQL table with an UPDATE with OUTPUT INSERTED, but we intend to switch to Azure Queues sometime). Currently, the same instance processes the tasks immediately after queueing them, but this won't scale, since the serial processing of several 10'000 tasks takes too long. That's the reason why we're looking for a mechanism to broadcast the event "tasks are available" from the initial instance to the others.
Have you considered using Queues for distribution of work? You can put the "tasks" which needs to be processed in queue and then distribute the work to many worker processes.
The problem I see with approach 1 is that I see this as a "Scale Up" pattern and not "Scale Out" pattern. By deploying many small VM instances instead of one large instance will give you more scalability + availability IMHO. Furthermore you mentioned that your jobs are not CPU intensive. If you consider X-Small instance, for the cost of 1 Small instance ($0.12 / hour), you can deploy 6 X-Small instances ($0.02 / hour) and likewise for the cost of 1 Large instance ($0.48) you could deploy 24 X-Small instances.
Furthermore it's easy to scale in case of a "Scale Out" pattern as you just add or remove instances. In case of "Scale Up" (or "Scale Down") pattern since you're changing the VM Size, you would end up redeploying the package.
Sorry, if I went a bit tangential :) Hope this helps.
I agree with Gaurav and others to consider one of the Azure Queue options. This is really a convenient pattern for cleanly separating concerns while also smoothing out the load.
This basic Queue-Centric Workflow (QCW) pattern has the work request placed on a queue in the handling of the Web Role's HTTP request (the mechanism that triggers the work, apparently done via a cron job that invokes wget). Then the IIS web server in the Web Role goes on doing what it does best: handling HTTP requests. It does not require any support from a load balancer.
The Web Role needs to accept requests as fast as they come (then enqueues a message for each), but the dequeue part is a pull so the load can easily be tuned for available capacity (or capacity tuned for the load! this is the cloud!). You can choose to handle these one at a time, two at a time, or N at a time: whatever your testing (sizing exercise) tells you is the right fit for the size VM you deploy.
As you probably also are aware, the RoleEntryPoint::Run method on the Web Role can also be implemented to do work continually. The default implementation on the Web Role essentially just sleeps forever, but you could implement an infinite loop to query the queue to remove work and process it (and don't forget to Sleep whenever no messages are available from the queue! failure to do so will cause a money leak and may get you throttled). As Gaurav mentions, there are some other considerations in robustly implementing this QCW pattern (what happens if my node fails, or if there's a bad ("poison") message, bug in my code, etc.), but your use case does not seem overly concerned with this since the next kick from the cron job apparently would account for any (rare, but possible) failures in the infrastructure and perhaps assumes no fatal bugs (so you can't get stuck with poison messages), etc.
Decoupling placing items on the queue from processing items from the queue is really a logical design point. By this I mean you could change this at any time and move the processing side (the code pulling from the queue) to another application tier (a service tier) rather easily without breaking any part of the essential design. This gives a lot of flexibility. You could even run everything on a single Web Role node (or two if you need the SLA - not sure you do based on some of your comments) most of the time (two-tier), then go three-tier as needed by adding a bunch of processing VMs, such as for the New Year.
The number of processing nodes could also be adjusted dynamically based on signals from the environment - for example, if the queue length is growing or above some threshold, add more processing nodes. This is the cloud and this machinery can be fully automated.
Now getting more speculative since I don't really know much about your app...
By using the Run method mentioned earlier, you might be able to eliminate the cron job as well and do that work in that infinite loop; this depends on complexity of cron scheduling of course. Or you could also possibly even eliminate the entire Web tier (the Web Role) by having your cron job place work request items directly on the queue (perhaps using one of the SDKs). You still need code to process the requests, which could of course still be your Web Role, but at that point could just as easily use a Worker Role.
[Adding as a separate answer to avoid SO telling me to switch to chat mode + bypass comments length limitation] & thinking out loud :)
I see your point. Basically through HTTP request, you're kind of broadcasting the availability of a new task to be processed to other instances.
So if I understand correctly, when an instance receives request for the task to be processed, it pushes that request in some kind of queue (like you mentioned it could either be Windows Azure Queues [personally I would actually prefer that] or SQL Azure database [Not prefer that because you would have to implement your own message locking algorithm]) and then broadcast a message to all instances that some work needs to be done. Remaining instances (or may be the instance which is broadcasting it) can then see if they're free to process that task. One instance depending on its availability can then fetch the task from the queue and start processing that task.
Assuming you used Windows Azure Queues, when an instance fetched the message, it becomes unavailable to other instances immediately for some amount of time (visibility timeout period of Azure queues) thus avoiding duplicate processing of the task. If the task is processed successfully, the instance working on that task can delete the message.
If for some reason, the task is not processed, it will automatically reappear in the queue after visibility timeout period has expired. This however leads to another problem. Since your instances look for tasks based on a trigger (generating HTTP request) rather than polling, how will you ensure that all tasks get done? Assuming you get to process just one task and one task only and it fails since you didn't get a request to process the 2nd task, the 1st task will never get processed again. Obviously it won't happen in practical situation but something you might want to think about.
Does this make sense?
i would definitely go for a scale out solution: less complex, more manageable and better in pricing. Plus you have a lesser risk on downtime in case of deployment failure (of course the mechanism of fault and upgrade domains should cover that, but nevertheless). so for that matter i completely back Gaurav on this one!