Best way of implementing fan-out pattern with Azure service bus - azure

I'm looking for some best practices on how to implement the following pattern using (micro) services and a service bus. In this particular case Service Fabric services and an Azure service bus instance, but from a pattern point of view that might not even be that important.
Suppose a scenario in which we get a work package with a number of files in it. For processing, we want each individual file to be processed in parallel so that we can easily scale this operation onto a number of services. Once all processing completes, we can continue our business process.
So it's a fan-out, following by a fan-in to gather all results and continue. For example sake, let's say that we have a ZIP file, unzip the file and have each file processed and once all are done we can continue.
The fan-out bit is easy. Unzip the file, for n files post n messages onto a service bus queue and have a number of services handle those in parallel. But now how do we know that these services have all completed their work?
A number of options I'm considering:
Next to sending a service bus message for each file, we also store the files in a table of sorts, along with the name of the originating ZIP file. Once a worker is done processing, it will remove that file from the table again and check if that was the last one. When it was, we can post a message to indicate the entire ZIP was now processed.
Similar to 1., but instead we have the worker reply that it's done and the ZIP processing service will then check if there is any work left. Little bit cleaner as the responsibility for that table is now clearly with the ZIP processing service.
Have the ZIP processing service actively wait for all the reply messages in separate threads, but just typing this already makes my head hurt a bit.
Introduce a specific orchestrator service which takes the n messages and takes care of the fan-out / fan-in pattern. This would still require solution 2 as well, but it's now located in a separate service so we don't have any of this logic (+ storage) in the ZIP processing service itself.
I looked into how service bus might already have a feature of some sort to support this pattern, but could not find anything suitable. Durable functions seems to support a scenario like this, but we're not using functions within this project and I'd rather not start doing so just to implement this one pattern.
We're surely not the first ones to implement such a thing, so I really was hoping I could find some real world advise as to what works and what should be avoided at all cost.

Related

Trigger multiple concurrent service bus trigger azure functions without time degradation

I have a service bus trigger function that when receiving a message from the queue will do a simple db call, and then send out emails/sms. Can I run > 1000 calls in my service bus queue to trigger a function to run simultaneously without the run time being affected?
My concern is that I queue up 1000+ messages to trigger my function all at the same time, say 5:00PM to send out emails/sms. If they end up running later because there is so many running threads the users receiving the emails/sms don't get them until 1 hour after the designated time!
Is this a concern and if so is there a remedy?
FYI - I know I can make the function run asynchronously, would that make any difference in this scenario?
1000 messages is not a big number. If your e-mail/sms service can handle them fast, the whole batch will be gone relatively quickly. Few things to know though:
Functions won't scale to 1000 parallel executions in this case. They will start with 1 instance doing ~16 parallel calls at the same time, and then observe how fast the processing goes, then maybe add a second instance, wait again etc.
The exact scaling behavior is not publicly described and can change over time. Thus, YMMV, and you need to test against your specific scenario.
Yes, make the functions async whenever you can. I don't expect a huge boost in processing speed just because of that, but it certainly won't hurt.
Bottom line: your scenario doesn't sound like a problem for Functions, but if you need very short latency, you'll have to run a test before relying on it.
I'm assuming you are talking about an Azure Service Bus Binding to an Azure Function. There should be no issue with >1000 Azure Functions firing at the same time. They are a Serverless runtime and should be able to scale greatly if you are running under a consumption model. If you are running the functions in a service plan, you may be limited by the service plan.
In your scenario you are probably more likely to overwhelm the downstream dependencies: the database and SMS sending system, before you overwhelm the Azure Functions infrastructure.
The best thing to do is to do some load testing, and monitor the exceptions coming out of the connections to the database and SMS systems.

AWS multiple SQS queues and workers optimal design

I have following task to implement using AWS stack:
One job is triggered periodically and put message to queue (SQS). Worker recieves this task and based on it additional tasks need to be created (approximately 1-10 K tasks). And all these tasks are also put to another queue and there are additional workers to process these tasks.
These flow can be described displayed in following way:
Periodic task ->SQS->woker_1(creates more tasks) -> SQS -> workers_2
Based on project conventions and bureaucracy it will take some time to create two separate services for worker_1 that listen to periodic task and creates fine grained tasks and for workers_2 that just process particular tasks, make docker images, CI jobs etc... and get deploy it.
So, here is the tradeof:
1. Spend additional time and create two separate services. On the other hand these services might be really simple. And even there is a doubt to have 2 separate projects.
2. Make this as a one service that put messages to the same queue and also will listen to the messages on the same queue and perorm work for: worker_1 and worker_2.
Any suggestions or thoughts are appreciated!
I don't think there can be a "correct" answer to this, you already have a good list of pros and cons for both options. Some additional things I thought of:
SQS queues don't really allow you to pick out specific types of messages, you pretty much need to read everything first-in-first-out. So if you share queues, you may have less control of prioritizing messages.
For the two services to interact, they need a shared message definition. Sharing the same codebase would make it easier to dev and test the messaging code. Of course, it could also be a shared library.
Deploying both worker types in the same server/application would share resources, which might be more economical at the low end, or it might be confusing at high scale.
It may be possible to develop all the code into the same application, and leave the decision to deployment-time if it is all on the same server and queue or separate servers reading from separate queues. This seems ideal to me.

How do Azure WebJobs prioritize messages when monitoring multiple queues?

I'm using Azure WebJobs as part of a project at work. These are configured as continuously running jobs that monitor a number of different queues. As queue messages are received they cause various API commands to be run. The issue I have is that some of the API commands run quickly (ie. a few seconds) and some run slowly (several minutes), and I'm not sure how best to split the queue handlers between the WebJobs.
For example, I could put all of the slow API command handlers in one WebJob and all of the quick handlers in a different WebJob. My concern is that the "slow" WebJob process would always be busy whereas the "quick" WebJob process would be idling most of the time.
Another approach would be to mix quick and slow handlers in the same WebJob project. My concern with that would be the quicker handlers starving the slower ones of attention, or vice versa.
A third approach would be to have a separate WebJob for each individual message handler, but given the number of message types we have to deal with I'd rather not go down that route. It also seems like overkill to be honest.
I was wondering if anyone had encountered a similar scenario and could offer any insight into how Azure WebJobs choose which message to handle when they are monitoring multiple queues? Numerous internet searches have failed to turn-up any guidance or help in this area. To be clear, I'm not really after opinions as to which approach people think would be best; I'm looking for answers from people who have actually dealt with this kind of problem and can say with some degree of certainty which of the different approaches would be best given the way the Azure WebJobs API currently prioritizes queue message handling.
If you have multiple functions listening on different queues, the SDK will call them in parallel when messages are received simultaneously. You can not set which queue should be processed first.
Depending on you configuration, you will handle them parallel. If you think that some executions will stall others, you can split the handling in multiple webjobs and scale them seperately.

Orchestrating a Windows Azure web role to cope with occasional high workload

I'm running a Windows Azure web role which, on most days, receives very low traffic, but there are some (foreseeable) events which can lead to a high amount of background work which has to be done. The background work consists of many database calls (Azure SQL) and HTTP calls to external web services, so it is not really CPU-intensive, but it requires a lot of threads which are waiting for the database or the web service to answer. The background work is triggered by a normal HTTP request to the web role.
I see two options to orchestrate this, and I'm not sure which one is better.
Option 1, Threads: When the request for the background work comes in, the web role starts as many threads as necessary (or queues the individual work items to the thread pool). In this option, I would configure a larger instance during the heavy workload, because these threads could require a lot of memory.
Option 2, Self-Invoking: When the request for the background work comes in, the web role which receives it generates a HTTP request to itself for every item of background work. In this option, I could configure several web role instances, because the load balancer of Windows Azure balances the HTTP requests across the instances.
Option 1 is somewhat more straightforward, but it has the disadvantage that only one instance can process the background work. If I want more than one Azure instance to participate in the background work, I don't see any other option than sending HTTP requests from the role to itself, so that the load balancer can delegate some of the work to the other instances.
Maybe there are other options?
EDIT: Some more thoughts about option 2: When the request for the background work comes in, the instance that receives it would save the work to be done in some kind of queue (either Windows Azure Queues or some SQL table which works as a task queue). Then, it would generate a lot of HTTP requests to itself, so that the load balancer 'activates' all of the role instances. Each instance then dequeues a task from the queue and performs the task, then fetches the next task etc. until all tasks are done. It's like occasionally using the web role as a worker role.
I'm aware this approach has a smelly air (abusing web roles as worker roles, HTTP requests to the same web role), but I don't see the real disadvantages.
EDIT 2: I see that I should have elaborated a little bit more about the exact circumstances of the app:
The app needs to do some small tasks all the time. These tasks usually don't take more than 1-10 seconds, and they don't require a lot of CPU work. On normal days, we have only 50-100 tasks to be done, but on 'special days' (New Year is one of them), they could go into several 10'000 tasks which have to be done inside of a 1-2 hour window. The tasks are done in a web role, and we have a Cron Job which initiates the tasks every minute. So, every minute the web role receives a request to process new tasks, so it checks which tasks have to be processed, adds them to some sort of queue (currently it's an SQL table with an UPDATE with OUTPUT INSERTED, but we intend to switch to Azure Queues sometime). Currently, the same instance processes the tasks immediately after queueing them, but this won't scale, since the serial processing of several 10'000 tasks takes too long. That's the reason why we're looking for a mechanism to broadcast the event "tasks are available" from the initial instance to the others.
Have you considered using Queues for distribution of work? You can put the "tasks" which needs to be processed in queue and then distribute the work to many worker processes.
The problem I see with approach 1 is that I see this as a "Scale Up" pattern and not "Scale Out" pattern. By deploying many small VM instances instead of one large instance will give you more scalability + availability IMHO. Furthermore you mentioned that your jobs are not CPU intensive. If you consider X-Small instance, for the cost of 1 Small instance ($0.12 / hour), you can deploy 6 X-Small instances ($0.02 / hour) and likewise for the cost of 1 Large instance ($0.48) you could deploy 24 X-Small instances.
Furthermore it's easy to scale in case of a "Scale Out" pattern as you just add or remove instances. In case of "Scale Up" (or "Scale Down") pattern since you're changing the VM Size, you would end up redeploying the package.
Sorry, if I went a bit tangential :) Hope this helps.
I agree with Gaurav and others to consider one of the Azure Queue options. This is really a convenient pattern for cleanly separating concerns while also smoothing out the load.
This basic Queue-Centric Workflow (QCW) pattern has the work request placed on a queue in the handling of the Web Role's HTTP request (the mechanism that triggers the work, apparently done via a cron job that invokes wget). Then the IIS web server in the Web Role goes on doing what it does best: handling HTTP requests. It does not require any support from a load balancer.
The Web Role needs to accept requests as fast as they come (then enqueues a message for each), but the dequeue part is a pull so the load can easily be tuned for available capacity (or capacity tuned for the load! this is the cloud!). You can choose to handle these one at a time, two at a time, or N at a time: whatever your testing (sizing exercise) tells you is the right fit for the size VM you deploy.
As you probably also are aware, the RoleEntryPoint::Run method on the Web Role can also be implemented to do work continually. The default implementation on the Web Role essentially just sleeps forever, but you could implement an infinite loop to query the queue to remove work and process it (and don't forget to Sleep whenever no messages are available from the queue! failure to do so will cause a money leak and may get you throttled). As Gaurav mentions, there are some other considerations in robustly implementing this QCW pattern (what happens if my node fails, or if there's a bad ("poison") message, bug in my code, etc.), but your use case does not seem overly concerned with this since the next kick from the cron job apparently would account for any (rare, but possible) failures in the infrastructure and perhaps assumes no fatal bugs (so you can't get stuck with poison messages), etc.
Decoupling placing items on the queue from processing items from the queue is really a logical design point. By this I mean you could change this at any time and move the processing side (the code pulling from the queue) to another application tier (a service tier) rather easily without breaking any part of the essential design. This gives a lot of flexibility. You could even run everything on a single Web Role node (or two if you need the SLA - not sure you do based on some of your comments) most of the time (two-tier), then go three-tier as needed by adding a bunch of processing VMs, such as for the New Year.
The number of processing nodes could also be adjusted dynamically based on signals from the environment - for example, if the queue length is growing or above some threshold, add more processing nodes. This is the cloud and this machinery can be fully automated.
Now getting more speculative since I don't really know much about your app...
By using the Run method mentioned earlier, you might be able to eliminate the cron job as well and do that work in that infinite loop; this depends on complexity of cron scheduling of course. Or you could also possibly even eliminate the entire Web tier (the Web Role) by having your cron job place work request items directly on the queue (perhaps using one of the SDKs). You still need code to process the requests, which could of course still be your Web Role, but at that point could just as easily use a Worker Role.
[Adding as a separate answer to avoid SO telling me to switch to chat mode + bypass comments length limitation] & thinking out loud :)
I see your point. Basically through HTTP request, you're kind of broadcasting the availability of a new task to be processed to other instances.
So if I understand correctly, when an instance receives request for the task to be processed, it pushes that request in some kind of queue (like you mentioned it could either be Windows Azure Queues [personally I would actually prefer that] or SQL Azure database [Not prefer that because you would have to implement your own message locking algorithm]) and then broadcast a message to all instances that some work needs to be done. Remaining instances (or may be the instance which is broadcasting it) can then see if they're free to process that task. One instance depending on its availability can then fetch the task from the queue and start processing that task.
Assuming you used Windows Azure Queues, when an instance fetched the message, it becomes unavailable to other instances immediately for some amount of time (visibility timeout period of Azure queues) thus avoiding duplicate processing of the task. If the task is processed successfully, the instance working on that task can delete the message.
If for some reason, the task is not processed, it will automatically reappear in the queue after visibility timeout period has expired. This however leads to another problem. Since your instances look for tasks based on a trigger (generating HTTP request) rather than polling, how will you ensure that all tasks get done? Assuming you get to process just one task and one task only and it fails since you didn't get a request to process the 2nd task, the 1st task will never get processed again. Obviously it won't happen in practical situation but something you might want to think about.
Does this make sense?
i would definitely go for a scale out solution: less complex, more manageable and better in pricing. Plus you have a lesser risk on downtime in case of deployment failure (of course the mechanism of fault and upgrade domains should cover that, but nevertheless). so for that matter i completely back Gaurav on this one!

Microsoft Azure Master-Slave worker roles

I am trying to port an application to azure platform. I want to run an existing application multiple times. My initial idea is as follows: I have a master_process. I have many slave_processes. Each process is a worker role in Azure. Each slave_process will run an instance of the application independently. I want master_process to start many slave_processes and provide them the input arguments. At the end, master_process will collect the results. Currently, I have a working setup for calling the whole application from a C# wrapper. So, for the success, I need two things: First, I have to find a way to start slave workers inside of a master worker (just like threads). Second, I need to find a way to store results of the slave workers and reach these result files from master worker. Can anyone help me?
I think I would try and solve the problem differently. Deploying a whole new instance can take 15 to 30 minutes. Adding extra instances to an already running worker role is a little quicker, but not by much. I'm going to presume that you want results faster than that and that this process is something that is run frequently.
I would have just one worker role type that runs your existing logic and as many instances of that worker role that you determine you'll need. Whatever your client is will decide that it needs to break the work up in a certain number of pieces, let's say 10 for the sake of argument. It will give each piece of work an ID (e.g. a guid) and then put 10 messages that contain the parameters and the ID into a queue. Your worker role instances take messages out of the queue, do their work and write their results to storage somewhere (either SQL Azure, Azure Table Storage or maybe even blob storage depending on what the results are). The client polls that storage to wait for all of the results to be complete and then carries on.
If this is a process that is only run infrequently, then rather than having the worker roles deployed all of the time, you could use the same method I've described, but in addition get the client code to deploy the worker roles when it starts and then delete them when it's finished through the management API. There are samples on MSDN on how to use this.
I have a similar situation you might find useful:
I have a large sequential batch process I run in Azure which requires pre and post processing. The technique I used was to use instances of a single multifunctional worker role, but to use a "quorum" to nominate a head node, which then controls the workflow.
They way I do it is using an azure page blob as the quorum (basically a kind of global mutex/lock), because once a node grabs it for writing it's locked. For resilience, in case there's an issue with the head node, all nodes occasionally try to recapture the quorum.

Resources