Worker Role vs Web Role with async controllers - azure

Several components of our system have "long running" operations. They can take anywhere from seconds to minutes, and vary in their CPU usage. E.g., report generation will peg a CPU for a few seconds, but data collection is largely spent waiting on a database query.
I am faced with two choices:
(1) Web role + worker role + queue + table. Worker role spins on a queue, gets a message with parameters, does work, updates table with progress and completion flags. Client spins, displaying progress until its marked done. One web role, scale up number of worker roles as needed.
(2) Web role + async method. Make my long running operations use .NET 4.5's async/await stuff, and have the controller actions marked async. Scale up number of web roles as needed.
Option 1 obviously is much more complicated, but has the advantage of keeping the web roles free to do web stuff, and allowing proper queuing if things start getting really busy. Option 2 is simpler and will require less roles and storage resources, but doesn't it have the potential to choke up the entire website if things start getting busy? I am strongly leaning to option 2 just for simplicity. Is there any particular reason not to do this? If the website start slowing down, simply increasing the number of web role instances will solve performance problems right?

As with most architectural decisions, the answer is only "it depends".
In this case, option (2) is easier to code. If your're not expecting to scale massively, then I would say that's fine.
The key advantage to option (1) is that you have two knobs for scaling: your web roles handle web requests and your worker roles handle the work, and you can scale up your workers independently of your webs.
But unless you're going to scale massively, I wouldn't worry about it. Option (2) can scale quite well, just not with perfect efficiency. And if you do start scaling massively, you can (inefficiently) crank up scaling with option (2) and (presumably) use your massively scaling income to develop option (1).
P.S. You should use async for both options.

As Stephen posted, this is going to be an "It Depends" type of answer.
In this case I would probably go with either the first option, or an option in between where yuo have the web role, queue, and table, but create an executable that also runs on your web role as a startup executable.
With option 2, your going to be open to connection timeouts, instance restarts, etc and and additional demand for resources will have to be met with scaling your web roles or a code rewrite.
With option 1 (as Stephen pointed out), your workload is separated and you can scale the roles independently. In addition, letting the queues and workers handle the work allows the workers to manage their own lifespan and you can build in some resiliency for planned or unplanned restarts by not removing the queue items until you have finished the job (that way they will resurface if you crash in the middle). You also have the ability to take fuller advantage of the role resources (scale on threads first, then additional instances), as your polling mechanism will be controlling the workload rather than random people hitting the website. On the web side, you can then chose to either have the async method that waits for completion and returns, or you could return a token and let the client poll for completion of the work, either case should be relatively simple code.
Option 1.5 may be the best starting point, though. If your trying to start small, then using a background executable on your web roles will be the cheapest solution. You're going to want to have at least 2 web roles to ensure coverage by the SLA, so this solution would let you start with just those two instances and no more. Build the executable that does the queue polling and report (or whatever) execution seperately, then configure it as a start up task for the web role. This will let you keep the cost down while your starting and the code for those exe's is speerate and becomes the code for your worker role later if you need to expand. The biggest thing to watch for in this situation is that your executable handles all exceptions because if this process exits due to an unhandled exception, it won't get restarted (unlike worker roles, which Azure will continue to just keep on starting every time they die).

Related

Task scheduling behind multiple instances

Currently I am solving an engineering problem, and want to open the conversation to the SO community.
I want to implement a task scheduler. I have two separate instances of a nodeJS application sitting behind an elastic load balancer (ELB). The problem is when both instances come up, they try to execute the same tasks logic, causing the tasks run more than once.
My current solution is to use node-schedule to schedule tasks to run, then have them referencing the database to check if the task hasn't already been run since it's specified run time interval.
The logic here is a little messy, and I am wondering if there is a more elegant way I could go about doing this.
Perhaps it is possible to set a particular env variable on a specific instance - so that only that instance will run the tasks.
What do you all think?
What you are describing appears to be a perfect example of a use case for AWS Simple Queue Service.
https://aws.amazon.com/sqs/details/
Key points to look out for in your solution:
Make sure that you pick a visibility timeout that is reflective of your workload (so messages don't reenter the queue whilst still in process by another worker)
Don't store your workload in the message, reference it! A message can only be up to 256kb in size and message sizes have an impact on performance and cost.
Make sure you understand billing! As billing is charged in 64KB chunks, meaning 1 220KB message is charged as 4x 64KB chucks / requests.
If you make your messages small, you can save more money by doing batch requests as your bang for buck will be far greater!
Use longpolling to retrieve messages to get the most value out of your message requests.
Grant your application permissions to SQS by the use of an EC2 IAM Role, as this is the best security practice and the recommended approach by AWS.
It's an excellent service, and should resolve your current need nicely.
Thanks!
Xavier.

How does one configure "Autoscale" to deal with Web instances which have long wait times due to external processes?

I am using MVC3, ASP.NET4.5, C#, Razor, EF6.1, SQL Azure
I have been doing some load testing using JMeter, and I have found some surprising results.
I have a test of 30 concurrent users, ramping up over 10 secs. The test plan is fairly simple:
Login
Navigate to page
Do query
Navigate back
Logout
I am using "small" "standard" instances.
I have noticed that web instances may be waiting on external processes, such as databases queries, so the web CPU could be low, but it is still a bottleneck. The CPU could be idling at 40% while waiting for a result set from the DB. So this could also be a reason why the extra instance may not be triggered. Actually this is a real issue. How do you trigger extra instance based on longer wait times? At the moment the only way round this is to have 2 instances up there permanently, or proactively set it up against a schedule.
Use async calls and you won't have to worry about scaling up. The waiting threads will be asleep, freeing up resources to handle other users.
If you still see lengthened response times after that it's probably the external process that's choking and in need of being scaled up

Orchestrating a Windows Azure web role to cope with occasional high workload

I'm running a Windows Azure web role which, on most days, receives very low traffic, but there are some (foreseeable) events which can lead to a high amount of background work which has to be done. The background work consists of many database calls (Azure SQL) and HTTP calls to external web services, so it is not really CPU-intensive, but it requires a lot of threads which are waiting for the database or the web service to answer. The background work is triggered by a normal HTTP request to the web role.
I see two options to orchestrate this, and I'm not sure which one is better.
Option 1, Threads: When the request for the background work comes in, the web role starts as many threads as necessary (or queues the individual work items to the thread pool). In this option, I would configure a larger instance during the heavy workload, because these threads could require a lot of memory.
Option 2, Self-Invoking: When the request for the background work comes in, the web role which receives it generates a HTTP request to itself for every item of background work. In this option, I could configure several web role instances, because the load balancer of Windows Azure balances the HTTP requests across the instances.
Option 1 is somewhat more straightforward, but it has the disadvantage that only one instance can process the background work. If I want more than one Azure instance to participate in the background work, I don't see any other option than sending HTTP requests from the role to itself, so that the load balancer can delegate some of the work to the other instances.
Maybe there are other options?
EDIT: Some more thoughts about option 2: When the request for the background work comes in, the instance that receives it would save the work to be done in some kind of queue (either Windows Azure Queues or some SQL table which works as a task queue). Then, it would generate a lot of HTTP requests to itself, so that the load balancer 'activates' all of the role instances. Each instance then dequeues a task from the queue and performs the task, then fetches the next task etc. until all tasks are done. It's like occasionally using the web role as a worker role.
I'm aware this approach has a smelly air (abusing web roles as worker roles, HTTP requests to the same web role), but I don't see the real disadvantages.
EDIT 2: I see that I should have elaborated a little bit more about the exact circumstances of the app:
The app needs to do some small tasks all the time. These tasks usually don't take more than 1-10 seconds, and they don't require a lot of CPU work. On normal days, we have only 50-100 tasks to be done, but on 'special days' (New Year is one of them), they could go into several 10'000 tasks which have to be done inside of a 1-2 hour window. The tasks are done in a web role, and we have a Cron Job which initiates the tasks every minute. So, every minute the web role receives a request to process new tasks, so it checks which tasks have to be processed, adds them to some sort of queue (currently it's an SQL table with an UPDATE with OUTPUT INSERTED, but we intend to switch to Azure Queues sometime). Currently, the same instance processes the tasks immediately after queueing them, but this won't scale, since the serial processing of several 10'000 tasks takes too long. That's the reason why we're looking for a mechanism to broadcast the event "tasks are available" from the initial instance to the others.
Have you considered using Queues for distribution of work? You can put the "tasks" which needs to be processed in queue and then distribute the work to many worker processes.
The problem I see with approach 1 is that I see this as a "Scale Up" pattern and not "Scale Out" pattern. By deploying many small VM instances instead of one large instance will give you more scalability + availability IMHO. Furthermore you mentioned that your jobs are not CPU intensive. If you consider X-Small instance, for the cost of 1 Small instance ($0.12 / hour), you can deploy 6 X-Small instances ($0.02 / hour) and likewise for the cost of 1 Large instance ($0.48) you could deploy 24 X-Small instances.
Furthermore it's easy to scale in case of a "Scale Out" pattern as you just add or remove instances. In case of "Scale Up" (or "Scale Down") pattern since you're changing the VM Size, you would end up redeploying the package.
Sorry, if I went a bit tangential :) Hope this helps.
I agree with Gaurav and others to consider one of the Azure Queue options. This is really a convenient pattern for cleanly separating concerns while also smoothing out the load.
This basic Queue-Centric Workflow (QCW) pattern has the work request placed on a queue in the handling of the Web Role's HTTP request (the mechanism that triggers the work, apparently done via a cron job that invokes wget). Then the IIS web server in the Web Role goes on doing what it does best: handling HTTP requests. It does not require any support from a load balancer.
The Web Role needs to accept requests as fast as they come (then enqueues a message for each), but the dequeue part is a pull so the load can easily be tuned for available capacity (or capacity tuned for the load! this is the cloud!). You can choose to handle these one at a time, two at a time, or N at a time: whatever your testing (sizing exercise) tells you is the right fit for the size VM you deploy.
As you probably also are aware, the RoleEntryPoint::Run method on the Web Role can also be implemented to do work continually. The default implementation on the Web Role essentially just sleeps forever, but you could implement an infinite loop to query the queue to remove work and process it (and don't forget to Sleep whenever no messages are available from the queue! failure to do so will cause a money leak and may get you throttled). As Gaurav mentions, there are some other considerations in robustly implementing this QCW pattern (what happens if my node fails, or if there's a bad ("poison") message, bug in my code, etc.), but your use case does not seem overly concerned with this since the next kick from the cron job apparently would account for any (rare, but possible) failures in the infrastructure and perhaps assumes no fatal bugs (so you can't get stuck with poison messages), etc.
Decoupling placing items on the queue from processing items from the queue is really a logical design point. By this I mean you could change this at any time and move the processing side (the code pulling from the queue) to another application tier (a service tier) rather easily without breaking any part of the essential design. This gives a lot of flexibility. You could even run everything on a single Web Role node (or two if you need the SLA - not sure you do based on some of your comments) most of the time (two-tier), then go three-tier as needed by adding a bunch of processing VMs, such as for the New Year.
The number of processing nodes could also be adjusted dynamically based on signals from the environment - for example, if the queue length is growing or above some threshold, add more processing nodes. This is the cloud and this machinery can be fully automated.
Now getting more speculative since I don't really know much about your app...
By using the Run method mentioned earlier, you might be able to eliminate the cron job as well and do that work in that infinite loop; this depends on complexity of cron scheduling of course. Or you could also possibly even eliminate the entire Web tier (the Web Role) by having your cron job place work request items directly on the queue (perhaps using one of the SDKs). You still need code to process the requests, which could of course still be your Web Role, but at that point could just as easily use a Worker Role.
[Adding as a separate answer to avoid SO telling me to switch to chat mode + bypass comments length limitation] & thinking out loud :)
I see your point. Basically through HTTP request, you're kind of broadcasting the availability of a new task to be processed to other instances.
So if I understand correctly, when an instance receives request for the task to be processed, it pushes that request in some kind of queue (like you mentioned it could either be Windows Azure Queues [personally I would actually prefer that] or SQL Azure database [Not prefer that because you would have to implement your own message locking algorithm]) and then broadcast a message to all instances that some work needs to be done. Remaining instances (or may be the instance which is broadcasting it) can then see if they're free to process that task. One instance depending on its availability can then fetch the task from the queue and start processing that task.
Assuming you used Windows Azure Queues, when an instance fetched the message, it becomes unavailable to other instances immediately for some amount of time (visibility timeout period of Azure queues) thus avoiding duplicate processing of the task. If the task is processed successfully, the instance working on that task can delete the message.
If for some reason, the task is not processed, it will automatically reappear in the queue after visibility timeout period has expired. This however leads to another problem. Since your instances look for tasks based on a trigger (generating HTTP request) rather than polling, how will you ensure that all tasks get done? Assuming you get to process just one task and one task only and it fails since you didn't get a request to process the 2nd task, the 1st task will never get processed again. Obviously it won't happen in practical situation but something you might want to think about.
Does this make sense?
i would definitely go for a scale out solution: less complex, more manageable and better in pricing. Plus you have a lesser risk on downtime in case of deployment failure (of course the mechanism of fault and upgrade domains should cover that, but nevertheless). so for that matter i completely back Gaurav on this one!

Several worker roles more expensive?

What scenario is less expensive in $$$ using Windows Azure? And is it better to separate the two tasks. E-mails are rarely sent, but chat messages are posted all the time.
Having one worker role processing e-mails gotten from the Azure Queue every 10 seconds, and one worker role processing posted chat messages from the Azure every 1 second.
Having one generic worker role that processes both e-mails sending and chat messages every 1 second.
Worker roles are most efficient when run at or near to full CPU capacity- you are, after all, paying by CPU hour for them. A useful way to achieve this is to combine worker roles such that all of your background jobs end up being performed in a single role.
A great way to run single worker role type architectures is to use some sort of generic worker role pattern- basically a plugin pattern whereby the worker role reads a message off the queue and uses some metadata encoded into the message (or the name of the queu) to determine the type of processing it requires. It will then go to blob storage to retrieve the .NET assembly to perform that type of processing, instantiate this into a new appdomain and marshall the context into that assembly for processing.
This is covered in the Asynchronous Workloads session in the WIndows Azure Platform Training Kit. This also contains a hands on lab that guides you through a sample implementation of one of these types of approaches.
The folks from Lokad have a really elegant implementation including all the polish and administration mechanisms that you'd need if you did this properly. Their implementation is New BSD licensed and was the winner of the MSFT Azure Partner of the YEar award last year. It's an essential part of almost every Azure project that I build. Highly recommended and trivial to integrate. http://code.google.com/p/lokad-cloud/
So in short, I prefer a generic worker role implemented as a plugin type pattern with dynamics type loading and instantiation.
This all depends on your scaling strategy and how many instances you're going to need to run to handle your load.
If you're planning to take advantage of supported SLA (99.999 uptime) you will need at least 2 instances for every role.
Thus, if you split them up, you will need at least 4 instances. If you keep them together, you'll need at least 2.
Processing 1 email per 10s and 1 chat message per second does not sound like a lot and I don't think you'll need more than 2 instances to handle everything.
However, if processing power gets to be lop-sided (i.e. chat messages need more computing power than email messages) and total load exceeds 4 instances, i suggest splitting them up, so that you can scale the two processes separately
You're charged based on how many hours you're running and how many CPU cores you're running. So if you spin up four small VMs all doing the same thing versus two small VMs doing one thing and two doing another, the cost is the same.

Run multiple WorkerRoles per instance

I have several WorkerRole that only do job for a short time, and it would be a waste of money to put them in a single instance each. We could merge them in a single one, but it'd be a mess and in the far future they are supposed to work independently when the load increases.
Is there a way to create a "multi role" WorkerRole in the same way you can create a "multi site" WebRole?
In negative case, I think I can create a "master worker role", that is able to load the assemblies from a given folder, look for RoleEntryPoint derivated classes with reflection, create instances and invoke the .Run() or .OnStart() method. This "master worker role" will also rethrown unexpected exceptions, and call .OnStop() in all sub RoleEntryPoints when .OnStop() is called in the master one. Would it work? What should I be aware of?
As mentioned by others, this is a very common technique for maximizing utilization of your instances. There may examples and "frameworks" that abstract the worker infrastructure and the actual work you want to be done, including one in this (our) sample: http://msdn.microsoft.com/en-us/library/ff966483.aspx (scroll down to "inside the implementation")
Te most common ways of triggering work are:
Time scheduled workers (like "cron"
jobs)
Message baseds workers (work triggered by the presence of a message).
The code sample mentioned above implements further abstractions for #2 and is easily extensible for #1.
Bear in mind though that all interactions with queues are based on polling. The worker will not wake up with a new message on the queue. You need to actively query the queue for new messages. Querying too often will make Microsoft happy, but probably not you :-). Each query counts as a transaction that is billed (10K of those = $0.01). A good practice is to poll the queue for messages with some kind of delayed back-off. Also, get messages in batches.
Finally, taking this to an extreme, you can also combine web roles and worker roles in a single instance. See here for an example: http://blog.smarx.com/posts/web-page-image-capture-in-windows-azure
Multiple worker roles provide a very clean implementation. However, the cost footprint for idle role instances is going to be much higher than a single worker role.
Role-combining is a common pattern I've seen, working with ISV's on their Windows Azure deployments. You can have a background thread that wakes up every so often and runs a process. Another common implementation technique is to use an Azure Queue to send a message representing a process to execute. You can have multiple queues if you want, or a single command queue. In any case, you would have a queue listener running in a background thread, which would run in each instance. The first one to get the message processes it. You could take it further, and have a timed process pushing those messages onto the queue (maybe every 24 hours, or every hour).
Aside from CPU and memory limits, just remember that a single role can only have a maximum of 5 endpoints (less if you're using Remote Desktop).
EDIT: As of September 2011, role configuration has become much more flexible, now that you have 25 Input endpoints (accessible from the outside world) and 25 Internal endpoints (used for communication between roles) across an entire deployment. The MSDN article is here
I recently blogged about overloading a Web Role, which is somewhat related.
While there's no real issue with the solutions that have been pointed out for finding ways to do multiple worker components within a single Worker Role, I just want you to keep in mind the entire point of having distinct Worker Roles defined in the first place is isolation in the face of faults. If you just shove everything into a single Worker Role instance, just one of those worker components behaving badly has the ability to take down every other worker component in that role. Now all of a sudden you're writing a lot of infrastructure to provide isolation and fault tolerance across components which is pretty much what Azure is there to provide for you.
Again, I'm not saying it's an absolute to strickly do one thing. There's a place where multiple components under a single Worker Role makes sense (especially monaterily). Simply saying that you should keep in mind why it's designed this way in the first place and factor that in appropriately as you plan your architecture.
Why would a 'multi role' be a mess? You could write each worker role implementation as a loosely coupled component and then compose a Worker Role from all appropriate components.
When you later need to separate some of the responsibilities out to a separate worker role, you can compose a new worker role with only this component, while at the same time removing it from the old worker role.
If you wanted to, you could employ late binding so that this could even be done without recompilation, but often I don't think that would be worth the effort.

Resources