Suppose I have an Azure role with three instances running. I ask Azure to change the role count to two either by Management Portal or via Management API.
How will Azure decide which role to take down?
As British Developer mentioned, the Windows Azure Fabric Controller decides which instances to shut down. You cannot control this process. I don't think it is always the last number, because I am not sure whether the fabric controller does not rename the instances after shutting down. So even if it shuts down IN_1, at the end of the process we will still have IN_0 and IN_1, instated of IN_0 and IN_2 for example.
You can use the RoleEnvironment.Stopping event to handle the proper stopping (clean shutdown) of your Instance. This event is being raised after the VM was taken out of Load Balancer rotation and before the OnStop Method of your RoleEntryPoint class being called.
I am not sure where I noted, but I know that there is a hard time limit in which you have to complete your cleaning, i.e. I think the Instance will be shutdown after 5 minutes of waiting the OnStop or Stopping handler (I can't remember exactly, but the fabrci controller will not wait forever for you to clean up).
It will usually be the last one to be spun up. So you have IN_0 IN_1 and IN_3. I've only ever see IN_3 go down when you remove one so that seems to be the one.
However, this is not documented anywhere by Microsoft so this isn't guaranteed to be the one that goes down... just seems to be in practice.
Related
I have an azure cloud service which scales instances out and in. This works fine using some app insights metrics to manage the auto-scaling rules.
The issue comes in when the scales in and azure eliminates hosts; is there a way for it to only scale in an instance once that instance is done processing its task?
There is no way to do this automatically. Azure will always scale in the highest number instance.
The ideal solution is to make the work idempotent and chunked so that if an instance that was doing some set of work is interrupted (scaling in, VM reboot, power loss, etc), then another instance can pick up the work where it left off. This lets you recover from a lot of possible scenarios such as power loss, instead of just trying to design something specific for scale in.
Having said that, you can manually create a scaling solution that only removes instances that are not doing work, but doing so will require a fair bit of code on your part. Essentially you will use a signaling mechanism running in each instance that will let some external service (a Logic app or WebJob or something like that) know when an instance is free or busy, and that external service can delete the free instances using the Delete Role Instances API (https://learn.microsoft.com/en-us/rest/api/compute/cloudservices/rest-delete-role-instances).
For more discussion on this topic see:
How to Stop single Instance/VM of WebRole/WorkerRole
Azure autoscale scale in kills in use instances
Another solution but this one breaks an assumption that we are using Azure cloud service; if you use app services instead of the cloud service you will be able to setup auto scaling on the app service plan effectively taking care of the instance drop you are experiencing.
This is an infrastructure change so it's not a two click thing but I believe app services are better suited in many situations including this one.
You can look at some pros and cons but if your product is traffic managed this switch will not be painful.
Kwill, thanks for the links/information, the top item in the second link was the best compromise.
The process work length was usually under 5 minutes and the service already had re-handling of failed processes, so after some research it was decided to track state of when the service was processing a queue item and use a while loop in the RoleEnvironment.Stopping event to delay restart and scale-in events until the process had a chance to finish.
App Insights was used to track custom events during the on stopping event to track how often it completes vs restarts during the delay cycles.
I have an Azure webrole project which involves a long startup task of installing 3rd party software on the instance;
Occasionally, I've seen instances that don't respond, so I'm implementing a probe, for the load balancer to take note of this and not direct traffic to bad instances.
This of course isn't enough - what I'd want is for Azure (Fabric?) to then reboot the instance, and if that doesn't help (that is, make the instance reply properly to the probe) - reimage the instance.
Is that the behavior, and if so, where is that documented? I searched for quite a while but didn't find anything useful.
Thanks
Using the management API you should be able to externally monitor your role instances. Then, if one is taking to long you should be able to force it to be re-imaged.
http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx describes the health of a role instance, what Azure does for recovery, and how to use a load balancer probe.
When you say that your instance doesn't respond, does that mean that the instance shows as Busy (or something besides Ready) in the portal, or just that IIS isn't responding to requests? If the former (instance showing Busy) then you don't need a load balancer probe since Azure will automatically remove that instance from rotation. If the latter (IIS not responding) then you can potentially implement a StatusCheck event in your web code such that if w3wp itself is having a problem then the instance will be taken out of rotation by the fabric, but if w3wp itself is healthy and it is just the requests that are not responding then you will need the load balancer probe.
Having a good monitoring and recovery solution in place is very valuable, but I would recommend that instead of rebooting instances to mitigate a w3wp problem you should instead investigate the root cause of why your instances aren't responding. Fix the source of the problem rather than apply a Band-Aid :). The blog post at http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx, and in particular the troubleshooting scenario 5, may be a good place to start the investigation.
Is there a way to stop the worker process by itself? I already coded in console application, which uses REST API to start and stop worker process and delete cloud service deployment. In the latest announcement, stopping worker processes will not cost anything, it is free now.
Can I make the worker process to stop itself? Is there any event in the worker process to stop itself? Please let me know.
So I think you're referring to Worker Roles, right? A worker process would simply be something you run in your app (like a thread, a method, something). Azure Worker Roles are full VMs.
Assuming that's what you meant: The new announcement about stopping VMs does not apply to Web / Worker Role instances; it applies to Virtual Machines. And those can be stopped easily via REST call (or much easier via PowerShell that wraps the REST call). You could make that call from a Virtual Machine, which would effectively shut itself down, but I'm not so sure that's a sound idea. If you take that approach, it will be very hard for you to track the role-stop progress, since you would have just stopped the VM that made the call.
I need to test how my code will handle the failure of a web role instance in a development environment.
How do I terminate one of the instances? I can't see any option in the UI for this. Seems like a strange ommission
Update
The issue is relating to a distributed cache layer (I know that azure offers their own)
I want to be able to test how the system reacts to a missing or additional node etc
Prehaps my real question is
how up to date is RoleEnvironment.CurrentRoleInstance.Role.Instances
The need to simulate ungraceful exits in the dev emulator usually is done because you are doing something in your web role that is stateful or long running. That is generally discouraged, but sometimes is unavoidable.
I suspect the best way to simulate the a failure is to kill processes. If you open task manager (or better Process Explorer), you will see "WatDebugger" hosting either "WaIISHost" or "WaWorkerHost". If you kill this process, I think it will simulate a failure.
Honestly, it is easier to test this one in the cloud however. You can RDP into one of the instances and kill the 'WaAppAgent' process. That will kill your RoleEntryPoint and fabric controller agent. That will be a true ungraceful failure.
By failure, do you mean becoming unavailable? It should be seamless because the next request would simply be handled by one of the other instances. As long as there is one instance available Azure will route calls to that instance.
This is the nature of a high-available system, requests are handled by the available instances. This is why you have multiple instances in the first place, to handle requests in the case of failure in one or more instances.
This is why you need to always be watchful of how your application handles state. State needs to be maintained outside of the instance, either in queues or in a database. This ensures that any process can pickup a piece of work and execute against it.
There is another question dealing with Session State that should help: How does Microsoft Azure handle Session State?
By terminate an instance do you mean reducing instance count and see which one gets killed? I like Ryan's view about ungraceful exits, but if it's forced kill by the fabric it'll be a different ball game.
Background
I am trying to work out the best structure for an Azure application. Each of my worker roles will spin up multiple long-running jobs. Over time I can transfer jobs from one instance to another by switching them to a readonly mode on the source instance, spinning them up on the target instance, and then spinning the original down on the source instance.
If I have too many jobs then I can tell Azure to spin up extra role instance, and use them for new jobs. Conversely if my load drops (e.g. during the night) then I can consolidate outstanding jobs to a few machines and tell Azure to give me fewer instances.
The trouble is that (as I understand it) Azure provides no mechanism to allow me to decide which instance to stop. Thus I cannot know which servers to consolidate onto, and some of my jobs will die when their instance stops, causing delays for users while I restart those jobs on surviving instances.
Idea 1: I decide which instance to stop, and return from its Run(). I then tell Azure to reduce my instance count by one, and hope it concludes that the broken instance is a good candidate. Has anyone tried anything like this?
Idea 2: I predefine a whole bunch of different worker roles, with identical contents. I can individually stop and start them by switching their instance count from zero to one, and back again. I think this idea would work, but I don't like it because it seems to go against the natural Azure way of doing things, and because it involves me in a lot of extra bookkeeping to manage the extra worker roles.
Idea 3: Live with it.
Any better ideas?
In response to your ideas
Idea 1: I haven't tried doing exactly what you're describing, but in my experience your first instance has a name that ends with _0, the next _1 and I'm sure you can guess the rest. When you decrease the instance count it drops off the instance with the highest number suffix. I would be surprised if it took into account the state of any particular instance.
Idea 2: As I think you hint at, this will create management problems. You can only have 5 different workers per hosted service, so you'll need a service for each group of 5 roles that you want to be able to scale to. Also when you deploy updates you'll have to upload X times more services where X is the maximum number of instances you currently support.
Idea 3: Technically the easiest. Pending some clarification, this is probably what I'd be doing for now. To reduce the downsides of this option it may pay to investigate ways of loading the data faster. There is usually a Goldilocks level (not too much, not too little) of parallelism that helps with this.
You're right - you cannot choose which instance to stop. In general, you'd run the same jobs on each worker role instance, where each instance watches the same queue (or maybe multiple threads or jobs watching multiple queues).
If you really need to run a job on one instance (such as a scheduler), consider using blob leases as the way to constrain this. Create a blob as a mutex. Then, as each instance spins up, the scheduler job attempts to obtain a write lease on that blob. If it succeeds, it runs. If it fails, it simply sleeps (maybe for a minute) and tries again. At some point in the future, as you scale down in instance count, let's say the instance running the scheduler is killed. A minute later (or whatever time span you choose), another instance tries to acquire the lease, succeeds, and now runs the scheduler code.