I have an Azure webrole project which involves a long startup task of installing 3rd party software on the instance;
Occasionally, I've seen instances that don't respond, so I'm implementing a probe, for the load balancer to take note of this and not direct traffic to bad instances.
This of course isn't enough - what I'd want is for Azure (Fabric?) to then reboot the instance, and if that doesn't help (that is, make the instance reply properly to the probe) - reimage the instance.
Is that the behavior, and if so, where is that documented? I searched for quite a while but didn't find anything useful.
Thanks
Using the management API you should be able to externally monitor your role instances. Then, if one is taking to long you should be able to force it to be re-imaged.
http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx describes the health of a role instance, what Azure does for recovery, and how to use a load balancer probe.
When you say that your instance doesn't respond, does that mean that the instance shows as Busy (or something besides Ready) in the portal, or just that IIS isn't responding to requests? If the former (instance showing Busy) then you don't need a load balancer probe since Azure will automatically remove that instance from rotation. If the latter (IIS not responding) then you can potentially implement a StatusCheck event in your web code such that if w3wp itself is having a problem then the instance will be taken out of rotation by the fabric, but if w3wp itself is healthy and it is just the requests that are not responding then you will need the load balancer probe.
Having a good monitoring and recovery solution in place is very valuable, but I would recommend that instead of rebooting instances to mitigate a w3wp problem you should instead investigate the root cause of why your instances aren't responding. Fix the source of the problem rather than apply a Band-Aid :). The blog post at http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx, and in particular the troubleshooting scenario 5, may be a good place to start the investigation.
Related
I have an azure cloud service which scales instances out and in. This works fine using some app insights metrics to manage the auto-scaling rules.
The issue comes in when the scales in and azure eliminates hosts; is there a way for it to only scale in an instance once that instance is done processing its task?
There is no way to do this automatically. Azure will always scale in the highest number instance.
The ideal solution is to make the work idempotent and chunked so that if an instance that was doing some set of work is interrupted (scaling in, VM reboot, power loss, etc), then another instance can pick up the work where it left off. This lets you recover from a lot of possible scenarios such as power loss, instead of just trying to design something specific for scale in.
Having said that, you can manually create a scaling solution that only removes instances that are not doing work, but doing so will require a fair bit of code on your part. Essentially you will use a signaling mechanism running in each instance that will let some external service (a Logic app or WebJob or something like that) know when an instance is free or busy, and that external service can delete the free instances using the Delete Role Instances API (https://learn.microsoft.com/en-us/rest/api/compute/cloudservices/rest-delete-role-instances).
For more discussion on this topic see:
How to Stop single Instance/VM of WebRole/WorkerRole
Azure autoscale scale in kills in use instances
Another solution but this one breaks an assumption that we are using Azure cloud service; if you use app services instead of the cloud service you will be able to setup auto scaling on the app service plan effectively taking care of the instance drop you are experiencing.
This is an infrastructure change so it's not a two click thing but I believe app services are better suited in many situations including this one.
You can look at some pros and cons but if your product is traffic managed this switch will not be painful.
Kwill, thanks for the links/information, the top item in the second link was the best compromise.
The process work length was usually under 5 minutes and the service already had re-handling of failed processes, so after some research it was decided to track state of when the service was processing a queue item and use a while loop in the RoleEnvironment.Stopping event to delay restart and scale-in events until the process had a chance to finish.
App Insights was used to track custom events during the on stopping event to track how often it completes vs restarts during the delay cycles.
We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.
By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.
Is it possible to create one or several azure VMs on my local machine? I want to create a web app and load test it locally, without the need of putting it in the cloud. I'm thinking at the following scenario: I have a local VM running a IIS server with my web app; I use a tool to generate a lot of load; I need to deploy the second VM containing the same things as the first VM. The downtime of the web app should be equal to 0(hopefully).
Clarification(update):
I want to achieve the following: create a web app and a monitoring app(CPU,Memory) and deploy them on one VM. On a load test, if the VM cannot handle it(e.g. CPU goes above 80%), I want to programmatically deploy a new VM(with the same configuration, having both the web app and the monitoring app), such that no downtime occurs.
Azure has several ways for you to host sites.
Virtual Machines is just that, normal VMs. You can create them locally and upload them, but everything is up to you, including how to handle upgrades. If that is what you need to do then I don't know how you would handle upgrades with no down time; though, you can add multiple VMs to a load balancer and then upgrade them one at a time.
It sounds like what you really want to explore is Cloud Services. You can run one or more VMs locally in the emulator, upgrade with no down time once in the cloud, implement auto scaling (you will have to use a tool or write some code).
Alternatively you may want to look at Azure Web sites, but that is a completely different concept and you can't really test load and load balancing locally the same way.
Based on your statement that you essentially want to auto-scale your application you want to look at Cloud Services with Auto Scaling. However, you can't fully test this in the cloud emulator - but you can test your logic.
Background
Azure Cloud Services is designed for this kind of thing; You don't really work with VMs in the way you may be used to, instead you create a package that Azure then deploys to as many servers as you like. Once up and running, you can manually go into the management console and increase or decrease the number of active servers simply by moving a slider. Of course, you want to do this automatically, so you have a few options.
There is a management API you can use to change the number of servers. So, it would be quite simple to write a bit of code that you spin up in another thread from WebRole.Start and that simply sits and monitors the CPU on the machine and then calls the management API to spin up a new server instance if your CPU goes over a certain treshold. Okay, locally you can only test that the call to the management API is made, you won't actually see the new server coming up. But, if you grab your free trial of Azure and just try it you will see that you really don't need to test that part - it just works.
However, in practice there is an awful lot more to auto scaling. Here are some of the things you need to consider;
Even relatively idle web servers will often spike briefly to 100% so just having a simple treshold is unlikely to be good enough; You need to decide on how long the server needs to be over a certain treshold before you spin up another server instance.
What happens when you have more than one server? And, on Azure, you should always have at least two servers to ensure you have resilience. Note that the idea with Cloud Services really is to have many small servers rather than a few big servers. You pay per core, not per number of servers.
Imagine you currently have three servers and one is really busy for some reason and the other two are idle. Do you want to spin up a fourth server?
Imagine you currently have two servers and they are both quite busy. Do you really want them both to start a new server so you end up with four servers running?
There are several ways to handle these challenges. For starters, rather than having monitor programs running locally on each server, you are better of moving that monitoring outside; Azure comes with the ability to dump performance metrics to table storage at whatever interval you choose. You can then run an external program that retrieves the performance data over time from all your current servers and then reason about the overall workload before deciding to spin up or shut down additional servers. Now, you can of course host that external monitor program in a separate thread on each of your webroles to give your monitoring resilience - but the key point is that the monitoring program doesn't monitor the server it runs on, it monitors all the servers. You will, of course, still have to deal with stopping multiple monitoring program instances from all starting and stopping servers. One way to do is to place stop/start commands onto an Azure "message queue" (there are a few different types) and use the built-in "de-duper" which will automatically delete identical commands that are put on the queue within a certain time window (I am over simplyfing but you get the idea).
The actual answer
Really, though, you want to look at the Auto Scaling Application Block which will do most of this for you. I guess that is the real answer to your question, but I wanted to provide a bit of context first.
Again, I recognise you asked for how to test this locally - but I believe that that question doesn't really make sense in the context of Azure and I hope the above information helps.
I'm pretty sure you can't do that and it wouldn't make sense anyway. If you want load testing, you need to run that in an environment as similar to production as possible and that means you have to run your application is Azure cloud. How else do you know that the load will actually be processed fine on real cloud?
I'm looking into Windows Azure now and wondering if one can implement a TCP/IP server using Worker Roles - i.e, when a request comes in on a socket - a worker role (and not a web role) will accept it, treat it well and then return an answer on that same socket request.
Another question is - should I do it, or maybe just implement my own non-blocking server using .NET and put it in one worker role or a VM?
Thanks!
There's a full worked example of a telnet server on Maaten Balliauw's blog - see http://blog.maartenballiauw.be/post/2010/01/17/Creating-an-external-facing-Azure-Worker-Role-endpoint.aspx
On your second question, most answers seem to recommend using worker roles for code instead of using VMs - worker roles in general are "architecturally preferred" for Azure, and VMs are there mainly for when you need to support existing (legacy) code.
Adding to Stuart's answer: A Worker Role will give you nearly everything a VM role is going to offer you, without you having to worry about maintaining the OS. VM roles are needed for a few specific scenarios. I enumerated them in this other StackOverflow answer, but just for completeness, here are those scenarios which require a VM role:
Startup / setup takes a really long time. This is a bit subjective, but a good rule-of-thumb is around 5 minutes. Remember that, every time your role instances boot up, they need to re-run any tasks in your startup, including software installs, so role instance availability is delayed until all startup tasks are run.
Startup / setup tasks are unreliable and don't always work the first time you run them. Software setups need to run in unattended mode, and must reliably succeed.
Human interaction is required. If the software install can't be completely automated, there's no way to script it.
When it comes to hosting a TCP service, you can choose to host something either publically available or only internal to your other role instances. For public hosts, you have up to 25 endpoints to work with across your deployment, and for internal hosts, you have up to 5 endpoints per role. See my blog post here for more details around this.