When I try to scale out my Azure Web App I experience very slow response times for requests on the second or third instance of the app.
This seems to happen because the other instances were in cold mode and had to switch into hot mode once the load balancer redirected the request to them.
The problem is that in my scenario most of the time there isn't going on much on the system so probably only one instance will be used via the load balancer but approx four times a day there is a peak and I need more than one instance. But if these instances are in cold mode and had to wake up first it actually makes things worse.
The question is what to do?
I've already set the app to "always on" and ARR Affinity to "off".
In the past I've already experienced problems with my app going into some sort of sleep mode even though the app was set to "always on". I solved this by setting up a scheduler task that called the app every hour. But I don't think this would work with multiple instances anymore because the task would only call one instance and the other instances would still stay in sleep mode.
Any suggestions?
increase your app service plan(S3) and try it again. I had similar problem and this solved it.
Alternatively, you can reconfigure your scaling rules.
Consider enabling logging to debug which instances are receiving the requests and why these requests are slow.
For the comment around four times a day, you need more than once instance -- consider setting up Autoscale with recurrence profile on your app service plan to automatically scale out. You can setup autoscale rule with different instance counts based on the time of day.
Related
Following a recent investigation into an Azure web api going down (it does not like cold restarts as the queued requests then swamp the server, which 503's), I received the following:
Your application was restarted as site binding status changed. This
can most likely occur due to recent deployment slot swap operations.
In some cases after the swap the web app in the production slot may
restart later without any action taken by the app owner. This restart
may take place several hours/days after the swap took place. This
usually happens when the underlying storage infrastructure of Azure
App Service undergoes some changes. When that happens the application
will restart on all VMs at the same time which may result in a cold
start and a high latency of the HTTP requests. This event occurred
multiple times during the day.
The recommendation was
to minimize the random cold starts, you can set this app setting
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 in every slot of
the app.
Can someone please elaborate on this?
Am I right in thinking that if we ever do a swap (eg: staging to production) at some random point in the future the app will restart?
What does the app setting actually do and how will it stop Azure restarting the production slot?
Answer from the link provided by Patrick Goode, whose google-foo is far better than mine
"Just to explain the specifics of what
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG app setting does. By
default we put the site’s hostnames into the site’s
applicationHost.config file “bindings” section. Then when the swap
happens the hostnames in the applicationHost.config get out of sync
with what the actual site’s hostnames are. That does not affect the
app in anyway while it is running, but as soon as some storage event
occurs, e.g. storage volume fail over, that discrepancy causes the
worker process app domain to recycle. If you use this app setting then
instead of the hostnames we will put the sitename into the “bindings”
section of the appHost.config file. The sitename does not change
during the swap so there will be no such discrepancy after the swap
and hence there should not be a restart."
Looks like this setting is supposed to prevent 'random cold restarts'
https://ruslany.net/2019/06/azure-app-service-deployment-slots-tips-and-tricks
I'm using Azure REST API to create, deploy and start a Cloud Service (classic) (cspkg hosted in Azure Storage) with hundreds of instances. I'm noticing that time Azure takes to provision and start the requested instances is really heterogeneous. First instances might start in 6-7 minutes but last ones might take up to 15-20 minutes, about 10 minutes longer than first ones. So my questions are:
Is this the expected behaviour? If so, what's the logic behind this? Could I do anything to speed things up?
How is Azure billing this? Is it counting the total count of instances since the very initial time when Cloud Service is deployed? or is it taking into account the specific timing on each individual instance?
UPDATE: I've been testing more scenarios and I've found a puzzling surprise. If I replace all the processes that my Cloud Service instances should run by a simple wait for some minutes (run .bat file with timeout command) then all the instances start almost at the same time (about 15 seconds between fastest and slowest instance). It was not just luck and random behaviour, I've proved that this behavior is repeatable and I can't even try to explain the root reason.
I also checked this a few weeks ago, and the startup time, depends on the size of the machine, if it is large it has more resources, so the boot time is faster, and also, if there is any error, exception on startup the VM will recycle till it can successfully start. I googled it, but did not find any solution to speed this up, so I don't think it is possible to do anything about the startup time. In the background every time when you deploy something, it will create a Windows Server, and boot it up and deploy your package on it and puts your web roles behind load balancer, this is why it takes so long, because a lot of things are happening.
The billing part is also not the best for the classic cloud services, you have to pay for it even during the startup and recycle, and even when it is turned off, so if you are done with your update, you should delete the VMs from your staging slot or scale it down, because you will pay for it even if it is turned off.
I have an azure cloud service which scales instances out and in. This works fine using some app insights metrics to manage the auto-scaling rules.
The issue comes in when the scales in and azure eliminates hosts; is there a way for it to only scale in an instance once that instance is done processing its task?
There is no way to do this automatically. Azure will always scale in the highest number instance.
The ideal solution is to make the work idempotent and chunked so that if an instance that was doing some set of work is interrupted (scaling in, VM reboot, power loss, etc), then another instance can pick up the work where it left off. This lets you recover from a lot of possible scenarios such as power loss, instead of just trying to design something specific for scale in.
Having said that, you can manually create a scaling solution that only removes instances that are not doing work, but doing so will require a fair bit of code on your part. Essentially you will use a signaling mechanism running in each instance that will let some external service (a Logic app or WebJob or something like that) know when an instance is free or busy, and that external service can delete the free instances using the Delete Role Instances API (https://learn.microsoft.com/en-us/rest/api/compute/cloudservices/rest-delete-role-instances).
For more discussion on this topic see:
How to Stop single Instance/VM of WebRole/WorkerRole
Azure autoscale scale in kills in use instances
Another solution but this one breaks an assumption that we are using Azure cloud service; if you use app services instead of the cloud service you will be able to setup auto scaling on the app service plan effectively taking care of the instance drop you are experiencing.
This is an infrastructure change so it's not a two click thing but I believe app services are better suited in many situations including this one.
You can look at some pros and cons but if your product is traffic managed this switch will not be painful.
Kwill, thanks for the links/information, the top item in the second link was the best compromise.
The process work length was usually under 5 minutes and the service already had re-handling of failed processes, so after some research it was decided to track state of when the service was processing a queue item and use a while loop in the RoleEnvironment.Stopping event to delay restart and scale-in events until the process had a chance to finish.
App Insights was used to track custom events during the on stopping event to track how often it completes vs restarts during the delay cycles.
I have an Azure webrole project which involves a long startup task of installing 3rd party software on the instance;
Occasionally, I've seen instances that don't respond, so I'm implementing a probe, for the load balancer to take note of this and not direct traffic to bad instances.
This of course isn't enough - what I'd want is for Azure (Fabric?) to then reboot the instance, and if that doesn't help (that is, make the instance reply properly to the probe) - reimage the instance.
Is that the behavior, and if so, where is that documented? I searched for quite a while but didn't find anything useful.
Thanks
Using the management API you should be able to externally monitor your role instances. Then, if one is taking to long you should be able to force it to be re-imaged.
http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx describes the health of a role instance, what Azure does for recovery, and how to use a load balancer probe.
When you say that your instance doesn't respond, does that mean that the instance shows as Busy (or something besides Ready) in the portal, or just that IIS isn't responding to requests? If the former (instance showing Busy) then you don't need a load balancer probe since Azure will automatically remove that instance from rotation. If the latter (IIS not responding) then you can potentially implement a StatusCheck event in your web code such that if w3wp itself is having a problem then the instance will be taken out of rotation by the fabric, but if w3wp itself is healthy and it is just the requests that are not responding then you will need the load balancer probe.
Having a good monitoring and recovery solution in place is very valuable, but I would recommend that instead of rebooting instances to mitigate a w3wp problem you should instead investigate the root cause of why your instances aren't responding. Fix the source of the problem rather than apply a Band-Aid :). The blog post at http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx, and in particular the troubleshooting scenario 5, may be a good place to start the investigation.
I'm utilizing Azure for hosting a cloud service, which I recently modified to be scalable across multiple instances, including a session caching worker role. My question is, why would I be seeing extreme load (upwards of 90%) on one instance, but not on other instances (15-20% across all other instances)? Should I be worried?
Before I set up load balancing and when my single instance hit upwards of 95% load, it would slow to a crawl --- becoming unusable. Is there any way to ensure that I don't have any users experiencing this because they're somehow round-robin'd onto the overloaded instance?
We found we had a similar type of situation when one load-balanced instance failed over; what we were seeing is that all the load transferred, but wouldn't balance out again. We found that turning off keep-alive for a couple of minutes let the load spread again, after which we could turn it back on.
http://technet.microsoft.com/en-us/library/cc772183(v=ws.10).aspx
Well... azure load balance is based on round robin... so the distribution should be almost equal (something like 60-40 or even 70-30 is still acceptable)... so just to be sure: Are you sure your not using IIS "redirect" (I forgot the name of the feature) that would set sticky session?
I must say that without further details about what your site actually "do and how" it's quite hard to advice... I must say that this behavior is strange, but it's not clear that it is the load balancer fault...
Edit1: I would suggest that you further exam what is the 90% guy is doing by tracing it's activities... maybe you're out-of-luck and the requests that will cause heavy load are falling into that machine and the ones that will be quickly worked are being worked by the other one... Another thing that might be happening is that something might be stucked (maybe a infinite-loop)... if you implemented a scalable architecture I would recommend that you provision another machine and kill the one that is suffering...
Edit2: A simple way to verify that the load balancer is working is: Log remotely to the service machines and replace something like a image that is displayed on the main page (something that you can easily spot just by looking to the page). On server 1 put lets say a yellow image and on server 2 a red image (ok... maybe something not this drastic but you get the point...). Then keep loading the page again and again...