Suddenly I stopped receiving CPU and the other normal metric on 2 of my VMs and therefore the scale function wasn't operating. Did something change that I need to enable something for this to work? I went onto portal and turned on Diagnosticians including Basic metrics.
I've seen this happen multiple times as well, the availability of the metrics is just not very reliable. They usually come back after a while.
Related
since a month one of our web application hosted as WebApp on Azure is having some kind of problem and I cannot find the root cause of that.
This WebApp is hosted on Azure on a 2 x B2 App Service Plan. On the same App Service Plan there is another WebApp that is currently working without any issue.
This WebApp is an ASP.NET WebApi application and exposes a REST set of API.
Effect: without any apparent sense (at least for what I know by now), the ThreadCount metric starts to spin up, sometimes very slowly, sometimes in few minutes. What happens is that no requests seems to be served and the service is dead.
Solution: a simple restart of the application (an this means a restart of the AppPool) causes an immediate obvious drop of the ThreadCount and everything starts as usual.
Other observations: there is no "periodicity" in this event. It happened in the evening, in the morning and in the afternoon. It seems that evening is a preferred timeframe, but I won't say there is any correlation.
What I measured through Azure Monitoring Metric:
- Request Count seems to oscillate normally. There is no peak that causes that increase in ThreadCount
- CPU and Memory seems to be normal, nothing strange.
- Response time, like the others metrics
- Connections (that should be related to sockets) oscillates normally. So I'd exclude something related to DB connections.
What may I do in order to understand what's going on?
After a lot of research, this happened to be related to a wrong usage of Dependency Injection (using Ninject) and an application that wasn't designed to use it.
In order to diagnose, I discovered a very helpful feature in Azure. You can reach it by entering into the app that is having the problem, click on "Diagnose and solve problems" then click on "Diagnostic tools" and then select "Collect .NET profiler report". In that panel, after configuring the storage for the diagnostic files, you can select "Add thread report".
In those report you can easily understand what's going wrong.
Hope this helps.
We're looking for automated way to horizontally, vertically scale the pull of self hosted integration runtime virtual machines used in ADF.
Reading Microsoft docs does not provide answer.
Well, I don't have the experience, so I can only give you a theoretical answer, but maybe it's helpfull for you.
AFAIK, neither way is configurable out-of-the-box. For scale-out you'll have to deploy an additional IR machine yourself. So probably you'll want to create an image that you can provision from docker or kubernetes and has the IR and pre-requirements installed. The IR installation provides an PowerShell script that can be used to create an automated connection.
For scale-up/down, you'll have to run some script that scales your vm. In an IaaS solution (f.e.) Azure VM, that should be doable with an API call to change your VM.
For both cases you'll have to have some kind of montitor in place that monitors the IR loads and makes changess as needed. I think the measures provided in the Data Factory should do. Maybe you can use Log Analyics to monitor the loads.
I'm curious about your use case for this.
My solution is just for scaling out/in since the VM must be restarted if you are scaling up/down, which causes downtime and job failures etc.
At a high level this solution requires just 3 simple things:
Azure Metric Alert that fires when Scale-Out should occur (VM Start)
Azure Metric Alert that fires when Scale-In should occur (VM Deallocation)
Logic App that is triggered by Azure Alert and actually executes the Start/Stop of the VM, along with any other automation associated with this (eg posting to a Teams channel when Scale in/out occurs)
Here are more of the details surrounding how we setup the conditions for the alerts, but the main thing to keep in mind is (IR CPU %, IR queue length, Number of Nodes, and possibly IR Memory)
Scale-Out
Scale-In
Actions for Alerts
As you can see below we have the alert triggering 1 Logic App, using the payload that is passed to the Logic App, you can determine if the Logic App should be starting the VM, or stopping the VM. (As well as any other additional actions)
Logic App
There is a small chance that due to timing (and depending on how many ADF's the IR is shared to), that pipeline activities could be sent to Node 2 at the same time a deallocation command is sent to the VM for Node 2. I have not seen this as of yet, but adjusting the alert conditions based on your need could help avoid this. Feel free to play around with the conditions of the alerts, granularity, thresholds, etc. This is not a one size fits all solution.
We're experiencing CPU spikes on our Azure App Service Plan for no obvious reason. Its not something that stops the service, but we'd like to have an understanding of when&how that kind of things happen.
For example, CPU percentage sits at 0-1% range for days but then all of the sudden it spikes to 98%, 45%, 60% and comes back to 0-1% range very quickly. Memory stays unchanged at comfortable 40-45% level, no incoming requests to it, no web jobs, nothing unusual in logs, no failures, service health ok, nothing we could point our finger to as a reason.
We tried to find out through kudu > support > analyze (metrics)...but we couldn't get request submited. It just keeps giving error to try later.
There is only one web app running in that app service plan, its a asp.net core 2.0. web api.
Could someone shed some light on this kind of behavior? Is this normal, expected? If so, why it happens? Is there a danger that it spikes to 90% and don't immediately come back?
Just, what's going on?
After speaking with MS support i've got an answer it is a normal behavior coming from their monitoring tool:
We reviewed our internal tools taking as starting point 12/26 and
today 12/29 and we could notice that this was majority System
processes doing background tasks, which is normal for each sandbox
environment. In your case, it was mostly MonAgentCore.exe fluctuating
in CPU which is our diagnostic log capturing process and this looks
like a very temporary spike and appears normal.
By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.
This morning I found 5 of my Azure Virtual machines to be stuck in Starting mode.
All other VMs are working ok.
I managed to stop the VMs using the Azure command shell and then start them again but they are still stuck in starting mode with no end in sight.
It has now been over 5 1/2 hours and still stuck in starting mode.
I have contacted Microsoft support but they are taking hours to respond :(((
The Azure Status page doesn't show anything is wrong in my region.
Has anybody else experiencing this problem?
We've had the same issue and it's linked to a big issue Azure is having this morning.
The trick we used in order to get the instance running again is:
1. stop the VMs via Powershell
2. change the size of the vm and back (preferably from A to D as this is different hardware)
3. start the VM
We also have people complaining about RDP not working where reboots fixed the problem.
There are currently some problems with Azure, including the VM service. Also the status page does not reflect all of the problems. Here you have to keep in mind that this page also show impacts affecting most of the service customers. It does not reflect minor outages to single customers. You should keep an eye at the Azure blog which possibly gives a statement related to the current problems.
What works for me is a redeploy of the Virtual Machine within the Azure Portal whenever it gets stuck at "Starting...". Altho it takes half an hour to redeploy, it solves the issue. More details here.
Same problem I experienced and what I did is I resized Virtual Machine's Disk Size, You can go for increasing the whole VM size / power but for me the Disk size fixed it, probably it was updating and the disk file ran out.