We have an azure app service (running Umbraco), and once a day the system stops accepting requests. We know this because we have added logging to Application_BeginRequest, and during the 10-15 minute outage, no requests are logged. We also log the Application_EndRequest and keep a running counter of "active" requests, as we thought it may have been related to traffic volume, however the outage still occurs even when there are 0 active requests. The outage corrects itself after about 15 minutes with no interference. Restarting the app service "corrects" the outage quicker. There is nothing in the Umbraco logs (errors or otherwise) that would suggest anything is running. Upgrading Umbraco to the latest version (twice) has not resolved the issue. We have several other similarly sized and trafficked Umbraco sites that do not have this issue.
During the outage:
CPU stays steady at 5-20% usage
Memory stays steady at 30% usage
DTU (on azure sql db) stays steady at 10-20%
This tells me that none of the above are the bottlenecks, so I would guess there is some sort of deadlock or I/O issue. This leads to the question: from an infrastructure standpoint, knowing the above, what are some possible causes for the aforementioned behavior?
Related
We have a doubt about Azure because in some cases we have some dead times when we received requests in one of our AppServices or when a Service Bus triggers, for example, an Azure Functions.
If you see this image, you will see an example:
AppInsight Example Image
We execute a Request and at 5 seconds, but Azure delays more than 30 seconds to start the execution. We made a lot of optimizations in our apps, but we have no visibility about this delay.
Did someone face the same issue and found some solution? We believe it is a performance issue in the Workers, but, this happens also when the Workers are with a low load of memory and CPU. So we don't know how to scale horizontally automatically the resource if it is without load.
This happens also in our AZF, but we believe it's an issue between the Service Bus and the container of the AZF. In these cases we found the AZF has a higher consumption of CPU, but we don't why, because in the local environment we process a lot of messages with multithreading without any problem.
We are running a setup on Azure consisting:
S3 web app in UK South
S2 failover in UK West
200DTU Elastic Pool with around 25 databases
Redis server
Several times this week, we have had periods where Redis has stopped hitting and missing data, and no additional items are being added to the cache. In effect the caching completely ceases being available.
Flushing the cache does not make any difference to the issue - nothing is added, nothing is hit or even missed.
The only way to re-enable is to restart the web app itself. After which everything is back to normal.
Our developers are looking into potential causes in our codebase, but I wonder if anyone has any ideas on how to diagnose or solve this issue.
Thanks
If restarting your web app fixes it, it sounds like some contention on the client side. You might find this (https://gist.github.com/JonCole/db0e90bedeb3fc4823c2#file-diagnoserediserrors-clientside-md) link useful. It could likely be threadpool or CPU contention on the client machines hosting your web app.
We're experiencing CPU spikes on our Azure App Service Plan for no obvious reason. Its not something that stops the service, but we'd like to have an understanding of when&how that kind of things happen.
For example, CPU percentage sits at 0-1% range for days but then all of the sudden it spikes to 98%, 45%, 60% and comes back to 0-1% range very quickly. Memory stays unchanged at comfortable 40-45% level, no incoming requests to it, no web jobs, nothing unusual in logs, no failures, service health ok, nothing we could point our finger to as a reason.
We tried to find out through kudu > support > analyze (metrics)...but we couldn't get request submited. It just keeps giving error to try later.
There is only one web app running in that app service plan, its a asp.net core 2.0. web api.
Could someone shed some light on this kind of behavior? Is this normal, expected? If so, why it happens? Is there a danger that it spikes to 90% and don't immediately come back?
Just, what's going on?
After speaking with MS support i've got an answer it is a normal behavior coming from their monitoring tool:
We reviewed our internal tools taking as starting point 12/26 and
today 12/29 and we could notice that this was majority System
processes doing background tasks, which is normal for each sandbox
environment. In your case, it was mostly MonAgentCore.exe fluctuating
in CPU which is our diagnostic log capturing process and this looks
like a very temporary spike and appears normal.
I maintain an azure cloud service. It is set to auto-scale based on load. To monitor the health of this service I have another service which pings this service every 2 minutes. The usual response time from this service is around 100ms.
Once or twice a week I see that the service does not respond. It is not really a worry for me - because it happens quite infrequently. I still am trying to figure out what could be causing the service to not respond. I do not think the problem is with the pinging service - I don't see any of the other services (not on azure, but on other servers) that it pings having any issues.
What could be causing these occasional delays. Any other azure service owners seeing such delays ?
Having quite similar problems. But I use Applications Inside, so I have some statistics. For example that reponse time increases together with SQL azure access time and CPU usage. My average response time according to Applications Inside is about 600ms and average RPS is about 0,6. During these problems RPS usually higher than avarage - up to 1.5, but average response time grows up to 1min! (During the day my RPS can grow up to 3 or even higher without any reponse time growth). As I have 1min sql connection timeout and I have drammatical growth of total SQL azure access time during this periods I can assume that problem happens bacause of SQL Azure. This also happens once a day or two, for about 10-15 minutes max and my ping service also always reports that service doesn't respond.
So my advice here - install Application Insights to analyze what happens dusring these response delays. It would be great if you share your results here.
P.S. I also use autoscale based on load. Though it doesn't really help in these concrete situations.
In our still-in-development project we have noticed sudden delays when accessing our ASP.NET Web API services. Using the awesome Mini Profiler we nailed it that these delays are caused when connections to the Azure Data Cache (Preview) services are dropped and they have to be reestablished. This process takes about 3.3 seconds. After reconnecting, getting an object from the cache takes 1.4 ms.
When I increased maxConnectionsToServer from 1 to 20, I noticed another thing. If I don't make requests to the Web API for 1 or 2 minutes (that's usually when the connections are dropped) and then start making calls, next 20 requests are delayed for 3.3 seconds, which is how connection pooling works I guess (round-tripping the connections from the pool).
Both the Web API and Caching service are hosted in the East US region, we have disabled local cache, SSL is disabled, auto discover is enabled.
So, I'm wondering if something is wrong with our configuration or is this a thing because Azure Cache is still in preview?
Any information will be valued.
Thanks!
It sounds like your shared cache is being offloaded due to inactivity. One way to test this would be to add an In-Role Cache to an existing service (if available) and swap your cache usage to this new cache. In-Role cache is described here.
Once the cache is moved off of a shared offering, wait the requisite 1-2 minutes for idle time out and retry the connection, the delay should not be present.
Assuming you want to stick with the shared cache option after isolating the problem, the only current workaround that I am aware of is running a background task that will periodically ping the cache to keep it alive.
If you are running a full Web role you can launch a background task on application start up.
If you deploying via Mobile Services, then you can run the "ping" via Scheduled Jobs. The only issue you may run into here is that the minimum time for a scheduled job is 1 minute, which may not be aggressive enough to keep your cache alive 100% of the time.
Nothing that I see points to you doing anything wrong per se. It may be the Azure is genuinely having problems getting the cache connections up and running quickly. According to several best practices documents and MSDN posts, you want to increase your number of connections to caches to allow for a failover to an active connection, which you've effectively done with your configuration change.
Try making sure that your cache accessor is a static object (another MSDN recommendation) and this may be a long shot but consider using the Sliding Window option for object expiration and see if that not only tells the countdown for the object store to reset, but also prompts the cache service to reset the connection.