Azure App Service - Incoming connections spike without generating requests - azure

I have a .NET Core 2.1 web app running in Azure App services. Several times lately the app has crashed after encountering a high number of "Connections".
The only documentation I can find says this:
Connections
The number of bound sockets existing in the sandbox (w3wp.exe and its
child processes). A bound socket is created by calling
bind()/connect() APIs and remains until said socket is closed with
CloseHandle()/closesocket().
When looking at the metrics; connections seem to spike while requests stay fairly similar.
What could cause this? The documentation seems to suggest it's the total amount of open TCP connections? If this is just standard API calls; why wouldn't it register as requests also? Could it be caused by an underlying problem (like a dependency), and as response latency increases, connections are being held open?
Edit: Sorry at the time the web app was running .NET Core 2.2. I've since rolled back to 2.1 and am seeing no instability issues; but neither have I seen another spike in connections.

Since you said you have Application Insights installed, you are most likely the victim of these issues, which are fixed in latest versions:
https://github.com/Microsoft/ApplicationInsights-dotnet/issues/594
https://github.com/Microsoft/ApplicationInsights-aspnetcore/issues/690
Please update to the latest stable version of the SDKs are see if it helps.

Related

Azure search - connection timeout

I've had a couple of random instances in a 1 hour period, of the Azure Search service returning a connection timeout, it is being called from a .net core web application running as an Azure App Service.
App Insights has a dependency failure for the same time (a POST to /indexes('products')/docs/search.post.search?api-version=2019-05-06) with a response of "Faulted".
Any help/idea on why this happened and how I can prevent would be appreciated.
You could be attempting to retrieve too much data at once. Or you may have a throttling issue because of too much traffic. The reason for the timeout is not possible to determine without more context.
To avoid timeouts, you could optimize and resolve your root cause by reducing the response size, limiting the number of requests, or addressing your issue's root cause.
Also, consider implementing a retry mechanism with exponential backoff. See this thread for information: Azure Search .net SDK- How to use "FindFailedActionsToRetry"?
As Dan mentioned, it is recommended to use retries since failures due to network or many other reasons can happen and this will help improve your app availability. However, if you are seeing failures happen repeatedly or need more information then please open a support issue so the support team can investigate it further.

Azure WebApps leaking handles "out of nothing"

I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.

ASP.NET Core 2.2 experiencing high CPU usage

So I have hosted asp.net core 2.2 web service on Azure(S2 plan). The problem is that my application sometimes getting high CPU usage(almost 99%). What I have done for now - checked process explorer on azure. I see there a lot of processes who are consuming CPU. Maybe someone knows if it's okay for these processes consume CPU?
Currently, I don't have an idea where do they come from. Maybe it's normal to have them here.
Shortly about my application:
Currently, there is not much traffic. 500-600 request in a day. Most of the request is used to communicate with MS SQL by querying records, adding, etc.
As well I am using MS Websocket, but high CPU happens when no WebSocket client is connected to web service, so I hardly believe that it's a cause. I tried to use apache ab for load testing, but there isn't any pattern, that after one request's load test, I would get high CPU. So sometimes happens, sometimes don't during load testing.
So I just update screenshot of processes, I see that lots of threads are being locked/used during the time when fluent migrator start running its logging.
Update*
I will remove fluent migrator logging middleware from Configure method. Will look forward with the situation.
UPDATE**
So I removed logging of FluentMigrator. Until now I didn't notice any CPU usage over 90%.
But still, I am confused. My CPU usage is spinning. Is it health CPU usage graph or not?
Also, I tried to make a load test on the websocket server.
I made a script that calls some functions of WebSocket every 100ms from 6-7 clients. So every 100ms there are 7 calls to WebSocket server from different clients, every function within itself queries some data/insert (approximately 3-4 queries of every WebSocket function).
What I did notice, on Azure S1 DTU 20 after 2min I am getting out of SQL pool connections, If I increase DTU to 100, it handles 7 clients properly without any errors of 'no connection pool'.
So the first question: is it a normal CPU spinning?
Second: should I get an error message of 'no SQL connection free' using this kind of load test on DTU 10 Azure SQL. I am afraid that when creating a scoped service on singleton WebSocket Service I am leaking connections.
This topic gets too long, maybe I should move it to a new topic?
-
At this stage I would say you need to profile your application and figure out what areas of your code are CPU intensive. In the past I have used dotTrace, this highlighted methods which are the most expensive with a call tree.
Once you know what areas of your code base are the least efficient, you can begin to refactor them so that they are more efficient. This could simply be changing some small operations, adding caching for queries or using distributed locking for example.
I believe the reason the other DLLs are showing CPU usage is because your code calling methods which are within those DLLs.

Azure App Service "Local Written Bytes"

I have an app service running that has 8 instances running in the service plan.
The app is written in asp dotnet core, it's an older version than is currently available.
Occasionally I have an issue where the servers start returning a high number of 5xx errors after a period of sustained load.
It appears that only one instance is having an issue - which is causing the failed request rate to climb.
I've noticed that there is a corresponding increase in the "locally written bytes" on the instance that is having problems - I am not writing any data locally so I am confused as to what this metric is actually measuring. In addition the number of open connections goes high and then stays high - rebooting the problematic instance doesn't seem to achieve anything.
The only thing I suspect is that we are copying data from a user's request straight into Azure Blob Store using the UploadFromStreamAsync from the HttpRequest.Body - with the data coming from a mobile phone app.
Microsoft support suggested we swapped to using local cache as an option to reduce issues with storage, however this has not resolved the issue.
Can anyone tell me what is the "locally written bytes" actually measuring? There is little documentation on this metric that I can find in google.

IIS Connection Pool interrogation/leak tracking

Per this helpful article I have confirmed I have a connection pool leak in some application on my IIS 6 server running W2k3.
The tough part is that I'm serving 300 websites written by 700 developers from this server in 6 application pools, 50% of which are .NET 1.1 which doesn't even show connections in the CLR Data performance counter. I could watch connections grow on my end if everything were .NET 2.0+, but I'm even out of luck on that slim monitoring tool.
My 300 websites connect to probably 100+ databases spread out between Oracle, SQLServer and outliers, so I cannot watch the connections from the database end either.
Right now my best and only plan is to do a loose binary search for my worst offenders. I will kill application pools and slowly remove applications from them until I find which individual applications result in the most connections dropping when I kill their pool. But since this is a production box and I like continued employment, this could take weeks as a tracing method.
Does anyone know of a way to interrogate the IIS connection pools to learn their origin or owner? Is there an MSMQ trigger I might be able to which I might be able to attach when they are created? Anything silly I'm overlooking?
Kevin
(I'll include the error code to facilitate others finding your answers through search:
Exception: System.InvalidOperationException
Message: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.)
Try starting with this first article from Bill Vaughn.
Todd Denlinger wrote a fantastic class http://www.codeproject.com/KB/database/connectionmonitor.aspx which watches Sql Server connections and reports on ones that have not been properly disposed within a period of time. Wire it into your site, and it will let you know when there is a leak.

Resources