We run an App Service at Azure that is configured with 8 nodes. After the latest restart of the application, only 1 node is responding. We can judge that looking at live stream data in Application Insights. Requests from the clients mostly fail because they are directed to the dead nodes.
We run Windows environment with Java and Tomcat.
Any idea what could go wrong?
Sorry for a vague request guys. That was kind of emergency. Anyway, the problem is resolved, and I hope this information can be useful for someone getting into same kind of trouble.
The problem appeared to be with Azure infrastructure that experienced some failure and switched to a failover storage resource. Our application could not be started after that. Advised by the Microsoft's support, we downgraded our service plan, and then upgraded it again, thus causing all nodes be created from scratch. That helped.
Related
I have an api application that is running in a docker container, and since moving to AWS, the api stops daily with the error: Erlang closed the connection. I've monitored the server during that time and no IOPS seem to be causing this issue. Beyond that though, when the api fails, it won't restart on it's own on one of our clusters. I'm not sure where to find the logs to get more context and could use any input that may be helpful. Also, more context here, is that this api worked fairly well before in our data-center/physical server space, but now in AWS, it fails daily. Any thoughts or suggestions as to why this may be failing to restart?
I've looked at the syslogs and the application server logs and don't see any kind of failures. Maybe I'm not looking in the proper place. At this point, someone from one of our teams has to manually restart the api with an init.d command. I don't want to create a cron job to "fix" this because that's a band-aid and not a true fix.
There really isn't enough information on the structure of your app or its connections, so no one can give a "true fix".
The problem may be as small as configuring your nodes differently, changing some of the server's local configurations, or you may need some "keep alive" handler towards AWS.
Maybe try adding a manual periodic dump to see if its an accumulating problem, but I believe if Erlang closed the connection there's a problem between your API and AWS, and your API can't understand where it is located.
We are using Azure DevOps Self hosted agents to build and release our application. Often we are seeing
below error and recovering automatically. Does anyone know what is this error ,how to tackle this and where to exactly check logs about the error ?
We stopped hearing from agent <agent name>. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink?Linkid=846610
This seems to be a known issue with both self-hosted and Microsoft-hosted agents that many people have been reporting.
Quoting the reply from #zachariahcox from the Azure Pipelines Product Group:
To provide some context, the azure pipelines agent is composed of two
processes: agent.listener and agent.worker (one of these per
step in the job). The listener is responsible for reporting that
workers are still making progress. If the agent.listener is unable
to communicate with the server for 10 minutes (we attempt to
communicate every minute), we assume something has Gone Wrong and
abandon the job.
So, if you're running a private machine, anything that can interfere
with the listener's ability to communicate with our server is going to
be a problem.
Among the issues i've seen are anti-virus programs identifying it as a
threat, local proxies acting up in various ways, the physical machine
running out of memory or disk space (quite common), the machine
rebooting unexpectedly, someone ctrl+c'ing the whole listener process,
the work payload being run at a way higher priority than the listener
(thus "starving" the listener out), unit tests shutting down network
adapters (quite common), having too many agents at normal priority on
the same machine so they starve each other out, etc.
If you think you're seeing an issue that cannot be explained by any of
the above (and nothing jumps out at you from the _diag logs folder),
please file an issue at
https://azure.microsoft.com/en-us/support/devops/
If everything seems to be perfectly alright with your agent and none of the steps mentioned in the Pipeline troubleshooting guide help, please report it on Developer Community where the Azure DevOps Team and DevOps community are actively answering questions.
I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.
since a month one of our web application hosted as WebApp on Azure is having some kind of problem and I cannot find the root cause of that.
This WebApp is hosted on Azure on a 2 x B2 App Service Plan. On the same App Service Plan there is another WebApp that is currently working without any issue.
This WebApp is an ASP.NET WebApi application and exposes a REST set of API.
Effect: without any apparent sense (at least for what I know by now), the ThreadCount metric starts to spin up, sometimes very slowly, sometimes in few minutes. What happens is that no requests seems to be served and the service is dead.
Solution: a simple restart of the application (an this means a restart of the AppPool) causes an immediate obvious drop of the ThreadCount and everything starts as usual.
Other observations: there is no "periodicity" in this event. It happened in the evening, in the morning and in the afternoon. It seems that evening is a preferred timeframe, but I won't say there is any correlation.
What I measured through Azure Monitoring Metric:
- Request Count seems to oscillate normally. There is no peak that causes that increase in ThreadCount
- CPU and Memory seems to be normal, nothing strange.
- Response time, like the others metrics
- Connections (that should be related to sockets) oscillates normally. So I'd exclude something related to DB connections.
What may I do in order to understand what's going on?
After a lot of research, this happened to be related to a wrong usage of Dependency Injection (using Ninject) and an application that wasn't designed to use it.
In order to diagnose, I discovered a very helpful feature in Azure. You can reach it by entering into the app that is having the problem, click on "Diagnose and solve problems" then click on "Diagnostic tools" and then select "Collect .NET profiler report". In that panel, after configuring the storage for the diagnostic files, you can select "Add thread report".
In those report you can easily understand what's going wrong.
Hope this helps.
I noticed that for some reason one of the web role stopped and restarted itself. Could someone help me to understand on what scenarios does the webrole restarts itself?
And, is there any way to find why the webrole restarted itself?
That happens once in a while when Azure performs guest OS upgrades - it stops instances honoring upgrade domains and then starts them shortly thereafter. This is the most frequent scenario, the same could happen if the server hosting the VM was diagnosed faulty, but that happens quite rarely.
You should be ready for such restarts - they are normal - and your code should be designed to be able to continue working after such restart.
Here's a post with more details on the upgrade process.