I'm using Azure DevOps Pipelines with a self hosted build agent to deploy to App Services. We occasionally get "No space left on device" errors that result in deployment fails.
It seems somewhat random when we receive these errors, so we're having trouble figuring out what the cause is.
We scheduled a maintenance job to clean up records, but are still occasionally running in to the issue. The failures tend to happen more often in the mornings 8AM-10AM EST and after 4PM, but its random enough I'm not sure if that's relevant. I've also had a Saturday with pretty consistent fails. Eventually the deployments start working again with no intervention on our end.
Any insight in to what is going on or potential solutions would be greatly appreciated.
Related
We have multiple terraform scripts, that create/update hundreds of resources on azure. If we want to change anything on api management related resources, it takes ages and regularly even times out. Running it again sometimes solves issues, but also sometimes tells us, that the api we want to create already exists and stuff like that.
The customer is getting really annoyed by us providing unreliable update-scripts that cause quite some efforts for the operations team, that is responsible of deploying and running the whole product. Saving changes in the api management is also taking ages and running into errors when we use the azure portal.
Is there any trick or clue on how to improve our situation?
(This is going on for a while now and feels like getting worse and worse over the time)
I'd start by using the Debugging options to sort out precisely which resources are taking the longest. You can consider breaking those out into a separate state, so you don't have to calculate them each time.
Next, ensure that the running process has timeouts set greater than those of terraform. Killing terraform mid-run is a good way to end up with a confused state.
Aside from that, there are some resources for which you can provide Operation Timeouts. With those you can ensure terraform treats them as failed before the process running terraform kills it (if they are available).
I'd consider opening a bug on the azurerm provider or asking in the Terraform Section of the Community Forum.
Azure API Management is slow in applying changes because it's a distributed service. An update operation will take time as it waits until the changes are applied to all instances. If you are facing similar issues in the portal it's a sign that it has nothing to do with Terraform or AzureRM. I would contact Azure support, as they will have the telemetry to help you further.
In my personal experience, a guaranteed way to get things stuck is to do a lot of small changes in succession without waiting for the previous ones to finish so I would start by checking that.
Finally, if you find no help in the previous steps, I would try using Bicep/ARM to manage the APIM. Usually, the ARM deployment API is a bit more robust compared to the APIs used by Terraform & GO SDK.
We are using Azure DevOps Self hosted agents to build and release our application. Often we are seeing
below error and recovering automatically. Does anyone know what is this error ,how to tackle this and where to exactly check logs about the error ?
We stopped hearing from agent <agent name>. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink?Linkid=846610
This seems to be a known issue with both self-hosted and Microsoft-hosted agents that many people have been reporting.
Quoting the reply from #zachariahcox from the Azure Pipelines Product Group:
To provide some context, the azure pipelines agent is composed of two
processes: agent.listener and agent.worker (one of these per
step in the job). The listener is responsible for reporting that
workers are still making progress. If the agent.listener is unable
to communicate with the server for 10 minutes (we attempt to
communicate every minute), we assume something has Gone Wrong and
abandon the job.
So, if you're running a private machine, anything that can interfere
with the listener's ability to communicate with our server is going to
be a problem.
Among the issues i've seen are anti-virus programs identifying it as a
threat, local proxies acting up in various ways, the physical machine
running out of memory or disk space (quite common), the machine
rebooting unexpectedly, someone ctrl+c'ing the whole listener process,
the work payload being run at a way higher priority than the listener
(thus "starving" the listener out), unit tests shutting down network
adapters (quite common), having too many agents at normal priority on
the same machine so they starve each other out, etc.
If you think you're seeing an issue that cannot be explained by any of
the above (and nothing jumps out at you from the _diag logs folder),
please file an issue at
https://azure.microsoft.com/en-us/support/devops/
If everything seems to be perfectly alright with your agent and none of the steps mentioned in the Pipeline troubleshooting guide help, please report it on Developer Community where the Azure DevOps Team and DevOps community are actively answering questions.
We have a pretty large project that is running on Azure. For some reason swap times became really slow recently, like at least 10 minutes.
Somtimes during the swap the site becomes superslow, like that it doesn't respond for minutes.
Other times the swap just doesn't work for one reason or another.
We are using initializationPage to warmup the most specific pages, but it doesn't seem to help.
Question
Is it possible to see what's going on during the swap? I'm trying to debug why it's so slow. Is there any log that I can see why it's stuck on what?
We can't deploy emergency fixes without bringing the whole site down. and sometimes the whole site goes down.
Any help to debug swapping problems would greatly appreciated.
Update
I found the following in 'Activity log' on the Azure Portal, but I still can't find any details or any hint what is going on exactly.
So: The resource operation completed with terminal provisioning state 'Failed'.
Where can I find details? It really annoys me that I have to buy Azure Developer support while I'm spending hundreds euros per month already on something that seems broken or at least very uninformative about what is going wrong.
So: The resource operation completed with terminal provisioning state 'Failed'.
Where can I find details?
Microsoft has a few things that may help you.
You can view the operations for a deployment through the Azure portal.
You may be most interested in viewing the operations when you have
received an error during deployment so this article focuses on viewing
operations that have failed. The portal provides an interface that
enables you to easily find the errors and determine potential fixes.
The "View deployment operations with Azure Resource Manager" is directly from Microsoft it has several steps to follow. Follow the URL: Microsoft
I hope this helps.
I have a configuration of a Worker role, and a Web role with a few instances. Some of the instances seem to be unhealthy and it looks like they are constantly restarting.
The azure management portal gives the folllowing status:
Busy (Starting role... Runtime is initializing. [2014-07-02T08:38:18Z])
like.. constantly.
So I guess they are unhealthy for some reason. I've uploaded a new deployment to the staging server, but I can't do a VIP swap, because that gives me the following error:
Failed to swap the deployments in cloud service poulescom.
Windows Azure is currently performing an operation on this deployment that requires exclusive access.
So right now I'm in some kind of a deadlock, and can't get my healthy fresh version online without taking the whole site down!
Anyone know what to do?
This is a major bug in the current implementation of Azure platform which can be reproduced by anyone at any moment the following way. Add this line as the first line of your role entry point OnStart():
throw new InvalidOperationException();
build and deploy the package into production slot. Then remove this line, build another package and deploy that one into staging slot. The one in the staging slot will run just fine, the one in the production slot will be recycling. That's the expected part. The unexpected part is that when you attempt a swap you face a requires exclusive access error message.
Now think of it the following way. What if the deployment in the production slot was recycling not because of a deliberately planted error but because of some unintended error which was not occurring until recently? Like it was working for several days and then some repeatable unhandled exception started being thrown in one of the instances and now your deployment is partially degraded.
What would you want to do? I guess you'd fix the bug, build a new package, deploy it into the staging slot and then try to swap. Doing so would lead to requires exclusive access error message all the time. You wanted to seamlessly swap deployment to prevent downtime but the function designed for this doesn't work on the very moment when you need it the most.
You can't resolve this in the current implementation. Either you wait till both deployments stop recycling (which is not guaranteed of course) or you can do the following:
fix and redeploy the staging until it runs fine
test the staging deployment
(very carefully) delete the production deployment
do the swap (yes, you can swap between an non-empty staging and an empty production slot)
The sequence above will cause you about a minute of downtime (and a lot of nerve cells lost) but it's better than nothing.
I have two free subscriptions for windows azure and because I exceeded the limit on the first one, Microsoft closed it down. So I tried to deploy my application from the other subscription, and changed a few settings, and it seems to take a lot longer and the dns name of the depolyed application (in production area) does not seem to work. (I've been waiting for about 15 minutes.. in the other subscription it was almost immediate that the link started to work..). Also my webrole seems to be in a state of busy for a very long time..
The application always worked fine and now I'm getting all this trouble just by switching subscription?? I'm getting really frustrated with this especially because I all worked perfectly before. Now I have to 'waste' my time getting all the things to work again and I can't start with anything new. I don't think this is normal but I can't seem to find the solution to this either.
edit:
Over half an hour the dns finally started working but this still does not fix the problem with the extreme slow deploying and the busy state of the webrole..
Please study the discussion below to understand why the time to deploy an application could vary between 10-30 minutes:
Is there a way to reduce time between Azure deployment start and role OnStart() code being invoked?
Above details will helped you to get the answer about your statement ".. this still does not fix the problem with the extreme slow deploying and the busy state of the webrole.."..
To add more about that, when your application is deployment phase it goes through several state and in some cases the time taken in one state could be longer then expected and during this time you will see status as "Busy", "Initializing", "Starting.." etc and these state actually explain which level you are during your deployment. I hope this helps you to understand the time taken during deployment.