I've got an Azure Web App for Linux (Container) running and I have setup Diagnostic Settings so that AppServiceHTTPLogs are exposed and I also set Application logging under App Service Logs to Filesystem. I can view and query them in the Logs under the Monitoring section of the Web App settings.
This works fine for a few days but then it just stops logging. At first I thought it was a space issue, so I changed the quota from 35mb to 100mb, and it started logging again. Then a day later it stopped logging again, so I changed the retention from 7 days to 1 day. It started logging again. Now it has stopped and I can't go higher than 100mb or lower than 1 day. Additionally, when I look at the Filesystem storage used it's only sitting at a few megabytes.
I have no idea why it just stops logging. Has anyone experienced this?
EDIT:
As a wild experiment I just set the retention days back to 7, and low and behold it started logging again. It's as if it's just seeking attention.
Related
I'm working on an application that's hosted within Azure using an AppService, it sits behind an Azure Firewall and WAF (for reasons).
Over the Christmas break, most of my test environments went to sleep and never came back (they started dying after between 7 and 16 days of idle time). I could see the firewall attempting to health check them every 2 seconds, but at some point they all stopped responding. The AppService started returning 500.30 errors (which are visible in the AppServiceHttpLogs), but our applications weren't starting, and there were no ApplicationInsights logs (i.e. the app wasn't started/starting).
We also noticed, that if we made any configuration change to any of the environment (not the app) the app would start and behave just fine.
It is worth noting that "AlwaysOn" is configured off, because as far as I'm aware, the startup will just cause some initial request latency (after 20 minutes of idle).
Has anybody got a good suggestion as to what happened, could there be some weird interaction between "AlwaysOn" and AzureFirewall, and if so why did it take weeks before it kicked in?
Thanks.
To answer my own question (partially).
There was an update to azure, which rolled out across our environments over a couple of weeks. After the update there was ~50% change that the automatic restart killed out apps.
The apps were dying because... after a restart, there was a change that the app service route to their keyvault via a vnet, but instead via a public IP, which would be rejected by keyvault.
We determined that this was the issue using kudu --> tools --> diagnostic dump --> (some dump).zip --> LogFiles --> eventlog.xml
If you ever want find app service startup failure stack traces, this is a great place to look.
Now we've got to work out why sometimes keyvault requests don't get routed via vnet, and instead go via the public IP.
Following a recent investigation into an Azure web api going down (it does not like cold restarts as the queued requests then swamp the server, which 503's), I received the following:
Your application was restarted as site binding status changed. This
can most likely occur due to recent deployment slot swap operations.
In some cases after the swap the web app in the production slot may
restart later without any action taken by the app owner. This restart
may take place several hours/days after the swap took place. This
usually happens when the underlying storage infrastructure of Azure
App Service undergoes some changes. When that happens the application
will restart on all VMs at the same time which may result in a cold
start and a high latency of the HTTP requests. This event occurred
multiple times during the day.
The recommendation was
to minimize the random cold starts, you can set this app setting
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 in every slot of
the app.
Can someone please elaborate on this?
Am I right in thinking that if we ever do a swap (eg: staging to production) at some random point in the future the app will restart?
What does the app setting actually do and how will it stop Azure restarting the production slot?
Answer from the link provided by Patrick Goode, whose google-foo is far better than mine
"Just to explain the specifics of what
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG app setting does. By
default we put the site’s hostnames into the site’s
applicationHost.config file “bindings” section. Then when the swap
happens the hostnames in the applicationHost.config get out of sync
with what the actual site’s hostnames are. That does not affect the
app in anyway while it is running, but as soon as some storage event
occurs, e.g. storage volume fail over, that discrepancy causes the
worker process app domain to recycle. If you use this app setting then
instead of the hostnames we will put the sitename into the “bindings”
section of the appHost.config file. The sitename does not change
during the swap so there will be no such discrepancy after the swap
and hence there should not be a restart."
Looks like this setting is supposed to prevent 'random cold restarts'
https://ruslany.net/2019/06/azure-app-service-deployment-slots-tips-and-tricks
I deployed an Azure web app back in July and it's been running flawlessly up until about three weeks ago. At that time, I would notice my CPU utilization constantly between 80% to 100%, with no corresponding increase in traffic. The first time I saw this, after concluding it wasn't my app, or increased traffic, causing this, I restarted the web app service and the CPU utilization returned to its normal 5% to 15%. Then after a couple days it started to do it again. And, again, a restart solved the issue.
My question is this. Is this normal to have to restart the web service every day or so? And, if so, why?
Assuming no changes have been made to your code and you have not seen a corresponding increase in traffic, it is not normal. An Azure Web App with no app deployed should almost always stay at 0% CPU utilization. I say "almost always" because Microsoft does run diagnostic and monitoring tools in the background that can cause some very temporary spikes. See here for a thread on that particular issue.
My recommendations are:
When CPU pegs and stays pegged, log into your SCM site. Check the Process Explorer and confirm that it's your w3wp.exe (Note there's a separate w3wp.exe for your SCM site.) that's pegged the CPU.
Ensure that you don't have any Site Extensions or WebJobs that are losing their mind. You can check your installed Site Extensions on the SCM site under the Site Extensions -> Installed tab. Any WebJobs will show up on your SCM process explorer as separate processes from step #1.
Log into the Azure Portal and browse to your Web App's management blade. Go to the Diagnose and Solve Problems blade. From here, you can try "Metrics per Instance" and go through all of the Perf Counters to see if it gives you a clue as to what's wrong. For example, I had SignalR go nuts once and only found it by seeing that my thread count was out of control.
On the Diagnose and Solve Problems blade, you can also check Application Events.
You may have some light shed on this by installing Application Insights on your web application. It has a free tier that will likely have enough space to troubleshoot for a few days. If this is something going bananas with your code, you may get some insight here.
I'm including failed request tracing logs here for completeness. But these would likely show up in Application Insights.
If you've exhausted all of these possibilities, file a support ticket with Microsoft. As the above link shows, they have access to diagnostic tools that we don't and can eliminate the possibility of a runaway diagnostics or infrastructure process. I don't know how much help they can be if the CPU spike is due to your own w3wp.exe that's spiking the CPU.
Of course, if your app is seriously easy to redeploy and it's not a ridiculous hassle, you can just re-provision it and see if you see the same behavior.
I'm facing following problem with MassTransit 3. I'm publishing messages from WebApi to Backend (ran as continuous webjob). When the backend job is started all works well and messages are picked up properly. After cca 20 minutes all messages published from WebApi stop being picked up by the backend. The message is published to the Azure Service Bus properly but is picked up only after restart of the webjob process.
MT debug log is completely silent and shows no issues. So this question is more for authors of MT if they could think of anything that could cause this issue.
Update 1
The web job is continuous and running in standard mode, therefore the 20minute timeout mentioned in azure documentation shouldn't apply.
I've checked the logs and the job is running. Environment doesn't log anything about stopping the job and the process explorer shows the job. With quite high thread count (I have just 3 consumers). All threads are in wait state.
You should be creating a cloud service and not a web job. Web jobs are not meant for continuous processes. A worker role is exactly what you need.
From the Azure documentation:
Web apps in Free mode can time out after 20 minutes if there are no requests to the scm (deployment) site and the web app's portal is not open in Azure. Requests to the actual site will not reset this.
Resolved. The MT process got stuck after spawning around 2k threads. The issue must have been in azure transport as trying the same configuration with Rabbit worked well.
After updating to newer MT version (.11 beta), the transport started to behave properly.
So I'm trying to familiarise myself with Azure and have started work on a website which is currently being deployed on git commit to Azure. I decided I had to look at logging and so turned on application diagnostics in the Azure portal. I logged via a trace statement in my code and sure enough it writes to a log file.
I noticed that on hover of the info icon at the side of the "application logging (filesystem)" toggle, that it notes it will be turned off after 12 hours. I presumed that meant diagnostic logging will be turned off after 12 hours, but over 20 hours later that seems not to be the case.
Does the 12 hours refer to the retention of file logs post creation or geniunely that logging will (at some point) be switched off?
From the little I've read if I want durable logging I need to consider pushing log files to blob storage or azure tables (possibly writing directly). Are my thoughts on the 12 hour retention to be correct?
Thanks
Tim
This 12 hours limit is about application logging into text file(s): if you use an ILogger instance to log data (ie. logger.LogInformation(...)) then this feature will be disabled after 12 hours.