Azure functions portal log / monitor isn't very accurate - azure

I've been using functions for a while and it seems the longer the Function is around, the less accurate the Portal logs are. When I first was using my functions for maybe 3 months everything monitor/logging wise was fine. Over time things starting getting less accurate.
Now I see the real logs by going to the ms azure storage explorer and checking the AzureWebJobsStorage.
First when I bring up the code/logs the last log it brings up isn't accurate. It will be from a few days ago usually, or the last error. When it triggers though, it does get the live feed. This isn't that big a deal, it's the monitor being inactive that and not being able to see the logs from that which is bad. I suppose I just use the Azure Storage explorer.
Monitor Invocation Logs, always seems a few days behind. This used to be accurate, but the last month or so, it's always a few days behind

Dan,
The local, file based logs, exist primarily to support the portal experience, so the behavior you're observing on the log window is expected as the logs are not written by the runtime as part of the normal invocation process, but only when you're actively developing/testing on the portal.
The issue you're experiencing with the monitor is due to a regression that has been patched and should be fully rolled out today (you can see more details here)
We've been listening to feedback on our logging capabilities, and there has been a lot of investment in that area, resulting in the recently announced built in integration with Application Insights. That integration addresses some of the pain points you've brought up as well as other issues, so I'd strongly recommend trying it out. You can find more information about it here.

Related

Azure WebApps leaking handles "out of nothing"

I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.

Log streaming tab doesn't work in Azure functions

I posted this on github earlier, but hoping I would get an answer here from the wider community.
In the azure portal, one can look at live logs of a Function app in the Log Streaming tab. I have noticed that this often doesn't work for weeks, and I am wondering if I am doing something obviously wrong. More details below:
I have a function that receives messages from a service bus. I am able to see the logs in application insights, and I can see that it's processing the requests as expected. The problem is that I don't see any logs in the "Log Streaming" tab in the portal. See image below:
The above image indicates that no lines were logged between 6:17:46 and 6:18:46. However, see the below image from application insights logs, and you can see there were clearly several requests that were processed during this time (and several log lines written).
I tried Edge browser and Chrome, and also private tab in Edge, but I see the same behavior.
Note that I end up seeing this for extended period of time, and then sometimes it resolves itself. For example, I noticed it not work for weeks in June. Then, surprisingly I saw it work for at least a week in the beginning of July. But now I noticed it is not working.
Also, I see this behavior across all the function apps that I currently have. (So it's not limited to just one app).
I am using Azure Functions v2 .net core, C#.

How to debug Azure swapping process (sometimes bringing site down)

We have a pretty large project that is running on Azure. For some reason swap times became really slow recently, like at least 10 minutes.
Somtimes during the swap the site becomes superslow, like that it doesn't respond for minutes.
Other times the swap just doesn't work for one reason or another.
We are using initializationPage to warmup the most specific pages, but it doesn't seem to help.
Question
Is it possible to see what's going on during the swap? I'm trying to debug why it's so slow. Is there any log that I can see why it's stuck on what?
We can't deploy emergency fixes without bringing the whole site down. and sometimes the whole site goes down.
Any help to debug swapping problems would greatly appreciated.
Update
I found the following in 'Activity log' on the Azure Portal, but I still can't find any details or any hint what is going on exactly.
So: The resource operation completed with terminal provisioning state 'Failed'.
Where can I find details? It really annoys me that I have to buy Azure Developer support while I'm spending hundreds euros per month already on something that seems broken or at least very uninformative about what is going wrong.
So: The resource operation completed with terminal provisioning state 'Failed'.
Where can I find details?
Microsoft has a few things that may help you.
You can view the operations for a deployment through the Azure portal.
You may be most interested in viewing the operations when you have
received an error during deployment so this article focuses on viewing
operations that have failed. The portal provides an interface that
enables you to easily find the errors and determine potential fixes.
The "View deployment operations with Azure Resource Manager" is directly from Microsoft it has several steps to follow. Follow the URL: Microsoft
I hope this helps.

How does one know why an Azure WebSite instance(WebApp) was shutdown?

By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.

Intermittent Microsoft Azure Web Site access failure

I have a number of small MVC apps deployed as Microsoft Windows Azure websites. This has been working for several months.
Yesterday I rolled out a new one, and the deployment was unremarkable, everything worked fine. But a couple of hours later, access to the site was unavailable. The symptoms were that when the browser tried to navigate to the URL for that site, it would try to load for several minutes and then just give up with a completely blank page.
I attempted to stop and restart the site, and it worked once, but the symptoms came back several minutes later. Then I tried to stop and restart, and it didn't work.
I deployed the identical app to three additional URLs. Again, immediately on deployment, they all work fine, however, they fail at some interval in the future. They seem to not all fail at once. Sometimes restarting the site will fix the problem, and sometimes not.
IMPORTANT: If I wait for some period of time, the site may start to work again on its own.
However, deploying four versions of the app so that our users can go to a backup one if the primary one is not working is not optimal.
Any words of wisdom as to how I might go about debugging this?
ADDITIONAL INFO NOV 25, 2013:
When sites are failing, the IIS logs show either 500 or 502 Internal Service Errors. Our own MVC code is never hit, not even app_start.
You can start by checking the logs and remote debugging
http://www.drdobbs.com/windows/azure-sdk-22-supports-visual-studio-2013/240163499
Are the apps working locally?
Might not be the same problem, but from time to time our Azure instances will get the blue question mark of death as a status.
The reason we found out was that Microsoft will do upgrades on instances from time to time. If you have just one instance in a cloud service/role, then from time to time they will do maintenance and during that time it will be dead.
I have confirmed this with their support.
The only way to get around this that I know of is to create two instances. Then Microsoft guarantees ~99% availability.
Of course I also confirmed with them that this means twice the cost. =/
If that's not the issue I would enable RDP and get onto the machine to see what the problem is. Microsoft has these tools to help debug problems: http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx
First, you should always run multiple instances of your web role with more than 1 upgrade domain. This is configurable in the service definition (CSDEF). Without this, you don't get an SLA from Microsoft, so you can't really complain that the VMs go down.
Second, to figure out what might be going on with these boxes, you should have both logs (my preference is to roll my own with page blobs or table storage), AND you should always have RDP access to a pre-production environment (production as well if you're not too fussed about security). Once on the box, look through the event viewer for errors.
Third, when an outage occurs check out the azure service dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) for outages.
Lastly, contact Microsoft support. It may take a few hours, but they are pretty good.
That it is happening repeatedly and for extended periods of time (more than 5 minutes), I would be there's something wrong with your hosted service. Again, RDP in and poke around. Good luck.
To debug your sites try to enable diagnostic logs:
http://www.windowsazure.com/en-us/develop/net/common-tasks/diagnostics-logging-and-instrumentation/
Another nice way to look around your site is using the debug console:
https://github.com/projectkudu/kudu/wiki/Kudu-console

Resources