Performance impact of writing Azure diagnostic logs to blob storage - azure

Our C# web app, running on Azure, uses System.Diagnostics.Trace to write trace statements for debugging/troubleshooting. Once we enable blob storage for these logs (using the "Application Logging (blob)" option in the Azure portal), the response time for our application slows down considerably. If I turn this option off, the web app speeds up again (though obviously we don't get logs in blob storage anymore).
Does anyone know if this is expected? We certainly write a lot of trace statements on every request (100 or so per request), but I would not think this was unusual for web application. Is there some way to diagnose why enabling blob storage for the logs dramatically slows down the execution of these trace statements? Is writing the trace statement synchronous with the logs being updated in blob storage, for instance?

I was unable to find any information about how logging to blob storage in Azure was implemented. However, this is what I was able to deduce:
I confirmed that disabling the global lock had no effect. Therefore, the performance problem was not directly related to lock contention.
I also confirmed that if I turn AutoFlush off, the performance problem did not occur.
From further cross referencing the source code for the .NET trace API, my conclusion is that it appears that when you enable blob storage for logs, it injects some kind of trace listener into your application (the same way you might add a listener in web.config) and it synchronously writes every trace statement it receives to blob storage.
As such, it seems that there are a few ways to workaround this behavior:
Don't turn on AutoFlush, but manually flush periodically. This will prevent the synchronous blob writes from interrupting every log statement.
Write your own daemon that will periodically copy local log files to blob storage or something like this
Don't use this blob storage feature at all but instead leverage the tracing functionality in Application Insights.
I ended up doing #3 because, as it turns out, we already had Application Insights configured and on, we just didn't realize it could handle trace logging and querying. After disabling sampling for tracing events, we now have a way to easily query for any log statement remotely and get the full set of traces subject to any criteria (keyword match, all traces for a particular request, all traces in a particular time period, etc.) Moreover, there is no noticeable synchronous overhead to writing log statements with the Application Insights trace listener, so nothing in our application has to change (we can continue using the .NET trace class). As a bonus, since Application Insights tracing is pretty flexible with the tracing source, we can even switch to another higher performance logging API (e.g. ETW or log4net) if needed and Application Insights still works.
Ultimately, you should consider using Application Insights for storing and querying your traces. Depending on why you wanted your logs in blob storage in the first place, it may or may not meet your needs, but it worked for us.

Related

Application Insights Down - Impact on the Application

We use Application insights from Azure Functions, currently documenting outage scenarios for different Azure Components. What'll be the impact on Functions if Application Insights service goes down? I hope it doesn't impact Function executions and they continue to operate as normal.
Also when it comes back online, let's say half an hour after the outage, would all the logs done during that time will be lost?
What'll be the impact on Functions if Application Insights service goes down? I hope it doesn't impact Function executions and they continue to operate as normal.
It does not. Functions will continue to run as usual.Telemetry submitting is done in the background anyway.
Also when it comes back online, let's say half an hour after the outage, would all the logs done during that time will be lost?
It depends on what Channel is configured. The Application Insights .NET and .NET Core SDKs ship with two built-in channels:
InMemoryChannel: A lightweight channel that buffers items in memory until they're sent. Items are buffered in memory and flushed once every 30 seconds, or whenever 500 items are buffered. This channel offers minimal reliability guarantees because it doesn't retry sending telemetry after a failure. This channel also doesn't keep items on disk, so any unsent items are lost permanently upon application shutdown (graceful or not). This channel implements a Flush() method that can be used to force-flush any in-memory telemetry items synchronously.
ServerTelemetryChannel: A more advanced channel that has retry policies and the capability to store data on a local disk. This channel retries sending telemetry if transient errors occur. This channel also uses local disk storage to keep items on disk during network outages or high telemetry volumes.
When using the ServerTelemetryChannel you may need to configure the location where telemetry will be stored during the downtime. See also the docs regarding offline storage

Blob trigger affecting application insight logging in azure functions

I have two azure functions that exist in the same azure function app and they are both connected to the same instance of application insights:
TimerFunction uses a TimerTrigger and executes every 60 seconds and logs each log type for testing purposes.
BlobFunction uses a BlobTrigger and its functionality is irrelevant for this question.
It appears that when BlobFunction is enabled (it isn't being triggered by the way), it clogs up the application insights with polling, as I don't receive some of the log messages written in TimerFunction. If I disable BlobFunction, then the logs I see in the development tools monitor for TimerFunction are all there.
This is shown in the screenshot below. TimerFunction and BlobFunction were both running until I disabled BlobFunction at 20:24, where you can clearly see the logs working "normally", then at 20:26 I re-enabled BlobFunction and the logs written by TimerFunction are again intermittent, and missing my own logged info.
Here is the sample telemetry from the live metrics tab:
Am I missing something glaringly obvious here? What is going on?
FYI: My host.json file does not set any log levels, I took them all out in the process of testing this and it is currently a near-skeleton. I also changed the BlobFunction to use a HttpTrigger instead, and the issue disappeared, so I'm 99% certain it's because of the BlobTrigger.
EDIT:
I tried to add an Event Grid trigger instead as Peter Bons suggested, but my resource group shows no storage account for some reason. The way the linked article shows, and the way this video shows (https://www.youtube.com/watch?v=0sEzimJYhME&list=WL) just don't work for me. The options are just different, as shown below:
It is normal behavior that the polling is cluttering your logs. You can of course set a log level in host.json to filter out those message, though you might loose some valueable other logging as well.
As for possible missing telemetry: it could very well be that some logs are dropped due to sampling that is enabled by default. I would also not be suprised if some logging is not shown on the portal. I've personally experienced logging being delayed up to 10 minutes or not available at all in the azure function log page on the portal. Try a direct query in App Insights as well.
Or you can go directly to the App Insights resource and create some queries yourself that filter out those messages using Search or Logs.
The other option is to not rely on polling using the blobtrigger but instead use an event grid trigger that invocates the function once a blob is added. Here is an example of calling a function when an image is uploaded to an azure storage blob container. Because there is no polling involved this is a much more efficient way of reacting to storage events.

Azure Functions - Logging Consolidation - Controlling the host log?

I run a system on top of a bunch of Azure Functions and I'm just tidying some last threads up. I mostly abandoned the logging provided out of the box by Azure functions because I found the flush timings to be super irregular and I also wanted to consolidate the logs from all of my functions into one spot and be able to query them. This all works for the most part but I have one annoying use-case remaining where if a function binding is faulty (e.g. the azure function method signature is wrong because someone checked garbage into Git) the function won't be invoked and even the log for the function wont be invoked but the error will instead be placed into a different file (the host log).
Now I guess I can just access the storage account that backs up the azure function and pull the host log from there but I was wondering if there was a better means of directly controlling/intercepting the logging in Azure Functions. Does anyone know if there is at least a way of getting it to flush more quickly?
You can see host logs as well as function logs in associated Application Insights:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-monitoring#other-categories

Determining cause of CPU spike in azure

I am relatively new to Azure. I have a website that has been running for a couple of months with not too much traffic...when users are on the system, the various dashboard monitors go up and then flat line the rest of the time. This week, the CPU time when way up when there were no requests and data going in or out of the site. Is there a way to determine the cause of this CPU activity when the site is not active? It doesn't make sense to me that I should have CPU activity being assigned to my site when there is to site activity.
If your website has significant processing at application start, it is possible your VM got rebooted or your app pool recycled and your onstart handler got executed again (which would cause CPU to spike without any request).
You can analyze this by adding application logs to your Application_Start event (but after initializing trace). There is another comment detailing how to enable logging, but you can also consult this link.
You need to collect data to understand what's going on. So first thing I would say is:
1. Go to Azure management portal -> your website (assuming you are using Azure websites) -> dashboard -> operation logs. Try to see whether there is any suspicious activity going on.
download the logs for your site using any ftp client and analyze what's happening. If there is not much data, I would suggest adding more logging in your application to see what is happening or which module is spinning.
A great way to detect CPU spikes and even determine slow running areas of your application is to use a profiler like New Relic. It's a free add on for Azure that collects data and provides you with a dashboard of data. You might find it useful to determine the exact cause of the CPU spike.
We regularly use it to monitor the performance of our applications. I would recommend it.

What happens to Azure diagnostic information when a role stops?

When an Azure worker role stops (either because of an unhandled exception or because Run() finishes), what happens to local diagnostic information that has not yet been transferred? Microsoft documentation says diagnostics are transferred to storage at scheduled intervals or on demand, neither of which can cover an unhandled exception. Does this mean diagnostic information is always lost in this case? This seems particularly odd because crash dumps are part of the diagnostic data (set up by default in DiagnosticMonitorConfiguration.Directories). How then can you ever get a crash dump back (related to this question)?
To me it would be logical if diagnostics were also transferred when a role terminates, but this is not my experience.
It depends on what you mean by 'role stops'. The Diagnostic Monitor in SDK 1.3 and later is implemented as a background task that has no dependency on the RoleEntryPoint. So, if you mean your RoleEntryPoint is reporting itself as unhealthy or something like that, then your DiagnosticMonitor (DM) will still be responsive and will send data according to the configuration you have setup.
However, if you mean that a role stop is a scale down operation (shutting down the VM), then no, there is no flush of the data on disk. At that point, the VM is shutdown and the DM with it. Anything not already flushed (transferred) can be considered lost.
If you are only rebooting the VM, then in theory you will be connected back to the same resource VHDs that hold the buffered diagnostics data so you would not lose it, it would be transferred on next request. I am pretty sure that sticky storage is enabled on it, so it won't be cleaned on reboot.
HTH.
The diagnostic data is stored locally before it is transferred to storage. So that information is available to you there; you can review/verify this by using RDP to check it out.
I honestly have not tested to see if it gets transferred after the role stops. However, you can request transfers on demand. So using that approach, you could request the logs/dumps to be transferred one more time after the role has stopped.
I would suggest checking out a tool like Cerebrata Azure Diagnostics Manager to request on demand transfer of your logs, and also analyze the data.
I answered your other question as well. Part of my answer was to add the event that would allow you to change your logging and transfer settings on the fly.
Hope this helps
I think it works like this: local diagnostic data is stored in the local storage named "DiagnosticStore", which I guess has cleanOnRoleRecycle set to false. (I don't know how to verify this last bit - LocalResource has no corresponding attribute.) When the role is recycled that data remains in place and will eventually be uploaded by the new diagnostic monitor (assuming the role doesn't keep crashing before it can finish).

Resources