What happens to Azure diagnostic information when a role stops? - azure

When an Azure worker role stops (either because of an unhandled exception or because Run() finishes), what happens to local diagnostic information that has not yet been transferred? Microsoft documentation says diagnostics are transferred to storage at scheduled intervals or on demand, neither of which can cover an unhandled exception. Does this mean diagnostic information is always lost in this case? This seems particularly odd because crash dumps are part of the diagnostic data (set up by default in DiagnosticMonitorConfiguration.Directories). How then can you ever get a crash dump back (related to this question)?
To me it would be logical if diagnostics were also transferred when a role terminates, but this is not my experience.

It depends on what you mean by 'role stops'. The Diagnostic Monitor in SDK 1.3 and later is implemented as a background task that has no dependency on the RoleEntryPoint. So, if you mean your RoleEntryPoint is reporting itself as unhealthy or something like that, then your DiagnosticMonitor (DM) will still be responsive and will send data according to the configuration you have setup.
However, if you mean that a role stop is a scale down operation (shutting down the VM), then no, there is no flush of the data on disk. At that point, the VM is shutdown and the DM with it. Anything not already flushed (transferred) can be considered lost.
If you are only rebooting the VM, then in theory you will be connected back to the same resource VHDs that hold the buffered diagnostics data so you would not lose it, it would be transferred on next request. I am pretty sure that sticky storage is enabled on it, so it won't be cleaned on reboot.
HTH.

The diagnostic data is stored locally before it is transferred to storage. So that information is available to you there; you can review/verify this by using RDP to check it out.
I honestly have not tested to see if it gets transferred after the role stops. However, you can request transfers on demand. So using that approach, you could request the logs/dumps to be transferred one more time after the role has stopped.
I would suggest checking out a tool like Cerebrata Azure Diagnostics Manager to request on demand transfer of your logs, and also analyze the data.
I answered your other question as well. Part of my answer was to add the event that would allow you to change your logging and transfer settings on the fly.
Hope this helps

I think it works like this: local diagnostic data is stored in the local storage named "DiagnosticStore", which I guess has cleanOnRoleRecycle set to false. (I don't know how to verify this last bit - LocalResource has no corresponding attribute.) When the role is recycled that data remains in place and will eventually be uploaded by the new diagnostic monitor (assuming the role doesn't keep crashing before it can finish).

Related

Performance impact of writing Azure diagnostic logs to blob storage

Our C# web app, running on Azure, uses System.Diagnostics.Trace to write trace statements for debugging/troubleshooting. Once we enable blob storage for these logs (using the "Application Logging (blob)" option in the Azure portal), the response time for our application slows down considerably. If I turn this option off, the web app speeds up again (though obviously we don't get logs in blob storage anymore).
Does anyone know if this is expected? We certainly write a lot of trace statements on every request (100 or so per request), but I would not think this was unusual for web application. Is there some way to diagnose why enabling blob storage for the logs dramatically slows down the execution of these trace statements? Is writing the trace statement synchronous with the logs being updated in blob storage, for instance?
I was unable to find any information about how logging to blob storage in Azure was implemented. However, this is what I was able to deduce:
I confirmed that disabling the global lock had no effect. Therefore, the performance problem was not directly related to lock contention.
I also confirmed that if I turn AutoFlush off, the performance problem did not occur.
From further cross referencing the source code for the .NET trace API, my conclusion is that it appears that when you enable blob storage for logs, it injects some kind of trace listener into your application (the same way you might add a listener in web.config) and it synchronously writes every trace statement it receives to blob storage.
As such, it seems that there are a few ways to workaround this behavior:
Don't turn on AutoFlush, but manually flush periodically. This will prevent the synchronous blob writes from interrupting every log statement.
Write your own daemon that will periodically copy local log files to blob storage or something like this
Don't use this blob storage feature at all but instead leverage the tracing functionality in Application Insights.
I ended up doing #3 because, as it turns out, we already had Application Insights configured and on, we just didn't realize it could handle trace logging and querying. After disabling sampling for tracing events, we now have a way to easily query for any log statement remotely and get the full set of traces subject to any criteria (keyword match, all traces for a particular request, all traces in a particular time period, etc.) Moreover, there is no noticeable synchronous overhead to writing log statements with the Application Insights trace listener, so nothing in our application has to change (we can continue using the .NET trace class). As a bonus, since Application Insights tracing is pretty flexible with the tracing source, we can even switch to another higher performance logging API (e.g. ETW or log4net) if needed and Application Insights still works.
Ultimately, you should consider using Application Insights for storing and querying your traces. Depending on why you wanted your logs in blob storage in the first place, it may or may not meet your needs, but it worked for us.

How does one know why an Azure WebSite instance(WebApp) was shutdown?

By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.

Determining cause of CPU spike in azure

I am relatively new to Azure. I have a website that has been running for a couple of months with not too much traffic...when users are on the system, the various dashboard monitors go up and then flat line the rest of the time. This week, the CPU time when way up when there were no requests and data going in or out of the site. Is there a way to determine the cause of this CPU activity when the site is not active? It doesn't make sense to me that I should have CPU activity being assigned to my site when there is to site activity.
If your website has significant processing at application start, it is possible your VM got rebooted or your app pool recycled and your onstart handler got executed again (which would cause CPU to spike without any request).
You can analyze this by adding application logs to your Application_Start event (but after initializing trace). There is another comment detailing how to enable logging, but you can also consult this link.
You need to collect data to understand what's going on. So first thing I would say is:
1. Go to Azure management portal -> your website (assuming you are using Azure websites) -> dashboard -> operation logs. Try to see whether there is any suspicious activity going on.
download the logs for your site using any ftp client and analyze what's happening. If there is not much data, I would suggest adding more logging in your application to see what is happening or which module is spinning.
A great way to detect CPU spikes and even determine slow running areas of your application is to use a profiler like New Relic. It's a free add on for Azure that collects data and provides you with a dashboard of data. You might find it useful to determine the exact cause of the CPU spike.
We regularly use it to monitor the performance of our applications. I would recommend it.

Caching Diagnostics recommends 20GB of local storage(!). Why?

I installed the Azure 1.8 tools/SDK and it upgraded my projects co-located caching from preview to final. However, it also decided to add 20GB to the Role's Local Storage (DiagnosticStore). I manually dialed it down to 500MB but then I get the following message in the Role's Property page (cloud proj => roles => right click role => properties i.e. GUI for ServiceDefinition.csdef):
Caching Diagnostics recommends 20GB of local storage. If you decrease
the size of local storage a full redeployment is required which will
result in a loss of virtual IP addresses for this Cloud Service.
I don't know who signed off on this operating model within MS but it begs a simple Why?. For better understanding, I'm breaking that "Why" into 3 "Why" subquestions for caching in Azure SDK 1.8:
Why is the diagnostics of caching coupled with the caching itself? We just need caching for performance...
Why is the recommendation for a whopping 20Gigs? What happens if I dial it down to 500MB?
Slightly off-topic but still related: why does the decreasing of local storage require a full redeployment? This is especially painful since Azure doesn't provide any strong controls to reserve IP addresses. So if you need to work with 3rd parties that use whitelisted IPs - too bad!?
PS: I did contemplate breaking it into 3 separate questions. But given that they are tightly coupled it seems this would be a more helpful approach for future readers.
Diagnostic store is used for storing cache diagnostic data which includes - server logs, crash dumps, counter data etc. which can be automatically uploaded to Azure Storage by configuring the cache diagnostics (CacheDiagnostics.ConfigureDiagnostics call in OnStart method - without this call, data is generated on local VM but not uplaoded into Azure Storage ). And the amount of data that is collected is controlled by diagnostic level (higher the level, more data is collected) which can be changed dynamically. More details on cache diagnostics is avialble at: http://msdn.microsoft.com/en-us/library/windowsazure/hh914135.aspx
Since you enabled cache, it will come with default diagnostic level that should help in diagnosing cache issues if they happen. This data is stored locally unless you call the ConfigureDiagnostics method in OnStart (which uploads the data to Azure storage).
If a lower storage value is provided (say 2GB), then higher diagnostic levels cannot be used since they need more space (crash dump itself can take upwards 12GB for XL VMs). And if you want higher levels, then you might want to upgrade the deployment with change in the diagnostic store size which defeats the purpose - change diagnostic level without redeployment/upgrade/update/restarts. That is the reason why a limit of 20GB is set to cater to all diagnostic levels (and they can be changed in a running deployment with cscfg change).
is answered above.
Hope this helps.
I'll answer question #3 - local storage decreases are one of the only deployment changes that can't be done in-place (increases are fine, as well as VM size changes and several other changes now possible without redeploy). See this post for details around in-place updates.

What happens when when the disc holding logs on azure is full?

Our website is currently deployed to azure and we are writing trace logs using azure diagnostics. We then ship the logs to blob storage periodically and read them using Cerebrata’s Windows Diagnostics Manger software. I would like to know what happens when the disc holding the logs on azure is full i.e before the logs are shipped. When do the logs get purged? and is it is any different if the logs are not shipped. My concern is that the site may somehow fall over when exceptions are raised (if at all) when trying to write to a full disc.
Many Thanks
If you are using Windows Azure Diagnostics, then it will age out the logs on disk (deleting the oldest files first). You have a quota that is specified in your wad-control-container in blob storage on an instance level basis. By default, this will be 4GB (you can change it). All of your traces, counters, and event logs needs to fit in this 4GB of disk space. You can set separate quotas here if you like per data source as well. The Diagnostics Manager takes care to manage the data sources and the quota.
Now, there was a bug in older versions of the SDK where the disk could get full and diagnostics stopped working. You will know if you might be impacted by this bug by RDP'ing into an instance and trying to navigate to C:\Resources\Directory\\Monitor directory. If you are denied access, then you are likely to hit this bug. If you can view this directory as normal admin on machine, you should not be impacted. There was a permission issue in an older SDK version where deletes to this directory failed. Unfortunately, the only symptom of this impact is that suddenly you won't get data transferred out anymore. There is no overt failure.
Are you using System.Diagnostics.Trace to "write" your logs, or you are writing in log files.
In either way there is a roll-up. Meaning that if you hit storage quota, before logs beeing transfered, the oldest logs are being deleted. But you can easily increase your default (4G !) logs quota.
Please take a look at following articles and posts, describing in detail diagnostics in Windows Azure:
http://blogs.msdn.com/b/golive/archive/2012/04/21/windows-azure-diagnostics-from-the-ground-up.aspx
http://www.windowsazure.com/en-us/develop/net/common-tasks/diagnostics/
http://msdn.microsoft.com/en-us/library/windowsazure/hh411544.aspx

Resources