Caching Diagnostics recommends 20GB of local storage(!). Why? - azure

I installed the Azure 1.8 tools/SDK and it upgraded my projects co-located caching from preview to final. However, it also decided to add 20GB to the Role's Local Storage (DiagnosticStore). I manually dialed it down to 500MB but then I get the following message in the Role's Property page (cloud proj => roles => right click role => properties i.e. GUI for ServiceDefinition.csdef):
Caching Diagnostics recommends 20GB of local storage. If you decrease
the size of local storage a full redeployment is required which will
result in a loss of virtual IP addresses for this Cloud Service.
I don't know who signed off on this operating model within MS but it begs a simple Why?. For better understanding, I'm breaking that "Why" into 3 "Why" subquestions for caching in Azure SDK 1.8:
Why is the diagnostics of caching coupled with the caching itself? We just need caching for performance...
Why is the recommendation for a whopping 20Gigs? What happens if I dial it down to 500MB?
Slightly off-topic but still related: why does the decreasing of local storage require a full redeployment? This is especially painful since Azure doesn't provide any strong controls to reserve IP addresses. So if you need to work with 3rd parties that use whitelisted IPs - too bad!?
PS: I did contemplate breaking it into 3 separate questions. But given that they are tightly coupled it seems this would be a more helpful approach for future readers.

Diagnostic store is used for storing cache diagnostic data which includes - server logs, crash dumps, counter data etc. which can be automatically uploaded to Azure Storage by configuring the cache diagnostics (CacheDiagnostics.ConfigureDiagnostics call in OnStart method - without this call, data is generated on local VM but not uplaoded into Azure Storage ). And the amount of data that is collected is controlled by diagnostic level (higher the level, more data is collected) which can be changed dynamically. More details on cache diagnostics is avialble at: http://msdn.microsoft.com/en-us/library/windowsazure/hh914135.aspx
Since you enabled cache, it will come with default diagnostic level that should help in diagnosing cache issues if they happen. This data is stored locally unless you call the ConfigureDiagnostics method in OnStart (which uploads the data to Azure storage).
If a lower storage value is provided (say 2GB), then higher diagnostic levels cannot be used since they need more space (crash dump itself can take upwards 12GB for XL VMs). And if you want higher levels, then you might want to upgrade the deployment with change in the diagnostic store size which defeats the purpose - change diagnostic level without redeployment/upgrade/update/restarts. That is the reason why a limit of 20GB is set to cater to all diagnostic levels (and they can be changed in a running deployment with cscfg change).
is answered above.
Hope this helps.

I'll answer question #3 - local storage decreases are one of the only deployment changes that can't be done in-place (increases are fine, as well as VM size changes and several other changes now possible without redeploy). See this post for details around in-place updates.

Related

Auto suggest, Azure Webapp & .Net core WebAPI iMemoryCache

Tech Stack
Azure WebApp
.Net core 2.1 WebApi
We have around 4k reference data which is used during auto suggest lookup, so in this i was wondering whether i should cache this data on WebApp or should always get it from database / 3rd party API.
I know i can use RedisCache to solve this issue, but i would like to know how Azure WebApp works when it comes to caching, it will have memory pressure? When? Yes then scale-up is the only solution?
We are using IMemoryCache in .net Core to store reference data and it expires on daily basis or when Azure WebApp is restarted (So 1st user will get delay till it gets all data in cache).
Data size is in range of 500KB - 1MB & sometimes goes till 3MB+.
What is the best approach?
iMemoryCache is not suggested when using WebApps because it is tightly bound to your application instance, so if you try to scale out your app (in case of load surges during the day) your caching mechanism will be broken.
RedisCache is pretty much a dictionary, key-value pairs.
It is very fast on look-ups but it could be very slow in some other operations like a GetAllKeys when it has to run through the whole cache. That will bring your cache server to its knees, so it needs to be handled carefully.
It will not put any significant pressure in the memory consumption of your app, you only need to have a static client. The rest is handled by the redis server.
If you plan to scale up your application (give more RAM and CPU resources to your one running instance) the iMemory cache is probably fine.
If you plan to scale out (create multiple instances of your application), that is strongly suggested for all stateless applications, then RedisCache (or any other distributed cache) is an one way for you if you need a caching mechanism.
Value and key max size is 512MB so you are on the safe side regarding value data size.
Attention
Be sure to use the Connection multiplexer as it is suggested in the official documentation because it automatically re-establishes the connection in case it is lost. That was a bug earlier, when redis cache server was going into maintenance your calls where redirected to the fail over instance but the connection was failing, so you needed to restart your application.

Wildcards in counter specifiers in Azure Diagnostic

What I am trying to achieve is similar to what you can see in mongo-azure repo, specifically, I want to write Azure Diagnostic Configuration File in such a way so that I can get all the instances of performance counters, for example
\Processor(*)\% Processor Time
and it doesn't seem to be working - no data is visible in the table in my storage account.
Is it achievable at all with configuration, and if so, how?
UPD: We were able to get this working for a simple single VM (so it is possible!), but for some reason it still doesn't work for VMs in VMSS where Service Fabric Cluster is running
UPD #2: We did upgrade to VS 2015 tools 1.5 and now it magically works. I am not really sure if that was the root cause problem or we screwed up anywhere else.
Is it achievable at all with configuration, and if so, how?
Based on the documentation here, it seems it is not possible. From this link:
Performance counters available for Microsoft Azure
Azure provides a subset of the performance counters available for
Windows Server, IIS and the ASP.NET stack. The following table lists
some of the performance counters of particular interest for Azure
applications.
Table that follows below only includes Processor(_Total).
This is possible. We do this in CloudMonix when monitoring VMSS and it is working.
How are you instrumenting Diagnostics and which tables are you looking at?
Specifying \LogicalDisk(*)\Available Megabytes yileds all drives and their free space, for example

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

What happens when when the disc holding logs on azure is full?

Our website is currently deployed to azure and we are writing trace logs using azure diagnostics. We then ship the logs to blob storage periodically and read them using Cerebrata’s Windows Diagnostics Manger software. I would like to know what happens when the disc holding the logs on azure is full i.e before the logs are shipped. When do the logs get purged? and is it is any different if the logs are not shipped. My concern is that the site may somehow fall over when exceptions are raised (if at all) when trying to write to a full disc.
Many Thanks
If you are using Windows Azure Diagnostics, then it will age out the logs on disk (deleting the oldest files first). You have a quota that is specified in your wad-control-container in blob storage on an instance level basis. By default, this will be 4GB (you can change it). All of your traces, counters, and event logs needs to fit in this 4GB of disk space. You can set separate quotas here if you like per data source as well. The Diagnostics Manager takes care to manage the data sources and the quota.
Now, there was a bug in older versions of the SDK where the disk could get full and diagnostics stopped working. You will know if you might be impacted by this bug by RDP'ing into an instance and trying to navigate to C:\Resources\Directory\\Monitor directory. If you are denied access, then you are likely to hit this bug. If you can view this directory as normal admin on machine, you should not be impacted. There was a permission issue in an older SDK version where deletes to this directory failed. Unfortunately, the only symptom of this impact is that suddenly you won't get data transferred out anymore. There is no overt failure.
Are you using System.Diagnostics.Trace to "write" your logs, or you are writing in log files.
In either way there is a roll-up. Meaning that if you hit storage quota, before logs beeing transfered, the oldest logs are being deleted. But you can easily increase your default (4G !) logs quota.
Please take a look at following articles and posts, describing in detail diagnostics in Windows Azure:
http://blogs.msdn.com/b/golive/archive/2012/04/21/windows-azure-diagnostics-from-the-ground-up.aspx
http://www.windowsazure.com/en-us/develop/net/common-tasks/diagnostics/
http://msdn.microsoft.com/en-us/library/windowsazure/hh411544.aspx

What happens to Azure diagnostic information when a role stops?

When an Azure worker role stops (either because of an unhandled exception or because Run() finishes), what happens to local diagnostic information that has not yet been transferred? Microsoft documentation says diagnostics are transferred to storage at scheduled intervals or on demand, neither of which can cover an unhandled exception. Does this mean diagnostic information is always lost in this case? This seems particularly odd because crash dumps are part of the diagnostic data (set up by default in DiagnosticMonitorConfiguration.Directories). How then can you ever get a crash dump back (related to this question)?
To me it would be logical if diagnostics were also transferred when a role terminates, but this is not my experience.
It depends on what you mean by 'role stops'. The Diagnostic Monitor in SDK 1.3 and later is implemented as a background task that has no dependency on the RoleEntryPoint. So, if you mean your RoleEntryPoint is reporting itself as unhealthy or something like that, then your DiagnosticMonitor (DM) will still be responsive and will send data according to the configuration you have setup.
However, if you mean that a role stop is a scale down operation (shutting down the VM), then no, there is no flush of the data on disk. At that point, the VM is shutdown and the DM with it. Anything not already flushed (transferred) can be considered lost.
If you are only rebooting the VM, then in theory you will be connected back to the same resource VHDs that hold the buffered diagnostics data so you would not lose it, it would be transferred on next request. I am pretty sure that sticky storage is enabled on it, so it won't be cleaned on reboot.
HTH.
The diagnostic data is stored locally before it is transferred to storage. So that information is available to you there; you can review/verify this by using RDP to check it out.
I honestly have not tested to see if it gets transferred after the role stops. However, you can request transfers on demand. So using that approach, you could request the logs/dumps to be transferred one more time after the role has stopped.
I would suggest checking out a tool like Cerebrata Azure Diagnostics Manager to request on demand transfer of your logs, and also analyze the data.
I answered your other question as well. Part of my answer was to add the event that would allow you to change your logging and transfer settings on the fly.
Hope this helps
I think it works like this: local diagnostic data is stored in the local storage named "DiagnosticStore", which I guess has cleanOnRoleRecycle set to false. (I don't know how to verify this last bit - LocalResource has no corresponding attribute.) When the role is recycled that data remains in place and will eventually be uploaded by the new diagnostic monitor (assuming the role doesn't keep crashing before it can finish).

Resources