Azure changing hardware - azure

I have a product which uses CPU ID, network MAC, and disk volume serial numbers for validation. Basically when my product is first installed these values are recorded and then when the app is loaded up, these current values are compared against the old ones.
Something very mysterious happened recently. Inside of an Azure VM that had not been restarted in weeks, my app failed to load because some of these values were different. Unfortunately the person who caught the error deleted the VM before it was brought to my attention.
My question is, when an Azure VM is running, what hardware resources may change? Is that even possible?
Thanks!

Answering this requires a short rundown of how Azure works.
In each data centres there are thousands of individual machines. Each machine runs a hypervisor which allows a number of operating systems to share the same underlying hardware.
When you start a role, Azure looks for available resources - disk space CPU RAM etc and boots up a copy of the appropriate OS VM in thoe avaliable resources. I understand from your question that this is a VM role - so this VM is the one you uploaded or created.
As long as your VM is running, the underlying virtual resources provided by the hypervisor are not likely to change. (the caveat to this is that windows server 2012's hyper visor can move virtual machines around over the network even while they are running. Whether azure takes advantage of this, I don't know)
Now, Azure keeps charging you for even when your role has stopped because it considers your role "deployed". So in theory, those underlying resources still "belong" to your role.
This is not guaranteed. Azure could decided to boot up your VM on a different set of virtualized hardware for any number of reasons - hardware failure being at the top of the list, with insufficient capacity being second.
It is even possible (tho unlikely) for your resources to be provided by different hardware nodes.
An additional point of consideration is that it is Azure policy that disaster recovery (or other major event) may include transferring your roles to run in a separate data centre entirely.
My point is that the underlying hardware is virtual and treating it otherwise is most unwise. Roles are at the mercy of the Azure Management Routines, and we can't predict in advance what decisions they may make.
So the answer to your question is that ALL of the underlying resources may change. And it is very, very possible.

Related

Azure Confidential VM benchmarks perform higher than same size non-confidential VM instance

This is my first attempt using AMD SEV and before deploying any applications I wanted to verify the expected performance overhead compared to a VM without AMD SEV. I am trying to replicate these results, that find Confidential VMs 2-8% slower than their non-confidential counterparts. However, in my case, the confidential VM seems to consistently perform better than the Standard VM.
I deployed 2 Azure VMs, one D4asv5 (Security: Standard) and one DC4asv5 (Security: Confidential VM), both in US West, with the rest of the configuration exactly the same.
I run cpu and fileIO benchmarks with sysbench (same parameters for both VMs) and CoreMark 666 benchmarks, as in the article above. In all cases, the Standard VM performance was 4-9% slower than the Confidential VM.
I have double-checked that the VMs are configured as intended, so I assume this must be an issue with the Standard VM performance. I redeployed the VM to force host migration, and ran Performance Diagnostics that didn't identify any issues. The benchmarks ran over a few-days period during which I reallocated the VMs multiple times.
Is there anything else that I am missing, that could justify this performance difference?

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Will I lose all my data and hosted websites if I change size of Virtual Machine (VM) from Small to large from Azure portal?

Let's imagine I have created an Azure virtual machine, a small one initially. I have installed SQL Server and created databases. Also hosted to website by IIS on the virtual machine.
I can see the performance of the small one is not up to the mark. I want to upgrade to a larger machine more powerful one. I know, I can do this from Azure portal.
My question is since I have already fully configured this machine with databases and websites running on the small VM. I need to know, Will I lose all my data and hosted websites if I change size of Virtual Machine (VM) from Small to large from Azure portal? I am worried that if this upgrade I may lose data and website.
You will not lose your (entire) data when you scale.
Why I put Entire - because your data is on the System drive (C). Which by default (if you have not turned this off) has a Read/Write Host Cache enabled. The Write cache can cause some data corruption when the VM is not gracefully shut down, or while changing the size. And this is the only issue you have to be worried about.
Changing VM size is kind of a common task that everyone does almost on a daily basis, especially when using IaaS as dev/test environment.
It is also a recommended corrective action to take if you are having issues with booting up the VM.
So, go ahead and change the size. You can pre-cautious stop your IIS before resizing, to avoid data loss. This only make sense if your application has some logic which writes files to local (C) drive.

Azure Worker Role data volatility

I would like to create an application that holds large amount of volatile data in memory. Only small part of this data needs to be persisted when host machine shuts down, or in case of maintenance. Outages should be rare, this in memory data needs to be accessible for most of the time, but rare restrats of service is bearable.
If I have been developing for a server, I would create a WindowsService, which runs reliably while the machine is up, and I would persist a fraction of the data in the OnStop() method.
I'm thinking of moving this whole thing to the cloud. The question is that if a Worker Role is similiar to a Windows Service from this point of view? Does it run most of the time with rare outages, or is it recycled / restarted from time to time or when it is idle?
Like Windows Service, Worker role is meant for processing background tasks. However one thing you would need to keep in mind that your worker role can go down any time. It may be because of hardware failure or software updates. Thus you can't always assume this to be highly available. That's why Windows Azure recommends deploying multiple instances of your application.
What you could do is have multiple instances of your worker role running and all of them sharing a common cache where you would put volatile data. Do take a look at Windows Azure Caching (http://msdn.microsoft.com/en-us/library/windowsazure/gg278356.aspx) where you could either dedicate some memory of a VM (i.e. an instance) for caching purpose or have a full VM dedicated for caching. That way you'll have your volatile data somewhere outside of your worker roles and thus making it available to all instances.

Azure Compute : OS Patching, Updates and Downtime

We know that Azure Compute is PaaS so the Operating System (Windows Server 2008 R2) has to be patched and upgraded automatically.
I just wanted to know will there be any downtime during the patching or Compute upgradation...?
If you only have a single instance of a particular VM role, then yes - you'll have a short bit of downtime, as you need to be rebooted. Likewise, if the host OS is patched, you'll have a bit of downtime.
If you run two or more instances, then the SLA kicks in, because your instances are separated into different containers/network branches/etc. These are fault domains. So even if a network segment, router, or entire rack were to go offline, you'd have another instance somewhere else.
During OS updates, your instances are divided into upgrade domains, so that they're not all upgraded at once. This leaves your service in an always-available state, as long as you have two or more instances of your roles. For background processes that aren't customer-facing (say, in a worker role that simply reads from queues and processes queue items asynchronously), you can probably get away with a single instance of that role, provided you can handle the work load and that it would be ok to have occasional processing delays.
See this recent TechNet blog post for more details around fault domains and upgrade domains.

Resources