Azure Compute : OS Patching, Updates and Downtime - azure

We know that Azure Compute is PaaS so the Operating System (Windows Server 2008 R2) has to be patched and upgraded automatically.
I just wanted to know will there be any downtime during the patching or Compute upgradation...?

If you only have a single instance of a particular VM role, then yes - you'll have a short bit of downtime, as you need to be rebooted. Likewise, if the host OS is patched, you'll have a bit of downtime.
If you run two or more instances, then the SLA kicks in, because your instances are separated into different containers/network branches/etc. These are fault domains. So even if a network segment, router, or entire rack were to go offline, you'd have another instance somewhere else.
During OS updates, your instances are divided into upgrade domains, so that they're not all upgraded at once. This leaves your service in an always-available state, as long as you have two or more instances of your roles. For background processes that aren't customer-facing (say, in a worker role that simply reads from queues and processes queue items asynchronously), you can probably get away with a single instance of that role, provided you can handle the work load and that it would be ok to have occasional processing delays.
See this recent TechNet blog post for more details around fault domains and upgrade domains.

Related

Availability set azure fault domain & update domain

Q.I have 2 servers. So i ll have 2 FD(FD0, FD1) & 2 UD(UD0, UD1). What if UD0 is down & at the same time the FD1 goes down due some reason. So what will happen ?
If I co-relate the actual question and diagram in the Ashok's answer,
There are two scenarios here,
1) Update Domains will be down only if there is any updates going on(it can be planned or unplanned). So in case if FD1 goes down there won't be any updates happening in UD0 as because there are no other servers to take the load. Till FD1 comes online UD0 will have to wait to do the update.
2) In case any updates going on in UD1 or UD2 definitely UD0 will be running and serving the load/handling the traffic. At that time, if FD0 goes down, then your app will be down. To overcome this scenario you should have 3 FDs.
Very simple: both of your servers would be out.
It's not even related to Azure here, even if you have 2 machines, hosted in two locations, by 2 different providers, and the first is down for maintenance, and the second one crashes, you'll end up with everything down. So, fault domains and update domains will not protect you from a full outage in such an event.
This is how FDs and UDs are useful in the case of two machines:
Having each machine in its own FD and its own UD allows you to avoid a full outage in the event of an unexpected outage in one FD and avoid full outage in the event of an update
Having both machines in the same FD but in different UDs allows you to avoid full outage during update operations, but does not prevent full outage in the event of an unexpected FD outage
Having both machines in the same UD, but in different FDs (yes it's possible) allows you to avoid full outage in the event of an unexpected outage in one FD, but you'll have full outage for each update operation
Having both machines in the same FD and in the same UD would not protect you from anything, you'll have a full outage for both unexpected FD outages and update outages
For all Virtual Machines that have two or more instances deployed in the same Availability Set, Microsoft guarantee you will have Virtual Machine Connectivity to at least one instance at least 99.95% of the time.
For any Single Instance Virtual Machine using premium storage for all Operating System Disks and Data Disks, Microsoft guarantee you will have Virtual Machine Connectivity of at least 99.9%.
Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.
Fault domains define the group of virtual machines that share a common power source and network switch. By default, the virtual machines configured within your availability set are separated across up to three fault domains for Resource Manager deployments (two fault domains for Classic). While placing your virtual machines into an availability set does not protect your application from operating system or application-specific failures, it does limit the impact of potential physical hardware failures, network outages, or power interruptions.
Here is an article which helps you to understand Fault Domains and Update Domains

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Azure changing hardware

I have a product which uses CPU ID, network MAC, and disk volume serial numbers for validation. Basically when my product is first installed these values are recorded and then when the app is loaded up, these current values are compared against the old ones.
Something very mysterious happened recently. Inside of an Azure VM that had not been restarted in weeks, my app failed to load because some of these values were different. Unfortunately the person who caught the error deleted the VM before it was brought to my attention.
My question is, when an Azure VM is running, what hardware resources may change? Is that even possible?
Thanks!
Answering this requires a short rundown of how Azure works.
In each data centres there are thousands of individual machines. Each machine runs a hypervisor which allows a number of operating systems to share the same underlying hardware.
When you start a role, Azure looks for available resources - disk space CPU RAM etc and boots up a copy of the appropriate OS VM in thoe avaliable resources. I understand from your question that this is a VM role - so this VM is the one you uploaded or created.
As long as your VM is running, the underlying virtual resources provided by the hypervisor are not likely to change. (the caveat to this is that windows server 2012's hyper visor can move virtual machines around over the network even while they are running. Whether azure takes advantage of this, I don't know)
Now, Azure keeps charging you for even when your role has stopped because it considers your role "deployed". So in theory, those underlying resources still "belong" to your role.
This is not guaranteed. Azure could decided to boot up your VM on a different set of virtualized hardware for any number of reasons - hardware failure being at the top of the list, with insufficient capacity being second.
It is even possible (tho unlikely) for your resources to be provided by different hardware nodes.
An additional point of consideration is that it is Azure policy that disaster recovery (or other major event) may include transferring your roles to run in a separate data centre entirely.
My point is that the underlying hardware is virtual and treating it otherwise is most unwise. Roles are at the mercy of the Azure Management Routines, and we can't predict in advance what decisions they may make.
So the answer to your question is that ALL of the underlying resources may change. And it is very, very possible.

Azure Worker Role data volatility

I would like to create an application that holds large amount of volatile data in memory. Only small part of this data needs to be persisted when host machine shuts down, or in case of maintenance. Outages should be rare, this in memory data needs to be accessible for most of the time, but rare restrats of service is bearable.
If I have been developing for a server, I would create a WindowsService, which runs reliably while the machine is up, and I would persist a fraction of the data in the OnStop() method.
I'm thinking of moving this whole thing to the cloud. The question is that if a Worker Role is similiar to a Windows Service from this point of view? Does it run most of the time with rare outages, or is it recycled / restarted from time to time or when it is idle?
Like Windows Service, Worker role is meant for processing background tasks. However one thing you would need to keep in mind that your worker role can go down any time. It may be because of hardware failure or software updates. Thus you can't always assume this to be highly available. That's why Windows Azure recommends deploying multiple instances of your application.
What you could do is have multiple instances of your worker role running and all of them sharing a common cache where you would put volatile data. Do take a look at Windows Azure Caching (http://msdn.microsoft.com/en-us/library/windowsazure/gg278356.aspx) where you could either dedicate some memory of a VM (i.e. an instance) for caching purpose or have a full VM dedicated for caching. That way you'll have your volatile data somewhere outside of your worker roles and thus making it available to all instances.

Long running (or forever) task on Windows Azure

I need to write some data to database every 50 seconds or so. It's similar to a Windows service that's running on background and silently doing its job. Starting and stopping is not an option in my case as I need a small amount of previously inserted data to be stored in memory. What's the best solution for this when using Windows Azure or AWS?
Thank you.
With Windows Azure, you can choose either a Web or Worker role (both basically Windows 2008 Server R2 or SP2) and have some type of timed event, as #Lucifure suggested. You could also run a scheduler, like Quartz.net, or take advantage of windows Azure queues or service bus queues to have messages show up at a certain time. However: You cannot have a "forever" task in a given role instance, in that periodically your VM instances will be rebooted (e.g. for host OS maintenance every month). With role shutdowns, you'll get notice, which you can handle these shutdown notices in Stopping() or OnStop(). If you have multiple instances, you can use a scheduler or queue to ensure your events still trigger every 50 seconds or so, and get handled across multiple instances (but only by one instance at any given time).
To preserve your in-memory information, one idea is to store that information in a cache. You have 2 choices:
Distributed (shared) cache service, which has been around for some time now. It runs independently of your role instances.
In-memory cache, just introduced in June 2012. Assuming you have more than one instance, the cache is spread across those instances. You can even run the cache inside of memory of your existing roles.
More information on caching is here.
There are a few StackOverflow answers regarding Quartz.net and Windows Azure, such as this one.
On Windows Azure, you can use a Worker Role, which can do this. It can be simple as a while loop.
Try this article for an introduction.
http://www.c-sharpcorner.com/uploadfile/40e97e/windows-azu-creating-and-deploying-worker-role/
You could setup a System.Threading.Timer to fire every 50 seconds or so, and do your work whenever the event occurs.

Resources