Azure Functions: High Availability without double maintenance - azure

The Consumption Plan for Azure Functions is quite ideal, with its pleasant pricing and automatic scaling. However, I haven't found much information about High Availability with such a plan.
Let's consider a scenario. Imagine that, based on the load, there is currently one instance of the function app running. Then there is a problem in that data center. The consumption plan only scales out based on load. I can find no guarantees that a new instance will be added in this scenario, let alone that downtime will be prevented.
I'm aware that we could use Azure Front Door, with two separate function apps behind it. However, it appears that we must manage those function apps separately. That is a hassle. Swapping slots twice, remembering to change app settings in two places... That's no good.
What I'd like to achieve is something like Azure SQL in its Premium or Business-Critical tier, preferably with zone-redundant configuration. The diagram here shows how that works:
In simple terms, there is a primary replica with automatic failover to a secondary replica in the same data center, and also automatic failover to secondary replicas in two different data centers within the availability zone.
Notice how there is no manual management of the secondary replicas, since they are simply replicated from the primary.
How much of this can we achieve with Azure Functions, and how?


Scaling criteria for Azure functions premium/consumption plan

I am working on a HttpTrigger based azure function and trying to figure out the scaling and cold start issues.
While looking into scaling, I found that the azure function documentation states that the
"instances of the Azure Functions host are dynamically added and removed based on the number of incoming events"
which has confused me as to how does the number of events determine the scaling out of function instances, as different function can be of different sizes in terms of how much compute power or memory they require to execute.
And where exactly can I find this "number of events" that supposedly triggers a new instance to be added?
You won't find a specific "number of events," it's based on a wide variety of factors that Microsoft measures to determine the load of the currently running instances. Functions that are grouped together in a single project and deployed as a single Function App on Azure scale together. If you need functions that consume different levels of resources to scale independently, then be sure to deploy them as separate Function Apps (in the C#/VS world, that means different Projects).
If you have cold start issues, then the Premium plan can come into play. You pay for at least one instance to always be on and "pre-warmed" so that you never have cold starts. The plan will then scale from there based on the previously mentioned "events" that Azure measures to determine if scaling out is needed. MS has said that scaling out tends to be faster on the Premium Plan. You also get a longer default function runtime on Premium if that is necessary (30min vs 5min).

Multi regional Azure Container Service DC/OS clusters

I'm experimenting a little with ACS using the DC/OS orchestrator, and while spinning up a cluster within a single region seems simple enough, I'm not quite sure what the best practice would be for doing deployments across multiple regions.
Azure itself does not seem to support deploying to more than one region right now. With that assumption, I guess my only other option is to create multiple, identical clusters in all the regions I wish to be available, and then use Azure Traffic Manager to route incoming traffic to the nearest available cluster.
While this solution works, it also causes a few issues I'm not 100% sure on how I should work around.
Our deployment pipelines must make sure to deploy to all regions when deploying a new version of a service. If we have a East US and North Europe region, during deployments from our CI tool I have to connect to the Marathon API in both regions to trigger the new deployments. If the deployment fails in one region, and succeeds in the other, I suddenly have a disparity between the two regions.
If i have a service using local persistent volumes deployed, let's say PostgreSQL or ElasticSearch, it needs to have instances in both regions since service discovery will only find services local to the region. That brings up the problem of replication between regions to keep all state in all regions; this seem to require some/a lot of manual configuration to get to work.
Has anyone ever used a setup somewhat like this using Azure Container Service (or really Amazon Container Service, as I assume the same challenges can be found there) and have some pointers on how to approach this?
You have multiple options for spinning up across regions. I would use a custom installation together with terraform for each of them. This here is a great starting point:
Distributing agents across different regions should be no problem, ensuring that your services will keep running despite failures.
Distributing masters (giving you control over the services during failures) is a little more diffult as it involves distributing a zookeeper quorum across high latency links, so you should be careful in choosing the "distance" between regions.
Have a look at the documentation for more details.
You are correct ACS does not currently support Multi-Region deployments.
Your first issue is specific to Marathon in DC/OS, I'll ping some of the engineering folks over there to see if they have any input on best practice.
Your second point is something we (I'm the ACS PM) are looking at. There are some solutions you can use in certain scenarios (e.g. ArangoDB is in the DC/OS universe and will provide replication). The DC/OS team may have something to say here too. In ACS we are evaluating the best approaches to providing solutions for this use case but I'm afraid I can't give any indication of timeline.
An alternative solution is to have your database in a SaaS offering. This takes away all the complexity of managing redundancy and replication.

Change cloud service region

Is it possible to change a Cloud Service region (i.e: move from East US to West US)?
I don't see an option from the management console to do it or maybe I did not dig deep enough.
I would like to do it since I have my database in one region different to my application's and I guess it could decrease performance.
No, there is no way to change Cloud Service region. You have to create new Cloud Service in desired region and redeploy there. It becomes more complicated when you also have Storage accounts with data which you have to move. For this you could probably use Red Gate's Cloud Services or other mature product.
And you are right about Database and performance. It is not only performance, but also costs savings. When your Database is in different geographic region all data that comes out of your database is basically Outbound (Egress) traffic, which is being charged per GB!
You can make your own script using Powershell that is a powerful tool and can help you a lot, including copying the data between regions directly (not passing by your computer). I am going to do that know.

Azure Traffic Manager Load Balance Options

I tried to dig on MSDN but could not get concrete statement for which is the best load balancing method.
could someone please share some light on which of the below are best option for given scenario:
Round Robin.
x Web Roleshosted on Large VM on single data center.
must be 100% up 24x7.
Thank you.
First: Do you really want to offer a 100% uptime SLA for your customers, when Azure itself doesn't offer 100% in its SLA's?
That said: Traffic Manager only load-balances your compute, not your storage. So if you're trying to increase uptime by having a set of backup compute nodes running in another data center, you need to think about data access speed and cost:
With round robin, you'll now have distributed traffic across multiple data centers, guaranteed, and constantly. And if your data is in a single data center (which is a good idea to have data in a single System of Record, unless you have replication logic all taken care of), some of your users are going to see increased latency as the nodes separated from your data are going to be requesting data across many miles (potentially between continents). Plus, data egress has a $$$ cost to it.
With performance, your users are directed toward the data center which offers them the lowest latency. Again, this now means traffic across multiple data centers, with the same issues as round robin.
With failover, you now have all traffic going to one data center, with another designated as your failover data center (so it's for High Availability). In the event you have an outage in the primary data center, you'd now have a failover data center to rely on. This may help justify the added latency and cost, as you'd only experience this latency+cost when your primary app location becomes unavailable for some reason.
So: If you're going for the high availability route, to help approach the 100% availability mark, I'm guessing you'd be best off with the failover model.
Traffic manager comes into picture only when your application is deployed across multiple cloud services within same data center or in different data centers. If your application is hosted in a single cloud service (with multiple instances of course) , then the instances are load balanced using Round Robin pattern. This is the default load balancing pattern and comes to you without any extra charge.
You can read more about traffic manager here:
As per my guess there can not be comparison which is best load balancing method of Azure Traffic manager. All of them have unique advantages and vary depending on the requirement of application. Most common scenario is to use performance load balancing option with azure traffic manager. But as Gaurav said, you will have to have your cloud service application hosted on more than one cloud services. If you wish to implement performance load balancing then here is the link to get you started -

How do I make my Windows Azure application resistant to Azure datacenter catastrophic event?

AFAIK Amazon AWS offers so-called "regions" and "availability zones" to mitigate risks of partial or complete datacenter outage. Looks like if I have copies of my application in two "regions" and one "region" goes down my application still can continue working as if nothing happened.
Is there something like that with Windows Azure? How do I address risk of datacenter catastrophic outage with Windows Azure?
Within a single data center, your Windows Azure application has the following benefits:
Going beyond one compute instance, your VMs are divided into fault domains, across different physical areas. This way, even if an entire server rack went down, you'd still have compute running somewhere else.
With Windows Azure Storage and SQL Azure, storage is triple replicated. This is not eventual replication - when a write call returns, at least one replica has been written to.
Ok, that's the easy stuff. What if a data center disappears? Here are the features that will help you build DR into your application:
For SQL Azure, you can set up Data Sync. This facility synchronizes your SQL Azure database with either another SQL Azure database (presumably in another data center), or an on-premises SQL Server database. More info here. Since this feature is still considered a Preview feature, you have to go here to set it up.
For Azure storage (tables, blobs), you'll need to handle replication to a second data center, as there is no built-in facility today. This can be done with, say, a background task that pulls data every hour and copies it to a storage account somewhere else. EDIT: Per Ryan's answer, there's data geo-replication for blobs and tables. HOWEVER: Aside from a mention in this blog post in December, and possibly at PDC, this is not live.
For Compute availability, you can set up Traffic Manager to load-balance across data centers. This feature is currently in CTP - visit the Beta area of the Windows Azure portal to sign up.
Remember that, with DR, whether in the cloud or on-premises, there are additional costs (such as bandwidth between data centers, storage costs for duplicate data in a secondary data center, and Compute instances in additional data centers). .
Just like with on-premises environments, DR needs to be carefully thought out and implemented.
David's answer is pretty good, but one piece is incorrect. For Windows Azure blobs and tables, your data is actually geographically replicated today between sub-regions (e.g. North and South US). This is an async process that has a target of about a 10 min lag or so. This process is also out of your control and is purely for a data center loss. In total, your data is replicated 6 times in 2 different data centers when you use Windows Azure blobs and tables (impressive, no?).
If a data center was lost, they would flip over your DNS for blob and table storage to the other sub-region and your account would appear online again. This is true only for blobs and tables (not queues, not SQL Azure, etc).
So, for a true disaster recovery, you could use Data Sync for SQL Azure and Traffic Manager for compute (assuming you run a hot standby in another sub-region). If a datacenter was lost, Traffic Manager would route to the new sub-region and you would find your data there as well.
The one failure that you didn't account for is in the ability for an error to be replicated across data centers. In that scenario, you may want to consider running Azure PAAS as part of HP Cloud offering in either a load balanced or failover scenario.
