automatic failover if webserver is down (SRV / additional A-record / ?) - dns

I am starting to develop a webservice that will be hosted in the cloud but needs higher availability than typical cloud SLAs provide.
Typical SLAs, e.g. Windows Azure, promise an availability of 99.9%, i.e. up to 43min downtime per month. I am looking for an order of magnitude better availability (<5min down time per month). While I can configure several load balanced database back-ends to resolve that part of the issue I see a bottleneck at the webserver. If the webserver fails, the whole service is unavailable to the customer. What are the options of reducing that risk without introducing another possible single point of failure? I see the following solutions and drawbacks to each:
SRV-record:
I duplicate the whole infrastructure (and take care that the databases are in sync) and add additional SRV records for the domain so that the user tying to access www.example.com will automatically get forwarded to example.cloud1.com or if that one is offline to example.cloud2.com. Googling around it seems that SRV records are not supported by any major browser, is that true?
second A-record:
Add an additional A-record as alternatives. Drawbacks:
a) at my hosting provider I do not see any possibility to add a second A-record but just one... is that normal?
b)if one server of two servers are down I am not sure if the user gets automatically re-directed to the other one or 50% of all users get a 404 or some other error
Any clues for a best-practice would be appreciated
Cheers,
Sebastian

The availability of the instance i.e. SLA when specified by the Cloud Provider means the "Instance's Health is server running in the context of Hypervisor or Fabric Controller". With that said, you need to take an effort and ensure the instance is not failing because of your app / OS / or pretty much anything running inside the instance. There are few things which devops tend to miss and that kind of hit back hard like for instance - forgetting to configure the OS Updates and Patches.
The fundamental axiom with the availability is the redundancy. More redundant your application / infrastructure is more availabile is your app.
I recommend your to look into the Azure Traffic Manager and then re-work on your architecture. You need not worry about the SRV record or A-Record. Just a CNAME for the traffic manager would do the trick.
The idea of traffic manager is simple, you can tell the traffic
manager to stand after the domain name ( domain name resolution of the
app ) then the traffic manager decides where to send the request on
considerations of factors like Round-Robin, Disaster Management etc.
With the combination of the Traffic Manager and multi-region infrastructure setup; you will march towards the high availability goal.
Links
Azure Traffic Manager Overview
Cloud Power: How to scale Azure Websites globally with Traffic Manager

Maybe You should configure a corosync cluster with DRBD ?
DRBD will ensure You that the data on both nodes are replicated (for example website files and db files).
Apache as web server will be available under a virtual IP to which domain is pointed. In case of one server is down corosync will move all services to second server within few seconds.

Related

Slow communications between Azure regions

So just a quick summary of what we are doing to put everything into context. We have a socket server running as an Azure Cloud Service (worker role) within the South Central US region. All of our other components (Queue, DBs, web app, API etc) are located in East US. The reasons being is sadly due to not being able to modify the static IP address that was created for the South Central US a few years ago. The devices in the field cannot alter their IP as well :/ So we are stuck communicating cross region.
So what Im asking, is there a way to improve latency? Can we "port forward" ? What other options do we have? Im assuming the latency is our biggest enemy here as we pipe data back and forth.
Looking at load balancing at moment - https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-overview
Thoughts?
Load Balancer is a regional service and cannot direct traffic across regions.
There are a couple of options:
1) build your own VM's with a TCP proxy to achieve your scenario. You could use Load Balancer to scale and protect your TCP proxy instance if you want to pursue that path.
2) explore using Application Gateway for this scenario since it is a proxy and can direct to IP address destinations. This is essentially a managed service for option 1, although limited to HTTP & HTTPS.
3) migrate to a DNS approach for locating your service and orchestrating a migration across regions over time.
Either way, traffic would remain on Microsoft's own backbone between regions.
Best regards,
Christian

Is Azure Traffic manager is reliable for failover? what are other problems I should be worried about?

I am planning to use Azure Traffic manager to do a failover of my app running on one Azure zone to Azure zone.
I need some suggestion, if that is the correct approach to do a failover ? We have seen issue with Azure that, most of the services in one region goes down for few hours. Although I understand that Azure traffic manager is not associated with the region. But is it possible that Azure traffic manager goes down or that traffic manager endpoint is not reachable although my backend webapp is reachable?
If I am planning to use Azure traffic manager, what are other problems I should be worried about ?
I've been working with TM for some time now, so here are a few issues I haven't seen mentioned before:
Keep-Alive
If your service allows Keep-Alive, then your DNS entry will be ignored as long as the connection remains open. I've seen some exceptionally odd behavior result from this, including users being stuck on a fallback page since they kept using the connection, causing it to remain open indefinitely. If you have access to IIS Manager, you can force Keep-Alive to be false.
Browser DNS Caching
Most browsers have their own DNS cache, and very few honor DNS Time To Live. In my experience Chrome is pretty responsive, with IE and Edge having significant delays if you need them to rollover quickly. I've heard that Opera is particularly bad.
Other DNS Caching
Even if you're not accessing your service through a browser, other components can have DNS caches, and some of them will allow you to manage the cache yourself. This can in theory even depend on ISP's DNS caching, though reports on the magnitude of this vary significantly.
Traffic Manager works at the DNS level, which itself is replicated. However, even then, you should still build in redundancy into your solution.
Take a look at the Azure Architecture Center under "Make all things redundant" and you will see a recommendation for Traffic Manager:
consider adding another traffic management solution as a failback. If
the Azure Traffic Manager service fails, change your CNAME records in
DNS to point to the other traffic management service.
The Traffic Manager internal architecture is resilient to the failure of any single Azure region. So, even if a region fails, Traffic Manager should stay up. That applies to all Traffic Manager components: control plane, endpoint monitoring, and DNS name servers.
Since Traffic Manager works at the DNS level, it doesn't have an 'endpoint' that proxies your traffic--it uses DNS to direct clients to the appropriate endpoint, and clients then connect to those endpoints directly. Thus, an unreachable endpoint is an application problem, not a Traffic Manager problem.
That said, if the Traffic Manager DNS name servers are down, you have a serious problem. You DNS resolution path will fail and your customers will be impacted. The only solution is to either accept the risk (small, but can never be zero) or have a plan in place to use another DNS system, either in parallel or failover. This is not a limitation of Traffic Manager; you could say the same about any DNS-based traffic management system.
The earlier answer from DornaDigital is very good (other than the first point which suggests DNS caching will protect you through a name server outage--it won't). It covers some important points. In short, DNS-based failover works well for new sessions. Existing clients may have to refresh or even close their browser and reconnect.
I also agree with the details provided dornadigital.
There are considerations for front end applications as well. The browsers all have different thresholds for how long they maintain persistent connections. Chromium, for example, currently maintains a connection unless there is inactivity for 300 seconds.
In our web applications, we are detecting the failover by the presence of a certain number of failed requests to the endpoint. After requests begin failing, we pause requests for 301 seconds to allow the connection to reset. This allows the DNS change from the traffic manager to be applied to subsequent requests. We pop up a snackbar to indicate to the user that we are having an issue and display the count down when requests will resume. Similar to Gmail when it has an issue connecting to their servers.
I hope that gives you one idea on how to build some redundancy into your web apps.
I disagree with Jonathan as his understanding of the resiliency of the Traffic Manager service is in disagreement with Microsoft's own documentation on the subject.
When you provision Azure Traffic Manager, you select a region in which to deploy the service. I (correctly) inferred this to assert if said region were to fail, the Traffic Manager service could also be impacted and in turn, your application solution would not properly fail over to the secondary region.
According to Microsoft's Azure Application Architecture Guide, under "Make all things redundant", a customer should deploy Traffic Manager into more than one region:
Include redundancy for Traffic Manager. Traffic Manager is a possible failure point. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to point to the other traffic management service.
Azure Application Architecture Guide - Make all things redundant
My thought and intention is to not deploy Traffic Manager within the primary service region, but instead to deploy it into the secondary (failover region) and a tertiary (3rd) region as a backup.

Azure Website doesnt detect Traffic Manager Change

I have an Azure website (website.mycompany.com) that uses a WCF service for some data. The WCF Service sits behind an Azure Traffic Manager (service.mycompany.com) running in "priority mode", with 2 instances of the service for failover handling. With priority mode, the primary always serves up the data first, unless it's unavailable. If unavailable, the 2nd instance will reply.. and so on down the line.
We've had a few instances recently where the primary endpoint for service.mycompany.com was offline. For "partnerships" who point to service.mycompany.com, they detected the switch and all was fine. Lately however, our own site (website.mycompany.com) does NOT detect the traffic manager switch, and the website has errors since the service fails to reply.
Our failover endpoint in these instances is up, and in the past the Azure website detected the switch, it's only recently we've encountered this issue. Has anyone experienced similar issues? Are there perhaps any DNS changes that we need to tweak in our Azure Website to help it detect TTL's?
Has anyone experienced similar issues?
Do you mean the traffic manager can't switch to another endpoint immediately?
Traffic manager works at the DNS level, here are the reasons why traffic manager can't switch immediately:
The duration of the cache is determined by the 'time-to-live' (TTL) property of each DNS record. Shorter values result in faster cache expiry and thus more round-trips to the Traffic Manager name servers. Longer values mean that it can take longer to direct traffic away from a failed endpoint.
The traffic manager endpoint monitor effects the response time. More information about how azure traffic manager works, please refer to the link.
The following timeline is a detailed description of the monitoring process.
Also we can check traffic manager profile using nslookup and ipconfig in windows. About how to vertify traffic Manager settings, please refer to the link.
By the way, because traffic manager works at the DNS level, it cannot influence existing connections to any endpoint. When it directs traffic between endpoints (either by changed profile settings, or during failover or failback), Traffic Manager directs new connections to available endpoints. However, other endpoints might continue to receive traffic via existing connections until those sessions are terminated. To enable traffic to drain from existing connections, applications should limit the session duration used with each endpoint.
I'm going to refer you to my answer here because while the situation isn't exactly the same, it seems like it could have the same solution. To summarize, I find it likely that you have a connection left open to the down service that isn't being properly closed. This connection is independent of TTL, which only deals with DNS caching, and as such bypasses Traffic Manager completely.

Azure Traffic Manager

Fast question is it possible to have Azure Traffic Manager
I would like to rent dedicated servers in 3rd party suppler and to load balancer from Azure
Question 1:
Can I setup this scenario? and use the load balancer from Azure?
Question 2:
Will I pay Outgoing bandwidth
Question 3:
Will you share for website with 10 000 000 page views per month how much you pay for DNS look ups as average.
Question 4 please suggest same service competitors... Google, Amazon, Rackspace I already know
The link you provided to the article already answers #1 and #3. Yes you can set this up. Billing is done by DNS lookup at $0.75 per million lookup, so your 10m page views will cost at most $7.50, but this isn't taking into consideration DNS caching which will drastically lower this (already very low) cost.
Question 2 is not an Azure Traffic Manager related question. No bandwidth goes through ATM so there is no charge. I am sure you will pay bandwidth charges with whatever 3rd party datacenter provider you are going to use.
I don't understand question 4. What do you want suggestions for? A cloud provider? There are lots of good ones but it depends on your scenario.
Azure Traffic Manager is a DNS routing system. It is similar to the routing features of AWS Route 53 (although Route 53 is a more full-featured DNS system).
Azure Traffic Manager uses DNS to point incoming traffic to different endpoints, which can be either within Azure or external urls. Because it uses DNS, it doesn't actually see any of the data itself, it just translates something like myapp.trafficmanager.net to 'webserver1.example.comorwebserver2.example.com` based on your rules and setup.
You can use round-robin, weighted or performance (which directs to the geographically closest address you have setup). You can further use Azure's DNS or another DNS system to use your own (sub)domain to CNAME to the trafficmanager.net domain name.
Load balancers like Azure Load Balancers and Amazon's Elastic Load Balancers are used to actually spread the traffic itself to different machines or services. Each work only with services hosted with the cloud provider so Azure Load Balancers can be used to load balance Azure VM's but not some servers you have hosted elsewhere.
Load balancers have bandwidth charges because they actually pass through the traffic. Azure Traffic Manager just has DNS query charges because that's all it does.
In your case, yes you can use Azure Traffic Manager to point to several external endpoints for your dedicated servers. You can also nest Traffic Manager profiles so that you can first use geo-location then round-robin. Azure Traffic Manager does support basic http/https monitoring to make sure the endpoint is still active.
Because it is based on DNS, there will always be a lag between changes with the TTL value and how clients cache DNS addresses. This is inherent with all DNS routing. To be extra safe, you can use Azure Traffic Manager to route to your datacenter and then run your own load balancing software locally to spread the load among servers.

Understanding availability set in Windows Azure

I am reading the explanation of Availability Sets on Microsoft' website but can't 100% understand the concept.
http://www.windowsazure.com/en-us/documentation/articles/manage-availability-virtual-machines/
There are many questions people ask in comments, but there is no technical support from Microsoft is there to answer them.
As I properly understand with availability sets you can duplicate your VM with IIS application and VM with SQL, which means you have to use 4 VM(pay for 4) instead of 2. This means that whenever IIS1 virtual machine is down, website will still be online with help of IIS2 virtual machine and vice versa? Same goes for SQL1 and SQL2 virtual machines?
Am I going to the right direction? If this is the case, how do I keep the data synchronized in SQL1 and SQL2, IIS1 and IIS2 virtual machines at the same time, so website will still be up with latest data and code if one VM is down for updates?
An availability set combines two concepts from the Windows Azure PaaS world - upgrade domains and fault domains - that help to make a service more robust. When several VMs are deployed into an availability set the Windows Azure fabric controller will distribute them among several upgrade domains and fault domains.
A fault domain represents a grouping of VMs which have a single point of failure - a convenient (although not precisely accurate) way to think about it is a rack with a single top or rack router. By deploying the VMs into different fault domains the fabric controller ensures that a single failure will not take the entire service offline.
The fabric controller uses upgrade domains to control the manner in which host OS upgrades (i.e., of the underlying physical server) are performed. The fabric controller performs these upgrades one upgrade domain at a time, only moving onto the next upgrade domain when the upgrade of the preceding upgrade domain has completed. Doing this ensures that the service remains available, although with reduced capacity, during a host OS upgrade. These upgrades appear to happen every month or two, and services in which all VMs are deployed into availability sets receive no warning since they are supposedly resilient towards the upgrade. Microsoft does provide warning about upgrades to subscriptions containing VMs deployed outside availability sets.
Furthermore, there is no SLA for services which have VMs deployed outside availability sets.
As regards SQL Server, you may want to look into the use of SQL Server Availability Groups which sit on top of Windows Server Failover Cluster and use synchronous replication of the data. For IIS, you may want to look at the possibility of deploying your application into a PaaS cloud service since that provides significant advantages over deploying it into an IaaS cloud service. You can create a service topology integrating PaaS and IaaS cloud services through the use of a VNET.
Availability set is combination of these two feature
Fault Domain(you have option to select max 3 when creating new Availability Set)
Update Domains (you have option to select max 20 when creating new Availability Set)
Fault Domain is the physical(like rack, power) set lets you selected 2 fault domain in your availability set and your machine in that availability set will have value 1 and 2 so at least one can be available in case of power failure at any physical set.
Update Domain is set which will be updated by azure system update at once.
if select 4 update domains and your 2 VM have value 2,3 that means they will not be updated together for any planed maintenance
For high availability duplicate VM should not be on same Fault Domain or same Update Domain
Now You can not change availability set after creation of a VM it should be set at the time of creation

Resources