How is the Multi-AZ deployment of Amazon RDS realized? - amazon-rds

Recently I'm considering to use Amazon RDS Multi-AZ deployment for a service in production environment, and I've read the related documents.
However, I have a question about the failover. In the FAQ of Amazon RDS, failover is described as follows:
Q: What happens during Multi-AZ failover and how long does it take?
Failover is automatically handled by Amazon RDS so that you can resume
database operations as quickly as possible without administrative
intervention. When failing over, Amazon RDS simply flips the canonical
name record (CNAME) for your DB Instance to point at the standby,
which is in turn promoted to become the new primary. We encourage you
to follow best practices and implement database connection retry at
the application layer. Failover times are a function of the time it
takes crash recovery to complete. Start-to-finish, failover typically
completes within three minutes.
From the above description, I guess there must be a monitoring service which could detect failure of primary instance and do the flipping.
My question is, which AZ does this monitoring service host in? There are 3 possibilities:
1. Same AZ as the primary
2. Same AZ as the standby
3. Another AZ
Apparently 1&2 won't be the case, since it could not handle the situation that entire AZ being unavailable. So, if 3 is the case, what if the AZ of the monitoring service goes down? Is there another service to monitor this monitoring service? It seems to be an endless domino.
So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?

So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?
I think that the "how" in this case is abstracted by design away from the user, given that RDS is a PaaS service. A multi-AZ deployment has a great deal that is hidden, however, the following are true:
You don't have any access to the secondary instance, unless a failover occurs
You are guaranteed that a secondary instance is located in a separate AZ from the primary
In his blog post, John Gemignani mentions the notion of an observer managing which RDS instance is active in the multi-AZ architecture. But to your point, what is the observer? And where is it observing from?
Here's my guess, based upon my experience with AWS:
The observer in an RDS multi-AZ deployment is a highly available service that is deployed throughout every AZ in every region that RDS multi-AZ is available, and makes use of existing AWS platform services to monitor the health and state of all of the infrastructure that may affect an RDS instance. Some of the services that make up the observer may be part of the AWS platform itself, and otherwise hidden from the user.
I would be willing to bet that the same underlying services that comprise CloudWatch Events is used in some capacity for the RDS multi-AZ observer. From Jeff Barr's blog post announcing CloudWatch Events, he describes the service this way:
You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules, it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.
Think of the observer the same way - it's a component of the AWS platform that provides a function that we, as the users of the platform do not need to think about. It's part of AWS's responsibility in the Shared Responsibility Model.

Educated guess - the monitoring service runs on all the AZs and refers to a shared list of running instances (which is sync-replicated across the AZs). As soon as a monitoring service on one AZ notices that another AZ is down, it flips the CNAMES of all the running instances to an AZ which is currently up.

We did not get to determine where the fail-over instance resides, but our primary is in US-West-2c and secondary is in US-West-2b.
Using PostgreSQL, our data became corrupted because of a physical problem with the Amazon volume (as near as we could tell). We did not have a multi-AZ set up at the time, so to recover, we had to perform a point-in-time restore as close in time as we could to the event. Amazon support assured us that had we gone ahead with the Multi-AZ, they would have automatically rolled over to the other AZ. This begs the questions how they could have determined that, and would the data corruption propagated to the other AZ?
Because of that shisaster, we also added a read-only replica, which seems to make a lot more sense to me. We also use the RO replica for read and other functions. My understanding from my Amazon rep is that one can think of the Multi-AZ setting as more like a RAID situation.

From the docs, fail over occurs if the following conditions are met:
Loss of availability in primary Availability Zone
Loss of network connectivity to primary
Compute unit failure on primary
Storage failure on primary
This infers that the monitoring is not located in the same AZ. Most likely, the read replica is using mysql functions (https://dev.mysql.com/doc/refman/5.7/en/replication-administration-status.html) to monitor the status of the master, and taking action if the master becomes unreachable.
Of course, this bears the question what happens if the replica AZ fails? Amazon most likely has checks in the replica's failure detection to figure out whether it's failing or the primary is.

Related

Architecture recommendation - Azure Webjob

I have a webjob that subscribes to an Azure service Bus topic. The webjob automates a very important business process. For the Service bus, it is Premium SKU and have Geo-Recovery configured. My question is about the best practice to setup High Availability for my webjob (to ensure that the process runs always). I already have the App Service Plan deployed in two regions, and the webjob is installed in both the regions. However, I would like my webjob in the secondary region to run only if the primary region is down - maybe temporarily due to an outage. How can this be implemented? If I run both the webjob in parallel, that will create some serious duplication issues. Is there any architectural pattern I can refer to, or use any features within App Service or Azure to implement this?
With ServiceBus, when you can pick up a message, it is locked so shouldn't be picked up by another process unless the lock time expires or you issue a compled message back to service bus. In your case, if you are using Peek Lock, you can use it to prevent the same message being picked up by different instances. See docs
You can also make use of sessions which is available in the premium instance of ServiceBus. In this way, you can group messages to a session and each service instance handles their own session unless the other instance is not available.
Since WebJob is associated with App service , so really depends how you have configured this. You already mentioned that WebJobs are in 2 regions which mean you have app services running in 2 regions. (make sure you have multiple instance running in each region and different Availability zones).
Now it comes down what configuration you have regarding standby region. Is it Active/passive with hot Standby, Active/passive with cold Standby or is it active/Active. If your secondary region is Active where you have atleast one instance running then your webjob is actually processing the message.
I would recommend read through these patterns and understand.
Standby Regions Configuration , Multi Region Config
Regarding Service bus, When you are processing the message with Peek-Lock it means the message is not visible in the queue so no other instance would pick up. If your webjob is not able to process in time or failed to do or crash , the message become visible in the queue again and any other instance can pick it up so no two instances can pick same message.
Better Approach
I would recommend using Azure functions to process queue message .They are serverless offering with free invocations credit a month and are naturally highly available.
You can find more about here
Azure Function Svc Bus Trigger

Azure - Linux Standard B2ms - Turned off automatically?

I have a Linux Standard B2ms azure virtual machine. I have disabled the autoshutdown feature you see in your dashboard under operations. For some reason this server was still shutdown after running about 8 days.
What reasons are there which could shutdown this server if I haven't changed anything on it the last three days?
What reasons are there which could shutdown this server if I haven't
changed anything on it the last three days?
There are many reasons will shutdown this VM, maybe we should try to find some logs about this.
First, we should check Azure Alerts via Azure portal, try to find some logs about you VM.
Second, we should check this VM's performance, maybe high CPU usage or high memory usage, we can find logs in /var/log/*.
Also we can try to find are there some issue about Azure service, we can check service Health -> Health history to find are there some issues in your region.
By the way, if we just create one VM in Azure, we can't avoid a single point of failure. In Azure, Microsoft recommended that two or more VMs are created within an availability set to provide for a highly available application and to meet the 99.95% Azure SLA.
An availability set is composed of two additional groupings that protect against hardware failures and allow updates to safely be applied - fault domains (FDs) and update domains (UDs).
Fault domains:
A fault domain is a logical group of underlying hardware that share a common power source and network switch, similar to a rack within an on-premises datacenter. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these fault domains. This approach limits the impact of potential physical hardware failures, network outages, or power interruptions.
Update domains:
An update domain is a logical group of underlying hardware that can undergo maintenance or be rebooted at the same time. As you create VMs within an availability set, the Azure platform automatically distributes your VMs across these update domains. This approach ensures that at least one instance of your application always remains running as the Azure platform undergoes periodic maintenance. The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time.
In your scenario, maybe there are some unplanned maintenance events,when Microsoft update the VM host, they will migrate your VM to another host, they will shutdown your VM then migrate it.
To achieve a highly available, maybe we should create at least two VMs in one availability set.

Service fabric actors state

We are planning to use Service fabric actor model for one of our user services. We have thousands of users and they have their own profile data. By far reading the materials, service fabric actor model maintains its states with their service fabric cluster. I couldn't get a clear picture in disaster recovery/planned shutdown scenarios/offline data access. In such cases, Is it needed to persist the data out side of these actor service?
What happens to the data, if we decided to shutdown all the service fabric cluster one day, and wanted to reactivate few days later?
In an SF cluster in Azure, the data is stored on the temp drive. There's no guarantee that a node that is shutdown retains the temp drive. So shutting down all nodes simultaneously will result in data loss.
To avoid this, you should regularly create backups of your (Actor) Services. For instance by using this Nuget package. Store the resulting files outside the cluster.
The cluster technology will help keep your data safe during failures of nodes, e.g. in a 5 node cluster, 4 remaining healthy nodes can take over the work of a failed node. Data is stored redundantly, so your services remain operational. The same functionality also allows for rolling upgrades of services/actors.
Here's an article about DR.
I had implemented a large enterprise application in service fabric using actor model for management of orders.
Few things that might help while choosing a strategy for data backup and restoration
As the package https://github.com/loekd/ServiceFabric.BackupRestore is not full fledged and you need to take care of some of the scenario.
for example: During deployment your actor partitions moved to other nodes and if you try to take incremental backups it will failed with FabricMissingFullBackupException because on that node after becoming primary you haven't took the Full backup and some one needs to manually fix the issue.
How we added the retry pattern to fix that issue is not in the scope of this question.
Incremental backups didn't restore always during restoration process.
Some time Incremental backup creation failed even if you set the logTrunctationIntervalInMinutes properly.
Some developer by mistake deleted the service or application you will loss all your data.
if your system heavily dependent on Reminder's which was in our case.
During restoration all the reminders gets reset.
Good Solution: Override the default KvsActorStateProvider with your own implementation which stores the data in DocumentDB, MongoDB, Cassandra or Azure SQL if you want to use the power BI for some analytics.

Multi regional Azure Container Service DC/OS clusters

I'm experimenting a little with ACS using the DC/OS orchestrator, and while spinning up a cluster within a single region seems simple enough, I'm not quite sure what the best practice would be for doing deployments across multiple regions.
Azure itself does not seem to support deploying to more than one region right now. With that assumption, I guess my only other option is to create multiple, identical clusters in all the regions I wish to be available, and then use Azure Traffic Manager to route incoming traffic to the nearest available cluster.
While this solution works, it also causes a few issues I'm not 100% sure on how I should work around.
Our deployment pipelines must make sure to deploy to all regions when deploying a new version of a service. If we have a East US and North Europe region, during deployments from our CI tool I have to connect to the Marathon API in both regions to trigger the new deployments. If the deployment fails in one region, and succeeds in the other, I suddenly have a disparity between the two regions.
If i have a service using local persistent volumes deployed, let's say PostgreSQL or ElasticSearch, it needs to have instances in both regions since service discovery will only find services local to the region. That brings up the problem of replication between regions to keep all state in all regions; this seem to require some/a lot of manual configuration to get to work.
Has anyone ever used a setup somewhat like this using Azure Container Service (or really Amazon Container Service, as I assume the same challenges can be found there) and have some pointers on how to approach this?
You have multiple options for spinning up across regions. I would use a custom installation together with terraform for each of them. This here is a great starting point: https://github.com/bernadinm/terraform-dcos
Distributing agents across different regions should be no problem, ensuring that your services will keep running despite failures.
Distributing masters (giving you control over the services during failures) is a little more diffult as it involves distributing a zookeeper quorum across high latency links, so you should be careful in choosing the "distance" between regions.
Have a look at the documentation for more details.
You are correct ACS does not currently support Multi-Region deployments.
Your first issue is specific to Marathon in DC/OS, I'll ping some of the engineering folks over there to see if they have any input on best practice.
Your second point is something we (I'm the ACS PM) are looking at. There are some solutions you can use in certain scenarios (e.g. ArangoDB is in the DC/OS universe and will provide replication). The DC/OS team may have something to say here too. In ACS we are evaluating the best approaches to providing solutions for this use case but I'm afraid I can't give any indication of timeline.
An alternative solution is to have your database in a SaaS offering. This takes away all the complexity of managing redundancy and replication.

Turning off ServiceFabric clusters overnight

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

Resources