Azure Managed instance failover with transnational replication - azure

I have two azure SQL managed instances in different region and configured fail-over group between the same MI instances. Also configured the transactional replication between on-prem sql \ Azure IaaS VM to primary managed instance. Now want to test fail-over group by failing over to secondary and then to primary. What's the best way \ possible way so that replication should not get disturbed.

If geo-replication is enabled on a publisher or distributor instance in a failover group, the managed instance administrator must clean up all publications on the old primary and reconfigure them on the new primary after a failover occurs. Please refer MS doucumentation for more information.

When you configure the subscriber, use the failover group read/write listener endpoint instead of the primary managed instance name.
The following information is available in the Microsoft documentation on this subject:
"If a subscriber SQL Managed Instance is in a failover group, the publication should be configured to connect to the failover group listener endpoint for the subscriber managed instance. In the event of a failover, subsequent action by the managed instance administrator depends on the type of failover that occurred:
For a failover with no data loss, replication will continue working
after failover.
For a failover with data loss, replication will work as well. It will
replicate the lost changes again.
For a failover with data loss, but the data loss is outside of the
distribution database retention period, the SQL Managed Instance
administrator will need to reinitialize the subscription database."

Related

Active geo replication, Auto failover groups and Read scale out in Azure

What is the difference between Active geo replication, Auto failover groups and Read scale out in Azure.
Active Geo-replication provides replication of your primary database to a sec database in a different azure region.
Auto-Fail-over groups is a feature that provides automated management of the fail-over in case the primary server goes down, traffic will route to the secondary on its own.

Is there any Azure Cache/database supporting multi-region automatic failover

We have one webapp running on Azure, which pushes data to Azure Redis, then we have an on-prem component which reads that data from Azure redis and processes that.
Recently due to Azure region failure that Azure Redis went down. Web app and my on-prem component was not able to contact Azure redis.
How can I make sure zero down time for my web app to access Azure redis ?
Redis-GeoRelication doesn't solves my problem as it is unideirectional, and Manual failover. Also my web app and on-prem component need to know both redis endpoint, and contact accrondignly. which is not seemless.
Azure redis doesn't support cluster having shards in multiple region.
So my requirement is, Web-app and on-prem component both need to contain one cache/database endpoint ( without having any knowledge about the replication of the cache/database). if primary cache/db fails then, that endpoint should automatically goes to replicated cache or DB.
As per Documentation from Azure, it doesn't seem Azure Redis is correct fit for this requirment, is there any other Azure component which fits this requiremnet.
Had a look to Azure sql with failover group. As per documentation, "you can configure a grace period that controls the time between the detection of the outage and the failover itself. It is possible that traffic manager initiates the endpoint failover before the failover group triggers the failover of the database. In that case the web application cannot immediately reconnect to the database. But the reconnections will automatically succeed as soon as the database failover completes." . We can set that grace period to 1 hour (minimum) .
So it means with Azure sql also. In case of failure of one db server, my web application will not be able to write to db for atleast 1 hour, Is my understanding correct ?
Azure SQL and Azure Cosmos DB both support single endpoint and HA across regions, you might want to look into those.
Those are not caches, but they do allow for a single endpoint and failover

Does enrolling a SQL Azure database in Geo-Replication provide failover out of the box?

If I go into the Azure portal and go to a SQL Azure db and click on Geo-Replication I can select another data center to have a secondary database in. I can configure this as "readable." With that done, do I automatically get failover?
So for example, if my primary db is in Central US and I configure Geo-Replication to US East 2, will anything automatically failover my db to US East 2 if there is an error in Central US? Or do i have to initiate the failover through the portal or some code/monitoring solution? And would i have to update my connection string or does the azure infrastructure manage this for me?
I've reviewed a few docs below about this but looking for some more input:
- https://azure.microsoft.com/en-us/documentation/articles/sql-database-designing-cloud-solutions-for-disaster-recovery/?rnd=1
- https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-failover-portal/
- https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-overview/
https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-overview/
do i have to initiate the failover through the portal or some
code/monitoring solution?
Yes, you have to initiate the fail over explicitly. There is no automatic failover in case the primary goes offline.
would i have to update my connection string or does the azure
infrastructure manage this for me?
You would have to update connection string explicitly as well.
FailOver and DR drill sections of this link should provide necessary info, it also talks about keeping firewall rules and users in sync between primary and secondary : https://azure.microsoft.com/en-us/blog/spotlight-on-new-capabilities-of-azure-sql-database-geo-replication/

Turning off ServiceFabric clusters overnight

We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.

How is the Multi-AZ deployment of Amazon RDS realized?

Recently I'm considering to use Amazon RDS Multi-AZ deployment for a service in production environment, and I've read the related documents.
However, I have a question about the failover. In the FAQ of Amazon RDS, failover is described as follows:
Q: What happens during Multi-AZ failover and how long does it take?
Failover is automatically handled by Amazon RDS so that you can resume
database operations as quickly as possible without administrative
intervention. When failing over, Amazon RDS simply flips the canonical
name record (CNAME) for your DB Instance to point at the standby,
which is in turn promoted to become the new primary. We encourage you
to follow best practices and implement database connection retry at
the application layer. Failover times are a function of the time it
takes crash recovery to complete. Start-to-finish, failover typically
completes within three minutes.
From the above description, I guess there must be a monitoring service which could detect failure of primary instance and do the flipping.
My question is, which AZ does this monitoring service host in? There are 3 possibilities:
1. Same AZ as the primary
2. Same AZ as the standby
3. Another AZ
Apparently 1&2 won't be the case, since it could not handle the situation that entire AZ being unavailable. So, if 3 is the case, what if the AZ of the monitoring service goes down? Is there another service to monitor this monitoring service? It seems to be an endless domino.
So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?
So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?
I think that the "how" in this case is abstracted by design away from the user, given that RDS is a PaaS service. A multi-AZ deployment has a great deal that is hidden, however, the following are true:
You don't have any access to the secondary instance, unless a failover occurs
You are guaranteed that a secondary instance is located in a separate AZ from the primary
In his blog post, John Gemignani mentions the notion of an observer managing which RDS instance is active in the multi-AZ architecture. But to your point, what is the observer? And where is it observing from?
Here's my guess, based upon my experience with AWS:
The observer in an RDS multi-AZ deployment is a highly available service that is deployed throughout every AZ in every region that RDS multi-AZ is available, and makes use of existing AWS platform services to monitor the health and state of all of the infrastructure that may affect an RDS instance. Some of the services that make up the observer may be part of the AWS platform itself, and otherwise hidden from the user.
I would be willing to bet that the same underlying services that comprise CloudWatch Events is used in some capacity for the RDS multi-AZ observer. From Jeff Barr's blog post announcing CloudWatch Events, he describes the service this way:
You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules, it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.
Think of the observer the same way - it's a component of the AWS platform that provides a function that we, as the users of the platform do not need to think about. It's part of AWS's responsibility in the Shared Responsibility Model.
Educated guess - the monitoring service runs on all the AZs and refers to a shared list of running instances (which is sync-replicated across the AZs). As soon as a monitoring service on one AZ notices that another AZ is down, it flips the CNAMES of all the running instances to an AZ which is currently up.
We did not get to determine where the fail-over instance resides, but our primary is in US-West-2c and secondary is in US-West-2b.
Using PostgreSQL, our data became corrupted because of a physical problem with the Amazon volume (as near as we could tell). We did not have a multi-AZ set up at the time, so to recover, we had to perform a point-in-time restore as close in time as we could to the event. Amazon support assured us that had we gone ahead with the Multi-AZ, they would have automatically rolled over to the other AZ. This begs the questions how they could have determined that, and would the data corruption propagated to the other AZ?
Because of that shisaster, we also added a read-only replica, which seems to make a lot more sense to me. We also use the RO replica for read and other functions. My understanding from my Amazon rep is that one can think of the Multi-AZ setting as more like a RAID situation.
From the docs, fail over occurs if the following conditions are met:
Loss of availability in primary Availability Zone
Loss of network connectivity to primary
Compute unit failure on primary
Storage failure on primary
This infers that the monitoring is not located in the same AZ. Most likely, the read replica is using mysql functions (https://dev.mysql.com/doc/refman/5.7/en/replication-administration-status.html) to monitor the status of the master, and taking action if the master becomes unreachable.
Of course, this bears the question what happens if the replica AZ fails? Amazon most likely has checks in the replica's failure detection to figure out whether it's failing or the primary is.

Resources