We are using SQL Database failover groups. In case of an unplanned failover, some data loss might take place. It is documented that when the failed region becomes available, the old primary will automatically become secondary.
https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/auto-failover-group-sql-mi?view=azuresql&tabs=azure-powershell
we would like to preserve the failed replica so that a manual data reconciliation can be done on the data that was not propagated to DR site during the unplanned failover. What is the best way to do it?
what is the mechanism of old primary becoming new secondary? Is it a complete copy or log applied from some point?
is it possible to have a failover group with no databases just to establish read\write listeners?
Related
Our application uses RA-GZRS for storage which enables to read data from the secondary when the primary is down, but can't write to it.
Is there a solution which enables application to do both read from and write to storage in the event of an Azure region going down?
In Azure Storage, there can be only one region (primary) where you can write. The other region (secondary) will always be read only.
One possible solution would be to do a manual failover so that the secondary region of your account becomes the primary and then you should be able to write to it. However, please be aware that the manual failover comes with a lots of caveats and make sure you understand those.
You can read more about those things here: https://learn.microsoft.com/en-us/azure/storage/common/storage-initiate-account-failover?tabs=azure-portal#important-implications-of-account-failover.
Please go through this. To quote from the article:
If the primary region becomes unavailable, you can choose to fail over
to the secondary region. After the failover has completed, the
secondary region becomes the primary region, and you can again read
and write data. For more information on disaster recovery and to learn
how to fail over to the secondary region, see Disaster recovery and
storage account failover.
The tutorial here tells you how to build a highly available application that automatically switches between endpoints on a failure. It uses a circuit breaker pattern.
We can have passive read-only asynchronous real-time sync-up for Azure SQL database, for disaster recovery.
But our requirement is to have real-time sync-up between both active read-write databases to provide low latency to customers in different locations of the world.
for example:
I'm providing e-commerce website, I will update data in one of the
database server and other connected databases in sync with this
database should get updates.
Users from different servers of the world will get connected to their
nearest data center for low latency. If someone buys something or puts
some review, it should get updated in all other databases. In this
way we need active-active database sync.
We explored multiple items on this, but did not find anything relative.
Can anyone please guide me on how to achieve this.
SQL Server has Peer-to-Peer Transactional Replication, but you need to ensure in the application that conflicting changes are not introduced on multiple nodes.
SQL Server also has Merge Replication, which allows updates at any subscriber, and supports custom conflict resolution.
These are both available on SQL Server VMs. Limited replication options are available on Azure SQL Database Managed Instance. Azure SQL Database also has Data Sync.
Azure Cosmos DB also supports Multi-Master.
In either case multi-master introduces significant cost/complexity. Often it's better to just have a single writable master with regional readable replicas. In that configuration the application needs to connect to the global master for writing, but can read from a local replica. For this pattern you can simply use Failover Groups.
In Event grid, how do we setup geo replication. as the per the documentation, it should the publisher responsibility to do the health check.
https://learn.microsoft.com/en-us/azure/event-grid/custom-disaster-recovery
https://learn.microsoft.com/en-us/azure/event-grid/geo-disaster-recovery
is there something like pairing of two resources in event grid like what is there in other services like service bus or sql database server?
The Automatic Geo Disaster Recovery is already built-in and requires no configuration from your end. Do make note of the Recovery Point Objectives and Recovery Time Objectives on guarantees made.
Considering the RPO/RTO guarantees, its best to have client-side recovery as well for maximum continuity.
We are planning to use Service fabric actor model for one of our user services. We have thousands of users and they have their own profile data. By far reading the materials, service fabric actor model maintains its states with their service fabric cluster. I couldn't get a clear picture in disaster recovery/planned shutdown scenarios/offline data access. In such cases, Is it needed to persist the data out side of these actor service?
What happens to the data, if we decided to shutdown all the service fabric cluster one day, and wanted to reactivate few days later?
In an SF cluster in Azure, the data is stored on the temp drive. There's no guarantee that a node that is shutdown retains the temp drive. So shutting down all nodes simultaneously will result in data loss.
To avoid this, you should regularly create backups of your (Actor) Services. For instance by using this Nuget package. Store the resulting files outside the cluster.
The cluster technology will help keep your data safe during failures of nodes, e.g. in a 5 node cluster, 4 remaining healthy nodes can take over the work of a failed node. Data is stored redundantly, so your services remain operational. The same functionality also allows for rolling upgrades of services/actors.
Here's an article about DR.
I had implemented a large enterprise application in service fabric using actor model for management of orders.
Few things that might help while choosing a strategy for data backup and restoration
As the package https://github.com/loekd/ServiceFabric.BackupRestore is not full fledged and you need to take care of some of the scenario.
for example: During deployment your actor partitions moved to other nodes and if you try to take incremental backups it will failed with FabricMissingFullBackupException because on that node after becoming primary you haven't took the Full backup and some one needs to manually fix the issue.
How we added the retry pattern to fix that issue is not in the scope of this question.
Incremental backups didn't restore always during restoration process.
Some time Incremental backup creation failed even if you set the logTrunctationIntervalInMinutes properly.
Some developer by mistake deleted the service or application you will loss all your data.
if your system heavily dependent on Reminder's which was in our case.
During restoration all the reminders gets reset.
Good Solution: Override the default KvsActorStateProvider with your own implementation which stores the data in DocumentDB, MongoDB, Cassandra or Azure SQL if you want to use the power BI for some analytics.
I'm working on a quiet large and critical application. It's been deployed to azure with 3 web roles and sql azure db.
In case of disaster, we need to be able to restore both web roles and sql azure to different data centers. Could someone please help me how we can restore SQL Azure DB and Web Role(s) to different data center.
The simple answer is that you take regular backups of your SQL Azure database, which can be restored to a database in another datacenter. You will have a problem with the data since the last backup being lost, which becomes a more difficult problem to resolve — the simplest may be to have a hot standby and use SQL Database Data Sync, but it may not be practical for all the data. Web roles are easier — you redeploy them somewhere else, and change the connection strings to the database. You would also have to change the CNAME for your domain as they will be restored to a different cloudapp.net name.
You did ask for restore, and not failover, right? Performing a failover (where you have a hot standby) is a more difficult problem, particularly as far as data synchronisation is concerned.
I would go back and question 'disaster' and correlate with known facts. I am not sure of the outage history of Azure in specific data centres, but there have been significant Azure-wide outages (leap year 2012 and the certificate problem this year). The ability to restore to a different Azure datacentre won't help you in these scenarios. (Although AWS seems to mostly have regional outages) I don't think that a datacenter-specific recovery strategy is necessary on Windows Azure, but you may want to check the history and likelihood of datacenter-specific failures before making a final call. Having a multi-region architecture that distributes load and data across datacentres, and handles live traffic across all (say using traffic manager), has many benefits — of side effect being builtin-disaster recovery - but comes at an architectural, development, hosting and bandwidth cost.
Go back and write the business case for your datacenter disaster recovery scenario. You may find that it is not worth it financially, or doesn't solve your real problem.