In Event grid, how do we setup geo replication. as the per the documentation, it should the publisher responsibility to do the health check.
https://learn.microsoft.com/en-us/azure/event-grid/custom-disaster-recovery
https://learn.microsoft.com/en-us/azure/event-grid/geo-disaster-recovery
is there something like pairing of two resources in event grid like what is there in other services like service bus or sql database server?
The Automatic Geo Disaster Recovery is already built-in and requires no configuration from your end. Do make note of the Recovery Point Objectives and Recovery Time Objectives on guarantees made.
Considering the RPO/RTO guarantees, its best to have client-side recovery as well for maximum continuity.
Related
We are using SQL Database failover groups. In case of an unplanned failover, some data loss might take place. It is documented that when the failed region becomes available, the old primary will automatically become secondary.
https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/auto-failover-group-sql-mi?view=azuresql&tabs=azure-powershell
we would like to preserve the failed replica so that a manual data reconciliation can be done on the data that was not propagated to DR site during the unplanned failover. What is the best way to do it?
what is the mechanism of old primary becoming new secondary? Is it a complete copy or log applied from some point?
is it possible to have a failover group with no databases just to establish read\write listeners?
Thought of enabling geo redundancy for existing Azure Cosmos DB account? But how to simulate the failover to do testing & development?
How importance is to consider, the data consistency after enabling geo redundancy?
What would be general recommendation & guideline/ principles to follow before the deciding required data consistency level?
Any code change required to consider the geo redundancy or to consider data consistency? we are using cosmosdb-sqlapi.
Manual Failover
First Azure Cosmos account must be configured for manual failover for this operation to succeed.
https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-manage-database-account#set-failover-priorities-for-your-azure-cosmos-account
The process for performing a manual failover involves changing the account's write region (failover priority = 0) to another region configured for the account.
https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-manage-database-account#manual-failover
Consistency
Azure Cosmos DB offers five well-defined levels. From strongest to weakest, the levels are:
Strong
Bounded staleness
Session
Consistent prefix
Eventual
https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels#consistency-levels-and-throughput
CosmosDB Geo Redundancy in application
You should consider multiple points prior to implement the geo redundancy.
https://learn.microsoft.com/en-us/azure/cosmos-db/high-availability#building-highly-available-applications
We can have passive read-only asynchronous real-time sync-up for Azure SQL database, for disaster recovery.
But our requirement is to have real-time sync-up between both active read-write databases to provide low latency to customers in different locations of the world.
for example:
I'm providing e-commerce website, I will update data in one of the
database server and other connected databases in sync with this
database should get updates.
Users from different servers of the world will get connected to their
nearest data center for low latency. If someone buys something or puts
some review, it should get updated in all other databases. In this
way we need active-active database sync.
We explored multiple items on this, but did not find anything relative.
Can anyone please guide me on how to achieve this.
SQL Server has Peer-to-Peer Transactional Replication, but you need to ensure in the application that conflicting changes are not introduced on multiple nodes.
SQL Server also has Merge Replication, which allows updates at any subscriber, and supports custom conflict resolution.
These are both available on SQL Server VMs. Limited replication options are available on Azure SQL Database Managed Instance. Azure SQL Database also has Data Sync.
Azure Cosmos DB also supports Multi-Master.
In either case multi-master introduces significant cost/complexity. Often it's better to just have a single writable master with regional readable replicas. In that configuration the application needs to connect to the global master for writing, but can read from a local replica. For this pattern you can simply use Failover Groups.
I am planning to migrate my existing cloud monolithic Restful Web API service to Service fabric in three steps.
The Memory cache (in process) has been heavily used in my cloud service.
Step 1) Migrate cloud service to SF stateful service with 1 replica and single partition. The cache code is as it is. No use of Reliable collection.
Step 2) Horizontal scaling of SF Monolithic stateful service to 5 replica and single partition. Cache code is modified to use Reliable collection.
Step 3) Break down the SF monolithic service to micro services (stateless / stateful)
Is the above approach cleaner? Any recommendation.? Any drawback?
More on Step 2) Horizontal scaling of SF stateful service
I am not planning to use SF partitioning strategy as I could not think of uniform data distribuition in my applictaion.
By adding more replica and no partitioning with SF stateful service , I am just making my service more reliable (Availability) . Is my understanding correct?
I will modify the cache code to use Reliable collection - Dictionary. The same state data will be available in all replicas.
I understand that the GET can be executed on any replica , but update / write need to be executed on primary replica?
How can i scale my SF stateful service without partitioning ?
Can all of the replica including secondory listen to my client request and respond the same? GET shall be able to execute , How PUT & POST call works?
Should i prefer using external cache store (Redis) over Reliable collection at this step? Use Stateless service?
This document has a good overview of options for scaling a particular workload in Service Fabric and some examples of when you'd want to use each.
Option 2 (creating more service instances, dynamically or upfront) sounds like it would map to your workload pretty well. Whether you decide to use a custom stateful service as your cache or use an external store depends on a few things:
Whether you have the space in your main compute machines to store the cached data
Whether your service can get away with a simple cache or whether it needs more advanced features provided by other caching services
Whether your service needs the performance improvement of a cache in the same set of nodes as the web tier or whether it can afford to call out to a remote service in terms of latency
whether you can afford to pay for a caching service, or whether you want to make due with using the memory, compute, and local storage you're already paying for with the VMs.
whether you really want to take on building and running your own cache
To answer some of your other questions:
Yes, adding more replicas increases availability/reliability, not scale. In fact it can have a negative impact on performance (for writes) since changes have to be written to more replicas.
The state data isn't guaranteed to be the same in all replicas, just a majority of them. Some secondaries can even be ahead, which is why reading from secondaries is discouraged.
So to your next question, the recommendation is for all reads and writes to always be performed against the primary so that you're seeing consistent quorum committed data.
We are currently looking at Azure Event Hubs as a mechanism to dispatch messages to background processors. At the moment the queue-based systems are being used.
Most processors are writing data to SQL Server databases, and the writes are wrapped in transactions.
Event Hubs are positioned as at-least-once communication channel, so duplicate messages should be expected. EventProcessorHost is the recommended API on the read side, which automates lease management and checkpointing using Azure Blob Storage.
But we have an idea, for some most critical processors, to implement checkpointing ourselves using a SQL Server table inside the same database, and write the checkpoint inside the same transaction of the processor. This should give us the strong guarantee of exactly once delivery when needed.
Ignoring lease management for now (just run 1 processor per partition), is SQL-based checkpointing a good idea? Are there other drawbacks, except the need to work on lower level of API and handle checkpoints ourselves?
As per Fred's advice, we implemented our own Checkpoint Manager based on a table in SQL Server. You can find the code sample here.
This implementation plugs nicely into EventProcessorHost. We also had to implement ILeaseManager, because they are highly coupled in the default implementation.
In my blog post I've described my motivation for such SQL-based implementation, and the high-level view of the overall solution.
Azure Storage is the built-in solution but we are not limited with that. If most of your processors are writing data to SQL Server databases and you do not want to have EventProcessorHost store checkpoints in Azure Storage (that requires a storage account), in my view, storing checkpoints in your SQL database which provide a easy way to make process event and manage checkpoint transactionally, it would be a good solution.
You could write your own checkpoint manager using ICheckpointManager interface to storing checkpoints in your SQL database.