Replicate part of data to different region - azure

I am trying a use case with cosmoseDB where we want to maintain one CosmoseDB but split the data into US region and Europe region with some partition key?
And for inserting/updating documents, application know which region(US/Europe) the documents go so is it possible to point to the right region while inserting/updating the document?

As I know , Cosmos DB global distribution mechanism guarantees consistency of all replica sets.
When you create the distribute cosmos db account, you enable the geo-redundancy.
You will see the regions of read and write separation.
Write operations are completed in write region and replicated to other read regions to ensure consistency. On the client side, there is no need to point to specific region to write data. From perspective of consistency , all region data supposed to be same.
More details, you could refer to this document.
Hope it helps you.

Can you have multiple write regions?
DocumentDB has good build-in features to bring read operations closer to consumers by adding read-regions to your documentDB account. You can read about it in documentation: "How to setup Azure Cosmos DB global distribution using the SQL API".
My understanding based on this is that there is always only 1 write region at any given time. I would not bet my thumbs on it, but it's hinted at in documentation. For example in "Connecting to a preferred region using the SQL API":
The SDK will automatically send all writes to the current write region.
All reads will be sent to the first available region in the PreferredLocations list. If the request fails, the client will fail down the list to the next region, and so on.
What you can do..
Things get more complicated when you also want to distribute writes (espcially if you care about consistency and latency). DocumentDB's own documentation suggest you implement this as a combination of multiple accounts, each of which has its own local write region and automatic distribution to read/fallback node in other regions.
The downside is that your application would have to configure and implement reading from all accounts in your application code and merging the results. Having data well partitioned by geography could help avoid full fan-out at times but your DAL would still have to manage multiple storages internally.
This scenario is explained in more detail in documentation page
"Multi-master globally replicated database architectures with Azure Cosmos DB".
I would seriously consider if adding such complexity would be justified, or if distributing just the reads would suffice.

All regions for a given account have the same replicated data. If you want to separate data across regions, you'd need to split it into two accounts.
Given partition A in the US and partition B in the EU – there is very little difference if A and B were under the same account, or under different accounts… the collection/db/account are all just logical wrappers on top of the partition.

Related

Is there a built-in method to replicate a collection to a "follower" collection in the same region?

CosmosDB can geo-replicate collections and clients can be configured to make (read-only) queries to these "follower" regions.
Is there a built-in way for CosmosDB to provide a "follower" collection in the same region?
The scenario for using that is to use the "main" collection for fast interactive queries, and use the "follower" collection for slower, heavier backend queries, without the possibility of hitting limits and causing throttling that would impact the interactive case.
The usual answer for "copying" collections is to use a change feed (possibly via an Azure function), but this is "manual" work and the client (me) would have to take care of general dev-ops overhead like provisioning, telemetry, monitoring, alerting, key rotation etc.
I'd like to know if there's a "managed" way to do this, like there is for geo-replication.
The built-in geo-replication feature only works when replicating to different regions. You cannot replicate the same collection(s) back to the same region.
You'll need to set this up yourself. As you've already mentioned, you can use Change Feed to do this (though you called it a "manual" process and I don't see it as such, since this can be completely automated in code). You can also incorporate a messaging/event pattern: subscribe to database update events, and have multiple consumers writing to different database collections, per your querying needs.
Also: by having an independent collection where you provide the data-movement code, you can choose a different data model for your slower, heavier backend queries (maybe with a different partition key; maybe with some helpful aggregations; etc.).
There's really no way to avoid the added infrastructure setup.
Replication is limited to a single container/collection. For most scenarios like yours, one would use an alternate partition key to make the second collection read optimized. You should also review your top queries and consider using an alternate database which is more read optimize.
You could use this new tool:
https://github.com/Azure-Samples/azure-cosmosdb-live-data-migrator

How efficient can Azure BLOB Table service can be?

How efficient azure blob tables can be?
Azure BLOB service has various components like Containers, Queues and Tables too. How efficient can tables be, what is their exact use case and why are they generally used with a supporting service like Azure CosmoDB.
Can anyone help me understand the concept and thought behind it?
Edit: The problem I am facing is that I have to log a processing batch of 700 000 data rows in C#, into BLOB Tables. How do I achieve this in the best practices?
This is a three in one question :-)
How efficient can tables be
Very efficient, if used properly. Every row in a table has a PartitionKey and Rowkey. When querying data it performs very well if you can reduce the set by using (parts of) the PartitionKey and RowKey. As soon as you start filtering on other columns performance can decrease very fast. See also the docs regarding this topic.
what is their exact use case
It is basically a key/value pair nosql solution. It can be used very efficient to store simple data in a fast and cheap manner. It is one of the cheapest options when it comes to data storage. Tables don't have a fixed schema (hence, nosql) and is used to store for example logs, configuration data and simple data structures.
and why are they generally used with a supporting service like Azure CosmosDB.
This is not the case. Azure Table Storage can be used on its own. CosmosDB has a Table API that lets you make uses of CosmosDB against code written for Azure Table Storage without code modifications. It allows for premium performance as not only the PartitionKey and Rowkey are indexed, but all the other columns as well. So as soon as you start filtering on other columns performance will still be very good. But it will costs you more in terms of money.
Data storage could be best done using batches as data is written per partition. See the answer of Ivan.
Some more material on when to use it:
https://markheath.net/post/azure-tables-what-are-they-good-for
https://blogs.msdn.microsoft.com/brunoterkaly/2013/01/13/knowing-when-to-choose-windows-azure-table-storage-or-windows-azure-sql-database/

What is the recommended approach towards multi-tenant databases in Cassandra?

I'm thinking of creating a multi-tenant app using Apache Cassandra.
I can think of three strategies:
All tenants in the same keyspace using tenant-specific fields for security
table per tenant in a single shared DB
Keyspace per tenant
The voice in my head is suggesting that I go with option 3.
Thoughts and implications, anyone?
There are several considerations that you need to take into account:
Option 1: In pure Cassandra this option will work only if access to database will be always through "proxy" - the API, for example, that will enforce filtering on tenant field. Otherwise, if you provide an CQL access, then everybody can read all data. In this case, you need also to create data model carefully, to have tenant as a part of composite partition key. DataStax Enterprise (DSE) has additional functionality called row-level access control (RLAC) that allows to set permissions on the table level.
Options 2 & 3: are quite similar, except that when you have a keyspace per tenant, then you have flexibility to setup different replication strategy - this could be useful to store customer's data in different data centers bound to different geographic regions. But in both cases there are limitations on the number of tables in the cluster - reasonable number of tables is around 200, with "hard stop" on more than 500. The reason - you need an additional resources, such as memory, to keep auxiliary data structures (bloom filter, etc.) for every table, and this will consume both heap & off-heap memory.
I've done this for a few years now at large-scale in the retail space. So my belief is that the recommended way to handle multi-tenancy in Cassandra, is not to. No matter how you do it, the tenants will be hit by the "noisy neighbor" problem. Just wait until one tenant runs a BATCH update with 60k writes batched to the same table, and everyone else's performance falls off.
But the bigger problem, is that there's no way you can guarantee that each tenant will even have a similar ratio of reads to writes. In fact they will likely be quite different. That's going to be a problem for options #1 and #2, as disk IOPs will be going to the same directory.
Option #3 is really the only way it realistically works. But again, all it takes is one ill-considered BATCH write to crush everyone. Also, want to upgrade your cluster? Now you have to coordinate it with multiple teams, instead of just one. Using SSL? Make sure multiple teams get the right certificate, instead of just one.
When we have new teams use Cassandra, each team gets their own cluster. That way, they can't hurt anyone else, and we can support them with fewer question marks about who is doing what.

multi-master in cosmosdb/documentdb

How can I set up multiple write regions in cosmosdb so that I do not need to combine query results of two or more different regions in my application layer? From this documentation, it seems like cosmosdb global distribution is global replication with one writer and multiple read secondarys, not true multi-master. https://learn.microsoft.com/en-us/azure/documentdb/documentdb-multi-region-writers
As of May 2018, Cosmos DB now supports multi-master natively using a combination of CRDT data types and automatic conflict resolution.
Multi-master in Azure Cosmos DB provides high levels of availability
(99.999%), single-digit millisecond latency to write data and
scalability with built-in comprehensive and flexible conflict
resolution support.
Multi-master is composed of multiple master regions that equally
participate in a write-anywhere model (active-active pattern) and it
is used to ensure that data is available at any time where you need
it. Updates made to an individual region are asynchronously propagated
to all other regions (which in turn are master regions in their own).
Azure Cosmos DB regions operating as master regions in a multi-master
configuration automatically work to converge the data of all replicas
and ensure global consistency and data integrity.
Azure Cosmos DB implements the logic for handling conflicting writes
inside the database engine itself. Azure Cosmos DB offers
comprehensive and flexible conflict resolution support by offering
several conflict resolution models, including Automatic (CRDT-
conflict-free replicated data types), Last Write Wins (LWW), and
Custom (Stored Procedure) for automatic conflict resolution. The
conflict resolution models provide correctness and consistency
guarantees and remove the burden from developers to have to think
about consistency, availability, performance, replication latency, and
complex combinations of events under geo-failovers and cross-region
write conflicts.
More details here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers
It's currently in preview and might require approval before you can use it:
According to your supplied link, based on my understanding. Multi-master in cosmosdb/documentdb is implemented by multiple documentdbs separately for write regions and read the documents from the combined query. Currently it seems that it is not supported to set up multiple write regions in cosmosdb so that don't need to combine query results of two or more different regions .
The referenced article states how to implement multi-master in Cosmosdb, while explicitly stating that it is not a multi-master database.
There are ways to "simulate" multi-master scenarios by configuring the consistency level (e.g. session) which will allow callers to see their local copy without having it written to the write region. You can find the details of the various levels here: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels.
Aside from that, consider if you truly need multi-master by working with the consistency levels, considering what acceptable latency is, etc. There are few scenarios that can't tolerate latency, particularly when you have adequate tools to provide a user experience that approximates a local write master. There is no such thing as real-time when remote networks are involved ;)

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Resources