Does cosmosdb stores data differently depending on which api you use? - azure

What is the ramification of using each api?
For example if I am using sql api, am I sacrificing ACID, and which part of CAP am I using? How do azure achieve horizontal scalability
If I am using document api or key value api, is the internal data layout different?

The internal storage format for the various database API's in Cosmos DB doesn't have any bearing on ACID or CAP. The API you choose should be driven by the appropriate use case needed and/or your familiarity with it. For instance, both SQL API and Mongo DB API are document databases. But if you have experience using Mongo, then it's probably a better choice rather than say Gremlin which is a graph, Table which is a key/value store, or Cassandra which is a columnar store.
Cosmos DB provides ACID support for data within the same logical partition. With regards to CAP that depends on which consistency level/model you choose. Strong consistency trades "A" availability for "C" consistency. All the other consistency models trade consistency for availability but at varying degrees. For instance, bounded staleness defines an upper bound that data can lag in consistency by time or updates. As that boundary approaches, Cosmos will throttle new writes to allow the replication queue to catch up, ensuring the consistency guarantees are met.
Achieving horizontal scalability is a function of your partitioning strategy. For write heavy workloads, the objective should be choosing a partition key that will allow for writes to be distributed across a wide range of values. For read heavy workloads, you want for queries to be served by one or a bounded number of partitions. If it's both, then using change feed to copy data into 2 or more containers as needed such that the cost of copying data is cheaper than running it in a container that would result in cross-partition queries. At the end of the day, choosing a good partition key requires testing under production levels of load and data.

Related

Replicate part of data to different region

I am trying a use case with cosmoseDB where we want to maintain one CosmoseDB but split the data into US region and Europe region with some partition key?
And for inserting/updating documents, application know which region(US/Europe) the documents go so is it possible to point to the right region while inserting/updating the document?
As I know , Cosmos DB global distribution mechanism guarantees consistency of all replica sets.
When you create the distribute cosmos db account, you enable the geo-redundancy.
You will see the regions of read and write separation.
Write operations are completed in write region and replicated to other read regions to ensure consistency. On the client side, there is no need to point to specific region to write data. From perspective of consistency , all region data supposed to be same.
More details, you could refer to this document.
Hope it helps you.
Can you have multiple write regions?
DocumentDB has good build-in features to bring read operations closer to consumers by adding read-regions to your documentDB account. You can read about it in documentation: "How to setup Azure Cosmos DB global distribution using the SQL API".
My understanding based on this is that there is always only 1 write region at any given time. I would not bet my thumbs on it, but it's hinted at in documentation. For example in "Connecting to a preferred region using the SQL API":
The SDK will automatically send all writes to the current write region.
All reads will be sent to the first available region in the PreferredLocations list. If the request fails, the client will fail down the list to the next region, and so on.
What you can do..
Things get more complicated when you also want to distribute writes (espcially if you care about consistency and latency). DocumentDB's own documentation suggest you implement this as a combination of multiple accounts, each of which has its own local write region and automatic distribution to read/fallback node in other regions.
The downside is that your application would have to configure and implement reading from all accounts in your application code and merging the results. Having data well partitioned by geography could help avoid full fan-out at times but your DAL would still have to manage multiple storages internally.
This scenario is explained in more detail in documentation page
"Multi-master globally replicated database architectures with Azure Cosmos DB".
I would seriously consider if adding such complexity would be justified, or if distributing just the reads would suffice.
All regions for a given account have the same replicated data. If you want to separate data across regions, you'd need to split it into two accounts.
Given partition A in the US and partition B in the EU – there is very little difference if A and B were under the same account, or under different accounts… the collection/db/account are all just logical wrappers on top of the partition.

multi-master in cosmosdb/documentdb

How can I set up multiple write regions in cosmosdb so that I do not need to combine query results of two or more different regions in my application layer? From this documentation, it seems like cosmosdb global distribution is global replication with one writer and multiple read secondarys, not true multi-master. https://learn.microsoft.com/en-us/azure/documentdb/documentdb-multi-region-writers
As of May 2018, Cosmos DB now supports multi-master natively using a combination of CRDT data types and automatic conflict resolution.
Multi-master in Azure Cosmos DB provides high levels of availability
(99.999%), single-digit millisecond latency to write data and
scalability with built-in comprehensive and flexible conflict
resolution support.
Multi-master is composed of multiple master regions that equally
participate in a write-anywhere model (active-active pattern) and it
is used to ensure that data is available at any time where you need
it. Updates made to an individual region are asynchronously propagated
to all other regions (which in turn are master regions in their own).
Azure Cosmos DB regions operating as master regions in a multi-master
configuration automatically work to converge the data of all replicas
and ensure global consistency and data integrity.
Azure Cosmos DB implements the logic for handling conflicting writes
inside the database engine itself. Azure Cosmos DB offers
comprehensive and flexible conflict resolution support by offering
several conflict resolution models, including Automatic (CRDT-
conflict-free replicated data types), Last Write Wins (LWW), and
Custom (Stored Procedure) for automatic conflict resolution. The
conflict resolution models provide correctness and consistency
guarantees and remove the burden from developers to have to think
about consistency, availability, performance, replication latency, and
complex combinations of events under geo-failovers and cross-region
write conflicts.
More details here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers
It's currently in preview and might require approval before you can use it:
According to your supplied link, based on my understanding. Multi-master in cosmosdb/documentdb is implemented by multiple documentdbs separately for write regions and read the documents from the combined query. Currently it seems that it is not supported to set up multiple write regions in cosmosdb so that don't need to combine query results of two or more different regions .
The referenced article states how to implement multi-master in Cosmosdb, while explicitly stating that it is not a multi-master database.
There are ways to "simulate" multi-master scenarios by configuring the consistency level (e.g. session) which will allow callers to see their local copy without having it written to the write region. You can find the details of the various levels here: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels.
Aside from that, consider if you truly need multi-master by working with the consistency levels, considering what acceptable latency is, etc. There are few scenarios that can't tolerate latency, particularly when you have adequate tools to provide a user experience that approximates a local write master. There is no such thing as real-time when remote networks are involved ;)

graph traversal performance of azure Cosmos DB

Cosmos DB allows us to store graph data using the gremlin query language.
Are there intelligent algorithms optimizing how the graph is split up among many servers? If not I can imagine some queries being extremely slow do to network latency between the shards.
The documentation is still a bit lacking, but there are some performance considerations for DocumentDb itself. Namely, setting up a PartitionKey that is adequately granular will split your data across multiple partitions, giving you higher throughput. You can find more here:
https://learn.microsoft.com/en-us/azure/documentdb/documentdb-partition-data

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.
This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.
I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.
There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Resources