Azure Cosmos DB each partition size limit - azure

In Azure Cosmos DB partinioned collection, does each partition has any size limit?
As per this old document, they have a size limit of 10 GB. Is that the same now also?
https://azure.microsoft.com/en-in/blog/10-things-to-know-about-documentdb-partitioned-collections/
Regards,
Karthikeyan V.

A partitioned collection has individual 10GB partition spaces. For a given partition key, you cannot exceed 10GB of data. This has not changed.
You'll need to pick a partition key which distributes your data across many partitions, vs creating "hot" partitions which could fill up (where you'd then get an error when attempting to write content).

There are two type of collection
Single Partition Collection (10GB and 10,000 RU/s)
Partitioned Collection (250 GB and 250,000 RU/s)- you can increase the limit as needed after contacting azure team.
For partitioned collection you mush have to specify a partition key based on your query filter for better read performance and if you will not mention it will be by default single partition collection.
Note: Collection is a logical space and it can span across multiple node(hence quorum) in background based on the RU and other param, in short it's a PAAS and the infra handling is automated behind the screen, you will not have much control over it.
More info here:
Partitioning and horizontal scaling in Azure Cosmos DB

Related

Azure Cosmos Db Indexes

I am unable to find any documentation mentioning how are cosmos db indexes organized per the number of physical partitions. If i have my logical partition split into multiple physical partitions and assuming i am not including a partition key in the filter and have created an index on the field i am querying with.
What would the behavior be in terms of index. Does cosmos create an individual index per physical partition or a centrally maintained global index?
Can someone please explain what the behavior could be in such a case or point to some documentation in azure which explains how this would work.
A physical partition is simply a compute and storage node on which your data resides. A partition key within your WHERE clause routes the query to the partition where that data resides. Indexes reside within each partition and index the data for that partition only. Partitions are share nothing. In addition to routing, partition keys must also be included in your index policy when used in queries.
A query without a partition key in the filter will fan out to every partition within a container. At small scales (< 10K RU/s or < 50GB) this isn't much of an issue because the data is all located on a single physical partition. However, as the amount of storage and throughput grows, this query will likely become increasingly more expensive with greater latency. In short, the query will not scale. This is because as the size grows, so does the number of physical partitions that must be contacted to serve the same query.
More information here, Tuning query performance with Azure Cosmos DB and here, Indexing Overview
Perhaps this ms learn article provides the information you are looking for or this one for more details.
A Logical partition is mapped to only one physical partition;
Physical partitions are an internal implementation of the system and they are entirely managed by Azure Cosmos DB.
Azure Cosmos DB will automatically create new physical partitions by splitting existing ones
Kind regards

How do I create a Cosmos DB container with a single partition?

I have a container that stores ~5000 documents. Each document is not very large. The most frequent query is just to select everything in this container (so that the frontend can display it in a nice table client-side). Each document has a unique ID. I was using this as the partition key (/id) for the container but I have read that querying data like this is more efficient in terms of time and RU/s when all the data comes from the same partition as I can avoid cross-partition queries.
Can I create a container without a partition key? Or a container that only has one partition? Will I have to add a property to every document that is the same value to force this or is there an easier way?
The number of partitions of a container is defined by the provisioned RU and data size: https://learn.microsoft.com/azure/cosmos-db/partitioning-overview#physical-partitions
So, if you create a container with less than 10K RU and keep the data size small (<50GB), it should be a single physical partition.
If you use a single value for your Partition Key, you will hit the data cap: https://learn.microsoft.com/azure/cosmos-db/sql/troubleshoot-forbidden#partition-key-exceeding-storage because your database simply won't be able to scale.
Looking at the no of documents in your container (~5000), it would ideally land in a single physical partition unless you have huge amount of RU requirement above 10,000 RU's. Assuming you have an RU config of less than 10,000 RU/S, this would be in a single physical partition >You can confirm this by looking at the metric(classic) option from the left-hand blade in the portal

Partition key for a lookup table in Azure Cosmos DB Tables

I have a very simple lookup table I want to call from an Azure function.
Schema is incredibly simple:
Name | Value 1 | Value 2
Name will be unique, but value 1 and value 2 will not be. There is no other data in the lookup table.
For an Azure Table you need a partition key and a row key. Obviously the rowkey would be the Name field.
What exactly should I use for Partition Key?
Right now, I'm using a constant because there won't be a ton of data (maybe a couple hundred rows at most) but using a constant seems to go against the point.
This answer applies to all Cosmos DB containers, including Tables.
When does it make sense to store your Cosmos DB container in a single partition (use a constant as the partition key)?
If you are sure the data size of your container will always remain well under 10GB.
If you are sure the throughput requirement for your container will always remain under 10,000 RU/s (RU per second).
If either of the above conditions are false, or if you are not sure about future growth of data size or throughput requirements then using a partition key based on the guidelines below will allow the container to scale.
How partitioning works in Cosmos DB
Cosmos groups container items into a set of logical partitions based on the partition key. These logical partitions are then mapped to physical partitions. A physical partition is the unit of compute/storage which makes up the underlying database infrastructure.
You can determine how your data is split into logical partitions by your choice of partition key. You have no control over how your logical partitions are mapped to physical partitions, Cosmos handles this automatically and transparently.
Distributing your container across a large number of physical partitions is the way Cosmos allows the container to scale to virtually unlimited size and throughput.
Each logical partition can contain a maximum of 10GB of data. An unpartitioned container can have a maximum throughput of 10,000 RU/s which implies there is a limit of 10,000 RU/s per logical partition.
The RU/s allocated to your container are evenly split across all physical partitions hosting the container's data. For instance, if your container has 4,000 RU/s allocated and its logical partitions are spread across 4 physical partitions then each physical partition will have 1,000 RU/s allocated to it, which also means that if one of your physical partitions is under a heavly load or 'hot', it will get rate-limited at 1,000 RU/s, not at 4,000. This is why it is very important to choose a partition key that spreads your data, and access to the data, evenly across partitions.
If your container is in a single logical partition, it will always be mapped to a single physical partition and the entire allocation of RU/s for the container will always be available.
All Cosmos DB transactions are scoped to a single logical partition, and the execution of a stored procedure or trigger is also scoped to a single logical partition.
How to choose a good partition key
Choose a partition key that will evenly distribute your data across logical partitions, which in turn will help ensure the data is evenly mapped across physical partitions. This will prevent 'bottleneck' or 'hot' partitions which will cause rate-limiting and may increase your costs.
Choose a partition key that will be the filter criteria for a high percentage of your queries. By providing the partition key as filter to your query, Cosmos can efficiently route your query to the correct partition. If the partition key is not supplied it will result in a 'fan out' query, which is sent to all partitions which will increase your RU cost and may hinder performance. If you frequently filter based on multiple fields see this article for guidance.
Summary
The primary purpose of partitioning your containers in Cosmos DB is allowing the continers to scale in terms of both storage and throughput.
Small containers which will not grow significantly in data size or throughput requirements can use a single partition.
Large containers, or containers expected to grow in data size or throughput requirements should be partitioned using a well chosen partition key.
The choice of partition key is critical and may significantly impact your ability to scale, your RU cost and the performance of your queries.

Throughput value for Azure Cosmos db

I am confused about how partition affects the size limit and throughput value for Azure Cosmos DB (in our case, we are using documentdb). If I understand the documentation correctly.
for a partitioned collection, the 10G storage limit applies to each partition?
the throughput value ex. 400RU/S applies to each partition, not collection?
Whether you use Single-partition collections or multi partition collections, each partition can be up to 10 Gb. This means that a single-partition collection can not exceed that size, where a multi partition collection can.
Taken from Azure Cosmos DB FAQ:
What is a collection?
A collection is a group of documents and their associated JavaScript application logic. A collection is a billable entity, where the cost is determined by the throughput and used storage. Collections can span one or more partitions or servers and can scale to handle practically unlimited volumes of storage or throughput.
Collections are also the billing entities for Azure Cosmos DB. Each collection is billed hourly, based on the provisioned throughput and used storage space. For more information, see Azure Cosmos DB Pricing.
Billing is per Collection, where one collection can have one or more Partitions. Since Azure allocates partitions to host your collection, the amount of RU's needs to be per collection. Otherwise a customer with lots and lots of partitions would get way more RU's than a different customer who has an equal collection, but way less partitions.
For more info, see the bold text in the quote below:
Taken from Azure Cosmos DB Pricing:
Provisioned throughput
At any scale, you can store data and provision throughput capacity. Each container is billed hourly based on the amount of data stored (in GBs) and throughput reserved in units of 100 RUs/second, with a minimum of 400 RUs/second. Unlimited containers have a minimum of 100 RUs/second per partition.
Taken from Request Units in Azure Cosmos DB:
When starting a new collection, table or graph, you specify the number of request units per second (RU per second) you want reserved. Based on the provisioned throughput, Azure Cosmos DB allocates physical partitions to host your collection and splits/rebalances data across partitions as it grows.
The other answers here provide a great starting point on throughput provisioning but failed to touch on an important point that doesn't get mentioned often in the docs.
Your throughput is actually divided across the number of physical partitions in your collection. So for a multi partition collection provisioned for 1000RU/s with 10 physical partitions it's actually 100RU/s per partition. So if you have hot partitions that get accessed more frequently you'll receive throttling errors even though you haven't exceeded the total RU assigned the collection.
For a single partition collection you obviously get the full RU assigned for that partition since it's the only one.
If you're using a multi-partition collection you should strive to pick a partition key that has an even access pattern so that your workload can be evenly distributed across the underlying partitions without bottle necking.
for a partitioned collection, the 10G storage limit applies to each partition?
That is correct. Each partition in a partitioned collection can be a maximum of 10GB in size.
the throughout value ex. 400RU/S applies to each partition, not collection?
The throughput is at collection level and not at partition level. Further minimum RU/S for a partitioned collection is 2500 RU/S and not 400RU/S. 400RU/S is the default for a non-partitioned collection.

Azure Cosmos DB - Understanding Partition Key

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries

Resources