CosmosDb search on the Index vs partition Key - azure

By default in cosmosDb, all properties in documents are indexed, so why should I care to do researches on the partition key while the searches on index works perfectly as well and cost nothing ?
I have a cosmosDb with one million of document like this with each of them contain an array, the partition key is "tankId" e.g.:
{
"id": "67acdb16-80dd-4a6c-a5b0-118d5f5fdb97",
"tankId": "67acdb16-80dd-4a6c-a5b0-118d5f5fdb97"
"UserIds": [
"905336a5-bf96-444f-bb11-3eedb65c3760",
"432270f5-780f-401b-9772-72ec96166be1",
"cfecdf7e-5067-46b1-ab4e-25ca7d597248"
],
}
If I do a request on "UserIds" on this million documents which is not a partition key but indexed property, it takes only 3.32 RU !!! Wow.
SELECT *
FROM c
WHERE ARRAY_CONTAINS(c.UserIds, "905336a5-bf96-444f-bb11-3eedb65c3760")
Is it a good practice to do that kind of request ? I am a little bit worried on my design.

It start's mattering once your number of physical partitions starts growing. Using the partition key will allow Cosmos to map the query to a logical partition that resides in a physical partition. Therefore the query won't be a so called 'cross-partition query' and it won't have to check the index of other physical partitions (that also would consume RU).
In your case you are talking about a million documents which likely use a lot less than 50GB of data (the max size of a physical partition) so it's all stored in the same physical partition. Therefore you won't have any noticable effects on the RU usage.
So to anwser your underlying question whether you should make any changes. Is your database mostly read heavy? Do you have any property that is often used for querying? Are you assured that your partitions remain under the logical partition size limit (20GB)? If yes, then you should likely consider it in your design. Even then it'll only matter once your data starts to split in physical partitions.

Related

Azure Cosmos Db Indexes

I am unable to find any documentation mentioning how are cosmos db indexes organized per the number of physical partitions. If i have my logical partition split into multiple physical partitions and assuming i am not including a partition key in the filter and have created an index on the field i am querying with.
What would the behavior be in terms of index. Does cosmos create an individual index per physical partition or a centrally maintained global index?
Can someone please explain what the behavior could be in such a case or point to some documentation in azure which explains how this would work.
A physical partition is simply a compute and storage node on which your data resides. A partition key within your WHERE clause routes the query to the partition where that data resides. Indexes reside within each partition and index the data for that partition only. Partitions are share nothing. In addition to routing, partition keys must also be included in your index policy when used in queries.
A query without a partition key in the filter will fan out to every partition within a container. At small scales (< 10K RU/s or < 50GB) this isn't much of an issue because the data is all located on a single physical partition. However, as the amount of storage and throughput grows, this query will likely become increasingly more expensive with greater latency. In short, the query will not scale. This is because as the size grows, so does the number of physical partitions that must be contacted to serve the same query.
More information here, Tuning query performance with Azure Cosmos DB and here, Indexing Overview
Perhaps this ms learn article provides the information you are looking for or this one for more details.
A Logical partition is mapped to only one physical partition;
Physical partitions are an internal implementation of the system and they are entirely managed by Azure Cosmos DB.
Azure Cosmos DB will automatically create new physical partitions by splitting existing ones
Kind regards

Cassandra : optimal partition size

I plan to have a simple table like this (simple key/value use case) :
CREATE TABLE my_data (
id bigint,
value blob,
PRIMARY KEY (id)
)
With the following caracteristics :
as you can see, one partition = one blob (value)
each value is always accessed by the corresponding key
each value is a blob of 1MB max (average also 1 MB)
with 1MB blob, it give 60 millions partitions
What do you think about the 1MB blob ? Is that OK for Cassandra ?
Indeed, I can divide my data further, to work with 1ko blob, but in that case, it will lead to many more partitions on Cassandra (more than 600 millions ?), and many more partitions to retreive the data for a same client side query..
Thanks
The general recommendation is to stay as close to 100MB partition sizes although this isn't a hard limit. There are some edge cases were partitions can get beyond 1GB and still be acceptable for some workloads as long as you're willing to accept the tradeoffs.
However in your case, 1MB blobs is a strong recommendation but again not a hard limit. You will notice a significant performance hit for larger blob sizes if you were to do a reasonable load test.
600 million partitions is not a problem at all. Cassandra is designed to handle billions, trillions of partitions and beyond. Cheers!

Azure CosmosDB view partition data

Does anyone know how to check what kind of data is on a certain partition in Azure Cosmos/MongoDB ?
I have one partition exceeding the storage limit, but I can't figure out why. I have one collection, with around 70 partitions. All of the partitions are 3.5GB or less, but one partition is 10GB, causing problems as it exceeds the maximum amount of data.
Yesterday, I removed for about 15% of the data for that partition, but it still claims to be on 10GB.
How can I check which documents are residents of the particular partition?
Rens Groenveld,firstly,I'd say sorry for the misunderstanding in my comment. After observing your screenshot, combining this blog, the partition key(in 2nd pic) is for logical partition and the partition name(in 3rd pic) is for physical partition.
You specify the partition key to create a logicalpartition that
guarantees to keep items with the same hash of the key together.
Cosmos DB manages the physical partitions based on needs. In the
portal, you can see that although we have a few dozen partition keys,
there are only a handful of partitions.
You could know the differences between logical partition and physical partition from this official document.
Unlike logical partitions, physical partitions are an internal
implementation of the system. You can't control their size, placement,
the count, or the mapping between the logical partitions and the
physical partitions. However, you can control the number of logical
partitions and the distribution of data and throughput by choosing the
right partition key.
Back to your issue, your logical partitioned data is keep together logically, cosmos db would balance them physically which we can't get involved. Data for the same logical partition maybe not reside on the same physical partition. I assumed that's the reason you remove the data does't save the size. I suggest you getting touch with Azure Cosmos DB Team to see what they can do with your physical partition.

Azure CosmosDb create partition only

And probably I already know the answer, yet I would love some feedback.
I have a Azure CosmosDb without partition key (empty), I want to create one because the RUs are too high so the performance improves.
My would-be partition is Date (20181005).
My question is if I don't send the Date as part of the queries (most of the times we request the object by ID), will the partition help on the performance?
I believe that it will since physically will organize documents better, however, I would love some feedback.
Thanks
The document id is only unique within it's own logical partiton. You can have multiple documents with the exact same id property as long as they are in different logical partitions.
If you partition your collection you have to deal with 2 (of many) realities.
The logical partition size cannot exceed 10GB
In order to have efficient queries and reads you have to provide the partition key value alongside your operations.
You can still do any querying operation using a cross partition query but this is something that should be avoided if possible. If you see yourself needing to use a cross partition query frequently then there is a problem with your partitioning strategy.
Bottomline is that your querying performance will be way worse without a partition key provided during the querying process.

Azure Cosmos DB - Understanding Partition Key

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries

Resources