DocumentDB PartitionKey and performance - azure

I have a scenario where I store large amounts of third party data for ad-hoc analysis by business users. Most queries against the data will be complicated, using multiple self-joins, projections, and ranges.
When it comes to picking a PartitionKey for use in Azure DocumentDB, I see people advising to use a logical separator such as TenantId, DeviceId, etc.
Given the parallel nature of DocumentDB, however, I was curious how it would handle a PartitionKey based either on some sort of GUID or large integer so that, during large reads, it would be highly parellelized.
With that in mind, I devised a test with two collections:
test-col-1
PartitionKey is TenantId with roughly 100 possible values
test-col-2
PartitionKey is unique value assigned by third party that follows the pattern "AB1234568". Guaranteed to be globally unique by the third party.
Both collections are set to 100,000 RUs.
In my experiment I loaded both collections with roughly 2,000 documents. Each document is roughly 20 KB in size and is highly denormalized. Each document is an order, which contains multiple jobs, each of which contain users, prices, etc.
Example query:
SELECT
orders.Attributes.OrderNumber,
orders.Attributes.OpenedStamp,
jobs.SubOrderNumber,
jobs.LaborTotal.Amount As LaborTotal,
jobs.LaborActualHours As LaborHours,
jobs.PartsTotal.Amount As PartsTotal,
jobs.JobNumber,
jobs.Tech.Number As TechNumber,
orders.Attributes.OrderPerson.Number As OrderPersonNumber,
jobs.Status
FROM orders
JOIN jobs IN orders.Attributes.Jobs
JOIN tech IN jobs.Techs
WHERE orders.TenantId = #TentantId
AND orders.Attributes.Type = 1
AND orders.Attributes.Status IN (4, 5)";
In my testing I adjusted the following settings:
Default ConnectionPolicy
Best practices ConnectionPolicy
ConnectionMode.Direct, Protocol.Tcp
Various MaxDegreeOfParallelism values
Various MaxBufferedItemCount
The collection with the GUID PartitionKey was queried with EnableCrossPartitionQuery = true. I am using C# and the .NET SDK v1.14.0.
In my initial tests with default settings, I found that querying the collection with TentantId as the PartitionKey was faster, with it taking on average 3,765 ms compared to 4,680 ms on the GUID-keyed collection.
When I set the ConnectionPolicy to Direct with TCP, I discovered TenantID collection query times decreased by nearly 1000 ms to an average of 2,865 ms while the GUID collection increased by about 800 ms to an average of 5,492 ms.
Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).
Given these observations, why would you not want to make the PartitionKey as broad a value as possible?

Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).
MaxDegreeOfParallelism could set the maximum number of concurrent tasks enabled by ParallelOptions instance. As I known, this is a client side parallelism and it would cost your CPU / Memory resources that you have on your site.
Given these observations, why would you not want to make the PartitionKey as broad a value as possible?
For write operations, we could scale across partition keys, in order to use the throughout that you have provisioned. While for read operations, we need to minimize cross-partition lookups for a lower latency.
Also, as this official document mentioned:
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
It is a best practice to have a partition key with many distinct values (100s-1000s at a minimum).
To achieve the full throughput of the container, you must choose a partition key that allows you to evenly distribute requests among some distinct partition key values.
For more details, you could refer to How to partition and scale in Azure Cosmos DB and this channel 9 tutorial about Azure DocumentDB Elastic Scale - Partitioning.

Related

Cosmos DB partition key and query design for sequential access

We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.

Why should I not put all my data in one CosmosDB collection?

The problem
I have discovered that Cosmos DB is priced very aggressively and can be expensive if used with many data types.
I would think that a good structure, would be to put each data type I have in their own collection, almost like tables in a database (not quite).
However, each collection costs at least 24 USD per month. This is if I choose "Fixed", that limits me to 10GB and is NOT scalable. Hardly the point of Cosmos DB, so I would rather choose "Unlimited". However, here the price is at least 60 USD per month.
60 USD per month per data type.
This includes 1000 RU, but on top of this, I have to pay more for consumption.
This might be OK if I have a few data types, but if I a fully fledged business application with 30 data types (not at all uncommon), it becomes 1800 USD per month, at least. As a starting price. When I have no data yet.
The question
The structure of the data in the collection is not strict. I can store different types of documents in the same collection.
When using an "Unlimited" collection, I can use partition keys, which should be used to partition my data to ensure scalability.
However, why do I not just include the data type in the partition key?
Then the partition key becomes something like:
[customer-id]-[data-type]-[actual-partition-value, like 'state']
With one swift move, my minimum cost becomes 60 USD and the rest is based on consumption. Presumably, partition keys ensure satisfactory performance regardless of the data volume. So what am I missing? Is there some problem with this approach?
Update
Microsoft now supports sharing RU across all containers (without a minimum of 10000 RU) so this question is essentially no longer relevant, as you can now freely choose to separate data into different containers without any extra cost.
No there will be no problem per se.
It all boils down to whether you're fine with having 1000 RU/s, or more specifically a single bottleneck, for your whole system.
In fact you can simplify this even more by having your document id to be the partition key. This will guarantee the uniqueness of the document id and will enable the maximum possible distribution and scale in CosmosDB.
That's exactly how collection sharing works in Cosmonaut (disclaimer, I'm the creator of this project) and I have noticed no problems, even on systems with many different data types.
However you have to keep in mind that even though you can scale this collection up and down you still restrict your whole system with this one bottleneck. I would recommend that you don't just create one collection but probably 2 or 3 collections with shared entities in them. If this is done smart enough and you batch entities in a logical way then you can scale your throughput for specific parts of your system.

Azure Cosmos DB - Understanding Partition Key

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries

Azure Table Storage Partition Key

Two somewhat related questions.
1) Is there anyway to get an ID of the server a table entity lives on?
2) Will using a GUID give me the best partition key distribution possible? If not, what will?
we have been struggling for weeks on table storage performance. In short, it's really bad, but early on we realized that using a randomish partition key will distribute the entities across many servers, which is exactly what we want to do as we are trying to achieve 8000 reads per second. Apparently our partition key wasn't random enough, so for testing purposes, I have decided to just use a GUID. First impression is it is waaaaaay faster.
Really bad get performance is < 1000 per second. Partition key is Guid.NewGuid() and row key is the constant "UserInfo". Get is execute using TableOperation with pk and rk, nothing else as follows: TableOperation retrieveOperation = TableOperation.Retrieve(pk, rk); return cloudTable.ExecuteAsync(retrieveOperation). We always use indexed reads and never table scans. Also, VM size is medium or large, never anything smaller. Parallel no, async yes
As other users have pointed out, Azure Tables are strictly controlled by the runtime and thus you cannot control / check which specific storage nodes are handling your requests. Furthermore, any given partition is served by a single server, that is, entities belonging to the same partition cannot be split between several storage nodes (see HERE)
In Windows Azure table, the PartitionKey property is used as the partition key. All entities with same PartitionKey value are clustered together and they are served from a single server node. This allows the user to control entity locality by setting the PartitionKey values, and perform Entity Group Transactions over entities in that same partition.
You mention that you are targeting 8000 requests per second? If that is the case, you might be hitting a threshold that requires very good table/partitionkey design. Please see the article "Windows Azure Storage Abstractions and their Scalability Targets"
The following extract is applicable to your situation:
This will provide the following scalability targets for a single storage account created after June 7th 2012.
Capacity – Up to 200 TBs
Transactions – Up to 20,000 entities/messages/blobs per second
As other users pointed out, if your PartitionKey numbering follows an incremental pattern, the Azure runtime will recognize this and group some of your partitions within the same storage node.
Furthermore, if I understood your question correctly, you are currently assigning partition keys via GUID's? If that is the case, this means that every PartitionKey in your table will be unique, thus every partitionkey will have no more than 1 entity. As per the articles above, the way Azure table scales out is by grouping entities in their partition keys inside independent storage nodes. If your partitionkeys are unique and thus contain no more than one entity, this means that Azure table will scale out only one entity at a time! Now, we know Azure is not that dumb, and it groups partitionkeys when it detects a pattern in the way they are created. So if you are hitting this trigger in Azure and Azure is grouping your partitionkeys, it means your scalability capabilities are limited to the smartness of this grouping algorithm.
As per the the scalability targets above for 2012, each partitionkey should be able to give you 2,000 transactions per second. Theoretically, you should need no more than 4 partition keys in this case (assuming that the workload between the four is distributed equally).
I would suggest you to design your partition keys to group entities in such a way that no more than 2,000 entities per second per partition are reached, and drop using GUID's as partitionkeys. This will allow you to better support features such as Entity Transaction Group, reduce the complexity of your table design, and get the performance you are looking for.
Answering #1: There is no concept of a server that a particular table entity lives on. There are no particular servers to choose from, as Table Storage is a massive-scale multi-tenant storage system. So... there's no way to retrieve a server ID for a given table entity.
Answering #2: Choose a partition key that makes sense to your application. just remember that it's partition+row to access a given entity. If you do that, you'll have a fast, direct read. If you attempt to do a table- or partition-scan, your performance will certainly take a hit.
See
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for more info on key selection (Note the numbers are 3 years old, but the guidance is still good).
Also this talk can be of some use as far as best practice : http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B406#fbid=lCN9J5QiTDF.
In general a given partition can support up to 2000 tps, so spreading data across partitions will help achieve greater numbers. Something to consider is that atomic batch transactions only apply to entities that share the same partitionkey. Additionally, for smaller requests you may consider disabling Nagle as small requests may be getting held up at the client layer.
From the client end, I would recommend using the latest client lib (2.1) and Async methods as you have literally thousands of requests per second. (the talk has a few slides on client best practices)
Lastly, the next release of storage will support JSON and JSON no metadata which will dramatically reduce the size of the response body for the same objects, and subsequently the cpu cycles needed to parse them. If you use the latest client libs your application will be able to leverage these behaviors with little to no code change.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Resources