Rsu same fro primary key and non-primary key - cosmos - azure

I have a cosmos db container and currently it has just 100 documents. So when I query with Indexed id (primary key) or non-primary key, then the rsu remains the same.
So, as I have more data, the rsu's will change right ? Or does cosmos calculates based on more data and gives an average ?
I have id (primary key) as some unique ids and I am setting partition keys to be same as id. Because few times, i need to query based on id. But now there is a new requirement to query also on the basis of non-primary key. So, should I add that to be a partition key (again - unique ids) or add a secondary index ?
The data is not yet in production.

At small sizes the choices you are making here likely won't have any impact either way on RU/s costs. It's only when you want to scale your workload that design decisions you make will have an impact.
If you are new to Cosmos DB, I would recommend you watch this presentation on how to model and partition data in Cosmos DB. https://youtu.be/utdxvAhIlcY

Related

Azure table storage partition key and row key for one key entity

What is the best practice when choosing partition & row key for entities with one important key?
Sample entities:
Device1:
ID: AB1234567
IsRunning: Yes
IsUpdating: No
Device2:
ID: AB7654321
IsRunning: Yes
IsUpdating: Yes
I saw this post that suggests splitting the ID as partition key and row key.
But Azure documentation actually recommends only using partition key when the entity only has one key property. It doesn't say what should be set as the row key though.. should it be empty? Or maybe a default value like '0'?
The expected records is maybe in the tens of thousands. Currently ~10k but growing
PartitionKey in Table Storage
In Table Storage, you need to decide on the PartitionKey yourself. Eventually, you are responsible for the output you will get on your system. If you put every entity in the same partition, you will be limited to the size of the storage machines for the amount of storage you can use. Also, you will be constraining the maximal throughput as there are lots of entities in the same partition.
RowKey in Table Storage
A RowKey in Table Storage is a very important thing: it is “primary key” within a partition. Combination of PartitionKey and RowKey form the composite unique identifier for an entity. Within one PartitionKey, you can only have unique RowKeys. If you use multiple partitions, the same RowKey can be reused in every partition.
This article by Maarten Balliauw will help you to decide What is the best practice when choosing partition & row key for entities.

Can I have multiple entries under the same primary key while using DynamoDB?

I'm currently developing a skill which will be a Stock portfolio manager.
I want the table to have three headings
UserID (which comes from Alexa)
Stock name
Stock amount
UserID is the primary key, and currently I can add an amount of a stock to the table and then in a separate method called GetPortfolioValue I query the database to return the Stock name and Stock amount for a specific UserID, which I will then do some maths on to return the Portfolio value.
The problem is, I can't add another entry under the primary key and so the portfolio can only have one Stock option in it, which sucks
I want to later be able to edit the portfolio in case a user sells part of their stocks.
Is there a way I can do this using DynamoDB?
Primary keys must be unique. You probably need a compound primary key of UserID + Stock Name. In DynamoDB that is accomplished by setting UserID to the hash key, and Stock Name to the sort key, on the primary table index.
You can have several items with same primary key when table has sort key. If table has only primary key without sort key - then no. If table has sort key, then each primary & sort keys combo must be unique.
In this table, PK attribute is primary key, SK - sort key.
Yet PKs will be placed physically together, there shouldn't be PKs like "active" & "not_active" while there are only 2 possible values for that row item and all of them will fall into only 2 spaces. If you will have a lot of rows with same primary key, you will create heat spot for queries and you may experience slow queries. But it's way too broad topic to discuss how to design DynamoDB table.
You will probably benefit from my article: https://lukasliesis.medium.com/learning-from-aws-fcb0cc71926b
There are two brilliant videos by Rick Houlihan mentioned on DynamoDB Architecture:
https://www.youtube.com/watch?v=HaEPXoXVf2k
https://www.youtube.com/watch?v=6yqfmXiZTlM
I highly recommend watching both for multiple times to get into mindset of DynamoDB, understand queries heat maps, how to put all app into single table and use DynamoDB to it's best.
Update: Adaptive Capacity
Probably most common issue which once was in DynamoDB.
If you’ve used DynamoDB before, you’re probably aware that DynamoDB recommends building your application to deliver evenly distributed traffic for optimal performance. That is, your requests should be evenly distributed across your primary key. This is because before adaptive capacity, DynamoDB allocated read and write throughput evenly across partitions. For example, if you had a table capable of 400 writes per second (in other words, 400 write capacity units, or “WCUs”) distributed across four partitions, each partition would be allocated 100 WCUs. If you had a nonuniform workload with one partition receiving more than 100 writes per second (a hot partition), those requests might have returned a ProvisionedThroughputExceededException error.
Read full article here:
https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Yes, assuming you're using the userId from the requests from Alexa, they will provide an excellent primary key for your dynamoDB database.
"user": {
"userId": "amzn1.ask.account.[unique-value-here]",
Simply store the stocks as a JSON object against each user.
{
"stocks": [
{
"stockAmount": 45,
"stockName": "AMZN"
}
]
}
Storing the Object
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table stocks = dynamo.getTable("stocks");
String json = {"stocks": [{"stockAmount": 45,"stockName": "AMZN"}]}
Item item =
new Item()
.withPrimaryKey("alexa_id", 1)
.withJSON("document", json);
table.putItem(item);
and how to get it back
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table table = dynamo.getTable("stocks");
Item documentItem =
table.getItem(new GetItemSpec()
.withPrimaryKey("alexa_id", 1)
.withAttributesToGet("document"));
System.out.println(documentItem.getJSONPretty("document"));
As your users add or remove stocks you'll need to append to the stocks array for the user. You'll be subject to a 400KB limit for each of your DynamoDB items The size of a given item includes the attribute name (in UTF-8) and the attribute value.

Role of partition key in Cosmos DB Sql API Insert? With the Bulk Executor?

I'm trying to repeatedly insert about 850 documents between 100 - 300Kb into a cosmos collection. I have them all in the same partition key.
The estimator suggests that at 50K RUs should handle this in short order but at well over 100k its averaging 20 minutes or so per set rather than something more reasonable.
Should I have unique partition keys for each document? Is the problem that having the all the documents going to the same partition key, they are being handled in series and the capacity isn't load leveling?
Will using the bulk executor fix this?
Should I have unique partition keys for each document? Is the problem
that having the all the documents going to the same partition key,
they are being handled in series and the capacity isn't load leveling?
You could find below statement from this doc.
To fully utilize throughput provisioned for a container or a set of
containers, you must choose a partition key that allows you to evenly
distribute requests across all distinct partition key values.
So, I think defining partition key is good for insert or query.However, the choosing of partition key is really worth a dig.Please refer to this doc to choose your partition key.
Will using the bulk executor fix this?
Yes,you could use continuation token in bulk insert.More details ,please refer to my previous case:How do I get a continuation token for a bulk INSERT on Azure Cosmos DB?.
Hope it helps you.
Just for summary, we need to evaluate the default indexes for collection.It may take 100 to 1000x more RUs than actually writing the file.

How to decide a good partition key for Azure Cosmos DB

I'm new to Azure Cosmos DB, but I want to have a vivid understanding of:
What is the partition key?
My understanding is shallow for now -> items with the same partition key will go to the same partition for storage, which could better load balancing when the system grows bigger.
How to decide on a good partition key?
Could somebody please provide an example?
Thanks a lot!
You have to choose your partition based on your workload. They can be classified into two.
Read Heavy
Write Heavy
Read heavy workloads are where the data is read more than it has been written, like the product catalog, where the insert/update frequency of the catalogs is less, and people browsing the product is more.
Write Heavy workloads are the ones where the data is written more than it is read. Common scenarios are IoT devices sending multiple data from multiple sensors. You will be writing lots of data to Cosmos DB because you may get data every second.
For read-heavy workload choose the partition key, where the property is used in the filter query. The product example will be the product id, which will be used mostly to fetch the data when the user wants to read the information and browse its reviews.
For Write-heavy workload choose the partition key, where the property is more unique. For example, in the IoT Scenario, use the partition key such as deviceid_signaldatetime, which is concatenating the device-id that sends the signal, and DateTime of the signal has more uniqueness.
1.What is the partition key?
In azure cosmos db , there are two partitions: physical partition and logical partition
A.Physical partition is a fixed amount of reserved SSD-backed storage combined with variable amount of compute resources.
B.Logical partition is a partition within a physical partition that stores all the data associated with a single partition key value.
I think the partiton key you mentioned is the logical partition key.The partition key acts as a logical partition for your data and provides Azure Cosmos DB with a natural boundary for distributing data across physical partitions.More details, you could refer to How does partitioning work.
2.How to decide a good partition key? Could somebody please provide an example?
You need consider to pick a property name that has a wide range of values and has even access patterns.An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
For example, your data has fields named id and color and you query the color as filter more frequently.You need to pick the color not id for partition key which is more efficient for your query performance. Because every item has different id but maybe has same color.It has wide range. Also if you add a color,the partition key is scalable.
More details ,please read the Partition and scale in Azure Cosmos DB.
Hope it helps you.

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Resources