Coming from Azure’s DocumentDB (Cosmos db) background to AWS DynamoDB for a application where dynamo db is already being used.
I have a confusion around partition key on DynamoDB.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions. So a query may need to be executed in multiple servers.
For an example, if I wanted to develop a multi-tenant system where I will use single table to store the all tenant’s data but partitioned using tenant id. I will do as mentioned below in document db.
1) Storing data
Create objects with following schema.
Primary key: Order Id
Partition key: Tenant id
2) Retrieving all records for a tenants
SELECT * FROM Orders o WHERE o.tenantId="tenantId"
3) Retrieving a record by id for a tenant
SELECT * FROM Orders o WHERE o.Id='id' and o.tenantId="tenantId"
4) Retrieving all records for a tenant with sorting
SELECT * FROM Orders o WHERE o.tenantId="tenantId" order by o.CreatedData
//by default all fields in document db are indexed, so order by just works
How do I achieve same operations in dynamo db?
Finally I have found how to use dynamodb properly.Thanks to [#Jesse Carter], his comment was so helpful to understand dynamo db better. I am answering my Own Question now.
Compared to other NoSQL db's DynamoDB is bit difficult as terms are too much confusing, below I have mentioned simplified dynamodb table design for few common scenarios.
Primary key
In dynamo db Primary keys are not need to be unique, I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dyanmodb) are actually "Partition key".
Finding 1
You always required to supply Primary key as part of query
Scenario 1 - Key value(s) store
Assume you want to create a table with Id, and multiple other attributes. Also you query based on Id attribute only. in this case Id could be a Primary key.
|---------------------|------------------|
| User Id | Name |
|---------------------|------------------|
| 12 | value1 |
| 13 | value2 |
|---------------------|------------------|
We can have User Id as "Primary Key (Partition Key)"
Scenario 2
Now say we want to store messages for users as shown below, and we will query by user id to retrieve all messages for user.
|---------------------|------------------|
| User Id | Message Id |
|---------------------|------------------|
| 12 | M1 |
| 12 | M2 |
| 13 | M3 |
|---------------------|------------------|
Still "User Id" shall be a primary key for this table. Please remember Primary key in dynamodb not need to be unique per document. Message Id can be Sort key
So what is Sort key.
Sort key is a kind of unique key within a partition. Combination of Partition key, and Sort key has to be unique.
Creating table locally
If you are using Visual Studio, you can install AWS tool kit for visual studio to create Local tables on your machine for testing.
Note: The above Image adds some more terms!!.
Hash key, Range key. Always surprises with dynamo db isn't? :) . Actually
(Primary Key = Partition Key = Hash Key) != Your application objects primary key
As per our second scenario "Message Id" suppose to be primary key for our application, however as per DynamoDB terms user Id became a primary key to achieve partition benefits.
(Sort key = Range Key) = Could be a application objects primary
Local Secondary Indexes
We can create indexes within partition and that is called local secondary index. For example if we want retrieve messages for user based on message status
|------------|--------------|------------|
| User Id | Message Id | Status |
|------------|--------------|------------|
| 12 | M1 | 1 |
| 12 | M2 | 0 |
| 13 | M3 | 2 |
|------------|--------------|------------|
Primary Key: User Id
Sort Key: Message Id
Secondary Local Index: Status
Global Secondary Indexes
As the name states it is a global index. If we want to retrieve single message based on id, without partition key i.e. user id. Then we shall create a global index based on Message id.
Please see the explanantion from AWS documentation,
The primary key uniquely identifies each item in a table. The primary key can be simple (partition key) or composite (partition key and sort key).
When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the partition key value. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values. Distributing requests across partition key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
This does not mean that you must access all of the partition key values to achieve your throughput level; nor does it mean that the percentage of accessed partition key values needs to be high. However, be aware that when your workload accesses more distinct partition key values, those requests will be spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions.
You are correct that the partition key is used in DynamoDB to segregate data to different partitions. However partition key and physical partition in which items recides is not a one-to-one mapping.
The number of partitions is decided based on your RCU/WCU in such a way that all RCU/WCU can be utilized.
In dynamo db Primary keys are not need to be unique. I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dynamoDB) are actually "Partition key".
This is a wrong understanding. The concept of the primary key is exactly the same as SQL standards with extra restrictions as you would expect a NoSQL database to have. In short, you can have a partition key as a primary key or partition key and sort key as a composite primary key.
Related
What is the best practice when choosing partition & row key for entities with one important key?
Sample entities:
Device1:
ID: AB1234567
IsRunning: Yes
IsUpdating: No
Device2:
ID: AB7654321
IsRunning: Yes
IsUpdating: Yes
I saw this post that suggests splitting the ID as partition key and row key.
But Azure documentation actually recommends only using partition key when the entity only has one key property. It doesn't say what should be set as the row key though.. should it be empty? Or maybe a default value like '0'?
The expected records is maybe in the tens of thousands. Currently ~10k but growing
PartitionKey in Table Storage
In Table Storage, you need to decide on the PartitionKey yourself. Eventually, you are responsible for the output you will get on your system. If you put every entity in the same partition, you will be limited to the size of the storage machines for the amount of storage you can use. Also, you will be constraining the maximal throughput as there are lots of entities in the same partition.
RowKey in Table Storage
A RowKey in Table Storage is a very important thing: it is “primary key” within a partition. Combination of PartitionKey and RowKey form the composite unique identifier for an entity. Within one PartitionKey, you can only have unique RowKeys. If you use multiple partitions, the same RowKey can be reused in every partition.
This article by Maarten Balliauw will help you to decide What is the best practice when choosing partition & row key for entities.
I have a cosmos db container and currently it has just 100 documents. So when I query with Indexed id (primary key) or non-primary key, then the rsu remains the same.
So, as I have more data, the rsu's will change right ? Or does cosmos calculates based on more data and gives an average ?
I have id (primary key) as some unique ids and I am setting partition keys to be same as id. Because few times, i need to query based on id. But now there is a new requirement to query also on the basis of non-primary key. So, should I add that to be a partition key (again - unique ids) or add a secondary index ?
The data is not yet in production.
At small sizes the choices you are making here likely won't have any impact either way on RU/s costs. It's only when you want to scale your workload that design decisions you make will have an impact.
If you are new to Cosmos DB, I would recommend you watch this presentation on how to model and partition data in Cosmos DB. https://youtu.be/utdxvAhIlcY
I'm currently developing a skill which will be a Stock portfolio manager.
I want the table to have three headings
UserID (which comes from Alexa)
Stock name
Stock amount
UserID is the primary key, and currently I can add an amount of a stock to the table and then in a separate method called GetPortfolioValue I query the database to return the Stock name and Stock amount for a specific UserID, which I will then do some maths on to return the Portfolio value.
The problem is, I can't add another entry under the primary key and so the portfolio can only have one Stock option in it, which sucks
I want to later be able to edit the portfolio in case a user sells part of their stocks.
Is there a way I can do this using DynamoDB?
Primary keys must be unique. You probably need a compound primary key of UserID + Stock Name. In DynamoDB that is accomplished by setting UserID to the hash key, and Stock Name to the sort key, on the primary table index.
You can have several items with same primary key when table has sort key. If table has only primary key without sort key - then no. If table has sort key, then each primary & sort keys combo must be unique.
In this table, PK attribute is primary key, SK - sort key.
Yet PKs will be placed physically together, there shouldn't be PKs like "active" & "not_active" while there are only 2 possible values for that row item and all of them will fall into only 2 spaces. If you will have a lot of rows with same primary key, you will create heat spot for queries and you may experience slow queries. But it's way too broad topic to discuss how to design DynamoDB table.
You will probably benefit from my article: https://lukasliesis.medium.com/learning-from-aws-fcb0cc71926b
There are two brilliant videos by Rick Houlihan mentioned on DynamoDB Architecture:
https://www.youtube.com/watch?v=HaEPXoXVf2k
https://www.youtube.com/watch?v=6yqfmXiZTlM
I highly recommend watching both for multiple times to get into mindset of DynamoDB, understand queries heat maps, how to put all app into single table and use DynamoDB to it's best.
Update: Adaptive Capacity
Probably most common issue which once was in DynamoDB.
If you’ve used DynamoDB before, you’re probably aware that DynamoDB recommends building your application to deliver evenly distributed traffic for optimal performance. That is, your requests should be evenly distributed across your primary key. This is because before adaptive capacity, DynamoDB allocated read and write throughput evenly across partitions. For example, if you had a table capable of 400 writes per second (in other words, 400 write capacity units, or “WCUs”) distributed across four partitions, each partition would be allocated 100 WCUs. If you had a nonuniform workload with one partition receiving more than 100 writes per second (a hot partition), those requests might have returned a ProvisionedThroughputExceededException error.
Read full article here:
https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Yes, assuming you're using the userId from the requests from Alexa, they will provide an excellent primary key for your dynamoDB database.
"user": {
"userId": "amzn1.ask.account.[unique-value-here]",
Simply store the stocks as a JSON object against each user.
{
"stocks": [
{
"stockAmount": 45,
"stockName": "AMZN"
}
]
}
Storing the Object
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table stocks = dynamo.getTable("stocks");
String json = {"stocks": [{"stockAmount": 45,"stockName": "AMZN"}]}
Item item =
new Item()
.withPrimaryKey("alexa_id", 1)
.withJSON("document", json);
table.putItem(item);
and how to get it back
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table table = dynamo.getTable("stocks");
Item documentItem =
table.getItem(new GetItemSpec()
.withPrimaryKey("alexa_id", 1)
.withAttributesToGet("document"));
System.out.println(documentItem.getJSONPretty("document"));
As your users add or remove stocks you'll need to append to the stocks array for the user. You'll be subject to a 400KB limit for each of your DynamoDB items The size of a given item includes the attribute name (in UTF-8) and the attribute value.
Consider a table like this to store a user's contacts -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARYKEY ((user, contact_name), contact_id)
// ^-- Note the composite partition key
}
The composite partition key results in a row per contact.
Let's say there are a 100 million users and every user has a few hundred contacts.
I can look up a particular user's particular contact's data by using
SELECT contact_data FROM contacts WHERE user_name='foo' AND contact_name='bar'
However, is it also possible to look up all contact names for a user using something like,
SELECT contact_name FROM contacts WHERE user_name='foo'
? could the WHERE clause contain only some of all the columns that form the primary key?
EDIT -- I tried this and cassandra doesn't allow it. So my question now is, how would you model the data to support two queries -
Get data for a specific user & contact
Get all contact names for a user
I can think of two options -
Create another table containing user_name and contact_name with only user_name as the primary key. But then if a user has too many contacts, could that be a wide row issue?
Create an index on user_name. But given 100M users with only a few hundred contacts per user, would user_name be considered a high-cardinality value hence bad for use in index?
In a RDBMS the query planner might be able to create an efficient query plan for that kind of query. But Cassandra can not. Cassandra would have to do a table scan. Cassandra tries hard not to allow you to make those kinds of queries. So it should reject it.
No You cannot. If you look at the mechanism of how cassandra stores data, you will understand why you cannot query by part of composite partition key.
Cassandra distributes data across nodes based on partition key. The co-ordinator of a write request generates hash token using murmur3 algorithm on partition key and sends the write request to the token's owner.(each node has a token range that it owns). During a read, a co-ordinator again calculates the hash token based on partition key and sends the read request to the token's owner node.
Since you are using composite partition key, during a write request, all components of key (user, contact_name) will be used to generate the hash token. The owner node of this token has the entire row. During a read request, you have to provide all components of the key to calculate the token and issue the read request to the correct owner of that token. Hence, Cassandra enforces you to provide the entire partition key.
You could use two different tables with same structure but not the same partition key :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE TABLE contacts_by_users {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name), contact_id)
}
With this structure you have data duplication and you have to maintain both tables manually.
If you are using cassandra > 3.0, you can also use materialized views :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE MATERIALIZED VIEW contracts_by_users
AS
SELECT *
FROM contracts
WHERE user_name IS NOT NULL
AND contract_name IS NOT NULL
AND contract_id IS NOT NULL
PRIMARY KEY ((user_name), contract_name, contract_id)
WITH CLUSTERING ORDER BY contract_name ASC
In this case, you only have to maintain table contracts, the view will be automaticlly update
Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.