I've been working with Vogels and NodeJS - Vogels handles creating the schema for me in DynamoDB local. It works perfect.
For some reason, I'm having problem trying to deploy app in AWS using DynamoDB service. I am getting the error:
Details:TypeError: Cannot read property 'hashKey' of undefined
I have even tried to manually setup the schema however DynamoDB does not have option for hashKey in the AWS Console. It only gives option for:
Primary Key/partition (String/Binary/Number)
Sort key (String/Binary/Number)
Has anyone come across this or know how to handle creating the schema?
When you say two Primary keys. I presume that you mean hash key and sort key (two separate attributes).
Please note that two attributes can't be part of a hash key.
Hash Key - 1 attribute
Sort Key - 1 attribute
DynamoDB supports two different kinds of primary keys:
Partition Key—A simple primary key, composed of one attribute, known as the partition key. DynamoDB uses the partition key's value as input to an internal hash function; the output from the hash function determine the partition where the item will be stored. No two items in a table can have the same partition key value.
Partition Key and Sort Key—A composite primary key composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. DynamoDB uses the partition key value as input to an internal hash function; the output from the hash function determines the partition where the item will be stored. All items with the same partition key are stored together, in sorted order by sort key value. It is possible for two items to have the same partition key value, but those two items must have different sort key values.
Primary key
Screenshot for creating the table in AWS console:-
Related
I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.
PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.
I have a table in cassandra for saving messages. I have uuid as primary key, but I need to send clients bigints as message keys which must be unique for that user.
how can I achieve that? Is there a way to combine user primary key which is bigint and message key to generate a bigint message_id for that user?
or should I use bigint as primary key for messages? if so then how can I generate unique bigints?
Cassandra allows you to have a compound primary key, in this case, the message_id seems a good candidate to be used as a clustering key.
For more information, you can take a look here and here
There is no way to generate auto incremented bigint number in Cassandra.
You have that key generation logic some ware else and use that as part of index in Cassandra
or
Build your own id service where you fetch your next id. This service will only be run in a single instance and will be a non scaling scary factor.
If I store documents without providing partition key, In this case documentId will be treated as Partition Key of Logical Partition?
If yes: how about Billion logical partitions in that collection? I have query to only look by documentId.
Now inside Document JSON:
I have multiple fields & I have provided /asset as the partitionKey. Is this a composite partition key now: /asset/documentId?
or /asset will tel partition to search for documentId from?
If I store documents without providing partition key, In this case
documentId will be treated as Partition Key of Logical Partition?
No. If you create a document without Partition Key, document id will not be treated as the partition key. Cosmos DB engine will put all the documents without a partition key value in a hidden logical partition. This particular partition can be accessed by specifying partition key as {}.
You define the partition key when you create the collection (according to the screenshot asset is a partition key in your case). If you dont provide a partition key when you create a collection - it will be limited to 10 GB of data (because it wouldn't be able to shard it without partition key).
Only partition key is used to determine the partition of the document. other fields are irrelevant when deciding which partition this document belongs to.
Coming from Azure’s DocumentDB (Cosmos db) background to AWS DynamoDB for a application where dynamo db is already being used.
I have a confusion around partition key on DynamoDB.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions. So a query may need to be executed in multiple servers.
For an example, if I wanted to develop a multi-tenant system where I will use single table to store the all tenant’s data but partitioned using tenant id. I will do as mentioned below in document db.
1) Storing data
Create objects with following schema.
Primary key: Order Id
Partition key: Tenant id
2) Retrieving all records for a tenants
SELECT * FROM Orders o WHERE o.tenantId="tenantId"
3) Retrieving a record by id for a tenant
SELECT * FROM Orders o WHERE o.Id='id' and o.tenantId="tenantId"
4) Retrieving all records for a tenant with sorting
SELECT * FROM Orders o WHERE o.tenantId="tenantId" order by o.CreatedData
//by default all fields in document db are indexed, so order by just works
How do I achieve same operations in dynamo db?
Finally I have found how to use dynamodb properly.Thanks to [#Jesse Carter], his comment was so helpful to understand dynamo db better. I am answering my Own Question now.
Compared to other NoSQL db's DynamoDB is bit difficult as terms are too much confusing, below I have mentioned simplified dynamodb table design for few common scenarios.
Primary key
In dynamo db Primary keys are not need to be unique, I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dyanmodb) are actually "Partition key".
Finding 1
You always required to supply Primary key as part of query
Scenario 1 - Key value(s) store
Assume you want to create a table with Id, and multiple other attributes. Also you query based on Id attribute only. in this case Id could be a Primary key.
|---------------------|------------------|
| User Id | Name |
|---------------------|------------------|
| 12 | value1 |
| 13 | value2 |
|---------------------|------------------|
We can have User Id as "Primary Key (Partition Key)"
Scenario 2
Now say we want to store messages for users as shown below, and we will query by user id to retrieve all messages for user.
|---------------------|------------------|
| User Id | Message Id |
|---------------------|------------------|
| 12 | M1 |
| 12 | M2 |
| 13 | M3 |
|---------------------|------------------|
Still "User Id" shall be a primary key for this table. Please remember Primary key in dynamodb not need to be unique per document. Message Id can be Sort key
So what is Sort key.
Sort key is a kind of unique key within a partition. Combination of Partition key, and Sort key has to be unique.
Creating table locally
If you are using Visual Studio, you can install AWS tool kit for visual studio to create Local tables on your machine for testing.
Note: The above Image adds some more terms!!.
Hash key, Range key. Always surprises with dynamo db isn't? :) . Actually
(Primary Key = Partition Key = Hash Key) != Your application objects primary key
As per our second scenario "Message Id" suppose to be primary key for our application, however as per DynamoDB terms user Id became a primary key to achieve partition benefits.
(Sort key = Range Key) = Could be a application objects primary
Local Secondary Indexes
We can create indexes within partition and that is called local secondary index. For example if we want retrieve messages for user based on message status
|------------|--------------|------------|
| User Id | Message Id | Status |
|------------|--------------|------------|
| 12 | M1 | 1 |
| 12 | M2 | 0 |
| 13 | M3 | 2 |
|------------|--------------|------------|
Primary Key: User Id
Sort Key: Message Id
Secondary Local Index: Status
Global Secondary Indexes
As the name states it is a global index. If we want to retrieve single message based on id, without partition key i.e. user id. Then we shall create a global index based on Message id.
Please see the explanantion from AWS documentation,
The primary key uniquely identifies each item in a table. The primary key can be simple (partition key) or composite (partition key and sort key).
When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the partition key value. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values. Distributing requests across partition key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
This does not mean that you must access all of the partition key values to achieve your throughput level; nor does it mean that the percentage of accessed partition key values needs to be high. However, be aware that when your workload accesses more distinct partition key values, those requests will be spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions.
You are correct that the partition key is used in DynamoDB to segregate data to different partitions. However partition key and physical partition in which items recides is not a one-to-one mapping.
The number of partitions is decided based on your RCU/WCU in such a way that all RCU/WCU can be utilized.
In dynamo db Primary keys are not need to be unique. I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dynamoDB) are actually "Partition key".
This is a wrong understanding. The concept of the primary key is exactly the same as SQL standards with extra restrictions as you would expect a NoSQL database to have. In short, you can have a partition key as a primary key or partition key and sort key as a composite primary key.
Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.