Related
I want to use cassandra as a DB to store messages, when in my model messages are aggregate by channel.
the 3 main important field of message:
channel_id
created_by
message_id (unique)
The main read/fetch API is get messages by channel sorted by created_by.
Plus, I have a low scale messages update by channel_id + message_id.
So my question is regarding the primary_key definition.
If I will define it (channel_id,created_by)
will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX, even if message_id is not in the primary key (I do give the query the partition key)?
And if not,
if I will define the primary key like this (channel_id,created_by, message_id)
will I be able to do the read with where cause with only 1 clustering column (channel_id,created_by)
and do the update using the where cause channel_id + message_id?
Thanks
define it (channel_id,created_by) will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX
No. All primary key components are required for a write operation in Cassandra. First you will have to provide created_by. message_id is not part of the key, so that will have to be removed.
And if not, if I will define the primary key like this (channel_id,created_by, message_id) will I be able to do the read with WHERE cause with only 1 clustering column (channel_id,created_by)
Yes, this will work:
SELECT * FROM messages WHERE channel_id='1' AND created_by='Aaron';
This ^ works, because you have provided the first two primary key components, without skipping any. Cassandra can easily find the node containing the partition for channel_id, and scan down to the row starting with created_by.
and do the update using the WHERE cause channel_id + message_id?
No. Again, you would need to provide created_by for the write to succeed.
The primary key selection decision is one of the most important part in Cassandra data modeling. You need to understand the table. I am not sure if I can help you with the above-provided information by you. But I will still give it a try.
Your requirement:
Sort by created_by.
Update with channel_id + message_id
Try having channel_id + message_id as the partition key and created_by as clustering key. Message_id in the primary key will also help in ensuring uniqueness.
Recently I found DS220 course on Data modeling on https://academy.datastax.com/. This is awesome.
I have a table in cassandra for saving messages. I have uuid as primary key, but I need to send clients bigints as message keys which must be unique for that user.
how can I achieve that? Is there a way to combine user primary key which is bigint and message key to generate a bigint message_id for that user?
or should I use bigint as primary key for messages? if so then how can I generate unique bigints?
Cassandra allows you to have a compound primary key, in this case, the message_id seems a good candidate to be used as a clustering key.
For more information, you can take a look here and here
There is no way to generate auto incremented bigint number in Cassandra.
You have that key generation logic some ware else and use that as part of index in Cassandra
or
Build your own id service where you fetch your next id. This service will only be run in a single instance and will be a non scaling scary factor.
Coming from Azure’s DocumentDB (Cosmos db) background to AWS DynamoDB for a application where dynamo db is already being used.
I have a confusion around partition key on DynamoDB.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions. So a query may need to be executed in multiple servers.
For an example, if I wanted to develop a multi-tenant system where I will use single table to store the all tenant’s data but partitioned using tenant id. I will do as mentioned below in document db.
1) Storing data
Create objects with following schema.
Primary key: Order Id
Partition key: Tenant id
2) Retrieving all records for a tenants
SELECT * FROM Orders o WHERE o.tenantId="tenantId"
3) Retrieving a record by id for a tenant
SELECT * FROM Orders o WHERE o.Id='id' and o.tenantId="tenantId"
4) Retrieving all records for a tenant with sorting
SELECT * FROM Orders o WHERE o.tenantId="tenantId" order by o.CreatedData
//by default all fields in document db are indexed, so order by just works
How do I achieve same operations in dynamo db?
Finally I have found how to use dynamodb properly.Thanks to [#Jesse Carter], his comment was so helpful to understand dynamo db better. I am answering my Own Question now.
Compared to other NoSQL db's DynamoDB is bit difficult as terms are too much confusing, below I have mentioned simplified dynamodb table design for few common scenarios.
Primary key
In dynamo db Primary keys are not need to be unique, I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dyanmodb) are actually "Partition key".
Finding 1
You always required to supply Primary key as part of query
Scenario 1 - Key value(s) store
Assume you want to create a table with Id, and multiple other attributes. Also you query based on Id attribute only. in this case Id could be a Primary key.
|---------------------|------------------|
| User Id | Name |
|---------------------|------------------|
| 12 | value1 |
| 13 | value2 |
|---------------------|------------------|
We can have User Id as "Primary Key (Partition Key)"
Scenario 2
Now say we want to store messages for users as shown below, and we will query by user id to retrieve all messages for user.
|---------------------|------------------|
| User Id | Message Id |
|---------------------|------------------|
| 12 | M1 |
| 12 | M2 |
| 13 | M3 |
|---------------------|------------------|
Still "User Id" shall be a primary key for this table. Please remember Primary key in dynamodb not need to be unique per document. Message Id can be Sort key
So what is Sort key.
Sort key is a kind of unique key within a partition. Combination of Partition key, and Sort key has to be unique.
Creating table locally
If you are using Visual Studio, you can install AWS tool kit for visual studio to create Local tables on your machine for testing.
Note: The above Image adds some more terms!!.
Hash key, Range key. Always surprises with dynamo db isn't? :) . Actually
(Primary Key = Partition Key = Hash Key) != Your application objects primary key
As per our second scenario "Message Id" suppose to be primary key for our application, however as per DynamoDB terms user Id became a primary key to achieve partition benefits.
(Sort key = Range Key) = Could be a application objects primary
Local Secondary Indexes
We can create indexes within partition and that is called local secondary index. For example if we want retrieve messages for user based on message status
|------------|--------------|------------|
| User Id | Message Id | Status |
|------------|--------------|------------|
| 12 | M1 | 1 |
| 12 | M2 | 0 |
| 13 | M3 | 2 |
|------------|--------------|------------|
Primary Key: User Id
Sort Key: Message Id
Secondary Local Index: Status
Global Secondary Indexes
As the name states it is a global index. If we want to retrieve single message based on id, without partition key i.e. user id. Then we shall create a global index based on Message id.
Please see the explanantion from AWS documentation,
The primary key uniquely identifies each item in a table. The primary key can be simple (partition key) or composite (partition key and sort key).
When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the partition key value. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values. Distributing requests across partition key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
This does not mean that you must access all of the partition key values to achieve your throughput level; nor does it mean that the percentage of accessed partition key values needs to be high. However, be aware that when your workload accesses more distinct partition key values, those requests will be spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions.
You are correct that the partition key is used in DynamoDB to segregate data to different partitions. However partition key and physical partition in which items recides is not a one-to-one mapping.
The number of partitions is decided based on your RCU/WCU in such a way that all RCU/WCU can be utilized.
In dynamo db Primary keys are not need to be unique. I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dynamoDB) are actually "Partition key".
This is a wrong understanding. The concept of the primary key is exactly the same as SQL standards with extra restrictions as you would expect a NoSQL database to have. In short, you can have a partition key as a primary key or partition key and sort key as a composite primary key.
Consider a table like this to store a user's contacts -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARYKEY ((user, contact_name), contact_id)
// ^-- Note the composite partition key
}
The composite partition key results in a row per contact.
Let's say there are a 100 million users and every user has a few hundred contacts.
I can look up a particular user's particular contact's data by using
SELECT contact_data FROM contacts WHERE user_name='foo' AND contact_name='bar'
However, is it also possible to look up all contact names for a user using something like,
SELECT contact_name FROM contacts WHERE user_name='foo'
? could the WHERE clause contain only some of all the columns that form the primary key?
EDIT -- I tried this and cassandra doesn't allow it. So my question now is, how would you model the data to support two queries -
Get data for a specific user & contact
Get all contact names for a user
I can think of two options -
Create another table containing user_name and contact_name with only user_name as the primary key. But then if a user has too many contacts, could that be a wide row issue?
Create an index on user_name. But given 100M users with only a few hundred contacts per user, would user_name be considered a high-cardinality value hence bad for use in index?
In a RDBMS the query planner might be able to create an efficient query plan for that kind of query. But Cassandra can not. Cassandra would have to do a table scan. Cassandra tries hard not to allow you to make those kinds of queries. So it should reject it.
No You cannot. If you look at the mechanism of how cassandra stores data, you will understand why you cannot query by part of composite partition key.
Cassandra distributes data across nodes based on partition key. The co-ordinator of a write request generates hash token using murmur3 algorithm on partition key and sends the write request to the token's owner.(each node has a token range that it owns). During a read, a co-ordinator again calculates the hash token based on partition key and sends the read request to the token's owner node.
Since you are using composite partition key, during a write request, all components of key (user, contact_name) will be used to generate the hash token. The owner node of this token has the entire row. During a read request, you have to provide all components of the key to calculate the token and issue the read request to the correct owner of that token. Hence, Cassandra enforces you to provide the entire partition key.
You could use two different tables with same structure but not the same partition key :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE TABLE contacts_by_users {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name), contact_id)
}
With this structure you have data duplication and you have to maintain both tables manually.
If you are using cassandra > 3.0, you can also use materialized views :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE MATERIALIZED VIEW contracts_by_users
AS
SELECT *
FROM contracts
WHERE user_name IS NOT NULL
AND contract_name IS NOT NULL
AND contract_id IS NOT NULL
PRIMARY KEY ((user_name), contract_name, contract_id)
WITH CLUSTERING ORDER BY contract_name ASC
In this case, you only have to maintain table contracts, the view will be automaticlly update
I am using cassandra in an experimental project. My model is simple I have the table below:
create table message(
id varchar,
msgId varchar,
tId varchar,
gtName varchar,
status varchar,
text text,
PRIMARY KEY (id, tId)
);
At the time of first insert I would only have id and tId to insert. There will be an immediate update where I can get msgId to persist from the return type of the called method. There will be another call to my app containing the status which would only know the msgId. In that case I will need a lookup in order to update the message table using a where clause on msgId.
How can I possibly get that working fine with cassandra 2.1.0 I am also using spring-data-cassandra:1.1.0.RELEASE
Thanks for your suggestion
A first simple step would be to create a secondary index on that key:
create index on message(msgId);
select * from message where msgId='foo';
Secondary indexes do have some performance concerns and aren't always a good fit, depending on your data model. Another option is to create a second table which maps msgId back to id and tId:
create table msgid (
msgId varchar,
id varchar,
tId varchar,
primary key (msgId)
);
Here are some useful discussions about usage of secondary indexes:
Cassandra query on secondary index is very slow - Stack Overflow
The sweet spot for Cassandra secondary indexing - Richard Low's blog
How to avoid secondary indexes in cassandra? - Stack Overflow