I have a table in cassandra for saving messages. I have uuid as primary key, but I need to send clients bigints as message keys which must be unique for that user.
how can I achieve that? Is there a way to combine user primary key which is bigint and message key to generate a bigint message_id for that user?
or should I use bigint as primary key for messages? if so then how can I generate unique bigints?
Cassandra allows you to have a compound primary key, in this case, the message_id seems a good candidate to be used as a clustering key.
For more information, you can take a look here and here
There is no way to generate auto incremented bigint number in Cassandra.
You have that key generation logic some ware else and use that as part of index in Cassandra
or
Build your own id service where you fetch your next id. This service will only be run in a single instance and will be a non scaling scary factor.
Related
I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.
PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.
I want to use cassandra as a DB to store messages, when in my model messages are aggregate by channel.
the 3 main important field of message:
channel_id
created_by
message_id (unique)
The main read/fetch API is get messages by channel sorted by created_by.
Plus, I have a low scale messages update by channel_id + message_id.
So my question is regarding the primary_key definition.
If I will define it (channel_id,created_by)
will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX, even if message_id is not in the primary key (I do give the query the partition key)?
And if not,
if I will define the primary key like this (channel_id,created_by, message_id)
will I be able to do the read with where cause with only 1 clustering column (channel_id,created_by)
and do the update using the where cause channel_id + message_id?
Thanks
define it (channel_id,created_by) will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX
No. All primary key components are required for a write operation in Cassandra. First you will have to provide created_by. message_id is not part of the key, so that will have to be removed.
And if not, if I will define the primary key like this (channel_id,created_by, message_id) will I be able to do the read with WHERE cause with only 1 clustering column (channel_id,created_by)
Yes, this will work:
SELECT * FROM messages WHERE channel_id='1' AND created_by='Aaron';
This ^ works, because you have provided the first two primary key components, without skipping any. Cassandra can easily find the node containing the partition for channel_id, and scan down to the row starting with created_by.
and do the update using the WHERE cause channel_id + message_id?
No. Again, you would need to provide created_by for the write to succeed.
The primary key selection decision is one of the most important part in Cassandra data modeling. You need to understand the table. I am not sure if I can help you with the above-provided information by you. But I will still give it a try.
Your requirement:
Sort by created_by.
Update with channel_id + message_id
Try having channel_id + message_id as the partition key and created_by as clustering key. Message_id in the primary key will also help in ensuring uniqueness.
Recently I found DS220 course on Data modeling on https://academy.datastax.com/. This is awesome.
Here's the problem.
Our 'customers' are ingested regularly as part of bulk file upload (CSV) from clients. The data we have from them is Name, Address, PostCode, Client Reference Number.
We're saving these into a cassandra 'Customer Table'.
When we do this we assign a UUID which we then use throughout the rest of the system.
The question comes around primary keys… we have two options really
1) UUID as primary key or, 2) composite primary key of (name, address, postcode).
Problem with these options are: 1) we don't have the UUID at initial insert, it's possible that the 'customers' are duplicated, so how do we de-dupe? Get (select) followed by upsert would be inefficient. 2) has a couple of issues: a) if we perform an update operation there is a possibility that the UUID could be overwritten… b) there is also an issue that name, address, postcode couldn't be updated as they're a composite primary key… a) might not be an issue as a change to UUID will emit an event that will be picked up by other interested services… but kind of removes the point of a UUID… b) we can keep alias (AKA) fields for a customers preferred, or updated details, whilst keeping the original data for reference… though this feels clumsy.
Preferred, and easiest way would be to go for option 1, but without using primary key for initial creation - not sure this is possible? With option 2, we would also need to be able to update all fields but with the exception of the UUID column…
You can only really use the UUID as the partition key if you know if beforehand. You won't be able to insert new customers into the table if you don't have the UUID.
Based on your description, you use the UUID as the unique ID for the rest of your system so it really is the perfect partition key. You will however need to find a solution for the situations where you don't have the customer's UUID. Cheers!
I've been working with Vogels and NodeJS - Vogels handles creating the schema for me in DynamoDB local. It works perfect.
For some reason, I'm having problem trying to deploy app in AWS using DynamoDB service. I am getting the error:
Details:TypeError: Cannot read property 'hashKey' of undefined
I have even tried to manually setup the schema however DynamoDB does not have option for hashKey in the AWS Console. It only gives option for:
Primary Key/partition (String/Binary/Number)
Sort key (String/Binary/Number)
Has anyone come across this or know how to handle creating the schema?
When you say two Primary keys. I presume that you mean hash key and sort key (two separate attributes).
Please note that two attributes can't be part of a hash key.
Hash Key - 1 attribute
Sort Key - 1 attribute
DynamoDB supports two different kinds of primary keys:
Partition Key—A simple primary key, composed of one attribute, known as the partition key. DynamoDB uses the partition key's value as input to an internal hash function; the output from the hash function determine the partition where the item will be stored. No two items in a table can have the same partition key value.
Partition Key and Sort Key—A composite primary key composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. DynamoDB uses the partition key value as input to an internal hash function; the output from the hash function determines the partition where the item will be stored. All items with the same partition key are stored together, in sorted order by sort key value. It is possible for two items to have the same partition key value, but those two items must have different sort key values.
Primary key
Screenshot for creating the table in AWS console:-
Consider a table like this to store a user's contacts -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARYKEY ((user, contact_name), contact_id)
// ^-- Note the composite partition key
}
The composite partition key results in a row per contact.
Let's say there are a 100 million users and every user has a few hundred contacts.
I can look up a particular user's particular contact's data by using
SELECT contact_data FROM contacts WHERE user_name='foo' AND contact_name='bar'
However, is it also possible to look up all contact names for a user using something like,
SELECT contact_name FROM contacts WHERE user_name='foo'
? could the WHERE clause contain only some of all the columns that form the primary key?
EDIT -- I tried this and cassandra doesn't allow it. So my question now is, how would you model the data to support two queries -
Get data for a specific user & contact
Get all contact names for a user
I can think of two options -
Create another table containing user_name and contact_name with only user_name as the primary key. But then if a user has too many contacts, could that be a wide row issue?
Create an index on user_name. But given 100M users with only a few hundred contacts per user, would user_name be considered a high-cardinality value hence bad for use in index?
In a RDBMS the query planner might be able to create an efficient query plan for that kind of query. But Cassandra can not. Cassandra would have to do a table scan. Cassandra tries hard not to allow you to make those kinds of queries. So it should reject it.
No You cannot. If you look at the mechanism of how cassandra stores data, you will understand why you cannot query by part of composite partition key.
Cassandra distributes data across nodes based on partition key. The co-ordinator of a write request generates hash token using murmur3 algorithm on partition key and sends the write request to the token's owner.(each node has a token range that it owns). During a read, a co-ordinator again calculates the hash token based on partition key and sends the read request to the token's owner node.
Since you are using composite partition key, during a write request, all components of key (user, contact_name) will be used to generate the hash token. The owner node of this token has the entire row. During a read request, you have to provide all components of the key to calculate the token and issue the read request to the correct owner of that token. Hence, Cassandra enforces you to provide the entire partition key.
You could use two different tables with same structure but not the same partition key :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE TABLE contacts_by_users {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name), contact_id)
}
With this structure you have data duplication and you have to maintain both tables manually.
If you are using cassandra > 3.0, you can also use materialized views :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE MATERIALIZED VIEW contracts_by_users
AS
SELECT *
FROM contracts
WHERE user_name IS NOT NULL
AND contract_name IS NOT NULL
AND contract_id IS NOT NULL
PRIMARY KEY ((user_name), contract_name, contract_id)
WITH CLUSTERING ORDER BY contract_name ASC
In this case, you only have to maintain table contracts, the view will be automaticlly update