cassandra primary key where cause limitation - cassandra

I want to use cassandra as a DB to store messages, when in my model messages are aggregate by channel.
the 3 main important field of message:
channel_id
created_by
message_id (unique)
The main read/fetch API is get messages by channel sorted by created_by.
Plus, I have a low scale messages update by channel_id + message_id.
So my question is regarding the primary_key definition.
If I will define it (channel_id,created_by)
will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX, even if message_id is not in the primary key (I do give the query the partition key)?
And if not,
if I will define the primary key like this (channel_id,created_by, message_id)
will I be able to do the read with where cause with only 1 clustering column (channel_id,created_by)
and do the update using the where cause channel_id + message_id?
Thanks

define it (channel_id,created_by) will I be able to do an UPDATE with WHERE cLause like channel_id=X and message_id=XX
No. All primary key components are required for a write operation in Cassandra. First you will have to provide created_by. message_id is not part of the key, so that will have to be removed.
And if not, if I will define the primary key like this (channel_id,created_by, message_id) will I be able to do the read with WHERE cause with only 1 clustering column (channel_id,created_by)
Yes, this will work:
SELECT * FROM messages WHERE channel_id='1' AND created_by='Aaron';
This ^ works, because you have provided the first two primary key components, without skipping any. Cassandra can easily find the node containing the partition for channel_id, and scan down to the row starting with created_by.
and do the update using the WHERE cause channel_id + message_id?
No. Again, you would need to provide created_by for the write to succeed.

The primary key selection decision is one of the most important part in Cassandra data modeling. You need to understand the table. I am not sure if I can help you with the above-provided information by you. But I will still give it a try.
Your requirement:
Sort by created_by.
Update with channel_id + message_id
Try having channel_id + message_id as the partition key and created_by as clustering key. Message_id in the primary key will also help in ensuring uniqueness.
Recently I found DS220 course on Data modeling on https://academy.datastax.com/. This is awesome.

Related

Cassandra Insert/Update without duplication when you can't rely on the primary key or uuid

Here's the problem.
Our 'customers' are ingested regularly as part of bulk file upload (CSV) from clients. The data we have from them is Name, Address, PostCode, Client Reference Number.
We're saving these into a cassandra 'Customer Table'.
When we do this we assign a UUID which we then use throughout the rest of the system.
The question comes around primary keys… we have two options really
1) UUID as primary key or, 2) composite primary key of (name, address, postcode).
Problem with these options are: 1) we don't have the UUID at initial insert, it's possible that the 'customers' are duplicated, so how do we de-dupe? Get (select) followed by upsert would be inefficient. 2) has a couple of issues: a) if we perform an update operation there is a possibility that the UUID could be overwritten… b) there is also an issue that name, address, postcode couldn't be updated as they're a composite primary key… a) might not be an issue as a change to UUID will emit an event that will be picked up by other interested services… but kind of removes the point of a UUID… b) we can keep alias (AKA) fields for a customers preferred, or updated details, whilst keeping the original data for reference… though this feels clumsy.
Preferred, and easiest way would be to go for option 1, but without using primary key for initial creation - not sure this is possible? With option 2, we would also need to be able to update all fields but with the exception of the UUID column…
You can only really use the UUID as the partition key if you know if beforehand. You won't be able to insert new customers into the table if you don't have the UUID.
Based on your description, you use the UUID as the unique ID for the rest of your system so it really is the perfect partition key. You will however need to find a solution for the situations where you don't have the customer's UUID. Cheers!

how to create a unique bigint primary key in cassandra?

I have a table in cassandra for saving messages. I have uuid as primary key, but I need to send clients bigints as message keys which must be unique for that user.
how can I achieve that? Is there a way to combine user primary key which is bigint and message key to generate a bigint message_id for that user?
or should I use bigint as primary key for messages? if so then how can I generate unique bigints?
Cassandra allows you to have a compound primary key, in this case, the message_id seems a good candidate to be used as a clustering key.
For more information, you can take a look here and here
There is no way to generate auto incremented bigint number in Cassandra.
You have that key generation logic some ware else and use that as part of index in Cassandra
or
Build your own id service where you fetch your next id. This service will only be run in a single instance and will be a non scaling scary factor.

Cassandra CQL3 clustering order and pagination

I am building out a user favourites service using Cassandra. I want to be able to have the favourites sorted by latest and then be able to paginate over the track_ids i.e the front end sends back the last track_id in the 200 page.
CREATE TABLE user_favorites
( user_id uuid, track_id int ,
favourited_date timestamp,
PRIMARY KEY ((user_id), favourited_date))
WITH CLUSTERING ORDER BY (favourited_date DESC);
I've tried different combinations of primary and clustering keys but to no avail.
I am wondering if it is better to split this out over multiple tables also.
I solved it using the comment about the Java driver base64'ing the PagingState and returning it to the client.

Can a cassandra table be queried using only a part of the composite partition key?

Consider a table like this to store a user's contacts -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARYKEY ((user, contact_name), contact_id)
// ^-- Note the composite partition key
}
The composite partition key results in a row per contact.
Let's say there are a 100 million users and every user has a few hundred contacts.
I can look up a particular user's particular contact's data by using
SELECT contact_data FROM contacts WHERE user_name='foo' AND contact_name='bar'
However, is it also possible to look up all contact names for a user using something like,
SELECT contact_name FROM contacts WHERE user_name='foo'
? could the WHERE clause contain only some of all the columns that form the primary key?
EDIT -- I tried this and cassandra doesn't allow it. So my question now is, how would you model the data to support two queries -
Get data for a specific user & contact
Get all contact names for a user
I can think of two options -
Create another table containing user_name and contact_name with only user_name as the primary key. But then if a user has too many contacts, could that be a wide row issue?
Create an index on user_name. But given 100M users with only a few hundred contacts per user, would user_name be considered a high-cardinality value hence bad for use in index?
In a RDBMS the query planner might be able to create an efficient query plan for that kind of query. But Cassandra can not. Cassandra would have to do a table scan. Cassandra tries hard not to allow you to make those kinds of queries. So it should reject it.
No You cannot. If you look at the mechanism of how cassandra stores data, you will understand why you cannot query by part of composite partition key.
Cassandra distributes data across nodes based on partition key. The co-ordinator of a write request generates hash token using murmur3 algorithm on partition key and sends the write request to the token's owner.(each node has a token range that it owns). During a read, a co-ordinator again calculates the hash token based on partition key and sends the read request to the token's owner node.
Since you are using composite partition key, during a write request, all components of key (user, contact_name) will be used to generate the hash token. The owner node of this token has the entire row. During a read request, you have to provide all components of the key to calculate the token and issue the read request to the correct owner of that token. Hence, Cassandra enforces you to provide the entire partition key.
You could use two different tables with same structure but not the same partition key :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE TABLE contacts_by_users {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name), contact_id)
}
With this structure you have data duplication and you have to maintain both tables manually.
If you are using cassandra > 3.0, you can also use materialized views :
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_id int,
contact_data blob,
PRIMARY KEY ((user_name, contact_name), contact_id)
}
CREATE MATERIALIZED VIEW contracts_by_users
AS
SELECT *
FROM contracts
WHERE user_name IS NOT NULL
AND contract_name IS NOT NULL
AND contract_id IS NOT NULL
PRIMARY KEY ((user_name), contract_name, contract_id)
WITH CLUSTERING ORDER BY contract_name ASC
In this case, you only have to maintain table contracts, the view will be automaticlly update

Design data model for messaging system with Cassandra

I am new to Cassandra and trying to build a data model for messaging system. I found few solutions but none of them exactly match my requirements. There are two main requirements:
Get a list of last messages for a particular user, from all other users, sorted by time.
Get a list of messages for one-to-one message history, sorted by time as well.
I thought of something like this,
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user,from_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
But this design has few issues, like I wont be able to satisfy first requirement since this design requires to pass from_user as well. And also this would be inefficient when number of (to_user,from_user) pair increases.
You are right. That one table won't satisfy both queries, so you will need two tables. One for each query. This is a core concept with Cassandra data modeling. Query driven design.
So the query looking for messages to a user:
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
Messages from a user to another user.
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),from_user,time)
) WITH CLUSTERING ORDER BY (time DESC);
Slight difference from yours: from_user is a clustering column and not a part of the partition key. This is minimize the amount of select queries needed in application code.
It's possible to use the second table to satisfy both queries, but you will have to supply the 'from_user' to use a range query on time.

Resources