Updating Primary Key value Cassandra - cassandra

I have table with node_id, node_name and data. My requirement is to getByID and getByName. So I have made id and name as the primary keys. But I also need to sometimes update the name as well.
I know Cassandra does not allow updating primary keys and having non primary key in the WHERE clause.
How can I achieve this?
I did consider deleting the record first, and then inserting again with the same id and new name. But I read that this would create tombstones and affect the performance.

Use only node_id as the primary key. To implement getByName create a materialized view. materialized views in cassandra

create table users_by_id_name(
id int,
createdOn bigint, -- timestamp in millisec
name text,
age int,
primary key (id,name,createdOn)
)WITH CLUSTERING ORDER BY ( name DESC, createdOn DESC);
Use above table definition to insert users.
Insert query --
insert into users_by_id_name (id,createdOn,name,age) values (1,100,'darthvedar',28);
to update the user, insert the row again with same user id and updated name and createdOn value.
insert into users_by_id_name (id,createdOn,name,age) values (1,200,'obi-wan-kenobi',28);
while selecting the user use below query --
select by user id -
select * from users_by_id_name where id=1 limit 1;
Select user by name -
select * from users_by_id_name where name='obi-wan-kenobi' ALLOW FILTERING;
Other way is to use secondary index on user name. Think, user name is not going to change too frequently, so secondary index is also one good option.
Edit after comments -
If you have very frequent updates on user name, it would be better to use two different tables.
create table users_by_id(
id int,
name text,
age int,
primary key (id)
);
create table users_by_name(
id int,
name text,
age int,
primary key (name)
);
While inserting , insert in both the tables using batch statement.
Hope this will help.

Related

CQL query delete if not in list

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?
CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

Data modelling conflicts in Cassandra

The schema I am using is following :
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date),name, id)
) WITH CLUSTERING ORDER BY (name desc);
I have 2 queries for my use case :
Fetching all records for given name
Delete all records for given date.
As we can't delete records without partition key being specified, my partition key got fixed to date only and no other column can be added to partition key as I won't have anything except date at time of deletion.
But to fetch records using name, I need to use ALLOW FILTERING as I need to scan whole table of above schema which causes performance issue.
Can you suggest a better way so that I can skip ALLOW FILTERING with is also delete by date compatible.
You could use indexes:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
But you have to be careful, there could be issues depending on the size of the table. You should read this post for more informations:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
You need an additional table to support your requirements.
Your main query is to retrieve the records given a name. For this, you should use mytable as follow (note the primary key):
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((name),date, id)
) WITH CLUSTERING ORDER BY (date desc);
This table will let you retrieve your data for a given name with (query 1):
SELECT * FROM mytable WHERE name='bob';
Now, you want to be able to delete by date. For this you would need the following additional table:
CREATE TABLE mytable_by_date(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date), name, id)
) WITH CLUSTERING ORDER BY (name);
This table will let you find the name (and id) for a given date with:
SELECT * from mytable_by_date WHERE date='your-date';
I don't know your business requirements, so you this query might return 0, 1 or maybe more results. Once you have that, you can issue the delete against the first and second table (maybe using a logged batch for atomicity?)
DELETE * from mytable_by_date WHERE date='your-date' and name='the-name' and id='the-id'
DELETE * from mytable WHERE name='the-name' and ...
Overall, you might need to adjust according to your business requirements (is name unique, is uniqueness enforced by id etc...)
Hope it helps!

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

how to handle search by unique id in Cassandra

I have a table with a composite primary key. name,description, ID
PRIMARY KEY (id, name, description)
whenever searching Cassandra I need to provide the three keys, but now I have a use case where I want to delete, update, and get just based on ID.
So I created a materialized view against this table, and reordered the keys to have ID first so I can search just based on ID.
But how do I delete or update record with just an ID ?
It's not clear if you are using a partition key with 3 columns, or if you are using a composite primary key.
If you are using a partition key with 3 columns:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY ((id, name, description))
);
notice the double parenthesis you need all 3 components to identify your data. So when you query your data by ID from the materialized view you need to retrieve also both name and description fields, and then issue one delete per tuple <id, name, description>.
Instead, if you use a composite primary key with ID being the only PARTITION KEY:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY (id, name, description)
);
notice the single parenthesis, then you can simply issue one delete because you already know the partition and don't need anything else.
Check this SO post for a clear explanation on primary key types.
Another thing you should be aware of is that the materialized view will populate a table under the hood for you, and the same rules/ideas about data modeling should also apply for materialized views.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

Resources