I'm trying to see what would be the best way to model groups in Cassandra. There are basically two options, store members as set or create individual rows for each group member. In CQL this would be:
CREATE TABLE groups (
group_id uuid,
member_id text,
name text,
PRIMARY KEY ((group_id), member_id)
);
CREATE TABLE groups_by_member (
member_id text,
group_id uuid,
PRIMARY KEY ((member_id), group_id)
);
An alternative using sets:
CREATE TABLE groups (
group_id text PRIMARY KEY,
member_ids set<text>,
name text
);
CREATE TABLE groups_by_member (
member_id text PRIMARY KEY,
group_ids set<uuid>
);
The operations would be typical:
Find all members of a group
Find all groups of a member
In both cases memberships would be changed by batched inserts or deletes. Is either one better than the other? Does the situation change if groups can have 100 or 1000 members?
Yet a third option would be to just have one table with each group member as it's own row, but then use materialized views to allow find all the groups that a user belongs to.
Related
I have table with node_id, node_name and data. My requirement is to getByID and getByName. So I have made id and name as the primary keys. But I also need to sometimes update the name as well.
I know Cassandra does not allow updating primary keys and having non primary key in the WHERE clause.
How can I achieve this?
I did consider deleting the record first, and then inserting again with the same id and new name. But I read that this would create tombstones and affect the performance.
Use only node_id as the primary key. To implement getByName create a materialized view. materialized views in cassandra
create table users_by_id_name(
id int,
createdOn bigint, -- timestamp in millisec
name text,
age int,
primary key (id,name,createdOn)
)WITH CLUSTERING ORDER BY ( name DESC, createdOn DESC);
Use above table definition to insert users.
Insert query --
insert into users_by_id_name (id,createdOn,name,age) values (1,100,'darthvedar',28);
to update the user, insert the row again with same user id and updated name and createdOn value.
insert into users_by_id_name (id,createdOn,name,age) values (1,200,'obi-wan-kenobi',28);
while selecting the user use below query --
select by user id -
select * from users_by_id_name where id=1 limit 1;
Select user by name -
select * from users_by_id_name where name='obi-wan-kenobi' ALLOW FILTERING;
Other way is to use secondary index on user name. Think, user name is not going to change too frequently, so secondary index is also one good option.
Edit after comments -
If you have very frequent updates on user name, it would be better to use two different tables.
create table users_by_id(
id int,
name text,
age int,
primary key (id)
);
create table users_by_name(
id int,
name text,
age int,
primary key (name)
);
While inserting , insert in both the tables using batch statement.
Hope this will help.
What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.
I have a table with a composite primary key. name,description, ID
PRIMARY KEY (id, name, description)
whenever searching Cassandra I need to provide the three keys, but now I have a use case where I want to delete, update, and get just based on ID.
So I created a materialized view against this table, and reordered the keys to have ID first so I can search just based on ID.
But how do I delete or update record with just an ID ?
It's not clear if you are using a partition key with 3 columns, or if you are using a composite primary key.
If you are using a partition key with 3 columns:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY ((id, name, description))
);
notice the double parenthesis you need all 3 components to identify your data. So when you query your data by ID from the materialized view you need to retrieve also both name and description fields, and then issue one delete per tuple <id, name, description>.
Instead, if you use a composite primary key with ID being the only PARTITION KEY:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY (id, name, description)
);
notice the single parenthesis, then you can simply issue one delete because you already know the partition and don't need anything else.
Check this SO post for a clear explanation on primary key types.
Another thing you should be aware of is that the materialized view will populate a table under the hood for you, and the same rules/ideas about data modeling should also apply for materialized views.
Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
Assuming all people records are identified by a UUID and all groups are identified by a UUID. What data model would you create when you need to commonly query the list of all people in a group, and the list of all groups a person belongs to. i.e.
create table membership (
person_uuid uuid,
group_uuid uuid,
joined bigint,
primary key (person_uuid, group_uuid));
The above would optimise for querying by person, and the below would optimise for querying by group.
create table membership (
group_uuid uuid,
person_uuid uuid,
joined bigint,
primary key (group_uuid, person_uuid));
Is there a neat way to handle so you can optimally query by person_uuid and by group_uuid without having to use "allow filtering", i.e.:
select group_uuid from membership where person_uuid=?
select person_uuid from membership where group_uuid=? allow filtering
Do you just go ahead and store two copies of the data for queries in both directions, this has atomicity issues though right?
#Jacob
What you can do is create secondary index on the second clustering component of primary key to be able to query on it.
create table membership (
person_uuid uuid,
group_uuid uuid,
joined bigint,
primary key (person_uuid, group_uuid));
create index on membership(group_uuid);
Of course then you'll need to add allow filtering to the query but it will be much faster than without index.
If you choose to use 2 tables index data without using secondary index, you could use atomic batch when inserting data to guarantee atomicity