Storing person group membership under cassandra - cassandra

Assuming all people records are identified by a UUID and all groups are identified by a UUID. What data model would you create when you need to commonly query the list of all people in a group, and the list of all groups a person belongs to. i.e.
create table membership (
person_uuid uuid,
group_uuid uuid,
joined bigint,
primary key (person_uuid, group_uuid));
The above would optimise for querying by person, and the below would optimise for querying by group.
create table membership (
group_uuid uuid,
person_uuid uuid,
joined bigint,
primary key (group_uuid, person_uuid));
Is there a neat way to handle so you can optimally query by person_uuid and by group_uuid without having to use "allow filtering", i.e.:
select group_uuid from membership where person_uuid=?
select person_uuid from membership where group_uuid=? allow filtering
Do you just go ahead and store two copies of the data for queries in both directions, this has atomicity issues though right?

#Jacob
What you can do is create secondary index on the second clustering component of primary key to be able to query on it.
create table membership (
person_uuid uuid,
group_uuid uuid,
joined bigint,
primary key (person_uuid, group_uuid));
create index on membership(group_uuid);
Of course then you'll need to add allow filtering to the query but it will be much faster than without index.
If you choose to use 2 tables index data without using secondary index, you could use atomic batch when inserting data to guarantee atomicity

Related

Will cassandra do multi partition search for compounded primary key

I have a scenario, let's say below is my cassandra table
CREATE TABLE USER (
id TEXT,
name TEXT,
age int,
role TEXT,
PRIMARY KEY ((id, role), age));
Now I should be able to query user table using either id or role or both id and role. My question is when I use only id or role in the WHERE clause to find user, in this case will cassandra search for user record in different partition(and nodes)? As I am not searching user using both id and role which make the PK of my table.
When you use a compound partition key like in your example PRIMARY KEY ((id, role), age)
Cassandra will concatenate the two values together. It's a technique used to create a more unique or sometimes granular partition key to better control how evenly data gets distributed around the respective datacenter.
Because id and role are concatenated then hashed, you must always provide both the id AND role. Cassandra will not know what to do if you give only part of the compound partition key.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Does manual indexes makes sense on Cassandra?

Does this simple schema makes sense on Cassandra context? Or I can just use the unique constraint index instead of a manual indexing through partition key for username and email? I understood that to guarantees normal index efficiency on Cassandra the consult must includes the partition key, so if I want to execute a "get by" on a table with millions of rows without stipulating the partition key just the index column, it may not be as fast as it should be, so the manual index by creating new partition keys become a better choice. Is this notion correct? The only problem with manual indexing is that you'll need to do it manually, if you delete a row on "users" you will need to get the respective values for the respective indexed column before deleting to be able to delete the indexes together, and may also need to batch it. Did I mis-concept Cassandra?
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
password_hash text,
password_salt text,
display_name text,
timezone int,
created_at timestamp,
last_login_at timestamp
);
CREATE TABLE usernames (
username text PRIMARY KEY,
user_id uuid
);
CREATE TABLE user_emails (
email text PRIMARY KEY,
user_id uuid
);
Manual indexing could an overhead, that is you need to maintain indexes along with data, while doing CRUD operations.
So its recommended to use secondary indexing support of Cassandra.
If you want to query on username and email columns then you should create secondary indexes on that columns. Secondary indexes are Cassandra inbuilt indexing mechanism to index non key columns.

Resources