Data modeling easy table in Cassandra not working - cassandra

I have to design a web page where a group leader can invite people to join his/her group. My requirements are really simple.
No sending duplicate emails out, if person was already contacted.
Show the group leader a list of invites sorted by invite date in ascending order.
Seems easy. I created this table.
CREATE TABLE invites (
email_address text,
invite_date timeuuid,
PRIMARY KEY (email_address, invite_date)
) WITH CLUSTERING ORDER BY (invite_date ASC);
Problem 1: LWT no use with invite_date as a Cluster column.
I figured I'd use LWT to ensure email_address is unique, only to find out IF NOT EXISTS only seems to work on the whole PRIMARY KEY, so LWT in C* does not work for me.
Problem 2: I cannot get an ordered list of invites back to save me life even with invite_date as a Cluster column.
If I take invite_date out, I cannot issue an 'order by' in CQL. That said, having invite_date out of the PK let's me use LWT...
I can't even get a 2 column table to fulfill 2 easy requirements! Any help on data modeling design for this problem is much appreciated.
New Dec. 4, 2015:
Additional to business requirements, a technical requirement I have is:
I want to make sure I model this correctly in Cassandra, so that it allows me to use CQL's LIMIT and pagingState capabilities in the Java driver. This means, I cannot just read all the rows in, sort on Java side and return the results.

Problem 1:
I think that the easiest way to handle this might be to have two separate tables, one for the emails_in_group and one for invites_by_group. This will allow each query to be fulfilled independantly. The emails_in_group table would look something like this:
CREATE TABLE emails_in_group (
email_address text,
group_id text,
PRIMARY KEY (email_address , group_id));
Then this, combined with the table as defined in Problem 2 below could be updated using a conditional batch statement as shown here:
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use-batch-static.html
Problem 2:
So the basic problem here is that as you have your data currently modeled each email_address value will be in it's own partition and then within that partition the invite_date will be ordered. #bydsky is right when he said that you need to add something like a group_id to your table and make it the partition key portion of your Primary Key. If you do this and then add invite_date as a cluster column to the partition key then all records for that group_id will be stored in the same partition and the Order By will work. Order By only works within the same partiton, not across partitions which is what you were asking it to do.
CREATE TABLE invites_by_group (
group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, invite_date));

I think you should use group_id and email as primary key.
For ordering, maybe you could do it in your application.
CREATE TABLE invites ( group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, email_address) );
For cassandra data modeling, it's a good start to watch DS220

Related

Cassandra Defining Primary key and alternatives

Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.
My requirements are
search by uuid
search by username
search by email
All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search
PRIMARY KEY(uid,username,email)
What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?
what are other options? should we go with a new table with username to uid, then search the user table. ?
From all articles out there on the internet recommends not to create secondary index for high cardinality keys
CREATE TABLE medicscity.user (
uid uuid,
fname text,
lname text,
user_id text,
email_id text,
password text,
city text,
state_id int,
country_id int,
dob timestamp,
zipcode text,
PRIMARY KEY (??)
)
How do we solve this kind of situation ?
Yes, you need to go with duplicate tables.
If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.
Now, how much duplication you have to use, is individual choice.
Like, in this example, you can either duplicate table with full data.
Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table.
Create a new table with primary key of column1, column2, column3 and partition key on column2.
Another one with same primary key and partition key on column3.
So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.
Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.
Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.
e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.
All you need is the denormalization and duplicate data.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.
In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)

Cassandra table based query and primary key uniqueness

I have read here that for a table like:
CREATE TABLE user (
username text,
password text,
email text,
company text,
PRIMARY KEY (username)
);
We can create a table like:
CREATE TABLE user_by_company (
company text,
username text,
email text,
PRIMARY KEY (company)
);
In order to support query by the company. But what about primary key uniqueness for the second table?
Modify your table's PRIMARY KEY definition and add username as a clustering key:
CREATE TABLE user_by_company (
company text,
username text,
email text,
PRIMARY KEY (company,username)
);
That will enforce uniqueness, as well as return all usernames for a particular company. Additionally, your result set will be sorted in ascending order by username.
data will be partitioned by the company name over nodes. What if there is a lot of users from one company and less from other one. Data will be partition'ed in a non balanced way
That's the balance that you have to figure out on your own. PRIMARY KEY definition in Cassandra is a give-and-take between data distribution and query flexibility. And unless the cardinality of company is very low (like single digits), you shouldn't have to worry about creating hot spots in your cluster.
Also, if one particular company gets too big, you can use a modeling technique known as "bucketing." If I was going to "bucket" your user_by_company table, I would first add a company_bucket column, and it as an additional (composite) partitioning key:
CREATE TABLE user_by_company (
company text,
company_bucket text,
username text,
email text,
PRIMARY KEY ((company,company_bucket),username)
);
As for what to put into that bucket, it's up to you. Maybe that particular company has East and West locations, so something like this might work:
INSERT INTO user_by_company (company,company_bucket,username,email)
VALUES ('Acme','West','Jayne','jcobb#serenity.com');
The drawback here, is that you would then have to provide company_bucket whenever querying that table. But it is a solution that could help you if a company should get too big.
I think there is typo in the blog (the link you mentioned). You are right with the table structure as user_by_company there will be issue with uniqueness.
To support the typo theory:
In this case, creating a secondary index in the company field in the
user table could be a solution because it has much lower cardinality
than the user's email but let’s solve it with performance in mind.
Secondary indexes are always slower than dedicated table approach.
This are the lines mentioned in the blog for querying user by company.
If you were to define company as primary key OR part of primary key there should be no need to create secondary index.

How to check whether a given key is in a map structure in CQL

I am trying to update a value stroed in map of given key using cql. Can anyone tell me how to do it? The following is my table:
create table game(game_id uuid, game_name text, participant_id_name map<uuid, text>, PRIMARY KEY (game_id));
create index on game(participant_id_name);
Now I have a given participant's uuid and want to update his/her name, but I dont know the game_id. I wonder how can I check if the participant belongs to participant_id_name column and then update the name.
I do not believe your data model is optimal given what you are trying to do, so I am proposing the following:
create table game(game_id uuid, game_name text, participant_id int, participant_id_name text, PRIMARY KEY (game_id), participant_id);
With this table, you have a partition key of game_id and a clustering column of participant_id, so the partition will include everything for the game_id ordered by the participant_id column. I believe this makes sense for what you are trying to do. Instead of using a database generated unique id for the participant_id, I suggest having the application provide an integer when inserting the name of the person so you know both pieces of data.
(Please note I do not have complete information, and am making a best effort with the information provided.)

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources