Cassandra Update query | timestamp column as clustering key - cassandra

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.

The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.

If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

Related

cassandra how I can get data with 2 tables?

In cassandra joins are not supported. So I want to show 20 Videos with comments.
I saw this example from a data modelling:
CREATE TABLE videos (
id number(12),
userid number(12) NOT NULL,
name nvarchar2(255),
description nvarchar2(500),
location nvarchar2(255),
location_type int,
added_date timestamp,
CONSTRAINT users_userid_fk FOREIGN KEY (userid) REFERENCES users (Id) ON DELETE CASCADE,
PRIMARY KEY (id)
);
CREATE TABLE comments (
id number(12),
userId number(12),
videoId number(12),
comment_text nvarchar2(500),
comment_time timestamp(6),
PRIMARY KEY (id),
CONSTRAINT user_comment_fk FOREIGN KEY (userid) REFERENCES users (Id) ON DELETE CASCADE,
CONSTRAINT video_comment_fk FOREIGN KEY (videoId) REFERENCES videos (Id) ON DELETE CASCADE
);
So how can I get now all videos with comments ? Because joins are not supported.
Can anyone help me ?
You are right that Cassandra does not support joins. And reason for that joins may become slow for big web scale data and Cassandra was designed to solve that particular problem.
Now coming back to your problem, to solve this in Cassandra it it recommended to create a table (call it a join table) which can answer your query. So if you want to see list of videos with comments, you can group videos and comments in a single partition using user_id as partition key. The table can look like this
CREATE TABLE KEYSPACE.videos_comments_by_user_id (\
user_id int,\
video_id int,\
comments list\<text>,
PRIMARY KEY((user_id), video_id);
While designing a table in Cassandra one should always keep one thing in mode, What query does this table shall serve? Do not think about joining the two tables, that will never work with Cassandra. Instead design to serve a query.
Below is the table design which will serve your query-
Table
comments_by_user_videos
Columns
Userid
Videoid
CommentId
CommentTimeStam
Videoname static
VideoDescription static
Location static
Location type static
Comment_text
Constraints-
Primary Key ((userid,videoid), CommentId).
The static columns are shared for a row kye. In this design , user id and video id define one row so for a user id and video id the value of static columns remains same. Static column can used to represent one to many relationship.
Query- to get all the comments for a video of a user.
Select * from comments_by_user_videos where userid=? and videoid=?
Edit:
If you want to display few number of vides, you should design another table.
Table: user_videos
userid
timestamp
videoid
videoname
videoLocation
locationtype
Primary key (userid, videoid, timestamp) ordered by timestamp DESC
Query:
select * from users_by_id_name where userid=? limit 10;
The query should return latest 10 videos by user.
Insert data in to both tables, never hesitate to duplicate data.

how to handle search by unique id in Cassandra

I have a table with a composite primary key. name,description, ID
PRIMARY KEY (id, name, description)
whenever searching Cassandra I need to provide the three keys, but now I have a use case where I want to delete, update, and get just based on ID.
So I created a materialized view against this table, and reordered the keys to have ID first so I can search just based on ID.
But how do I delete or update record with just an ID ?
It's not clear if you are using a partition key with 3 columns, or if you are using a composite primary key.
If you are using a partition key with 3 columns:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY ((id, name, description))
);
notice the double parenthesis you need all 3 components to identify your data. So when you query your data by ID from the materialized view you need to retrieve also both name and description fields, and then issue one delete per tuple <id, name, description>.
Instead, if you use a composite primary key with ID being the only PARTITION KEY:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY (id, name, description)
);
notice the single parenthesis, then you can simply issue one delete because you already know the partition and don't need anything else.
Check this SO post for a clear explanation on primary key types.
Another thing you should be aware of is that the materialized view will populate a table under the hood for you, and the same rules/ideas about data modeling should also apply for materialized views.

How to model cassandra in this particular situations?

if I have table structure below, how can i query by
"source = 'abc' and created_at >= '2016-01-01 00:00:00'"?
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY (id)
)
I would like to model my system according to this:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Edit:
What we are doing is very similar to what you are proposing. The difference is our primary key doesn't have brackets around source:
PRIMARY KEY (source, created_at, id). We also have two other indexes:
CREATE INDEX articles_id_idx ON crawler.articles (id);
CREATE INDEX articles_url_idx ON crawler.articles (url);
Our system is really slow like this. What do you suggest?
Thanks for your replies!
Given the table structure
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
You can issue the following queries:
SELECT * FROM articles WHERE source=xxx // Give me all article given the source xxx
SELECT * FROM articles WHERE source=xxx AND created_at > '2016-01-01 00:00:00'; // Give me all articles whose source is xxx and created after 2016-01-01 00:00:00
The couple (created_at,id) in the primary key is here to guarantee article unicity. Indeed, it is possible to have, at the same created_at time, 2 different articles
Given the knowledge from previous question you posted where I said index is slowing down your query you need to solve two things:
Write article only if it does not already exist
Query article based on source and range query on created at
Based on those two I would go with two tables:
Reverse index table
CREATE TABLE article_by_id (
id text,
source text,
created_at timestamp,
PRIMARY KEY (id) ) WITH comment = 'Article by id.';
This table will be used to insert articles when they first arrive. Based on return statement after INSERT ... IF NOT EXISTS you will know if article is existing or new and if it is new you will write to second table. Also this table can serve to find all key parts for second table based on article id. If you need full article data you can add to this table as well all fields (category, channel etc.). This will be skinny row holding only single article in one partition.
Example of INSERT:
INSERT INTO article_by_id(id, source, created_at) VALUES (%s,%s, %s) IF NOT EXISTS;
Java driver returns true or false whether this query was applied or not. Probably it is same in python driver but I did not use it.
Table for range queries and queries by source
As doanduyhai suggested you create a second table:
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
In this table you will write only if first INSERT returned true meaning you have new article, not existing one. This table will serve range queries and queries by source.
Improvement suggestion
By using timeuuid instead of timestamp for created_at you are sure no two article can have same created_at and you can loose id all together and rely on timeuuid. However from second question I can see you rely on external id so wanted to mention this as a sidenote.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources