Cassandra bookmark table data modeling - cassandra

I want to model table with the following functionalities -
I can fetch the bookmarked items in descending timestamp order
I can delete individual bookmarked item.
My table looks like this-
CREATE TABLE bookmarked_content(
user_id uuid,
type varchar,
timestamp timestamp,
item_id uuid,
primary key(user_id, type, timestamp)
WITH CLUSTERING KEY (type , timestamp DESC)
);
Now this is fine for fetching all the bookmarked of specific type in descending timestamp order, But the problem is I can't delete specific item from the table and I don't want to depend on secondary indexes for this problem.
Thanks in advance

You have nothing to do except using plain old chmod function:
rename($from, $to);
chmod($to, $mode);

Related

CQL query delete if not in list

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?
CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

Data modelling conflicts in Cassandra

The schema I am using is following :
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date),name, id)
) WITH CLUSTERING ORDER BY (name desc);
I have 2 queries for my use case :
Fetching all records for given name
Delete all records for given date.
As we can't delete records without partition key being specified, my partition key got fixed to date only and no other column can be added to partition key as I won't have anything except date at time of deletion.
But to fetch records using name, I need to use ALLOW FILTERING as I need to scan whole table of above schema which causes performance issue.
Can you suggest a better way so that I can skip ALLOW FILTERING with is also delete by date compatible.
You could use indexes:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
But you have to be careful, there could be issues depending on the size of the table. You should read this post for more informations:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
You need an additional table to support your requirements.
Your main query is to retrieve the records given a name. For this, you should use mytable as follow (note the primary key):
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((name),date, id)
) WITH CLUSTERING ORDER BY (date desc);
This table will let you retrieve your data for a given name with (query 1):
SELECT * FROM mytable WHERE name='bob';
Now, you want to be able to delete by date. For this you would need the following additional table:
CREATE TABLE mytable_by_date(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date), name, id)
) WITH CLUSTERING ORDER BY (name);
This table will let you find the name (and id) for a given date with:
SELECT * from mytable_by_date WHERE date='your-date';
I don't know your business requirements, so you this query might return 0, 1 or maybe more results. Once you have that, you can issue the delete against the first and second table (maybe using a logged batch for atomicity?)
DELETE * from mytable_by_date WHERE date='your-date' and name='the-name' and id='the-id'
DELETE * from mytable WHERE name='the-name' and ...
Overall, you might need to adjust according to your business requirements (is name unique, is uniqueness enforced by id etc...)
Hope it helps!

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

Order by created date In Cassandra

i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts
There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order
The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.

How to model cassandra in this particular situations?

if I have table structure below, how can i query by
"source = 'abc' and created_at >= '2016-01-01 00:00:00'"?
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY (id)
)
I would like to model my system according to this:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Edit:
What we are doing is very similar to what you are proposing. The difference is our primary key doesn't have brackets around source:
PRIMARY KEY (source, created_at, id). We also have two other indexes:
CREATE INDEX articles_id_idx ON crawler.articles (id);
CREATE INDEX articles_url_idx ON crawler.articles (url);
Our system is really slow like this. What do you suggest?
Thanks for your replies!
Given the table structure
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
You can issue the following queries:
SELECT * FROM articles WHERE source=xxx // Give me all article given the source xxx
SELECT * FROM articles WHERE source=xxx AND created_at > '2016-01-01 00:00:00'; // Give me all articles whose source is xxx and created after 2016-01-01 00:00:00
The couple (created_at,id) in the primary key is here to guarantee article unicity. Indeed, it is possible to have, at the same created_at time, 2 different articles
Given the knowledge from previous question you posted where I said index is slowing down your query you need to solve two things:
Write article only if it does not already exist
Query article based on source and range query on created at
Based on those two I would go with two tables:
Reverse index table
CREATE TABLE article_by_id (
id text,
source text,
created_at timestamp,
PRIMARY KEY (id) ) WITH comment = 'Article by id.';
This table will be used to insert articles when they first arrive. Based on return statement after INSERT ... IF NOT EXISTS you will know if article is existing or new and if it is new you will write to second table. Also this table can serve to find all key parts for second table based on article id. If you need full article data you can add to this table as well all fields (category, channel etc.). This will be skinny row holding only single article in one partition.
Example of INSERT:
INSERT INTO article_by_id(id, source, created_at) VALUES (%s,%s, %s) IF NOT EXISTS;
Java driver returns true or false whether this query was applied or not. Probably it is same in python driver but I did not use it.
Table for range queries and queries by source
As doanduyhai suggested you create a second table:
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
In this table you will write only if first INSERT returned true meaning you have new article, not existing one. This table will serve range queries and queries by source.
Improvement suggestion
By using timeuuid instead of timestamp for created_at you are sure no two article can have same created_at and you can loose id all together and rely on timeuuid. However from second question I can see you rely on external id so wanted to mention this as a sidenote.

Resources