How to model cassandra in this particular situations? - cassandra

if I have table structure below, how can i query by
"source = 'abc' and created_at >= '2016-01-01 00:00:00'"?
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY (id)
)
I would like to model my system according to this:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Edit:
What we are doing is very similar to what you are proposing. The difference is our primary key doesn't have brackets around source:
PRIMARY KEY (source, created_at, id). We also have two other indexes:
CREATE INDEX articles_id_idx ON crawler.articles (id);
CREATE INDEX articles_url_idx ON crawler.articles (url);
Our system is really slow like this. What do you suggest?
Thanks for your replies!

Given the table structure
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
You can issue the following queries:
SELECT * FROM articles WHERE source=xxx // Give me all article given the source xxx
SELECT * FROM articles WHERE source=xxx AND created_at > '2016-01-01 00:00:00'; // Give me all articles whose source is xxx and created after 2016-01-01 00:00:00
The couple (created_at,id) in the primary key is here to guarantee article unicity. Indeed, it is possible to have, at the same created_at time, 2 different articles

Given the knowledge from previous question you posted where I said index is slowing down your query you need to solve two things:
Write article only if it does not already exist
Query article based on source and range query on created at
Based on those two I would go with two tables:
Reverse index table
CREATE TABLE article_by_id (
id text,
source text,
created_at timestamp,
PRIMARY KEY (id) ) WITH comment = 'Article by id.';
This table will be used to insert articles when they first arrive. Based on return statement after INSERT ... IF NOT EXISTS you will know if article is existing or new and if it is new you will write to second table. Also this table can serve to find all key parts for second table based on article id. If you need full article data you can add to this table as well all fields (category, channel etc.). This will be skinny row holding only single article in one partition.
Example of INSERT:
INSERT INTO article_by_id(id, source, created_at) VALUES (%s,%s, %s) IF NOT EXISTS;
Java driver returns true or false whether this query was applied or not. Probably it is same in python driver but I did not use it.
Table for range queries and queries by source
As doanduyhai suggested you create a second table:
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
In this table you will write only if first INSERT returned true meaning you have new article, not existing one. This table will serve range queries and queries by source.
Improvement suggestion
By using timeuuid instead of timestamp for created_at you are sure no two article can have same created_at and you can loose id all together and rely on timeuuid. However from second question I can see you rely on external id so wanted to mention this as a sidenote.

Related

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

table definition statement for cassandra for range queries?

Here is the table data
video_id uuid
user_id timeuuid
added_year int
added_date timestamp
title text
description text
I want to construct table based on the following query
select * from video_by_year where added_year<2013;
create table videos_by_year (
video_id uuid
user_id timeuuid
added_year int
added_date timestamp
title text
description text
PRIMARY KEY ((added_year) added_year)
) ;
NOTE: I have used added_year as both primary key and clustering key which is not correct I suppose.
So one of the issues with data modeling in cassandra is that the first component - the partition key - must use "=". The reason for this is pretty clear if you realize what cassandra's doing - it uses that value, hashes it (md5 or murmur3), and uses that to determine which servers in the cluster own that partition.
For that reason, you can't use an inequality - it would require scanning every row in the cluster.
If you need to get videos added before 2013, consider a system where you use some portion of the date as partition key, and then SELECT from each of those date 'buckets', which you can do asynchronously and in parallel. For example:
create table videos_by_year (
video_id uuid
user_id timeuuid
added_date_bucket text
added_date timestamp
title text
description text
PRIMARY KEY ((added_date_bucket), added_date, video_id)
) ;
I used text for added_date_bucket so you could use 'YYYY', or 'YYYY-MM' or similar. Note that depending on how quickly you add videos to the system, you may even want 'YYYY-MM-DD' or 'YYYY-MM-DD-HH:ii:ss', because you'll hit a practical limit of a few million videos per bucket.
You could get clever and have the video_id be a timeuuid, then you get added_date and video_id in a single column.

How to save data in cassandra conditionally only if properties did not change

We have data model of article with lot of properties. Here is our table model:
CREATE TABLE articles (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';
Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated field and external devices know which articles to synchronizes since they have information when they synchronized last time.
We added one more table for that:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated from first table)
Problems we are facing right now are:
In order to check if article fields are updated we need to do read before write (we partially solved that with content_hash field which is hash over all fields so we read and compare hash of incoming article with value in DB)
We are deleting and inserting in second table since we need unique gtins there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones
We have feature to add to search by many different combinations of fields
Questions:
Is cassandra good choice for this kind of data or we should move it to some other storage (or even have elasticsearch and cassandra in combination where we can index changes after time and cassandra can hold only master data per gtin)?
Can data be modeled better for our use case to avoid read before write or deletes in second table?
Update
Just to clarify use case: other devices are syncing with pagination (sending last_sync_date and skip and count) so we need table with all article information, sorted by last_updated without duplicates and searchable by last_updated
If you are using Cassandra 2.1.1 and later, then you can use the "not equal" comparison in the IF part of the UPDATE statement (see CASSANDRA-6839 JIRA issue) to make sure you update data only if anything has changed. Your statement would look something like this:
UPDATE articles
SET
barcodes = <barcodes>,
... = <...>,
last_updated = <last_updated>
WHERE
organization_id = <organization_id>
AND gtin = <gtin>
IF content_hash != <content_hash>;
For your second table, you don't need to duplicate entire data from the first table as you can do the following:
create your table like this:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
last_updated timeuuid,
gtin text,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
Once you've updated the first table, you can read the last_updated value for that gtin again and if it is equal or greater than the last_updated value you passed in, then you know that the update was successful (by your or another process), so you can now go ahead and insert that retrieved last_updated value into the second table. You don't need to delete the records for this update. I assume you can create distinct updated gtin list on the application side, if you do polling (using a range query) on a regular basis, which I assume pulls a reasonable amount of data. You can TTL these new records after a few poll cycles to remove a necessity to do manual deletes for example. Then, after you found the gtins affected, then you do a second query where you pull all of the data from the first table. You can then run a second sanity check on the updated dates to avoid sending anything that is supposed to be sent on the next update (if it is necessary of course).
HTH.

What do you do change in the such a data model Cassandra?

I have task to create a social feed(news feed). I think no need to explain the standard functionality - all are how as FB.
I chose solution apache cassandra and designed a data column Posts for storing information about posts users:
CREATE TABLE Posts (
post_id uuid,
post_at timestamp,
user_id text,
name varchar,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY ((post_id, user_id), post_at)
)
WITH CLUSTERING ORDER BY (post_at DESC) COMPACT STORAGE;
The next table contains id user posts:
CREATE TABLE posts_user (
post_id bigint,
post_at timestamp,
user_id bigint,
PRIMARY KEY ((post_id), post_at, user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC) AND COMPACT STORAGE;
How do you think, is it good? What do you do change in the such a data model?
There are a couple of questions and a couple of improvements that jump out.
COMPACT STORAGE is deprecated now (if you want to take advantage of CQL 3 features). I do not think that you can create your table Posts as you have defined above since it uses CQL 3 features (collections) with COMPACT STORAGE as well as declaring more than one column that is not part of the primary key.
posts_user has completely different key types than Posts does. I am not clear on what the relationship between the two tables is, but I imagine that post_id is supposed to be consistent between them, whereas you have it as a uuid in one table and a bigint in the other. There are also discrepancies with the other fields.
Assuming post_id is unique and represents the id of an individual post, it is strange to have it as the first part of a compound primary key in the Posts table since if you know the post_id then you can already uniquely access the record. Furthermore, as it is part of the partition key it also prevents you from doing wider selects of multiple posts and taking advantage of your post_at ordering.
The common method to fix this is to create a dedicated index table to sort the data the way you want.
E.g.
CREATE TABLE posts (
id uuid,
created timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (id)
);
CREATE TABLE posts_by_user_index (
user_id uuid,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_at,post_id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
Or more comprehensively:
CREATE TABLE posts_by_user_sort_index (
user_id uuid,
post_id uuid,
sort_field text,
sort_value text,
PRIMARY KEY ((user_id,sort_field),sort_value,post_id)
);
However, in your case if you only wish to select the data one way, then you can get away with using your posts table to do the sorting:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (user_id,post_at,id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
It will just make it more complicated if you wish to add additional indexes later since you will need to index each post not just by its post id, but by its user and post_at fields as well.

Resources