How to save data in cassandra conditionally only if properties did not change - cassandra

We have data model of article with lot of properties. Here is our table model:
CREATE TABLE articles (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';
Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated field and external devices know which articles to synchronizes since they have information when they synchronized last time.
We added one more table for that:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated from first table)
Problems we are facing right now are:
In order to check if article fields are updated we need to do read before write (we partially solved that with content_hash field which is hash over all fields so we read and compare hash of incoming article with value in DB)
We are deleting and inserting in second table since we need unique gtins there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones
We have feature to add to search by many different combinations of fields
Questions:
Is cassandra good choice for this kind of data or we should move it to some other storage (or even have elasticsearch and cassandra in combination where we can index changes after time and cassandra can hold only master data per gtin)?
Can data be modeled better for our use case to avoid read before write or deletes in second table?
Update
Just to clarify use case: other devices are syncing with pagination (sending last_sync_date and skip and count) so we need table with all article information, sorted by last_updated without duplicates and searchable by last_updated

If you are using Cassandra 2.1.1 and later, then you can use the "not equal" comparison in the IF part of the UPDATE statement (see CASSANDRA-6839 JIRA issue) to make sure you update data only if anything has changed. Your statement would look something like this:
UPDATE articles
SET
barcodes = <barcodes>,
... = <...>,
last_updated = <last_updated>
WHERE
organization_id = <organization_id>
AND gtin = <gtin>
IF content_hash != <content_hash>;
For your second table, you don't need to duplicate entire data from the first table as you can do the following:
create your table like this:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
last_updated timeuuid,
gtin text,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
Once you've updated the first table, you can read the last_updated value for that gtin again and if it is equal or greater than the last_updated value you passed in, then you know that the update was successful (by your or another process), so you can now go ahead and insert that retrieved last_updated value into the second table. You don't need to delete the records for this update. I assume you can create distinct updated gtin list on the application side, if you do polling (using a range query) on a regular basis, which I assume pulls a reasonable amount of data. You can TTL these new records after a few poll cycles to remove a necessity to do manual deletes for example. Then, after you found the gtins affected, then you do a second query where you pull all of the data from the first table. You can then run a second sanity check on the updated dates to avoid sending anything that is supposed to be sent on the next update (if it is necessary of course).
HTH.

Related

Insert table Mutation to different Cassandra table

I have requirement to keep the old values of a row in a history table for auditing whenever we do row update. Is there any solution available in Apache Cassandra to achieve this?
I looked at the Trigger and not much mentioned in the docs. Not sure of performance issues if we use the triggers. Also if we use trigger, will it give the old value for a column when we do update?
Cassandra is best tool to keep the row history. I will try to explain it with an example. Consider the below table design -
CREATE TABLE user_by_id (
userId text,
timestamp timestamp,
name text,
fullname text,
email text,
PRIMARY KEY (userId,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
With this kind of table design you can keep the history of the record.
Here, userid is row partition key and timestamp as clustering key. Every insert for same user will be recorded as different row. for example -
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'x',xyz,'x#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'y',xyz,'y#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'z',xyz,'z#xyz.com');
Above insert statements are actually updating values of the name and email column. But, this will be saved in three different rows because of timestamp as a clustering key, timestamp will be different for each row. If you want to get the latest value, just use LIMIT in your select query.
This design keeps the history of the row which can be used foe audit purpose.

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

How is denormalization handled in cassandra

What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.

How to model cassandra in this particular situations?

if I have table structure below, how can i query by
"source = 'abc' and created_at >= '2016-01-01 00:00:00'"?
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY (id)
)
I would like to model my system according to this:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Edit:
What we are doing is very similar to what you are proposing. The difference is our primary key doesn't have brackets around source:
PRIMARY KEY (source, created_at, id). We also have two other indexes:
CREATE INDEX articles_id_idx ON crawler.articles (id);
CREATE INDEX articles_url_idx ON crawler.articles (url);
Our system is really slow like this. What do you suggest?
Thanks for your replies!
Given the table structure
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
You can issue the following queries:
SELECT * FROM articles WHERE source=xxx // Give me all article given the source xxx
SELECT * FROM articles WHERE source=xxx AND created_at > '2016-01-01 00:00:00'; // Give me all articles whose source is xxx and created after 2016-01-01 00:00:00
The couple (created_at,id) in the primary key is here to guarantee article unicity. Indeed, it is possible to have, at the same created_at time, 2 different articles
Given the knowledge from previous question you posted where I said index is slowing down your query you need to solve two things:
Write article only if it does not already exist
Query article based on source and range query on created at
Based on those two I would go with two tables:
Reverse index table
CREATE TABLE article_by_id (
id text,
source text,
created_at timestamp,
PRIMARY KEY (id) ) WITH comment = 'Article by id.';
This table will be used to insert articles when they first arrive. Based on return statement after INSERT ... IF NOT EXISTS you will know if article is existing or new and if it is new you will write to second table. Also this table can serve to find all key parts for second table based on article id. If you need full article data you can add to this table as well all fields (category, channel etc.). This will be skinny row holding only single article in one partition.
Example of INSERT:
INSERT INTO article_by_id(id, source, created_at) VALUES (%s,%s, %s) IF NOT EXISTS;
Java driver returns true or false whether this query was applied or not. Probably it is same in python driver but I did not use it.
Table for range queries and queries by source
As doanduyhai suggested you create a second table:
CREATE TABLE articles (
id text,
source text,
created_at timestamp,
category text,
channel text,
last_crawled timestamp,
text text,
thumbnail text,
title text,
url text,
PRIMARY KEY ((source),created_at, id)
)
In this table you will write only if first INSERT returned true meaning you have new article, not existing one. This table will serve range queries and queries by source.
Improvement suggestion
By using timeuuid instead of timestamp for created_at you are sure no two article can have same created_at and you can loose id all together and rely on timeuuid. However from second question I can see you rely on external id so wanted to mention this as a sidenote.

Using Cassandra for time series data

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis

Resources