Insert table Mutation to different Cassandra table - cassandra

I have requirement to keep the old values of a row in a history table for auditing whenever we do row update. Is there any solution available in Apache Cassandra to achieve this?
I looked at the Trigger and not much mentioned in the docs. Not sure of performance issues if we use the triggers. Also if we use trigger, will it give the old value for a column when we do update?

Cassandra is best tool to keep the row history. I will try to explain it with an example. Consider the below table design -
CREATE TABLE user_by_id (
userId text,
timestamp timestamp,
name text,
fullname text,
email text,
PRIMARY KEY (userId,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
With this kind of table design you can keep the history of the record.
Here, userid is row partition key and timestamp as clustering key. Every insert for same user will be recorded as different row. for example -
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'x',xyz,'x#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'y',xyz,'y#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'z',xyz,'z#xyz.com');
Above insert statements are actually updating values of the name and email column. But, this will be saved in three different rows because of timestamp as a clustering key, timestamp will be different for each row. If you want to get the latest value, just use LIMIT in your select query.
This design keeps the history of the row which can be used foe audit purpose.

Related

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How is denormalization handled in cassandra

What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.

Using Cassandra for time series data

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis

How to save data in cassandra conditionally only if properties did not change

We have data model of article with lot of properties. Here is our table model:
CREATE TABLE articles (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';
Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated field and external devices know which articles to synchronizes since they have information when they synchronized last time.
We added one more table for that:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated from first table)
Problems we are facing right now are:
In order to check if article fields are updated we need to do read before write (we partially solved that with content_hash field which is hash over all fields so we read and compare hash of incoming article with value in DB)
We are deleting and inserting in second table since we need unique gtins there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones
We have feature to add to search by many different combinations of fields
Questions:
Is cassandra good choice for this kind of data or we should move it to some other storage (or even have elasticsearch and cassandra in combination where we can index changes after time and cassandra can hold only master data per gtin)?
Can data be modeled better for our use case to avoid read before write or deletes in second table?
Update
Just to clarify use case: other devices are syncing with pagination (sending last_sync_date and skip and count) so we need table with all article information, sorted by last_updated without duplicates and searchable by last_updated
If you are using Cassandra 2.1.1 and later, then you can use the "not equal" comparison in the IF part of the UPDATE statement (see CASSANDRA-6839 JIRA issue) to make sure you update data only if anything has changed. Your statement would look something like this:
UPDATE articles
SET
barcodes = <barcodes>,
... = <...>,
last_updated = <last_updated>
WHERE
organization_id = <organization_id>
AND gtin = <gtin>
IF content_hash != <content_hash>;
For your second table, you don't need to duplicate entire data from the first table as you can do the following:
create your table like this:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
last_updated timeuuid,
gtin text,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
Once you've updated the first table, you can read the last_updated value for that gtin again and if it is equal or greater than the last_updated value you passed in, then you know that the update was successful (by your or another process), so you can now go ahead and insert that retrieved last_updated value into the second table. You don't need to delete the records for this update. I assume you can create distinct updated gtin list on the application side, if you do polling (using a range query) on a regular basis, which I assume pulls a reasonable amount of data. You can TTL these new records after a few poll cycles to remove a necessity to do manual deletes for example. Then, after you found the gtins affected, then you do a second query where you pull all of the data from the first table. You can then run a second sanity check on the updated dates to avoid sending anything that is supposed to be sent on the next update (if it is necessary of course).
HTH.

Primary Key related CQL3 Queries cases & errors when sorting

I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.

Resources