Using Insert with timestamp in Cassandra - cassandra

I am trying to INSERT (also UPDATE and DELETE) data in Cassandra using timestamp, but no change occur to the table. Any help please?
BEGIN BATCH
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('1',null,null,null) USING TIMESTAMP 0;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('2',null,null,null) USING TIMESTAMP 1;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('3',null,null,null) USING TIMESTAMP 2;
APPLY BATCH;

I think you're falling into Cassandra's "control of timestamps". Operations in C* are (in effect1) executed only if the timestamp of the new operation is "higher" than previous one.
Let's see an example. Given the following insert
INSERT INTO test (key, value ) VALUES ( 'mykey', 'somevalue') USING TIMESTAMP 1000;
You expect this as output:
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
And it should be like this unless someone before you didn't perform an operation on this information with a higher timestamp. For instance, if you now write
INSERT INTO test (key, value ) VALUES ( 'mykey', '999value') USING TIMESTAMP 999;
Here's the output
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
As you can see neither the value nor the timestamp have been updated.
1 That's a slight simplification. Unless you are doing a specialised 'compare-and-set' write, Cassandra doesn't read anything from the table before it writes and it doesn't know if there is existing data or what its timestamp is. So you end up with two versions of the row, with different timestamps. But when you read the row back you always get the one with the latest timestamp. Normally Cassandra will compact such duplicate rows after a while, which is when the older timestamp row gets discarded.

Related

DataStax/Cassandra USING TIMESTAMP behavior is unpredictable when a new timestamp value equals to previous one

This behavior in Cassandra seems undocumented and counterintuitive. I want to know why this is happening and how to prevent such things.
Create a test table.
CREATE TABLE test_table (id text PRIMARY KEY, foo text);
Now create a row in the table with USING TIMESTAMP.
INSERT INTO test_table (id, foo)
VALUES ('first', 'hello')
USING TIMESTAMP 1566912993048082;
The result is
id | foo | writetime(foo)
-------+-------+------------------
first | hello | 1566912993048082
Now let's update the row using the same timestamp.
INSERT INTO test_table (id, foo)
VALUES ('first', 'hello2')
USING TIMESTAMP 1566912993048082;
Everything works fine.
id | foo | writetime(foo)
-------+--------+------------------
first | hello2 | 1566912993048082
Let's update the row again using the same timestamp.
INSERT INTO test_table (id, foo)
VALUES ('first', 'hello1')
USING TIMESTAMP 1566912993048082;
!!! Nothing changed.
id | foo | writetime(foo)
-------+--------+------------------
first | hello2 | 1566912993048082
Update the same row again.
INSERT INTO test_table (id, foo)
VALUES ('first', 'hello3')
USING TIMESTAMP 1566912993048082;
!!! Works again.
id | foo | writetime(foo)
-------+--------+------------------
first | hello3 | 1566912993048082
It seems like an update happens only in cases when old.foo < new.foo using the same timestamp.
Expected results:
update doesn't happen using the same timestamp
update always happens using the same timestamp
Actual result:
update sometimes happens using the same timestamp
FYI,
I opened a ticket to get the answers to your question. Here is the response for others that may try this. Again, in a typical situation, one wouldn't do what you're doing.
---- Response ----
As you are aware, DSE/Cassandra handles the conflicts via the write timestamp where the latest always wins. In the event of the tie as detailed in your thought experiment, there are actually two scenarios that need to be handled.
Live cell colliding with tombstone
In this situation the tombstone will always win. There is no way to know if that is what the client expects, but the behavior will be consistent.
Live cell colliding with another live cell
Similar to the tombstone situation, we have no way of knowing which cell should be returned. In order to provide consistency, when the write timestamps are the same, the larger value wins.

How to store only most recent entry in Cassandra?

I have a Cassandra table like :-
create table test(imei text,dt_time timestamp, primary key(imei, dt_time)) WITH CLUSTERING ORDER BY (dt_time DESC);
Partition Key is: imei
Clustering Key is: dt_time
Now I want to store only most recent entry in this table(on the time basis) for each partition key.
Let's say if I am inserting entry in a table where there will be single entry for each imei
Now let's say for an imei 98838377272 dt_time is 2017-12-23 16.20.12 Now for same imei if dt_time comes like 2017-12-23 15.20.00
Then this entry should not be inserted in that Cassandra table.
But if time comes like 2017-12-23 17.20.00 then it should get insert and previous row should get replaced with this dt_time.
You can use TIMESTAMP clause in your insert statement to mark data as most recent:
Marks inserted data (write time) with TIMESTAMP. Enter the time since epoch (January 1, 1970) in microseconds. By default, Cassandra uses the actual time of write.
Remove dt_time from primary key to store only one entry for a imei and
Insert data and specify timestamp as 2017-12-23 16.20.12
Insert data and specify timestamp as 2017-12-23 15.20.00
In this case, select by imei will return record with the most recent timestamp (from point 1).
Please note, this approach will work if your dt_time (which will be specified as timestamp) is less than the current time. In other words, select query will return records with most recent timestamp but before the current time. If you insert data with timestamp greater then the current time you will not see this data until this timestamp comes.
First, to store only the last entry in the table, you need to remove dt_time from primary key - otherwise you'll get entries inserted into DB for every timestamp.
Cassandra supports so-called lightweight transactions that allows to check the data before inserting them.
So if you want to update entry only if dt_time is less than new time, then you can use something like:
first insert data:
> insert into test(imei, dt_time) values('98838377272', '2017-12-23 15:20:12');
try to update data with same time, or it could be smaller
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 15:20:12';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 15:20:12.000000+0000
This will fail as it's seen from applied equal to False. I can update it with greater timestamp, and it will be updated:
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 16:21:12';
[applied]
-----------
True
There are several problems with this:
It will not work if entry doesn't exist yet - in this case you may try to use INSERT ... IF NOT EXISTS before trying to update, or to pre-populate the database with emei numbers
The lightweight transactions impose overhead on cluster, as the data should be read before writing, and this could be significant load on servers, and decreasing of throughput.
Actually you cannot "update" a clustering key since its part of the primary key, so you should remove the clustering key on dt_time.
Then you can update the row using a lightweight transaction which checks if the new value its after the existing value.
cqlsh:test> CREATE TABLE test1(imei text, dt_time timestamp) PRIMARY KEY (imei);
cqlsh:test> INSERT INTO test1 (imei, dt_time) VALUES ('98838377272', '2017-12-23 16:20:12');
cqlsh:test> SELECT * FROM test1;
imei | dt_time
-------------+---------------------------------
98838377272 | 2017-12-23 08:20:12.000000+0000
(1 rows)
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 15:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 15:20:00';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 08:20:12.000000+0000
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 17:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 17:20:00';
[applied]
-----------
True
The update for '15:20:00' will return 'false' and tell you the current value.
The update for '17:20:00' will return 'true'
Reference: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html

Failed to update static column in cassandra

I have strange problem with Cassandra (version 2.2.3) database and using static columns when write some proof of concept for simple application with send money functionality.
My table is:
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
PRIMARY KEY (profile, timestamp)) WITH CLUSTERING ORDER BY (timestamp ASC);
First step I add new record
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD');
Then I want to 'lock' current user transaction to do some action with his balance. I try to execute this request:
UPDATE transactions SET lock = 1 WHERE profile = 'test_profile' IF lock = null;
But as result in cqlsh I see
[applied]
-----------
False
I don't understand why 'False', because current data for profile is:
profile | timestamp | lock | amount | balance
--------------+--------------------------+------+--------+---------
test_profile | 2015-11-05 15:20:01+0000 | null | 10USD | null
Any idea what I do wrong?
UPDATE
After read Nenad Bozic answer I modify my example to clarify why I need condition in update. Full code sample
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
balances map<text,text> static,
PRIMARY KEY (profile, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC);
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '1USD');
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
BEGIN BATCH
UPDATE transactions SET balances={'USD':'1USD'} WHERE profile='test_profile';
UPDATE transactions SET balance='1USD' WHERE profile='test_profile' AND timestamp='2015-11-05 15:20:01+0000';
DELETE lock FROM transactions WHERE profile='test_profile';
APPLY BATCH;
And if I try get lock again I get
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
[applied] | profile | timestamp | balances | lock | amount | balance
-----------+--------------+-----------+-----------------+------+--------+---------
False | test_profile | null | {'USD': '1USD'} | null | null | null
When you INSERT you do not insert lock field which means this field does not exist. Null representation in CQLSH or DevCenter is only synthetic sugar to make results looks like tabular data but in reality it has dynamic key values and lock is not present in that map of key values. It is useful to look thrift representation of data even though it is not used anymore to get sense how it is stored to disk.
So when UPDATE is fired it is expecting column to be present to updated it. In your case lock column is not even present so it cannot update it. This thread on difference between INSERT and UPDATE is also good read.
You have two solutions to make this work:
Insert null explicitly
You can add lock to your insert statement and set it to null (which is different in Cassandra than excluding it from insert because this way it will get null value and when you exclude it this column would not exist in
INSERT INTO transactions (profile, timestamp, amount, lock)
VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD', null);
Use insert on second statement
Since you are inserting on second statement lock for first time instead of updating existing value and since it is static column for that partition you can use INSERT IF NOT EXISTS instead of UPDATE IF LWT way of doing it (lock would not exist so this will pass first time and fail all other times since lock will have value):
INSERT INTO transactions (profile, lock)
VALUES ('test_profile', 1) IF NOT EXISTS;

Cassandra : Data Modelling

I currently have a table in cassandra called macrecord which looks something like this :
macadd | position | record | timestamp
-------------------+----------+--------+---------------------
23:FD:52:34:DS:32 | 1 | 1 | 2015-09-28 15:28:59
However i now need to make queries which will use the timestamp column to query for a range. I don't think it is possible to do so without timestamp being part of the primary key (macadd in this case) i-e without it being a clustering key.
If i make timestamp as part of the primary key the table looks like below :
macadd | timestamp | position | record
-------------------+---------------------+----------+--------
23:FD:52:34:DS:32 | 2015-09-28 15:33:26 | 1 | 1
However now i cannot update the timestamp column whenever i get a duplicate macadd.
update macrecord set timestamp = dateof(now()) where macadd = '23:FD:52:34:DS:32';
gives an error :
message="PRIMARY KEY part timestamp found in SET part"
I cannot think of an other solution in this case other than to delete the whole row if there is a duplicate value of macadd and then to insert a new row with an updated timestamp.
Is there a better solution to update the timestamp whenever there is a duplicate value of macadd or an alternative way to query for timestamp values in a range in my original table where only macadd is the primary key.
To do a range query in CQL, you'll need to have timestamp as a clustering key. But as you have seen, you can't update key fields without doing a delete and insert of the new key.
One option that will become available in Cassandra 3.0 when it is released in October is materialized views. That would allow you to have timestamp as a value column in the base table and as a clustering column in the view. See an example here.

Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.
Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid and eventtime:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On timestamp in descending order. (Ends_On is just a regular column.) Then I constrain the query with Starts_On<? where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.

Resources