How to store only most recent entry in Cassandra? - cassandra

I have a Cassandra table like :-
create table test(imei text,dt_time timestamp, primary key(imei, dt_time)) WITH CLUSTERING ORDER BY (dt_time DESC);
Partition Key is: imei
Clustering Key is: dt_time
Now I want to store only most recent entry in this table(on the time basis) for each partition key.
Let's say if I am inserting entry in a table where there will be single entry for each imei
Now let's say for an imei 98838377272 dt_time is 2017-12-23 16.20.12 Now for same imei if dt_time comes like 2017-12-23 15.20.00
Then this entry should not be inserted in that Cassandra table.
But if time comes like 2017-12-23 17.20.00 then it should get insert and previous row should get replaced with this dt_time.

You can use TIMESTAMP clause in your insert statement to mark data as most recent:
Marks inserted data (write time) with TIMESTAMP. Enter the time since epoch (January 1, 1970) in microseconds. By default, Cassandra uses the actual time of write.
Remove dt_time from primary key to store only one entry for a imei and
Insert data and specify timestamp as 2017-12-23 16.20.12
Insert data and specify timestamp as 2017-12-23 15.20.00
In this case, select by imei will return record with the most recent timestamp (from point 1).
Please note, this approach will work if your dt_time (which will be specified as timestamp) is less than the current time. In other words, select query will return records with most recent timestamp but before the current time. If you insert data with timestamp greater then the current time you will not see this data until this timestamp comes.

First, to store only the last entry in the table, you need to remove dt_time from primary key - otherwise you'll get entries inserted into DB for every timestamp.
Cassandra supports so-called lightweight transactions that allows to check the data before inserting them.
So if you want to update entry only if dt_time is less than new time, then you can use something like:
first insert data:
> insert into test(imei, dt_time) values('98838377272', '2017-12-23 15:20:12');
try to update data with same time, or it could be smaller
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 15:20:12';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 15:20:12.000000+0000
This will fail as it's seen from applied equal to False. I can update it with greater timestamp, and it will be updated:
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 16:21:12';
[applied]
-----------
True
There are several problems with this:
It will not work if entry doesn't exist yet - in this case you may try to use INSERT ... IF NOT EXISTS before trying to update, or to pre-populate the database with emei numbers
The lightweight transactions impose overhead on cluster, as the data should be read before writing, and this could be significant load on servers, and decreasing of throughput.

Actually you cannot "update" a clustering key since its part of the primary key, so you should remove the clustering key on dt_time.
Then you can update the row using a lightweight transaction which checks if the new value its after the existing value.
cqlsh:test> CREATE TABLE test1(imei text, dt_time timestamp) PRIMARY KEY (imei);
cqlsh:test> INSERT INTO test1 (imei, dt_time) VALUES ('98838377272', '2017-12-23 16:20:12');
cqlsh:test> SELECT * FROM test1;
imei | dt_time
-------------+---------------------------------
98838377272 | 2017-12-23 08:20:12.000000+0000
(1 rows)
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 15:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 15:20:00';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 08:20:12.000000+0000
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 17:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 17:20:00';
[applied]
-----------
True
The update for '15:20:00' will return 'false' and tell you the current value.
The update for '17:20:00' will return 'true'
Reference: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html

Related

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Failed to update static column in cassandra

I have strange problem with Cassandra (version 2.2.3) database and using static columns when write some proof of concept for simple application with send money functionality.
My table is:
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
PRIMARY KEY (profile, timestamp)) WITH CLUSTERING ORDER BY (timestamp ASC);
First step I add new record
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD');
Then I want to 'lock' current user transaction to do some action with his balance. I try to execute this request:
UPDATE transactions SET lock = 1 WHERE profile = 'test_profile' IF lock = null;
But as result in cqlsh I see
[applied]
-----------
False
I don't understand why 'False', because current data for profile is:
profile | timestamp | lock | amount | balance
--------------+--------------------------+------+--------+---------
test_profile | 2015-11-05 15:20:01+0000 | null | 10USD | null
Any idea what I do wrong?
UPDATE
After read Nenad Bozic answer I modify my example to clarify why I need condition in update. Full code sample
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
balances map<text,text> static,
PRIMARY KEY (profile, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC);
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '1USD');
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
BEGIN BATCH
UPDATE transactions SET balances={'USD':'1USD'} WHERE profile='test_profile';
UPDATE transactions SET balance='1USD' WHERE profile='test_profile' AND timestamp='2015-11-05 15:20:01+0000';
DELETE lock FROM transactions WHERE profile='test_profile';
APPLY BATCH;
And if I try get lock again I get
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
[applied] | profile | timestamp | balances | lock | amount | balance
-----------+--------------+-----------+-----------------+------+--------+---------
False | test_profile | null | {'USD': '1USD'} | null | null | null
When you INSERT you do not insert lock field which means this field does not exist. Null representation in CQLSH or DevCenter is only synthetic sugar to make results looks like tabular data but in reality it has dynamic key values and lock is not present in that map of key values. It is useful to look thrift representation of data even though it is not used anymore to get sense how it is stored to disk.
So when UPDATE is fired it is expecting column to be present to updated it. In your case lock column is not even present so it cannot update it. This thread on difference between INSERT and UPDATE is also good read.
You have two solutions to make this work:
Insert null explicitly
You can add lock to your insert statement and set it to null (which is different in Cassandra than excluding it from insert because this way it will get null value and when you exclude it this column would not exist in
INSERT INTO transactions (profile, timestamp, amount, lock)
VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD', null);
Use insert on second statement
Since you are inserting on second statement lock for first time instead of updating existing value and since it is static column for that partition you can use INSERT IF NOT EXISTS instead of UPDATE IF LWT way of doing it (lock would not exist so this will pass first time and fail all other times since lock will have value):
INSERT INTO transactions (profile, lock)
VALUES ('test_profile', 1) IF NOT EXISTS;

Cassandra : Data Modelling

I currently have a table in cassandra called macrecord which looks something like this :
macadd | position | record | timestamp
-------------------+----------+--------+---------------------
23:FD:52:34:DS:32 | 1 | 1 | 2015-09-28 15:28:59
However i now need to make queries which will use the timestamp column to query for a range. I don't think it is possible to do so without timestamp being part of the primary key (macadd in this case) i-e without it being a clustering key.
If i make timestamp as part of the primary key the table looks like below :
macadd | timestamp | position | record
-------------------+---------------------+----------+--------
23:FD:52:34:DS:32 | 2015-09-28 15:33:26 | 1 | 1
However now i cannot update the timestamp column whenever i get a duplicate macadd.
update macrecord set timestamp = dateof(now()) where macadd = '23:FD:52:34:DS:32';
gives an error :
message="PRIMARY KEY part timestamp found in SET part"
I cannot think of an other solution in this case other than to delete the whole row if there is a duplicate value of macadd and then to insert a new row with an updated timestamp.
Is there a better solution to update the timestamp whenever there is a duplicate value of macadd or an alternative way to query for timestamp values in a range in my original table where only macadd is the primary key.
To do a range query in CQL, you'll need to have timestamp as a clustering key. But as you have seen, you can't update key fields without doing a delete and insert of the new key.
One option that will become available in Cassandra 3.0 when it is released in October is materialized views. That would allow you to have timestamp as a value column in the base table and as a clustering column in the view. See an example here.

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

Using Insert with timestamp in Cassandra

I am trying to INSERT (also UPDATE and DELETE) data in Cassandra using timestamp, but no change occur to the table. Any help please?
BEGIN BATCH
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('1',null,null,null) USING TIMESTAMP 0;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('2',null,null,null) USING TIMESTAMP 1;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('3',null,null,null) USING TIMESTAMP 2;
APPLY BATCH;
I think you're falling into Cassandra's "control of timestamps". Operations in C* are (in effect1) executed only if the timestamp of the new operation is "higher" than previous one.
Let's see an example. Given the following insert
INSERT INTO test (key, value ) VALUES ( 'mykey', 'somevalue') USING TIMESTAMP 1000;
You expect this as output:
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
And it should be like this unless someone before you didn't perform an operation on this information with a higher timestamp. For instance, if you now write
INSERT INTO test (key, value ) VALUES ( 'mykey', '999value') USING TIMESTAMP 999;
Here's the output
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
As you can see neither the value nor the timestamp have been updated.
1 That's a slight simplification. Unless you are doing a specialised 'compare-and-set' write, Cassandra doesn't read anything from the table before it writes and it doesn't know if there is existing data or what its timestamp is. So you end up with two versions of the row, with different timestamps. But when you read the row back you always get the one with the latest timestamp. Normally Cassandra will compact such duplicate rows after a while, which is when the older timestamp row gets discarded.

Resources