Failed to update static column in cassandra - cassandra

I have strange problem with Cassandra (version 2.2.3) database and using static columns when write some proof of concept for simple application with send money functionality.
My table is:
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
PRIMARY KEY (profile, timestamp)) WITH CLUSTERING ORDER BY (timestamp ASC);
First step I add new record
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD');
Then I want to 'lock' current user transaction to do some action with his balance. I try to execute this request:
UPDATE transactions SET lock = 1 WHERE profile = 'test_profile' IF lock = null;
But as result in cqlsh I see
[applied]
-----------
False
I don't understand why 'False', because current data for profile is:
profile | timestamp | lock | amount | balance
--------------+--------------------------+------+--------+---------
test_profile | 2015-11-05 15:20:01+0000 | null | 10USD | null
Any idea what I do wrong?
UPDATE
After read Nenad Bozic answer I modify my example to clarify why I need condition in update. Full code sample
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
balances map<text,text> static,
PRIMARY KEY (profile, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC);
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '1USD');
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
BEGIN BATCH
UPDATE transactions SET balances={'USD':'1USD'} WHERE profile='test_profile';
UPDATE transactions SET balance='1USD' WHERE profile='test_profile' AND timestamp='2015-11-05 15:20:01+0000';
DELETE lock FROM transactions WHERE profile='test_profile';
APPLY BATCH;
And if I try get lock again I get
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
[applied] | profile | timestamp | balances | lock | amount | balance
-----------+--------------+-----------+-----------------+------+--------+---------
False | test_profile | null | {'USD': '1USD'} | null | null | null

When you INSERT you do not insert lock field which means this field does not exist. Null representation in CQLSH or DevCenter is only synthetic sugar to make results looks like tabular data but in reality it has dynamic key values and lock is not present in that map of key values. It is useful to look thrift representation of data even though it is not used anymore to get sense how it is stored to disk.
So when UPDATE is fired it is expecting column to be present to updated it. In your case lock column is not even present so it cannot update it. This thread on difference between INSERT and UPDATE is also good read.
You have two solutions to make this work:
Insert null explicitly
You can add lock to your insert statement and set it to null (which is different in Cassandra than excluding it from insert because this way it will get null value and when you exclude it this column would not exist in
INSERT INTO transactions (profile, timestamp, amount, lock)
VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD', null);
Use insert on second statement
Since you are inserting on second statement lock for first time instead of updating existing value and since it is static column for that partition you can use INSERT IF NOT EXISTS instead of UPDATE IF LWT way of doing it (lock would not exist so this will pass first time and fail all other times since lock will have value):
INSERT INTO transactions (profile, lock)
VALUES ('test_profile', 1) IF NOT EXISTS;

Related

How to store only most recent entry in Cassandra?

I have a Cassandra table like :-
create table test(imei text,dt_time timestamp, primary key(imei, dt_time)) WITH CLUSTERING ORDER BY (dt_time DESC);
Partition Key is: imei
Clustering Key is: dt_time
Now I want to store only most recent entry in this table(on the time basis) for each partition key.
Let's say if I am inserting entry in a table where there will be single entry for each imei
Now let's say for an imei 98838377272 dt_time is 2017-12-23 16.20.12 Now for same imei if dt_time comes like 2017-12-23 15.20.00
Then this entry should not be inserted in that Cassandra table.
But if time comes like 2017-12-23 17.20.00 then it should get insert and previous row should get replaced with this dt_time.
You can use TIMESTAMP clause in your insert statement to mark data as most recent:
Marks inserted data (write time) with TIMESTAMP. Enter the time since epoch (January 1, 1970) in microseconds. By default, Cassandra uses the actual time of write.
Remove dt_time from primary key to store only one entry for a imei and
Insert data and specify timestamp as 2017-12-23 16.20.12
Insert data and specify timestamp as 2017-12-23 15.20.00
In this case, select by imei will return record with the most recent timestamp (from point 1).
Please note, this approach will work if your dt_time (which will be specified as timestamp) is less than the current time. In other words, select query will return records with most recent timestamp but before the current time. If you insert data with timestamp greater then the current time you will not see this data until this timestamp comes.
First, to store only the last entry in the table, you need to remove dt_time from primary key - otherwise you'll get entries inserted into DB for every timestamp.
Cassandra supports so-called lightweight transactions that allows to check the data before inserting them.
So if you want to update entry only if dt_time is less than new time, then you can use something like:
first insert data:
> insert into test(imei, dt_time) values('98838377272', '2017-12-23 15:20:12');
try to update data with same time, or it could be smaller
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 15:20:12';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 15:20:12.000000+0000
This will fail as it's seen from applied equal to False. I can update it with greater timestamp, and it will be updated:
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 16:21:12';
[applied]
-----------
True
There are several problems with this:
It will not work if entry doesn't exist yet - in this case you may try to use INSERT ... IF NOT EXISTS before trying to update, or to pre-populate the database with emei numbers
The lightweight transactions impose overhead on cluster, as the data should be read before writing, and this could be significant load on servers, and decreasing of throughput.
Actually you cannot "update" a clustering key since its part of the primary key, so you should remove the clustering key on dt_time.
Then you can update the row using a lightweight transaction which checks if the new value its after the existing value.
cqlsh:test> CREATE TABLE test1(imei text, dt_time timestamp) PRIMARY KEY (imei);
cqlsh:test> INSERT INTO test1 (imei, dt_time) VALUES ('98838377272', '2017-12-23 16:20:12');
cqlsh:test> SELECT * FROM test1;
imei | dt_time
-------------+---------------------------------
98838377272 | 2017-12-23 08:20:12.000000+0000
(1 rows)
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 15:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 15:20:00';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 08:20:12.000000+0000
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 17:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 17:20:00';
[applied]
-----------
True
The update for '15:20:00' will return 'false' and tell you the current value.
The update for '17:20:00' will return 'true'
Reference: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html

As I know in range queries, Cassandra retrieves result ordered by culstring key. Can I change this behavior in my query?

I'm trying to store and retrieve last active sensors by this schema:
CREATE TABLE last_signals (
section bigint,
sensor bigint,
time bigint,
PRIMARY KEY (section, sensor)
);
Row of this table will be updated every few seconds and in the result hot sensors will remain in memtable. But what will happen when I get a run a query like this:
SELECT * FROM last_signals
WHERE section = ? AND time > ?
Limit ?
ALLOW FILTERING;
And the result will be something like this (Ordered by clustering key):
sect | sens | time
------+------+------
1 | 1 | 4
1 | 2 | 3
1 | 4 | 2
1 | 5 | 9
The first Question: Is this result guaranteed to be the same in all version? (I'm using 3.7) and the next one is that how I can change this behavior (with query option, modeling or etc.). Indeed I need to get last writes first without considering clustring-keys order. I think in this case my reads will be much faster.
I don't think there is any way to guarantee order besides using clustering keys. Thus your ALLOW FILTERING query is potentially costly and may even time out. You could consider the following schema:
CREATE TABLE last_signals_by_time (
section bigint,
sensor bigint,
time bigint,
dummy bool,
PRIMARY KEY ((section, sensor), time)
) WITH CLUSTERING ORDER BY (time DESC);
Instead of updates do inserts with TTL so that you do not have to clean up old entries manually. (The dummy field is needed in order for TTL to work)
And then just run your read queries per section/sensors in parallel:
SELECT * FROM last_signals_by_time
WHERE section = ? AND sensor = ?
LIMIT 1;

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

Using Insert with timestamp in Cassandra

I am trying to INSERT (also UPDATE and DELETE) data in Cassandra using timestamp, but no change occur to the table. Any help please?
BEGIN BATCH
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('1',null,null,null) USING TIMESTAMP 0;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('2',null,null,null) USING TIMESTAMP 1;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('3',null,null,null) USING TIMESTAMP 2;
APPLY BATCH;
I think you're falling into Cassandra's "control of timestamps". Operations in C* are (in effect1) executed only if the timestamp of the new operation is "higher" than previous one.
Let's see an example. Given the following insert
INSERT INTO test (key, value ) VALUES ( 'mykey', 'somevalue') USING TIMESTAMP 1000;
You expect this as output:
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
And it should be like this unless someone before you didn't perform an operation on this information with a higher timestamp. For instance, if you now write
INSERT INTO test (key, value ) VALUES ( 'mykey', '999value') USING TIMESTAMP 999;
Here's the output
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
As you can see neither the value nor the timestamp have been updated.
1 That's a slight simplification. Unless you are doing a specialised 'compare-and-set' write, Cassandra doesn't read anything from the table before it writes and it doesn't know if there is existing data or what its timestamp is. So you end up with two versions of the row, with different timestamps. But when you read the row back you always get the one with the latest timestamp. Normally Cassandra will compact such duplicate rows after a while, which is when the older timestamp row gets discarded.

Cassandra compound clustering key and queries with ordering

We use cassandra wide rows heavily to store per user time-series as they are perfect for that use-case. Let's assume we have a table:
create table user_events (
user_id text,
timestmp timestamp,
event text,
primary key((user_id), timestmp));
What if clashes on timestamp may happen (same user can emit two different events with the same timestamp). What is the best way to tweak this schema to resolve that assuming we have an ordering for all events present (have a sequence int for each event).
If I modify schema the following way:
create table user_events (
user_id text,
timestmp timestamp,
seq int,
event text,
primary key((user_id), timestmp, seq));
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
You might be seeing an error because you are repeating ASC. This should work:
WHERE user_id = ? ORDER BY timestmp,seq ASC
Also, as long as you have defined your primary key as PRIMARY KEY((user_id),timestmp,seq)) you don't even need to specify ORDER BY x[,y] ASC. It will cluster the data on disk in that order, and thus return it to you already sorted in that order. ORDER BY should only be necessary when you want to put your results in descending order (or whatever the opposite of how you have it defined is).
What if clashes on timestamp may happen?
I think your extra seq column should be sufficient, depending on how you plan on inserting the data. If you are setting the timestmp from the client, then you should be ok. However, look what happens when I (using your second table) INSERT rows while creating the timestamp two different ways.
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Mal',dateof(now()),1,'commanding');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Wash',dateof(now()),1,'piloting');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),2,'killing reavers');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',2,'killing reavers');
Querying that data by a user_id of "River" yields:
aploetz#cqlsh:stackoverflow> SELECT * FROM user_events WHERE user_id='River';
user_id | timestmp | seq | event
---------+--------------------------+-----+-----------------
River | 2015-01-13 13:14:00-0600 | 1 | freaking out
River | 2015-01-13 13:14:00-0600 | 2 | killing reavers
River | 2015-01-13 13:14:00-0600 | 3 | being weird
River | 2015-01-14 12:58:41-0600 | 1 | freaking out
River | 2015-01-14 12:58:57-0600 | 3 | being weird
River | 2015-01-14 12:58:57-0600 | 2 | killing reavers
(6 rows)
Notice that using the now() function to generate a timeuuid, and then converting that to a timestamp with dateof() causes the two rows with the timestmp "2015-01-14 12:58:57-0600" to appear to be the same. But they are not the same, as you can tell by the seq column.
So just a bit of caution on using/generating timestamps. They might look the same, but they may not be stored as the same value. Just to be on the safe side, I would use a timeuuid instead.

Resources