Cassandra Optimistic Locking - cassandra

I have a cassandra table1:
CREATE TABLE Policy.table1 (
name VARCHAR ,
date TIMESTAMP ,
version_num INT,
PRIMARY KEY (
name
)) WITH caching = 'all'
-- and memtable_flush_period_in_ms = 7200 ;
;
I need to implement optimistic locking on tis table. When we read a row from table1 we remember its version_num. And when we want to update this row we compare current version_num value and value that we remembered. Also we need to increment version_num on each update.
Problems:
We cannot put version_num into where clause, this will create an error: Bad Request: Non PRIMARY KEY version_num found in where clause:
update table where name = 'abc' and version = 3
We cannot set version_num as a part of primary key, because we need to update its value
If we index version_num it will not help for update statements, same exception will be thrown
The only way I see is to get current version_num value by Java, and if expected and actual version_num values are the same - than execute update. The problem is that in this case we have not atomic operation of checking version_num value and update the row.
Do you see any solution for this problem?

The solution was found there: Cassandra 2.0, Lightweight transactions http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
In case I execute query:
update table1 set version_num = 5 where name = 'abc' if version_num = 4;
I will receive a row with [applied] column. This row contains boolean value: true = update was successful, false = in other case.

Related

Update column value in Cassandra table if value exists

I have a Cassandra table as below
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty float,
PRIMARY KEY (prodid)
) ;
Requirement :
For the provided primary key, if no record exists in table, we need to insert, which is straight forward. but when the record exists for the primary key, then we need to update the qty column by adding the existing value in the table with new values received.
As per my understanding, I need to query the table first for the provided primary key and get the value of the qty column and add with new value received from the request and execute the update query with light weight transaction.
Ex: table has say qty 10 for the prodid=1 and if I receive from user new qty as 2 (which is delta), then I need to update qty as 12 for the prodid=1.
Is that logic is correct? or any better way to design the table or handle the use case? Will this approach introduce latency issue during the load as we need to do select query first and if data exists update the column value with new value ? Please help.
You can change the qty column to static. This way you do not have to update the table but Insert. Updates are resource intensive so cassandra treats UPDATE statement as insert statement. So, your table definition should be -
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty float static,
PRIMARY KEY (prodid) ) ;
So you can use your business logic to calculate the new value of QTY column and use INSERT statement, which intern update the same column.
Other way is to use counter column -
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty counter,
PRIMARY KEY (prodid, loc ) ) ;
Which this design you can just use update query like below -
update inventory set qty = qty + <calculated Quantity> where prodid = 1;
Notice that, in second table design, all other columns have to the part of primary key. In your case, it is easy and convenient.

How to manipulate timestamp columns in Apache Cassandra

I have a table with a timestamp column, and I'd like to manipulate the values of that column. For instance, I need to do something along the line:
UPDATE mytable SET datetimecolumn = datetimecolumn + 10mins
How is it done in Apache Cassandra?
UPDATE: The answer seems to be "you can't". But the selected answer is the closest we can get apparently.
You can query similar this one, only if the data type is counter.
Using Counter :
A counter is a special column used to store a number that is changed in increments. For example, you might use a counter column to count the number of times a page is viewed.
Define a counter in a dedicated table only and use the counter data type. You cannot index, delete, or re-add a counter column. All non-counter columns in the table must be defined as part of the primary key.
Example :
CREATE TABLE mytable (
pk1 int PRIMARY KEY,
datetimecolumn counter
);
Here you have to use datetimecolumn value in millisecond.
For the first time, you have to use update query with the time in millisecond value let's say 1487686182403
UPDATE mytable SET datetimecolumn = datetimecolumn + 1487686182403 where pk1 = 1
Now mytable with pk = 1 contains datetimecolumn = 1487686182403 value.
If you want to increment datetimecolumn by 10mins (600000 millisecond)
UPDATE mytable SET datetimecolumn = datetimecolumn + 600000 where pk1 = 1
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
read-before-write is a anti-pattern in cassandra. You have manipulate the value at client side and update as usual;
In other words: you have to search (select) the value, do changes (increment by 10mins) and update at cassandra the new value.

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

Updating a Column in Cassandra based on Where Clause

I have a very simple table
cqlsh:hell> describe columnfamily info ;
CREATE TABLE info (
nos int,
value map<text, text>,
PRIMARY KEY (nos)
)
The following is the query where I am trying to update the value .
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
Bad Request: Invalid operator LTE for PRIMARY KEY part nos
I use any operator for specifying the constraint . It complains saying invalid operator. I am not sure what I am doing wrong in here , according to cassandra 3.0 cql doc, there are similar update queries.
The following is my version
[cqlsh 4.1.0 | Cassandra 2.0.3 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I have no idea , whats going wrong.
The answer is really in my comment, but it needs a bit of elaboration. To restate from the comment...
The first predicate of the where clause has to uniquely identify the partition key. In your case, since the primary key is only one column the partition key == the primary key.
Cassandra can't do range scans over partitions. In the language of CQL, a partition is a potentially wide storage row that is uniquely identified by a key. In this case, the values in your nos column. The values of the partition keys are hashed into tokens which explicitly identify where that data lives in the cluster. Since that hash has no order to it, Cassandra cannot use any operator other than equality to route a statement to the correct destination. This isn't a primary key index that could potentially be updated, it is the fundamental partitioning mechanism in Cassandra. So, you can't use inequality operators as the first clause of a predicate. You can use them in subsequent clauses because the partition has been identified and now you're dealing with an ordered set of columns.
You can't use non-equal condition on the partition key (nos is your partition key).
http://cassandra.apache.org/doc/cql3/CQL.html#selectWhere
Cassandra currently does not support user defined functions inside a query such as the following.
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
First, can you push this onget function into the application layer? You can first query all the rows which nos < 1000. Then increment rows those via some batch query.
Otherwise, you can use a counter column for nos, not a int data type. Notice though, you cannot mix map data type with counter column families unless the non-counter columns are part of a composite key.
Also, you probably doe not want to have nos, a column that changes value as the primary key.
CREATE TABLE info (
id UUID,
value map<text, text>,
PRIMARY KEY (id)
)
CREATE TABLE nos_counter (
info_id UUID,
nos COUNTER,
PRIMARY KEY (info_id)
)
Now you can update the nos counter like this.
update info set nos = nos + 1 where info_id = 'SOME_UUID';

Does an UPDATE become an implied INSERT

For Cassandra, do UPDATEs become an implied INSERT if the selected row does not exist? That is, if I say
UPDATE users SET name = "Raedwald" WHERE id = 545127
and id is the PRIMARY KEY of the users table, and the table has no row with a key of 545127, will that be equivalent to
INSERT INTO users (id, name) VALUES (545127, "Raedwald")
I know that the opposite is true: an INSERT for an id that already exists becomes an UPDATE of the row with that id. Older Cassandra documentation talked about inserts actually being "upserts" for that reason.
I'm interested in the case for CQL3, Cassandra version 1.2+.
Yes, for Cassandra UPDATE is synonymous with INSERT, as explained in the CQL documentation where it says the following about UPDATE:
Note that unlike in SQL, UPDATE does not check the prior existence of the row: the row is created if none existed before, and updated otherwise. Furthermore, there is no mean to know which of creation or update happened. In fact, the semantic of INSERT and UPDATE are identical.
For the semantics to be different, Cassandra would need to do a read to know if the row already exists. Cassandra is write optimized, so you can always assume it doesn't do a read before write on any write operation. The only exception is counters (unless replicate_on_write = false), in which case replication on increment involves a read.
Unfortunately the accepted answer is not 100% accurate. inserts are different than updates:
cqlsh> create table ks.t (pk int, ck int, v int, primary key (pk, ck));
cqlsh> update ks.t set v = null where pk = 0 and ck = 0;
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+---
(0 rows)
cqlsh> insert into ks.t (pk,ck,v) values (0,0,null);
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+------
0 | 0 | null
(1 rows)
Scylla does the same thing.
In Scylla and Cassandra rows are sequences of cells. Each column gets a corresponding cell (or a set of cells in the case of non-frozen collections or UDTs). But there is one additional, invisible cell - the row marker (in Scylla at least; I suspect Cassandra has something similar).
The row marker makes a difference for rows in which all other cells are dead: a row shows up in a query if and only if there's at least one alive cell. Thus, if the row marker is alive, the row will show up, even if all other columns were previously set to null using e.g. updates.
inserts create a live row marker, while updates don't touch the row marker, so clearly they are different. The example above illustrates that.
One could argue that row markers are "internal" to Cassandra/Scylla, but as you can see, their effects are visible. Row markers affect your life whether you like it or not, so it may be useful to remember about them.
It's sad that no documentation mentions row markers (well, I found this: https://docs.scylladb.com/architecture/sstable/sstable2/sstable-data-file/#cql-row-marker but it's in the context of explaining SSTable internals, which is probably dedicated to Scylla developers more than to users).
Bonus: a cell delete:
delete v from ks.t where pk = 0 and ck = 0
is the same as a null update:
update ks.t set v = null where pk = 0 and ck = 0
indeed, a cell delete also doesn't touch the row marker. It only sets the specified cell to null.
This is different from a row delete:
delete from ks.t where pk = 0 and ck = 0
because row deletes insert a row tombstone, which kills all cells in the row (including the row marker). You could say that row deletes are the opposite of an insert. Updates and cell deletes are somewhere in between.
What one can do is this however:
UPDATE table_name SET field = false WHERE key = 55 IF EXISTS;
This will ensure that your update is a true update and not an upsert.

Resources