Cassandra: How to perform upsert in lightweight transaction - cassandra

In Cassandra, I want to add a row, and if it already exists, only update it if the existing date is earlier than the new date. This is how it's done:
INSERT INTO tbl (...) VALUES (...) IF NOT EXISTS;
If the first query is not applied, I'm running this second one:
UPDATE tbl SET ...
WHERE ...
IF date <= ?;
Is it possible to merge the two queries into one? Maybe using the UPDATE as an upsert, while keeping the IF condition. We are having performance issues with these statements (timeouts) so this is the reason why I want to change it.
Regular updates (without IFs) also perform inserts if the row doesn't exist, but the lightweight transaction doesn't. Maybe it's possible to "trick" it into inserting as well.
Thanks!

LWT is basically doing a check before executing data mutation. Conditional execution is enabled only for INSERT and UPDATE with these conditions:
1. IF NOT EXISTS for INSERT
2. IF column = 'value' for UPDATE
You cannot mix and match these conditions with different operations. If there was an option to say UPDATE ... IF column <= 'value' it would have to hit all the nodes and propose a transaction to all of them and this would be a huge performance impact. LWT impacts performance even with equals conditions by hitting only replica nodes.

Related

Cassandra insert/update if not exists or field=value

I would like to do in a single operation an insert if a record doesn't exist, or update only if a field of the row has a certain value.
Imagine the table:
CREATE TABLE (id VARCHAR PRIMARY KEY, field_a VARCHAR, field_b VARCHAR);
Is it possible to have something like:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF NOT EXISTS OR IF field_b='AVALUE';
Doing INSERT/UPDATE on a single statement that insert the record if field_b has value AVALUE or create a new row if a row doesn't already exists, but in case a row is in the table having field_b with a different value failing the update.
UPDATE my_table SET field_a='test' WHERE id='an-id'
IF NOT EXISTS OR IF field_b='AVALUE';
There are a few of nuances here. First it's important to remember that when doing a compare-and-set (CAS) operation in CQL, the syntax and capabilities between INSERT and UPDATE are not the same.
Case-in-point, the IF NOT EXISTS conditional is valid for INSERT, but not for UPDATE. On the flip-side, IF EXISTS is valid for UPDATE, but not for INSERT.
Secondly, OR is not a valid operator in CQL WHERE or in CAS operation conditionals.
Third, using UPDATE with IF EXISTS short-circuits any subsequent conditionals. So UPDATE can either use IF EXISTS or IF (condition) [ AND (another condition) ], but not both.
Considering these points, it would seem one approach here would be to split the statement into two:
INSERT INTO my_table (id,field_a) VALUES ('a1','test') IF NOT EXISTS;
And:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF field_b='AVALUE';
These are both valid CQL. However, that doesn't really help this situation. An alternative would be to build this logic on the application side. Technically, read-before-write approaches are considered anti-patterns in Cassandra, in-built CAS operations not withstanding due to their use of lightweight transactions.
Perhaps something like SELECT field_a,field_b FROM my_table WHERE id='an-id'; is enough to answer whether it exists as well as what the value of field_b is, thus triggering an additional write? There's a potential for a race condition here, so I'd closely examine the business requirements to see if something like this could work.

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Cassandra: Batch with conditions cannot span multiple tables

I am trying to execute 3 conditional inserts to different tables inside a batch by using the Cassandra cpp-driver:
BEGIN BATCH
insert into table1 values (...) IF NOT EXISTS
insert into table2 values (...) IF NOT EXISTS
insert into table3 values (...) IF NOT EXISTS
APPLY BATCH
But I am getting the following error:
Batch with conditions cannot span multiple tables
If the above is not possible in Cassandra, what is the alternative to perform multiple conditional inserts as a transaction and ensure that all succeed or all fail?
I'm afraid there are no alternatives. Conditional statements in a BATCH environment are limited to a single table only, and I don't think there's room for changes in future.
This is due to how Cassandra works internally: a batch containing a conditional update (it is called lightweight transaction) can only be used in one partition because they are based on the Paxos implementation, because the Paxos itself works at partition level only. Moreover, in a batch with multiple conditional statements in the same BATCH, all the conditions must be verified to the batch succeed. Even if one (and only) conditional update fails, the entire batch will fail.
You can read more about BATCH statements in the documentation.
You'd basically get a performance hit for the conditional update, and a performance hit for a batched operation, and C* stops you to get so far.
It seems to me you designed it RDBMS-like. A No-SQL alternative solution, I don't know if it can be applied to your use case though, you could denormalize your data in a 4th table that combines all the other 3 tables, and then supply a single update to this 4th table.

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

Cassandra Query by Date

How do I update a colum based on a greater or less then date in Casandra?
Example:
update asset_by_file_path set received = true where file_path = '/file/path' and time_received = '2015-07-24 02:14:34-0600';
This works fine. But I would like to do it for all columns that match this file path and time_received is greater then 2015-07-24 02:14:34-0600.
time_received is date, clustering column.
file_path is string, partition key
Cassandra's WHERE clause has many limitations and if you have several clustering columns things could not work as you expect, at least there are limitations for >, >=, <, <= etc operators. Here is a quite fresh blog post from Databrix about WHERE clause nuances, it also covers some upcoming features.
I think UPDATE can only modify a single row at a time, so I don't see a way to update multiple rows on the server side in CQL.
A couple possible programmatic approaches:
Do a range query to return all the rows you want to update, and then on the client side, update each row returned. Since they would all be for the same partition, you could issue the updates as batched statements.
If you have Spark available, you could read all the rows you want to update into an RDD using a range query. Then do a transformation on the RDD to set the received value to true, then save the RDD back to Cassandra.

Resources