Cassandra insert/update if not exists or field=value - cassandra

I would like to do in a single operation an insert if a record doesn't exist, or update only if a field of the row has a certain value.
Imagine the table:
CREATE TABLE (id VARCHAR PRIMARY KEY, field_a VARCHAR, field_b VARCHAR);
Is it possible to have something like:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF NOT EXISTS OR IF field_b='AVALUE';
Doing INSERT/UPDATE on a single statement that insert the record if field_b has value AVALUE or create a new row if a row doesn't already exists, but in case a row is in the table having field_b with a different value failing the update.

UPDATE my_table SET field_a='test' WHERE id='an-id'
IF NOT EXISTS OR IF field_b='AVALUE';
There are a few of nuances here. First it's important to remember that when doing a compare-and-set (CAS) operation in CQL, the syntax and capabilities between INSERT and UPDATE are not the same.
Case-in-point, the IF NOT EXISTS conditional is valid for INSERT, but not for UPDATE. On the flip-side, IF EXISTS is valid for UPDATE, but not for INSERT.
Secondly, OR is not a valid operator in CQL WHERE or in CAS operation conditionals.
Third, using UPDATE with IF EXISTS short-circuits any subsequent conditionals. So UPDATE can either use IF EXISTS or IF (condition) [ AND (another condition) ], but not both.
Considering these points, it would seem one approach here would be to split the statement into two:
INSERT INTO my_table (id,field_a) VALUES ('a1','test') IF NOT EXISTS;
And:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF field_b='AVALUE';
These are both valid CQL. However, that doesn't really help this situation. An alternative would be to build this logic on the application side. Technically, read-before-write approaches are considered anti-patterns in Cassandra, in-built CAS operations not withstanding due to their use of lightweight transactions.
Perhaps something like SELECT field_a,field_b FROM my_table WHERE id='an-id'; is enough to answer whether it exists as well as what the value of field_b is, thus triggering an additional write? There's a potential for a race condition here, so I'd closely examine the business requirements to see if something like this could work.

Related

Multiple rows insert operation in Postgres should return failed rows information

I want to use insert multiple rows using a single insert statement in Postgres.
Here, the catch is that if insert fails for single row, all other successful inserts are being rollbacked. Is there a way to avoid the rollback and make the query return list of failed rows.
Else, I end up in writing a loop of insert statements. I am using Node pg module. Is there a recommended way of achieving my requirement if postgres doesn't support this.
Edit - 1
insert into test(name, email) values ('abcd', 'abcd#a'), ('efgh', 'blah'), (null, 'abcd') ON CONFLICT DO NOTHING;
ERROR: null value in column "name" violates not-null constraint
DETAIL: Failing row contains (9, null, abcd).
After above query, select statement returns 0 rows. I am looking for a solution wherein, the first two rows get inserted.
Sounds like the failure you're talking about is hitting some sort of unique constraint? Take a look at PostgreSQL INSERT ON CONFLICT UPDATE (upsert) use all excluded values for a question related to insert... on conflict usage.
For example if you do INSERT INTO X VALUES (...100 rows...) ON CONFLICT DO NOTHING; Any duplicates that collide with the primary key will just be ignored. The alternate to DO NOTHING is to do an UPDATE on conflict.
EDIT to match newly stated question. On conflict does not help with null constraint violations. You can do a WITH clause and select only the values without null properties. Here's a sample I just tested in Postgres:
create extension if not exists pgcrypto;
create table tmp (id uuid primary key default gen_random_uuid());
with data_set_to_insert as (
select x.id from (values (null), (gen_random_uuid())) x(id) -- alias x can be anything, does not matter what it is
)
insert into tmp(id) select id from data_set_to_insert where id is not null returning *;

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

Cassandra: How to perform upsert in lightweight transaction

In Cassandra, I want to add a row, and if it already exists, only update it if the existing date is earlier than the new date. This is how it's done:
INSERT INTO tbl (...) VALUES (...) IF NOT EXISTS;
If the first query is not applied, I'm running this second one:
UPDATE tbl SET ...
WHERE ...
IF date <= ?;
Is it possible to merge the two queries into one? Maybe using the UPDATE as an upsert, while keeping the IF condition. We are having performance issues with these statements (timeouts) so this is the reason why I want to change it.
Regular updates (without IFs) also perform inserts if the row doesn't exist, but the lightweight transaction doesn't. Maybe it's possible to "trick" it into inserting as well.
Thanks!
LWT is basically doing a check before executing data mutation. Conditional execution is enabled only for INSERT and UPDATE with these conditions:
1. IF NOT EXISTS for INSERT
2. IF column = 'value' for UPDATE
You cannot mix and match these conditions with different operations. If there was an option to say UPDATE ... IF column <= 'value' it would have to hit all the nodes and propose a transaction to all of them and this would be a huge performance impact. LWT impacts performance even with equals conditions by hitting only replica nodes.

Difference between UPDATE and INSERT in Cassandra?

What is the difference between UPDATE and INSERT when executing CQL against Cassandra?
It looks like there used to be no difference, but now the documentation says that INSERT does not support counters while UPDATE does.
Is there a "preferred" method to use? Or are there cases where one should be used over the other?
Thanks so much!
There is a subtle difference. Inserted records via INSERT remain if you set all non-key fields to null. Records inserted via UPDATE go away if you set all non-key fields to null.
Try this:
CREATE TABLE T (
pk int,
f1 int,
PRIMARY KEY (pk)
);
INSERT INTO T (pk, f1) VALUES (1, 1);
UPDATE T SET f1=2 where pk=2;
SELECT * FROM T;
Returns:
pk | f1
----+----
1 | 1
2 | 2
Now, update each row setting f1 to null.
UPDATE T SET f1 = null WHERE pk = 1;
UPDATE T SET f1 = null WHERE pk = 2;
SELECT * FROM T;
Note that row 1 remains, while row 2 is removed.
pk | f1
----+------
1 | null
If you look at these using Cassandra-cli, you will see a different in how the rows are added.
I'd sure like to know whether this is by design or a bug and see this behavior documented.
Counter Columns in Cassandra couldn't be set to an arbitrary value: they can only be incremented or decremented by any arbitrary value.
For this reason, INSERT doesn't support Counter Column because you cannot "insert" a value into a Counter Column. You can only UPDATE them (increment or decrement) by some value. Here's how you would update a Counter column.
UPDATE ... SET name1 = name1 + <value>
You asked:
Is there a "preferred" method to use? Or are there cases where one should be used over the other?
Yes. If you are inserting values to the database, you can use INSERT. If the column doesn't exists, it will be created for you. Otherwise, INSERT's effect is similar to UPDATE. INSERT is useful when you don't have a pre-designed schema (Dynamic Column Family, i.e. insert anything, anytime). If you are designing the schema before hand (Static Column Family, similar to RDMS) and know each column, then you can use UPDATE.
Another subtle difference (i'm starting to believe cql is a terrible interface to cassandra, full of subtleties and caveats due to using similar SQL syntax but slightly different semantics) is with setting TTLs on existing data. With UPDATE you cannot update the TTL of the keys, even if the new actual values are equal to the old values. The solution is to INSERT the new row instead, with the new TTL already set
Regarding the subtle difference highlighted by billbaird (I'm unable to comment on that post directly) where a row created by an update operation will be deleted if all non-key fields are null:
That is expected behavior and not a bug based on the bug report at https://issues.apache.org/jira/browse/CASSANDRA-11805 (which was closed as "Not A Problem")
I ran into this myself when using Spring Data for the first time. I was using the save(T entity) method of a repository, but no row was being created. it turned out Spring Data was using an UPDATE because it determined that the object wasn't 'new' (not sure that test for 'isNew' makes sense here), and I happened to be testing with entities that only had the key fields set.
For this Spring Data case, the Cassandra-specific repository interfaces do provide an insert method that appear to consistently use an INSERT if that behavior is desired instead (though Spring's documentation doesn't document these details sufficiently either).

Resources