Difference between UPDATE and INSERT in Cassandra? - cassandra

What is the difference between UPDATE and INSERT when executing CQL against Cassandra?
It looks like there used to be no difference, but now the documentation says that INSERT does not support counters while UPDATE does.
Is there a "preferred" method to use? Or are there cases where one should be used over the other?
Thanks so much!

There is a subtle difference. Inserted records via INSERT remain if you set all non-key fields to null. Records inserted via UPDATE go away if you set all non-key fields to null.
Try this:
CREATE TABLE T (
pk int,
f1 int,
PRIMARY KEY (pk)
);
INSERT INTO T (pk, f1) VALUES (1, 1);
UPDATE T SET f1=2 where pk=2;
SELECT * FROM T;
Returns:
pk | f1
----+----
1 | 1
2 | 2
Now, update each row setting f1 to null.
UPDATE T SET f1 = null WHERE pk = 1;
UPDATE T SET f1 = null WHERE pk = 2;
SELECT * FROM T;
Note that row 1 remains, while row 2 is removed.
pk | f1
----+------
1 | null
If you look at these using Cassandra-cli, you will see a different in how the rows are added.
I'd sure like to know whether this is by design or a bug and see this behavior documented.

Counter Columns in Cassandra couldn't be set to an arbitrary value: they can only be incremented or decremented by any arbitrary value.
For this reason, INSERT doesn't support Counter Column because you cannot "insert" a value into a Counter Column. You can only UPDATE them (increment or decrement) by some value. Here's how you would update a Counter column.
UPDATE ... SET name1 = name1 + <value>
You asked:
Is there a "preferred" method to use? Or are there cases where one should be used over the other?
Yes. If you are inserting values to the database, you can use INSERT. If the column doesn't exists, it will be created for you. Otherwise, INSERT's effect is similar to UPDATE. INSERT is useful when you don't have a pre-designed schema (Dynamic Column Family, i.e. insert anything, anytime). If you are designing the schema before hand (Static Column Family, similar to RDMS) and know each column, then you can use UPDATE.

Another subtle difference (i'm starting to believe cql is a terrible interface to cassandra, full of subtleties and caveats due to using similar SQL syntax but slightly different semantics) is with setting TTLs on existing data. With UPDATE you cannot update the TTL of the keys, even if the new actual values are equal to the old values. The solution is to INSERT the new row instead, with the new TTL already set

Regarding the subtle difference highlighted by billbaird (I'm unable to comment on that post directly) where a row created by an update operation will be deleted if all non-key fields are null:
That is expected behavior and not a bug based on the bug report at https://issues.apache.org/jira/browse/CASSANDRA-11805 (which was closed as "Not A Problem")
I ran into this myself when using Spring Data for the first time. I was using the save(T entity) method of a repository, but no row was being created. it turned out Spring Data was using an UPDATE because it determined that the object wasn't 'new' (not sure that test for 'isNew' makes sense here), and I happened to be testing with entities that only had the key fields set.
For this Spring Data case, the Cassandra-specific repository interfaces do provide an insert method that appear to consistently use an INSERT if that behavior is desired instead (though Spring's documentation doesn't document these details sufficiently either).

Related

Cassandra insert/update if not exists or field=value

I would like to do in a single operation an insert if a record doesn't exist, or update only if a field of the row has a certain value.
Imagine the table:
CREATE TABLE (id VARCHAR PRIMARY KEY, field_a VARCHAR, field_b VARCHAR);
Is it possible to have something like:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF NOT EXISTS OR IF field_b='AVALUE';
Doing INSERT/UPDATE on a single statement that insert the record if field_b has value AVALUE or create a new row if a row doesn't already exists, but in case a row is in the table having field_b with a different value failing the update.
UPDATE my_table SET field_a='test' WHERE id='an-id'
IF NOT EXISTS OR IF field_b='AVALUE';
There are a few of nuances here. First it's important to remember that when doing a compare-and-set (CAS) operation in CQL, the syntax and capabilities between INSERT and UPDATE are not the same.
Case-in-point, the IF NOT EXISTS conditional is valid for INSERT, but not for UPDATE. On the flip-side, IF EXISTS is valid for UPDATE, but not for INSERT.
Secondly, OR is not a valid operator in CQL WHERE or in CAS operation conditionals.
Third, using UPDATE with IF EXISTS short-circuits any subsequent conditionals. So UPDATE can either use IF EXISTS or IF (condition) [ AND (another condition) ], but not both.
Considering these points, it would seem one approach here would be to split the statement into two:
INSERT INTO my_table (id,field_a) VALUES ('a1','test') IF NOT EXISTS;
And:
UPDATE my_table SET field_a='test' WHERE id='an-id' IF field_b='AVALUE';
These are both valid CQL. However, that doesn't really help this situation. An alternative would be to build this logic on the application side. Technically, read-before-write approaches are considered anti-patterns in Cassandra, in-built CAS operations not withstanding due to their use of lightweight transactions.
Perhaps something like SELECT field_a,field_b FROM my_table WHERE id='an-id'; is enough to answer whether it exists as well as what the value of field_b is, thus triggering an additional write? There's a potential for a race condition here, so I'd closely examine the business requirements to see if something like this could work.

Multiple rows insert operation in Postgres should return failed rows information

I want to use insert multiple rows using a single insert statement in Postgres.
Here, the catch is that if insert fails for single row, all other successful inserts are being rollbacked. Is there a way to avoid the rollback and make the query return list of failed rows.
Else, I end up in writing a loop of insert statements. I am using Node pg module. Is there a recommended way of achieving my requirement if postgres doesn't support this.
Edit - 1
insert into test(name, email) values ('abcd', 'abcd#a'), ('efgh', 'blah'), (null, 'abcd') ON CONFLICT DO NOTHING;
ERROR: null value in column "name" violates not-null constraint
DETAIL: Failing row contains (9, null, abcd).
After above query, select statement returns 0 rows. I am looking for a solution wherein, the first two rows get inserted.
Sounds like the failure you're talking about is hitting some sort of unique constraint? Take a look at PostgreSQL INSERT ON CONFLICT UPDATE (upsert) use all excluded values for a question related to insert... on conflict usage.
For example if you do INSERT INTO X VALUES (...100 rows...) ON CONFLICT DO NOTHING; Any duplicates that collide with the primary key will just be ignored. The alternate to DO NOTHING is to do an UPDATE on conflict.
EDIT to match newly stated question. On conflict does not help with null constraint violations. You can do a WITH clause and select only the values without null properties. Here's a sample I just tested in Postgres:
create extension if not exists pgcrypto;
create table tmp (id uuid primary key default gen_random_uuid());
with data_set_to_insert as (
select x.id from (values (null), (gen_random_uuid())) x(id) -- alias x can be anything, does not matter what it is
)
insert into tmp(id) select id from data_set_to_insert where id is not null returning *;

Why Cassandra does not allow udf in update statements?

I am new to Cassandra. I created a table, and I have inserted some data in it, now I want to select data from it, and in the output, I want some calculated columns.
I created a user defined function concat, which concatenates 2 strings and returns the result. Then I noticed that this function shows data correctly when I use it in SELECT statement. but it does not work when I use in UPDATE statement:
That is, this works;
select concat(prov,city), year,mnth,acno,amnt from demodb.budgets;
but this does not;
update demodb.budgets set extra=concat(prov,city) where prov='ON';
In addition, the UPDATE also does not work if we simply assign one column's value to another column of same type (without any calculations), as below;
update demodb.budgets set extra=city where prov='ON';
Also, even a simple arithmetic calculation doesn't work in Update statement;
that is, this too doesn't work;
update demodb.budgets set amnt = amnt + 20 where prov='ON';
here amnt is a simple double type column.
(when I saw this; all I could do is pull my hair hardly and say, I can't work with Cassandra, i don't just want it if it cannot do simple arithmetic)
Can someone please help how can I achieve the desired updates?
I think the basic answer to your question is Read-before-write is a huge anti-pattern in Cassandra.
The issue of concurrency in a distributed environment is a key point there.
More info.

CQL check if record exists

I'm on my path to learning Cassandra, and the differences in CQL and SQL, but I'm noticing the absence of a way to check to see if a record exists with Cassandra. Currently, the best way that I have is to use
SELECT primary_keys FROM TABLE WHERE primary_keys = blah,
and checking to see if the results set is empty. Is there a better way to do this, or do I have the right idea for now?
Using count will make it traverse all the matching rows just to be able to count them. But you only need to check one, so just limit and return whatever. Then interpret the presence of the result as true and absence - as false. E.g.,
SELECT primary_keys FROM TABLE WHERE primary_keys = blah LIMIT 1
That's the usual way in Cassandra to check if a row exists. You might not want to return all the primary keys if all you care about is if the row exists or not, so you could do this:
SELECT count(*) FROM TABLE WHERE primary_keys = blah,
This would just return a 1 if the row exists, and a 0 if it doesn't exist.
If you are using primary key to filter rows, all the above 3 solutions (including yours) are fine. And I don't think there are real differences.
But if you are using a general way (such as indexed column, partition key) to filter rows, you should take the solution of "Limit 1", which will avoid useless network traffic.
There is a related example at:
The best way to check existence of filtered rows in Cassandra? by user-defined aggregate?

Do specific rows in a partition have to be specified in order to update and/or delete static columns?

The CQL3 specification description of the UPDATE statement begins with the following paragraph:
The UPDATE statement writes one or more columns for a given row in a
table. The (where-clause) is used to select the row to update and must
include all columns composing the PRIMARY KEY (the IN relation is only
supported for the last column of the partition key). Other columns
values are specified through after the SET keyword.
The description in the specification of the DELETE statement begins with a similar paragraph:
The DELETE statement deletes columns and rows. If column names are provided
directly after the DELETE keyword, only those columns are deleted from the row
indicated by the (where-clause) (the id[value] syntax in (selection) is for
collection, please refer to the collection section for more details).
Otherwise whole rows are removed. The (where-clause) allows to specify the
key for the row(s) to delete (the IN relation is only supported for the last
column of the partition key).
The bolded portions of each of these descriptions state, in layman's terms, that these statements can be used to modify data in a solely row-based manner.
However, given the nature of the relationship (or lack thereof) between the rows and the static columns (which exist independent of any particular row) of a table, it seems as though there should be a way to modify such columns given only the keys of the partitions they're respectively contained in. According to the specification however, that does not seem to be possible, and I'm not sure if that is a product of the difficulty to allow such in the CQL3 syntax, or something else.
If a static column cannot be updated or deleted independent of any row in its table, then such operations become coupled with their non-static-column-based counterparts, making the set of columns targeted by such operations, difficult to determine. For example, given a populated table with the following definition:
CREATE TABLE IF NOT EXISTS example_table
(
partitionKeyColumn int
clusteringColumn int
nonPrimaryKeyColumn int
staticColumn varchar static
PRIMARY KEY (partitionKeyColumn, clusteringColumn)
)
... it is not immediately obvious if the following DELETE statements are equivalent:
//#1 (Explicitly specifies all of the columns in and "in" the target row)
DELETE partitionKeyColumn, clusteringColumn, nonPrimaryKeyColumn, staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
//#2 (Implicitly specifies all of the columns in (but not "in"?) the target row)
DELETE FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
So, phrasing my observations in question form:
Are the above DELETE statements equivalent?
Does the primary key of at least one row in a CQL3 table have to be supplied in order to update or delete a static column in said table? If so, why?
I do not know about specification but in the real cassandra world, your two DELETE statements are not equivalent.
The first statement deletes the static_column whereas the second one does not. The reason of this is that static columns are shared by rows. You have to specify it explicitly to actually delete it.
Furthermore, I do not think its a good idea to DELETE static columns and non-static columns at the same time. By the way, this statement won't work :
DELETE staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
The error output is :
Bad Request: Invalid restriction on clustering column priceable_name since the DELETE statement modifies only static columns

Resources