Cassandra Conditional Update combined with IF EXISTS - cassandra

I know that it's possible to use conditional updates (lightweight transaction) in cassandra.
update myTable
set
col1 = 'abc',
where id = 1
IF priority < 2
This allows me to update only rows with a higher priority. Unfortunately when the row does not exists the statement returns false. Is it possible to combine "IF NOT EXISTS" and "IF" with an OR operation?
Otherwise i have to execute more queries to the cluster. In my use case this could be a big performance issue.

I figured it out - unfortunately it's not possible in cassandra. We have to separate the insert and update statement.
https://issues.apache.org/jira/browse/CASSANDRA-8335
insert ... if not exists
update ... if priority < 2

You can use the IN operator, assuming that "priority" has a finite (and possibly small) allowed set of values, such as 0 and 1. By adding the NULL value to the set, the query will succeed also if the row does not exist.
You're using a single query, but note that the performance implications are non-negligible, and could be the same or worse than using two.
update myTable
set
col1 = 'abc',
where id = 1
IF priority IN (0, 1, NULL)

Related

Is there anyway we could limit an update?

Generally, I see we can limit the select by select * from table where predicate = value limit by N Am currently in a situation where I have 200 records falling under a predicate, but I want to update the first 100 like update table set column = 1 where predicate = value limit...? and the second half by update table set column = 2 where predicate = value. I think it could be done by having ranges <=,>= in the predicate section, unfortunately, I have none of them.
Currently, I don't think we have this feature as WHERE clause must identify the row or rows to be updated by primary key as per. However, you could further limit the number of rows to be updated by using IF EXISTS condition. Details can be found here

Cassandra: backfilling new boolean field

I added a new boolean column called subscribe to my email_subscriptions table in Cassandra. I noticed that it returns false for all rows' subscribe field.
I wanted to default all rows in the table with the subscribe field as true, but this StackOverflow answer says:
there is no default value in Cassandra.
So my question is, how do I set all rows in my email_subscriptions table to have their subscribe field set to true? Do I need to backfill via a batch update?
There is two way you can do this.
You can create a program, which will select all record from email_subscriptions and insert back along with subscribe = true value.
Or
When selecting subscribe field value, check the value is null with isNull() method (which will return true if the column is null, false otherwise). If true return, it means that subscribe is null and this value not yet inserted, you can treat it as subscribe = true
The only way is to fill back your whole table. Depending on size of your table, you could have problems at querying your table with a simple
SELECT * FROM mytable;
due to timeouts. Since the partition key is mandatory in an UPDATE statement, you have to find a way to spill your partition keys out from that table.
This is the perfect scenario for using the TOKEN function. Assuming you did your homework and don't have any too-wide partitions, you can scan all your dataset by splitting it into ranges of partitions. How wide is your range is up to your data. From a general point of view, you need to:
SELECT __partition_key_columns__ FROM mytable WHERE
TOKEN(__partition_key_columns__) >= min_range AND
TOKEN(__partition_key_columns__) < max_range;
and min_range and max_range go from -2^63 to 2^64-1 (IIRC, using Murmur3) in steps of a guessed window size W:
SELECT __partition_key_columns__ FROM mytable WHERE TOKEN(__partition_key_columns__) >= -2^63 AND TOKEN(__partition_key_columns__) < -2^63 + W;
SELECT __partition_key_columns__ FROM mytable WHERE TOKEN(__partition_key_columns__) >= -2^63 + W AND TOKEN(__partition_key_columns__) < -2^63 + 2*W;
...
until you covered all the range up to 2^64-1. If you get a timeout make W smaller and try again. And if you don't, expand your window W so you'll speed up the process. You will be able to extract all the partitions to issue the updates for each range.
EDIT: This blog post explains exactly how to perform such task.

Cassandra OperationTimedOut

select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.

How to debug "Each GROUP BY expression must contain at least one column that is not an outer reference error"

Since SSRS doesn't allow filters on aggregates, I found some code which helped me come up with the below query. However, when I run it I get:
Each GROUP BY expression must contain at least one column that is not an outer reference
I have searched everywhere but can't find how to fix this. I've even removed the two extra tables from the query so there were no joins at all. I need to not return any order where the total of the lines on the order is less than $500 and greater than 0.
SELECT
tdsls041_sales_order_lines.company,
tdsls041_sales_order_lines.order_number,
tdsls041_sales_order_lines.amount,
tdsls041_sales_order_lines.item,
tdsls041_sales_order_lines.container
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines
WHERE
(tdsls041_sales_order_lines.company = 610) AND
(tdsls041_sales_order_lines.order_number IN
(SELECT
tdsls041_sales_order_lines.order_number
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines_1
GROUP BY
tdsls041_sales_order_lines.order_number
HAVING
(SUM(tdsls041_sales_order_lines.amount) <= 500) OR
SUM(tdsls041_sales_order_lines.amount) > 0))
The issue that SQL Server is complaining about is that the Grouping wants an aggregate function in the SELECT statement. Unfortunately, you want to use IN which you need a list of Order Numbers.
You just need to add an aggregate function to your subquery and then add another layer to select just the Order Numbers from that.
SELECT T1.company, T1.order_number, T1.amount, T1.item, T1.container
FROM tdsls041_sales_order_lines AS T1
WHERE (T1.company = 610) AND (T1.order_number IN
(SELECT order_number FROM
(SELECT TSOL.order_number, SUM(TSOL.amount) AS TTL
FROM tdsls041_sales_order_lines AS TSOL
GROUP BY TSOL.order_number
HAVING (SUM(TSOL.amount) <= 500) OR
SUM(TSOL.amount) > 0) AS T2) )
You can filter on aggreagates in Chart and Tables. You have to put the aggregate filter on your GROUP instead of on the table itself (Group Properties->Filters tab).

Does an UPDATE become an implied INSERT

For Cassandra, do UPDATEs become an implied INSERT if the selected row does not exist? That is, if I say
UPDATE users SET name = "Raedwald" WHERE id = 545127
and id is the PRIMARY KEY of the users table, and the table has no row with a key of 545127, will that be equivalent to
INSERT INTO users (id, name) VALUES (545127, "Raedwald")
I know that the opposite is true: an INSERT for an id that already exists becomes an UPDATE of the row with that id. Older Cassandra documentation talked about inserts actually being "upserts" for that reason.
I'm interested in the case for CQL3, Cassandra version 1.2+.
Yes, for Cassandra UPDATE is synonymous with INSERT, as explained in the CQL documentation where it says the following about UPDATE:
Note that unlike in SQL, UPDATE does not check the prior existence of the row: the row is created if none existed before, and updated otherwise. Furthermore, there is no mean to know which of creation or update happened. In fact, the semantic of INSERT and UPDATE are identical.
For the semantics to be different, Cassandra would need to do a read to know if the row already exists. Cassandra is write optimized, so you can always assume it doesn't do a read before write on any write operation. The only exception is counters (unless replicate_on_write = false), in which case replication on increment involves a read.
Unfortunately the accepted answer is not 100% accurate. inserts are different than updates:
cqlsh> create table ks.t (pk int, ck int, v int, primary key (pk, ck));
cqlsh> update ks.t set v = null where pk = 0 and ck = 0;
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+---
(0 rows)
cqlsh> insert into ks.t (pk,ck,v) values (0,0,null);
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+------
0 | 0 | null
(1 rows)
Scylla does the same thing.
In Scylla and Cassandra rows are sequences of cells. Each column gets a corresponding cell (or a set of cells in the case of non-frozen collections or UDTs). But there is one additional, invisible cell - the row marker (in Scylla at least; I suspect Cassandra has something similar).
The row marker makes a difference for rows in which all other cells are dead: a row shows up in a query if and only if there's at least one alive cell. Thus, if the row marker is alive, the row will show up, even if all other columns were previously set to null using e.g. updates.
inserts create a live row marker, while updates don't touch the row marker, so clearly they are different. The example above illustrates that.
One could argue that row markers are "internal" to Cassandra/Scylla, but as you can see, their effects are visible. Row markers affect your life whether you like it or not, so it may be useful to remember about them.
It's sad that no documentation mentions row markers (well, I found this: https://docs.scylladb.com/architecture/sstable/sstable2/sstable-data-file/#cql-row-marker but it's in the context of explaining SSTable internals, which is probably dedicated to Scylla developers more than to users).
Bonus: a cell delete:
delete v from ks.t where pk = 0 and ck = 0
is the same as a null update:
update ks.t set v = null where pk = 0 and ck = 0
indeed, a cell delete also doesn't touch the row marker. It only sets the specified cell to null.
This is different from a row delete:
delete from ks.t where pk = 0 and ck = 0
because row deletes insert a row tombstone, which kills all cells in the row (including the row marker). You could say that row deletes are the opposite of an insert. Updates and cell deletes are somewhere in between.
What one can do is this however:
UPDATE table_name SET field = false WHERE key = 55 IF EXISTS;
This will ensure that your update is a true update and not an upsert.

Resources