Is an update in Cassandra not an anti pattern? - cassandra

As per Datastax documentation a read before a write in Cassandra is an anti pattern.
Whenever we use UPDATE either in CQLSH or using the Datastax drivers to set a few columns (with IFs & collection updates), does it not do a read before write first? Is that not an anti pattern? Am I missing something?
P.S I am not talking about mere UPSERTS but UPDATES on specific columns.
TIA!

No, Update is not an anti-pattern.
In Cassandra update is an upsert operation similar to insert.
UPDATE writes one or more column values to a row in a Cassandra table. Like INSERT, UPDATE is an upsert operation: if the specified row does not exist, the command creates it. All UPDATEs within the same partition key are applied atomically and in isolation.
But Lightweight transactions are read before write operation. Actually at the cost of four round trips.
Example of Lightweight transaction :
#Lightweight transaction Insert
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;
#Lightweight transaction Update
UPDATE customer_account
SET customer_email=’laurass#gmail.com’
IF customerID=’LauraS’;
Both of the above statement are Lightweight transactions
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html#cqlUpdate__description

Related

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.
You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.
Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

how can get the return of Update/Insert statement from nodejs driver

I am using the Datasax nodejs driver for cassandra (https://github.com/datastax/nodejs-driver)
I am wondering if I can get the number of rows updated after executing an update statement.
Thanks,
Updates to Cassandra is treated as another insert. In other words, there is no read before write (update in this case) in Cassandra. The updates are just writes to another immutable sstable and during read Cassandra stitches them together. Its during read, the latest value for that column is choosen.
So in short, there isn't a way in Cassandra to deduce the number of rows updated.

Cassandra hard vs soft delete

I have multiple tables that I want to keep their deleted data.
I thought of two options to achieve that:
Create new table called deleted_x and when deleting from x, immediatly insert to deleted_x.
Advantage : querying from only one table.
Disadvantages :
Do insert for each delete
When the original table structure changes, I will have to change the deleted table too.
Have a column called is_deleted and put it in the partition key in each of these tables and set it to true when deleting a row.
Advantage : One table structure
Disadvantage : mention is_deleted in all queries from table
Are there any performence considerations I should think of additionally?
Which way is the better way?
Option #1 is awkward, but it's probably the right way to do things in Cassandra. You could issue the two mutations (one DELETE, and one INSERT) in a single batch, and guarantee that both are written.
Option #2 isn't really as easy as you may expect if you're coming from a relational background, because adding an is_deleted column to a table in Cassandra and expecting to be able to query against it isn't trivial. The primary reason is that Cassandra performs significantly better when querying against the primary key (partition key(s) + optional clustering key(s) than secondary indexes. Therefore, for maximum performance, you'd need to model this as a clustering key - doing so then prohibits you from simply issuing an update - you'd need to delete + insert, anyway.
Option #2 becomes somewhat more viable in 3.0+ with Materialized Views - if you're looking at Cassandra 3.0+, it may be worth considering.
Are there any performence considerations I should think of additionally?
You will effectively double the write load and storage size for your cluster by inserting your data twice. This includes compactions, repairs, bootstrapping new nodes and backups.
Which way is the better way?
Let me suggest a 3rd option instead.
Create table all_data that contains each row and will never be deleted from
Create table active_data using the same partition key. This table will only contain non-deleted rows (Edit: but not any data at all, just the key!).
Check if key is in active_data before reading from all_data will allow you to only read non-deleted rows

Checking to see if a RethinkDB shard is in use?

I am wondering if there is a way to check to see if a RethinkDB shard is in use before performing so ReQL query on it.
I am currently calling two functions back to back, the first creating a RethinkDB table and inserting data, the second will read the data from that newly created table. This works okay if the data being inserted is minimal, but once the size of the data set being inserted increases, I start getting:
Unhandled rejection RqlRuntimeError: Cannot perform write: Primary replica for shard ["", +inf) not available
This is because the primary shard is still doing the write from the previous function. I guess I am wondering if there is some RethinkDB specific way of avoiding this or if I need to emit/listen for events or something?
You can probably use the wait command for this. From the docs:
Wait for a table or all the tables in a database to be ready. A table
may be temporarily unavailable after creation, rebalancing or
reconfiguring. The wait command blocks until the given table (or
database) is fully up to date.
http://rethinkdb.com/api/javascript/wait/

Is it possible to do sequential batch in cassandra.

Is it possible to do sequential batch in cassandra.
eg:
insert into table1 and take uuid from this insert operation and pass this to table2 insert statement.
If table 2 insert fails, fail the entire operaion.
If not , whats my best option?
(Its kind of transactional)
You'r best shot is Cassandra Batch statement:
BATCH - Cassandra documentation
Combined with "IF EXISTS" constraints (like here: DELETE - Cassandra documentation) it may be what you need.
However, I don't believe there is a possibility to "insert into table1 and take uuid from this insert operation and pass this to table2 insert statement". You can think of batches in C* as transactions in SQL - it's fully executed or not.
Important things to note:
batches can span multiple tables in C*
although batches are atomic, they are not isolated. Some portion of a batch can be executed, in another query you can read those changes, but it may happen they will be revoked because the batch will fail.

Resources