Batch Mutation In Cassandra?

Batch Mutation In Cassandra? - cassandra

I want to update multiple rows in 2 CF's.
I don't care about the order they get updated ?
But is it guaranteed that if one gets succeeded then eventually others will get too if some C* node fails in between ?
Hector BatchMutation class use batch update or atomic batch update as these are two separate things.

You should use an atomic batch in CQL3. This guarantees that either the entire batch succeeds or the entire batch fails. An example from the CQL3 docs:
BEGIN BATCH
INSERT INTO users (userid, password, name) VALUES ('user2', 'ch#ngem3b', 'second user');
UPDATE users SET password = 'ps22dhds' WHERE userid = 'user3';
INSERT INTO users (userid, password) VALUES ('user4', 'ch#ngem3c');
DELETE name FROM users WHERE userid = 'user1';
APPLY BATCH;
The Hector BatchMutation class uses the Thrift operation batch_mutate. This is weaker than atomic_batch_mutate, which is the Thrift equivalent of the above. batch_mutate is only atomic for updates on the same key (can be different CFs though), whereas atomic_batch_mutate is atomic on all updates. I don't think Hector has implemented atomic_batch_mutate so you will need to move to CQL3 and a CQL3-capable driver e.g. DataStax's java driver.

Related

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.

You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.

Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

Is an update in Cassandra not an anti pattern?

As per Datastax documentation a read before a write in Cassandra is an anti pattern.
Whenever we use UPDATE either in CQLSH or using the Datastax drivers to set a few columns (with IFs & collection updates), does it not do a read before write first? Is that not an anti pattern? Am I missing something?
P.S I am not talking about mere UPSERTS but UPDATES on specific columns.
TIA!

No, Update is not an anti-pattern.
In Cassandra update is an upsert operation similar to insert.
UPDATE writes one or more column values to a row in a Cassandra table. Like INSERT, UPDATE is an upsert operation: if the specified row does not exist, the command creates it. All UPDATEs within the same partition key are applied atomically and in isolation.
But Lightweight transactions are read before write operation. Actually at the cost of four round trips.
Example of Lightweight transaction :
#Lightweight transaction Insert
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;
#Lightweight transaction Update
UPDATE customer_account
SET customer_email=’laurass#gmail.com’
IF customerID=’LauraS’;
Both of the above statement are Lightweight transactions
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html#cqlUpdate__description

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);

IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost

Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

batch update cassandra with lightweight transaction

I am using cassandra 2.2.3 and want to make a batch update with two statements. Both using a lightweight transaction.
BEGIN BATCH
UPDATE account SET values['balance'] = 11 WHERE id = 1 IF values['balance'] = 10;
UPDATE account SET values['balance'] = 11 WHERE id = 2 IF values['balance'] = 10;
APPLY BATCH;
The batch returns following error:
InvalidRequest: code=2200 [Invalid query] message="Batch with conditions cannot span multiple partitions".
I understand that it is not possible to make a batch on various PKs in the where clause because of the partitions, but why it is not possible to do a batch on the same PK? The problems are the IF statements, removing them, the batch is working.
So is there a solution to successfully execute such a batch update? Or any workaround?
EDIT:
This is my schema:
CREATE TABLE booking.account (
id int PRIMARY KEY,
values map<varchar, decimal>,
timestampCreate timestamp,
timestampUpdate timestamp
);

I understand that it is not possible to make a batch on various PKs in
the where clause because of the partitions, but why it is not possible
to do a batch on the same PK?
You could make a batch on various PKs in the where clause, However this is not recommended (Please refer to Cassandra: Batch loading without the Batch keyword).
The problem here is conditional update (the if statement). Quote from datastax cql reference.
In Cassandra 2.0.6 and later, you can batch conditional updates
introduced as lightweight transactions in Cassandra 2.0. Only updates
made to the same partition can be included in the batch because the
underlying Paxos implementation works at the granularity of the
partition. You can group updates that have conditions with those that
do not, but when a single statement in a batch uses a condition, the
entire batch is committed using a single Paxos proposal, as if all of
the conditions contained in the batch apply.
So do you really need batch statement? Read this Using and misusing batches

Cassandra PreparedStatement vs normal insert

I'm using Cassandra for my project and I was facing a timeout issue during writes, the same the guy was receiving in this post Cassandra cluster with bad insert performance and insert stability (at the moment I'm testing with only one node, Java Driver, last release of Cassandra). The application has to insert a huge quantity of data per user once per day (during nights). I have a rest controller that accepts files and then processes them as they arrive in parallel to insert values in Cassandra. I have to insert 1million entries per user, where an entry has up to 8 values (time is not so important, it can take also 10minutes). Following the answer provided in Cassandra cluster with bad insert performance and insert stability I decided to add executeAsync(), Semaphore and PreparedStatement to my application, while previously I was using none of them.
The problem now is that, using variable keyspaces (one per user) and having the necessity to update lists in the database, I can't initialize my PreparedStatements in the initialization phase but I have to do it at least once per file processed (one file contains 10+k entries) and an user has to upload up to 100 files per day. For this reason, I'm getting this warning:
Re-preparing already prepared query INSERT INTO c2bdd9f7073dce28ed973238ac85b6e5d6162fce.sensorMonitoringLog (timestamp, sensorId, isLogging) VALUES (?, ?, ?). Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
My question is: is it a good practice to use PreparedStatement like this or it is better to use normal insert with executeAsync()?
Thank you

If you are facing a timeout issue during write, it is a good idea to use PreparedStatement but not to use asynchronous insert. Timeouts are a way to prevent Cassandra from work overload. With asynchronism you are giving it more work at the same time and the risk of OOM would grow.
To do things properly with PreparedStatement, you have to create one and only one Session object by keyspace. Then each session must prepare its own statement once.
Moreover, be aware their is a thread safety risk with PreparedStatement and asynchronism. Preparing a statement must be synchronized. But once again, I advice you not to use ExecuteAsynch in such case.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string