I am using cassandra 2.2.3 and want to make a batch update with two statements. Both using a lightweight transaction.
BEGIN BATCH
UPDATE account SET values['balance'] = 11 WHERE id = 1 IF values['balance'] = 10;
UPDATE account SET values['balance'] = 11 WHERE id = 2 IF values['balance'] = 10;
APPLY BATCH;
The batch returns following error:
InvalidRequest: code=2200 [Invalid query] message="Batch with conditions cannot span multiple partitions".
I understand that it is not possible to make a batch on various PKs in the where clause because of the partitions, but why it is not possible to do a batch on the same PK? The problems are the IF statements, removing them, the batch is working.
So is there a solution to successfully execute such a batch update? Or any workaround?
EDIT:
This is my schema:
CREATE TABLE booking.account (
id int PRIMARY KEY,
values map<varchar, decimal>,
timestampCreate timestamp,
timestampUpdate timestamp
);
I understand that it is not possible to make a batch on various PKs in
the where clause because of the partitions, but why it is not possible
to do a batch on the same PK?
You could make a batch on various PKs in the where clause, However this is not recommended (Please refer to Cassandra: Batch loading without the Batch keyword).
The problem here is conditional update (the if statement). Quote from datastax cql reference.
In Cassandra 2.0.6 and later, you can batch conditional updates
introduced as lightweight transactions in Cassandra 2.0. Only updates
made to the same partition can be included in the batch because the
underlying Paxos implementation works at the granularity of the
partition. You can group updates that have conditions with those that
do not, but when a single statement in a batch uses a condition, the
entire batch is committed using a single Paxos proposal, as if all of
the conditions contained in the batch apply.
So do you really need batch statement? Read this Using and misusing batches
Related
I have some table like this:
CREATE TABLE events (
id int,
eventdate timestamp,
PRIMARY KEY (id)
);
What I'm trying to do is conditional insert, which going to verify if eventdate is not older than 3 years and insert data if the condition is met.
In SQL something similar could be achieved by DATEADD
How to handle it in Cassandra?
select * from events and iterate (paging) through the result set. Issue an insert for everything older than 3 years. A quick python script and giving it a day or two to run will accomplish it in less time than more elaborate things. Particularly if this is a one off thing. If you need to do it regularly I would recommend writing a spark job to do it. You can get more efficient if you dont want to use spark and want to run it locally by splitting up token ranges on the select statement to be the ring boundaries.
Cassandra wont support large bulk operations that require reads before writes that must read entire data set. It wouldn't work on clusters its designed to support (think petabytes across many data centers).
Cassandra doesn't imply particular order in which the statements are executed.
Executing statements like the code below doesn't execute in the order.
INSERT INTO channel
JSON ''{"cuid":"NQAA0WAL6drA"
,"owner":"123"
,"status":"open"
,"post_count":0
,"mem_count":1
,"link":"FWsA609l2Og1AADRYODkzNjE2MTIyOTE="
, "create_at":"1543328307953"}}'';
BEGIN BATCH
UPDATE channel
SET title = ? , description = ? WHERE cuid = ? ;
INSERT INTO channel_subscriber
JSON ''{"cuid":"NQAA0WAL6drA"
,"user_id":"123"
,"status":"subscribed"
,"priority":"owner"
,"mute":false
,"setting":{"create_at":"1543328307956"}}'';
APPLY BATCH ;
According to system_traces.sessions each of them are received by different nodes.
Sometimes the started_at time in both query are equal (in milliseconds), sometimes the started_at time of second query is less than the first one.
So, this ruins the order of statements and data.
We use erlang, marina driver, consistency_level is QUORUM and time of all cassandra nodes and application server are sync.
How can I force Cassandra to execute queries in order?
Because of the distributed nature, queries in Cassandra could be received by different nodes, and depending on the load on particular node, it could be that some queries that sent later, are executed earlier. In your case you can put first insert into batch itself. Or, as it's implemented in some drivers (for example, Java driver), use whitelist policy to send queries to only one node - but it will be bottleneck in this case. (and I really not sure that your driver has such functionality).
As per Datastax documentation a read before a write in Cassandra is an anti pattern.
Whenever we use UPDATE either in CQLSH or using the Datastax drivers to set a few columns (with IFs & collection updates), does it not do a read before write first? Is that not an anti pattern? Am I missing something?
P.S I am not talking about mere UPSERTS but UPDATES on specific columns.
TIA!
No, Update is not an anti-pattern.
In Cassandra update is an upsert operation similar to insert.
UPDATE writes one or more column values to a row in a Cassandra table. Like INSERT, UPDATE is an upsert operation: if the specified row does not exist, the command creates it. All UPDATEs within the same partition key are applied atomically and in isolation.
But Lightweight transactions are read before write operation. Actually at the cost of four round trips.
Example of Lightweight transaction :
#Lightweight transaction Insert
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;
#Lightweight transaction Update
UPDATE customer_account
SET customer_email=’laurass#gmail.com’
IF customerID=’LauraS’;
Both of the above statement are Lightweight transactions
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html#cqlUpdate__description
Is it possible to do sequential batch in cassandra.
eg:
insert into table1 and take uuid from this insert operation and pass this to table2 insert statement.
If table 2 insert fails, fail the entire operaion.
If not , whats my best option?
(Its kind of transactional)
You'r best shot is Cassandra Batch statement:
BATCH - Cassandra documentation
Combined with "IF EXISTS" constraints (like here: DELETE - Cassandra documentation) it may be what you need.
However, I don't believe there is a possibility to "insert into table1 and take uuid from this insert operation and pass this to table2 insert statement". You can think of batches in C* as transactions in SQL - it's fully executed or not.
Important things to note:
batches can span multiple tables in C*
although batches are atomic, they are not isolated. Some portion of a batch can be executed, in another query you can read those changes, but it may happen they will be revoked because the batch will fail.
When I'm sending batch of inserts to only one table while each row as a unique key with condition if not exists and there is a problem when even if one of the rows exists.
I need to insert the batch per row and not per the whole batch.
Let's say I've a table "users" with only one column "user_name" and contains the row "jhon", Now I'm trying to import new users:
BEGIN BATCH
INSERT INTO "users" ("user_name") VALUES ("jhon") IF NOT EXISTS;
INSERT INTO "users" ("user_name") VALUES ("mandy") IF NOT EXISTS;
APPLY BATCH;
It will not insert "mandy" because that "jhon" exists, What can I do to isolate them?
I've a lot of rows to insert about 100-200K so I need to use batch.
Thanks!
First: what you describe is documented as intended behavior:
In Cassandra 2.0.6 and later, you can batch conditional updates introduced as lightweight transactions in Cassandra 2.0. Only updates made to the same partition can be included in the batch because the underlying Paxos implementation works at the granularity of the partition. You can group updates that have conditions with those that do not, but when a single statement in a batch uses a condition, the entire batch is committed using a single Paxos proposal, as if all of the conditions contained in the batch apply.
That basically confirms: your updates are to different partitions, so only one Paxos proposal is going to be used, which means the entire batch will succeed, or none of it will.
That said, with Cassandra, batches aren't meant to speed up and bulk load - they're meant to create pseudo-atomic logical operations. From http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html :
Batches are often mistakenly used in an attempt to optimize performance. Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
The coordinator node might also need to work hard to process a logged batch while maintaining consistency between tables. For example, upon receiving a batch, the coordinator node sends batch logs to two other nodes. In the event of a coordinator failure, the other nodes retry the batch. The entire cluster is affected. Use a logged batch to synchronize tables, as shown in this example:
In your schema, each INSERT is to a different partition, which is going to add a LOT of load on your coordinator.
You can run your 200k inserts with a client with async executes, and they'll run quite fast - probably as fast (or faster) as you'd see with a batch.