How do atomic batches work in Cassandra? - cassandra

How can atomic batches guarantee that either all statements in a single batch will be executed or none?

In order to understand how batches work under the hood, its helpful to look at the individual stages of the batch execution.
The client
Batches are supported using CQL3 or modern Cassandra client APIs. In each case you'll be able to specify a list of statements you want to execute as part of the batch, a consistency level to be used for all statements and an optional timestamp. You'll be able to batch execute INSERT, DELETE and UPDATE statements. If you choose not to provide a timestamp, the current time is automatically used and associated with the batch.
The client will have to handle two exception in case the batch could not be executed successfully.
UnavailableException - there are not enough nodes alive to fulfill any of the updates with specified batch CL
WriteTimeoutException - timeout while either writing the batchlog or applying any of the updates within the batch. This can be checked by reading the writeType value of the exception (either BATCH_LOG or BATCH).
Failed writes during the batchlog stage will be retried once automatically by the DefaultRetryPolicy in the Java driver. Batchlog creation is critical to ensure that a batch will always be completed in case the coordinator fails mid-operation. Read on for finding out why.
The coordinator
All batches send by the client will be executed by the coordinator just as with any write operation. Whats different from normal write operations is that Cassandra will also make use of a dedicated log that will contain all pending batches currently executed (called the batchlog). This log will be stored in the local system keyspace and is managed by each node individually. Each batch execution starts by creating a log entry with the complete batch on preferably two nodes other than the coordinator. After the coordinator was able to create the batchlog on the other nodes, it will start to execute the actual statements in the batch.
Each statement in the batch will be written to the replicas using the CL and timestamp of the whole batch. Beside from that, there's nothing special about writes happening at this point. Writes may also be hinted or throw a WriteTimeoutException, which can be handled by the client (see above).
After the batch has been executed, all created batchlogs can be safely removed. Therefor the coordinator will send a batchlog delete message upon successfull execution to the nodes that have received the batchlog before. This happens in the background and will go unnoticed in case it fails.
Lets wrap up what the coordinator does during batch execution:
sends batchlog to two other nodes (preferably in different racks)
execute all statements in batch
deletes batchlog from nodes again after successful batch execution
The batchlog replica nodes
As described above, the batchlog will be replicated across two other nodes (if the cluster size allows it) before batch execution. The idea is that any of these nodes will be able to pick up pending batches in case the coordinator will go down before finishing all statements in the batch.
What makes thinks a bit complicated is the fact that those nodes won't notice that the coordinator is not alive anymore. The only point at which the batchlog nodes will be updated with the current status of the batch execution, is when the coordinator is issuing a delete messages indicating the batch has been successfully executed. In case such a message doesn't arrive, the batchlog nodes will assume the batch hasn't been executed for some reasons and replay the batch from the log.
Batchlog replay is taking place potentially every minute, ie. that is the interval a node will check if there are any pending batches in the local batchlog that haven't been deleted by the -possibly killed- coordinator. To give the coordinator some time between the batchlog creation and the actual execution, a fixed grace period is used (write_request_timeout_in_ms * 2, default 4 sec). In case that the batchlog still exists after 4 sec, it will be replayed.
Just as with any write operation in Cassandra, timeouts may occur. In this case the node will fall back writing hints for the timed out operations. When timed out replicas will be up again, writes can resume from hints. This behavior doesn't seem to be effected whether hinted_handoff_enabled is enabled or not. There's also a TTL value associated with the hint which will cause the hint to be discarded after a longer period of time (smallest GCGraceSeconds for any involved CF).
Now you might be wondering if it isn't potentially dangerous to replay a batch on two nodes at the same time, which may happen has we replicate the batchlog on two nodes. Whats important to keep in mind here is that each batch execution will be idempotent due to the limited kind of supported operations (updates and deletes) and the fixed timestamp associated to the batch. There won't be any conflicts even if both nodes and the coordinator will retry executing the batch at the same time.
Atomicity guarantees
Lets get back to the atomicity aspects of "atomic batches" and review what exactly is meant with atomic (source):
"(Note that we mean “atomic” in the database sense that if any part of
the batch succeeds, all of it will. No other guarantees are implied;
in particular, there is no isolation; other clients will be able to
read the first updated rows from the batch, while others are in
progress."
So in a sense we get "all or nothing" guarantees. In most cases the coordinator will just write all the statements in the batch to the cluster. However, in case of a write timeout, we must check at which point the timeout occurred by reading the writeType value. The batch must have been written to the batchlog in order to be sure that those guarantees still apply. Also at this point other clients may also read partially executed results from the batch.
Getting back to the question, how can Cassandra guarantee that either all or no statements at all in a batch will be executed?
Atomic batches basically depend on successful replication and idempotent statements. It's not a 100% guaranteed solution as in theory there might be scenarios that will still cause inconsistencies. But for a lot of use cases in Cassandra its a very useful tool if you're aware how it works.

Batch documentation (doc) :
In Cassandra 1.2 and later, batches are atomic by default. In the context of a Cassandra batch operation, atomic means that if any of the batch succeeds, all of it will. To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity. If you do not want to incur this penalty, prevent Cassandra from writing to the batchlog system by using the UNLOGGED option: BEGIN UNLOGGED BATCH

Cassandra batches:-
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
To add to above answers:-
With Cassandra 2.0, you can write batch statements + LWT. The restriction though is that all DMLs must be on same partition

Related

What does Cassandra return to client on dropped mutations?

When there are "dropped mutations" on Cassandra side, does it return corresponding failure to calling client? Or It's always success response to calling client which invoked the transaction even though the corresponding mutations are dropped at server side and resulting in data loss?
In one particular instance we observed lots of dropped mutations (around 6k dropped mutations per sec) when we had TPS around 80K/sec and increased latency of 4000+ ms. The cluster is 6 node cluster. Don't node/cassandra yaml config with me now. In general, how to trouble shoot this "dropped mutations".
Strangely, we couldn't reproduce this bahavior even with at later point.
On writes, if enough replicas respond within write_request_timeout_in_ms (2 seconds by default) you will see successful responses at the client.
So consider that case where you are writing with consistency QUORUM with a replication factor of 3. When a write is sent from a client to the coordinator, the coordinator sends a write request to all three replicas simultaneously. If 2 replicas are able to respond within write_request_timeout_in_ms, the coordinator will then send a successful response back to the client. Meanwhile, if the third replica is not able to begin processing the write mutation within write_request_timeout_in_ms it will drop the mutation.
In this scenario, the fact that the mutation was dropped is not visible to the client, but that's OK from the client perspective! All you asked for was a quorum of nodes to acknowledge the write.
From an operational perspective however, this is a cause for concern. You have replicas that aren't even able to start working on processing the mutation until the timeout would have elapsed, that's not good!
There are multiple possible causes for this, garbage collection thrashing, hardware issues, or maybe your cluster is simply under-provisioned. Monitoring for dropped mutations to identify these situations is a good step towards understanding what is happening.
If you are worried about consistency issues between replicas, cassandra employs multiple anti-entropy mechanisms to get into a consistent state. If inconsistencies are identified while reading data, read repair will get replicas into a consistent state on those nodes by applying the cells with the highest timestamp. Even if data does match between required replicas, a read repair may still be triggered based on table's configured read repair chance to ensure consistent data among all replicas. You should also run scheduled repairs as well.
One last note, in the case that not enough replicas respond to meet your consistency level, you will see WriteTimeoutExceptions surfaced to the client. This could mean that your replicas are dropping mutations, but that isn't necessarily the case. They could have begun processing the mutation, but not completed processing within the timeout. In this case, the write will be applied on those replicas.

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6

Cassandra batch isolation guarantee

I have a question regarding Cassandra batch isolation:
Our cluster consist of a single datacenter, replication factor of 3, reading and writing in LOCAL_QUORUM.
We must provide a news feed resembling an 'after' trigger, to notify clients about CRUD events of data in the DB.
We thought of performing the actual operation, and inserting an event on another table (also in another partition), within a batch. Asynchronously, some process would read events from event table and send them through an MQ.
Because we're writing to different partitions, and operation order is not necessarily maintained in a batch operation; is there a chance our event is written, and our process read it before our actual data is persisted?
Could the same happen in case our batch at last fails?
Regards,
Alejandro
From ACID properties, Cassandra can provide ACD. Therefore, don't expect Isolation in its classical sense.
Batching records will provide you with Atomicity. So it does guarantee that all or none of the records within a batch are written. However, because it doesn't guarantee Isolation, you can end up having some of the records persisted and others not (e.g. wrote to your queue table, but not master table).
Cassandra docs explain how it works:
To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity.
Finally, using Cassandra table as MQ is considered anti-pattern.

Cassandra Batches with if not exists condition

When I'm sending batch of inserts to only one table while each row as a unique key with condition if not exists and there is a problem when even if one of the rows exists.
I need to insert the batch per row and not per the whole batch.
Let's say I've a table "users" with only one column "user_name" and contains the row "jhon", Now I'm trying to import new users:
BEGIN BATCH
INSERT INTO "users" ("user_name") VALUES ("jhon") IF NOT EXISTS;
INSERT INTO "users" ("user_name") VALUES ("mandy") IF NOT EXISTS;
APPLY BATCH;
It will not insert "mandy" because that "jhon" exists, What can I do to isolate them?
I've a lot of rows to insert about 100-200K so I need to use batch.
Thanks!
First: what you describe is documented as intended behavior:
In Cassandra 2.0.6 and later, you can batch conditional updates introduced as lightweight transactions in Cassandra 2.0. Only updates made to the same partition can be included in the batch because the underlying Paxos implementation works at the granularity of the partition. You can group updates that have conditions with those that do not, but when a single statement in a batch uses a condition, the entire batch is committed using a single Paxos proposal, as if all of the conditions contained in the batch apply.
That basically confirms: your updates are to different partitions, so only one Paxos proposal is going to be used, which means the entire batch will succeed, or none of it will.
That said, with Cassandra, batches aren't meant to speed up and bulk load - they're meant to create pseudo-atomic logical operations. From http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html :
Batches are often mistakenly used in an attempt to optimize performance. Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
The coordinator node might also need to work hard to process a logged batch while maintaining consistency between tables. For example, upon receiving a batch, the coordinator node sends batch logs to two other nodes. In the event of a coordinator failure, the other nodes retry the batch. The entire cluster is affected. Use a logged batch to synchronize tables, as shown in this example:
In your schema, each INSERT is to a different partition, which is going to add a LOT of load on your coordinator.
You can run your 200k inserts with a client with async executes, and they'll run quite fast - probably as fast (or faster) as you'd see with a batch.

Cassandra's atomicity and "rollback"

The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Resources