What mechanisms does Delta Lake use to ensure the atomicity, consistency, isolation, and durability of transactions initiated by user operations on a DeltaTable?
0. the DeltaLog
Deltalog = Delta Lake's transaction log.
The deltalog is a collection of ordered json files. It acts as a single source of truth giving to users access to the last version of a DeltaTable's state.
1. Atomicity
Delta Lake breaks down every operation performed by an user into commits, themselves composed of actions.
A commit is recorded in the deltalog only once each of its actions has successfully completed (else it is reverted and restarted or an error is thrown), ensuring its atomicity.
2. Consistency
The consistency of a DeltaTable is guaranteed by their strong schema checking.
3. Isolation
Concurrency of commits is managed to ensure their isolation. An optimistic concurrency control is applied:
When a commit execution starts, the thread snapshots the current deltalog.
When the commit actions have completed, the thread checks if the Deltalog has been updated by another one in the meantime:
If not it records the commit in the deltalog
Else it updates its DeltaTable view and attempts again to register the commit, after a step of reprocessing if needed.
4. Durability
Commits containing actions that mutate the DeltaTable's data need to finish their writes/deletions on underlying Parquet files (stored on the filesystem) to be considered as successfully completed, making them durable.
Further readings:
Diving Into Delta Lake: Unpacking The Transaction Log
ACID properties
Related
I have a delta table, where multiple jobs via databricks can merge/upsert data into the delta table concurrently.
How can I prevent from getting ConcurrentAppendException?
I cannot use this solution, as the incoming changes can be a part of any partition and I cannot filter any partition.
Is there a way to check whether the Delta table is being appended/merged/updated/deleted and wait until its completed and then we acquire the locks and start the merge for the second job?
Just FYI, these are 2 independent Azure Datafactory jobs trying to update one delta table.
Cheers!
You should handle concurrent appends to Delta as any other data store with Optimistic Offline Locking - by adding application-specific retry logic to your code whenever that particular exception happens.
Here's a good video on inner workings of Delta.
I am reading Cassandra: The Definitive Guide, 3rd edition. It has the following text:
The serial consistency level can apply on reads as well. If Cassandra detects that a query is reading data that is part of an uncommitted transaction, it commits the transaction as part of the read, according to the specified serial consistency level.
Why a read is committing an uncommitted transaction and doesn't it interfere with ability of the writer to rollback?
https://community.datastax.com/questions/5769/why-a-read-is-committing-an-uncommitted-transactio.html
Committed means that a mutation (INSERT, UPDATE or DELETE) is not added to commitlog.
Uncommitted is when a mutation is still in the process of being saved to the commitlog.
In oder for the LWT to provide guarantees such as IF EXISTS or IF NOT EXISTS, It has to add any data that is not written to commitlog by another in-flight operation to commitlog.
Here Uncommitted data doesnt mean that it was a failed write. Uncommitted data is a successful data written to some node in the cluster which is not updated in the current node.
here,
it commits the transaction as part of the read
means that Cassandra will initiate a read repair and update the data in the node before sending the data back to the client.
Rollback is not in the picture here because write was successful and this concerns only the replication of data across nodes
When a new member joins a cluster, table repartitioning and data merge will happen.
If the data is large, I believe it will take some time. While it is happening, what is the state of the cache like?
If I am using embedded mode, does it block my application until the merging is completed? or if I don't want to work with an incomplete cache, do I need to wait (somehow) before starting my application operations?
Partition migration will start as soon as the member joins the cluster. It will not block your application because it will progress asynchronously in the background.
Only mutating operations that fall into a migrating partition are blocked. Read-only operations are not blocked.
Mutating operations will get PartitionMigrationException which is a RetryableHazelcastException so they will be retried for default 2 minutes. If you have small partition sizes, then migration of a partition will last shorter. You can increase partition count via system property hazelcast.partition.count.
If you want to block your application until all migrations finish, you can check isClusterSafe method to make sure there are no migrating partitions in the cluster. But beware that isClusterSafe returns the status of the cluster rather than current member so it might not be something to rely on. Instead, I would recommend not to block the application while partitions are migrating.
I have a question regarding Cassandra batch isolation:
Our cluster consist of a single datacenter, replication factor of 3, reading and writing in LOCAL_QUORUM.
We must provide a news feed resembling an 'after' trigger, to notify clients about CRUD events of data in the DB.
We thought of performing the actual operation, and inserting an event on another table (also in another partition), within a batch. Asynchronously, some process would read events from event table and send them through an MQ.
Because we're writing to different partitions, and operation order is not necessarily maintained in a batch operation; is there a chance our event is written, and our process read it before our actual data is persisted?
Could the same happen in case our batch at last fails?
Regards,
Alejandro
From ACID properties, Cassandra can provide ACD. Therefore, don't expect Isolation in its classical sense.
Batching records will provide you with Atomicity. So it does guarantee that all or none of the records within a batch are written. However, because it doesn't guarantee Isolation, you can end up having some of the records persisted and others not (e.g. wrote to your queue table, but not master table).
Cassandra docs explain how it works:
To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity.
Finally, using Cassandra table as MQ is considered anti-pattern.
How can atomic batches guarantee that either all statements in a single batch will be executed or none?
In order to understand how batches work under the hood, its helpful to look at the individual stages of the batch execution.
The client
Batches are supported using CQL3 or modern Cassandra client APIs. In each case you'll be able to specify a list of statements you want to execute as part of the batch, a consistency level to be used for all statements and an optional timestamp. You'll be able to batch execute INSERT, DELETE and UPDATE statements. If you choose not to provide a timestamp, the current time is automatically used and associated with the batch.
The client will have to handle two exception in case the batch could not be executed successfully.
UnavailableException - there are not enough nodes alive to fulfill any of the updates with specified batch CL
WriteTimeoutException - timeout while either writing the batchlog or applying any of the updates within the batch. This can be checked by reading the writeType value of the exception (either BATCH_LOG or BATCH).
Failed writes during the batchlog stage will be retried once automatically by the DefaultRetryPolicy in the Java driver. Batchlog creation is critical to ensure that a batch will always be completed in case the coordinator fails mid-operation. Read on for finding out why.
The coordinator
All batches send by the client will be executed by the coordinator just as with any write operation. Whats different from normal write operations is that Cassandra will also make use of a dedicated log that will contain all pending batches currently executed (called the batchlog). This log will be stored in the local system keyspace and is managed by each node individually. Each batch execution starts by creating a log entry with the complete batch on preferably two nodes other than the coordinator. After the coordinator was able to create the batchlog on the other nodes, it will start to execute the actual statements in the batch.
Each statement in the batch will be written to the replicas using the CL and timestamp of the whole batch. Beside from that, there's nothing special about writes happening at this point. Writes may also be hinted or throw a WriteTimeoutException, which can be handled by the client (see above).
After the batch has been executed, all created batchlogs can be safely removed. Therefor the coordinator will send a batchlog delete message upon successfull execution to the nodes that have received the batchlog before. This happens in the background and will go unnoticed in case it fails.
Lets wrap up what the coordinator does during batch execution:
sends batchlog to two other nodes (preferably in different racks)
execute all statements in batch
deletes batchlog from nodes again after successful batch execution
The batchlog replica nodes
As described above, the batchlog will be replicated across two other nodes (if the cluster size allows it) before batch execution. The idea is that any of these nodes will be able to pick up pending batches in case the coordinator will go down before finishing all statements in the batch.
What makes thinks a bit complicated is the fact that those nodes won't notice that the coordinator is not alive anymore. The only point at which the batchlog nodes will be updated with the current status of the batch execution, is when the coordinator is issuing a delete messages indicating the batch has been successfully executed. In case such a message doesn't arrive, the batchlog nodes will assume the batch hasn't been executed for some reasons and replay the batch from the log.
Batchlog replay is taking place potentially every minute, ie. that is the interval a node will check if there are any pending batches in the local batchlog that haven't been deleted by the -possibly killed- coordinator. To give the coordinator some time between the batchlog creation and the actual execution, a fixed grace period is used (write_request_timeout_in_ms * 2, default 4 sec). In case that the batchlog still exists after 4 sec, it will be replayed.
Just as with any write operation in Cassandra, timeouts may occur. In this case the node will fall back writing hints for the timed out operations. When timed out replicas will be up again, writes can resume from hints. This behavior doesn't seem to be effected whether hinted_handoff_enabled is enabled or not. There's also a TTL value associated with the hint which will cause the hint to be discarded after a longer period of time (smallest GCGraceSeconds for any involved CF).
Now you might be wondering if it isn't potentially dangerous to replay a batch on two nodes at the same time, which may happen has we replicate the batchlog on two nodes. Whats important to keep in mind here is that each batch execution will be idempotent due to the limited kind of supported operations (updates and deletes) and the fixed timestamp associated to the batch. There won't be any conflicts even if both nodes and the coordinator will retry executing the batch at the same time.
Atomicity guarantees
Lets get back to the atomicity aspects of "atomic batches" and review what exactly is meant with atomic (source):
"(Note that we mean “atomic” in the database sense that if any part of
the batch succeeds, all of it will. No other guarantees are implied;
in particular, there is no isolation; other clients will be able to
read the first updated rows from the batch, while others are in
progress."
So in a sense we get "all or nothing" guarantees. In most cases the coordinator will just write all the statements in the batch to the cluster. However, in case of a write timeout, we must check at which point the timeout occurred by reading the writeType value. The batch must have been written to the batchlog in order to be sure that those guarantees still apply. Also at this point other clients may also read partially executed results from the batch.
Getting back to the question, how can Cassandra guarantee that either all or no statements at all in a batch will be executed?
Atomic batches basically depend on successful replication and idempotent statements. It's not a 100% guaranteed solution as in theory there might be scenarios that will still cause inconsistencies. But for a lot of use cases in Cassandra its a very useful tool if you're aware how it works.
Batch documentation (doc) :
In Cassandra 1.2 and later, batches are atomic by default. In the context of a Cassandra batch operation, atomic means that if any of the batch succeeds, all of it will. To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity. If you do not want to incur this penalty, prevent Cassandra from writing to the batchlog system by using the UNLOGGED option: BEGIN UNLOGGED BATCH
Cassandra batches:-
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
To add to above answers:-
With Cassandra 2.0, you can write batch statements + LWT. The restriction though is that all DMLs must be on same partition