I have read over several documents regarding the Cassandra commit log and, to me, there is conflicting information regarding this "structure(s)". The diagram shows that when a write occurs, Cassandra writes to the memtable and commit log. The confusing part is where this commit log resides.
The diagram that I've seen over-and-over shows the commit log on disk. However, if you do some more reading, they also talk about a commit log buffer in memory - and that piece of memory is flushed to disk every 10 seconds.
DataStax Documentation states:
"When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log buffer in memory. This buffer is flushed to disk every 10 seconds".
Nowhere in their diagram do they show a memory structure called a commit log buffer. They only show the commit log residing on disk.
It also states:
"When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk."
So I'm confused by the above. Is it written to the commit log memory buffer, which is eventually flushed to disk (which I would assume is also called the "commit log"), or is it written to the memtable and commit log on disk?
Apache's documentation states this:
"Instead, like other modern systems, Cassandra provides durability by appending writes to a commitlog first. This means that only the commitlog needs to be fsync'd, which, if the commitlog is on its own volume, obviates the need for seeking since the commitlog is append-only. Implementation details are in ArchitectureCommitLog.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commitlog to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time."
What I have inferred from the Apache statement is that ONLY because of the asynchronous nature of writes (acknowledgement of a cache write) could you lose data (it even states you can lose data if all replicas crash before it is flushed/sync'd).
I'm not sure what I can infer from the DataStax documentation and diagram as they've mentioned two different statements regarding the commit log - one in memory, one on disk.
Can anyone clarify, what I consider, a poorly worded and conflicting set of documentation?
I'll assume there is a commit log buffer, as they both reference it (yet DataStax doesn't show it in the diagram). How and when this is managed, I think, is a key to understand.
Generally when explaining the write path, the commit log is characterized as a file - and it's true the commit log is the on-disk storage mechanism that provides durability. The confusion is introduced when going deeper and the part about buffer cache and having to issue fsyncs is introduced. The reference to "commit log buffer in memory" is talking about OS buffer cache, not a memory structure in Cassandra. You can see in the code that there's not a separate in-memory structure for the commit log, but rather the mutation is serialized and written to a file-backed buffer.
Cassandra comes with two strategies for managing fsync on the commit log.
commitlog_sync
(Default: periodic) The method that Cassandra uses to acknowledge writes in milliseconds:
periodic: (Default: 10000 milliseconds [10 seconds])
Used with commitlog_sync_period_in_ms to control how often the commit log is synchronized to disk. Periodic syncs are acknowledged immediately.
batch: (Default: disabled)note
Used with commitlog_sync_batch_window_in_ms (Default: 2 ms) to control how long Cassandra waits for other writes before performing a sync. When using this method, writes are not acknowledged until fsynced to disk.
The periodic offers better performance at the cost of a small increase in the chance that data can be lost. The batch setting guarantees durability at the cost of latency.
Related
Does Cassandra acknowledges write as soon as it writes to commit log?or does it wait for the write to be written to the memtable also in order to send success to client?
Write success occurs when data is written to commitlog and memtable.
Again for multi node cluster with rf > 1 , it depends on the consistency level you set for writes.
Per DSE Architecture guide:
The write consistency level determines how many replica nodes must respond with a success acknowledgment for the write to be considered successful. Success means data was written to the commit log and the memtable.
And it makes sense, because it may not be possible to write data to the memtable, for example, if all memtables are still flushing, and there is no space left for a new write - in this case, Cassandra will return error. And writing to commit log just lowers the chance that you lose the data if machine lose power, or process crashes.
In Cassandra architecture, when we perform a write operation, data is first written in commit log, then into memtable, and when memtable reaches threshold, data is flushed into SSTable.
So at a given time we have 2 copies of data in a given node: one copy is in commit log and another copt is either in memtable or flushed to SSTable.
So why do we need to have 2 copies? Isn't commit log enough for recovery purposes? Or do they serve totally different purposes? And how are these 3 different from each other?
When you write, Cassandra saves the data to both commit log and Memtable, that makes the operation very fast. If the node restarts before the data is saved to the persistent SSTable, the data in memory is lost, but can be recovered from the commit log.
So Cassandra uses Memtables and SStables for lookup, and commit logs allows restarting a node at any moment without losing the data.
AFAIK, when Cassandra does a write, it writes to the Memtable as well as writing to the commit log on disk (appending). If the commit message is very small, too small to fill up a SSD page, won't this cause some fragmentation and write amplification in the long run? (After disk fills up)
This datastax article may answer your question:
https://www.datastax.com/dev/blog/updates-to-cassandras-commit-log-in-2-2
Specifically:
Since version 1.1 a feature of the commit log infrastructure in Cassandra has been the ability to reuse segments. This is done in order to reduce fragmentation on the logging drive -- a number of commitlog segments will be kept reserved by the database for overwriting after the data they contain has been flushed, which means that most of the time the commit log will not need to allocate new space in order to write. This does not eliminate all fragmentation that can be caused by the log, as it will continue writing after its space quota has been reached while memtable flushes are in progress, and afterwards it will release the overallocated space. Still, since less space is allocated and freed, there is a lower chance of introducing fragmentation on the drive.
Been reading up on a Cassandra, and I get the feeling thats its REALLY not fault tolerant, is it?
I mean, take a very simple scenario, incoming write, you write to to the WAL, to the memtable and then mark in the WAL that the write succeeded and then the server crashes before the memtable gets full so its not flushed to disk as an SSTable, meaning I just lost this write + I wont be able to redo it since its marked as "Done" in the WAL.
Am I missing something here or is it really not fault tolerant? Which seems very weird to me since its used in so many places and for so much data, which makes me think im missing something.
The commit log is written to before the memtable. You just write the mutation, there is no marking the mutation as applied to the memtable. The mutation is not removed from the commitlog until after the memtable has been completely flushed to a new sstable.
Although it is important to know, with some commitlog strategies they dont block the ack from write on the commitlog flush, so you can still have a data loss window that is only protected with RF. So its important to know the consistency levels and replication factors for durability as well in those cases. In 4.0+ I think the group commitlog sync is great option between batch and periodic.
I am currently exploring Cassandra in Depth as I am willing to specialize in it. I came across Cassandra "write path" and now trying to understand the Commit Logs. As I understand the write is acknowledged when it is written to the Commit Log, first, then to MemTable ( An in memory table ). But, if commit logs are written to the FILE SYSTEM, so as SSTables. What is the magical thing that makes writing to commit logs faster or as it is stated in many posts and documentations
A write is said to successful once it is written to the commit log and
memory, so there is very minimal disk I/O at the time of write
Why it is not written to SSTable and MemTable to be considered successful ?
SSTables are immutable, so appending to them would be impossible. Therefore writes are sent to both a memtable and the commit log (for durability). Under normal operations the memtable is periodically flushed to disk as an SSTable, after which it is compacted with existing SSTables to make reads more efficient. The commit log is only replayed on node restart to recover writes that had not been flushed to SSTables.
SSTables are created based on flushed memtables. While the commit log updates do happend periodically, the memtable flushing does not. That is because a memtable first needs to hit a certain treshold (ie. size) before getting written to disk. This makes sure that the created sstable will be large enough to be handled efficiently. In case memtables would be flushed periodically a couple of times a minute, we potentially end up with lots of tiny sstables that would have to be compacted again.
Writing to Cassandra is so fast because writing to a log is already very fast, you are also adding to an in memory datastructure like a b tree or an avl tree which is referred to as a memtable. Memtables are sorted and when they get written to disk, SStables also remain sorted and thus making reading very efficient but not as fast as writing.
The point to note is that clients never touch the commit log. It's only purpose is for creating a backup. If your machine dies then all your data in the memtable is lost. So the machine then uses the commit log to replay back the memtable.
You want your reads to be fast and this is only possible by putting all the data sequentially which also makes it easier to cache data. If you were to write to SStable on every write disk, either you would have to do random reads making reads slow, or you will have to wait for the disk to rotate so that you do sequential writes.