Been reading up on a Cassandra, and I get the feeling thats its REALLY not fault tolerant, is it?
I mean, take a very simple scenario, incoming write, you write to to the WAL, to the memtable and then mark in the WAL that the write succeeded and then the server crashes before the memtable gets full so its not flushed to disk as an SSTable, meaning I just lost this write + I wont be able to redo it since its marked as "Done" in the WAL.
Am I missing something here or is it really not fault tolerant? Which seems very weird to me since its used in so many places and for so much data, which makes me think im missing something.
The commit log is written to before the memtable. You just write the mutation, there is no marking the mutation as applied to the memtable. The mutation is not removed from the commitlog until after the memtable has been completely flushed to a new sstable.
Although it is important to know, with some commitlog strategies they dont block the ack from write on the commitlog flush, so you can still have a data loss window that is only protected with RF. So its important to know the consistency levels and replication factors for durability as well in those cases. In 4.0+ I think the group commitlog sync is great option between batch and periodic.
Related
AFAIK, when Cassandra does a write, it writes to the Memtable as well as writing to the commit log on disk (appending). If the commit message is very small, too small to fill up a SSD page, won't this cause some fragmentation and write amplification in the long run? (After disk fills up)
This datastax article may answer your question:
https://www.datastax.com/dev/blog/updates-to-cassandras-commit-log-in-2-2
Specifically:
Since version 1.1 a feature of the commit log infrastructure in Cassandra has been the ability to reuse segments. This is done in order to reduce fragmentation on the logging drive -- a number of commitlog segments will be kept reserved by the database for overwriting after the data they contain has been flushed, which means that most of the time the commit log will not need to allocate new space in order to write. This does not eliminate all fragmentation that can be caused by the log, as it will continue writing after its space quota has been reached while memtable flushes are in progress, and afterwards it will release the overallocated space. Still, since less space is allocated and freed, there is a lower chance of introducing fragmentation on the drive.
I have read over several documents regarding the Cassandra commit log and, to me, there is conflicting information regarding this "structure(s)". The diagram shows that when a write occurs, Cassandra writes to the memtable and commit log. The confusing part is where this commit log resides.
The diagram that I've seen over-and-over shows the commit log on disk. However, if you do some more reading, they also talk about a commit log buffer in memory - and that piece of memory is flushed to disk every 10 seconds.
DataStax Documentation states:
"When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log buffer in memory. This buffer is flushed to disk every 10 seconds".
Nowhere in their diagram do they show a memory structure called a commit log buffer. They only show the commit log residing on disk.
It also states:
"When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk."
So I'm confused by the above. Is it written to the commit log memory buffer, which is eventually flushed to disk (which I would assume is also called the "commit log"), or is it written to the memtable and commit log on disk?
Apache's documentation states this:
"Instead, like other modern systems, Cassandra provides durability by appending writes to a commitlog first. This means that only the commitlog needs to be fsync'd, which, if the commitlog is on its own volume, obviates the need for seeking since the commitlog is append-only. Implementation details are in ArchitectureCommitLog.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commitlog to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time."
What I have inferred from the Apache statement is that ONLY because of the asynchronous nature of writes (acknowledgement of a cache write) could you lose data (it even states you can lose data if all replicas crash before it is flushed/sync'd).
I'm not sure what I can infer from the DataStax documentation and diagram as they've mentioned two different statements regarding the commit log - one in memory, one on disk.
Can anyone clarify, what I consider, a poorly worded and conflicting set of documentation?
I'll assume there is a commit log buffer, as they both reference it (yet DataStax doesn't show it in the diagram). How and when this is managed, I think, is a key to understand.
Generally when explaining the write path, the commit log is characterized as a file - and it's true the commit log is the on-disk storage mechanism that provides durability. The confusion is introduced when going deeper and the part about buffer cache and having to issue fsyncs is introduced. The reference to "commit log buffer in memory" is talking about OS buffer cache, not a memory structure in Cassandra. You can see in the code that there's not a separate in-memory structure for the commit log, but rather the mutation is serialized and written to a file-backed buffer.
Cassandra comes with two strategies for managing fsync on the commit log.
commitlog_sync
(Default: periodic) The method that Cassandra uses to acknowledge writes in milliseconds:
periodic: (Default: 10000 milliseconds [10 seconds])
Used with commitlog_sync_period_in_ms to control how often the commit log is synchronized to disk. Periodic syncs are acknowledged immediately.
batch: (Default: disabled)note
Used with commitlog_sync_batch_window_in_ms (Default: 2 ms) to control how long Cassandra waits for other writes before performing a sync. When using this method, writes are not acknowledged until fsynced to disk.
The periodic offers better performance at the cost of a small increase in the chance that data can be lost. The batch setting guarantees durability at the cost of latency.
Currently I am debugging performance issue with Apache Cassandra. when Memtable for the column family is filled, it is queued to be flushed to SSTable. This flushing happens often when you perform massive writes.
When this queue is filled up, writes are blocked until next successful completion of flush. This indicates that your node cannot handle writes it is receiving.
Is there a matrix in nodetool indicating this behaviour? In other words, I want a data indicating a node cannot keep up with writes it is receiving.
Thanks!!
Thats not really true for a couple years. The active memtable is switched and a new memtable takes its position as live. New mutations occur on this live memtable while the "to be flushed" memtables are included in local reads. The MemtableFlushWriter thread pool has the flush tasks queued on it. So you can see how many are pending there (under tpstats). The mutations backing up you can also see under the MutationStage.
Ultimately
nodetool tpstats
Is likely what your looking for.
I want a data indicating a node cannot keep up with writes it is receiving.
Your issue is likely bound to disk I/O not being able to handle the throughput --> flushes of memtables queue up --> writes are blocked
the command dstat is your friend to investigate I/O issues. Some others linux commands may be also handy. Read this excellent blog post from Amy Tobey: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
Is there a matrix in nodetool indicating this behaviour?
nodetool tpstats
I believe you're looking for tp (thread pool) stats.
nodetool tpstats
Typically blocked FlushWriters indicates that your storage system is having trouble keeping up with the write workload. Are you using spinning disks by chance? You'll also want to keep an eye on iostat in this case as well.
Here's the docs for tpstats: https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsTPstats.html
I am currently exploring Cassandra in Depth as I am willing to specialize in it. I came across Cassandra "write path" and now trying to understand the Commit Logs. As I understand the write is acknowledged when it is written to the Commit Log, first, then to MemTable ( An in memory table ). But, if commit logs are written to the FILE SYSTEM, so as SSTables. What is the magical thing that makes writing to commit logs faster or as it is stated in many posts and documentations
A write is said to successful once it is written to the commit log and
memory, so there is very minimal disk I/O at the time of write
Why it is not written to SSTable and MemTable to be considered successful ?
SSTables are immutable, so appending to them would be impossible. Therefore writes are sent to both a memtable and the commit log (for durability). Under normal operations the memtable is periodically flushed to disk as an SSTable, after which it is compacted with existing SSTables to make reads more efficient. The commit log is only replayed on node restart to recover writes that had not been flushed to SSTables.
SSTables are created based on flushed memtables. While the commit log updates do happend periodically, the memtable flushing does not. That is because a memtable first needs to hit a certain treshold (ie. size) before getting written to disk. This makes sure that the created sstable will be large enough to be handled efficiently. In case memtables would be flushed periodically a couple of times a minute, we potentially end up with lots of tiny sstables that would have to be compacted again.
Writing to Cassandra is so fast because writing to a log is already very fast, you are also adding to an in memory datastructure like a b tree or an avl tree which is referred to as a memtable. Memtables are sorted and when they get written to disk, SStables also remain sorted and thus making reading very efficient but not as fast as writing.
The point to note is that clients never touch the commit log. It's only purpose is for creating a backup. If your machine dies then all your data in the memtable is lost. So the machine then uses the commit log to replay back the memtable.
You want your reads to be fast and this is only possible by putting all the data sequentially which also makes it easier to cache data. If you were to write to SStable on every write disk, either you would have to do random reads making reads slow, or you will have to wait for the disk to rotate so that you do sequential writes.
Insert-heavy workloads are CPU-bound in Cassandra before becoming
memory-bound. (All writes go to the commit log, but Cassandra is so
efficient in writing that the CPU is the limiting factor.
Can some body explain me this statement why I/O is not a limiting factor here? I mean as I understand it first heads to I/O and then to CPU.
I took a look at This StackOverflow question or Cassndra Incubator or Apache email chain but still its not clear for me.
Cassandra keeps a log of items, yes that part is I/O. But this log is appended continueusly. Therefore Cassandra doesn’t need to wait for HDD seek. Looking at HDD Burst write speeds - which are above 100MB/s this really doesn’t seem like a limiting factor to me. In fact the network would be limiting. But because you probably won’t reach write speeds at which the network becomes limiting, the CPU limitation kicks in.
I hope that now this part of the answer makes sense:
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
by Richard from Explanation required for a statement in Cassandra documentation