Cassandra write semantics - cassandra

In Cassandra architecture, when we perform a write operation, data is first written in commit log, then into memtable, and when memtable reaches threshold, data is flushed into SSTable.
So at a given time we have 2 copies of data in a given node: one copy is in commit log and another copt is either in memtable or flushed to SSTable.
So why do we need to have 2 copies? Isn't commit log enough for recovery purposes? Or do they serve totally different purposes? And how are these 3 different from each other?

When you write, Cassandra saves the data to both commit log and Memtable, that makes the operation very fast. If the node restarts before the data is saved to the persistent SSTable, the data in memory is lost, but can be recovered from the commit log.
So Cassandra uses Memtables and SStables for lookup, and commit logs allows restarting a node at any moment without losing the data.

Related

When does Cassandra acknowledgement the write?

Does Cassandra acknowledges write as soon as it writes to commit log?or does it wait for the write to be written to the memtable also in order to send success to client?
Write success occurs when data is written to commitlog and memtable.
Again for multi node cluster with rf > 1 , it depends on the consistency level you set for writes.
Per DSE Architecture guide:
The write consistency level determines how many replica nodes must respond with a success acknowledgment for the write to be considered successful. Success means data was written to the commit log and the memtable.
And it makes sense, because it may not be possible to write data to the memtable, for example, if all memtables are still flushing, and there is no space left for a new write - in this case, Cassandra will return error. And writing to commit log just lowers the chance that you lose the data if machine lose power, or process crashes.

How to flush Cassandra CDC changes periodically to disk?

Desired behaviour
I'm trying to configure cassandra cdc in a way that the commitlogsegments are flushed periodically to the cdc_raw directory (let's say every 10 seconds).
Based upon documentation from http://abiasforaction.net/apache-cassandra-memtable-flush/ and from https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configCDCLogging.html I found:
memtable_flush_period_in_ms – This is a CQL table property that
specifies the number of milliseconds after which a memtable should be
flushed. This property is specified on table creation.
and
Upon flushing the memtable to disk, CommitLogSegments containing data
for CDC-enabled tables are moved to the configured cdc_raw directory.
Putting those together I would think that by setting memtable_flush_period_in_ms: 10000 cassandra flushes it's CDC changes to disk every 10 seconds, which is what I want to accomplish.
My configuration
Based upon aforementioned and my configuration I would expect that the memtable gets flushed to the cdc_raw directory every 10 seconds. I'm using the following configuration:
cassandra.yaml:
cdc_enabled: true
commitlog_segment_size_in_mb: 1
commitlog_total_space_in_mb: 2
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
table configuration:
memtable_flush_period_in_ms = 10000
cdc = true
Problem
The memtable is not flushed periodically to the cdc_raw directory, but instead gets flushed to the commitlogs directory when a certain size threshold is reached.
In detail, the following happens:
When a commitlogsegment reaches 1MB, it's flushed to the commitlog directory. There is a maximum of 2 commitlogs in the commitlog directory (see configuration commitlog_total_space_in_mb: 2). When this threshold is reached, the oldest commitlog file in the commitlog directory is moved to the cdc_raw directory.
Question
How to flush Cassandra CDC changes periodically to disk?
Apache Cassandra's CDC in current version is tricky.
Commit log is 'global', meaning changes to any table go to the same commit log.
Your commit log segment can (and will) contain logs from tables other than the ones with CDC enabled. These include system tables.
Commit log segment is deleted and moved to cdc_raw directory after every logs in the commit log segment are flushed.
So, even you configure your CDC-enabled table to flush every 10 sec, there are logs from other tables still in the same commit log segment, which prevent from moving commit log to CDC directory.
There is no way to change the behavior other than trying to speed up the process by reducing commitlog_segment_size_in_mb (but you need to be careful not to reduce it to the size smaller than your single write requset).
This behavior is improved and will be released in next major version v4.0. You can read your CDC as fast as commit log is synced to disk (so when you are using periodic commit log sync, then you can read your change every commit_log_sync_period_in_ms milliseconds.
See CASSANDRA-12148 for detail.
By the way, you set commitlog_total_space_in_mb to 2, which I definitely do not recommend. What you are seeing right now is that Cassandra flushes every table when your commit log size exceeded this value to make more space. If you cannot reclaim your commit log space, then Cassandra would start throwing error and rejects writes.

Cassandra Load status does not update (nodetool status)

Using the nodetool status I can read out the Load of each node. Adding or removing data from the table should have direct impact on that value. However, the value remains the same, no matter how many times the nodetool status command is executed.
Cassandra documentation states that the Load value takes 90 seconds to update. Even allowing several minutes between running the command, the result is always wrong. The only way I was able to make this value update, was to restart the node.
I don't believe it is relevant, but I should add that I am using docker containers to create the cluster.
In the documentation that you linked, under Load it also says
Because all SSTable data files are included, any data that is not
cleaned up, such as TTL-expired cell or tombstoned data is counted.
It's important to note that when Cassandra deletes data, the data is marked with a tombstone and doesn't actually get removed until compaction. Thus, the load doesn't decrease immediately. You can force a major compaction with nodetool compact.
You can also try flushing memtable if data is being added. Apache notes that
Cassandra writes are first written to the CommitLog, and then to a
per-ColumnFamily structure called a Memtable. When a Memtable is full,
it is written to disk as an SSTable.
So you either need to add more data until the memtable is full, or you can run a nodetool flush (documented here) to force it.

Cassandra commit log clarification

I have read over several documents regarding the Cassandra commit log and, to me, there is conflicting information regarding this "structure(s)". The diagram shows that when a write occurs, Cassandra writes to the memtable and commit log. The confusing part is where this commit log resides.
The diagram that I've seen over-and-over shows the commit log on disk. However, if you do some more reading, they also talk about a commit log buffer in memory - and that piece of memory is flushed to disk every 10 seconds.
DataStax Documentation states:
"When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log buffer in memory. This buffer is flushed to disk every 10 seconds".
Nowhere in their diagram do they show a memory structure called a commit log buffer. They only show the commit log residing on disk.
It also states:
"When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk."
So I'm confused by the above. Is it written to the commit log memory buffer, which is eventually flushed to disk (which I would assume is also called the "commit log"), or is it written to the memtable and commit log on disk?
Apache's documentation states this:
"Instead, like other modern systems, Cassandra provides durability by appending writes to a commitlog first. This means that only the commitlog needs to be fsync'd, which, if the commitlog is on its own volume, obviates the need for seeking since the commitlog is append-only. Implementation details are in ArchitectureCommitLog.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commitlog to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time."
What I have inferred from the Apache statement is that ONLY because of the asynchronous nature of writes (acknowledgement of a cache write) could you lose data (it even states you can lose data if all replicas crash before it is flushed/sync'd).
I'm not sure what I can infer from the DataStax documentation and diagram as they've mentioned two different statements regarding the commit log - one in memory, one on disk.
Can anyone clarify, what I consider, a poorly worded and conflicting set of documentation?
I'll assume there is a commit log buffer, as they both reference it (yet DataStax doesn't show it in the diagram). How and when this is managed, I think, is a key to understand.
Generally when explaining the write path, the commit log is characterized as a file - and it's true the commit log is the on-disk storage mechanism that provides durability. The confusion is introduced when going deeper and the part about buffer cache and having to issue fsyncs is introduced. The reference to "commit log buffer in memory" is talking about OS buffer cache, not a memory structure in Cassandra. You can see in the code that there's not a separate in-memory structure for the commit log, but rather the mutation is serialized and written to a file-backed buffer.
Cassandra comes with two strategies for managing fsync on the commit log.
commitlog_sync
(Default: periodic) The method that Cassandra uses to acknowledge writes in milliseconds:
periodic: (Default: 10000 milliseconds [10 seconds])
Used with commitlog_sync_period_in_ms to control how often the commit log is synchronized to disk. Periodic syncs are acknowledged immediately.
batch: (Default: disabled)note
Used with commitlog_sync_batch_window_in_ms (Default: 2 ms) to control how long Cassandra waits for other writes before performing a sync. When using this method, writes are not acknowledged until fsynced to disk.
The periodic offers better performance at the cost of a small increase in the chance that data can be lost. The batch setting guarantees durability at the cost of latency.

What makes CommitLog faster than writing to SSTable in Cassandra ?

I am currently exploring Cassandra in Depth as I am willing to specialize in it. I came across Cassandra "write path" and now trying to understand the Commit Logs. As I understand the write is acknowledged when it is written to the Commit Log, first, then to MemTable ( An in memory table ). But, if commit logs are written to the FILE SYSTEM, so as SSTables. What is the magical thing that makes writing to commit logs faster or as it is stated in many posts and documentations
A write is said to successful once it is written to the commit log and
memory, so there is very minimal disk I/O at the time of write
Why it is not written to SSTable and MemTable to be considered successful ?
SSTables are immutable, so appending to them would be impossible. Therefore writes are sent to both a memtable and the commit log (for durability). Under normal operations the memtable is periodically flushed to disk as an SSTable, after which it is compacted with existing SSTables to make reads more efficient. The commit log is only replayed on node restart to recover writes that had not been flushed to SSTables.
SSTables are created based on flushed memtables. While the commit log updates do happend periodically, the memtable flushing does not. That is because a memtable first needs to hit a certain treshold (ie. size) before getting written to disk. This makes sure that the created sstable will be large enough to be handled efficiently. In case memtables would be flushed periodically a couple of times a minute, we potentially end up with lots of tiny sstables that would have to be compacted again.
Writing to Cassandra is so fast because writing to a log is already very fast, you are also adding to an in memory datastructure like a b tree or an avl tree which is referred to as a memtable. Memtables are sorted and when they get written to disk, SStables also remain sorted and thus making reading very efficient but not as fast as writing.
The point to note is that clients never touch the commit log. It's only purpose is for creating a backup. If your machine dies then all your data in the memtable is lost. So the machine then uses the commit log to replay back the memtable.
You want your reads to be fast and this is only possible by putting all the data sequentially which also makes it easier to cache data. If you were to write to SStable on every write disk, either you would have to do random reads making reads slow, or you will have to wait for the disk to rotate so that you do sequential writes.

Resources