What is the purpose of Cassandra's commit log? - cassandra

Please some one clarify for me to understand Commit Log and its use.
In Cassandra, while writing to Disk is the commit log the first entry point or MemTables.
If Memtables is what is getting flushed to disk, what is the use of Commit log, is the only purpose of commit log is to server sync issues if a data node is down?

You can think of the commit log as an optimization, but Cassandra would be unusably slow without it. When MemTables get written to disk we call them SSTables. SSTables are immutable, meaning once Cassandra writes them to disk it does not update them. So when a column changes Cassandra needs to write a new SSTable to disk. If Cassandra was writing these SSTables to disk on every update it would be completely IO bound and very slow.
So Cassandra uses a few tricks to get better performance. Instead of writing SSTables to disk on every column update, it keeps the updates in memory and flushes those changes to disk periodically to keep the IO to a reasonable level. But this leads to the obvious problem that if the machine goes down or Cassandra crashes you would lose data on that node. To avoid losing data, in addition to keeping recent changes in memory, Cassandra writes the changes to its CommitLog.
You may be asking why is writing to the CommitLog any better than just writing the SSTables. The CommitLog is optimized for writing. Unlike SSTables which store rows in sorted order, the CommitLog stores updates in the order which they were processed by Cassandra. The CommitLog also stores changes for all the column families in a single file so the disk doesn't need to do a bunch of seeks when it is receiving updates for multiple column families at the same time.
Basically writting the CommitLog to the disk is better because it has to write less data than writing SSTables does and it writes all that data to a single place on disk.
Cassandra keeps track of what data has been flushed to SSTables and is able to truncate the Commit log once all data older than a certain point has been written.
When Cassandra starts up it has to read the commit log back from that last known good point in time (the point at which we know all previous writes were written to an SSTable). It re-applies the changes in the commit log to its MemTables so it can get into the same state when it stopped. This process can be slow so if you are stopping a Cassandra node for maintenance it is a good idea to use nodetool drain before shutting it down which will flush everything in the MemTables to SSTables and make the amount of work on startup a lot smaller.

The write path in Cassandra works like this:
Cassandra Node ---->Commitlog-----------------> Memtable
| |
| |
|---> Periodically |---> Periodically
sync to disk flush to SSTable
Memtable and Commitlog are NOT written (kind of) in parallel. Write to Commitlog must be finished before starting to write to Memtable. Related source code stack is:
org.apache.cassandra.service.StorageProxy.mutateMV:mutation.apply->
org.apache.cassandra.db.Mutation.apply:Keyspace.open(keyspaceName).apply->
org.apache.cassandra.db.Keyspace.apply->
org.apache.cassandra.db.Keyspace.applyInternal{
Tracing.trace("Appending to commitlog");
commitLogPosition = CommitLog.instance.add(mutation)
...
Tracing.trace("Adding to {} memtable",...
...
upd.metadata().name(...);
...
cfs.apply(...);
...
}
The purpose of the Commitlog is to be able to recreate the Memtable after a node crashes or gets rebooted. This is important, since the Memtable only gets flushed to disk when it's 'full' - meaning the configured Memtable size is exceeded - or the flush is performed by nodetool or opscenter. So the data in Memtable is not persisted directly.
Having said that, a good thing before rebooting a node or container is to call nodetool flush to make sure your Memtables are fully persisted (flushed) to SSTables on disk. This also will reduce playback time of the Commitlog after the node or container comes up again.

Related

Cassandra read path

I'm learning Cassandra's read path.
According to some sources:
"When Cassandra receives the read request, data will be searched first in the Memtable, then data will be searched in SSTables and if data exists it is returned"
Also, I know that Memtables are periodically flushed to SSTables on disk.
My questions:
are memtables fully deleted from RAM after flushing to SSTables?
Suppose, we have a read request on a node. Node contains both memtables and SSTables.
Is it possible for Cassandra to get required data only from Memtables without accessing SSTables? If yes, when it is possible and how can Cassandra determine that required data stored only in Memtables and there are no other related data stored on disk (SSTables)?
Short answer to 2nd question - No. Cassandra will always check SSTables even if the data is in the memtable. The reason for that is the data in the memtable could be older than data in the SSTable. For example, if you're explicitly set write timestamp for records, or data is replayed from the hints on other node. When memtable is flushed, data is removed from memory. But in some cases you can use row cache if you have data that is often accessed.
You can read more about read path in the DSE arch guide.

Commitlog incremental switch

We get a burst of 'No segments I reserve; creating a new one'in log,as soon as we start our performance burst run on a single node cassandra Cluster.
Presumably, this means the commit log is expanding to grab more segments;which would impact the performance metrics.
Questions on the above:
1.Can this be avoided by pre-reserving segments for commit log and how.
2.How can we query the current limit of a commitlog for a cassandra instace.
3.We also notice memtable flush pointing to commitlog activity. Does commitlog sync duration trigger table flushes
We are using version 3.11.x

What does Cassandra nodetool repair exactly do?

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?
The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

How can I switch from multiple disks to a single disk in cassandra?

Because I ran out of space when shuffling, I was forced to add multiple disks on my Cassandra nodes.
When I finish compacting, cleaning up, and repairing, I'd like to remove them and return to one disk per node.
What is the procedure to make the switch?
Can I just kill cassandra, move the data from one disk to the other, remove the configuration for the second disk, and re-start cassandra?
I assume files will not have the same name and thus not be overwritten, is this the case?
Run disablegossip and disablethrift from nodetool, such that this
node is seen as DOWN by other nodes.
flush/drain the memtables, run compaction to merge SSTables, if any
[optionally, take snapshot as a precaution]
This stops all the other nodes/clients from writing to this node and since memtables are flushed to disk
stop Cassandra (though this node is down, cluster is available for
write/read, so zero downtime)
move data/log contents from other disk to the disk you want
make changes in cassandra.yaml to change the below paths:
commitlog_directory
saved_caches_directory
data_file_directories
log_directory
restart cassandra
do this for all nodes.

Explanation required for a statement in Cassandra documentation

I was going through the DataStax documentation and found an interesting statement.
It claimed "Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound".
Can someone explain about how this claim is made? and what might be causing this behavior of Cassandra??
Thanks.
For different workloads, Cassandra clusters can be CPU, memory, I/O or (occasionally) network bound. The claim in the documentation is, if you start a new cluster and make lots of inserts, the cluster will initially be CPU bound but after a while it becomes bottlenecked on memory.
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
There are a few reasons why more memory will help at this stage:
If Cassandra is low on heap space, it will flush memtables when they are smaller. This creates smaller SSTables so more work to compact them.
If the workload involves overwrites or inserts to the same row at different times, it is much cheaper to do this if the row is still in a current memtable. If not, the overwrite and new column is stored in a new memtable, then flushed and merged during compaction. So again, less memory means more compaction work.
Your OS uses memory to buffer reads and writes during compaction. If the OS can't then there will be extra I/O, slowing down memtable flushing and compaction.
Inserts into Cassandra consume lots of Java objects so create work for the garbage collector. If the heap is too small inserts may be paused while GC runs to make some free heap. (On the other hand, if the heap is too large, inserts may be paused for a few seconds during stop-the-world GC.)
So inserts may become memory bound, but they could also become I/O bound. If there isn't enough I/O to flush memtables then inserts will become blocked once the memtable flush queue is full. So I think the claim could be a bit more accurate:
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory or I/O bound.

Resources