Apache Cassandra: Nodetool Stats Which shows Memtable Flush Frequency - cassandra

Currently I am debugging performance issue with Apache Cassandra. when Memtable for the column family is filled, it is queued to be flushed to SSTable. This flushing happens often when you perform massive writes.
When this queue is filled up, writes are blocked until next successful completion of flush. This indicates that your node cannot handle writes it is receiving.
Is there a matrix in nodetool indicating this behaviour? In other words, I want a data indicating a node cannot keep up with writes it is receiving.
Thanks!!

Thats not really true for a couple years. The active memtable is switched and a new memtable takes its position as live. New mutations occur on this live memtable while the "to be flushed" memtables are included in local reads. The MemtableFlushWriter thread pool has the flush tasks queued on it. So you can see how many are pending there (under tpstats). The mutations backing up you can also see under the MutationStage.
Ultimately
nodetool tpstats
Is likely what your looking for.

I want a data indicating a node cannot keep up with writes it is receiving.
Your issue is likely bound to disk I/O not being able to handle the throughput --> flushes of memtables queue up --> writes are blocked
the command dstat is your friend to investigate I/O issues. Some others linux commands may be also handy. Read this excellent blog post from Amy Tobey: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
Is there a matrix in nodetool indicating this behaviour?
nodetool tpstats

I believe you're looking for tp (thread pool) stats.
nodetool tpstats
Typically blocked FlushWriters indicates that your storage system is having trouble keeping up with the write workload. Are you using spinning disks by chance? You'll also want to keep an eye on iostat in this case as well.
Here's the docs for tpstats: https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsTPstats.html

Related

Cassandra Write ahead log and memtables flush to disk

Been reading up on a Cassandra, and I get the feeling thats its REALLY not fault tolerant, is it?
I mean, take a very simple scenario, incoming write, you write to to the WAL, to the memtable and then mark in the WAL that the write succeeded and then the server crashes before the memtable gets full so its not flushed to disk as an SSTable, meaning I just lost this write + I wont be able to redo it since its marked as "Done" in the WAL.
Am I missing something here or is it really not fault tolerant? Which seems very weird to me since its used in so many places and for so much data, which makes me think im missing something.
The commit log is written to before the memtable. You just write the mutation, there is no marking the mutation as applied to the memtable. The mutation is not removed from the commitlog until after the memtable has been completely flushed to a new sstable.
Although it is important to know, with some commitlog strategies they dont block the ack from write on the commitlog flush, so you can still have a data loss window that is only protected with RF. So its important to know the consistency levels and replication factors for durability as well in those cases. In 4.0+ I think the group commitlog sync is great option between batch and periodic.

Cassandra, removing old, not needed data

I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.

Cassandra Load status does not update (nodetool status)

Using the nodetool status I can read out the Load of each node. Adding or removing data from the table should have direct impact on that value. However, the value remains the same, no matter how many times the nodetool status command is executed.
Cassandra documentation states that the Load value takes 90 seconds to update. Even allowing several minutes between running the command, the result is always wrong. The only way I was able to make this value update, was to restart the node.
I don't believe it is relevant, but I should add that I am using docker containers to create the cluster.
In the documentation that you linked, under Load it also says
Because all SSTable data files are included, any data that is not
cleaned up, such as TTL-expired cell or tombstoned data is counted.
It's important to note that when Cassandra deletes data, the data is marked with a tombstone and doesn't actually get removed until compaction. Thus, the load doesn't decrease immediately. You can force a major compaction with nodetool compact.
You can also try flushing memtable if data is being added. Apache notes that
Cassandra writes are first written to the CommitLog, and then to a
per-ColumnFamily structure called a Memtable. When a Memtable is full,
it is written to disk as an SSTable.
So you either need to add more data until the memtable is full, or you can run a nodetool flush (documented here) to force it.

What makes CommitLog faster than writing to SSTable in Cassandra ?

I am currently exploring Cassandra in Depth as I am willing to specialize in it. I came across Cassandra "write path" and now trying to understand the Commit Logs. As I understand the write is acknowledged when it is written to the Commit Log, first, then to MemTable ( An in memory table ). But, if commit logs are written to the FILE SYSTEM, so as SSTables. What is the magical thing that makes writing to commit logs faster or as it is stated in many posts and documentations
A write is said to successful once it is written to the commit log and
memory, so there is very minimal disk I/O at the time of write
Why it is not written to SSTable and MemTable to be considered successful ?
SSTables are immutable, so appending to them would be impossible. Therefore writes are sent to both a memtable and the commit log (for durability). Under normal operations the memtable is periodically flushed to disk as an SSTable, after which it is compacted with existing SSTables to make reads more efficient. The commit log is only replayed on node restart to recover writes that had not been flushed to SSTables.
SSTables are created based on flushed memtables. While the commit log updates do happend periodically, the memtable flushing does not. That is because a memtable first needs to hit a certain treshold (ie. size) before getting written to disk. This makes sure that the created sstable will be large enough to be handled efficiently. In case memtables would be flushed periodically a couple of times a minute, we potentially end up with lots of tiny sstables that would have to be compacted again.
Writing to Cassandra is so fast because writing to a log is already very fast, you are also adding to an in memory datastructure like a b tree or an avl tree which is referred to as a memtable. Memtables are sorted and when they get written to disk, SStables also remain sorted and thus making reading very efficient but not as fast as writing.
The point to note is that clients never touch the commit log. It's only purpose is for creating a backup. If your machine dies then all your data in the memtable is lost. So the machine then uses the commit log to replay back the memtable.
You want your reads to be fast and this is only possible by putting all the data sequentially which also makes it easier to cache data. If you were to write to SStable on every write disk, either you would have to do random reads making reads slow, or you will have to wait for the disk to rotate so that you do sequential writes.

Insert-heavy workloads are CPU-bound or I/O bound

Insert-heavy workloads are CPU-bound in Cassandra before becoming
memory-bound. (All writes go to the commit log, but Cassandra is so
efficient in writing that the CPU is the limiting factor.
Can some body explain me this statement why I/O is not a limiting factor here? I mean as I understand it first heads to I/O and then to CPU.
I took a look at This StackOverflow question or Cassndra Incubator or Apache email chain but still its not clear for me.
Cassandra keeps a log of items, yes that part is I/O. But this log is appended continueusly. Therefore Cassandra doesn’t need to wait for HDD seek. Looking at HDD Burst write speeds - which are above 100MB/s this really doesn’t seem like a limiting factor to me. In fact the network would be limiting. But because you probably won’t reach write speeds at which the network becomes limiting, the CPU limitation kicks in.
I hope that now this part of the answer makes sense:
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
by Richard from Explanation required for a statement in Cassandra documentation

Resources