Cassandra Load status does not update (nodetool status) - cassandra

Using the nodetool status I can read out the Load of each node. Adding or removing data from the table should have direct impact on that value. However, the value remains the same, no matter how many times the nodetool status command is executed.
Cassandra documentation states that the Load value takes 90 seconds to update. Even allowing several minutes between running the command, the result is always wrong. The only way I was able to make this value update, was to restart the node.
I don't believe it is relevant, but I should add that I am using docker containers to create the cluster.

In the documentation that you linked, under Load it also says
Because all SSTable data files are included, any data that is not
cleaned up, such as TTL-expired cell or tombstoned data is counted.
It's important to note that when Cassandra deletes data, the data is marked with a tombstone and doesn't actually get removed until compaction. Thus, the load doesn't decrease immediately. You can force a major compaction with nodetool compact.
You can also try flushing memtable if data is being added. Apache notes that
Cassandra writes are first written to the CommitLog, and then to a
per-ColumnFamily structure called a Memtable. When a Memtable is full,
it is written to disk as an SSTable.
So you either need to add more data until the memtable is full, or you can run a nodetool flush (documented here) to force it.

Related

Cassandra write semantics

In Cassandra architecture, when we perform a write operation, data is first written in commit log, then into memtable, and when memtable reaches threshold, data is flushed into SSTable.
So at a given time we have 2 copies of data in a given node: one copy is in commit log and another copt is either in memtable or flushed to SSTable.
So why do we need to have 2 copies? Isn't commit log enough for recovery purposes? Or do they serve totally different purposes? And how are these 3 different from each other?
When you write, Cassandra saves the data to both commit log and Memtable, that makes the operation very fast. If the node restarts before the data is saved to the persistent SSTable, the data in memory is lost, but can be recovered from the commit log.
So Cassandra uses Memtables and SStables for lookup, and commit logs allows restarting a node at any moment without losing the data.

Cassandra, removing old, not needed data

I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.

Impact of not running Cassandra repair on a time series data

Data we store in Cassandra is pure time series with no manual deletes. Data gets deleted only by TTL.
For such use cases, is repair really needed? What is the impact of not running repair?
Tobstoned data really deleted after gc_grace_seconds + compaction. if table with tombstoned data is not compacted, you will stack with this data, and it will cause performance degradation.
If you don't run repair within gc_grace period, dead data can live again. Here's datastax article on this (and why you need to run repairs regulary):
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
EDIT:
TTLed data isn't tombstoned on the time of the expire, but only when there's a compaction proccess (at least in 3.9). You will not see expired data, even when there's no tombstones.
So, if there is a problem with the node, and TTLed data isn't got it's tombstone on compaction, it will get one on the next compaction, or will be simply deleted. According to this, and the fact that the data is NEVER deleted and only expired, and you don't have any owerwrites to same key, you don't have to run repairs for data consistency.
And, regarding to all above, i will recommend to run repairs once in a while (with much higher interval between them), in case that something accidentally was written not using you write pass.
If you set TTL, cassandra will mark the data with tombstone after the time exceeded. If you don't run repair regularly, huge tombstone will be generated and it will affect cassandra performance
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html

Freeing disk space of overwritten data?

I have a table whose rows get overwritten frequently using the regular INSERT statements. This table holds ~50GB data, and the majority of it is overwritten daily.
However, according to OpsCenter, disk usage keeps going up and is not freed.
I have validated that rows are being overwritten and not simply being appended to the table. But they're apparently still taking up space on disk.
How can I free disk space?
Under the covers the way Cassandra during these writes is that a new row is being appended to the SSTable with a newer time stamp. When you perform a read the newest row (based on time stamp) is being returned to you as the row. However this also means that you are using twice the disk space to accomplish this. It is not until Cassandra runs a compaction operation that the older rows will be removed and the disk space recovered. Here is some information on how Cassandra writes to disk which explains the process:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_write_path_c.html?scroll=concept_ds_wt3_32w_zj__dml-compaction
A compaction is done on a node by node basis and is a very disk intensive operation which may effect the performance of your cluster during the time it is running. You can run a manual compaction using the nodetool compact command:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
As Aaron mentioned in his comment above overwriting all the data in your cluster daily is not really the best use case for Cassandra because of issues such as this one.

Is update in place possible in Cassandra?

I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.

Resources