Cassandra not removing deleted rows despite running nodetool compact - cassandra

Very often I have ghost rows that stay on the server and won't disappear after deleting a row in Cassandra.
I have tried all possible administration options with nodetool (compact, flush, etc.) and also connected to the cluster with jconsole and forced a GC thru it but the rows remain on the cluster.
For testing purpose I updated some rows with a TTL of 0 before doing the DELETE and these rows disappeared completely.
Do I need to live with that or can I somehow trigger a final removal of these deleted rows?
My testcluster uses Cassandra 1.0.7 and has only one single node.

This phenomenon that you are observing is the result of how distributed deletes work in Cassandra. See the Cassandra FAQ and the DistributedDeletes wiki page.
Basically the row will be completely deleted after GCGraceSeconds has passed and a compaction has run.

Related

Cassandra deletion - how does it resolve conflicts after some servers go down

Imagine a simplest Cassandra table on a Cassandra cluster of 2 nodes.
I issue a deletion command of a record. Imagine that node#2 is down at the time. Cassandra client receives a success response from node#1 and happily continues (consistency lvl = 1 for the command).
Then node#2 comes back up and it tries to sync data with node#1. Node#2 claims that it has a record that node#1 doesn't. How do they figure it out that it was a deletion action that deleted a record from node#1 and not insert action that added a record to node#2 (that didn't reach node#1 for any reason)? The reason I am talking about deletions is that I assume that after a deletion, Cassandra doesn't store a time-stamp of a deleted item.
Any useful links on the issue would be appreciated.
What I am talking in particular is either a Hinted-Handoff scenario or Read/Repairs.
Cassandra Repair takes care of these situations.
When you delete data in Cassandra the data, it is not removed immediately, instead Cassandra creates tombstones indicating the row/column is deleted. Tombstones are stored till the gc_grace_seconds.
If you run repair regularly:
So when you run repair, the node sync the data and the tombstones created. So after gc_grace_seconds the tombstones are deleted.
If you do not run repair regularly:
Consider your gc_grace_seconds = 10 days and you delete a data in node #1 while node #2 was down, Cassandra creates tombstone for the deleted data in node #1. After some time when you bring the node #2 and did not run repair and after gc_grace_seconds (10 days) the tombstones are deleted in node #1 but not deleted in node #2 and if you read the data now then data will re-appear instead of deletion.
Hence you must run a regular repair on the Cassandra cluster.
Refer Cassandra docs about the deletes:
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html

gc_grace_seconds to remove tombstone rows in cassandra

I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.

Memsql columnstore data not deleted from disk after TRUNCATE or DROP TABLE

I created a columnstore table in memsql and populated it with around 10 million records after which I started running several update scenarios. I noticed that the size of the data in /var/lib/memsql/leaf-3307/data/columns keeps increasing constantly and nothing there seems to be deleted. Initially the size of that folder is a couple hundred Mb but it quickly jumps to a couple of Gb after some full table updates. The "Columnstore Disk Usage" reported by memsql-ops also increases but at a very slow pace (far from what I see on disk).
This makes me think that data is never actually deleted from disk. The documentation states that running the OPTIMIZE commands should compact the row segment groups and that deleted rows would be removed:
Delete - Deleting a row in a columnstore index causes the row to be marked as deleted in the segment meta data leaving the data in place within the row segment. Segments which only contain deleted rows are removed, and the optimization process covered below will compact segments that require optimization.
Running the OPTIMIZE command didn't help. I also tried truncating the table and even dropping it but nothing helped. The data in the columns folder is still there. The only way I could find of cleaning that up is to DROP the entire database.
This doesn't seem like the desired behavior and I can't find any documentation justifying it. Can anybody explain why this is happening, if it should happen or point me to some relevant documentation?
Thanks in advance
MemSQL will keep around columnstore_window_size bytes of deleted columnstore data on disk per partition database. This is part of the implementation of columnstore replication (it keeps some old files around in case slaves are behind). If you lower the value of that system variable you'll see the disk usage drop. If your not using redundancy 2 there is no harm in lowering it.

Does nodetool cleanup affect Apache Spark rdd.count() of a Cassandra table?

I've been tracking the growth of some big Cassandra tables using Spark rdd.count(). Up 'till now the expected behavior was consistent, the number of rows is constantly growing.
Today I ran nodetool cleanup on one of the seeds and as usual it ran for a 50+ minutes.
And now rdd.count() returns one third of the rows it did before....
Did I destroy data using nodetool cleanup? Or is the Spark count unreliable and was counting ghost keys? I got no errors during cleanup and lots don't show anything out of the usual. It did seem like a successful operation, until now.
Update 2016-11-13
Turns out the Cassandra documentation set me up for the loss of 25+ million rows of data.
The documentation is explicit:
Use nodetool status to verify that the node is fully bootstrapped and
all other nodes are up (UN) and not in any other state. After all new
nodes are running, run nodetool cleanup on each of the previously
existing nodes to remove the keys that no longer belong to those
nodes. Wait for cleanup to complete on one node before running
nodetool cleanup on the next node.
Cleanup can be safely postponed for low-usage hours.
Well you check the status of the other nodes via nodetool status and they are all UP and Normal (UN), BUT here's the catch, you also need to run the command is nodetool describecluster where you might find that the schemas were not synced.
My schemas were not synced and I ran cleanup, when all nodes were UN, up and running normally as per the documentation. The Cassandra documentation does not mention nodetool describecluster after adding new nodes.
So I merrily added nodes, waited till they were UN (Up / Normal) and ran cleanup.
As a result, 25+ million rows of data are gone. I hope this helps others avoid this dangerous pitfall. Basically the Datastax documentation sets you up to destroy data by recommending cleanup as a step of the process of adding new nodes.
In my opinion, that cleanup step should be taken out of the new node procedure documentation altogether. It should be mentioned, elsewhere, that cleanup is a good practice but not in the same section as adding new nodes...it's like recommending rm -rf / as one of the steps for virus removal. Sure will remove the virus...
Thank you Aravind R. Yarram for your reply, I came to the same conclusion as your reply and came here to update this. Appreciate your feedback.
I am guessing you might have either added/removed nodes from the cluster or decreased replication factor before running nodetool cleanup. Until you run the cleanup, I guess Cassandra still reports the old key ranges as part of the rdd.count() as old data still exists on those nodes.
Reference:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCleanup.html

Cassandra Load status does not update (nodetool status)

Using the nodetool status I can read out the Load of each node. Adding or removing data from the table should have direct impact on that value. However, the value remains the same, no matter how many times the nodetool status command is executed.
Cassandra documentation states that the Load value takes 90 seconds to update. Even allowing several minutes between running the command, the result is always wrong. The only way I was able to make this value update, was to restart the node.
I don't believe it is relevant, but I should add that I am using docker containers to create the cluster.
In the documentation that you linked, under Load it also says
Because all SSTable data files are included, any data that is not
cleaned up, such as TTL-expired cell or tombstoned data is counted.
It's important to note that when Cassandra deletes data, the data is marked with a tombstone and doesn't actually get removed until compaction. Thus, the load doesn't decrease immediately. You can force a major compaction with nodetool compact.
You can also try flushing memtable if data is being added. Apache notes that
Cassandra writes are first written to the CommitLog, and then to a
per-ColumnFamily structure called a Memtable. When a Memtable is full,
it is written to disk as an SSTable.
So you either need to add more data until the memtable is full, or you can run a nodetool flush (documented here) to force it.

Resources