How to properly cleanup Cassandra after stress testing? - cassandra

I've been tasked with standing up and prep'ing our production Cassandra cluster (3.11.1). Everything was fine, loaded in a few hundred million records with the stress testing tool, great. However after I was done I did a "DROP KEYSPACE keyspace1;" (the space used by the stress test) assuming this was like MySQL and the space would be cleaned up.
Now I've run nodetools cleanup, flush, truncatehints, cleansnapshots and just about every other command variation I can find. The disk usage is still ~30GB per node and nothing seems to be going on in Cassandra.
So #1 - How do I recover the diskspace that is being absorbed by the now deleted keyspace?
And #2 - How should I have deleted this data, if this was the "wrong way"?

After you drop the keyspace you can delete its directory in your data directory. Which would clean it up, there isnt a command to do that.

Like #Chris said, we can manually delete the data. Also to add on to what he said, dropping the data does NOT really the data until after a specified gc_grace_seconds has passed. Default is 864000 seconds.
We can actually modify this by running this in cqlsh:
ALTER TABLE keyspace.table WITH gc_grace_seconds = 5;
And check again with:
SELECT table_name,gc_grace_seconds FROM system_schema.tables WHERE keyspace_name='transcript';

Related

Cassandra data directory does not get updated with deletion

Currently, I am bench marking Cassandra database using YCSB framework. During this time I have performed (batch) insertion and deletion of the data quite regularly.
I am using Truncate command to delete keyspace rows. However, I am noticing that my Cassandra data directory swells up as the experiments.
I have checked and can confirm that even there is no data in the keystore when I checked the size of data directory. Is there a way to initialize a process so that Cassandra automatically release the stored space, or does it happen over time.
When you use Truncate cassandra will create snapshots of your data.
To disable it you will have to set auto_snapshot: false in cassandra.yaml file.
If you are using Delete, then cassandra use tombstone,i.e your data will not get deleted immediately. Data will get deleted once compaction is ran.
To remove previous snapshots one can use nodetool snapshot command.

Local cassandra for testing purposes getting slower over time

I do know that it's a cassandra anti-pattern to delete rows (and more so – doing it frequently), but in my simple use case I have a local cassandra (single instance, replication factor set to 1) that I use for unit tests, which drop all tables before running, naturally to perform the tests with a clean slate.
Over time, the performance of this cassandra instance degraded extremely. It surprised me a bit that dropping the keyspaces althogether didn't help at all. Only by manually deleting everything in cassandra data directory I managed to recover all the performance.
This solution is quite fine for me as I don't care for the test data I delete over and over again, but it certainly feels a bit weird to have to delete these things manually on file system. Is there a better way to deal with such situation? Or am I going about this whole case completely wrong?
Based on the little information provided, I will provide some info:
First, deleting data creates tombstones in cassandra. The default behavior is to keep these tombstones for 10 days, set by the variable gc_grace_seconds.
Given you only have 1 node and don't care about the data once you delete it, you could set gc_grace_seconds to zero. You also could make sure to run compaction after you do a lot of deletes.
Documentation here:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
Lastly, there is a feature known as TTL, Time To Live. You could use that instead of deleting and let the database do the "deletes" once the data expires. If you go this route, I would still set gc_grace_seconds to zero and run compactions (via an hourly cronjob since its a dev environment).

Remove all data Cassandra?

I have a eight node cassandra setup. I am saving data with 3 days TTL. But the data is useless after I take a summary (using my java script, count of things etc). I want to delete all the data in a table. I can stop cassandra for sometime to do the deletion. So the data is removed from all nodes.
Should I run truncate and nodetool repair afterwards or should I flush first then delete. Whats the proper way to do it.
You can drop the tables or truncate them... but keep in mind that Cassandra will snapshot your tables by default, so you'll also need to run nodetool clearsnapshot on all of your nodes afterwards. There is no need to stop Cassandra while you do this delete.
I don't know that there is a right way per se... but when I do when I need to clear a table is, first, I run truncate on the table using cqlsh. Then I run nodetool clearsnapshot on my nodes using pssh (https://code.google.com/p/parallel-ssh/).
Hope this helps

Drop table or truncate table in Cassandra, which is better

We have a use case where we need to re-create a table every day with current data in Cassandra. For this should we use drop table or truncate table, which would be efficient? We do not want the data to be backed up etc?
Thanks
Ankur
I think for almost all cases Truncate is a safer operation than a drop recreate. There have been several issues with dropping/recreating in the past with ghost data, schema disagreement, ect... Although there have been a number of fixes to try to make drop/recreate more stable, if its an operation you are performing every day Truncate should be much cheaper and more stable.
Drop table drops the table and all data. Truncate clears all data in the table, and by default creates a snapshot of the data (but not the schema). Efficiency wise, they're close - though truncate will create the snapshot. You can disable this by setting auto_snapshot to false in cassandra yaml config, but it is server wide. If it's not too much trouble, I'd drop and recreate table - but I've seen issues if you don't wait a while after drop before recreating.
Source : https://support.datastax.com/hc/en-us/articles/204226339-FAQ-How-to-drop-and-recreate-a-table-in-Cassandra-versions-older-than-2-1
NOTE: By default, snapshots are created when tables are dropped or truncated. This will need to be cleaned out manually to reclaim disk space.
Tested manually as well.
Truncate will keep the schema though, drop will not.
Beware!
From datastax documentation: https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlTruncate.html
Note: TRUNCATE sends a JMX command to all nodes, telling them to delete SSTables that hold the data from the specified table. If any of these nodes is down or doesn't respond, the command fails and outputs a message like the following:
truncate cycling.user_activity;
Unable to complete request: one or more nodes were unavailable.
Unfortunately, there is nothing on the documentation saying if DROP behaves differently

Cassandra not removing deleted rows despite running nodetool compact

Very often I have ghost rows that stay on the server and won't disappear after deleting a row in Cassandra.
I have tried all possible administration options with nodetool (compact, flush, etc.) and also connected to the cluster with jconsole and forced a GC thru it but the rows remain on the cluster.
For testing purpose I updated some rows with a TTL of 0 before doing the DELETE and these rows disappeared completely.
Do I need to live with that or can I somehow trigger a final removal of these deleted rows?
My testcluster uses Cassandra 1.0.7 and has only one single node.
This phenomenon that you are observing is the result of how distributed deletes work in Cassandra. See the Cassandra FAQ and the DistributedDeletes wiki page.
Basically the row will be completely deleted after GCGraceSeconds has passed and a compaction has run.

Resources