Why would dropping a table in Cassandra take many minutes? - cassandra

When I delete a table in Cassandra, it takes close to forever.
From what I can see, this is because it first creates a snapshot of the table. However, what I do not understand is that they say the snapshot is done by creating a hard link with Copy on Write. So... Why would it still take that long? Once the hard link is created, you delete the original file and that should take a split second, right?
On my VMs and even my main computer, it can take minutes to delete one single table!!!

One minute is a bit high and I don't think a snapshot can take that long. What you are probably seeing is the memtables flushing before the snapshots are taken, and a flush could lead to a compaction.
Try disabling the "auto_snapshot" property in the yaml file, and check again how long it takes. Check also the number of sstables, since without snapshots C* will remove all of them, and as long as their number is low enough performance should be fine, but when you have a large number of sstables the unlink operation speed of your filesystem will be your bottleneck.

Related

Cassandra: how to automatically delete old records to avoid disk space shortage?

We are using TWCS for time series data with default TTL of 30 days an compaction window size of 1 day.
Unfortunately, there are cases when incoming data rate gets higher and not so much disk space left to write it. At the same time due to budget constraints adding new nodes to the cluster is not an option. Currently we resort to manually deleting old sstables, but it is error prone.
What is the best way in TWCS case to make Cassandra delete, say, all records that are older than certain date? I mean not to create tombstones in new sstable, but to actually delete old records from disk to free up space.
Of course, I can reduce TTL, but it will affect only new records (so will help only in a long run, but not immediately) and in a case when there is not so much incoming data records will be stored for a shorter period than could be.
Basically, that's the intent of the TTLs to automatically remove the old data. The explicit deletion always creates a tombstone, and it won't work well with with TWCS. So right now the solution would be to stop node, remove old files to free space, start the node - repeat on all nodes. But you're doing that already.

Time window compaction strategy on data with TTLed inserts followed by TTLed updates

I am facing problem with cassandra compaction on table that stores event data. These events are generated from censors and have associated TTL. By default each event has TTL of 1 day. Few events have different TTL like 7/10/30 which is business requirement. Few events can have TTL of 5 years if event needs to be stored. More than 98% of rows have TTL of 1 day.
Although minor compaction is triggered from time to time, disk usage are constantly increasing. This is because of how SizeTierd compaction-strategy works i.e. it would choose table of similar size for compaction. This creates few huge tables which aren't compacted for long time. Presence of few large table would increase average SSTable size and compaction is run less frequently. Looks like STCS is not right choice. In load-test env, I added data to tables and switched to leveled compaction-strategy. With LCS disk space was reclaimed till certain point and then disk usage were constant. CPU was also less compared to STCS. However time window compaction-strategy looks more promising as it works well for time series TTLed data. I am going to test TWCS with my dataset. Mean while I am trying to find answer for few queries to which I didn't find answer or whatever I found was not clear to me.
In my use case, event is added to table with associated TTL. Then there are 5 more updates on same event within next minute. Updates are not made on single column, instead complete row is re-written with new TTL(which is same for al columns). This new TTL is liked to be slightly less than previous TTL. For example, event is created with TTL of 86400 seconds. It is updated after 5 second then new TTL would be 86395. Further update would be with new TTL which would be slightly less than 86395. After 4-5 updates, no update would be made to more than 99% rows. 1% rows would be re-written with TTL of 5 years.
From what I read: TWCS is for data inserted with immutable TTL. Does
this mean I should not use TWCS?
Also out of order writes are not well handled by TWCS. If event is
created at 10 AM on 5th Sep with 1 day TTL and same event row is
re-written with TTL of 5 years on 10th or 12th Sep, would that be
our of order write? I suppose out of order would be when I am
setting timestamp on data while adding data to DB or something that
would be caused by read repair.
Any guidance/suggestion will be appreciated!
NOTE: I am using cassandra 2.2.8, so I'll be creating jar for TWCS and then use it.
TWCS is a great option under certain circumstances. Here are the things to keep in mind:
1) One of the big benefits of TWCS is that merging/reconciliation among sstables does not occur. The oldest one is simply "lopped" off. Because of that, you don't want to have rows/cells span multiple "buckets/windows".
For example, If you insert a single column during one window and then the next window you insert a different column (i.e. an update of the same row but different column at a later period of time). Instead of compaction creating a single row with both columns, TWCS would lop one of the columns off (the oldest). Actually I am not sure if TWCS will even allows this to occur, but was giving you an example of what would happen if it did. In this example, I believe TWCS will disallow the removal of either sstable until both windows expire. Not 100% sure though. Either way, avoid this scenario.
2) TWCS has similar problems when out-of-time writes occur (overlap). There is a great article by the last pickle that explains this:
https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Overlap can occur by repair or from an old compaction (i.e. if you were using STCS and then switched to TWCS, some of the sstables may overlap).
If there is overlap, say, between 2 sstables, you have to wait for both sstables to completely expire before TWCS can remove either of them, and when it does, both with be removed.
If you avoid both scenarios described above, TWCS is very efficient due to the nature of how it cleans things up - no more merging sstables. Simply remove the oldest window.
When you do set up TWCS, you have to remember that the oldest window gets removed after the TTLs expire and GC Grace passes as well - don't forget to add that part. Having a varying TTL number among rows, as you have described, may delay windows from getting removed. If you want to see what is either blocking TWCS from removing a sstable or what the sstables look like, you can use sstableexpiredblockers or the script in the above mentioned URL (which is essentially sstablemetadata with some fancy scripting).
Hopefully that helps.
-Jim

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Freeing disk space of overwritten data?

I have a table whose rows get overwritten frequently using the regular INSERT statements. This table holds ~50GB data, and the majority of it is overwritten daily.
However, according to OpsCenter, disk usage keeps going up and is not freed.
I have validated that rows are being overwritten and not simply being appended to the table. But they're apparently still taking up space on disk.
How can I free disk space?
Under the covers the way Cassandra during these writes is that a new row is being appended to the SSTable with a newer time stamp. When you perform a read the newest row (based on time stamp) is being returned to you as the row. However this also means that you are using twice the disk space to accomplish this. It is not until Cassandra runs a compaction operation that the older rows will be removed and the disk space recovered. Here is some information on how Cassandra writes to disk which explains the process:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_write_path_c.html?scroll=concept_ds_wt3_32w_zj__dml-compaction
A compaction is done on a node by node basis and is a very disk intensive operation which may effect the performance of your cluster during the time it is running. You can run a manual compaction using the nodetool compact command:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
As Aaron mentioned in his comment above overwriting all the data in your cluster daily is not really the best use case for Cassandra because of issues such as this one.

What's the function of snapshot in Cassandra?

Although I checked Datastax's document about snapshot, I am still confused about what a snapshot in cassandra is. What's the function or main purpose of a snapshot?
Under the snapshot folder, I find some subfolder named in convention of this:
1426256545571-tablename
What does the number at the very beginning mean? Anyway, I just a need a easy way to know what a snapshot is.
The number is the number of ms from epoch (timestamp). A snapshot is just a local backup. It occurs automatically for some types of operations like truncate (in case done by accident and want to undo it).
They are very fast and don't cost any extra disk space up front since its just a hard link to the immutable data files. Eventually you want to clean them up though to reclaim disk as compactions occur. You can disable the auto_snapshot option in cassandra.yaml if you don't want them anymore. It is likely you will see them while doing repairs, still.

Resources