What does it mean by "Key cache save of system.KeyCache" on Opscenter? - cassandra

I have running Cassandra cluster and opscenter too.I found that suddenly came "Key cache save of system.KeyCache" when I ran nodetool compactionstats. Also, found same on opscenter. Is there any performance impact due to this?

The key cache saving reuses the compaction manager so it shows up in the current compaction tasks which will show up in compactionstats (and also opscenter). It shouldn't cause any performance issues but if it takes a real long time it might block regular compactions (if your concurrent compactions are lower) from completing.
This is really just so when a node starts up you don't have to wait for the key cache to warm up to improve read performance so not critical, and with a low hit rate may not be very meaningful. If they take a long time to save it may be that your data model has a ton of small partitions so the keycache has many entries that need to be serialized. In that case I would recommend setting key_cache_keys_to_save in your cassandra.yaml to something like 100, 1000 or something you can tune until your save time is more reasonable.

Related

How to limit number of validation compactions during repair on Apache Cassandra 3.11

While running full repair on a cassandra cluster with 15 nodes, RF=3 and 3racks(single datacenter) using command ./nodetool repair -pr -full -seq I can see multiple validation compactions running at the same time (>10). Is there any way to limit simultaneous validations in cassandra 3.11.1 like we can limit normal compactions?
As the cluster size has increased, I limited repairs to run table by table and also used -pr and -seq to restrict load on the nodes. But now, the load is very high due to concurrent validation compactions. Need a way to restrict concurrent validation compactions to reduce load on nodes during repairs. I'm also exploring reaper to manage repairs but need to find some workaround for the load issues till I use reaper.
If you're seeing (validation) compactions becoming cumbersome, there are two settings that you should look at:
compaction_throughput_mb_per_sec
concurrent_compactors
compaction_throughput_mb_per_sec
This is the main tuneable setting for compaction. I mentioned this setting in a related answer here: Advise on stopping compaction to reduce slowness
I would recommend checking this setting, and then reducing it until contention is resolved. Or, you could try to set compaction throughput to 1 (the lowest setting) during the day. Then, raise it back up once business hours are over.
% bin/nodetool setcompactionthroughput 1
% bin/nodetool getcompactionthroughput
Current compaction throughput: 1 MB/s
But definitely check it first, just to see what you're running at, and maybe consider halving that and check the effect.
concurrent_compactors
So this defaults to the smaller of (number of disks, # number of cores), with a minimum of 2 and a maximum of 8. There is some solid advice out there around forcing this to a value of 1 if you're using spinning disks, and maybe set it to 4 for SSDs. The default is usually fine, but if it's too high, compactions can overwhelm disk I/O.
tl;dr;
Focus on compaction throughput for now. My advise is to check it, lower it, observe it, and repeat until things improve.

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Cassandra 2.1 speed up full compaction

I have a Cassandra 2.1 cluster using Leveled Compaction Strategy.
Base on my calculation, the cluster will run out of space before compaction kick in automatically when it reaches next level. For that reason, I have a cron job that runs "nodetool compact" every week to run a full (major) compaction to remove tomb stoned data points.
I noticed that full compaction consumes very little CPU/network resources. With bigger data set, full compaction runs for days.
I have tried to "setcompactionthroughput" to higher number (128MB/s instead of 32MB/s by default, even tried to set it to 0 (no limit), but full compaction speed doesn't seem to change at all.
Is there anything I can tune to make it faster? Thanks in advance.
There are very few cases where you should run full compaction via nodetool compact - it causes what you're likely seeing now (a single huge data file, which never naturally compacts with other sstables, even/especially when other deletions have happened).
Recovering from the state your in isn't trivial, but is possible. If you have a lot of cpu/IO to spare, you can try toggling from STCS to LCS, and LeveledCompactionStrategy will naturally split up that huge file into thousands of tiny files, and will be much more aggressive about rewriting those files over time (so tombstones are compacted away much more regularly). This is very much CPU and IO intensive, so don't do it if you're near tipping. Also, it will duplicate all data on disk for a short period, so you'll need to be under 50% disk utilization to do this.
If you're over 50% disk utilization, you've backed yourself into a corner, and you'll probably need to add more disk temporarily in order to recover.

Cassandra, removing old, not needed data

I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.

Cassandra Compaction takes all the resources and leads to node failure

I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.
Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)

Resources