Cassandra : impact of high compaction/streaming throughput values? - cassandra

The defaults throughput are as follow (cassandra.yaml) :
compaction_throughput_mb_per_sec: 16
stream_throughput_outbound_megabits_per_sec: 200
inter_dc_stream_throughput_outbound_megabits_per_sec: 200
To speed up things like compaction, I have set those values :
$ nodetool getcompactionthroughput
Current compaction throughput: 10000 MB/s
$ nodetool getstreamthroughput
Current stream throughput: 10000 Mb/s
$ nodetool getinterdcstreamthroughput
Current inter-datacenter stream throughput: 10000 Mb/s
Cassandra data directories are backed by both SSD and HDD, depending of the keyspace.
Are there any impact (like read or write timeout) of applying such very high values ?
Thank you

Only impact I can see with high compaction throughput is that your traffic (read/writes) from the application may get impacted. As SSDs have finite IOPS capacity so giving most of it to compaction will impact your traffic at the time compaction is going on.


Troubleshooting and fixing Cassandra OOM issue

Although there are multiple threads regarding the OOM issue would like to clarify certain things. We are running a 36 node Cassandra cluster of 3.11.6 version in K8's with 32gigs allocated for the container.
The container is getting OOM killed (Note:- Not java heap OOM error rather linux cgroup OOM killer) since it's reaching the memory limit of 32 gigs for its cgroup.
Stats and configs
map[limits:map[ephemeral-storage:2Gi memory:32Gi] requests:map[cpu:7 ephemeral-storage:2Gi memory:32Gi]]
Cgroup Memory limit
34359738368 -> 32 Gigs
The JVM spaces auto calculated by Cassandra -Xms19660M -Xmx19660M -Xmn4096M
Grafana Screenshot
Cassandra Yaml -->
JVM Options -->
Nodetool info output on a node which is already consuming 98% of the memory
nodetool info
ID : 59c53bdb-4f61-42f5-a42c-936ea232e12d
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 179.71 GiB
Generation No : 1643635507
Uptime (seconds) : 9134829
Heap Memory (MB) : 5984.30 / 19250.44
Off Heap Memory (MB) : 1653.33
Data Center : datacenter1
Rack : rack1
Exceptions : 5
Key Cache : entries 138180, size 99.99 MiB, capacity 100 MiB, 9666222 hits, 10281941 requests, 0.940 recent hit rate, 14400 save period in seconds
Row Cache : entries 10561, size 101.76 MiB, capacity 1000 MiB, 12752 hits, 88528 requests, 0.144 recent hit rate, 900 save period in seconds
Counter Cache : entries 714, size 80.95 KiB, capacity 50 MiB, 21662 hits, 21688 requests, 0.999 recent hit rate, 7200 save period in seconds
Chunk Cache : entries 15498, size 968.62 MiB, capacity 1.97 GiB, 283904392 misses, 34456091078 requests, 0.992 recent hit rate, 467.960 microseconds miss latency
Percent Repaired : 8.28107989669628E-8%
Token : (invoke with -T/--tokens to see all 256 tokens)
What had been done
We had made sure there is no memory leak on the cassandra process since we have a custom trigger code. Gc log analytics shows we occupy roughly 14 gigs of total jvm space.
Although we know cassandra does occupy off heap spaces (Bloom filter, Memtables , etc )
The grafana screenshot shows the node is occupying 98% of 32 gigs. JVM heap = 19.5 gigs + offheap space in nodetool info output = 1653.33 MB (1Gigs) (JVM heap + off heap = 22 gigs ). Where is the remaining memory (10 gigs) ?. How to exactly account what is occupying the remaining memory. (Nodetool tablestats and nodetool cfstats output are not shared for complaince reasons) ?
Our production cluster requires tons of approval so deploying them with jconsole remote is tough. Any other ways to account for this memory usage.
Once we account the memory usage what are the next steps to fix this and avoid OOM kill ?
There's a good chance that the SSTables are getting mapped to memory (cached with mmap()). If this is the case, it wouldn't be immediate and memory usage would grow over time depending on when SSTables are read which are then cached. I've written about this issue in
There's an issue with a not-so-well-known configuration property called "disk access mode". When it's not set it cassandra.yaml, it defaults to mmap which means that all SSTables get mmaped to memory. If so, you'll see an entry in the system.log on startup that looks like:
INFO [main] 2019-05-02 12:33:21,572 - \
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
The solution is to configure disk access mode to only cache SSTable index files (not the *-Data.db component) by setting:
disk_access_mode: mmap_index_only
For more information, see the link I posted above. Cheers!

Cassandra read latency increases while writing

I have a cassandra cluster, its read latency increases during writes. The writes mostly happen via spark jobs during the night time. The writes happen in huge bursts, is there a way to reduce read latency during the writes. The writes happen using LOCAL_QUORUM and reads happen using LOCAL_ONE. Is there a way to reduce read latency when writes are happening?
Cassandra Cluster Configs
10 Node cassandra cluster (5 in DC1, 5 in DC2)
CPU: 8 Core
Memory: 32GB
Grafana Metrics
I can give some advice:
Use LCS compaction strategy.
Prefer round-robin load balancing policy for reads.
Choose partition_key wisely so that requests are not bombarded on a single partition.
Partition size also play a good role. Cassandra recommends to have smaller partition size. However, I have tested with Partitions of 10000 rows each with each row having size of 800 bytes. It worked better than with 3000 rows(or even 1 row). Very tiny partitions tend to increase CPU usage when data stored is large in terms of row count. However, very large partitions should be avoided even.
Replication Factor should be chosen strategically . Write consistency level should be decided considering the replication of all keyspaces.

STCS : how I can improve compaction performance?

I have six nodes Cassandra cluster, which host a large columnfamily (cql table) that is immuable (because it's a kind of an history table from an application point of view). Such table is about 400Go of compressed data, which is not that much!
So after truncating the table, then ingest the app history data in it, I trigger nodetool compact on it on each node, in order to have the best read performance, by reducing down the number of SSTables. The compaction strategy is STCS.
After running nodetool compact, I trigger nodetool compactionstats to follow the compaction progress :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 3.65 GiB 1.11 TiB bytes 0.32%
After hours I have on that same node :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 4.08 GiB 1.11 TiB bytes 0.36%
So the compaction process seems to work, but it's terribly slow.
Even with nodetool setcompactionthreshold -- 0, the compaction remains terribly slow. Moreover, CPU seems to be used to 100% because of that compaction.
Questions :
What are configurations parameters that I can tune to try to boost compaction performance ?
Could the 100% CPU when compaction occurs be related to GC pressure ?
If compaction is too slow, it is relevant to add more nodes, or add more CPU/RAM to each nodes ? Could it help ?
Performance of compaction depends on the underlying hardware - its performance depends on what kind of disks is used, etc. But it also depends on how many compaction threads are allowed to run, and what throughput is configured for compaction threads. From command line compaction throughput is configured by nodetool setcompactionthroughput, not the nodetool setcompactionthreshold as you used. And number of concurrent compactors is set with nodetool setconcurrentcompactors (but it's available in 3.1, IIRC). You can also configure default values in the cassandra.yaml.
So if you have enough CPU power, and good SSD disks, then you can bump compaction throughput, and number of compactors.

CPU 100% due to thousands of Pending Compaction

Recently we inserted millions of records and deleted millions of records from a table, a table of size 10 GB was truncated.
We are running with 2 nodes with SizeTieredCompactionStrategy, currently CPU utilization is 100% and pending compaction is increasing , currently pending compaction is 293144
Any pointers to reduce CPU utilization and get the compaction done quickly.
reduce CPU utilization and get the compaction done quickly.
These two things are orthogonal. You can either accelerate the compaction (by using more resources) or limit the resources for the compactions so that your writes aren't affected but have it take longer.
If you have an ingest running against your cassandra cluster, I would try to ensure that it is not affected by your compactions. As long as the # of pending compactions is decreasing over time it's just a matter of time.
If you don't have reads or writes coming in (I.E. downtime or you're bootstrapping) it's okay to let compactions use up all your resources and finish fast.
The levers are:
1) get/set compaction throughput (nodetool)-- only kicks in for the next available compaction. This is how fast the compaction will occur. Default is 16 mb/s if you have resources available, you can increase this to a larger number.
2) concurrent compactors -- there are 2 values you have to set in JMX. you can do this on the fly using jmxsh or jconsole, etc. This is the number of compactions you can run at a time (number of cores).
Watch nodetool compactionstats or OpsCenter (you can also chart pending compactions and set alerts) to find out the progress for the current compactions or nodetool comactionhistory for completed compactions.
Other things
a table of size 10 GB was truncated.
Truncates are free, no compaction needed.

Determining how full a Cassandra cluster is

I just imported a lot of data in a 9 node Cassandra cluster and before I create a new ColumnFamily with even more data, I'd like to be able to determine how full my cluster currently is (in terms of memory usage). I'm not too sure what I need to look at. I don't want to import another 20-30GB of data and realize I should have added 5-6 more nodes.
In short, I have no idea if I have too few/many nodes right now for what's in the cluster.
Any help would be greatly appreciated :)
$ nodetool -h ring
Address DC Rack Status State Load Owns Token
151236607520417094872610936636341427313 datacenter1 rack1 Up Normal 7.19 GB 11.11% 0 datacenter1 rack1 Up Normal 7.18 GB 11.11% 18904575940052136859076367079542678414 datacenter1 rack1 Up Normal 7.23 GB 11.11% 37809151880104273718152734159085356828 datacenter1 rack1 Up Normal 4.2 GB 11.11% 56713727820156410577229101238628035242 datacenter1 rack1 Up Normal 4.25 GB 11.11% 75618303760208547436305468318170713656 datacenter1 rack1 Up Normal 4.1 GB 11.11% 94522879700260684295381835397713392071 datacenter1 rack1 Up Normal 4.83 GB 11.11% 113427455640312821154458202477256070485 datacenter1 rack1 Up Normal 2.24 GB 11.11% 132332031580364958013534569556798748899 datacenter1 rack1 Up Normal 3.06 GB 11.11% 151236607520417094872610936636341427313
# nodetool -h cfstats
Keyspace: stats
Read Count: 232
Read Latency: 39.191931034482764 ms.
Write Count: 160678758
Write Latency: 0.0492021849459404 ms.
Pending Tasks: 0
Column Family: DailyStats
SSTable count: 5267
Space used (live): 7710048931
Space used (total): 7710048931
Number of Keys (estimate): 10701952
Memtable Columns Count: 4401
Memtable Data Size: 23384563
Memtable Switch Count: 14368
Read Count: 232
Read Latency: 29.047 ms.
Write Count: 160678813
Write Latency: 0.053 ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 115533264
Key cache capacity: 200000
Key cache size: 1894
Key cache hit rate: 0.627906976744186
Row cache: disabled
Compacted row minimum size: 216
Compacted row maximum size: 42510
Compacted row mean size: 3453
[default#stats] describe;
Keyspace: stats:
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Durable Writes: true
Options: [replication_factor:3]
Column Families:
ColumnFamily: DailyStats (Super)
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type
Row cache size / save period in seconds / keys to save : 0.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 200000.0/14400
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
Replicate on write: true
Built indexes: []
Column Metadata:
Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy
Compression Options:
Obviously, there are two types of memory -- disk and RAM. I'm going to assume you're talking about disk space.
First, you should find out how much space you're currently using per node. Check the on-disk usage of the cassandra data dir (by default /var/lib/cassandra/data) with this command: du -ch /var/lib/cassandra/data You should then compare that to the size of your disk, which can be found with df -h. Only consider the entry for the df results for the disk your cassandra data is on, by checking the Mounted on column.
Using those stats, you should be able to calculate how full in % the cassandra data partition. Generally you don't want to get too close to 100% because cassandra's normal compaction processes temporarily use more disk space. If you don't have enough, then a node can get caught with a full disk, which can be painful to resolve (as I side note I occasionally keep a "ballast" file of a few Gigs that I can delete just in case I need to open some extra space). I've generally found that not exceeding about 70% disk usage is on the safe side for the 0.8 series.
If you're using a newer version of cassandra, then I'd recommend giving the Leveled Compaction strategy a shot to reduce temporary disk usage. Instead of potentially using twice as much disk space, the new strategy will at most use 10x of a small, fixed size (5MB by default).
You can read more about how compaction temporarily increases disk usage on this excellent blog post from Datastax: It also explains the compaction strategies.
So to do a little capacity planning, you can figure up how much more space you'll need. With a replication factor of 3 (what you're using above), adding 20-30GB of raw data would add 60-90GB after replication. Split between your 9 nodes, that's maybe 3GB more per node. Does adding that kind of disk usage per node push you too close to having full disks? If so, you might want to consider adding more nodes to the cluster.
One other note is that your nodes' loads aren't very even -- from 2GB up to 7GB. If you're using the ByteOrderPartitioner over the random one, then that can cause uneven load and "hotspots" in your ring. You should consider using random if possible. The other possibility could be that you have extra data hanging out that needs to be taken care of (Hinted Handoffs and snapshots come to mind). Consider cleaning that up by running nodetool repair and nodetool cleanup on each node one at a time (be sure to read up on what those do first!).
Hope that helps.
