Advise on stopping compaction to reduce slowness - cassandra

I am seeing high CPU and memory usage of cassandra on the seed node. Is it advisable to stop compaction(nodetool stop) and enable in offpeak hours. Should I do manual compaction or enable autocompaction. I see lot of Native-Transport-Requests. I have three seed nodes. This is the first seed node.
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 54255 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 2 2566 352765 0 0
MutationStage 0 0 2659921760 0 0
MemtableReclaimMemory 0 0 180958 0 0
PendingRangeCalculator 0 0 21 0 0
GossipStage 0 0 338375 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 63 0 0
RequestResponseStage 0 1 1684328696 0 0
Native-Transport-Requests 4 0 1538523706 0 47006391
ReadRepairStage 0 0 2197 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemtablePostFlush 1 1 216220 0 0
PerDiskMemtableFlushWriter_0 1 1 180958 0 0
ValidationExecutor 0 0 33250 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 1 1 180958 0 0
InternalResponseStage 0 0 141677 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 166254 0 0
CacheCleanupExecutor 0 0 0 0 0
Repair#9 0 0 5719 0 0
I do see high compactions. Is it advisable to disable compactions using nodetool stop
$ nodetool info
ID : ebeda774-cea8-40bb-9322-69c6fcded5a9
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 535.37 GiB
Generation No : 1636316595
Uptime (seconds) : 73152
Heap Memory (MB) : 19542.18 / 32168.00
Off Heap Memory (MB) : 1337.98
Data Center : us-west2
Rack : a
Exceptions : 15
Key Cache : entries 152283, size 23.07 MiB, capacity 100 MiB, 23835 hits, 280738 requests, 0.085 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 6782, size 423.88 MiB, capacity 480 MiB, 23947952 misses, 24381819 requests, 0.018 recent hit rate, 250.977 microseconds miss latency
Percent Repaired : 0.49796724500672584%
Token : (invoke with -T/--tokens to see all 256 tokens)
$ free -h
total used free shared buff/cache available
Mem: 62G 53G 658M 1.0M 8.5G 8.5G
Swap: 0B 0B 0B
~$ nodetool compactionstats
pending tasks: 197
id compaction type keyspace table completed total unit progress
5e555610-40b2-11ec-9b5a-27bc920e6e55 Compaction mykeyspace table1 27299674 89930474 bytes 30.36%
5e55f251-40b2-11ec-9b5a-27bc920e6e55 Compaction mykeyspace table2 13922048 74426264 bytes 18.71%
Active compaction remaining time : 0h00m02s

I would definitely not run compaction manually. Much of the compaction thresholds are file-size based, which means that forcing it creates files sized outside of the normal progression. The result, is that the chances of compaction running on that table again are extremely slim. Basically, once you start down that path, you'll be running manual compactions forever.
I would also say that compaction is a good thing. You want it to happen, as compacted files are necessary to keep reads performing well. Of course, that's not much of a consolation when the compaction process is affecting operational activity.
One I have done in the past, is to lower compaction throughput during the day. Not sure what throughput you're running with currently, but you can find this out by running nodetool getcompactionthroughput:
% bin/nodetool getcompactionthroughput
Current compaction throughput: 64 MB/s
So at the times when customer/operational traffic is high, you can reduce that significantly:
% bin/nodetool setcompactionthroughput 1
% bin/nodetool getcompactionthroughput
Current compaction throughput: 1 MB/s
1 MB / second is the lowest that compaction throughput can be set. If you set it to zero, it's "un-throttled," which means it'll consume all the resources that it can get at. Setting it to 1 brings its resource use (and speed) down to a trickle.
Once the busy daily traffic subsides, that setting can be turned back up:
% bin/nodetool setcompactionthroughput 256
Current compaction throughput: 256 MB/s
This can be accomplished with a scheduled job for each command.


CounterMutationStage and ViewMutationStage metrics are missing in Cassandra 4.0

When invoking nodetool tpstats on Cassandra 4.0, here is what I got nodetool result screenshot
But no CounterMutationStage and ViewMutationStage found. Where are they?
Those metrics are still there. The issue though, is that they expose their data "lazily." Which basically means, they won't show at all when the value is zero. Once you start writing to counters or views, those metrics execute their "lazy initialization," and only then are they exposed. I tested this out using Cassandra 4.0 beta4.
Running a baseline nodetool tpstats | head -n 4:
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 1 0 0
ReadStage 0 0 27 0 0
CompactionExecutor 0 0 41 0 0
Next, I'll create a simple counter table.
CREATE TABLE games_popularity (game text PRIMARY KEY, popularity counter);
I'll increment the counter a few times and SELECT it.
aploetz#cqlsh> SELECT * FROM stackoverflow.games_popularity ;
game | popularity
Cyberpunk 2077 | 3
(1 rows)
Now rerunning the nodetool tpstats | head -n 4 indeed show CounterMutationStage:
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 12 0 0
CounterMutationStage 0 0 3 0 0
ReadStage 0 0 96 0 0
Note that in 4.0 these metrics are also exposed in the system_view.thread_pools virtual table, which you can view with SELECT * FROM system_views.thread_pools;.
Thanks to the good work that have been done by Cassandra developers, the metrics are now lazy initialised to improve the performance.
The best way to "wake up" all lazy metrics is:
nodetool getconcurrency

Cassandra tombstones not deleted a month after actual record TTL

Running into an issue with DSE 4.7.
The tombstones are not being deleted even after compactions, cleanup, rebuild_index and repair. records have a 15 day ttl.
sstablemetadata output suggests that there are 90% tombstones
Any ideas?
sstablemetadata output
SSTable: ./abcd-abcd-ka-478675
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Minimum timestamp: 1527521280829593
Maximum timestamp: 1527596173976435
SSTable max local deletion time: 1528892173
Compression ratio: 0.36967428395684393
Estimated droppable tombstones: 0.9073013816277629
SSTable Level: 0
Repaired at: 0
ReplayPosition(segmentId=1520529283052, position=4626679)
Estimated tombstone drop times:%n
1528817679: 18318196
1528818619: 20753822
1528819513: 24176310
Count Row Size Cell Count
1 0 0
2 0 1752560
3 0 0
4 0 6355421
5 0 0
6 0 687302
7 0 0
8 0 529613
10 0 444801
12 0 410107
14 0 456011
17 0 1347893
20 0 184960
24 0 152814
770 1347893 137
924 184960 109
1109 220403 68
1331 121620 86
1597 2044030 102
1916 185601 195
2299 184816 158273
2759 868754 0
3311 62795 0
3973 1668 0
4768 2143 0
5722 1812541 0
6866 828 0
Ancestors: [476190, 474027, 475201, 478160]
Estimated cardinality: 20059264
Cassandra marks TTL data with a tombstone after the requested amount of time has expired. A tombstone exists for gc_grace_seconds. After data is marked with a tombstone, the data is automatically removed during the normal compaction process.
you can try to run major compaction to evict tombstone out.
Tombstones gets deleted after normal compaction. But, still sometime you find stale data (even in prod)in tombstone.The reason could be out of all the nodes in that cluster one is down and the data from tombstone did not got deleted because of that node. Also sometimes null values are inserted in primary key causing tombstone data.

Cassandra NoHostAvailableException when deletes are executed with cqlsh

We have a cluster with 7 nodes and we use the datastax java driver to connect to the cluster. The problem is that I am getting constant NoHostAvailableException like this:
Caused by:
com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: /
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /,
/, /, / [only
showing errors of first 3 hosts, use getErrors() for more details])
All the nodes are up:
UN 152.21 GB 256 14.5% 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1
UN 168.4 GB 256 14.5% bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1
UN 177.71 GB 256 13.7% 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1
UN 158.57 GB 256 14.1% 94022081-a563-4042-81ab-75ffe4d13194 RAC1
UN 176.83 GB 256 14.6% 0dda3410-db58-42f2-9351-068bdf68f530 RAC1
UN 159 GB 256 13.6% 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1
UN 166.05 GB 256 15.0% 4d009603-faa9-4add-b3a2-fe24ec16a7c1 RAC1
but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.
I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.
Our configuration for the java connection is:
com.datastax.driver.core.Cluster cluster = null;
//Get contact points
String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
cluster = com.datastax.driver.core.Cluster.builder()
.withQueryOptions(new QueryOptions()
.withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
.withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
Metadata metadata = cluster.getMetadata();
for ( Host host : metadata.getAllHosts() ) {"Datacenter: "+host.getDatacenter()+"; Host: "+host.getAddress()+"; DC: "+host.getDatacenter()+"\n");
and the contact points are:,,,,
Anyone knows how I can solve this problem? Or at least have anyone some hint about how to deal with this situation?
Update: If I get the error messages withe.getErrors() I obtain:
/ [/] Operation timed out,
/ [/] Operation timed out,
/ [/] Operation timed out,
/ [/] Operation timed out,
/ [/] Operation timed out}
The replication factor of the keyspace is 3.
For the deletes Im running them using different files with the cql queries:
cqlsh ip_node_1 -f script-1.duplicates
cqlsh ip_node_1 -f script-2.duplicates
cqlsh ip_node_1 -f script-3.duplicates
I am not specifying any consistency level, so is using the default one which is ONE.
Each of the previous files contain deletes like this:
DELETE FROM WHERE idline1 = 837 and idline2 = 841 and partid = 8558 and id = 18c04c20-8a3a-11e5-9e20-0025905a2ab2;
And the column family is:
idline1 bigint,
idline2 bigint,
partid int,
id uuid,
field3 int,
field4 int,
field5 int,
field6 int,
field7 int,
field8 int,
field9 double,
field10 bigint,
field11 bigint,
field12 bigint,
field13 boolean,
field14 boolean,
field15 int,
field16 bigint,
field17 int,
field18 int,
field19 int,
field20 int,
field21 uuid,
field22 boolean,
PRIMARY KEY ((idline1, idline2, partid), id)
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='Table with the snp between lines' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=0 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX search_partid ON search (partid);
CREATE INDEX search_field8 ON search (field8);
UPDATE (18-03-2016):
After the deletes start to be executed I found the cpu of some of the nodes increases a lot:
I check the processes on that nodes and only cassandra is running but consuming a lot of cpu. The rest of the nodes are not using almost cpu.
UPDATE (04-04-2016): I do not know if it is related. I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned).
Checing the thread pool stats:
nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 20042001 0 0
RequestResponseStage 0 0 149365845 0 0
MutationStage 32 117720 181498576 0 0
ReadRepairStage 0 0 799373 0 0
ReplicateOnWriteStage 0 0 13624173 0 0
GossipStage 0 0 5580503 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 32173 0 0
MigrationStage 0 0 9 0 0
MemtablePostFlusher 0 0 45044 0 0
MemoryMeter 0 0 9553 0 0
FlushWriter 0 0 9425 0 18
ValidationExecutor 0 0 15980 0 0
MiscStage 0 0 0 0 0
PendingRangeCalculator 0 0 7 0 0
CompactionExecutor 0 0 1293147 0 0
commitlog_archiver 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 273 0 0
Message type Dropped
I realize that the pending mutation stages are growing but the active value remain the same, could be this the problem?
I see two problems with your datamodel.
You use two secondary indexes. One is on a field on the partition key. I don't know how cassandra behaves in this case. Worst case is, that even if you use the complete partition key (like you do in your example delete) cassandra does a lookup in the secondary index. In that case this would mean a full cluster scan, because secondary indexes are only stored per partition. Since only a part of the partition key is indexed cassandra does not know on which partition the index informations lies. This behavior at least would explain the timeouts.
You said, you delete a lot of rows in a specific partition. That is also a problem. For each deletion cassandra creates a tombstone. The more tombstones there are, the slower the read will become. This will sooner or later lead to timeouts or exceptions (I believe cassandra will write warnings when 1000 tombstones are reached and throw exceptions when 10.000 tombstones are reached). Btw. these tombstones are also created in the secondary index. By default cassandra will remove tombstones after gc_grace_seconds (by default 10 days) when a compaction is performed. You could change this property per table. More information on these table properties can be found here: Table Properties
I believe the first point could be the reason for the timeouts.

Cassandra - compaction stuck

Upfront warning - cassandra beginner
I have setup a 4 node m3.xlarge cluster on aws using the datastax enterprise ami and loaded data using the Cassandra bulkloader approach.
Cassandra version is "ReleaseVersion:"
One of the four nodes - the one I started the buklkload from - seems to be stuck in compaction (for last 12 hours nothing changed)
$ nodetool compactionstats
pending tasks: 1
compaction type keyspace table completed total unit progress
Compaction xxx yyy 60381305196 66396499686 bytes 90.94%
Active compaction remaining time : 0h05m58s
I have also noticed that sometimes that node becomes unavailable (goes red in opscenter) but after a while (a long while) it becomes available again.
In the cassandra log is an exception (see below). What is weird though is that there is lot's of disk space left.
> ERROR [MemtableFlushWriter:3] 2015-10-29 23:54:21,511
> - Exception in thread
> Thread[MemtableFlushWriter:3,5,main]
> No space
> left on device
> at$IndexWriter.close(
> ~[cassandra-all-]
> at
> ~[cassandra-all-]
> at
> ~[cassandra-all-]
> at
> ~[cassandra-all-]
> at
> ~[cassandra-all-]
> at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(
> ~[cassandra-all-]
> at org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(
> ~[cassandra-all-]
> at
> ~[cassandra-all-]
> at$SameThreadExecutorService.execute(
> ~[guava-16.0.1.jar:na]
> at org.apache.cassandra.db.ColumnFamilyStore$
> ~[cassandra-all-]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ~[na:1.7.0_80]
> at java.util.concurrent.ThreadPoolExecutor$
> ~[na:1.7.0_80]
> at ~[na:1.7.0_80] Caused by: No space left on device
> at Method) ~[na:1.7.0_80]
> at ~[na:1.7.0_80]
> at
> ~[na:1.7.0_80]
> at
> ~[na:1.7.0_80]
> at
> ~[cassandra-all-]
> at$IndexWriter.close(
> ~[cassandra-all-]
> ... 12 common frames omitted
Tpstats output is
$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 19485 0 0
RequestResponseStage 0 0 116191 0 0
MutationStage 0 0 386132 0 0
ReadRepairStage 0 0 848 0 0
GossipStage 0 0 46669 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 1 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 80 0 0
MemtableReclaimMemory 0 0 79 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 1 33 127 0 0
CompactionExecutor 1 1 27492 0 0
InternalResponseStage 0 0 4 0 0
HintedHandoff 0 0 3 0 0
Message type Dropped
Anyone any tips on how to make that hanging compaction go away and on why this happpens in the first place?
All tips tremendously appreciated!
Let's assume you're using SizeTieredCompaction and you have four SSTables of size X, a compaction will merge them into one SSTable of size Y and this process repeats itself.
Problem: A compaction will create a new SSTable of size Y and both the new and old SSTables of size X exist during the compaction.
In the worst case (with no deletes and overwrites), a compaction will require 2 times of the on-disk space used for SSTables, or more specifically: at certain points you need to have enough disk space to hold the SSTables of size X and Y.
So even though it seems that you have enough space left, you might run out of disk space during compaction.
You might wanna try LeveledCompactionStrategy because it needs much less space for compaction (10 x sstable_size_in_mb). See also for when to use LeveledCompactionStrategy.
No matter which compaction strategy you use, you should always leave enough free disk space to accommodate streaming, repair, and snapshots.

/proc/[pid]/stat refresh period

hi I am a Linux programmer
I have an order that monitor process cpus usage, so I use data on /proc/[pid]/stat № 14 and 15. That values are called utime and stime.
Example [/proc/[pid]/stat]
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5274 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
State after 5 sec
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5440 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
In test environment, this file refreshed 1 ~ 2 sec, so I assume this file often updated by system at least 1 sec.
So I use this calculation
process_cpu_usage = ((utime - old_utime) + (stime - old_stime))/ period
In case of above values
33.2 = ((5440 - 5274) + (0 - 0)) / 5
But, In commercial servers environment, process run with high load (cpu and file IO), /proc/[pid]/stat file update period increasing even 20~60 sec!!
So top/htop utility can't measure correct process usage value.
Why is this phenomenon occurring??
Our system is [CentOS Linux release 7.1.1503 (Core)]
Most (if not all) files in the /proc filesystem are special files, their content at any given moment reflect the actual OS/kernel data at that very moment, they're not files with contents periodically updated. See the /proc filesystem doc.
In particular the /proc/[pid]/stat content changes whenever the respective process state changes (for example after every scheduling event) - for processes mostly sleeping the file will appear to be "updated" at slower rates while for active/running processes at higher rates on lightly loaded systems. Check, for example, the corresponding files for a shell process which doesn't do anything and for a browser process playing some video stream.
On heavily loaded systems with many processes in the ready state (like the one mentioned in this Q&A, for example) there can be scheduling delays making the file content "updates" appear less often despite the processes being ready/active. Such conditions seem to be more often encountered in commercial/enterprise environments (debatable, I agree).
