Cassandra compaction tasks number keep growing

Cassandra compaction tasks number keep growing - cassandra

I`m using Cassandra dsc 2.1.5 with 3 nodes, and the following table description:
cqlsh> DESCRIBE TABLE mykeyspace.mytable;
CREATE TABLE mykeyspace.mytable (
a text,
b text,
c timestamp,
d timestamp,
e text,
PRIMARY KEY ((a, b), c)
) WITH CLUSTERING ORDER BY (c ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
The first and the third nodes are working fine (all have the same cassandra.yaml) but the second node started to have more and more pending compaction tasks that I can see with nodetool compactionstats -H command. The situation is so bad that my spark jobs get stuck and work only when I completely shut down the second node.
I have around 130G free on the second node...
Also, here a cfstatss state:
> nodetool cfstatss mykeyspace.mytable
Keyspace: mykeyspace
Read Count: 0
Read Latency: NaN ms.
Write Count: 54316 [1/1863] Write Latency: 0.1877597945356801 ms.
Pending Flushes: 0
Table: mytable
SSTable count: 1249
Space used (live): 1125634027755
Space used (total): 1125634027755
Space used by snapshots (total): 0
Off heap memory used (total): 1202327957
SSTable Compression Ratio: 0.11699340657338655
Number of keys (estimate): 34300801
Memtable cell count: 758856
Memtable data size: 351011415
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 0
Local read latency: NaN ms
Local write count: 54319
Local write latency: 0.188 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 48230904
Bloom filter off heap memory used: 48220912
Index summary off heap memory used: 11161093
Compression metadata off heap memory used: 1142945952
Compacted partition minimum bytes: 925
Compacted partition maximum bytes: 52066354
Compacted partition mean bytes: 299014
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
What can be the problem?

Related

What does "PER PARTITION LIMIT" means in cql query in cassandra?

I have a scylla table as shown below:
cqlsh:sampleks> describe table test;
CREATE TABLE test (
client_id int,
when timestamp,
process_ids list<int>,
md text,
PRIMARY KEY (client_id, when) ) WITH CLUSTERING ORDER BY (when DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 172800
AND max_index_interval = 1024
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
And I see this is how we are querying it. It's been a long time I worked on cassandra so this PER PARTITION LIMIT is new thing to me (looks like recently added). Can someone explain what does this do with some example in a layman language? I couldn't find any good doc on that which explains easily.
SELECT * FROM test WHERE client_id IN ? PER PARTITION LIMIT 1;

The PER PARTITION LIMIT clause can be helpful in a "wide partition scenario."
It returns only the first two rows in the partition.
Take this query:
aploetz#cqlsh:stackoverflow> SELECT client_id,when,md
FROM test PER PARTITION LIMIT 2 ;
Considering the PRIMARY KEY definition of (client_id,when), that query will iterate over each client_id. Cassandra will then return only the first two rows (clustered by when) from that partition, regardless of how many ocurences of when may be present.
In this case, I inserted 7 rows into your test table, using two different client_ids (2 partitions total). Using a PER PARTITION LIMIT of 2, I get 4 rows returned (2 client_id x PER PARTITION LIMIT 2) == 4 rows.
client_id | when | md
-----------+---------------------------------+-----
1 | 2020-05-06 12:00:00.000000+0000 | md1
1 | 2020-05-05 22:00:00.000000+0000 | md1
2 | 2020-05-06 19:00:00.000000+0000 | md2
2 | 2020-05-06 01:00:00.000000+0000 | md2
(4 rows)

Cassandra NodeTool Showing value as NaN

In my 3 node Cassandra cluster I tried to get node info by executing "nodetool info" but I can see some NaN values in Cache details.
Rack : 2a
Exceptions : 0
Key Cache : entries 478610, size 36.52 MiB, capacity 50 MiB, 251452781 hits, 292195506 requests, 0.861 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 25 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 18, size 1.12 MiB, capacity 219 MiB, 259 misses, 9648 requests, 0.973 recent hit rate, NaN microseconds miss latency
Can't figure out why it is returning NaN values.
Using Cassandra ReleaseVersion: 3.11.6

It happens because there is no activity in corresponding caches, and as result, hit ratio couldn't be calculated as 0 hits divided by 0 requests gives you a NaN (not a number). You can see this discussion about NaNs

The best way to get table statues like total row numbers and current workload like op/s of Cassandra?

I am trying two things about our Cassandra based application, which is not possible to stop all service for testing purpose:
Test the performance like op/s and 99.9% latency using python driver. To get more accurate result, we want to know current workload such as read and write op/s of Cassandra.
Get some information like total number rows a table contains(our table have almost 8 billion record for now) & how many records inserted in every week(there are some data source that we cannot control, so it's hard to get this information from insert script directly).
I have tried some methods for these two problems:
Updated in comments.
select count(*) from xxx does not work at all, and it is too slow.
I tried to get some information using nodetool tablestats, take system_distributed for example:
Keyspace : system_distributed
Read Count: 0
Read Latency: NaN ms
Write Count: 0
Write Latency: NaN ms
Pending Flushes: 0
Table: parent_repair_history
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Table: repair_history
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Table: view_build_status
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
These have some parameters that I cannot understand:
a. What does Local write count mean? If I have a table distributed on different nodes and have multiple replica, how to calculate how many rows of that table?
b. Do the first 5 lines (Read Count, Write Count) describe information of that keyspace(system_distributed)?
c. Does all latency here mean average latency？
Appreciate if you guys could give me any suggestion.
Jiashi

Column counts are larger than 1996099046, unable to calculate percentiles

while I am running TableHistograms getting below message:
NodeTool TableHistograms keyspace TableName
Column counts are larger than 1996099046, unable to calculate percentiles
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 268650950 NaN
75% 0.00 0.00 0.00 3449259151 NaN
95% 0.00 0.00 0.00 25628284214 NaN
98% 0.00 0.00 0.00 44285675122 NaN
99% 0.00 0.00 0.00 44285675122 NaN
Min 0.00 0.00 0.00 105779 0
Max 0.00 0.00 0.00 442856751229223372036854776000
Cassandra version:
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
Replication factor 3
4 node cluster
Getting the above message in one node only
Tried repairing the table but failed with streaming error :
40328:ERROR [StreamReceiveTask:53] 2019-06-10 13:54:33,684 StreamSession.java:593 - [Stream #c9214180-8b82-11e9-90ce-399bac480141] Streaming error occurred on session with peer <IP ADDRESS>
40329-java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed
40330- at org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:51) ~[apache-cassandra-3.11.2.jar:3.11.2]
40331- at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:373) ~[apache-cassandra-3.11.2.jar:3.11.2]
40332- at org.apache.cassandra.index.SecondaryIndexManager.buildIndexesBlocking(SecondaryIndexManager.java:383) ~[apache-cassandra-3.11.2.jar:3.11.2]
40333- at org.apache.cassandra.index.SecondaryIndexManager.buildAllIndexesBlocking(SecondaryIndexManager.java:270) ~[apache-cassandra-3.11.2.jar:3.11.2]
40334- at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:216) ~[apache-cassandra-3.11.2.jar:3.11.2]
40335- at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_144]
40336- at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_144]
40337- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
40338- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
--
0354:ERROR [Reference-Reaper:1] 2019-06-10 13:54:33,907 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State#7bd8303d) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup#1084465868:PATH/talename-5b621cd0c53311e7a612ffada4e45177/mc-26405-big-Index.db was not released before the reference was garbage collected
Table description includes :
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Any idea why it is happening? Any help or suggestion is welcome.

You cannot have 2 billion cells in a partition. Also having a secondary index on a table with a 44gb partition is going to have issues for multiple reasons. There really isn't much you can do to fix this short of dropping your index and building a new data model to migrate into. You could build a custom version of Cassandra to ignore that exception but something else will come up very soon as you are at the extreme limits of whats even theoretically possible. You are already past a point that I am surprised is running.
If the streaming error is from repairs you can ignore it while you fix your data model. If it's from bootstrapping I think you will need a custom version of Cassandra to stay running in meantime (or can just ignore the down node you are replacing). Keep in mind node failures are a serious threat to you now as bootstrapping likely will not work. When you put so much in a single partition it cannot be scaled out so there are limited options.

Even simple queries not working for a cluster ("Request did not complete within rpc_timeout")

I've successfully set up a Cassandra cluster with 7 nodes. However, I can't get it to work for basic queries.
CREATE TABLE lgrsettings (
siteid bigint,
channel int,
name text,
offset float,
scalefactor float,
units text,
PRIMARY KEY (siteid, channel)
)
insert into lgrsettings (siteid,channel,name,offset,scalefactor,units) values (999,1,'Flow',0.0,1.0,'m');
Then on one node:
select * from lgrsettings;
Request did not complete within rpc_timeout.
And on Another:
select * from lgrsettings;
Bad Request: unconfigured columnfamily lgrsettings
Even though the keyspace and column family shows up on all nodes.
Any ideas where I could start looking?
Alex
Interesting results. The node that handled the keyspace creation and insert shows:
Keyspace: testdata
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: 0.304 ms.
Pending Tasks: 0
Column Family: lgrsettings
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 10
Memtable Data Size: 129
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Column Family: datapoints
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Other nodes don't have this in the cfstats but do show it in DESCRIBE KEYSPACE testdata; in the CQL3 clients...

Request did not complete within rpc_timeout
Check your Cassandra logs to confirm if there is any issue - sometimes exceptions in Cassandra lead to timeouts on the client.

In a comment, the OP said he found the cause of his problem:
I've managed to solve the issue. It was due to time sync between the nodes so I installed ntpd on all nodes, waited 5 minutes and tried again and I have a working cluster!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string