Cassandra NodeTool Showing value as NaN

Cassandra NodeTool Showing value as NaN - cassandra

In my 3 node Cassandra cluster I tried to get node info by executing "nodetool info" but I can see some NaN values in Cache details.
Rack : 2a
Exceptions : 0
Key Cache : entries 478610, size 36.52 MiB, capacity 50 MiB, 251452781 hits, 292195506 requests, 0.861 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 25 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 18, size 1.12 MiB, capacity 219 MiB, 259 misses, 9648 requests, 0.973 recent hit rate, NaN microseconds miss latency
Can't figure out why it is returning NaN values.
Using Cassandra ReleaseVersion: 3.11.6

It happens because there is no activity in corresponding caches, and as result, hit ratio couldn't be calculated as 0 hits divided by 0 requests gives you a NaN (not a number). You can see this discussion about NaNs

Related

dsbulk unload is failing on large table

trying to unload data from a huge table, below is the command used and output.
$ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000
Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 1 | 0 | 109,790.10 | 110,058.54 | 110,058.54
Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
Here are the dsbulk records --
cassandra#somehost> cd logs
cassandra#somehost> cd COUNT_20210423-070104-108326/
cassandra#somehost> ls
operation.log unload-errors.log
cassandra#somehost> cat operation.log
2021-04-23 07:01:04 WARN Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
2021-04-23 07:01:04 WARN Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
2021-04-23 07:01:04 INFO Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
2021-04-23 07:02:55 WARN Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
2021-04-23 07:02:55 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:55 INFO Memory usage: used: 212 MB, free: 1,922 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:55 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:55 INFO Throughput: 0 reads/second
2021-04-23 07:02:55 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
2021-04-23 07:02:58 INFO Final stats:
2021-04-23 07:02:58 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:58 INFO Memory usage: used: 251 MB, free: 1,883 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:58 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:58 INFO Throughput: 0 reads/second
2021-04-23 07:02:58 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
cassandra#somehost> cat unload-errors.log
Statement: com.datastax.oss.driver.internal.core.cql.DefaultBoundStatement#1083fef9 [0 values, idempotence: <UNSET>, CL: <UNSET>, serial CL: <UNSET>, timestamp: <UNSET>, timeout: <UNSET>]
SELECT batch_id from .... allow filtering (Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded))
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.toErrorPage(ResultSubscription.java:534)
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.lambda$fetchNextPage$1(ResultSubscription.java:372)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.setFinalError(CqlRequestHandler.java:447) [4 skipped]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.access$700(CqlRequestHandler.java:94)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processRetryVerdict(CqlRequestHandler.java:859)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processErrorResponse(CqlRequestHandler.java:828)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.onResponse(CqlRequestHandler.java:655)
at com.datastax.oss.driver.internal.core.channel.InFlightHandler.channelRead(InFlightHandler.java:257)
at java.lang.Thread.run(Thread.java:748) [24 skipped]
Caused by: com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
Cassandra's system.log snippet ----
DEBUG [ScheduledTasks:1] 2021-04-23 00:01:48,539 MonitoringTask.java:152 - 1 operations timed out in the last 5015 msecs:
<SELECT * FROM my query being run with limit - LIMIT 5000>, total time 10004 msec, timeout 10000 msec/cross-node
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,540 MessagingService.java:1302 - RANGE_SLICE messages were dropped in last 5000 ms: 0 internal and 1 cross node
. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 10299 ms
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,551 StatusLogger.java:114 -
Pool Name Active Pending Completed Blocked All Time Blocked
ReadStage 1 0 1736872997 0 0
ContinuousPagingStage 0 0 586 0 0
RequestResponseStage 0 0 1483193130 0 0
ReadRepairStage 0 0 9079516 0 0
CounterMutationStage 0 0 0 0 0
MutationStage 0 0 351841038 0 0
ViewMutationStage 0 0 0 0 0
CommitLogArchiver 0 0 32961 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 12034828 0 0
MemtableReclaimMemory 0 0 68612 0 0
PendingRangeCalculator 0 0 9 0 0
AntiCompactionExecutor 0 0 0 0 0
GossipStage 0 0 20137208 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3798 0 0
MigrationStage 0 0 8 0 0
MemtablePostFlush 0 0 338955 0 0
PerDiskMemtableFlushWriter_0 0 0 66297 0 0
ValidationExecutor 0 0 247600 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 41757 0 0
InternalResponseStage 0 0 525242 0 0
AntiEntropyStage 0 0 767527 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 958717934 0 65
CompactionManager 0 0
MessagingService n/a 0/0
Cache Type Size Capacity KeysToSave
KeyCache 104857216 104857600 all
RowCache 0 0 all

The problem is that you are running an unload command in DSBulk to perform a SELECT COUNT() which means it has to do a full table scan to return a single row.
In addition, ALLOW FILTERING is not recommended unless you restrict the query to a single partition. In any case, the performance of ALLOW FILTERING is very unpredictable even in optimal situations.
I suggest that you instead use the DSBulk count command which is optimised for counting rows or partitions in Cassandra. For details, see the Counting data with DSBulk example.
There's also additional examples in this DSBulk Counting blog post which Alex Ott already linked in his answer. Cheers!

Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end - in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.
Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Is there a function I can use when having min and max condition values in a data frame?

I have a data frame that look like this:
ID Product Option Value Calculated_Charge
1 ODS REBATE 500 2.2
2 FBS PAYAS 10000 44
3 SSD BUNDL 200 0.88
4 WXR BUNDL 70000 308
What I'm trying to do is identify Revenue leakage. If a customer transacts they are charged a withdrawal fee that is determined by the amount the customer has withdrawn. The calculated fee is:
calculated_charge= Value * 0.44%
But if the customer's charged amount is less than the min_fee_value = $5 then output the min_fee_value instead of the calculated_charge. If the customer's Charged amount is greater than the max_fee_value = $60 then output the max_fee_value instead of the calculated_charge.
This is how i want my output to look like:
ID Product Option Value Calculated_Charge
1 ODS REBATE 500 5
2 FBS PAYAS 10000 44
3 SSD BUNDL 200 5
4 WXR BUNDL 70000 60
I tried using this code, but it only outputs the calculated_charge:
if (df['Product]== 'ODS' ) and (df['Option']=='REBATE'):
calculation = Value * 0.44%
if calculation > 60:
return 60
elif calculation < 5:
return 5
else:
return calculation

Same as suggested by user3483203. Putting her so that this can be closed (please ask user3483203 post his answer & accept it)
df['Calculated_Charge'] = (df['Value']*0.0044).clip(5, 60)

The best way to get table statues like total row numbers and current workload like op/s of Cassandra?

I am trying two things about our Cassandra based application, which is not possible to stop all service for testing purpose:
Test the performance like op/s and 99.9% latency using python driver. To get more accurate result, we want to know current workload such as read and write op/s of Cassandra.
Get some information like total number rows a table contains(our table have almost 8 billion record for now) & how many records inserted in every week(there are some data source that we cannot control, so it's hard to get this information from insert script directly).
I have tried some methods for these two problems:
Updated in comments.
select count(*) from xxx does not work at all, and it is too slow.
I tried to get some information using nodetool tablestats, take system_distributed for example:
Keyspace : system_distributed
Read Count: 0
Read Latency: NaN ms
Write Count: 0
Write Latency: NaN ms
Pending Flushes: 0
Table: parent_repair_history
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Table: repair_history
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Table: view_build_status
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: -1.0
Number of partitions (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
These have some parameters that I cannot understand:
a. What does Local write count mean? If I have a table distributed on different nodes and have multiple replica, how to calculate how many rows of that table?
b. Do the first 5 lines (Read Count, Write Count) describe information of that keyspace(system_distributed)?
c. Does all latency here mean average latency？
Appreciate if you guys could give me any suggestion.
Jiashi

Cassandra write performance

We have this Cassandra cluster and would like to know if current performance is normal and what we can do to improve it.
Cluster is formed of 3 nodes located on the same datacenter with total capacity of 465GB and 2GB of Heap per node. Each node has 8 cores and 8GB or RAM. Version of different components are cqlsh 5.0.1 | Cassandra 2.1.11.872 | DSE 4.7.4 | CQL spec 3.2.1 | Native protocol v3
The workload is described as follows:
Keyspace use org.apache.cassandra.locator.SimpleStrategy placement strategy and replication factor of 3 (this is very important for us)
Workload consist mainly of write operations into a single table. The table schema is as follows:
CREATE TABLE aiceweb.records (
process_id timeuuid,
partition_key int,
collected_at timestamp,
received_at timestamp,
value text,
PRIMARY KEY ((process_id, partition_key), collected_at, received_at)
) WITH CLUSTERING ORDER BY (collected_at DESC, received_at ASC)
AND read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
Write operations come from a NodeJS based API server. The Nodejs driver provided by Datastax is used (version recently updated from 2.1.1 to 3.2.0). The code in charge of performing the write request will group write operations per Primary Key and additionally it will limit the request size to 500 INSERTs per request. The write operation is performed as a BATCH. The only options explicitly set are prepare:true, logged:false.
OpsCenter reflect a historial level of less than one request per second in the last year using this setup (each write request been a BATCH of up to 500 operations directed to the same table and the same partition). Write request latency has been at 1.6ms for 90% of requests for almost the entire year but lately it has increased up to more than 2.6ms for the 90% of requests. Os Load has been below 2.0 and Disk Utilization has been below 5% most of the time with few peaks at 7%. Average Heap usage has been 1.3GB the entire year with peaks at 1.6GB even though currently this peak is rising during the last month.
The problem with this setup is that API performance has been degrading the entire year. Currently the BATCH operation can take from 300ms up to more than 12s (leading to a operation timeout). In some cases the NodeJS driver report all Cassandra drivers down even when OpsCenter report all nodes alive and healthy.
Compaction Stats shows always 0 on each node and nodetool tpstats is showing something like:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 10554 0 0
ReadStage 0 0 687567 0 0
RequestResponseStage 0 0 767898 0 0
MutationStage 0 0 393407 0 0
ReadRepairStage 0 0 411 0 0
GossipStage 0 0 1314414 0 0
CacheCleanupExecutor 0 0 48 0 0
MigrationStage 0 0 0 0 0
ValidationExecutor 0 0 126 0 0
Sampler 0 0 0 0 0
MemtableReclaimMemory 0 0 497 0 0
InternalResponseStage 0 0 126 0 0
AntiEntropyStage 0 0 630 0 0
MiscStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MemtableFlushWriter 0 0 485 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 0 0 7879 0 0
CompactionExecutor 0 0 263599 0 0
AntiEntropySessions 0 0 3 0 0
HintedHandoff 0 0 8 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Any help or suggestion with this problem will be deeply appreciated. Feel free to request any other information you need to analyze it.
Best regards

You are staying with same amount of requests, or the workload is growing?
Looks like the server is overloaded (maybe network).

I would try to find a reproducer and run the reproducer with tracing enabled - hopefully that will help to understand what the issue is (especially if you compare it to a trace in which the latency is good).
There is an example on how to enable query tracing and retrive the output via nodejs driver examples retrieve-query-trace.js (can be found on https://github.com/datastax/nodejs-driver)

Even simple queries not working for a cluster ("Request did not complete within rpc_timeout")

I've successfully set up a Cassandra cluster with 7 nodes. However, I can't get it to work for basic queries.
CREATE TABLE lgrsettings (
siteid bigint,
channel int,
name text,
offset float,
scalefactor float,
units text,
PRIMARY KEY (siteid, channel)
)
insert into lgrsettings (siteid,channel,name,offset,scalefactor,units) values (999,1,'Flow',0.0,1.0,'m');
Then on one node:
select * from lgrsettings;
Request did not complete within rpc_timeout.
And on Another:
select * from lgrsettings;
Bad Request: unconfigured columnfamily lgrsettings
Even though the keyspace and column family shows up on all nodes.
Any ideas where I could start looking?
Alex
Interesting results. The node that handled the keyspace creation and insert shows:
Keyspace: testdata
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: 0.304 ms.
Pending Tasks: 0
Column Family: lgrsettings
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 10
Memtable Data Size: 129
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Column Family: datapoints
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Other nodes don't have this in the cfstats but do show it in DESCRIBE KEYSPACE testdata; in the CQL3 clients...

Request did not complete within rpc_timeout
Check your Cassandra logs to confirm if there is any issue - sometimes exceptions in Cassandra lead to timeouts on the client.

In a comment, the OP said he found the cause of his problem:
I've managed to solve the issue. It was due to time sync between the nodes so I installed ntpd on all nodes, waited 5 minutes and tried again and I have a working cluster!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string