Cassandra write performance - node.js

We have this Cassandra cluster and would like to know if current performance is normal and what we can do to improve it.
Cluster is formed of 3 nodes located on the same datacenter with total capacity of 465GB and 2GB of Heap per node. Each node has 8 cores and 8GB or RAM. Version of different components are cqlsh 5.0.1 | Cassandra 2.1.11.872 | DSE 4.7.4 | CQL spec 3.2.1 | Native protocol v3
The workload is described as follows:
Keyspace use org.apache.cassandra.locator.SimpleStrategy placement strategy and replication factor of 3 (this is very important for us)
Workload consist mainly of write operations into a single table. The table schema is as follows:
CREATE TABLE aiceweb.records (
process_id timeuuid,
partition_key int,
collected_at timestamp,
received_at timestamp,
value text,
PRIMARY KEY ((process_id, partition_key), collected_at, received_at)
) WITH CLUSTERING ORDER BY (collected_at DESC, received_at ASC)
AND read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
Write operations come from a NodeJS based API server. The Nodejs driver provided by Datastax is used (version recently updated from 2.1.1 to 3.2.0). The code in charge of performing the write request will group write operations per Primary Key and additionally it will limit the request size to 500 INSERTs per request. The write operation is performed as a BATCH. The only options explicitly set are prepare:true, logged:false.
OpsCenter reflect a historial level of less than one request per second in the last year using this setup (each write request been a BATCH of up to 500 operations directed to the same table and the same partition). Write request latency has been at 1.6ms for 90% of requests for almost the entire year but lately it has increased up to more than 2.6ms for the 90% of requests. Os Load has been below 2.0 and Disk Utilization has been below 5% most of the time with few peaks at 7%. Average Heap usage has been 1.3GB the entire year with peaks at 1.6GB even though currently this peak is rising during the last month.
The problem with this setup is that API performance has been degrading the entire year. Currently the BATCH operation can take from 300ms up to more than 12s (leading to a operation timeout). In some cases the NodeJS driver report all Cassandra drivers down even when OpsCenter report all nodes alive and healthy.
Compaction Stats shows always 0 on each node and nodetool tpstats is showing something like:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 10554 0 0
ReadStage 0 0 687567 0 0
RequestResponseStage 0 0 767898 0 0
MutationStage 0 0 393407 0 0
ReadRepairStage 0 0 411 0 0
GossipStage 0 0 1314414 0 0
CacheCleanupExecutor 0 0 48 0 0
MigrationStage 0 0 0 0 0
ValidationExecutor 0 0 126 0 0
Sampler 0 0 0 0 0
MemtableReclaimMemory 0 0 497 0 0
InternalResponseStage 0 0 126 0 0
AntiEntropyStage 0 0 630 0 0
MiscStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MemtableFlushWriter 0 0 485 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 0 0 7879 0 0
CompactionExecutor 0 0 263599 0 0
AntiEntropySessions 0 0 3 0 0
HintedHandoff 0 0 8 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Any help or suggestion with this problem will be deeply appreciated. Feel free to request any other information you need to analyze it.
Best regards

You are staying with same amount of requests, or the workload is growing?
Looks like the server is overloaded (maybe network).

I would try to find a reproducer and run the reproducer with tracing enabled - hopefully that will help to understand what the issue is (especially if you compare it to a trace in which the latency is good).
There is an example on how to enable query tracing and retrive the output via nodejs driver examples retrieve-query-trace.js (can be found on https://github.com/datastax/nodejs-driver)

Related

dsbulk unload is failing on large table

trying to unload data from a huge table, below is the command used and output.
$ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000
Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 1 | 0 | 109,790.10 | 110,058.54 | 110,058.54
Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
Here are the dsbulk records --
cassandra#somehost> cd logs
cassandra#somehost> cd COUNT_20210423-070104-108326/
cassandra#somehost> ls
operation.log unload-errors.log
cassandra#somehost> cat operation.log
2021-04-23 07:01:04 WARN Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
2021-04-23 07:01:04 WARN Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
2021-04-23 07:01:04 INFO Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
2021-04-23 07:02:55 WARN Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
2021-04-23 07:02:55 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:55 INFO Memory usage: used: 212 MB, free: 1,922 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:55 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:55 INFO Throughput: 0 reads/second
2021-04-23 07:02:55 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
2021-04-23 07:02:58 INFO Final stats:
2021-04-23 07:02:58 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:58 INFO Memory usage: used: 251 MB, free: 1,883 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:58 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:58 INFO Throughput: 0 reads/second
2021-04-23 07:02:58 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
cassandra#somehost> cat unload-errors.log
Statement: com.datastax.oss.driver.internal.core.cql.DefaultBoundStatement#1083fef9 [0 values, idempotence: <UNSET>, CL: <UNSET>, serial CL: <UNSET>, timestamp: <UNSET>, timeout: <UNSET>]
SELECT batch_id from .... allow filtering (Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded))
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.toErrorPage(ResultSubscription.java:534)
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.lambda$fetchNextPage$1(ResultSubscription.java:372)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.setFinalError(CqlRequestHandler.java:447) [4 skipped]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.access$700(CqlRequestHandler.java:94)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processRetryVerdict(CqlRequestHandler.java:859)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processErrorResponse(CqlRequestHandler.java:828)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.onResponse(CqlRequestHandler.java:655)
at com.datastax.oss.driver.internal.core.channel.InFlightHandler.channelRead(InFlightHandler.java:257)
at java.lang.Thread.run(Thread.java:748) [24 skipped]
Caused by: com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
Cassandra's system.log snippet ----
DEBUG [ScheduledTasks:1] 2021-04-23 00:01:48,539 MonitoringTask.java:152 - 1 operations timed out in the last 5015 msecs:
<SELECT * FROM my query being run with limit - LIMIT 5000>, total time 10004 msec, timeout 10000 msec/cross-node
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,540 MessagingService.java:1302 - RANGE_SLICE messages were dropped in last 5000 ms: 0 internal and 1 cross node
. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 10299 ms
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,551 StatusLogger.java:114 -
Pool Name Active Pending Completed Blocked All Time Blocked
ReadStage 1 0 1736872997 0 0
ContinuousPagingStage 0 0 586 0 0
RequestResponseStage 0 0 1483193130 0 0
ReadRepairStage 0 0 9079516 0 0
CounterMutationStage 0 0 0 0 0
MutationStage 0 0 351841038 0 0
ViewMutationStage 0 0 0 0 0
CommitLogArchiver 0 0 32961 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 12034828 0 0
MemtableReclaimMemory 0 0 68612 0 0
PendingRangeCalculator 0 0 9 0 0
AntiCompactionExecutor 0 0 0 0 0
GossipStage 0 0 20137208 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3798 0 0
MigrationStage 0 0 8 0 0
MemtablePostFlush 0 0 338955 0 0
PerDiskMemtableFlushWriter_0 0 0 66297 0 0
ValidationExecutor 0 0 247600 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 41757 0 0
InternalResponseStage 0 0 525242 0 0
AntiEntropyStage 0 0 767527 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 958717934 0 65
CompactionManager 0 0
MessagingService n/a 0/0
Cache Type Size Capacity KeysToSave
KeyCache 104857216 104857600 all
RowCache 0 0 all
The problem is that you are running an unload command in DSBulk to perform a SELECT COUNT() which means it has to do a full table scan to return a single row.
In addition, ALLOW FILTERING is not recommended unless you restrict the query to a single partition. In any case, the performance of ALLOW FILTERING is very unpredictable even in optimal situations.
I suggest that you instead use the DSBulk count command which is optimised for counting rows or partitions in Cassandra. For details, see the Counting data with DSBulk example.
There's also additional examples in this DSBulk Counting blog post which Alex Ott already linked in his answer. Cheers!
Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end - in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.
Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Excel: Match and Index based on range

I am stumped with the following problem and not sure how to accomplish it in excel. Here is an example of the data:
A B
1 Date Stock_Return
2 Jan-95 -5.2%
3 Feb-95 2.1%
4 Mar-95 3.7%
5 Apr-95 6.9%
6 May-95 6.5%
7 Jun-95 -5.6%
8 Jul-95 6.6%
9 Aug-95 6.2%
What I would like is to have the dates returned which fall within a certain return range and sorted from low to high.
For example:
1 2 3 4 5
Below -7% 0 0 0 0 0
-7% to -5% Jun-95 Jan-95 0 0 0
-5% to -3% 0 0 0 0 0
-3% to 0% 0 0 0 0 0
0% to 3% Feb-95 0 0 0 0
3% to 5% Mar-95 0 0 0 0
5% to 7% Aug-95 May-95 Jul-95 Apr-95 0
I thought Index and Match might make the most sense but when I drag across columns it doesn't work. Any help is very much appreciated.
You can use AGGREGATE function:
=IFERROR(AGGREGATE(14,6,$A$2:$A$9/(($B$2:$B$9>$D2)*($B$2:$B$9<=$E2)),COLUMN(A1)),"0")
If you have Excel O365, you can use the FILTER function:
F2: =IFERROR(TRANSPOSE(FILTER($A$2:$A$9,(F2<=$B$2:$B$9)*(G2>=$B$2:$B$9))),"")
and fill down.

Is there a way to prevent heap exhausted in common lisp

I am not too familiar with garbage collection in lisp, and I wonder how it is possible to manage it in order to avoid the
fatal error: Heap exhausted during garbage collection in *inferior-lisp*.
SLIME 2.20 and SBCL 2.0.1
Heap exhausted during garbage collection: 0 bytes available, 16 requested.
Gen Boxed Code Raw LgBox LgCode LgRaw Pin Alloc Waste Trig WP GCs Mem-age
1 19685 0 1 0 0 0 5 644710992 359856 21474836 19686 0 1.0000
2 29304 0 1 0 0 0 13 960070208 196032 21474836 29305 0 0.0000
3 0 0 0 0 0 0 0 0 0 21474836 0 0 0.0000
4 0 0 0 0 0 0 0 0 0 21474836 0 0 0.0000
5 367 1 130 34 0 15 51 17101008 823088 38575844 547 13 0.0000
6 485 2 221 55 0 10 0 24716944 612720 2000000 773 0 0.0000
7 15224 0 1 0 0 0 0 498860064 32736 2000000 15225 0 0.0000
Total bytes allocated = 2145459216
Dynamic-space-size bytes = 2147483648
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 84761(tid 0x700000026000):
Heap exhausted, game over.
Error opening /dev/tty: Device not configured
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
I am using an algorithm to solve combinatorial issues, and as you can guess, the field of search increase exponentially. It is no point to increase the dynamic-space-size, because this will not solve the issue. So, the idea is to stop the process before heap exhausted. Something like a condition when a memory limit is reached for instance.
Any help is welcome.
It seems that there is no real answer to my question.
I found some guidelines with which I will manage my issue.
Here are the links for the record: https://sourceforge.net/p/sbcl/mailman/message/30184781/ and https://sourceforge.net/p/sbcl/mailman/message/18414749/

Subtract two dataframes to get the value difference

I have a dataframe on which I get a daily aggregate for particular dates.Below is the dataframe for the date 2018-02-11 where I have found out the mean, min, max, std
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-11 4.282442748 0 17 4.361148065 13.61068702 0 27 6.123815451 3.891450382 0 47.62 6.426298507 1.526717557 0 100 12.30842628 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Similarly I have another dataframe for the date 2018-02-12 for which I found the mean, min, max, std
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-12 5.726315789 0 21 2.938315053 22.30526316 0 23 3.581474037 6.06 0 44.75 6.798944285 0.5263157895 0 100 7.254762501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is code below
import pandas as pd
df = pd.read_csv("metrics.csv", parse_dates=["date"])
df.set_index("date", inplace=True)
df_prev = df.loc['2018-02-11'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_next = df.loc['2018-02-12'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
Now I want to subtract the two dataframes to get the value difference for each of the column.This is what I do
df_diff = df_next.sub(df_prev, fill_value=0)
print(df_diff)
But it doesn't subtract anything and I also get the dates which doesn't make any sense since I only want the stats difference.
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-11 -4.282442748 0 -17 -4.361148065 -13.61068702 0 -27 -6.123815451 -3.891450382 0 -47.62 -6.426298507 -1.526717557 0 -100 -12.30842628 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2018-02-12 5.726315789 0 21 2.938315053 22.30526316 0 23 3.581474037 6.06 0 44.75 6.798944285 0.5263157895 0 100 7.254762501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As you can see it doesn't do any subtraction at all.Why is that happening?
PS I ultimately want to find out the percentage difference between the stats of the two dates.Is there any direct way to do that?
To get the difference
df_next - df_prev.values
To get the % change,
(df_next - df_prev.values)/(df_prev.values) * 100

Even simple queries not working for a cluster ("Request did not complete within rpc_timeout")

I've successfully set up a Cassandra cluster with 7 nodes. However, I can't get it to work for basic queries.
CREATE TABLE lgrsettings (
siteid bigint,
channel int,
name text,
offset float,
scalefactor float,
units text,
PRIMARY KEY (siteid, channel)
)
insert into lgrsettings (siteid,channel,name,offset,scalefactor,units) values (999,1,'Flow',0.0,1.0,'m');
Then on one node:
select * from lgrsettings;
Request did not complete within rpc_timeout.
And on Another:
select * from lgrsettings;
Bad Request: unconfigured columnfamily lgrsettings
Even though the keyspace and column family shows up on all nodes.
Any ideas where I could start looking?
Alex
Interesting results. The node that handled the keyspace creation and insert shows:
Keyspace: testdata
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: 0.304 ms.
Pending Tasks: 0
Column Family: lgrsettings
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 10
Memtable Data Size: 129
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Column Family: datapoints
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Other nodes don't have this in the cfstats but do show it in DESCRIBE KEYSPACE testdata; in the CQL3 clients...
Request did not complete within rpc_timeout
Check your Cassandra logs to confirm if there is any issue - sometimes exceptions in Cassandra lead to timeouts on the client.
In a comment, the OP said he found the cause of his problem:
I've managed to solve the issue. It was due to time sync between the nodes so I installed ntpd on all nodes, waited 5 minutes and tried again and I have a working cluster!

Resources