I see cassandra java process on my linux instance, it uses ~38gb of memory and shows ~700 threads under it.
When any connections are made to the database via python or java? Do they become a thread under the main java process or a separate OS process?
What happens when a cluster connection is spawning multiple threads, would they also become threads under the main process? If so, how to distinguish between connection threads and connection spawning threads?
The memory allocated for session threads, does it get allocated under non-heap memory?
Update -
#chris - here is the output of tpstats
[username#hostname ~]$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 110336013 0 0
ContinuousPagingStage 0 0 31 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 4244757 0 0
MutationStage 0 0 25309020 0 0
GossipStage 0 0 2484700 0 0
RequestResponseStage 0 0 46705216 0 0
ReadRepairStage 0 0 2193356 0 0
CounterMutationStage 0 0 3563130 0 0
MemtablePostFlush 0 0 117717 0 0
ValidationExecutor 1 1 111176 0 0
MemtableFlushWriter 0 0 23843 0 0
ViewMutationStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Repair#1953 1 3 1 0 0
MemtableReclaimMemory 0 0 28251 0 0
PendingRangeCalculator 0 0 6 0 0
AntiCompactionExecutor 0 0 0 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 29 0 0
Native-Transport-Requests 0 0 110953286 0 0
MigrationStage 0 0 19 0 0
PerDiskMemtableFlushWriter_0 0 0 27853 0 0
Sampler 0 0 0 0 0
InternalResponseStage 0 0 21264 0 0
AntiEntropyStage 0 0 350913 0 0
Message type Dropped Latency waiting in queue (micros)
50% 95% 99% Max
READ 0 0.00 0.00 0.00 10090.81
RANGE_SLICE 0 0.00 0.00 10090.81 10090.81
_TRACE 0 N/A N/A N/A N/A
HINT 0 0.00 0.00 0.00 0.00
MUTATION 0 0.00 0.00 0.00 10090.81
COUNTER_MUTATION 0 0.00 0.00 0.00 10090.81
BATCH_STORE 0 0.00 0.00 0.00 0.00
BATCH_REMOVE 0 0.00 0.00 0.00 0.00
REQUEST_RESPONSE 0 0.00 0.00 0.00 12108.97
PAGED_RANGE 0 N/A N/A N/A N/A
READ_REPAIR 0 0.00 0.00 0.00 0.00```
The connections go to a netty service which should have threads equal to the number of cores, even if you have 10000 connected clients. However Cassandra was initially designed with a Staged Event Driven Architecture (SEDA) which kinda sits between an async and fully threaded model. It creates pools of threads to handle different types of tasks. This does mean however depending on the configuration in your yaml there can be a lot of threads. For example by default theres up to 128 threads for the native transport pool, 32 concurrent readers, 32 concurrent writers, 32 counter mutations etc but if your cluster was tuned for ssds these might be higher. That in mind theres a number of these pools that use a shared pool (show up as SharedWorkers) with the SEPExecutor (Single executor pool). So with spikes there might be many created but the threads may not be utilized often.
nodetool tpstats will give you details on the different pools and how many are active which can help identify which and if the threads are being used. If not there then you can also use jstack (use as same user as the cassandra process) to dump the traces. If its too much to look through theres tools like https://fastthread.io/ to make viewing it easier.
For what it is worth 32gb of memory and 700 threads doesn't sound like an issue.
Related
We have 3 nodes Cassandra cluster.
We have an application that uses a keyspace that creates a hightload on disks, on read. The problem has a cumulative effect. The more days we interact with the keyspace, the more disk reading grows. :
hightload read
Reading goes up to > 700 MB/s. Then the storage (SAN) begins to degrade, and then the Сassandra cluster also degrades.
UPD 25.10.2021: "I wrote it a little wrong, through the SAN space is allocated to a virtual machine, like a normal drive"
The only thing that helps is clearing the keyspace.
Output command "tpstats" and "cfstats"
[cassandra-01 ~]$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 1 1 1837888055 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 6789640 0 0
MutationStage 0 0 870873552 0 0
MemtableReclaimMemory 0 0 7402 0 0
PendingRangeCalculator 0 0 9 0 0
GossipStage 0 0 18939072 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3 0 0
RequestResponseStage 0 0 1307861786 0 0
Native-Transport-Requests 0 0 2981687196 0 0
ReadRepairStage 0 0 346448 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 168 0 0
MemtablePostFlush 0 0 8193 0 0
PerDiskMemtableFlushWriter_0 0 0 7402 0 0
ValidationExecutor 0 0 21 0 0
Sampler 0 0 10988 0 0
MemtableFlushWriter 0 0 7402 0 0
InternalResponseStage 0 0 3404 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 71 0 0
CacheCleanupExecutor 0 0 0 0 0
Message type Dropped
READ 7
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 5
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
[cassandra-01 ~]$ nodetool cfstats box_messages -H
Total number of tables: 73
----------------
Keyspace : box_messages
Read Count: 48847567
Read Latency: 0.055540737801741485 ms
Write Count: 69461300
Write Latency: 0.010656743870327794 ms
Pending Flushes: 0
Table: messages
SSTable count: 6
Space used (live): 3.84 GiB
Space used (total): 3.84 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 10.3 MiB
SSTable Compression Ratio: 0.23265712113582082
Number of partitions (estimate): 4156030
Memtable cell count: 929912
Memtable data size: 245.04 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 92
Local read count: 20511450
Local read latency: 0.106 ms
Local write count: 52111294
Local write latency: 0.013 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 57318
Bloom filter false ratio: 0.00841
Bloom filter space used: 6.56 MiB
Bloom filter off heap memory used: 6.56 MiB
Index summary off heap memory used: 1.78 MiB
Compression metadata off heap memory used: 1.95 MiB
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 17084
Compacted partition mean bytes: 3287
Average live cells per slice (last five minutes): 2.0796939751354797
Maximum live cells per slice (last five minutes): 10
Average tombstones per slice (last five minutes): 1.1939751354797576
Maximum tombstones per slice (last five minutes): 2
Dropped Mutations: 5 bytes
(I'm unable to comment and hence posting it as an answer)
As folks mentioned SAN is not going to be the best suite here and one could read through the list of anti-patterns documented here which could also apply to OSS C*.
I have a 3 nodes Cassandra cluster(3.7), a keyspace
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
a table
CREATE TABLE tradingdate (key text,tradingdate date,PRIMARY KEY (key, tradingdate));
one day when deleting one row like
delete from tradingdate
where key='tradingDay'and tradingdate='2018-12-31'
then the deleted row become ghost, when the query
select * from tradingdate
where key='tradingDay'and tradingdate>'2018-12-27' limit 2;
key | tradingdate
------------+-------------
tradingDay | 2018-12-28
tradingDay | 2019-01-02
select * from tradingdate
where key='tradingDay'and tradingdate<'2019-01-03'
order by tradingdate desc limit 2;
key | tradingdate
------------+-------------
tradingDay | 2019-01-02
tradingDay | 2018-12-31
So when use order by, the deleted row (tradingDay, 2018-12-31) come back.
I guess I only delete a row on one node, but it still exists on another node. So I execute:
nodetool repair demo tradingdate
on 3 nodes, then the deleted row totally disappears
So I want to know why use order by, I can see the ghost row.
This is some good reading about deletes in Cassandra (and other distributed systems as well):
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
As well as:
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
You will need to run/schedule a routine repair at least once within gc_grace_seconds which defaults to ten days to prevent data from reappearing in your cluster.
Also you should look for dropped messages in case one of your nodes is missing deletes (and other messages):
# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 787032744 0 0
ReadStage 0 0 1627843193 0 0
RequestResponseStage 0 0 2257452312 0 0
ReadRepairStage 0 0 99910415 0 0
CounterMutationStage 0 0 0 0 0
HintedHandoff 0 0 1582 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 6649458 0 0
MemtableReclaimMemory 0 0 17987 0 0
PendingRangeCalculator 0 0 46 0 0
GossipStage 0 0 22766295 0 0
MigrationStage 0 0 8 0 0
MemtablePostFlush 0 0 127844 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 17851 0 0
InternalResponseStage 0 0 8669 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 631966060 0 19
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
MUTATION 0
COUNTER_MUTATION 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
Dropped messages indicate that there is something wrong.
I got some points about Native Transport Requests in Cassandra using this link : What are native transport requests in Cassandra?
As per my understanding, any query I execute in Cassandra is an Native Transport Requests.
I frequently get Request Timed Out error in Cassandra and I observed the following logs in Cassandra debug log and as well as using nodetool tpstats
/var/log/cassandra# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 186933949 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 781880580 0 0
RequestResponseStage 0 0 5783147 0 0
ReadRepairStage 0 0 0 0 0
CounterMutationStage 0 0 14430168 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 366708 0 0
MemtableReclaimMemory 0 0 788 0 0
PendingRangeCalculator 0 0 1 0 0
GossipStage 0 0 0 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemtablePostFlush 0 0 799 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 788 0 0
InternalResponseStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 477629331 0 1063468
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
1) What is the All time blocked state?
2) What is this value : 1063468 denotes? How harmful it is?
3) How to tune this?
Each request is taken processed by the NTR stage before being handed off to read/mutation stage but it still blocks while waiting for completion. To prevent being overloaded the stage starts to block tasks being added to its queue to apply back pressure to client. Every time a request is blocked the all time blocked counter is incremented. So 1063468 requests have at one time been blocked for some period of time due to having to many requests backed up.
In situations where the app has spikes of queries this blocking is unnecessary and can cause issues so you can increase this queue limit with something like -Dcassandra.max_queued_native_transport_requests=4096 (default 128). You can also throttle requests on client side but id try increasing queue size first.
There also may be some request thats exceptionally slow that is clogging up your system. If you have monitoring setup, look at high percentile read/write coordinator latencies. You can also use nodetool proxyhistograms. There may be something in your data model or queries that is causing issues.
I am doing some test in a cassandra cluster,and now i have a table with 1TB data per node.When i used ycsb to do more insert operation,i found the throughput was really low(about 10000 ops/sec) comparing to a same,new table in the same cluster(about 80000 ops/sec).While inserting,the cpu usage was about 40%,and almost no disk usege.
I used nodetool tpstats to get task details,it showed :
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 102 0 0
RequestResponseStage 0 0 41571733 0 0
MutationStage 384 21949 82375487 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 247100 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 6 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 16 16 4745 0 0
MemtableReclaimMemory 0 0 4745 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 1 163 9394 0 0
CompactionExecutor 8 29 13713 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 2 2 5 0 0
I found there was a large amount of pending MutationStage and MemtablePostFlush
I have read some related articles about cassandra write limitation,but no useful information.I want to know why there is a huge difference about cassandra throughput between two same tables except the data size?
In addition,i use ssd on my server.However,this phenomenon also occur in another cluster using hdd
When cassandra was running,i found the both %user and %nice on cpu utilization are about 10% while only compactiontask running with compaction throughput about 80MB/S.but i have been set nice value to 0 for my cassandra process.
Wild guess: your system is busy compacting the sstable.
Check it out with nodetool compactionstats
BTW, YCSB does not use prepare statement, which make it bad estimator for actual application load.
I have a three node Cassandra cluster running perfectly fine. When i do select count(*) from usertracking; query on one of the node of my cluster. I get the following error :
errors={}, last_host=localhost
Statement trace did not complete within 10 seconds
Although, it's working fine on rest of the two nodes of the cluster. Can anyone tell me the why i am getting this error only on one node and also what is the reason of error?
As given in this https://stackoverflow.com/questions/27766976/cassandra-cqlsh-query-fails-with-no-error I have also increased the time out parameters read_request_timeout_in_ms and range_request_timeout_in_ms in cassandra.yaml. But that didn't help.
KeySpace definition :
CREATE KEYSPACE cw WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
Table definition :
CREATE TABLE usertracking (
cwc text,
cur_visit_id text,
cur_visit_datetime timestamp,
cur_visit_last_ts bigint,
prev_visit_datetime timestamp,
prev_visit_last_ts bigint,
tot_page_view bigint,
tot_time_spent bigint,
tot_visit_count bigint,
PRIMARY KEY (cwc)
);
Output of node tool status :
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.200 146.06 MB 1 ? 92c5bd4a-8f2b-4d7b-b420-6261a1bb8648 rack1
UN 192.168.1.201 138.53 MB 1 ? 817d331b-4cc0-4770-be6d-7896fc00e82f rack1
UN 192.168.1.202 155.04 MB 1 ? 351731fb-c3ad-45e0-b2c8-bc1f3b1bf25d rack1
Output of nodetool tpstats :
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 25 0 0
RequestResponseStage 0 0 257103 0 0
MutationStage 0 0 593226 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 612335 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 87 0 0
MemtableReclaimMemory 0 0 87 0 0
PendingRangeCalculator 0 0 3 0 0
MemtablePostFlush 0 0 2829 0 0
CompactionExecutor 0 0 216 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 2 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
not sure if this helps. I have a very similar configuration in my development environment and was getting OperationTimedOut errors when running count operations.
Like yourself, I originally tried working with the various TIMEOUT variables in cassandra.yaml, but these appeared to make no difference.
In the end, the timeout that was being exceeded was actually the cqlsh client itself. When i updated/created the ~/.cassandra/cqlshrc file with the following, I was able to run the count without failure.
[connection]
client_timeout = 20
This example sets the client time out to 20 seconds.
There is some information in the following article about the cqlshrc file: CQL Configuration File
Hopefully this helps, sorry if I'm barking up the wrong tree.