Cassandra node stuck in Joining state - cassandra

I'm trying to add a new node to an existing Cassandra 3.11.1.0 cluster with auto_bootstrap: true option. The new node Completed streaming the data from other nodes, the secondary index build and compact procedures for main table but after that it seems to be stuck in JOINING state. There are no errors/warnings in node's system.log - just INFO messages.
Also during secondary index build and compact procedures there was significant CPU load on node and now there is none. So it looks like the node is stuck during bootstrap and currently idle.
# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN XX.XX.XX.109 33.37 GiB 256 ? xxxx-9f1c79171069 rack1
UN XX.XX.XX.47 35.41 GiB 256 ? xxxx-42531b89d462 rack1
UJ XX.XX.XX.32 15.18 GiB 256 ? xxxx-f5838fa433e4 rack1
UN XX.XX.XX.98 20.65 GiB 256 ? xxxx-add6ed64bcc2 rack1
UN XX.XX.XX.21 33.02 GiB 256 ? xxxx-660149bc0070 rack1
UN XX.XX.XX.197 25.98 GiB 256 ? xxxx-703bd5a1f2d4 rack1
UN XX.XX.XX.151 21.9 GiB 256 ? xxxx-867cb3b8bfca rack1
nodetool compactionstats shows that there are some compactions pending but I've no idea if there is some activity or it just stuck:
# nodetool compactionstats
pending tasks: 4
- keyspace_name.table_name: 4
nodetool netstats shows that counters of Completed requests for Small/Gossip messages are increasing:
# nodetool netstats
Mode: JOINING
Bootstrap xxxx-81b554ae3baf
/XX.XX.XX.109
/XX.XX.XX.47
/XX.XX.XX.98
/XX.XX.XX.151
/XX.XX.XX.21
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 0 0
Small messages n/a 0 571777 0
Gossip messages n/a 0 199190 0
nodetool tpstats shows that counters of Completed requests for CompactionExecutor,MigrationStage, GossipStage pools are increasing:
# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 251 0 0
MutationStage 0 0 571599 0 0
MemtableReclaimMemory 0 0 98 0 0
PendingRangeCalculator 0 0 7 0 0
GossipStage 0 0 185695 0 0
SecondaryIndexManagement 0 0 2 0 0
HintsDispatcher 0 0 0 0 0
RequestResponseStage 0 0 6 0 0
ReadRepairStage 0 0 0 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 14 0 0
MemtablePostFlush 0 0 148 0 0
PerDiskMemtableFlushWriter_0 0 0 98 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 98 0 0
InternalResponseStage 0 0 11 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 124
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
So it looks like node is still receiving some data from another nodes and applying it but I don't know how to check the progress and should I wait or cancel bootstrap. I've already tried to re-bootstrap this node and got the following situation: node was in UJ state for a long time (16 hours) had some pending compaction and 99.9% of CPU idle. Also I've added nodes to cluster about a month ago and there wasn't any issues - nodes joined during 2-3 hour and became in UN state.
Also nodetool cleanup is running on one of existing nodes on this node I see the following warnings in system.log:
**WARN [STREAM-IN-/XX.XX.XX.32:46814] NoSpamLogger.java:94 log Spinning trying to capture readers [BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1117-big-Data.db'), BigTableReader(path='/var/lib/cassandra/data/keyspace_name/table_name-6750375affa011e7bdc709b3eb0d8941/mc-1070-big-Data.db'), ...]**
Since cleanup is local procedure it cannot affect new node during bootstrap. But I can be wrong.
Any help will be appreciated.

Sometimes this can happen. Maybe there was an issue with gossip communicating that joining had completed, or maybe another node quickly reported as DN and disrupted the process.
When this happens, you have a couple of options:
You can always stop the node, wipe it, and try to join it again.
If you're sure that all (or most) of the data is there, you can stop the node, and add a line in the cassandra.yaml of auto_bootstrap: false. The node will start, join the cluster, and serve its data. For this option, it's usually a good idea to run a repair once the node is up.

Just Auto_bootstrap: false on cassandra.yaml of new node. and then restart the node. it will join as UN. After some time run full repair which will ensure the consistency.

Related

High disk I/O (read) on Cassandra nodes

We have 3 nodes Cassandra cluster.
We have an application that uses a keyspace that creates a hightload on disks, on read. The problem has a cumulative effect. The more days we interact with the keyspace, the more disk reading grows. :
hightload read
Reading goes up to > 700 MB/s. Then the storage (SAN) begins to degrade, and then the Сassandra cluster also degrades.
UPD 25.10.2021: "I wrote it a little wrong, through the SAN space is allocated to a virtual machine, like a normal drive"
The only thing that helps is clearing the keyspace.
Output command "tpstats" and "cfstats"
[cassandra-01 ~]$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 1 1 1837888055 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 6789640 0 0
MutationStage 0 0 870873552 0 0
MemtableReclaimMemory 0 0 7402 0 0
PendingRangeCalculator 0 0 9 0 0
GossipStage 0 0 18939072 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3 0 0
RequestResponseStage 0 0 1307861786 0 0
Native-Transport-Requests 0 0 2981687196 0 0
ReadRepairStage 0 0 346448 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 168 0 0
MemtablePostFlush 0 0 8193 0 0
PerDiskMemtableFlushWriter_0 0 0 7402 0 0
ValidationExecutor 0 0 21 0 0
Sampler 0 0 10988 0 0
MemtableFlushWriter 0 0 7402 0 0
InternalResponseStage 0 0 3404 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 71 0 0
CacheCleanupExecutor 0 0 0 0 0
Message type Dropped
READ 7
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 5
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
[cassandra-01 ~]$ nodetool cfstats box_messages -H
Total number of tables: 73
----------------
Keyspace : box_messages
Read Count: 48847567
Read Latency: 0.055540737801741485 ms
Write Count: 69461300
Write Latency: 0.010656743870327794 ms
Pending Flushes: 0
Table: messages
SSTable count: 6
Space used (live): 3.84 GiB
Space used (total): 3.84 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 10.3 MiB
SSTable Compression Ratio: 0.23265712113582082
Number of partitions (estimate): 4156030
Memtable cell count: 929912
Memtable data size: 245.04 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 92
Local read count: 20511450
Local read latency: 0.106 ms
Local write count: 52111294
Local write latency: 0.013 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 57318
Bloom filter false ratio: 0.00841
Bloom filter space used: 6.56 MiB
Bloom filter off heap memory used: 6.56 MiB
Index summary off heap memory used: 1.78 MiB
Compression metadata off heap memory used: 1.95 MiB
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 17084
Compacted partition mean bytes: 3287
Average live cells per slice (last five minutes): 2.0796939751354797
Maximum live cells per slice (last five minutes): 10
Average tombstones per slice (last five minutes): 1.1939751354797576
Maximum tombstones per slice (last five minutes): 2
Dropped Mutations: 5 bytes
(I'm unable to comment and hence posting it as an answer)
As folks mentioned SAN is not going to be the best suite here and one could read through the list of anti-patterns documented here which could also apply to OSS C*.

Deleted data in cassandra come back,like ghost

I have a 3 nodes Cassandra cluster(3.7), a keyspace
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
a table
CREATE TABLE tradingdate (key text,tradingdate date,PRIMARY KEY (key, tradingdate));
one day when deleting one row like
delete from tradingdate
where key='tradingDay'and tradingdate='2018-12-31'
then the deleted row become ghost, when the query
select * from tradingdate
where key='tradingDay'and tradingdate>'2018-12-27' limit 2;
key | tradingdate
------------+-------------
tradingDay | 2018-12-28
tradingDay | 2019-01-02
select * from tradingdate
where key='tradingDay'and tradingdate<'2019-01-03'
order by tradingdate desc limit 2;
key | tradingdate
------------+-------------
tradingDay | 2019-01-02
tradingDay | 2018-12-31
So when use order by, the deleted row (tradingDay, 2018-12-31) come back.
I guess I only delete a row on one node, but it still exists on another node. So I execute:
nodetool repair demo tradingdate
on 3 nodes, then the deleted row totally disappears
So I want to know why use order by, I can see the ghost row.
This is some good reading about deletes in Cassandra (and other distributed systems as well):
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
As well as:
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
You will need to run/schedule a routine repair at least once within gc_grace_seconds which defaults to ten days to prevent data from reappearing in your cluster.
Also you should look for dropped messages in case one of your nodes is missing deletes (and other messages):
# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 787032744 0 0
ReadStage 0 0 1627843193 0 0
RequestResponseStage 0 0 2257452312 0 0
ReadRepairStage 0 0 99910415 0 0
CounterMutationStage 0 0 0 0 0
HintedHandoff 0 0 1582 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 6649458 0 0
MemtableReclaimMemory 0 0 17987 0 0
PendingRangeCalculator 0 0 46 0 0
GossipStage 0 0 22766295 0 0
MigrationStage 0 0 8 0 0
MemtablePostFlush 0 0 127844 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 17851 0 0
InternalResponseStage 0 0 8669 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 631966060 0 19
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
MUTATION 0
COUNTER_MUTATION 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
Dropped messages indicate that there is something wrong.

Native Transport Requests in Cassandra

I got some points about Native Transport Requests in Cassandra using this link : What are native transport requests in Cassandra?
As per my understanding, any query I execute in Cassandra is an Native Transport Requests.
I frequently get Request Timed Out error in Cassandra and I observed the following logs in Cassandra debug log and as well as using nodetool tpstats
/var/log/cassandra# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 186933949 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 781880580 0 0
RequestResponseStage 0 0 5783147 0 0
ReadRepairStage 0 0 0 0 0
CounterMutationStage 0 0 14430168 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 366708 0 0
MemtableReclaimMemory 0 0 788 0 0
PendingRangeCalculator 0 0 1 0 0
GossipStage 0 0 0 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemtablePostFlush 0 0 799 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 788 0 0
InternalResponseStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 477629331 0 1063468
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
1) What is the All time blocked state?
2) What is this value : 1063468 denotes? How harmful it is?
3) How to tune this?
Each request is taken processed by the NTR stage before being handed off to read/mutation stage but it still blocks while waiting for completion. To prevent being overloaded the stage starts to block tasks being added to its queue to apply back pressure to client. Every time a request is blocked the all time blocked counter is incremented. So 1063468 requests have at one time been blocked for some period of time due to having to many requests backed up.
In situations where the app has spikes of queries this blocking is unnecessary and can cause issues so you can increase this queue limit with something like -Dcassandra.max_queued_native_transport_requests=4096 (default 128). You can also throttle requests on client side but id try increasing queue size first.
There also may be some request thats exceptionally slow that is clogging up your system. If you have monitoring setup, look at high percentile read/write coordinator latencies. You can also use nodetool proxyhistograms. There may be something in your data model or queries that is causing issues.

Cassandra2.1 write slow in a 1TB data table

I am doing some test in a cassandra cluster,and now i have a table with 1TB data per node.When i used ycsb to do more insert operation,i found the throughput was really low(about 10000 ops/sec) comparing to a same,new table in the same cluster(about 80000 ops/sec).While inserting,the cpu usage was about 40%,and almost no disk usege.
I used nodetool tpstats to get task details,it showed :
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 102 0 0
RequestResponseStage 0 0 41571733 0 0
MutationStage 384 21949 82375487 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 247100 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 6 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 16 16 4745 0 0
MemtableReclaimMemory 0 0 4745 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 1 163 9394 0 0
CompactionExecutor 8 29 13713 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 2 2 5 0 0
I found there was a large amount of pending MutationStage and MemtablePostFlush
I have read some related articles about cassandra write limitation,but no useful information.I want to know why there is a huge difference about cassandra throughput between two same tables except the data size?
In addition,i use ssd on my server.However,this phenomenon also occur in another cluster using hdd
When cassandra was running,i found the both %user and %nice on cpu utilization are about 10% while only compactiontask running with compaction throughput about 80MB/S.but i have been set nice value to 0 for my cassandra process.
Wild guess: your system is busy compacting the sstable.
Check it out with nodetool compactionstats
BTW, YCSB does not use prepare statement, which make it bad estimator for actual application load.

Error in cqlsh command line while querying

I have a three node Cassandra cluster running perfectly fine. When i do select count(*) from usertracking; query on one of the node of my cluster. I get the following error :
errors={}, last_host=localhost
Statement trace did not complete within 10 seconds
Although, it's working fine on rest of the two nodes of the cluster. Can anyone tell me the why i am getting this error only on one node and also what is the reason of error?
As given in this https://stackoverflow.com/questions/27766976/cassandra-cqlsh-query-fails-with-no-error I have also increased the time out parameters read_request_timeout_in_ms and range_request_timeout_in_ms in cassandra.yaml. But that didn't help.
KeySpace definition :
CREATE KEYSPACE cw WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
Table definition :
CREATE TABLE usertracking (
cwc text,
cur_visit_id text,
cur_visit_datetime timestamp,
cur_visit_last_ts bigint,
prev_visit_datetime timestamp,
prev_visit_last_ts bigint,
tot_page_view bigint,
tot_time_spent bigint,
tot_visit_count bigint,
PRIMARY KEY (cwc)
);
Output of node tool status :
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.200 146.06 MB 1 ? 92c5bd4a-8f2b-4d7b-b420-6261a1bb8648 rack1
UN 192.168.1.201 138.53 MB 1 ? 817d331b-4cc0-4770-be6d-7896fc00e82f rack1
UN 192.168.1.202 155.04 MB 1 ? 351731fb-c3ad-45e0-b2c8-bc1f3b1bf25d rack1
Output of nodetool tpstats :
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 25 0 0
RequestResponseStage 0 0 257103 0 0
MutationStage 0 0 593226 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 612335 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 87 0 0
MemtableReclaimMemory 0 0 87 0 0
PendingRangeCalculator 0 0 3 0 0
MemtablePostFlush 0 0 2829 0 0
CompactionExecutor 0 0 216 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 2 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
not sure if this helps. I have a very similar configuration in my development environment and was getting OperationTimedOut errors when running count operations.
Like yourself, I originally tried working with the various TIMEOUT variables in cassandra.yaml, but these appeared to make no difference.
In the end, the timeout that was being exceeded was actually the cqlsh client itself. When i updated/created the ~/.cassandra/cqlshrc file with the following, I was able to run the count without failure.
[connection]
client_timeout = 20
This example sets the client time out to 20 seconds.
There is some information in the following article about the cqlshrc file: CQL Configuration File
Hopefully this helps, sorry if I'm barking up the wrong tree.

Resources