Decommission a Cassandra node - cassandra

I just decommissioned one of my nodes in a Cassandra cluster with 3 nodes (all running Cassandra 3.3). One of the remaining nodes constantly is printing:
DEBUG [GossipTasks:1] 2016-05-29 15:30:16,770 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:17,770 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:18,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:19,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:20,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
in the logs for half a day or so, once a second. Any idea why? What this means?
Thanks
Edit:
I noticed both nodes are printing this message in the logs for the past 48 hours!

Its normal to have this message till 72 hours after the node is decomissioned.
More Details here:
https://issues.apache.org/jira/browse/CASSANDRA-10371

Related

Anticompaction phase post nodetool repair in cassandra

After running repair on 3.11.2 version , I am getting below message in debug.log saying -
mc-50-big-Data.db fully contained in range (-9223372036854775808,-9223372036854775808], mutating repairedAt instead of anticompacting
Why this SSTABLE is fully contained in range ( -9223372036854775808,-9223372036854775808 )
beside having muliple tokens / keys as shown is below output
Below key found in this SSTABLE
sstabledump demo/msisdn-e59722f0d1e711ebb52c1524f01c1145/mc-50-big-Data.db| grep key
"key" : [ "1" ],
"key" : [ "2" ],
"key" : [ "22" ],
"key" : [ "833" ],
"key" : [ "3232" ],
"key" : [ "98" ],
"key" : [ "900" ],
"key" : [ "173" ],
DIFFERENT TOKENS found in this SSTABLE
account_id | system.token(account_id)
------------+--------------------------
1 | -4069959284402364209
2 | -3248873570005575792
22 | -1117083337304738213
833 | -1083053322882870066
3232 | -1016771166277942908
98 | -463622059452620815
900 | -300805731578844817
173 | 298622069266553728
I executed below command -
nodetool repair -full -seq demo msisdn
For C* older than
Example of what Anti - compaction perform with C* SSTABLES during repair .
Cassandra older version > C* 2.2 was actually performing ANTICOMPACTION where one sstable splitted into two parts -
repaired
unRepaired
Below is the example .
Currently Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-1-Data.db
Repaired at: 0
demo-msisdn-ka-2-Data.db
Repaired at: 0
client:~/css/apache-cassandra-2.1.23/bin$ ./nodetool repair -st -5196837186409114737 -et -178801028445334456 -par -inc
[2021-06-25 19:55:04,270] Nothing to repair for keyspace 'system'
[2021-06-25 19:55:04,293] Starting repair command #6, repairing 4 ranges for keyspace system_traces (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,395] Repair session 3bdd7000-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,395] Repair session 3be05630-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,396] Repair session 3be38a80-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,401] Repair session 3be62290-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,421] Repair command #6 finished
[2021-06-25 19:55:04,437] Starting repair command #7, repairing 4 ranges for keyspace demo (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,504] Repair session 3bf0f800-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,504] Repair session 3bf1e260-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,507] Repair session 3bf64f30-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,514] Repair session 3bf760a0-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,753] Repair command #7 finished
After repair - Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-3-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-4-Data.db
Repaired at: 0
demo-msisdn-ka-5-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-6-Data.db
Repaired at: 0
---- LOG -----
INFO [MemtableFlushWriter:7] 2021-06-25 19:52:19,665 Memtable.java:382 - Completed flushing /home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-tmp-ka-1-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1624646218285, position=185893)
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,713 Validator.java:257 - [repair #d9ad0620-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,783 Validator.java:257 - [repair #d9c485c0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,845 Validator.java:257 - [repair #d9d21a50-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,861 Validator.java:257 - [repair #d9d54ea0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,330 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,337 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,349 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,355 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,372 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,374 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,388 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,391 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:496 - Starting anticompaction for system_traces.events on 0/0 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:561 - Completed anticompaction successfully
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:496 - Starting anticompaction for system_traces.sessions on 0/0 sstables
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:561 - Completed anticompaction successfully
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,453 Validator.java:257 - [repair #3bf0f800-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,472 Validator.java:257 - [repair #3bf1e260-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,485 Validator.java:257 - [repair #3bf64f30-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,496 Validator.java:257 - [repair #3bf760a0-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:496 - Starting anticompaction for demo.msisdn on 1/1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:537 - SSTable SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db') ((-7133164915313410844,6369609434230030255]) will be anticompacted on range (-5196837186409114737,-178801028445334456]
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1125 - Performing anticompaction on 1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1137 - Anticompacting SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db')
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,670 CompactionManager.java:1197 - Anticompaction completed successfully, anticompacted from 1 to 2 sstable(s).
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,678 CompactionManager.java:561 - Completed anticompaction successfully

Azure AKS Scale out

I have AKS with 03 nodes, I tried to manually scale out from 3 to 4 nodes. Scale up was fine.
After ~ 20 minutes , all 04 Node are in NotReady Service, all kube-system services is not Ready status.
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000001 Ready agent 17m v1.18.14
aks-agentpool-40760006-vmss000002 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000003 Ready agent 11m v1.18.14
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 NotReady agent 23m v1.18.14
aks-agentpool-40760006-vmss000002 NotReady agent 24m v1.18.14
aks-agentpool-40760006-vmss000003 NotReady agent 19m v1.18.14
k get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-748cdb7bf4-7frq2 0/1 Pending 0 10m
coredns-748cdb7bf4-vg5nn 0/1 Pending 0 10m
coredns-748cdb7bf4-wrhxs 1/1 Terminating 0 28m
coredns-autoscaler-868b684fd4-2gb8f 0/1 Pending 0 10m
kube-proxy-p6wmv 1/1 Running 0 28m
kube-proxy-sksz6 1/1 Running 0 23m
kube-proxy-vpb2g 1/1 Running 0 28m
metrics-server-58fdc875d5-sbckj 0/1 Pending 0 10m
tunnelfront-5d74798f6b-w6rvn 0/1 Pending 0 10m
The node logs shows that:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 25m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 25m kubelet Updated Node Allocatable limit across pods
Normal Starting 25m kube-proxy Starting kube-proxy.
Normal NodeReady 24m kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeReady
Warning FailedToCreateRoute 5m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 50.264754ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 45.945658ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.180158ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m35s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.550858ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m25s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 44.74355ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m15s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 42.428456ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 41.664858ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 48.456954ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 38.611964ms: timed out waiting for the condition
Warning FailedToCreateRoute 65s (x16 over 3m35s) route_controller (combined from similar events): Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 13.972487ms: timed out waiting for the condition
You can use cluster autoscaler option to avoid such situations in future.
To keep up with application demands in Azure Kubernetes Service (AKS),
you may need to adjust the number of nodes that run your workloads.
The cluster autoscaler component can watch for pods in your cluster
that can't be scheduled because of resource constraints. When issues
are detected, the number of nodes in a node pool is increased to meet
the application demand. Nodes are also regularly checked for a lack of
running pods, with the number of nodes then decreased as needed. This
ability to automatically scale up or down the number of nodes in your
AKS cluster lets you run an efficient, cost-effective cluster.
You can Update an existing AKS cluster to enable the cluster autoscaler in order to use your current resource group.
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Seem it is OK now. I am lacking of the right to scale up node.

Cassandra node running, but cant connect?

using Cassandra version 3.11.8, openjdk-8u242-b08
Prior to this crashing, I was altering a table with 50k+ columns so this might (is) a factor to all of this. I would Ideally rather lose the data in the commit (if its inserting a backlog still perpetually) so I can connect to the hosts so service can be resumed.
Before start of error, I started alter table commands inserting many columns into the table in calls of 1000 at a time. Eventually after it may have done about half of them i received this error for all the nodes.
2020-09-10 15:34:29 WARNING [control connection] Error connecting to
127.0.0.3:9042: Traceback (most recent call last): File "cassandra\cluster.py", line 3522, in
cassandra.cluster.ControlConnection._reconnect_internal File
"cassandra\cluster.py", line 3591, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3588, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3690, in
cassandra.cluster.ControlConnection._refresh_schema File
"cassandra\metadata.py", line 142, in
cassandra.metadata.Metadata.refresh File "cassandra\metadata.py",
line 165, in cassandra.metadata.Metadata._rebuild_all File
"cassandra\metadata.py", line 2522, in get_all_keyspaces File
"cassandra\metadata.py", line 2031, in get_all_keyspaces File
"cassandra\metadata.py", line 2719, in
cassandra.metadata.SchemaParserV3._query_all File
"cassandra\connection.py", line 985, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 983, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 1435, in
cassandra.connection.ResponseWaiter.deliver
cassandra.OperationTimedOut: errors=None, last_host=None
I am running 8 nodes on a server. I have reset all nodes and handshakes are done. But I cannot make a connect to my cluster on any of the nodes.My system.log and debug.log have similar logs throughout once cassandra starts running. gc.log has not updated in some time so it makes me wonder what is going on? Interesting point is i only retrieve the list of columns in the table 3 times total, I have ran this code on my desktop without issues using 2 nodes (much much less resources) and have not received any of these issues.
Edit: just for clarity my application/connections are not running and these logs below are what is happening periodically..I tried looking at scheduled tasks and cannot find information about cassandra for this. I wonder what backlog its reading from and if I can stop it. Ideally I would like to stop this backload of operations from happening...
-------SYSTEM.LOG-------
INFO [GossipStage:1] 2020-09-10 17:38:52,376 StorageService.java:2400 - Node /127.0.0.9 state jump to NORMAL
WARN [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:377 - CassandraRoleManager skipped default role setup: some nodes were not ready
INFO [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:416 - Setup task failed with error, rescheduling
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:56,965 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:58,262 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,102 Gossiper.java:1139 - Node /127.0.0.4 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,105 Gossiper.java:1103 - InetAddress /127.0.0.4 is now UP
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:38:59,813 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:00,104 StorageService.java:2400 - Node /127.0.0.4 state jump to NORMAL
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:39:01,029 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,266 Gossiper.java:1139 - Node /127.0.0.5 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,270 Gossiper.java:1103 - InetAddress /127.0.0.5 is now UP
INFO [GossipStage:1] 2020-09-10 17:39:04,271 StorageService.java:2400 - Node /127.0.0.5 state jump to NORMAL
INFO [ScheduledTasks:1] 2020-09-10 17:43:05,805 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:48:40,892 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:54:35,999 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:59:36,083 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
------DEBUG.LOG------
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:26,102 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 1256/1232/1281 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:56,110 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 795 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:01:01,111 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:03:41,143 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 853 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:06,148 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:26,153 MonitoringTask.java:173 - 1 operations were slow in the last 4991 msecs:
<SELECT * FROM system_schema.columns>, time 838 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:31,154 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 841 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:56,160 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:11,165 MonitoringTask.java:173 - 1 operations were slow in the last 4994 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:31,171 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:56,176 MonitoringTask.java:173 - 1 operations were slow in the last 5010 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 847/837/857 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:16,196 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 827 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:31,200 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:01,207 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 799 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:16,211 MonitoringTask.java:173 - 1 operations were slow in the last 4999 msecs:
<SELECT * FROM system_schema.columns>, time 780 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:36,217 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 835 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:01,221 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 832 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:09:56,231 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:56,231 MonitoringTask.java:173 - 1 operations were slow in the last 4995 msecs:
<SELECT * FROM system_schema.columns>, time 778 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:10:06,233 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 1099 msec - slow timeout 500 msec
The timeout is from the driver trying to parse the schema while establishing the control connection.
The driver uses the control connection for admin tasks such as discovering the cluster's topology and schema during the initialisation phase. I've discussed it in a bit more detail in this post -- https://community.datastax.com/questions/7702/.
In your case, the driver initialisation times out while parsing the thousands of columns in the table you mentioned. I have to admit that this is new to me. I've never worked with a cluster that had thousands of columns so I'm curious to know what your use case is and perhaps there might be a better data model for it.
As a workaround, you can try to bump out the default timeout to see if the driver is able to eventually initialise. However, this is going to be a band-aid solution since the driver needs to parse the schema every time a DDL takes place. Cheers!

Cassandra issue while adding jmx_prometheus

I want to add Cassandra monitoring using Prometheus. ref https://blog.pythian.com/step-step-monitoring-cassandra-prometheus-grafana/
When I add /etc/cassandra/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7070:/opt/jmx_prometheus/cassandra.yml"
I get an error :
ubuntu#ip-172-21-0-111:~$ sudo service cassandra status
● cassandra.service - LSB: distributed storage system for structured data
Loaded: loaded (/etc/init.d/cassandra; bad; vendor preset: enabled)
Active: active (exited) since Mon 2020-04-13 05:43:38 UTC; 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 3557 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
Process: 3570 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Starting LSB: distributed storage system for structured data...
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Started LSB: distributed storage system for structured data.
ubuntu#ip-172-21-0-111:~$ nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
when I remove jmx_prometheus entry I get it working :
ubuntu#ip-172-21-0-111:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.21.0.111 1.83 GiB 128 100.0% b52324d0-c57f-46e3-bc10-a6dc07bae17a rack1
ubuntu#ip-172-21-0-111:~$ tail -f /var/log/cassandra/system.log
INFO [main] 2020-04-13 05:37:36,609 StorageService.java:2169 - Node /172.21.0.111 state jump to NORMAL
INFO [main] 2020-04-13 05:37:36,617 CassandraDaemon.java:673 - Waiting for gossip to settle before accepting client requests...
INFO [main] 2020-04-13 05:37:44,621 CassandraDaemon.java:704 - No gossip backlog; proceeding
INFO [main] 2020-04-13 05:37:44,713 NativeTransportService.java:70 - Netty using native Epoll event loop
INFO [main] 2020-04-13 05:37:44,773 Server.java:161 - Using Netty Version: [netty-buffer=netty-buffer-4.0.36.Final.e8fa848, netty-codec=netty-codec-4.0.36.Final.e8fa848, netty-codec-haproxy=netty-codec-haproxy-4.0.36.Final.e8fa848, netty-codec-http=netty-codec-http-4.0.36.Final.e8fa848, netty-codec-socks=netty-codec-socks-4.0.36.Final.e8fa848, netty-common=netty-common-4.0.36.Final.e8fa848, netty-handler=netty-handler-4.0.36.Final.e8fa848, netty-tcnative=netty-tcnative-1.1.33.Fork15.906a8ca, netty-transport=netty-transport-4.0.36.Final.e8fa848, netty-transport-native-epoll=netty-transport-native-epoll-4.0.36.Final.e8fa848, netty-transport-rxtx=netty-transport-rxtx-4.0.36.Final.e8fa848, netty-transport-sctp=netty-transport-sctp-4.0.36.Final.e8fa848, netty-transport-udt=netty-transport-udt-4.0.36.Final.e8fa848]
INFO [main] 2020-04-13 05:37:44,773 Server.java:162 - Starting listening for CQL clients on /172.21.0.111:9042 (unencrypted)...
INFO [main] 2020-04-13 05:37:44,811 CassandraDaemon.java:505 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO [SharedPool-Worker-1] 2020-04-13 05:37:46,625 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
INFO [OptionalTasks:1] 2020-04-13 05:37:46,752 CassandraRoleManager.java:339 - Created default superuser role 'cassandra'
It worked! Changed port to 7071 from 7070 in JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7071:/opt/jmx_prometheus/cassandra.yml"

ConfigurationException while launching Apache Cassanda DB: This node was decommissioned and will not rejoin the ring

This is a snippet from the system log while shutting down:
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:28:50,995 StorageService.java:3788 - Announcing that I have left the ring for 30000ms
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,995 ThriftServer.java:142 - Stop listening to thrift clients
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Server.java:182 - Stop listening for CQL clients
WARN [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 MessagingService.java:786 - Waiting for messaging service to quiesce
INFO [ACCEPT-sysengplayl0127.bio-iad.ea.com/10.72.194.229] 2016-07-27 22:29:20,998 MessagingService.java:1133 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:21,022 StorageService.java:1411 - DECOMMISSIONED
INFO [main] 2016-07-27 22:32:17,534 YamlConfigurationLoader.java:89 - Configuration location: file:/opt/cassandra/product/apache-cassandra-3.7/conf/cassandra.yaml
And then while starting up:
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:630 - Cassandra version: 3.7
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:631 - Thrift API version: 20.1.0
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:632 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2016-07-27 22:32:20,351 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 397 MB and a resize interval of 60 minutes
ERROR [main] 2016-07-27 22:32:20,357 CassandraDaemon.java:731 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set, or all existing data is removed and the node is bootstrapped again
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:815) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:370) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:585) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:714) [apache-cassandra-3.7.jar:3.7]
WARN [StorageServiceShutdownHook] 2016-07-27 22:32:20,358 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [StorageServiceShutdownHook] 2016-07-27 22:32:20,359 MessagingService.java:786 - Waiting for messaging service to quiesce
Is there something wrong with the configuration?
I had faced same issue.
Posting the answer so that it might help others.
As the log suggests, the property "cassandra.override_decommission" should be overridden.
start cassandra with the syntax:
cassandra -Dcassandra.override_decommission=true
This should add the node back to the cluster.

Resources