After running repair on 3.11.2 version , I am getting below message in debug.log saying -
mc-50-big-Data.db fully contained in range (-9223372036854775808,-9223372036854775808], mutating repairedAt instead of anticompacting
Why this SSTABLE is fully contained in range ( -9223372036854775808,-9223372036854775808 )
beside having muliple tokens / keys as shown is below output
Below key found in this SSTABLE
sstabledump demo/msisdn-e59722f0d1e711ebb52c1524f01c1145/mc-50-big-Data.db| grep key
"key" : [ "1" ],
"key" : [ "2" ],
"key" : [ "22" ],
"key" : [ "833" ],
"key" : [ "3232" ],
"key" : [ "98" ],
"key" : [ "900" ],
"key" : [ "173" ],
DIFFERENT TOKENS found in this SSTABLE
account_id | system.token(account_id)
------------+--------------------------
1 | -4069959284402364209
2 | -3248873570005575792
22 | -1117083337304738213
833 | -1083053322882870066
3232 | -1016771166277942908
98 | -463622059452620815
900 | -300805731578844817
173 | 298622069266553728
I executed below command -
nodetool repair -full -seq demo msisdn
For C* older than
Example of what Anti - compaction perform with C* SSTABLES during repair .
Cassandra older version > C* 2.2 was actually performing ANTICOMPACTION where one sstable splitted into two parts -
repaired
unRepaired
Below is the example .
Currently Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-1-Data.db
Repaired at: 0
demo-msisdn-ka-2-Data.db
Repaired at: 0
client:~/css/apache-cassandra-2.1.23/bin$ ./nodetool repair -st -5196837186409114737 -et -178801028445334456 -par -inc
[2021-06-25 19:55:04,270] Nothing to repair for keyspace 'system'
[2021-06-25 19:55:04,293] Starting repair command #6, repairing 4 ranges for keyspace system_traces (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,395] Repair session 3bdd7000-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,395] Repair session 3be05630-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,396] Repair session 3be38a80-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,401] Repair session 3be62290-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,421] Repair command #6 finished
[2021-06-25 19:55:04,437] Starting repair command #7, repairing 4 ranges for keyspace demo (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,504] Repair session 3bf0f800-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,504] Repair session 3bf1e260-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,507] Repair session 3bf64f30-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,514] Repair session 3bf760a0-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,753] Repair command #7 finished
After repair - Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-3-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-4-Data.db
Repaired at: 0
demo-msisdn-ka-5-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-6-Data.db
Repaired at: 0
---- LOG -----
INFO [MemtableFlushWriter:7] 2021-06-25 19:52:19,665 Memtable.java:382 - Completed flushing /home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-tmp-ka-1-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1624646218285, position=185893)
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,713 Validator.java:257 - [repair #d9ad0620-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,783 Validator.java:257 - [repair #d9c485c0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,845 Validator.java:257 - [repair #d9d21a50-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,861 Validator.java:257 - [repair #d9d54ea0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,330 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,337 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,349 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,355 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,372 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,374 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,388 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,391 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:496 - Starting anticompaction for system_traces.events on 0/0 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:561 - Completed anticompaction successfully
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:496 - Starting anticompaction for system_traces.sessions on 0/0 sstables
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:561 - Completed anticompaction successfully
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,453 Validator.java:257 - [repair #3bf0f800-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,472 Validator.java:257 - [repair #3bf1e260-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,485 Validator.java:257 - [repair #3bf64f30-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,496 Validator.java:257 - [repair #3bf760a0-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:496 - Starting anticompaction for demo.msisdn on 1/1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:537 - SSTable SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db') ((-7133164915313410844,6369609434230030255]) will be anticompacted on range (-5196837186409114737,-178801028445334456]
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1125 - Performing anticompaction on 1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1137 - Anticompacting SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db')
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,670 CompactionManager.java:1197 - Anticompaction completed successfully, anticompacted from 1 to 2 sstable(s).
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,678 CompactionManager.java:561 - Completed anticompaction successfully
Related
I created Kafka streams application with spring cloud stream which reads data from one topic and writes to another topic and I'm trying to deploy and run the job in AKS with ACR image but the stream getting closed without any error after reading all the available messages(lag 0) in the topic. strange thing I'm facing is, it is running fine in Intellj.
Here is my AKS pod logs:
2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=62): org.apache.kafka.common.requests.FetchResponse#7b021a01
[2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=49) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=49,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63): org.apache.kafka.common.requests.FetchResponse#50fb365c
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=50) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=64) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=50,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,710] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b, started on Tue Mar 02 17:29:08 GMT 2021
[2021-03-02 17:30:39,715] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#71391b3f, started on Tue Mar 02 17:29:12 GMT 2021, parent: org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:369] Stopping beans in phase 2147483547
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:242] Bean 'org.springframework.kafka.config.internalKafkaListenerEndpointRegistry' completed its stop procedure
[2021-03-02 17:30:39,719] [DEBUG] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:1016] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] Stopping Streams client with timeoutMillis = 10000 ms.
[2021-03-02 17:30:39,719] [INFO] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:287] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:1116] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Informed to shut down
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:221] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:772] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State already transits to PENDING_SHUTDOWN, skipping the run once call after poll request
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:206] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Ignoring request to transit from PENDING_SHUTDOWN to PENDING_SHUTDOWN: only DEAD state is a valid next state
[2021-03-02 17:30:39,788] [INFO] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:1130] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Shutting down
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.AssignedStreamsTasks AssignedStreamsTasks.java:529] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Clean shutdown of all active tasks
Please advise.
using Cassandra version 3.11.8, openjdk-8u242-b08
Prior to this crashing, I was altering a table with 50k+ columns so this might (is) a factor to all of this. I would Ideally rather lose the data in the commit (if its inserting a backlog still perpetually) so I can connect to the hosts so service can be resumed.
Before start of error, I started alter table commands inserting many columns into the table in calls of 1000 at a time. Eventually after it may have done about half of them i received this error for all the nodes.
2020-09-10 15:34:29 WARNING [control connection] Error connecting to
127.0.0.3:9042: Traceback (most recent call last): File "cassandra\cluster.py", line 3522, in
cassandra.cluster.ControlConnection._reconnect_internal File
"cassandra\cluster.py", line 3591, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3588, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3690, in
cassandra.cluster.ControlConnection._refresh_schema File
"cassandra\metadata.py", line 142, in
cassandra.metadata.Metadata.refresh File "cassandra\metadata.py",
line 165, in cassandra.metadata.Metadata._rebuild_all File
"cassandra\metadata.py", line 2522, in get_all_keyspaces File
"cassandra\metadata.py", line 2031, in get_all_keyspaces File
"cassandra\metadata.py", line 2719, in
cassandra.metadata.SchemaParserV3._query_all File
"cassandra\connection.py", line 985, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 983, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 1435, in
cassandra.connection.ResponseWaiter.deliver
cassandra.OperationTimedOut: errors=None, last_host=None
I am running 8 nodes on a server. I have reset all nodes and handshakes are done. But I cannot make a connect to my cluster on any of the nodes.My system.log and debug.log have similar logs throughout once cassandra starts running. gc.log has not updated in some time so it makes me wonder what is going on? Interesting point is i only retrieve the list of columns in the table 3 times total, I have ran this code on my desktop without issues using 2 nodes (much much less resources) and have not received any of these issues.
Edit: just for clarity my application/connections are not running and these logs below are what is happening periodically..I tried looking at scheduled tasks and cannot find information about cassandra for this. I wonder what backlog its reading from and if I can stop it. Ideally I would like to stop this backload of operations from happening...
-------SYSTEM.LOG-------
INFO [GossipStage:1] 2020-09-10 17:38:52,376 StorageService.java:2400 - Node /127.0.0.9 state jump to NORMAL
WARN [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:377 - CassandraRoleManager skipped default role setup: some nodes were not ready
INFO [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:416 - Setup task failed with error, rescheduling
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:56,965 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:58,262 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,102 Gossiper.java:1139 - Node /127.0.0.4 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,105 Gossiper.java:1103 - InetAddress /127.0.0.4 is now UP
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:38:59,813 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:00,104 StorageService.java:2400 - Node /127.0.0.4 state jump to NORMAL
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:39:01,029 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,266 Gossiper.java:1139 - Node /127.0.0.5 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,270 Gossiper.java:1103 - InetAddress /127.0.0.5 is now UP
INFO [GossipStage:1] 2020-09-10 17:39:04,271 StorageService.java:2400 - Node /127.0.0.5 state jump to NORMAL
INFO [ScheduledTasks:1] 2020-09-10 17:43:05,805 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:48:40,892 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:54:35,999 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:59:36,083 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
------DEBUG.LOG------
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:26,102 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 1256/1232/1281 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:56,110 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 795 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:01:01,111 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:03:41,143 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 853 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:06,148 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:26,153 MonitoringTask.java:173 - 1 operations were slow in the last 4991 msecs:
<SELECT * FROM system_schema.columns>, time 838 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:31,154 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 841 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:56,160 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:11,165 MonitoringTask.java:173 - 1 operations were slow in the last 4994 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:31,171 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:56,176 MonitoringTask.java:173 - 1 operations were slow in the last 5010 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 847/837/857 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:16,196 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 827 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:31,200 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:01,207 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 799 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:16,211 MonitoringTask.java:173 - 1 operations were slow in the last 4999 msecs:
<SELECT * FROM system_schema.columns>, time 780 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:36,217 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 835 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:01,221 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 832 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:09:56,231 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:56,231 MonitoringTask.java:173 - 1 operations were slow in the last 4995 msecs:
<SELECT * FROM system_schema.columns>, time 778 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:10:06,233 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 1099 msec - slow timeout 500 msec
The timeout is from the driver trying to parse the schema while establishing the control connection.
The driver uses the control connection for admin tasks such as discovering the cluster's topology and schema during the initialisation phase. I've discussed it in a bit more detail in this post -- https://community.datastax.com/questions/7702/.
In your case, the driver initialisation times out while parsing the thousands of columns in the table you mentioned. I have to admit that this is new to me. I've never worked with a cluster that had thousands of columns so I'm curious to know what your use case is and perhaps there might be a better data model for it.
As a workaround, you can try to bump out the default timeout to see if the driver is able to eventually initialise. However, this is going to be a band-aid solution since the driver needs to parse the schema every time a DDL takes place. Cheers!
I want to add Cassandra monitoring using Prometheus. ref https://blog.pythian.com/step-step-monitoring-cassandra-prometheus-grafana/
When I add /etc/cassandra/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7070:/opt/jmx_prometheus/cassandra.yml"
I get an error :
ubuntu#ip-172-21-0-111:~$ sudo service cassandra status
● cassandra.service - LSB: distributed storage system for structured data
Loaded: loaded (/etc/init.d/cassandra; bad; vendor preset: enabled)
Active: active (exited) since Mon 2020-04-13 05:43:38 UTC; 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 3557 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
Process: 3570 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Starting LSB: distributed storage system for structured data...
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Started LSB: distributed storage system for structured data.
ubuntu#ip-172-21-0-111:~$ nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
when I remove jmx_prometheus entry I get it working :
ubuntu#ip-172-21-0-111:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.21.0.111 1.83 GiB 128 100.0% b52324d0-c57f-46e3-bc10-a6dc07bae17a rack1
ubuntu#ip-172-21-0-111:~$ tail -f /var/log/cassandra/system.log
INFO [main] 2020-04-13 05:37:36,609 StorageService.java:2169 - Node /172.21.0.111 state jump to NORMAL
INFO [main] 2020-04-13 05:37:36,617 CassandraDaemon.java:673 - Waiting for gossip to settle before accepting client requests...
INFO [main] 2020-04-13 05:37:44,621 CassandraDaemon.java:704 - No gossip backlog; proceeding
INFO [main] 2020-04-13 05:37:44,713 NativeTransportService.java:70 - Netty using native Epoll event loop
INFO [main] 2020-04-13 05:37:44,773 Server.java:161 - Using Netty Version: [netty-buffer=netty-buffer-4.0.36.Final.e8fa848, netty-codec=netty-codec-4.0.36.Final.e8fa848, netty-codec-haproxy=netty-codec-haproxy-4.0.36.Final.e8fa848, netty-codec-http=netty-codec-http-4.0.36.Final.e8fa848, netty-codec-socks=netty-codec-socks-4.0.36.Final.e8fa848, netty-common=netty-common-4.0.36.Final.e8fa848, netty-handler=netty-handler-4.0.36.Final.e8fa848, netty-tcnative=netty-tcnative-1.1.33.Fork15.906a8ca, netty-transport=netty-transport-4.0.36.Final.e8fa848, netty-transport-native-epoll=netty-transport-native-epoll-4.0.36.Final.e8fa848, netty-transport-rxtx=netty-transport-rxtx-4.0.36.Final.e8fa848, netty-transport-sctp=netty-transport-sctp-4.0.36.Final.e8fa848, netty-transport-udt=netty-transport-udt-4.0.36.Final.e8fa848]
INFO [main] 2020-04-13 05:37:44,773 Server.java:162 - Starting listening for CQL clients on /172.21.0.111:9042 (unencrypted)...
INFO [main] 2020-04-13 05:37:44,811 CassandraDaemon.java:505 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO [SharedPool-Worker-1] 2020-04-13 05:37:46,625 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
INFO [OptionalTasks:1] 2020-04-13 05:37:46,752 CassandraRoleManager.java:339 - Created default superuser role 'cassandra'
It worked! Changed port to 7071 from 7070 in JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7071:/opt/jmx_prometheus/cassandra.yml"
Can someone point me to documentation on what -100 exit code means? EMR cluster, spark 2.0.0 on YARN (per EMR standard spark-cluster deployment). I've seen https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_sg_yarn_container_exec_errors.html which gives some error codes, of which -100 is not one of them. Also, as a more general question, it seems that neither the YARN container logs and the Spark container logs contain much information on what causes such a failure ... from the YARN logs I see
17/01/18 17:51:58 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 4164 executors.
17/01/18 17:51:58 INFO YarnAllocator: Driver requested a total number of 4163 executor(s).
17/01/18 17:51:58 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 4163 executors.
17/01/18 17:51:58 INFO YarnAllocator: Driver requested a total number of 4162 executor(s).
17/01/18 17:51:58 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 4162 executors.
17/01/18 17:51:59 INFO YarnAllocator: Driver requested a total number of 4161 executor(s).
17/01/18 17:51:59 INFO YarnAllocator: Driver requested a total number of 4160 executor(s).
17/01/18 17:51:59 INFO YarnAllocator: Canceling requests for 2 executor container(s) to have a new desired total 4160 executors.
17/01/18 17:52:00 INFO YarnAllocator: Driver requested a total number of 4159 executor(s).
17/01/18 17:52:00 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 4159 executors.
17/01/18 17:52:00 INFO YarnAllocator: Completed container container_1483555419510_0037_01_000114 on host: ip-172-20-221-152.us-west-2.compute.internal (state: COMPLETE, exit status: -100)
17/01/18 17:52:00 WARN YarnAllocator: Container marked as failed: container_1483555419510_0037_01_000114 on host: ip-172-20-221-152.us-west-2.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
17/01/18 17:52:00 INFO YarnAllocator: Completed container container_1483555419510_0037_01_000107 on host: ip-172-20-221-152.us-west-2.compute.internal (state: COMPLETE, exit status: -100)
17/01/18 17:52:00 WARN YarnAllocator: Container marked as failed: container_1483555419510_0037_01_000107 on host: ip-172-20-221-152.us-west-2.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
17/01/18 17:52:00 INFO YarnAllocator: Will request 2 executor containers, each with 7 cores and 22528 MB memory including 2048 MB overhead
17/01/18 17:52:00 INFO YarnAllocator: Canceled 0 container requests (locality no longer needed)
17/01/18 17:52:00 INFO YarnAllocator: Submitted container request (host: Any, capability: <memory:22528, vCores:7>)
17/01/18 17:52:00 INFO YarnAllocator: Submitted container request (host: Any, capability: <memory:22528, vCores:7>)
17/01/18 17:52:01 INFO YarnAllocator: Driver requested a total number of 4158 executor(s).
17/01/18 17:52:01 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 4158 executors.
17/01/18 17:52:02 INFO YarnAllocator: Driver requested a total number of 4157 executor(s).
and Spark executor logs I see
17/01/18 17:39:39 INFO MemoryStore: MemoryStore cleared
17/01/18 17:39:39 INFO BlockManager: BlockManager stopped
17/01/18 17:39:39 INFO ShutdownHookManager: Shutdown hook called
neither of which is very informative?
"Exit status: -100. Diagnostics: Container released on a lost node" tell you that the node has lost
This is a snippet from the system log while shutting down:
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:28:50,995 StorageService.java:3788 - Announcing that I have left the ring for 30000ms
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,995 ThriftServer.java:142 - Stop listening to thrift clients
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Server.java:182 - Stop listening for CQL clients
WARN [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 MessagingService.java:786 - Waiting for messaging service to quiesce
INFO [ACCEPT-sysengplayl0127.bio-iad.ea.com/10.72.194.229] 2016-07-27 22:29:20,998 MessagingService.java:1133 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:21,022 StorageService.java:1411 - DECOMMISSIONED
INFO [main] 2016-07-27 22:32:17,534 YamlConfigurationLoader.java:89 - Configuration location: file:/opt/cassandra/product/apache-cassandra-3.7/conf/cassandra.yaml
And then while starting up:
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:630 - Cassandra version: 3.7
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:631 - Thrift API version: 20.1.0
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:632 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2016-07-27 22:32:20,351 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 397 MB and a resize interval of 60 minutes
ERROR [main] 2016-07-27 22:32:20,357 CassandraDaemon.java:731 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set, or all existing data is removed and the node is bootstrapped again
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:815) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:370) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:585) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:714) [apache-cassandra-3.7.jar:3.7]
WARN [StorageServiceShutdownHook] 2016-07-27 22:32:20,358 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [StorageServiceShutdownHook] 2016-07-27 22:32:20,359 MessagingService.java:786 - Waiting for messaging service to quiesce
Is there something wrong with the configuration?
I had faced same issue.
Posting the answer so that it might help others.
As the log suggests, the property "cassandra.override_decommission" should be overridden.
start cassandra with the syntax:
cassandra -Dcassandra.override_decommission=true
This should add the node back to the cluster.