Cassandra node running, but cant connect? - cassandra

using Cassandra version 3.11.8, openjdk-8u242-b08
Prior to this crashing, I was altering a table with 50k+ columns so this might (is) a factor to all of this. I would Ideally rather lose the data in the commit (if its inserting a backlog still perpetually) so I can connect to the hosts so service can be resumed.
Before start of error, I started alter table commands inserting many columns into the table in calls of 1000 at a time. Eventually after it may have done about half of them i received this error for all the nodes.
2020-09-10 15:34:29 WARNING [control connection] Error connecting to
127.0.0.3:9042: Traceback (most recent call last): File "cassandra\cluster.py", line 3522, in
cassandra.cluster.ControlConnection._reconnect_internal File
"cassandra\cluster.py", line 3591, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3588, in
cassandra.cluster.ControlConnection._try_connect File
"cassandra\cluster.py", line 3690, in
cassandra.cluster.ControlConnection._refresh_schema File
"cassandra\metadata.py", line 142, in
cassandra.metadata.Metadata.refresh File "cassandra\metadata.py",
line 165, in cassandra.metadata.Metadata._rebuild_all File
"cassandra\metadata.py", line 2522, in get_all_keyspaces File
"cassandra\metadata.py", line 2031, in get_all_keyspaces File
"cassandra\metadata.py", line 2719, in
cassandra.metadata.SchemaParserV3._query_all File
"cassandra\connection.py", line 985, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 983, in
cassandra.connection.Connection.wait_for_responses File
"cassandra\connection.py", line 1435, in
cassandra.connection.ResponseWaiter.deliver
cassandra.OperationTimedOut: errors=None, last_host=None
I am running 8 nodes on a server. I have reset all nodes and handshakes are done. But I cannot make a connect to my cluster on any of the nodes.My system.log and debug.log have similar logs throughout once cassandra starts running. gc.log has not updated in some time so it makes me wonder what is going on? Interesting point is i only retrieve the list of columns in the table 3 times total, I have ran this code on my desktop without issues using 2 nodes (much much less resources) and have not received any of these issues.
Edit: just for clarity my application/connections are not running and these logs below are what is happening periodically..I tried looking at scheduled tasks and cannot find information about cassandra for this. I wonder what backlog its reading from and if I can stop it. Ideally I would like to stop this backload of operations from happening...
-------SYSTEM.LOG-------
INFO [GossipStage:1] 2020-09-10 17:38:52,376 StorageService.java:2400 - Node /127.0.0.9 state jump to NORMAL
WARN [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:377 - CassandraRoleManager skipped default role setup: some nodes were not ready
INFO [OptionalTasks:1] 2020-09-10 17:38:54,802 CassandraRoleManager.java:416 - Setup task failed with error, rescheduling
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:56,965 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [HANDSHAKE-/127.0.0.4] 2020-09-10 17:38:58,262 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,102 Gossiper.java:1139 - Node /127.0.0.4 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,103 TokenMetadata.java:497 - Updating topology for /127.0.0.4
INFO [GossipStage:1] 2020-09-10 17:38:59,105 Gossiper.java:1103 - InetAddress /127.0.0.4 is now UP
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:38:59,813 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:00,104 StorageService.java:2400 - Node /127.0.0.4 state jump to NORMAL
INFO [HANDSHAKE-/127.0.0.5] 2020-09-10 17:39:01,029 OutboundTcpConnection.java:561 - Handshaking version with /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,266 Gossiper.java:1139 - Node /127.0.0.5 has restarted, now UP
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,267 TokenMetadata.java:497 - Updating topology for /127.0.0.5
INFO [GossipStage:1] 2020-09-10 17:39:01,270 Gossiper.java:1103 - InetAddress /127.0.0.5 is now UP
INFO [GossipStage:1] 2020-09-10 17:39:04,271 StorageService.java:2400 - Node /127.0.0.5 state jump to NORMAL
INFO [ScheduledTasks:1] 2020-09-10 17:43:05,805 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:48:40,892 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:54:35,999 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [ScheduledTasks:1] 2020-09-10 17:59:36,083 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
------DEBUG.LOG------
INFO [Service Thread] 2020-09-10 18:00:24,722 GCInspector.java:285 - ParNew GC in 237ms. CMS Old Gen: 717168160 -> 887151520; Par Eden Space: 1718091776 -> 0; Par Survivor Space: 12757512 -> 214695936
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:26,102 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 1256/1232/1281 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:00:56,110 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 795 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:01:01,111 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:03:41,143 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 853 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:06,148 MonitoringTask.java:173 - 1 operations were slow in the last 4996 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:26,153 MonitoringTask.java:173 - 1 operations were slow in the last 4991 msecs:
<SELECT * FROM system_schema.columns>, time 838 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:31,154 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 841 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:04:56,160 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:04:56,160 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 772 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:11,165 MonitoringTask.java:173 - 1 operations were slow in the last 4994 msecs:
<SELECT * FROM system_schema.columns>, time 808 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:31,171 MonitoringTask.java:173 - 1 operations were slow in the last 5004 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:05:56,176 MonitoringTask.java:173 - 1 operations were slow in the last 5010 msecs:
<SELECT * FROM system_schema.columns>, was slow 2 times: avg/min/max 847/837/857 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:16,196 MonitoringTask.java:173 - 1 operations were slow in the last 5003 msecs:
<SELECT * FROM system_schema.columns>, time 827 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:07:31,200 MonitoringTask.java:173 - 1 operations were slow in the last 5007 msecs:
<SELECT * FROM system_schema.columns>, time 834 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:01,207 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 799 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:16,211 MonitoringTask.java:173 - 1 operations were slow in the last 4999 msecs:
<SELECT * FROM system_schema.columns>, time 780 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:08:36,217 MonitoringTask.java:173 - 1 operations were slow in the last 5000 msecs:
<SELECT * FROM system_schema.columns>, time 835 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:01,221 MonitoringTask.java:173 - 1 operations were slow in the last 5002 msecs:
<SELECT * FROM system_schema.columns>, time 832 msec - slow timeout 500 msec
INFO [ScheduledTasks:1] 2020-09-10 18:09:56,231 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2020-09-10 18:09:56,231 MonitoringTask.java:173 - 1 operations were slow in the last 4995 msecs:
<SELECT * FROM system_schema.columns>, time 778 msec - slow timeout 500 msec
DEBUG [ScheduledTasks:1] 2020-09-10 18:10:06,233 MonitoringTask.java:173 - 1 operations were slow in the last 5009 msecs:
<SELECT * FROM system_schema.columns>, time 1099 msec - slow timeout 500 msec

The timeout is from the driver trying to parse the schema while establishing the control connection.
The driver uses the control connection for admin tasks such as discovering the cluster's topology and schema during the initialisation phase. I've discussed it in a bit more detail in this post -- https://community.datastax.com/questions/7702/.
In your case, the driver initialisation times out while parsing the thousands of columns in the table you mentioned. I have to admit that this is new to me. I've never worked with a cluster that had thousands of columns so I'm curious to know what your use case is and perhaps there might be a better data model for it.
As a workaround, you can try to bump out the default timeout to see if the driver is able to eventually initialise. However, this is going to be a band-aid solution since the driver needs to parse the schema every time a DDL takes place. Cheers!

Related

Anticompaction phase post nodetool repair in cassandra

After running repair on 3.11.2 version , I am getting below message in debug.log saying -
mc-50-big-Data.db fully contained in range (-9223372036854775808,-9223372036854775808], mutating repairedAt instead of anticompacting
Why this SSTABLE is fully contained in range ( -9223372036854775808,-9223372036854775808 )
beside having muliple tokens / keys as shown is below output
Below key found in this SSTABLE
sstabledump demo/msisdn-e59722f0d1e711ebb52c1524f01c1145/mc-50-big-Data.db| grep key
"key" : [ "1" ],
"key" : [ "2" ],
"key" : [ "22" ],
"key" : [ "833" ],
"key" : [ "3232" ],
"key" : [ "98" ],
"key" : [ "900" ],
"key" : [ "173" ],
DIFFERENT TOKENS found in this SSTABLE
account_id | system.token(account_id)
------------+--------------------------
1 | -4069959284402364209
2 | -3248873570005575792
22 | -1117083337304738213
833 | -1083053322882870066
3232 | -1016771166277942908
98 | -463622059452620815
900 | -300805731578844817
173 | 298622069266553728
I executed below command -
nodetool repair -full -seq demo msisdn
For C* older than
Example of what Anti - compaction perform with C* SSTABLES during repair .
Cassandra older version > C* 2.2 was actually performing ANTICOMPACTION where one sstable splitted into two parts -
repaired
unRepaired
Below is the example .
Currently Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-1-Data.db
Repaired at: 0
demo-msisdn-ka-2-Data.db
Repaired at: 0
client:~/css/apache-cassandra-2.1.23/bin$ ./nodetool repair -st -5196837186409114737 -et -178801028445334456 -par -inc
[2021-06-25 19:55:04,270] Nothing to repair for keyspace 'system'
[2021-06-25 19:55:04,293] Starting repair command #6, repairing 4 ranges for keyspace system_traces (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,395] Repair session 3bdd7000-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,395] Repair session 3be05630-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,396] Repair session 3be38a80-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,401] Repair session 3be62290-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,421] Repair command #6 finished
[2021-06-25 19:55:04,437] Starting repair command #7, repairing 4 ranges for keyspace demo (parallelism=PARALLEL, full=false)
[2021-06-25 19:55:04,504] Repair session 3bf0f800-d5ef-11eb-966c-d7c730160a5d for range (-5196837186409114737,-4923763406927773451] finished
[2021-06-25 19:55:04,504] Repair session 3bf1e260-d5ef-11eb-966c-d7c730160a5d for range (-4923763406927773451,-2187651444700558944] finished
[2021-06-25 19:55:04,507] Repair session 3bf64f30-d5ef-11eb-966c-d7c730160a5d for range (-2187651444700558944,-574543093143405237] finished
[2021-06-25 19:55:04,514] Repair session 3bf760a0-d5ef-11eb-966c-d7c730160a5d for range (-574543093143405237,-178801028445334456] finished
[2021-06-25 19:55:04,753] Repair command #7 finished
After repair - Repaired at of SSTABLES -
client:~/css/apache-cassandra-2.1.23/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d$ for i in ls *Data.db ; do echo $i ; ~/css/apache-cassandra-2.1.23/tools/bin/sstablemetadata $i | grep Repai ;done | grep -v commitlog
demo-msisdn-ka-3-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-4-Data.db
Repaired at: 0
demo-msisdn-ka-5-Data.db
Repaired at: 1624650904440
demo-msisdn-ka-6-Data.db
Repaired at: 0
---- LOG -----
INFO [MemtableFlushWriter:7] 2021-06-25 19:52:19,665 Memtable.java:382 - Completed flushing /home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-tmp-ka-1-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1624646218285, position=185893)
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,713 Validator.java:257 - [repair #d9ad0620-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,783 Validator.java:257 - [repair #d9c485c0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,845 Validator.java:257 - [repair #d9d21a50-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:52:19,861 Validator.java:257 - [repair #d9d54ea0-d5ee-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,330 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,337 Validator.java:257 - [repair #3bdd7000-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,349 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,355 Validator.java:257 - [repair #3be05630-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,372 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,374 Validator.java:257 - [repair #3be38a80-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,388 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/events
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,391 Validator.java:257 - [repair #3be62290-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for system_traces/sessions
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:496 - Starting anticompaction for system_traces.events on 0/0 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,413 CompactionManager.java:561 - Completed anticompaction successfully
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:496 - Starting anticompaction for system_traces.sessions on 0/0 sstables
INFO [CompactionExecutor:45] 2021-06-25 19:55:04,414 CompactionManager.java:561 - Completed anticompaction successfully
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,453 Validator.java:257 - [repair #3bf0f800-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,472 Validator.java:257 - [repair #3bf1e260-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,485 Validator.java:257 - [repair #3bf64f30-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [AntiEntropyStage:1] 2021-06-25 19:55:04,496 Validator.java:257 - [repair #3bf760a0-d5ef-11eb-966c-d7c730160a5d] Sending completed merkle tree to /127.0.0.5 for demo/msisdn
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:496 - Starting anticompaction for demo.msisdn on 1/1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,516 CompactionManager.java:537 - SSTable SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db') ((-7133164915313410844,6369609434230030255]) will be anticompacted on range (-5196837186409114737,-178801028445334456]
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1125 - Performing anticompaction on 1 sstables
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,517 CompactionManager.java:1137 - Anticompacting SSTableReader(path='/home/divyanshu_sharma/css/apache-cassandra-2.1.22/data/data/demo/msisdn-495d5c00d5ee11eb966cd7c730160a5d/demo-msisdn-ka-1-Data.db')
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,670 CompactionManager.java:1197 - Anticompaction completed successfully, anticompacted from 1 to 2 sstable(s).
INFO [CompactionExecutor:43] 2021-06-25 19:55:04,678 CompactionManager.java:561 - Completed anticompaction successfully

Cassandra issue while adding jmx_prometheus

I want to add Cassandra monitoring using Prometheus. ref https://blog.pythian.com/step-step-monitoring-cassandra-prometheus-grafana/
When I add /etc/cassandra/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7070:/opt/jmx_prometheus/cassandra.yml"
I get an error :
ubuntu#ip-172-21-0-111:~$ sudo service cassandra status
● cassandra.service - LSB: distributed storage system for structured data
Loaded: loaded (/etc/init.d/cassandra; bad; vendor preset: enabled)
Active: active (exited) since Mon 2020-04-13 05:43:38 UTC; 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 3557 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
Process: 3570 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Starting LSB: distributed storage system for structured data...
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Started LSB: distributed storage system for structured data.
ubuntu#ip-172-21-0-111:~$ nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
when I remove jmx_prometheus entry I get it working :
ubuntu#ip-172-21-0-111:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.21.0.111 1.83 GiB 128 100.0% b52324d0-c57f-46e3-bc10-a6dc07bae17a rack1
ubuntu#ip-172-21-0-111:~$ tail -f /var/log/cassandra/system.log
INFO [main] 2020-04-13 05:37:36,609 StorageService.java:2169 - Node /172.21.0.111 state jump to NORMAL
INFO [main] 2020-04-13 05:37:36,617 CassandraDaemon.java:673 - Waiting for gossip to settle before accepting client requests...
INFO [main] 2020-04-13 05:37:44,621 CassandraDaemon.java:704 - No gossip backlog; proceeding
INFO [main] 2020-04-13 05:37:44,713 NativeTransportService.java:70 - Netty using native Epoll event loop
INFO [main] 2020-04-13 05:37:44,773 Server.java:161 - Using Netty Version: [netty-buffer=netty-buffer-4.0.36.Final.e8fa848, netty-codec=netty-codec-4.0.36.Final.e8fa848, netty-codec-haproxy=netty-codec-haproxy-4.0.36.Final.e8fa848, netty-codec-http=netty-codec-http-4.0.36.Final.e8fa848, netty-codec-socks=netty-codec-socks-4.0.36.Final.e8fa848, netty-common=netty-common-4.0.36.Final.e8fa848, netty-handler=netty-handler-4.0.36.Final.e8fa848, netty-tcnative=netty-tcnative-1.1.33.Fork15.906a8ca, netty-transport=netty-transport-4.0.36.Final.e8fa848, netty-transport-native-epoll=netty-transport-native-epoll-4.0.36.Final.e8fa848, netty-transport-rxtx=netty-transport-rxtx-4.0.36.Final.e8fa848, netty-transport-sctp=netty-transport-sctp-4.0.36.Final.e8fa848, netty-transport-udt=netty-transport-udt-4.0.36.Final.e8fa848]
INFO [main] 2020-04-13 05:37:44,773 Server.java:162 - Starting listening for CQL clients on /172.21.0.111:9042 (unencrypted)...
INFO [main] 2020-04-13 05:37:44,811 CassandraDaemon.java:505 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO [SharedPool-Worker-1] 2020-04-13 05:37:46,625 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
INFO [OptionalTasks:1] 2020-04-13 05:37:46,752 CassandraRoleManager.java:339 - Created default superuser role 'cassandra'
It worked! Changed port to 7071 from 7070 in JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7071:/opt/jmx_prometheus/cassandra.yml"

Docker-Flink: TaskManagers can't find JobManager when in different nodes in Docker Swarm

This happens even when the nodes are in the same subnet.
I am using the Docker-Flink project in:
https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
I am creating the services with the following commands:
docker network create -d overlay overlay
docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 --network overlay --constraint 'node.hostname == ubuntu-swarm-manager' flink jobmanager
docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager --network overlay --constraint 'node.hostname != ubuntu-swarm-manager' flink taskmanager
This is the error I get:
- Trying to register at JobManager akka.tcp://flink#jobmanager:6123/ user/jobmanager (attempt 4, timeout: 4000 milliseconds)
These are my environment configurations:
node: ubuntu-swarm-master Azure VM Standard D4s v3 (4 vcpus, 16 GB
memory) Docker version 17.03.1-ce, build c6d412e
node: azure-swarm-worker-1 Azure VM Standard D2 v2 Promo (2 vcpus, 7
GB memory) Docker version 17.09.0-ce, build afdb6d4
Flink: using image 1.3.2-hadoop2-scala_2.10
This is from the log of the container running TaskManager:
Starts ok...
Starting Task Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting taskmanager as a console application on host 00afd4130a94.
Then there are some errors (scroll right):
2017-11-02 14:06:51,064 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2017-11-02 14:06:51,065 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2017-11-02 14:06:51,067 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/10.0.0.2:6123.
2017-11-02 14:06:54,578 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:54,779 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:54,829 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:54,880 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:54,931 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:54,981 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:55,031 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:55,032 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:56,034 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:57,036 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,037 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,038 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:58,138 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:58,339 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:58,389 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,439 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,490 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:58,541 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:59,593 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:07:00,595 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:07:01,600 WARN org.apache.flink.runtime.net.ConnectionUtils - Could not connect to jobmanager/10.0.0.2:6123. Selecting a local address using heuristics.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address '00afd4130a94' (10.0.0.5) for communication.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 00afd4130a94:0.
2017-11-02 14:07:01,947 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 14:07:01,978 INFO Remoting - Starting remoting
2017-11-02 14:07:02,168 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#00afd4130a94:33881]
2017-11-02 14:07:02,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2017-11-02 14:07:02,192 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: 00afd4130a94/10.0.0.5, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2017-11-02 14:07:02,199 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2017-11-02 14:07:02,201 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 29 GB, usable 25 GB (86.21% usable)
2017-11-02 14:07:02,286 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3260, bytes per segment: 32768).
2017-11-02 14:07:02,393 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2017-11-02 14:07:02,400 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 2 ms).
2017-11-02 14:07:02,434 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 32 ms). Listening on SocketAddress /10.0.0.5:42921.
2017-11-02 14:07:02,493 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2017-11-02 14:07:02,498 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-e57d51fa-2269-4df0-9910-0fe26c6042bd for spill files.
2017-11-02 14:07:02,501 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 14:07:02,553 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-2c0c063f-464e-48f1-9fb8-fcfa48868e3a
2017-11-02 14:07:02,564 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-0c5e2b25-70a2-4964-9eec-24b0e79d560e
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1719715507.
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: df5992297d269fa16a5e945e1dce0451 # 00afd4130a94 (dataPort=42921)
2017-11-02 14:07:02,573 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s).
2017-11-02 14:07:02,574 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 113/1024/1024 MB, NON HEAP: 33/33/-1 MB (used/committed/max)]
2017-11-02 14:07:02,576 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-11-02 14:07:03,106 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
2017-11-02 14:07:04,126 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
Here is the log from the container running JobManager:
Starting Job Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting jobmanager as a console application on host c30e0fe7b765.
2017-11-02 13:42:33,721 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 # 10:23:11 UTC)
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: flink
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 981 MiBytes
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: /docker-java-home/jre
2017-11-02 13:42:33,799 INFO org.apache.flink.runtime.jobmanager.JobManager - Hadoop version: 2.7.2
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - /opt/flink/conf
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - cluster
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Classpath: /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar:::
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2017-11-02 13:42:33,911 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /opt/flink/conf
2017-11-02 13:42:33,914 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,924 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager without high-availability
2017-11-02 13:42:33,926 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on jobmanager:6123 with execution mode CLUSTER
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,962 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE)
2017-11-02 13:42:34,026 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at jobmanager:6123
2017-11-02 13:42:34,290 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 13:42:34,327 INFO Remoting - Starting remoting
2017-11-02 13:42:34,505 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#jobmanager:6123]
2017-11-02 13:42:34,524 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'jobmanager.web.log.path'.
2017-11-02 13:42:34,532 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-9f0ba581-3488-4086-a79c-53e17b56352c for the web interface files
2017-11-02 13:42:34,533 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-17a58ccf-7d8b-475e-b727-4a7935a19c0f for web frontend JAR file uploads
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor
2017-11-02 13:42:34,751 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-d10b620a-73ae-40af-bd23-aad5211fe1cc
2017-11-02 13:42:34,752 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2017-11-02 13:42:34,763 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 13:42:34,769 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager on port 8081
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#jobmanager:6123/user/jobmanager:00000000-0000-0000-0000-000000000000.
2017-11-02 13:42:34,776 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#jobmanager:6123/user/jobmanager.
2017-11-02 13:42:34,785 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink#jobmanager:6123/user/jobmanager
2017-11-02 13:42:34,801 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2017-11-02 13:42:34,814 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#844712453] - leader session 00000000-0000-0000-0000-000000000000
Why can't the TaskManagers talk to JobManager? I wonder if there's some configuration missing. Any help will be much appreciated. Thank you very much!

ConfigurationException while launching Apache Cassanda DB: This node was decommissioned and will not rejoin the ring

This is a snippet from the system log while shutting down:
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:28:50,995 StorageService.java:3788 - Announcing that I have left the ring for 30000ms
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,995 ThriftServer.java:142 - Stop listening to thrift clients
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Server.java:182 - Stop listening for CQL clients
WARN [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 MessagingService.java:786 - Waiting for messaging service to quiesce
INFO [ACCEPT-sysengplayl0127.bio-iad.ea.com/10.72.194.229] 2016-07-27 22:29:20,998 MessagingService.java:1133 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:21,022 StorageService.java:1411 - DECOMMISSIONED
INFO [main] 2016-07-27 22:32:17,534 YamlConfigurationLoader.java:89 - Configuration location: file:/opt/cassandra/product/apache-cassandra-3.7/conf/cassandra.yaml
And then while starting up:
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:630 - Cassandra version: 3.7
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:631 - Thrift API version: 20.1.0
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:632 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2016-07-27 22:32:20,351 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 397 MB and a resize interval of 60 minutes
ERROR [main] 2016-07-27 22:32:20,357 CassandraDaemon.java:731 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set, or all existing data is removed and the node is bootstrapped again
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:815) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:370) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:585) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:714) [apache-cassandra-3.7.jar:3.7]
WARN [StorageServiceShutdownHook] 2016-07-27 22:32:20,358 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [StorageServiceShutdownHook] 2016-07-27 22:32:20,359 MessagingService.java:786 - Waiting for messaging service to quiesce
Is there something wrong with the configuration?
I had faced same issue.
Posting the answer so that it might help others.
As the log suggests, the property "cassandra.override_decommission" should be overridden.
start cassandra with the syntax:
cassandra -Dcassandra.override_decommission=true
This should add the node back to the cluster.

Decommission a Cassandra node

I just decommissioned one of my nodes in a Cassandra cluster with 3 nodes (all running Cassandra 3.3). One of the remaining nodes constantly is printing:
DEBUG [GossipTasks:1] 2016-05-29 15:30:16,770 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:17,770 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:18,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:19,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
DEBUG [GossipTasks:1] 2016-05-29 15:30:20,771 Gossiper.java:336 - Convicting /10.80.64.33 with status LEFT - alive false
in the logs for half a day or so, once a second. Any idea why? What this means?
Thanks
Edit:
I noticed both nodes are printing this message in the logs for the past 48 hours!
Its normal to have this message till 72 hours after the node is decomissioned.
More Details here:
https://issues.apache.org/jira/browse/CASSANDRA-10371

Resources