I have deployed this PostgreSQL image to the IBM Cloud, Google Cloud Platform and Microsoft Azure using Kubernetes. https://github.com/paunin/PostDock
It was successfully deployed on all 3 platforms with identical configurations and an identical process. The IBM cloud fails with the error "psql: FATAL: password authentication failed for user "replica_user""
You can find below the logs from all 3 cloud platforms. Has anyone experienced this?
IBM Cloud Log
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> TUNING UP POSTGRES...
>>> Cleaning data folder which might have some garbage...
psql: FATAL: password authentication failed for user "replica_user"
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node2-service" (172.30.65.206) and accepting
TCP/IP connections on port 5432?
>>> Auto-detected master name: ''
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
psql: FATAL: password authentication failed for user "replica_user"
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node1-service:5432 (will try 30 times more)
....
The last couple of lines are then repeated many times.
This is the log file from deploying the same application, using identical processes on the Google Cloud. It works just fine on the Google Cloud Platform.
Google Cloud Log
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> TUNING UP POSTGRES...
>>> Cleaning data folder which might have some garbage...
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node1-service" (10.52.0.11) and accepting
TCP/IP connections on port 5432?
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node2-service" (10.52.0.12) and accepting
TCP/IP connections on port 5432?
>>> Auto-detected master name: ''
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node1-service" (10.52.0.11) and accepting
TCP/IP connections on port 5432?
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node1-service:5432 (will try 30 times more)
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node1-service:5432 (will try 29 times more)
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node1-service" (10.52.0.11) and accepting
TCP/IP connections on port 5432?
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node1-service" (10.52.0.11) and accepting
TCP/IP connections on port 5432?
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node1-service:5432 (will try 28 times more)
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
>>> REPLICATION_UPSTREAM_NODE_ID=1
>>> Sending in background postgres start...
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
>>> Starting standby node...
>>> Instance hasn't been set up yet.
>>> Clonning primary node...
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
NOTICE: destination directory '/var/lib/postgresql/data' provided
INFO: connecting to upstream node
INFO: Successfully connected to upstream node. Current installation size is 34 MB
INFO: checking and correcting permissions on existing directory /var/lib/postgresql/data ...
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
NOTICE: starting backup (using pg_basebackup)...
INFO: executing: '/usr/lib/postgresql/9.5/bin/pg_basebackup -l "repmgr base backup" -D /var/lib/postgresql/data -h cyclos-postgres-node1-service -p 5432 -U replica_user -c fast -X stream '
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example : pg_ctl -D /var/lib/postgresql/data start
HINT: After starting the server, you need to register this standby with "repmgr standby register"
[REPMGR EVENT] Node id: 2; Event type: standby_clone; Success [1|0]: 1; Time: 2018-02-02 13:24:32.87843+00; Details: Cloned from host 'cyclos-postgres-node1-service', port 5432; backup method: pg_basebackup; --force: Y
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to exists file
>>> Starting postgres...
>>> Waiting for local postgres server start...
>>> Wait db replica_db on cyclos-postgres-node2-service:5432(user: replica_user,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
LOG: incomplete startup packet
LOG: incomplete startup packet
LOG: database system was interrupted; last known up at 2018-02-02 13:24:31 UTC
FATAL: the database system is starting up
psql: FATAL: the database system is starting up
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node2-service:5432 (will try 60 times more)
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000F8
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
>>>>>> Db replica_db exists on cyclos-postgres-node2-service:5432!
>>> Waiting for replication on this node is over(if any in progress): CLEAN_UP_ON_FAIL=, INTERVAL=30
>>> Replication is done
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to standby database
INFO: connecting to master database
INFO: retrieving node list for cluster 'postgres_cluster'
INFO: registering the standby
[REPMGR EVENT] Node id: 2; Event type: standby_register; Success [1|0]: 1; Time: 2018-02-02 13:24:51.891592+00; Details:
INFO: standby registration complete
NOTICE: standby node correctly registered for cluster postgres_cluster with id 2 (conninfo: user=replica_user password=replica_pass host=cyclos-postgres-node2-service dbname=replica_db port=5432 connect_timeout=2)
Locking standby (NEW_UPSTREAM_NODE_ID=1)...
>>> Starting repmgr daemon...
[2018-02-02 13:24:53] [NOTICE] looking for configuration file in current directory
[2018-02-02 13:24:53] [NOTICE] looking for configuration file in /etc
[2018-02-02 13:24:53] [NOTICE] configuration file found at: /etc/repmgr.conf
[2018-02-02 13:24:53] [INFO] connecting to database 'user=replica_user password=replica_pass host=cyclos-postgres-node2-service dbname=replica_db port=5432 connect_timeout=2'
[2018-02-02 13:24:53] [INFO] connected to database, checking its state
[2018-02-02 13:24:53] [INFO] connecting to master node of cluster 'postgres_cluster'
[2018-02-02 13:24:53] [INFO] retrieving node list for cluster 'postgres_cluster'
[2018-02-02 13:24:53] [INFO] checking role of cluster node '1'
[2018-02-02 13:24:53] [INFO] checking cluster configuration with schema 'repmgr_postgres_cluster'
[2018-02-02 13:24:53] [INFO] checking node 2 in cluster 'postgres_cluster'
[2018-02-02 13:24:53] [INFO] reloading configuration file
[2018-02-02 13:24:53] [INFO] configuration has not changed
[2018-02-02 13:24:53] [INFO] starting continuous standby node monitoring
ERROR: cannot execute DELETE in a read-only transaction
STATEMENT: DELETE FROM repmgr_postgres_cluster.repl_nodes WHERE conninfo LIKE '%host=cyclos-postgres-node3-service%'
And on the Azure Cloud, it works just fine as well.
Azure Cloud Log
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> TUNING UP POSTGRES...
>>> Cleaning data folder which might have some garbage...
psql: could not connect to server: Connection refused
Is the server running on host "cyclos-postgres-node2-service" (10.244.0.9) and accepting
TCP/IP connections on port 5432?
>>> Auto-detected master name: 'cyclos-postgres-node1-service'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
>>> REPLICATION_UPSTREAM_NODE_ID=1
>>> Sending in background postgres start...
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
>>> Starting standby node...
>>> Instance hasn't been set up yet.
>>> Clonning primary node...
>>> Waiting for upstream postgres server...
>>> Wait db replica_db on cyclos-postgres-node1-service:5432(user: replica_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
NOTICE: destination directory '/var/lib/postgresql/data' provided
INFO: connecting to upstream node
>>>>>> Db replica_db exists on cyclos-postgres-node1-service:5432!
INFO: Successfully connected to upstream node. Current installation size is 34 MB
INFO: checking and correcting permissions on existing directory /var/lib/postgresql/data ...
NOTICE: starting backup (using pg_basebackup)...
INFO: executing: '/usr/lib/postgresql/9.5/bin/pg_basebackup -l "repmgr base backup" -D /var/lib/postgresql/data -h cyclos-postgres-node1-service -p 5432 -U replica_user -c fast -X stream '
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example : pg_ctl -D /var/lib/postgresql/data start
HINT: After starting the server, you need to register this standby with "repmgr standby register"
[REPMGR EVENT] Node id: 2; Event type: standby_clone; Success [1|0]: 1; Time: 2018-02-02 06:50:47.340146+00; Details: Cloned from host 'cyclos-postgres-node1-service', port 5432; backup method: pg_basebackup; --force: Y
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to exists file
>>> Starting postgres...
>>> Waiting for local postgres server start...
>>> Wait db replica_db on cyclos-postgres-node2-service:5432(user: replica_user,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
LOG: incomplete startup packet
LOG: database system was interrupted; last known up at 2018-02-02 06:50:46 UTC
LOG: incomplete startup packet
FATAL: the database system is starting up
psql: FATAL: the database system is starting up
>>>>>> Db replica_db is still not accessable on cyclos-postgres-node2-service:5432 (will try 60 times more)
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/2000130
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
>>>>>> Db replica_db exists on cyclos-postgres-node2-service:5432!
>>> Waiting for replication on this node is over(if any in progress): CLEAN_UP_ON_FAIL=, INTERVAL=30
>>> Replication is done
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to standby database
INFO: connecting to master database
INFO: retrieving node list for cluster 'postgres_cluster'
INFO: registering the standby
[REPMGR EVENT] Node id: 2; Event type: standby_register; Success [1|0]: 1; Time: 2018-02-02 06:51:05.083455+00; Details:
INFO: standby registration complete
NOTICE: standby node correctly registered for cluster postgres_cluster with id 2 (conninfo: user=replica_user password=replica_pass host=cyclos-postgres-node2-service dbname=replica_db port=5432 connect_timeout=2)
Locking standby (NEW_UPSTREAM_NODE_ID=1)...
>>> Starting repmgr daemon...
[2018-02-02 06:51:05] [NOTICE] looking for configuration file in current directory
[2018-02-02 06:51:05] [NOTICE] looking for configuration file in /etc
[2018-02-02 06:51:05] [NOTICE] configuration file found at: /etc/repmgr.conf
[2018-02-02 06:51:05] [INFO] connecting to database 'user=replica_user password=replica_pass host=cyclos-postgres-node2-service dbname=replica_db port=5432 connect_timeout=2'
[2018-02-02 06:51:06] [INFO] connected to database, checking its state
[2018-02-02 06:51:06] [INFO] connecting to master node of cluster 'postgres_cluster'
[2018-02-02 06:51:06] [INFO] retrieving node list for cluster 'postgres_cluster'
[2018-02-02 06:51:06] [INFO] checking role of cluster node '1'
[2018-02-02 06:51:06] [INFO] checking cluster configuration with schema 'repmgr_postgres_cluster'
[2018-02-02 06:51:06] [INFO] checking node 2 in cluster 'postgres_cluster'
[2018-02-02 06:51:06] [INFO] reloading configuration file
[2018-02-02 06:51:06] [INFO] configuration has not changed
[2018-02-02 06:51:06] [INFO] starting continuous standby node monitoring
ERROR: cannot execute DELETE in a read-only transaction
STATEMENT: DELETE FROM repmgr_postgres_cluster.repl_nodes WHERE conninfo LIKE '%host=cyclos-postgres-node3-service%'
I was able to run this on a paid cluster in IBM Cloud and it appears to be working. I did NOT use the persistent volumes and I was on a paid cluster. Please note that persistent volumes are not available on free clusters, so if you are testing on a free cluster you will get issues if you use persistent volumes.
My cluster has 3 workers of size u2c.2x4 (the smallest available) and is on the default version of Kubernetes for IBM Cloud (1.8.6), if that helps you debug at all. Please try again or if your setup is different than mine, let me know and I can try with a matching setup.
$ kubectl logs --namespace=mysystem mysystem-db-node1-0
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> TUNING UP POSTGRES...
>>> Cleaning data folder which might have some garbage...
psql: could not translate host name "mysystem-db-node1-service" to address: Name or service not known
psql: could not translate host name "mysystem-db-node2-service" to address: Name or service not known
>>> Auto-detected master name: ''
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
>>> Sending in background postgres start...
>>> Waiting for local postgres server start...
>>> Wait db replica_db on mysystem-db-node1-service:5432(user: replica_user,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
psql: could not translate host name "mysystem-db-node3-service" to address: Name or service not known
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
psql: could not connect to server: Connection refused
Is the server running on host "mysystem-db-node1-service" (172.30.207.54) and accepting
TCP/IP connections on port 5432?
selecting default shared_buffers ... >>>>>> Db replica_db is still not accessable on mysystem-db-node1-service:5432 (will try 60 times more)
128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/postgresql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /var/lib/postgresql/data -l logfile start
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....LOG: could not bind IPv6 socket: Cannot assign requested address
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
LOG: database system was shut down at 2018-02-14 15:40:14 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
done
server started
CREATE DATABASE
CREATE ROLE
/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/entrypoint.sh
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Config file was replaced with standard one!
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>> Creating replication user 'replica_user'
CREATE ROLE
>>> Creating replication db 'replica_db'
LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down
waiting for server to shut down....LOG: shutting down
LOG: database system is shut down
done
server stopped
PostgreSQL init process complete; ready for start up.
LOG: database system was shut down at 2018-02-14 15:40:16 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: incomplete startup packet
LOG: incomplete startup packet
>>>>>> Db replica_db exists on mysystem-db-node1-service:5432!
>>> Registering node with role master
INFO: connecting to master database
INFO: master register: creating database objects inside the 'repmgr_mysystem_cluster' schema
INFO: retrieving node list for cluster 'mysystem_cluster'
[REPMGR EVENT] Node id: 1; Event type: master_register; Success [1|0]: 1; Time: 2018-02-14 15:40:27.337393+00; Details:
[REPMGR EVENT] will execute script '/usr/local/bin/cluster/repmgr/events/execs/master_register.sh' for the event
[REPMGR EVENT::master_register] Node id: 1; Event type: master_register; Success [1|0]: 1; Time: 2018-02-14 15:40:27.337393+00; Details:
[REPMGR EVENT::master_register] Locking master...
[REPMGR EVENT::master_register] Unlocking standby...
NOTICE: master node correctly registered for cluster 'mysystem_cluster' with id 1 (conninfo: user=replica_user password=replica_pass host=mysystem-db-node1-service dbname=replica_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
[2018-02-14 15:40:27] [NOTICE] looking for configuration file in current directory
[2018-02-14 15:40:27] [NOTICE] looking for configuration file in /etc
[2018-02-14 15:40:27] [NOTICE] configuration file found at: /etc/repmgr.conf
[2018-02-14 15:40:27] [INFO] connecting to database 'user=replica_user password=replica_pass host=mysystem-db-node1-service dbname=replica_db port=5432 connect_timeout=2'
[2018-02-14 15:40:27] [INFO] connected to database, checking its state
[2018-02-14 15:40:27] [INFO] checking cluster configuration with schema 'repmgr_mysystem_cluster'
[2018-02-14 15:40:27] [INFO] checking node 1 in cluster 'mysystem_cluster'
[2018-02-14 15:40:27] [INFO] reloading configuration file
[2018-02-14 15:40:27] [INFO] configuration has not changed
[2018-02-14 15:40:27] [INFO] starting continuous master connection check
This happens even when the nodes are in the same subnet.
I am using the Docker-Flink project in:
https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
I am creating the services with the following commands:
docker network create -d overlay overlay
docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 --network overlay --constraint 'node.hostname == ubuntu-swarm-manager' flink jobmanager
docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager --network overlay --constraint 'node.hostname != ubuntu-swarm-manager' flink taskmanager
This is the error I get:
- Trying to register at JobManager akka.tcp://flink#jobmanager:6123/ user/jobmanager (attempt 4, timeout: 4000 milliseconds)
These are my environment configurations:
node: ubuntu-swarm-master Azure VM Standard D4s v3 (4 vcpus, 16 GB
memory) Docker version 17.03.1-ce, build c6d412e
node: azure-swarm-worker-1 Azure VM Standard D2 v2 Promo (2 vcpus, 7
GB memory) Docker version 17.09.0-ce, build afdb6d4
Flink: using image 1.3.2-hadoop2-scala_2.10
This is from the log of the container running TaskManager:
Starts ok...
Starting Task Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting taskmanager as a console application on host 00afd4130a94.
Then there are some errors (scroll right):
2017-11-02 14:06:51,064 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2017-11-02 14:06:51,065 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2017-11-02 14:06:51,067 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/10.0.0.2:6123.
2017-11-02 14:06:54,578 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:54,779 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:54,829 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:54,880 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:54,931 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:54,981 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:55,031 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:55,032 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:56,034 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:57,036 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,037 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,038 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:58,138 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:58,339 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:58,389 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,439 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,490 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:58,541 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:59,593 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:07:00,595 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:07:01,600 WARN org.apache.flink.runtime.net.ConnectionUtils - Could not connect to jobmanager/10.0.0.2:6123. Selecting a local address using heuristics.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address '00afd4130a94' (10.0.0.5) for communication.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 00afd4130a94:0.
2017-11-02 14:07:01,947 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 14:07:01,978 INFO Remoting - Starting remoting
2017-11-02 14:07:02,168 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#00afd4130a94:33881]
2017-11-02 14:07:02,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2017-11-02 14:07:02,192 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: 00afd4130a94/10.0.0.5, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2017-11-02 14:07:02,199 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2017-11-02 14:07:02,201 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 29 GB, usable 25 GB (86.21% usable)
2017-11-02 14:07:02,286 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3260, bytes per segment: 32768).
2017-11-02 14:07:02,393 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2017-11-02 14:07:02,400 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 2 ms).
2017-11-02 14:07:02,434 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 32 ms). Listening on SocketAddress /10.0.0.5:42921.
2017-11-02 14:07:02,493 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2017-11-02 14:07:02,498 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-e57d51fa-2269-4df0-9910-0fe26c6042bd for spill files.
2017-11-02 14:07:02,501 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 14:07:02,553 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-2c0c063f-464e-48f1-9fb8-fcfa48868e3a
2017-11-02 14:07:02,564 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-0c5e2b25-70a2-4964-9eec-24b0e79d560e
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1719715507.
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: df5992297d269fa16a5e945e1dce0451 # 00afd4130a94 (dataPort=42921)
2017-11-02 14:07:02,573 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s).
2017-11-02 14:07:02,574 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 113/1024/1024 MB, NON HEAP: 33/33/-1 MB (used/committed/max)]
2017-11-02 14:07:02,576 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-11-02 14:07:03,106 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
2017-11-02 14:07:04,126 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
Here is the log from the container running JobManager:
Starting Job Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting jobmanager as a console application on host c30e0fe7b765.
2017-11-02 13:42:33,721 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 # 10:23:11 UTC)
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: flink
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 981 MiBytes
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: /docker-java-home/jre
2017-11-02 13:42:33,799 INFO org.apache.flink.runtime.jobmanager.JobManager - Hadoop version: 2.7.2
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - /opt/flink/conf
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - cluster
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Classpath: /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar:::
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2017-11-02 13:42:33,911 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /opt/flink/conf
2017-11-02 13:42:33,914 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,924 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager without high-availability
2017-11-02 13:42:33,926 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on jobmanager:6123 with execution mode CLUSTER
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,962 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE)
2017-11-02 13:42:34,026 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at jobmanager:6123
2017-11-02 13:42:34,290 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 13:42:34,327 INFO Remoting - Starting remoting
2017-11-02 13:42:34,505 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#jobmanager:6123]
2017-11-02 13:42:34,524 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'jobmanager.web.log.path'.
2017-11-02 13:42:34,532 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-9f0ba581-3488-4086-a79c-53e17b56352c for the web interface files
2017-11-02 13:42:34,533 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-17a58ccf-7d8b-475e-b727-4a7935a19c0f for web frontend JAR file uploads
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor
2017-11-02 13:42:34,751 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-d10b620a-73ae-40af-bd23-aad5211fe1cc
2017-11-02 13:42:34,752 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2017-11-02 13:42:34,763 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 13:42:34,769 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager on port 8081
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#jobmanager:6123/user/jobmanager:00000000-0000-0000-0000-000000000000.
2017-11-02 13:42:34,776 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#jobmanager:6123/user/jobmanager.
2017-11-02 13:42:34,785 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink#jobmanager:6123/user/jobmanager
2017-11-02 13:42:34,801 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2017-11-02 13:42:34,814 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#844712453] - leader session 00000000-0000-0000-0000-000000000000
Why can't the TaskManagers talk to JobManager? I wonder if there's some configuration missing. Any help will be much appreciated. Thank you very much!
This is a snippet from the system log while shutting down:
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:28:50,995 StorageService.java:3788 - Announcing that I have left the ring for 30000ms
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,995 ThriftServer.java:142 - Stop listening to thrift clients
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Server.java:182 - Stop listening for CQL clients
WARN [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 MessagingService.java:786 - Waiting for messaging service to quiesce
INFO [ACCEPT-sysengplayl0127.bio-iad.ea.com/10.72.194.229] 2016-07-27 22:29:20,998 MessagingService.java:1133 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:21,022 StorageService.java:1411 - DECOMMISSIONED
INFO [main] 2016-07-27 22:32:17,534 YamlConfigurationLoader.java:89 - Configuration location: file:/opt/cassandra/product/apache-cassandra-3.7/conf/cassandra.yaml
And then while starting up:
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:630 - Cassandra version: 3.7
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:631 - Thrift API version: 20.1.0
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:632 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2016-07-27 22:32:20,351 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 397 MB and a resize interval of 60 minutes
ERROR [main] 2016-07-27 22:32:20,357 CassandraDaemon.java:731 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set, or all existing data is removed and the node is bootstrapped again
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:815) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:370) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:585) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:714) [apache-cassandra-3.7.jar:3.7]
WARN [StorageServiceShutdownHook] 2016-07-27 22:32:20,358 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [StorageServiceShutdownHook] 2016-07-27 22:32:20,359 MessagingService.java:786 - Waiting for messaging service to quiesce
Is there something wrong with the configuration?
I had faced same issue.
Posting the answer so that it might help others.
As the log suggests, the property "cassandra.override_decommission" should be overridden.
start cassandra with the syntax:
cassandra -Dcassandra.override_decommission=true
This should add the node back to the cluster.
After install of the Spark package at Linux (SuSE SLES 12) I see the following connectivity error ("failed to connect"), which beside the Spark slave process also impacts the "pyspark" examples, rejecting connections. Any hint how to activate the port 7077 connectivity via localhost addresses is welcome. Part of the problem might be the default Linux firewall settings.
Firewall Commands to Open Localhost addresses:
sudo iptables -A INPUT -s 127.0.0.1 -d 127.0.0.1 -j ACCEPT
sudo iptables -A INPUT -s 127.0.0.1 -d zbra2016 -j ACCEPT
Starting the Spark Master - commands:
export SPARK_LOCAL_IP=zbra2016
./sbin/stop-master.sh
./sbin/start-master.sh
16/04/19 10:12:29 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/04/19 10:12:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/19 10:12:29 INFO SecurityManager: Changing view acls to: linux1
16/04/19 10:12:29 INFO SecurityManager: Changing modify acls to: linux1
16/04/19 10:12:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(linux1); users with modify permissions: Set(linux1)
16/04/19 10:12:30 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/04/19 10:12:30 INFO Master: Starting Spark master at spark://zbra2016:7077
16/04/19 10:12:30 INFO Master: Running Spark version 1.6.1
16/04/19 10:12:30 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
16/04/19 10:12:30 INFO Utils: Successfully started service 'MasterUI' on port 8081.
16/04/19 10:12:30 INFO MasterWebUI: Started MasterWebUI at http://localhost:8081
16/04/19 10:12:30 INFO Utils: Successfully started service on port 6066.
16/04/19 10:12:30 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/04/19 10:12:31 INFO Master: I have been elected leader! New state: ALIVE
Starting the Spark Worker - commands:
./sbin/stop-slave.sh
./sbin/start-slave.sh spark://zbra2016:7077
Logfile displays a "Failed to Connect Error Message":
/data/spark/spark/logs/spark-linux1-org.apache.spark.deploy.worker.Worker-1-zbra2016.out
16/04/19 10:15:46 INFO Worker: Retrying connection to master (attempt # 1)
16/04/19 10:15:46 INFO Worker: Connecting to master zbra2016:7077...
16/04/19 10:15:47 WARN Worker: Failed to connect to master zbra2016:7077
java.io.IOException: Failed to connect to zbra2016/127.0.0.1:7077
Testing connectivity of alias: zbra2016 = localhost
linux1#zbra2016:/data/spark/spark> ping zbra2016
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.022 ms
We just found a solution for it in the setup of the Linux iptables firewall. I used the following command to open localhost traffic:
iptables -I INPUT 1 -p all -s localhost -d localhost -j ACCEPT
Now the worker process is able to connect to the master through the localhost ports.
You may be able to change the settings allowing port 7077 through your firewall.
Try:
sudo ufw allow 7077
I'm trying to setup a gridgain cluster in a cloud environment (opensciencedatacloud.org).
I've verified that UDP multicast is available and port 47400 is open in this environment, but unfortunately GridGain is unable to find the other nodes when they are launched. Do you have clue why it is not working.
Following you can find below the a cluster node log:
INFO o.g.grid.kernal.GridKernal%nextflow - Config URL: n/a
INFO o.g.grid.kernal.GridKernal%nextflow - Daemon mode: off
INFO o.g.grid.kernal.GridKernal%nextflow - OS: Linux 2.6.32-358.2.1.el6.x86_64 amd64
INFO o.g.grid.kernal.GridKernal%nextflow - OS user: root
INFO o.g.grid.kernal.GridKernal%nextflow - Language runtime: Groovy
INFO o.g.grid.kernal.GridKernal%nextflow - VM information: Java(TM) SE Runtime Environment 1.7.0_51-b13 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 24.51-b03
INFO o.g.grid.kernal.GridKernal%nextflow - VM total memory: 0.83GB
INFO o.g.grid.kernal.GridKernal%nextflow - Remote Management [restart: off, REST: on, JMX (remote: off)]
INFO o.g.grid.kernal.GridKernal%nextflow - GRIDGAIN_HOME=/root
INFO o.g.grid.kernal.GridKernal%nextflow - VM arguments: [-Djava.awt.headless=true]
WARN o.g.grid.kernal.GridKernal%nextflow - SMTP is not configured - email notifications are off.
INFO o.g.grid.kernal.GridKernal%nextflow - Configured caches ['allSessions']
INFO o.g.grid.kernal.GridKernal%nextflow - 3-rd party licenses can be found at: /root/libs/licenses
INFO o.g.grid.kernal.GridKernal%nextflow - Local node user attribute [ROLE=worker]
[gridgain-#5%pub-nextflow%] WARN o.g.grid.kernal.GridDiagnostic - Initial heap size is less than 512MB (59MB). It is highly recommended to allocate at least 512MB of initial heap to run GridGain. Use -Xms512m -Xmx512m to set initial heap size.
INFO o.g.grid.kernal.GridKernal%nextflow - Non-loopback local IPs: 172.16.1.98, fe80:0:0:0:78b5:53ff:fe01:643b%3, fe80:0:0:0:f816:3eff:fe54:f4e8%2, 172.17.42.1
INFO o.g.grid.kernal.GridKernal%nextflow - Enabled local MACs: FA163E54F4E8, 7AB55301643B
INFO o.g.g.s.c.t.GridTcpCommunicationSpi - IPC shared memory server endpoint started [port=48100, tokDir=/root/work/ipc/shmem/cf5dbd14-4bb8-420b-998f-820056aa6d1c-2646]
INFO o.g.g.s.c.t.GridTcpCommunicationSpi - Successfully bound shared memory communication to TCP port [port=48100, locHost=0.0.0.0/0.0.0.0]
INFO o.g.g.s.c.t.GridTcpCommunicationSpi - Successfully bound to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0]
WARN o.g.g.s.c.noop.GridNoopCheckpointSpi - Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
INFO o.g.grid.kernal.GridKernal%nextflow - Security status [authentication=off, secure-session=off]
WARN o.g.g.k.p.cache.GridCacheProcessor - Cache write synchronization mode is set to FULL_ASYNC. All single-key 'put' and 'remove' operations will return 'null', all 'putx' and 'removex' operations will return 'true'.
WARN o.g.g.k.p.cache.GridCacheProcessor - Automatically set write order mode to PRIMARY for write synchronization mode [writeSynchronizationMode=FULL_ASYNC, cacheName=allSessions]
WARN o.g.g.k.p.cache.GridCacheProcessor - Query indexing is disabled (queries will not work) for cache: 'allSessions'. To enable change GridCacheConfiguration.isQueryIndexEnabled() property.
INFO o.g.g.k.p.cache.GridCacheDgcManager - <allSessions> DGC trace log disabled.
INFO o.g.g.k.p.cache.GridCacheProcessor - Started cache [name=allSessions, mode=REPLICATED]
INFO org.eclipse.jetty.server.Server - jetty-9.0.5.v20130815
INFO o.e.jetty.server.ServerConnector - Started ServerConnector#7b9617a0{HTTP/1.1}{0.0.0.0:8080}
INFO o.g.g.k.p.r.p.h.j.GridJettyRestProtocol - Command protocol successfully started [name=Jetty REST, host=/0.0.0.0, port=8080]
INFO o.g.g.k.p.r.p.t.GridTcpRestProtocol - Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
INFO o.g.g.s.d.tcp.GridTcpDiscoverySpi - Successfully bound to TCP port [port=47500, localHost=/172.16.1.98]
WARN o.g.g.s.d.t.i.m.GridTcpDiscoveryMulticastIpFinder - GridTcpDiscoveryMulticastIpFinder has no pre-configured addresses (it is recommended in production to specify at least one address in GridTcpDiscoveryMulticastIpFinder.getAddresses() configuration property)
>>> +------------------------------------------------------------------------------------+
>>> GridGain ver. platform-os-6.0.2#20140323-sha1:f9c796a1b29d2d7ce2737e681cbe578b5315d79f
>>> +------------------------------------------------------------------------------------+
>>> OS name: Linux 2.6.32-358.2.1.el6.x86_64 amd64
>>> CPU(s): 2
>>> Heap: 0.83GB
>>> VM name: 2646#node.novalocal
>>> Grid name: nextflow
>>> Local node [ID=CF5DBD14-4BB8-420B-998F-820056AA6D1C, order=1]
>>> Local node addresses: [node.novalocal/172.16.1.98]
>>> Local ports: TCP:8080 TCP:11211 TCP:47100 TCP:47500 TCP:48100
>>> GridGain documentation: http://www.gridgain.com/documentation
INFO o.g.g.k.m.d.GridDiscoveryManager - Topology snapshot [ver=1, nodes=1, CPUs=2, heap=0.83GB]
Usually software firewalls prevent multicast packets. Can you try with firewall disabled on your system?