Trouble with RetryUpToMaximumCountWithFixedSleep of Spark job running in Dataproc - apache-spark

I am running Spark Streaming job in Google Dataproc Cluster. I want to test the case what happens Dataproc abruptly stops/fails while streaming job is still running. Hence I stopped the Dataproc cluster manually while the job is still running,
I got the below info in log once I stopped,
22/10/12 06:46:39 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:39 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:40 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:40 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:41 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:41 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
The problem is the job just hangs right here. RetryUpToMaximumCountWithFixedSleep never reaches the count 10. Hence the job's status is just shown as 'running' in dataproc console. But I want the job to fail.
Is there any way to reduce RetryUpToMaximumCountWithFixedSleep? I want to reduce it to 2. Also I don't understand why job hangs and RetryUpToMaximumCountWithFixedSleep doesn't increment to 10.

Related

spring cloud kafka streams terminating in azure kubernetes service

I created Kafka streams application with spring cloud stream which reads data from one topic and writes to another topic and I'm trying to deploy and run the job in AKS with ACR image but the stream getting closed without any error after reading all the available messages(lag 0) in the topic. strange thing I'm facing is, it is running fine in Intellj.
Here is my AKS pod logs:
2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=62): org.apache.kafka.common.requests.FetchResponse#7b021a01
[2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=49) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=49,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63): org.apache.kafka.common.requests.FetchResponse#50fb365c
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=50) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=64) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=50,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,710] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b, started on Tue Mar 02 17:29:08 GMT 2021
[2021-03-02 17:30:39,715] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#71391b3f, started on Tue Mar 02 17:29:12 GMT 2021, parent: org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:369] Stopping beans in phase 2147483547
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:242] Bean 'org.springframework.kafka.config.internalKafkaListenerEndpointRegistry' completed its stop procedure
[2021-03-02 17:30:39,719] [DEBUG] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:1016] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] Stopping Streams client with timeoutMillis = 10000 ms.
[2021-03-02 17:30:39,719] [INFO] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:287] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:1116] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Informed to shut down
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:221] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:772] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State already transits to PENDING_SHUTDOWN, skipping the run once call after poll request
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:206] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Ignoring request to transit from PENDING_SHUTDOWN to PENDING_SHUTDOWN: only DEAD state is a valid next state
[2021-03-02 17:30:39,788] [INFO] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:1130] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Shutting down
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.AssignedStreamsTasks AssignedStreamsTasks.java:529] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Clean shutdown of all active tasks
Please advise.

Stop retrying connect to YARN resource manager after limited attempts in Java

I am using the org.apache.spark.deploy.yarn.Client API in Java to submit Spark application to YARN.
SparkConf sparkConf = new SparkConf();
List<String> submitArgs = new ArrayList<>();
if (StringUtils.hasText(appName)) {
submitArgs.add("--name");
submitArgs.add(appName);
sparkConf.setAppName(appName);
}
submitArgs.add("--jar");
submitArgs.add(appJarPath);
submitArgs.add("--class");
submitArgs.add(appMainClass);
System.setProperty("SPARK_YARN_MODE", "true");
sparkConf.setMaster("yarn")
.set("spark.submit.deployMode", "cluster")
.set("spark.yarn.queue",queue);
ClientArguments clientArguments = new ClientArguments(submitArgs.toArray(new String[submitArgs.size()]));
Client client = new Client(clientArguments, sparkConf);
client.run();
I am trying to create a scenario when the client cannot connect to the resource manager, after so may attempts, it will timeout and throw back an exception. However, the client keeps retrying connecting to the YARN resource manager without any timeout. See below console output:
2018-09-18 14:20:25.046 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.yarn.client.RMProxy : Connecting to ResourceManager at /0.0.0.0:8032
2018-09-18 14:20:27.600 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:29.603 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:31.600 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:33.607 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:35.608 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:37.619 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:39.620 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:41.623 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:43.633 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:20:45.632 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:18.665 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:20.676 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:22.679 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:24.686 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:26.691 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:28.695 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:30.697 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-09-18 14:21:32.708 INFO 10480 --- [nio-8080-exec-1] org.apache.hadoop.ipc.Client : Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
What custom configuration I should do and how so that the retry will stop after a limited number of attempts or time?

Docker-Flink: TaskManagers can't find JobManager when in different nodes in Docker Swarm

This happens even when the nodes are in the same subnet.
I am using the Docker-Flink project in:
https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
I am creating the services with the following commands:
docker network create -d overlay overlay
docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 --network overlay --constraint 'node.hostname == ubuntu-swarm-manager' flink jobmanager
docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager --network overlay --constraint 'node.hostname != ubuntu-swarm-manager' flink taskmanager
This is the error I get:
- Trying to register at JobManager akka.tcp://flink#jobmanager:6123/ user/jobmanager (attempt 4, timeout: 4000 milliseconds)
These are my environment configurations:
node: ubuntu-swarm-master Azure VM Standard D4s v3 (4 vcpus, 16 GB
memory) Docker version 17.03.1-ce, build c6d412e
node: azure-swarm-worker-1 Azure VM Standard D2 v2 Promo (2 vcpus, 7
GB memory) Docker version 17.09.0-ce, build afdb6d4
Flink: using image 1.3.2-hadoop2-scala_2.10
This is from the log of the container running TaskManager:
Starts ok...
Starting Task Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting taskmanager as a console application on host 00afd4130a94.
Then there are some errors (scroll right):
2017-11-02 14:06:51,064 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2017-11-02 14:06:51,065 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2017-11-02 14:06:51,067 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/10.0.0.2:6123.
2017-11-02 14:06:54,578 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:54,779 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:54,829 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:54,880 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:54,931 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:54,981 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:55,031 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:55,032 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:56,034 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:57,036 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,037 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,038 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:58,138 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:58,339 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:58,389 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,439 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,490 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:58,541 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:59,593 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:07:00,595 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:07:01,600 WARN org.apache.flink.runtime.net.ConnectionUtils - Could not connect to jobmanager/10.0.0.2:6123. Selecting a local address using heuristics.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address '00afd4130a94' (10.0.0.5) for communication.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 00afd4130a94:0.
2017-11-02 14:07:01,947 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 14:07:01,978 INFO Remoting - Starting remoting
2017-11-02 14:07:02,168 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#00afd4130a94:33881]
2017-11-02 14:07:02,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2017-11-02 14:07:02,192 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: 00afd4130a94/10.0.0.5, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2017-11-02 14:07:02,199 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2017-11-02 14:07:02,201 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 29 GB, usable 25 GB (86.21% usable)
2017-11-02 14:07:02,286 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3260, bytes per segment: 32768).
2017-11-02 14:07:02,393 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2017-11-02 14:07:02,400 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 2 ms).
2017-11-02 14:07:02,434 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 32 ms). Listening on SocketAddress /10.0.0.5:42921.
2017-11-02 14:07:02,493 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2017-11-02 14:07:02,498 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-e57d51fa-2269-4df0-9910-0fe26c6042bd for spill files.
2017-11-02 14:07:02,501 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 14:07:02,553 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-2c0c063f-464e-48f1-9fb8-fcfa48868e3a
2017-11-02 14:07:02,564 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-0c5e2b25-70a2-4964-9eec-24b0e79d560e
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1719715507.
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: df5992297d269fa16a5e945e1dce0451 # 00afd4130a94 (dataPort=42921)
2017-11-02 14:07:02,573 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s).
2017-11-02 14:07:02,574 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 113/1024/1024 MB, NON HEAP: 33/33/-1 MB (used/committed/max)]
2017-11-02 14:07:02,576 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-11-02 14:07:03,106 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
2017-11-02 14:07:04,126 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
Here is the log from the container running JobManager:
Starting Job Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting jobmanager as a console application on host c30e0fe7b765.
2017-11-02 13:42:33,721 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 # 10:23:11 UTC)
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: flink
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 981 MiBytes
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: /docker-java-home/jre
2017-11-02 13:42:33,799 INFO org.apache.flink.runtime.jobmanager.JobManager - Hadoop version: 2.7.2
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - /opt/flink/conf
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - cluster
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Classpath: /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar:::
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2017-11-02 13:42:33,911 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /opt/flink/conf
2017-11-02 13:42:33,914 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,924 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager without high-availability
2017-11-02 13:42:33,926 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on jobmanager:6123 with execution mode CLUSTER
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,962 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE)
2017-11-02 13:42:34,026 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at jobmanager:6123
2017-11-02 13:42:34,290 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 13:42:34,327 INFO Remoting - Starting remoting
2017-11-02 13:42:34,505 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#jobmanager:6123]
2017-11-02 13:42:34,524 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'jobmanager.web.log.path'.
2017-11-02 13:42:34,532 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-9f0ba581-3488-4086-a79c-53e17b56352c for the web interface files
2017-11-02 13:42:34,533 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-17a58ccf-7d8b-475e-b727-4a7935a19c0f for web frontend JAR file uploads
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor
2017-11-02 13:42:34,751 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-d10b620a-73ae-40af-bd23-aad5211fe1cc
2017-11-02 13:42:34,752 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2017-11-02 13:42:34,763 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 13:42:34,769 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager on port 8081
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#jobmanager:6123/user/jobmanager:00000000-0000-0000-0000-000000000000.
2017-11-02 13:42:34,776 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#jobmanager:6123/user/jobmanager.
2017-11-02 13:42:34,785 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink#jobmanager:6123/user/jobmanager
2017-11-02 13:42:34,801 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2017-11-02 13:42:34,814 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#844712453] - leader session 00000000-0000-0000-0000-000000000000
Why can't the TaskManagers talk to JobManager? I wonder if there's some configuration missing. Any help will be much appreciated. Thank you very much!

Configuring hadoop with Blob storage on Azure

we are testing hadoop HA cluster on Azure v2 cloud running on linux. We are trying to switch to Azure BLOB storage. We are not sure how we should configure name nodes using Blob storage. We are getting following error:
2015-12-22 13:05:50,193 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:start(129)) - Starting standby checkpoint thread...
Checkpointing active NN at http://bd-azure-qa-nn2:50070
Serving checkpoints at http://bd-azure-qa-nn1:50070
2015-12-22 13:07:50,240 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(269)) - Triggering log roll on remote NameNode bd-azure-qa-nn2/10.0.0.7:8020
2015-12-22 13:07:51,387 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:52,391 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:53,400 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:54,416 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:55,425 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:56,450 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:57,456 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:58,462 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:07:59,473 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:00,478 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:01,482 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:02,490 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:03,501 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:04,515 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:05,520 INFO ipc.Client (Client.java:handleConnectionFailure(858)) - Retrying connect to server: bd-azure-qa-nn2/10.0.0.7:8020. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-12-22 13:08:05,966 WARN ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(274)) - Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1719)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1352)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6339)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:933)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy14.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:145)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
We are not pretty sure how to configure name nodes with Azure Blob since Blob effectively handover the HDFS functionality.
We have standard HA name node configuration in hdfs-site.xml and modified core-site.xml with the following properties to enable BLOB storage.
<property>
<name>fs.AbstractFileSystem.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.OUR_STORAGE_ACCOUNT.blob.core.windows.net</name>
<value>"OUR_KEY"</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>wasb://blob-hdfs#OUR_STORAGE_ACCOUNT.blob.core.windows.net</value>
<final>true</final>
</property>
<property>
<name>fs.azure.page.blob.dir</name>
<value>/datadir</value>
</property>
<property>
<name>fs.azure.selfthrottling.read.factor</name>
<value>1.000000</value>
</property>
<property>
<name>fs.azure.selfthrottling.write.factor</name>
<value>1.000000</value>
<property>
Our HA name in the original cluster were:
<!-- <property>
<name>fs.defaultFS</name>
<value>hdfs://namenodeha</value>
</property> -->
hdfs-site.xml we didn't touch at all.
We are not sure about name node settings. Two name node from original settings are probably overkill since underlying BLOB should handle all replication etc.
Could someone please clarify?
Following properties we were missing in hdfs-site.xml in oreder to make it work.
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50475</value>
</property>
<property>
<name>dfs.namenode.https-address.namenodeha.nn1</name>
<value>bd-azure-qa-nn1:50470</value>
</property>
<property>
<name>dfs.namenode.https-address.namenodeha.nn2</name>
<value>bd-azure-qa-nn2:50470</value>
</property>
<property>
<name>dfs.journalnode.https-address</name>
<value>0.0.0.0:8481</value> :
</property>
For sure with appropriate host names.

Port in the core-site.xml file

I tried to create the folder in file system of HDFS by means of a command
./hadoop fs -mkdir /user/hadoop
and as a result I received the following messages
13/02/17 09:45:50 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 0 time(s).
13/02/17 09:45:51 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 1 time(s).
13/02/17 09:45:52 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 2 time(s).
13/02/17 09:45:53 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 3 time(s).
13/02/17 09:45:54 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 4 time(s).
13/02/17 09:45:55 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 5 time(s).
13/02/17 09:45:56 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 6 time(s).
13/02/17 09:45:57 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 7 time(s).
13/02/17 09:45:58 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 8 time(s).
13/02/17 09:45:59 INFO ipc.Client: Retrying connect to server: one/192.168.1.8:9000. Already tried 9 time(s).
Bad connection to FS. command aborted. exception: Call to one/192.168.1.8:9000 failed on connection exception: java.net.ConnectException: Connection refused
In the /export/hadoop-1.0.1/conf/core-site.xml file the following is specified:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.8:9000</value>
</property>
</configuration>
Wanted to specify - whether correct I specified port? If isn't present, tell what it is necessary?
seems like hadoop daemons are not running. check with jps

Resources