spring cloud kafka streams terminating in azure kubernetes service - azure

I created Kafka streams application with spring cloud stream which reads data from one topic and writes to another topic and I'm trying to deploy and run the job in AKS with ACR image but the stream getting closed without any error after reading all the available messages(lag 0) in the topic. strange thing I'm facing is, it is running fine in Intellj.
Here is my AKS pod logs:
2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=62): org.apache.kafka.common.requests.FetchResponse#7b021a01
[2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=49) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=49,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63): org.apache.kafka.common.requests.FetchResponse#50fb365c
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=50) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=64) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=50,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,710] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b, started on Tue Mar 02 17:29:08 GMT 2021
[2021-03-02 17:30:39,715] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#71391b3f, started on Tue Mar 02 17:29:12 GMT 2021, parent: org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:369] Stopping beans in phase 2147483547
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:242] Bean 'org.springframework.kafka.config.internalKafkaListenerEndpointRegistry' completed its stop procedure
[2021-03-02 17:30:39,719] [DEBUG] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:1016] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] Stopping Streams client with timeoutMillis = 10000 ms.
[2021-03-02 17:30:39,719] [INFO] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:287] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:1116] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Informed to shut down
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:221] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:772] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State already transits to PENDING_SHUTDOWN, skipping the run once call after poll request
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:206] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Ignoring request to transit from PENDING_SHUTDOWN to PENDING_SHUTDOWN: only DEAD state is a valid next state
[2021-03-02 17:30:39,788] [INFO] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:1130] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Shutting down
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.AssignedStreamsTasks AssignedStreamsTasks.java:529] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Clean shutdown of all active tasks
Please advise.

Related

Trouble with RetryUpToMaximumCountWithFixedSleep of Spark job running in Dataproc

I am running Spark Streaming job in Google Dataproc Cluster. I want to test the case what happens Dataproc abruptly stops/fails while streaming job is still running. Hence I stopped the Dataproc cluster manually while the job is still running,
I got the below info in log once I stopped,
22/10/12 06:46:39 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:39 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:40 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:40 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:41 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/10/12 06:46:41 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: airflow-stream-test-m/10.***:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
The problem is the job just hangs right here. RetryUpToMaximumCountWithFixedSleep never reaches the count 10. Hence the job's status is just shown as 'running' in dataproc console. But I want the job to fail.
Is there any way to reduce RetryUpToMaximumCountWithFixedSleep? I want to reduce it to 2. Also I don't understand why job hangs and RetryUpToMaximumCountWithFixedSleep doesn't increment to 10.

Cassandra issue while adding jmx_prometheus

I want to add Cassandra monitoring using Prometheus. ref https://blog.pythian.com/step-step-monitoring-cassandra-prometheus-grafana/
When I add /etc/cassandra/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7070:/opt/jmx_prometheus/cassandra.yml"
I get an error :
ubuntu#ip-172-21-0-111:~$ sudo service cassandra status
● cassandra.service - LSB: distributed storage system for structured data
Loaded: loaded (/etc/init.d/cassandra; bad; vendor preset: enabled)
Active: active (exited) since Mon 2020-04-13 05:43:38 UTC; 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 3557 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
Process: 3570 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Starting LSB: distributed storage system for structured data...
Apr 13 05:43:38 ip-172-21-0-111 systemd[1]: Started LSB: distributed storage system for structured data.
ubuntu#ip-172-21-0-111:~$ nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
when I remove jmx_prometheus entry I get it working :
ubuntu#ip-172-21-0-111:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.21.0.111 1.83 GiB 128 100.0% b52324d0-c57f-46e3-bc10-a6dc07bae17a rack1
ubuntu#ip-172-21-0-111:~$ tail -f /var/log/cassandra/system.log
INFO [main] 2020-04-13 05:37:36,609 StorageService.java:2169 - Node /172.21.0.111 state jump to NORMAL
INFO [main] 2020-04-13 05:37:36,617 CassandraDaemon.java:673 - Waiting for gossip to settle before accepting client requests...
INFO [main] 2020-04-13 05:37:44,621 CassandraDaemon.java:704 - No gossip backlog; proceeding
INFO [main] 2020-04-13 05:37:44,713 NativeTransportService.java:70 - Netty using native Epoll event loop
INFO [main] 2020-04-13 05:37:44,773 Server.java:161 - Using Netty Version: [netty-buffer=netty-buffer-4.0.36.Final.e8fa848, netty-codec=netty-codec-4.0.36.Final.e8fa848, netty-codec-haproxy=netty-codec-haproxy-4.0.36.Final.e8fa848, netty-codec-http=netty-codec-http-4.0.36.Final.e8fa848, netty-codec-socks=netty-codec-socks-4.0.36.Final.e8fa848, netty-common=netty-common-4.0.36.Final.e8fa848, netty-handler=netty-handler-4.0.36.Final.e8fa848, netty-tcnative=netty-tcnative-1.1.33.Fork15.906a8ca, netty-transport=netty-transport-4.0.36.Final.e8fa848, netty-transport-native-epoll=netty-transport-native-epoll-4.0.36.Final.e8fa848, netty-transport-rxtx=netty-transport-rxtx-4.0.36.Final.e8fa848, netty-transport-sctp=netty-transport-sctp-4.0.36.Final.e8fa848, netty-transport-udt=netty-transport-udt-4.0.36.Final.e8fa848]
INFO [main] 2020-04-13 05:37:44,773 Server.java:162 - Starting listening for CQL clients on /172.21.0.111:9042 (unencrypted)...
INFO [main] 2020-04-13 05:37:44,811 CassandraDaemon.java:505 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO [SharedPool-Worker-1] 2020-04-13 05:37:46,625 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
INFO [OptionalTasks:1] 2020-04-13 05:37:46,752 CassandraRoleManager.java:339 - Created default superuser role 'cassandra'
It worked! Changed port to 7071 from 7070 in JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus/jmx_prometheus_javaagent-0.3.0.jar=7071:/opt/jmx_prometheus/cassandra.yml"

Some YARN worker node not join cluster , while I create spark cluster on Dataproc

I have created a spark cluster on dataproc with 1 master and 6 worker node.
On GCP console I can see 6 VMs are running, but I only see 5 nodes on YARN Node Manager UI.
When I ssh into that machine, from the yarn-yarn-nodemanager log, I see, it keeps restarting and reconnecting to NodeManager.
How can I make this node rejoin cluster ?
update: my command
gcloud dataproc clusters create ${GCS_CLUSTER} \
--image pyspark-with-conda \
--bucket test-spark-data \
--zone asia-east1-b \
--master-boot-disk-size 500GB \
--master-machine-type n1-standard-2 \
--num-masters 1 \
--num-workers 2 \
--worker-machine-type n1-standard-8 \
--num-preemptible-workers 4 \
--preemptible-worker-boot-disk-size 500GB
Error message:
2018-08-22 08:25:24,801 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at test-spark-cluster-m/10.140.0.34:8031
2018-08-22 08:25:24,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2018-08-22 08:25:24,843 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2018-08-22 08:25:24,978 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
2018-08-22 08:25:24,979 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apach
e.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal,
Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,081 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recie
ved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,084 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:8042
2018-08-22 08:25:25,185 INFO org.apache.hadoop.ipc.Server: Stopping server on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
2018-08-22 08:25:25,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2018-08-22 08:25:25,205 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,205 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,205 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2018-08-22 08:25:25,206 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,208 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
OP confirmed that this issue is resolved and they didn't encounter it anymore.

Hive on Spark infinite connections

Im trying to run a simple query on a column with only 10 rows:
select MAX(Column3) from table;
However the spark application runs infinitely with the following message:
> 2017-05-10T16:23:40,397 DEBUG [IPC Parameter Sending Thread #0]
> ipc.Client: IPC Client (1360312263) connection to /0.0.0.0:8032 from
> ubuntu sending #1841 2017-05-10T16:23:40,397 DEBUG [IPC Client
> (1360312263) connection to /0.0.0.0:8032 from ubuntu] ipc.Client: IPC
> Client (1360312263) connection to /0.0.0.0:8032 from ubuntu got value
> #1841 2017-05-10T16:23:40,397 DEBUG [main] ipc.ProtobufRpcEngine: Call: getApplicationReport took 0ms 2017-05-10T16:23:41,397 DEBUG
> [main] security.UserGroupInformation: PrivilegedAction as:ubuntu
> (auth:SIMPLE)
> from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)
> 2017-05-10T16:23:41,398 DEBUG [IPC Parameter Sending Thread #0]
> ipc.Client: IPC Client (1360312263) connection to /0.0.0.0:8032 from
> ubuntu sending #1842 2017-05-10T16:23:41,398 DEBUG [IPC Client
> (1360312263) connection to /0.0.0.0:8032 from ubuntu] ipc.Client: IPC
> Client (1360312263) connection to /0.0.0.0:8032 from ubuntu got value
> #1842 2017-05-10T16:23:41,398 DEBUG [main] ipc.ProtobufRpcEngine: Call: getApplicationReport took 1ms 2017-05-10T16:23:41,399 DEBUG
> [main] security.UserGroupInformation: PrivilegedAction as:ubuntu
> (auth:SIMPLE)
> from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)
> 2017-05-10T16:23:41,399 DEBUG [IPC Parameter Sending Thread #0]
> ipc.Client: IPC Client (1360312263) connection to /0.0.0.0:8032 from
> ubuntu sending #1843 2017-05-10T16:23:41,399 DEBUG [IPC Client
> (1360312263) connection to /0.0.0.0:8032 from ubuntu] ipc.Client: IPC
> Client (1360312263) connection to /0.0.0.0:8032 from ubuntu got value
> #1843 2017-05-10T16:23:41,399 DEBUG [main] ipc.ProtobufRpcEngine: Call: getApplicationReport took 0ms
The issue was related to an unhealthy node, therefore it was not able to assign the task. The solution was to increase the yarn maximum disk utilization percentage in yarn-site.xml because my disk was at 97% used :
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99</value>
</property>

ConfigurationException while launching Apache Cassanda DB: This node was decommissioned and will not rejoin the ring

This is a snippet from the system log while shutting down:
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:28:50,995 StorageService.java:3788 - Announcing that I have left the ring for 30000ms
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,995 ThriftServer.java:142 - Stop listening to thrift clients
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Server.java:182 - Stop listening for CQL clients
WARN [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:20,997 MessagingService.java:786 - Waiting for messaging service to quiesce
INFO [ACCEPT-sysengplayl0127.bio-iad.ea.com/10.72.194.229] 2016-07-27 22:29:20,998 MessagingService.java:1133 - MessagingService has terminated the accept() thread
INFO [RMI TCP Connection(12)-127.0.0.1] 2016-07-27 22:29:21,022 StorageService.java:1411 - DECOMMISSIONED
INFO [main] 2016-07-27 22:32:17,534 YamlConfigurationLoader.java:89 - Configuration location: file:/opt/cassandra/product/apache-cassandra-3.7/conf/cassandra.yaml
And then while starting up:
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:630 - Cassandra version: 3.7
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:631 - Thrift API version: 20.1.0
INFO [main] 2016-07-27 22:32:20,316 StorageService.java:632 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2016-07-27 22:32:20,351 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 397 MB and a resize interval of 60 minutes
ERROR [main] 2016-07-27 22:32:20,357 CassandraDaemon.java:731 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set, or all existing data is removed and the node is bootstrapped again
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:815) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:725) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) ~[apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:370) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:585) [apache-cassandra-3.7.jar:3.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:714) [apache-cassandra-3.7.jar:3.7]
WARN [StorageServiceShutdownHook] 2016-07-27 22:32:20,358 Gossiper.java:1508 - No local state or state is in silent shutdown, not announcing shutdown
INFO [StorageServiceShutdownHook] 2016-07-27 22:32:20,359 MessagingService.java:786 - Waiting for messaging service to quiesce
Is there something wrong with the configuration?
I had faced same issue.
Posting the answer so that it might help others.
As the log suggests, the property "cassandra.override_decommission" should be overridden.
start cassandra with the syntax:
cassandra -Dcassandra.override_decommission=true
This should add the node back to the cluster.

Resources