We have a ceph cluster with 408 osds, 3 mons and 3 rgws. We updated our cluster from nautilus 14.2.14 to octopus 15.2.12 a few days ago. After upgrading, the garbage collector process which is run after the lifecycle process, causes slow ops and makes some osds to be restarted. In each process the garbage collector deletes about 1 million objects. Below are the one of the osd's logs before it restarts.
2021-07-18T00:44:38.807+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:38.807+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:42.811+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:42.811+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
what is the suitable configuration for gc in such a heavy delete process so it doesn't make slow ops? We had the same delete load in nautilus but we didn't have any problem with that.
Related
We are running spark jobs on Kubernetes (EKS non EMR) using Spark operator.
After some time some executors get SIGNAL TERM, an example log from executor:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
On the driver side, after 2 minutes the driver stops receiving heartbeats and then decides to kill the executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
I tried to understand if we are crossing some resource limit on Kubernetes level but couldn't find something like that.
What can I look for to understand the reason for Kubernetes killing executors?
FOLLOW UP:
I missed a log message on the driver side:
Mar 01 21:04:23.471 Disabling executor 50.
and then on the executor side:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
I looked at what class is writing the Disabling executor log message and found this class KubernetesDriverEndpoint, it seems that the onDisconnected method is called for all these executors and this method calls disableExecutor in DriverEndpoint
So the question now is why these executors are considered disconnected.
Looking at the explanation from this site
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
it is said there
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
But I couldn't find any WARN logs on the driver side, any suggestions?
The reason for the executors being killed was that we were running them on spot instances in AWS, and this is how it looks like, the first sign we saw for an executor being killed is this line in it's log
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Once we moved to on demand instances for the executors as well not a single executor was terminated in 20 hour jobs
I created Kafka streams application with spring cloud stream which reads data from one topic and writes to another topic and I'm trying to deploy and run the job in AKS with ACR image but the stream getting closed without any error after reading all the available messages(lag 0) in the topic. strange thing I'm facing is, it is running fine in Intellj.
Here is my AKS pod logs:
2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=62): org.apache.kafka.common.requests.FetchResponse#7b021a01
[2021-03-02 17:30:39,131] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=49) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,132] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=49,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:840] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Received FETCH response from node 3 for request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=63): org.apache.kafka.common.requests.FetchResponse#50fb365c
[2021-03-02 17:30:39,636] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:463] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Node 0 sent an incremental fetch response with throttleTimeMs = 3 for session 614342128 with 0 response partition(s), 1 implied partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:1177] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Added READ_UNCOMMITTED fetch request for partition test.topic at position FetchPosition{offset=128, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[vm3.lab (id: 3 rack: 1)], epoch=1}} to node vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.FetchSessionHandler FetchSessionHandler.java:259] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Built incremental fetch (sessionId=614342128, epoch=50) for node 3. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 1 partition(s)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.c.i.Fetcher Fetcher.java:261] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(test.topic)) to broker vm3.lab (id: 3 rack: 1)
[2021-03-02 17:30:39,637] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.c.NetworkClient NetworkClient.java:505] [Consumer clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, groupId=latest] Sending FETCH request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1-consumer, correlationId=64) and timeout 60000 to node 3: {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=614342128,session_epoch=50,topics=[],forgotten_topics_data=[],rack_id=}
[2021-03-02 17:30:39,710] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b, started on Tue Mar 02 17:29:08 GMT 2021
[2021-03-02 17:30:39,715] [DEBUG] [SpringContextShutdownHook] [o.s.c.a.AnnotationConfigApplicationContext AbstractApplicationContext.java:1006] Closing org.springframework.context.annotation.AnnotationConfigApplicationContext#71391b3f, started on Tue Mar 02 17:29:12 GMT 2021, parent: org.springframework.context.annotation.AnnotationConfigApplicationContext#dc9876b
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:369] Stopping beans in phase 2147483547
[2021-03-02 17:30:39,718] [DEBUG] [SpringContextShutdownHook] [o.s.c.s.DefaultLifecycleProcessor DefaultLifecycleProcessor.java:242] Bean 'org.springframework.kafka.config.internalKafkaListenerEndpointRegistry' completed its stop procedure
[2021-03-02 17:30:39,719] [DEBUG] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:1016] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] Stopping Streams client with timeoutMillis = 10000 ms.
[2021-03-02 17:30:39,719] [INFO] [SpringContextShutdownHook] [o.a.k.s.KafkaStreams KafkaStreams.java:287] stream-client [latest-e07d649d-5178-4107-898b-08b8008d822e] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:1116] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Informed to shut down
[2021-03-02 17:30:39,729] [INFO] [kafka-streams-close-thread] [o.a.k.s.p.i.StreamThread StreamThread.java:221] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:772] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] State already transits to PENDING_SHUTDOWN, skipping the run once call after poll request
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:206] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Ignoring request to transit from PENDING_SHUTDOWN to PENDING_SHUTDOWN: only DEAD state is a valid next state
[2021-03-02 17:30:39,788] [INFO] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.StreamThread StreamThread.java:1130] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Shutting down
[2021-03-02 17:30:39,788] [DEBUG] [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] [o.a.k.s.p.i.AssignedStreamsTasks AssignedStreamsTasks.java:529] stream-thread [latest-e07d649d-5178-4107-898b-08b8008d822e-StreamThread-1] Clean shutdown of all active tasks
Please advise.
We have a Xtradb cluster with three nodes. There is one node, which was not properly stopped and won't start. The other two nodes are correctly working and responding. The only thing in logs is this:
-- Unit mysql.service has begun starting up.
Aug 25 04:40:45 percona-prod-perconaxtradb-vm-0 /etc/init.d/mysql[2503]: MySQL PID not found, pid_file detected/guessed: /var/run/mysql
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 mysql[2462]: Starting MySQL (Percona XtraDB Cluster) database server: mysqld . . . . .
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 mysql[2462]: failed!
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 systemd[1]: mysql.service: control process exited, code=exited status=1
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 systemd[1]: Failed to start LSB: Start and stop the mysql (Percona XtraDB Cluster) daem
-- Subject: Unit mysql.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
In /var/lib/mysql/wsrep_recovery.qEEkjd we found this:
2018-08-25T05:49:31.055887Z 0 [ERROR] Found 20 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
2018-08-25T05:49:31.055892Z 0 [ERROR] Aborting
2018-08-25T05:49:31.055901Z 0 [Note] Binlog end
We would like to completely drop these 20 prepared transactions.
The other two nodes are consistent and working, so it would be enough to tell this node "ignore your state and sync with other nodes".
In the end we removed the /data folder on the dead node and restarted the node. The node then started SST replication - which takes a long time and the only progress one can see is checking the growing size of the folder. But then it worked.
I have created a spark cluster on dataproc with 1 master and 6 worker node.
On GCP console I can see 6 VMs are running, but I only see 5 nodes on YARN Node Manager UI.
When I ssh into that machine, from the yarn-yarn-nodemanager log, I see, it keeps restarting and reconnecting to NodeManager.
How can I make this node rejoin cluster ?
update: my command
gcloud dataproc clusters create ${GCS_CLUSTER} \
--image pyspark-with-conda \
--bucket test-spark-data \
--zone asia-east1-b \
--master-boot-disk-size 500GB \
--master-machine-type n1-standard-2 \
--num-masters 1 \
--num-workers 2 \
--worker-machine-type n1-standard-8 \
--num-preemptible-workers 4 \
--preemptible-worker-boot-disk-size 500GB
Error messageļ¼
2018-08-22 08:25:24,801 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at test-spark-cluster-m/10.140.0.34:8031
2018-08-22 08:25:24,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2018-08-22 08:25:24,843 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2018-08-22 08:25:24,978 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
2018-08-22 08:25:24,979 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apach
e.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal,
Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,081 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recie
ved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,084 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:8042
2018-08-22 08:25:25,185 INFO org.apache.hadoop.ipc.Server: Stopping server on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
2018-08-22 08:25:25,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2018-08-22 08:25:25,205 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,205 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,205 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2018-08-22 08:25:25,206 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,208 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
OP confirmed that this issue is resolved and they didn't encounter it anymore.
I want to create a file when the service is restarted.
rpm -qa | grep super
supervisor-3.1.3-0.5.b1.el6.noarch
My config file:
[program:tests]
command=/root/while.sh
directory=/root
user=root
autostart=true
autorestart=true
[eventlistener:tests_ls]
command=touch /tmp/32
events=PROCESS_STATE
My log file error
2016-06-27 10:48:15,794 ERRO pool tests_ls event buffer overflowed, discarding event 8
2016-06-27 10:48:15,794 INFO exited: tests_ls (exit status 0; not expected)
2016-06-27 10:48:16,796 ERRO pool tests_ls event buffer overflowed, discarding event 9
2016-06-27 10:48:16,796 INFO gave up: tests_ls entered FATAL state, too many start retries too quickly
2016-06-27 10:48:39,155 ERRO pool tests_ls event buffer overflowed, discarding event 10
2016-06-27 10:48:39,155 INFO exited: tests (terminated by SIGKILL; not expected)
2016-06-27 10:48:40,157 ERRO pool tests_ls event buffer overflowed, discarding event 11
2016-06-27 10:48:40,160 INFO spawned: 'tests' with pid 26378
2016-06-27 10:48:41,163 ERRO pool tests_ls event buffer overflowed, discarding event 12
supervisorctl
tests RUNNING pid 26378, uptime 0:09:02
tests_ls FATAL Exited too quickly (process log may have details)
Where did I go wrong?