Supervisord action if service restart - linux

I want to create a file when the service is restarted.
rpm -qa | grep super
supervisor-3.1.3-0.5.b1.el6.noarch
My config file:
[program:tests]
command=/root/while.sh
directory=/root
user=root
autostart=true
autorestart=true
[eventlistener:tests_ls]
command=touch /tmp/32
events=PROCESS_STATE
My log file error
2016-06-27 10:48:15,794 ERRO pool tests_ls event buffer overflowed, discarding event 8
2016-06-27 10:48:15,794 INFO exited: tests_ls (exit status 0; not expected)
2016-06-27 10:48:16,796 ERRO pool tests_ls event buffer overflowed, discarding event 9
2016-06-27 10:48:16,796 INFO gave up: tests_ls entered FATAL state, too many start retries too quickly
2016-06-27 10:48:39,155 ERRO pool tests_ls event buffer overflowed, discarding event 10
2016-06-27 10:48:39,155 INFO exited: tests (terminated by SIGKILL; not expected)
2016-06-27 10:48:40,157 ERRO pool tests_ls event buffer overflowed, discarding event 11
2016-06-27 10:48:40,160 INFO spawned: 'tests' with pid 26378
2016-06-27 10:48:41,163 ERRO pool tests_ls event buffer overflowed, discarding event 12
supervisorctl
tests RUNNING pid 26378, uptime 0:09:02
tests_ls FATAL Exited too quickly (process log may have details)
Where did I go wrong?

Related

Ceph octopus garbage collector makes slow ops

We have a ceph cluster with 408 osds, 3 mons and 3 rgws. We updated our cluster from nautilus 14.2.14 to octopus 15.2.12 a few days ago. After upgrading, the garbage collector process which is run after the lifecycle process, causes slow ops and makes some osds to be restarted. In each process the garbage collector deletes about 1 million objects. Below are the one of the osd's logs before it restarts.
2021-07-18T00:44:38.807+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:38.807+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:39.847+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:40.895+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
2021-07-18T00:44:41.859+0430 7fd1cda76700 1 osd.60 1092400 not healthy; waiting to boot
2021-07-18T00:44:42.811+0430 7fd1cda76700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
2021-07-18T00:44:42.811+0430 7fd1cda76700 1 osd.60 1092400 is_healthy false -- internal heartbeat failed
what is the suitable configuration for gc in such a heavy delete process so it doesn't make slow ops? We had the same delete load in nautilus but we didn't have any problem with that.

Some YARN worker node not join cluster , while I create spark cluster on Dataproc

I have created a spark cluster on dataproc with 1 master and 6 worker node.
On GCP console I can see 6 VMs are running, but I only see 5 nodes on YARN Node Manager UI.
When I ssh into that machine, from the yarn-yarn-nodemanager log, I see, it keeps restarting and reconnecting to NodeManager.
How can I make this node rejoin cluster ?
update: my command
gcloud dataproc clusters create ${GCS_CLUSTER} \
--image pyspark-with-conda \
--bucket test-spark-data \
--zone asia-east1-b \
--master-boot-disk-size 500GB \
--master-machine-type n1-standard-2 \
--num-masters 1 \
--num-workers 2 \
--worker-machine-type n1-standard-8 \
--num-preemptible-workers 4 \
--preemptible-worker-boot-disk-size 500GB
Error messageļ¼š
2018-08-22 08:25:24,801 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at test-spark-cluster-m/10.140.0.34:8031
2018-08-22 08:25:24,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2018-08-22 08:25:24,843 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2018-08-22 08:25:24,978 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
2018-08-22 08:25:24,979 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apach
e.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal,
Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,081 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recie
ved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,084 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:8042
2018-08-22 08:25:25,185 INFO org.apache.hadoop.ipc.Server: Stopping server on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
2018-08-22 08:25:25,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2018-08-22 08:25:25,205 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,205 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,205 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2018-08-22 08:25:25,206 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,208 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
OP confirmed that this issue is resolved and they didn't encounter it anymore.

pm2 running at 100% CPU - how to debug

Last couple of weeks, I am struggling with pm2 100% CPU usage which hangs my node server.
I tried reading logs but I didn't find any issue there.
My node version: 6.9.1
PM2 is : 2.4.4.
OS: Ubuntu 14.04
On an average my cpu usage is : ~5
Manually I am restarting all apps - pm2 restart all.
pm2.log:
fluid_admin#instance-2:~$ tail -15 .pm2/pm2.log
2017-04-12 09:55:30: Starting execution sequence in -fork mode- for app name:fluid-prod id:0
2017-04-12 09:55:30: App name:fluid-prod id:0 online
2017-04-14 13:53:21: Stopping app:fluid-prod id:0
2017-04-14 13:53:21: Stopping app:nedbserver id:1
2017-04-14 13:53:21: App [nedbserver] with id [1] and pid [32557], exited with code [0] via signal [SIGINT]
2017-04-14 13:53:21: pid=32574 msg=failed to kill - retrying in 100ms
2017-04-14 13:53:21: pid=32557 msg=process killed
2017-04-14 13:53:21: Starting execution sequence in -fork mode- for app name:nedbserver id:1
2017-04-14 13:53:21: App [fluid-prod] with id [0] and pid [32574], exited with code [0] via signal [SIGINT]
2017-04-14 13:53:21: App name:nedbserver id:1 online
2017-04-14 13:53:21: pid=32574 msg=process killed
2017-04-14 13:53:21: Starting execution sequence in -fork mode- for app name:fluid-prod id:0
2017-04-14 13:53:21: App name:fluid-prod id:0 online
App is running on GAE, OS is : Ubuntu 14.04.
My CPU usage when PM2 is high
I have moved from forever to pm2 6 months back. till recent time, it was working file but now I am frequently having this issue.
I don't know deal this problem. Can some one help me how to debug this issue.
It happened again. Output of TOP commands
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25059 fluid_a+ 20 0 1260364 105532 8236 R 99.7 17.5 58:27.27 node fluid/server/tools/www +
1 root 20 0 33520 3208 1792 S 0.0 0.5 0:04.82 /sbin/init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kthreadd]
Script to start my app:
pm2 stop fluid-prod
pm2 start \
-n fluid-prod \
-e /path/to/fluid-error.log \
-o /path/to/fluid-out.log \
$(dirname $0)/www -- --max-memory-restart 200M --env=production

Heroku "State changed from starting to down Stopping all processes with SIGTERM"

After booting up my Node.js Heroku app with this Procfile:
web: node www/main.js
I used to get:
Error R10 (Boot timeout) -> Web process failed to bind to $PORT within
60 seconds of launch
So I have changed my Procfile to a generic command to get around this, following from here, using:
start: node www/main.js
And I am still getting a shut down after 60 seconds. This is the error(s) now:
2015-01-20T13:04:01.452819+00:00 heroku[worker.1]: State changed from
up to starting
2015-01-20T13:04:02.728905+00:00 heroku[worker.1]: State changed from starting to down
2015-01-20T13:04:03.434251+00:00 heroku[worker.1]: Starting process with command node www/main.js
2015-01-20T13:04:03.874370+00:00 heroku[worker.1]: Stopping all processes with SIGTERM
2015-01-20T13:04:05.188100+00:00 heroku[worker.1]: Process exited with status 143
2015-01-20T13:04:05.930916+00:00 app[worker.1]: [Tue Jan 20 2015 13:04:05 GMT+0000 (UTC)] INFO Connecting...
2015-01-20T13:04:06.837197+00:00 app[worker.1]: Welcome to Slack. You are #derpy of
2015-01-20T13:04:06.837559+00:00 app[worker.1]: You are in: #general
2015-01-20T13:04:06.837637+00:00 app[worker.1]: As well as:
2015-01-20T13:04:06.837739+00:00 app[worker.1]: You have 13 unread messages
2015-01-20T13:04:07.526373+00:00 heroku[worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of
SIGTERM
2015-01-20T13:04:07.526508+00:00 heroku[worker.1]: Stopping remaining processes with SIGKILL
I am using https://github.com/slackhq/node-slack-client and have not adapted the code too much. I have tried all the usual things and now I'm asking for help.
The other weird thing is that the Slack bot connects and is running perfectly for those 60 seconds.
socket = io.listen(process.env.PORT);
Do this in your main.js file, and revert back to web: node www/main.js

Error running supervisord with gearmand on Ubuntu Natty

I am using Ubuntu Natty.
I'm trying to use supervisord to deamonize gearmand. I've installed both gearmand and supervisord.
However, whenever I start supervisord I get the following log entries:
2012-05-18 12:23:29,219 CRIT Supervisor running as root (no user in config file)
2012-05-18 12:23:29,287 INFO RPC interface 'supervisor' initialized
2012-05-18 12:23:29,287 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2012-05-18 12:23:29,293 INFO daemonizing the supervisord process
2012-05-18 12:23:29,294 INFO supervisord started with pid 16596
2012-05-18 12:23:30,302 INFO spawned: 'gearman' with pid 16599
2012-05-18 12:23:30,312 INFO exited: gearman (exit status 127; not expected)
2012-05-18 12:23:31,318 INFO spawned: 'gearman' with pid 16630
2012-05-18 12:23:31,329 INFO exited: gearman (exit status 127; not expected)
2012-05-18 12:23:33,337 INFO spawned: 'gearman' with pid 16631
2012-05-18 12:23:33,346 INFO exited: gearman (exit status 127; not expected)
2012-05-18 12:23:36,355 INFO spawned: 'gearman' with pid 16632
2012-05-18 12:23:36,365 INFO exited: gearman (exit status 127; not expected)
2012-05-18 12:23:37,366 INFO gave up: gearman entered FATAL state, too many start retries too quickly
Below is my program entry for gearmand in supervisord.conf
[program:gearman]
command=/usr/sbin/gearmand -u root
numprocs=1
directory=/usr/local/php
stdout_logfile=/var/log/supervisord.log
autostart=true
autorestart=true
user=root
stopsignal=KILL
When I run the command /usr/sbin/gearmand -u root in the command line it works ok.
Not sure what I'm doing wrong, would appreciate some assistance.
Thanks.

Resources