Failed to stop Apache Spark Master or Slave using Systemd - apache-spark

Perspectives
Actually I needs to configure two service files. One for Spark Master and another for Spark Slave (Worker) node. Please find the environment and service configuration as following:
Cofigurations
/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
JAVA_HOME="/usr/lib/jvm/java-17-openjdk-amd64"
SPARK_HOME="/opt/cli/spark-3.3.0-bin-hadoop3"
PYSPARK_PYTHON="/usr/bin/python3"
/etc/systemd/system/spark-master.service
[Unit]
Description=Apache Spark Master
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target
/etc/systemd/system/spark-slave.service
[Unit]
Description=Apache Spark Slave
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://spark.cdn.chorke.org:7077
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target
Outcome
It's started successfully but failed to stop successfully for some sorts of errors! Actually it's failed to stop Apache Spark Master or Slave using Systemd
Spark Master Stop Status
× spark-master.service - Apache Spark Master
Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2022-09-26 18:43:39 +08; 8s ago
Docs: https://spark.apache.org/docs/3.3.0
Process: 488887 ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh (code=exited, status=0/SUCCESS)
Process: 489000 ExecStartPost=/bin/bash -c echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid (code=exited, status=0/SUCCESS)
Process: 489484 ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh (code=exited, status=0/SUCCESS)
Main PID: 488903 (code=exited, status=143)
CPU: 4.813s
Spark Slave Stop Status
× spark-slave.service - Apache Spark Slave
Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2022-09-26 18:38:22 +08; 15s ago
Docs: https://spark.apache.org/docs/3.3.0
Process: 489024 ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://ns12-pc04:7077 (code=exited, status=0/SUCCESS)
Process: 489145 ExecStartPost=/bin/bash -c echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid (code=exited, status=0/SUCCESS)
Process: 489174 ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh (code=exited, status=0/SUCCESS)
Main PID: 489040 (code=exited, status=143)
CPU: 4.306s
Expected Behavior
Your guide line would be appreciated to shutdown both Master & Slave node without any error.

Theoretical Solution
In this case you have to write your own script for manipulating the shutdown to force exit code 0 instead of 143. If you are idle enough like me then you can changeSuccessExitStatus from 0 to 143. By default systemd unit test looking forSuccessExitStatus code is 0. We need to change the default unit test behavior.
Practical Solution
/etc/systemd/system/spark-master.service
[Unit]
Description=Apache Spark Master
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
SuccessExitStatus=143
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target
/etc/systemd/system/spark-slave.service
[Unit]
Description=Apache Spark Slave
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
SuccessExitStatus=143
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://spark.cdn.chorke.org:7077
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target

Related

Linux service returns (code=exited, status=2)

I have created a Golang web app and built as an executable. The executable is running well when executed as ./main (main is the name of the executable) from its subdirectory.
I'm trying to create it as a service. I have a file goservice.service
[Unit]
Description=goservice
[Service]
Type=simple
Restart=always
RestartSec=5s
ExecStart=/home/nshankar/go/eperssona/main
[Install]
WantedBy=multi-user.target
When I see the status of the service, I get following response -
Loaded: loaded (/etc/systemd/system/goservice.service; enabled; preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2022-10-28 10:47:54 UTC; 498ms ago
Process: 13286 ExecStart=/home/nshankar/go/eperssona/main (code=exited, status=2)
Main PID: 13286 (code=exited, status=2)
CPU: 4ms
The service is listening on port 9002.
Why am I getting this error?

How to avoid cgroup.subtree_control of a delegated unit being reset on service restart

In the systemd v250 environment, when I restart the service, the cgroup.subtree_control is reset.
If I modify the value of cgroup.subtree_control, then when I restart the service, systemd will try to overwrite the value of cgroup.subtree_control. For example, if I add a cpu controller to subtree_control, systemd will remove it from subtree_control when the service is restarted. If I create a subdirectory at this time, and there are still processes in the subgroup when restarting, the startup will fail. The error is as follows:
Control process exited, code=exited, status=219/CGROUP
...
Unit process 543222 (xxx) remains running after unit stoped.
....
Which seems to be a failure when rewriting the subtree_control.
I want to manage cgroup.subtree_control by myself when delegation is turned on.
I hope systemd doesn't modify it, doesn't reset cgroup.subtree_control of the delegated service when service is restarted.
Relevant files
unit file:
[Unit]
Description=DelegateTest
[Service]
Type=simple
TimeoutSec=60s
KillMode=process
ExecStartPre=/bin/bash /test/start_pre.sh
ExecStart=/bin/bash /test/loader.sh
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID
Delegate=yes
/test/start_pre.sh:
echo "start_pre"
/test/loader.sh
echo "executing"
CGROUP=$(cat /proc/$$/cgroup)
CGROUP_PATH=/sys/fs/cgroup${CGROUP#*::}
if [[ ! -d ${CGROUP_PATH}/job ]]; then
mkdir ${CGROUP_PATH}/job
fi
echo $$ > ${CGROUP_PATH}/job/cgroup.procs
echo "+cpu" > ${CGROUP_PATH}/cgroup.subtree_control
ping 127.0.0.1 > /dev/null
Steps to reproduce this problem
systemctl start DelegateTest.service
systemctl status DelegateTest.service
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: active (running) since Thu 2022-07-21 09:58:31 CST; 1s ago
Process: 541635 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=0/SUCCESS)
Main PID: 541636 (bash)
Tasks: 2 (limit: 23196)
Memory: 660.0K
CPU: 11ms
CGroup: /system.slice/DelegateTest.service
└─job
├─541636 /bin/bash /test/loader.sh
└─541639 ping 127.0.0.1
systemctl stop DelegateTest.service
systemctl status DelegateTest.service
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: inactive (dead) since Thu 2022-07-21 09:58:36 CST; 981ms ago
Process: 541635 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=0/SUCCESS)
Process: 541636 ExecStart=/bin/bash /test/loader.sh (code=killed, signal=TERM)
Process: 541644 ExecStop=/bin/kill $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 541636 (code=killed, signal=TERM)
Tasks: 1 (limit: 23196)
Memory: 300.0K
CPU: 13ms
CGroup: /system.slice/DelegateTest.service
└─job
└─541639 ping 127.0.0.1
systemctl start DelegateTest.service
Job for DelegateTest.service failed because the control process exited with error code.
See "systemctl status DelegateTest.service" and "journalctl -xeu DelegateTest.service" for details.
systemctl status DelegateTest.service
× DelegateTest.service - DelegateTest
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: failed (Result: exit-code) since Thu 2022-07-21 09:58:53 CST; 1s ago
Process: 541649 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=219/CGROUP)
Tasks: 1 (limit: 23196)
Memory: 300.0K
CPU: 84us
CGroup: /system.slice/DelegateTest.service
└─job
└─541639 ping 127.0.0.1
Relevant issue:
https://github.com/systemd/systemd/issues/24064
https://github.com/systemd/systemd/issues/20026
https://github.com/systemd/systemd/issues/18293
https://github.com/systemd/systemd/pull/9119
https://github.com/systemd/systemd/issues/8645
https://github.com/systemd/systemd/issues/18104

Start Airflow Webserver Automaticalll as a Service

I currently would like to write a system configuration file that starts the airflow webserver. Here is my config file definition:
#airflow-webserver.service
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
#EnvironmentFile=/etc/default/airflow
Environment="PATH=/root/airflow_venv/bin"
User=airflow
Group=airflow
Type=simple
ExecStart=/root/airflow_venv/bin/airflow webserver -p 8080 --pid /root/airflow/airflow-webserver.pid
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
However running "sudo service airflow-webserver status" it return:
● airflow-webserver.service - Airflow webserver daemon
Loaded: loaded (/etc/systemd/system/airflow-webserver.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2021-01-25 06:58:14 UTC; 3s ago
Process: 5913 ExecStart=/root/airflow_venv/bin/airflow webserver -p 8080 --pid /root/airflow/airflow-webserver.pid (code=exited, status=217/USER)
Main PID: 5913 (code=exited, status=217/USER)
CGroup: /system.slice/airflow-webserver.service
I already looked at How to enable a virtualenv in a systemd service unit? as it sounds like a similar issue. However, I could not find my solution there. The OS I am operating on is Amazon Linux 2 AMI.
Can somebody help?
Lazloo
For anyone who gets the same Error:
Main PID: 5913 (code=exited, status=217/USER)
this indicates that the user is incorrect. After adapting the user and group, this service worked for me.

How do i install spark as a daemon

I start spark as master and slave on two machines with this guide:
https://www.tutorialkart.com/apache-spark/how-to-setup-an-apache-spark-cluster/
Then I make systemd .service for each of them but when I start them as a service they fail to start. Here is my systemctl status:
● sparkslave.service - Spark Slave
Loaded: loaded (/etc/systemd/system/sparkslave.service; enabled; ven
dor preset: enabled)
Active: inactive (dead) since Mon 2019-12-09 07:30:22 EST; 55s ago
Process: 31680 ExecStart=/usr/lib/spark/sbin/start-slave.sh spark://1
72.16.3.90:7077 (code=exited, status=0/SUCCESS)
Main PID: 31680 (code=exited, status=0/SUCCESS)
Dec 09 07:30:19 SparkSlave1 systemd[1]: Started Spark Slave.
Dec 09 07:30:19 SparkSlave1 start-slave.sh[31680]: starting org.apache.
spark.deploy.worker.Worker, logging to /usr/lib/spark/logs/spark-spark-
user-org.apache.spark.deploy.worker.Worker-1-SparkSlave1.out
And this is my sparkslave.service:
[Unit]
Description=Spark Slave
After=network.target
[Service]
User=spark-user
WorkingDirectory=/usr/lib/spark/sbin
ExecStart=/usr/lib/spark/sbin/start-slave.sh spark://172.16.3.90:7077
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
What is the problem?
Service type must change from simple to forking:
[Service]
Type=forking

Systemd service leaves out command in script

I am trying to start a service named pigpiod.service via systemd. It invokes a script with three commands. The second one is left out. Why is this?
/etc/systemd/system/pigpiod.service:
[Unit]
Description=Starts pigpiod
Before=touchscreen.service
[Service]
ExecStart=/home/sysop/pigpiod.sh
[Install]
WantedBy=multi-user.target
/home/sysop/pigpiod.sh:
#!/bin/sh
touch /home/sysop/before_pigpiod
/usr/bin/pigpiod
touch /home/sysop/after_pigpiod
When restarting the machine the two files get created in /home/sysop/, but pigpiod is not starting.
When starting the service manually via sudo systemctl start pigpiod the same happens.
When running sudo /home/sysop/pigpiod.sh manually pigpiod is actually starting!
This is the output of sudo systemctl status pigpiod -l right after boot:
● pigpiod.service - Starts pigpiod
Loaded: loaded (/etc/systemd/system/pigpiod.service; enabled)
Active: inactive (dead) since Sat 2017-09-16 20:02:03 UTC; 2min 29s ago
Process: 440 ExecStart=/home/sysop/pigpiod.sh (code=exited, status=0/SUCCESS)
Main PID: 440 (code=exited, status=0/SUCCESS)
Sep 16 20:02:02 kivypie systemd[1]: Starting Starts pigpiod...
Sep 16 20:02:02 kivypie systemd[1]: Started Starts pigpiod.
Why is it, that systemd skips the execution of /usr/bin/pigpiod, but manually running the script as root does not?
My system: Raspberry Pi Model 3B, Raspbian GNU/Linux 8 (jessie)
pigpiod forks without the -g option. So use Type = forking or use pigpiod -g
[Unit]
Description=Starts pigpiod
Before=touchscreen.service
[Service]
ExecStart=/home/sysop/pigpiod.sh
Type=forking
[Install]
WantedBy=multi-user.target

Resources