I start spark as master and slave on two machines with this guide:
https://www.tutorialkart.com/apache-spark/how-to-setup-an-apache-spark-cluster/
Then I make systemd .service for each of them but when I start them as a service they fail to start. Here is my systemctl status:
● sparkslave.service - Spark Slave
Loaded: loaded (/etc/systemd/system/sparkslave.service; enabled; ven
dor preset: enabled)
Active: inactive (dead) since Mon 2019-12-09 07:30:22 EST; 55s ago
Process: 31680 ExecStart=/usr/lib/spark/sbin/start-slave.sh spark://1
72.16.3.90:7077 (code=exited, status=0/SUCCESS)
Main PID: 31680 (code=exited, status=0/SUCCESS)
Dec 09 07:30:19 SparkSlave1 systemd[1]: Started Spark Slave.
Dec 09 07:30:19 SparkSlave1 start-slave.sh[31680]: starting org.apache.
spark.deploy.worker.Worker, logging to /usr/lib/spark/logs/spark-spark-
user-org.apache.spark.deploy.worker.Worker-1-SparkSlave1.out
And this is my sparkslave.service:
[Unit]
Description=Spark Slave
After=network.target
[Service]
User=spark-user
WorkingDirectory=/usr/lib/spark/sbin
ExecStart=/usr/lib/spark/sbin/start-slave.sh spark://172.16.3.90:7077
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
What is the problem?
Service type must change from simple to forking:
[Service]
Type=forking
Related
I have created a Golang web app and built as an executable. The executable is running well when executed as ./main (main is the name of the executable) from its subdirectory.
I'm trying to create it as a service. I have a file goservice.service
[Unit]
Description=goservice
[Service]
Type=simple
Restart=always
RestartSec=5s
ExecStart=/home/nshankar/go/eperssona/main
[Install]
WantedBy=multi-user.target
When I see the status of the service, I get following response -
Loaded: loaded (/etc/systemd/system/goservice.service; enabled; preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2022-10-28 10:47:54 UTC; 498ms ago
Process: 13286 ExecStart=/home/nshankar/go/eperssona/main (code=exited, status=2)
Main PID: 13286 (code=exited, status=2)
CPU: 4ms
The service is listening on port 9002.
Why am I getting this error?
Perspectives
Actually I needs to configure two service files. One for Spark Master and another for Spark Slave (Worker) node. Please find the environment and service configuration as following:
Cofigurations
/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
JAVA_HOME="/usr/lib/jvm/java-17-openjdk-amd64"
SPARK_HOME="/opt/cli/spark-3.3.0-bin-hadoop3"
PYSPARK_PYTHON="/usr/bin/python3"
/etc/systemd/system/spark-master.service
[Unit]
Description=Apache Spark Master
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target
/etc/systemd/system/spark-slave.service
[Unit]
Description=Apache Spark Slave
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://spark.cdn.chorke.org:7077
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target
Outcome
It's started successfully but failed to stop successfully for some sorts of errors! Actually it's failed to stop Apache Spark Master or Slave using Systemd
Spark Master Stop Status
× spark-master.service - Apache Spark Master
Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2022-09-26 18:43:39 +08; 8s ago
Docs: https://spark.apache.org/docs/3.3.0
Process: 488887 ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh (code=exited, status=0/SUCCESS)
Process: 489000 ExecStartPost=/bin/bash -c echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid (code=exited, status=0/SUCCESS)
Process: 489484 ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh (code=exited, status=0/SUCCESS)
Main PID: 488903 (code=exited, status=143)
CPU: 4.813s
Spark Slave Stop Status
× spark-slave.service - Apache Spark Slave
Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2022-09-26 18:38:22 +08; 15s ago
Docs: https://spark.apache.org/docs/3.3.0
Process: 489024 ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://ns12-pc04:7077 (code=exited, status=0/SUCCESS)
Process: 489145 ExecStartPost=/bin/bash -c echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid (code=exited, status=0/SUCCESS)
Process: 489174 ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh (code=exited, status=0/SUCCESS)
Main PID: 489040 (code=exited, status=143)
CPU: 4.306s
Expected Behavior
Your guide line would be appreciated to shutdown both Master & Slave node without any error.
Theoretical Solution
In this case you have to write your own script for manipulating the shutdown to force exit code 0 instead of 143. If you are idle enough like me then you can changeSuccessExitStatus from 0 to 143. By default systemd unit test looking forSuccessExitStatus code is 0. We need to change the default unit test behavior.
Practical Solution
/etc/systemd/system/spark-master.service
[Unit]
Description=Apache Spark Master
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
SuccessExitStatus=143
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-master.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-master.sh
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target
/etc/systemd/system/spark-slave.service
[Unit]
Description=Apache Spark Slave
Wants=network-online.target
After=network-online.target
[Service]
User=spark
Group=spark
Type=forking
SuccessExitStatus=143
WorkingDirectory=/opt/cli/spark-3.3.0-bin-hadoop3/sbin
EnvironmentFile=/opt/cli/spark-3.3.0-bin-hadoop3/etc/env
ExecStartPost=/bin/bash -c "echo $MAINPID > /opt/cli/spark-3.3.0-bin-hadoop3/etc/spark-slave.pid"
ExecStart=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/start-slave.sh spark://spark.cdn.chorke.org:7077
ExecStop=/opt/cli/spark-3.3.0-bin-hadoop3/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target
In the systemd v250 environment, when I restart the service, the cgroup.subtree_control is reset.
If I modify the value of cgroup.subtree_control, then when I restart the service, systemd will try to overwrite the value of cgroup.subtree_control. For example, if I add a cpu controller to subtree_control, systemd will remove it from subtree_control when the service is restarted. If I create a subdirectory at this time, and there are still processes in the subgroup when restarting, the startup will fail. The error is as follows:
Control process exited, code=exited, status=219/CGROUP
...
Unit process 543222 (xxx) remains running after unit stoped.
....
Which seems to be a failure when rewriting the subtree_control.
I want to manage cgroup.subtree_control by myself when delegation is turned on.
I hope systemd doesn't modify it, doesn't reset cgroup.subtree_control of the delegated service when service is restarted.
Relevant files
unit file:
[Unit]
Description=DelegateTest
[Service]
Type=simple
TimeoutSec=60s
KillMode=process
ExecStartPre=/bin/bash /test/start_pre.sh
ExecStart=/bin/bash /test/loader.sh
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID
Delegate=yes
/test/start_pre.sh:
echo "start_pre"
/test/loader.sh
echo "executing"
CGROUP=$(cat /proc/$$/cgroup)
CGROUP_PATH=/sys/fs/cgroup${CGROUP#*::}
if [[ ! -d ${CGROUP_PATH}/job ]]; then
mkdir ${CGROUP_PATH}/job
fi
echo $$ > ${CGROUP_PATH}/job/cgroup.procs
echo "+cpu" > ${CGROUP_PATH}/cgroup.subtree_control
ping 127.0.0.1 > /dev/null
Steps to reproduce this problem
systemctl start DelegateTest.service
systemctl status DelegateTest.service
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: active (running) since Thu 2022-07-21 09:58:31 CST; 1s ago
Process: 541635 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=0/SUCCESS)
Main PID: 541636 (bash)
Tasks: 2 (limit: 23196)
Memory: 660.0K
CPU: 11ms
CGroup: /system.slice/DelegateTest.service
└─job
├─541636 /bin/bash /test/loader.sh
└─541639 ping 127.0.0.1
systemctl stop DelegateTest.service
systemctl status DelegateTest.service
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: inactive (dead) since Thu 2022-07-21 09:58:36 CST; 981ms ago
Process: 541635 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=0/SUCCESS)
Process: 541636 ExecStart=/bin/bash /test/loader.sh (code=killed, signal=TERM)
Process: 541644 ExecStop=/bin/kill $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 541636 (code=killed, signal=TERM)
Tasks: 1 (limit: 23196)
Memory: 300.0K
CPU: 13ms
CGroup: /system.slice/DelegateTest.service
└─job
└─541639 ping 127.0.0.1
systemctl start DelegateTest.service
Job for DelegateTest.service failed because the control process exited with error code.
See "systemctl status DelegateTest.service" and "journalctl -xeu DelegateTest.service" for details.
systemctl status DelegateTest.service
× DelegateTest.service - DelegateTest
Loaded: loaded (/usr/lib/systemd/system/DelegateTest.service; static)
Active: failed (Result: exit-code) since Thu 2022-07-21 09:58:53 CST; 1s ago
Process: 541649 ExecStartPre=/bin/bash /test/start_pre.sh (code=exited, status=219/CGROUP)
Tasks: 1 (limit: 23196)
Memory: 300.0K
CPU: 84us
CGroup: /system.slice/DelegateTest.service
└─job
└─541639 ping 127.0.0.1
Relevant issue:
https://github.com/systemd/systemd/issues/24064
https://github.com/systemd/systemd/issues/20026
https://github.com/systemd/systemd/issues/18293
https://github.com/systemd/systemd/pull/9119
https://github.com/systemd/systemd/issues/8645
https://github.com/systemd/systemd/issues/18104
I currently would like to write a system configuration file that starts the airflow webserver. Here is my config file definition:
#airflow-webserver.service
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
#EnvironmentFile=/etc/default/airflow
Environment="PATH=/root/airflow_venv/bin"
User=airflow
Group=airflow
Type=simple
ExecStart=/root/airflow_venv/bin/airflow webserver -p 8080 --pid /root/airflow/airflow-webserver.pid
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
However running "sudo service airflow-webserver status" it return:
● airflow-webserver.service - Airflow webserver daemon
Loaded: loaded (/etc/systemd/system/airflow-webserver.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2021-01-25 06:58:14 UTC; 3s ago
Process: 5913 ExecStart=/root/airflow_venv/bin/airflow webserver -p 8080 --pid /root/airflow/airflow-webserver.pid (code=exited, status=217/USER)
Main PID: 5913 (code=exited, status=217/USER)
CGroup: /system.slice/airflow-webserver.service
I already looked at How to enable a virtualenv in a systemd service unit? as it sounds like a similar issue. However, I could not find my solution there. The OS I am operating on is Amazon Linux 2 AMI.
Can somebody help?
Lazloo
For anyone who gets the same Error:
Main PID: 5913 (code=exited, status=217/USER)
this indicates that the user is incorrect. After adapting the user and group, this service worked for me.
I tried to run systemd using the commands systemctl enable photogrid.service & systemctl start photogrid.service in ubuntu 16
The nodejs app itself can run as expected. The service is to ensure that application will auto-start when application crash or server reboot.
The service apparently did not start. So I key in systemctl status photogrid.service to see what happened, the below is what I got from the terminal.
● photogrid.service - Photogrid
Loaded: loaded (/lib/systemd/system/photogrid.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2016-11-09 04:35:36 UTC; 7s ago
Process: 27523 ExecStart=/usr/local/bin/node /home/ubuntu/photogrid/app.js (code=exited, status=203/EXEC)
Main PID: 27523 (code=exited, status=203/EXEC)
Nov 09 04:35:36 ip-172-31-34-151 systemd[1]: photogrid.service: Main process exited, code=exited, status=203/EXEC
Nov 09 04:35:36 ip-172-31-34-151 systemd[1]: photogrid.service: Unit entered failed state.
Nov 09 04:35:36 ip-172-31-34-151 systemd[1]: photogrid.service: Failed with result 'exit-code'.
This the script that I wrote for the service under the path /lib/systemd/system/photogrid.service
[Unit]
Description=Photogrid
[Service]
Type=simple
Restart=always
RestartSec=10
Environment=NODE_ENV=production
ExecStart=/usr/local/bin/node /home/ubuntu/photogrid/app.js
[Install]
WantedBy=multi-user.target
Basically under ExecStart make sure you point to the correct nodejs executable. For my case it was in a different folder and not /usr/local/bin/node, to check where is your node executable. (Assuming you confirm you have downloaded and install it correctly in linux) use command which node to give you path direction.