Proper way for keep process running on a container - linux

I'm not aware if this could be considered as a duplicate since it's a problem for an specific case.
Currently, I have created a docker outside docker image for handling my Jenkins agent which will perform auto restarts without using supervisor as a solution ( lack of python 3.7 support ), and by that, since I'm using openjdk:slim as base image and I don't want to install any additional dependencies I opted to compensate the lack of tools like lsof and ps, or others for checking if the process is running or not, by writing the started process pid on a file which will be used for validating if the process exists or not under the path /proc/pid/status. Currently this works and the main reason of creating this solution for handling the auto start of the agents.
But my question is, Is this the best or more appropriated approach?
Please find the following code with the implementation:
#!/bin/bash
set -e
agent_runner() {
while :
do
if [ ! -f "/proc/$(cat /tmp/agent.pid)/status" ]
then
curl $JNLP_AGENT_DOWNLOAD_URL -o agent.jar
java \
-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 \
-Dhttps.protocols=TLSv1.2 \
-jar agent.jar \
-jnlpUrl $JNLP_AGENT_URL \
-secret $JENKINS_SECRET \
-workDir "$JENKINS_WORKDIR" &
echo $! > /tmp/agent.pid
else
:
fi
sleep 10
done
}
while :
do
if [ cat < /dev/tcp/"$TARGET" ]; then
echo "Starting Agent"
agent_runner
else
echo "Jenkins master is offline, waiting...."
fi
sleep 10
done
Link for the repository: https://github.com/thcp/jenkins-agent-dod

If the main process in the container dies, you should let the container die with it.
Docker and the various layers above it have functionality to restart whole containers. There is a docker run --restart option for the basic Docker CLI, and equivalent Docker Compose option, and restarting dying containers after some backoff is the default behavior for Kubernetes pods.
So, if you just let a container die on its own, you’ll have out-of-the-box support for the container engine to restart itself, without adding any special support into your image; just set the CMD to the thing you actually need the container to do and go. This approach also has the benefit that if you detect your environment has become unstable (“I depend on a database and it’s unreachable”) the process can choose to abort itself and let it be restarted later when hopefully the environment has improved.

Related

Stopping subprocess ran by exec.command in golang

This might seems like a similar question roaming around on the internet but it not as I didn't find any similar, so asking here.
The thing is, I have a go program named abc.go which contains two functions which are to run and stop someScript.sh script. Run() and stop() are being called at API hit. I am running this abc.go file using command sudo go run abc.go someFolder/someScript.sh, while passing someScript.sh path as argument. Instop(), I am saving the process-groupID and then killing the whole process-group.
But when I call run and then stop functions, it gives me this output
pid=5844 duration=13.667µs err=exec: already started
and doesn't actually stop the running docker container (I am checking using docker container ls -a ).
The someScript.sh file is:
#!/bin/bash
docker container run --rm --name someContainerName nginx
The abc.go file is:
func Run(){
someVar= true
execCMD = exec.Command("/bin/sh", "-c", commandFromTerminal)
output, err = execCMD.CombinedOutput()
fmt.Println("Output()=", bp.Output())
someVar= false
}
func Stop(){
execCMD.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
start := time.Now()
syscall.Kill(-execCMD.Process.Pid, syscall.SIGKILL)
err := execCMD.Run()
fmt.Printf("pid=%d duration=%s err=%s\n", execCMD.Process.Pid, time.Since(start),
err)
}
As per my understanding, it seems like docker command which is written in someScript.sh, didn't run the docker container as a subchild/grandchild of /bin/bash but rather ran it as a separate process which the code in my stop() is unable to actaully stop it
Below is the flow diagram which is according to my understanding where i think on calling abc.go, it internally calling /bin/bash, then running sudo as its child, further sudo has a subchildsomeScript.sh. And finally the docker, which is not running as any child/subchild of the above hierarchy, but as a different process.
My question finally is, how to stop this docker container on calling stop(). Or how to make this docker container run as a subchild of the hierarchy so that I can kill it using process-groupID method which I have used above.
PS: I have also tried
err := execCMD.Process.Kill()
if err != nil {
panic(err.Error())
}
execCMD.Process.Release()
but it too didn't help.
docker is just a client for the docker daemon. docker run simply sends a few HTTP requests to the daemon, and the daemon sets up the container and executes it.
So docker run is a grandchild of your Go program, but the nginx processes are descendants of the Docker daemon, and entirely unrelated to your Go program. Mind you, the docker daemon can even be on a different machine, in principle at least.
That being said,
Assigning SysProcAttr after a process has been started has no effect.
You're calling Run in Stop (very suspicious) and you cannot Run a process that has already been started, even after it terminated.
Sending SIGKILL gives docker run no chance to terminate the container. After fixing the other errors, it's possible that the docker daemon takes care of the cleanup due to the --rm flag (I forget how this works, exactly). If not, send SIGTERM instead.

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Docker - unable to run script

What I'm doing
I am using AWS batch to run a docker container for a large compute job. I have configured the ECR/ECS successfully to the best of my knowledge but am having issues running the required commands for reasons that are beyond my level of understanding with docker ( newbie )
What I need to do is pass the below commands into my application and start my application to perform some heavy computing tasks; all commands listed below must be present.
The Issue(s)
The issue arises when I send the submit job to AWS batch; this service pulls the image from the ACR ( amazon container repository ) and spins up a compute environment. The issue comes from when I try to run the command I pass in, below I will go throgh it.
"command": [
"mkdir -p logging",
"chmod 777 logging/",
"docker run -t -i -e my-application", # container name
"-e APIKEY",
"-e BASEURI",
"-e APIUSER",
"-v WORKSPACE /logging:/src/log",
"DOCKERIMAGE",
"python my_app.py",
"-t APP_USER",
"-e APP_ENVIRONMENT",
"-u APP_USERNAME",
"-p APP_PASSWORD",
"-i IN_PATH",
"-o OUT_PATH",
"-b tmp/"
]
The command above generates the following error(s)
container_linux.go:370: starting container process caused: exec: "mkdir -p log": executable file not found in $PATH
I tried to pass in the command to echo the env var $PATH but was unsuccesfull getting a response and resulted in a similar error.
I have ran successfully "ls" and was able to see the directory contents of my application inside.
I am not however able to run any of these commands that I have included in the command [] section. I have tried just running python and such in hopes of getting a more detailed error but was unsuccessful.
Logic in plain English
Create a path called logging if it doesnt exist
set the permissions for logging
run the docker container and pass in the environment variables while doing so
Tell docker to run the python file my_app.py and pass in the expected runtime args
Execute and perform the required logic deligated in the python3 application
Questions
Why can I not create a directory here called "logging" where am I ?
Am I running these properly as defined by AWS batch? or docker
What am I missing or where am I going wrong?
AWS Batch high level doc
AWS Batch link specific to what i'm doing
Assuming that you're following the syntax described in the Container
Properties
section of the AWS docs, you have several problems with the syntax of
your command directive.
First
The command directive can only run a single command. You can't mash together a bunch of commands as you're trying to do in your example. If you need to run multiple commands you would need to embed them as an argument to a shell. For example, something like:
command: ["/bin/sh", "-c", "mkdir -p logging; chmod 777 logging; ..."]
Second
You must properly tokenize your
command lines -- that is, when you type mkdir -p logging at the
command prompt, the shell splits this into three parts (or "tokens"): ['mkdir', '-p', 'logging']. You need to do the same thing when building up the
list of arguments to command.
This is invalid:
command: ["mkdir -p logging"]
That would looking for a command named mkdir -p logging, and of course no such command exists. That would properly be written as:
command: ["mkdir", "-p", "logging"]
Third
I'm not very familiar with the AWS batch environment, but it's unlikely you can run a docker command inside a docker` container as you're trying to do. It's unclear why you're doing this, though: why not just configure your AWS batch job with the appropriate image, environment variables, etc?
Take a look at some of these example job definitions.

How to run airflow with CeleryExecutor on a custom docker image

I am adding airflow to a web application that manually adds a directory containing business logic to the PYTHON_PATH env var, as well as does additional system-level setup that I want to be consistent across all servers in my cluster. I've been successfully running celery for this application with RMQ as the broker and redis as the task results backend for awhile, and have prior experience running Airflow with LocalExecutor.
Instead of using Pukel's image, I have a an entry point for a base backend image that runs a different service based on the SERVICE env var. That looks like this:
if [ $SERVICE == "api" ]; then
# upgrade to the data model
flask db upgrade
# start the web application
python wsgi.py
fi
if [ $SERVICE == "worker" ]; then
celery -A tasks.celery.celery worker --loglevel=info --uid=nobody
fi
if [ $SERVICE == "scheduler" ]; then
celery -A tasks.celery.celery beat --loglevel=info
fi
if [ $SERVICE == "airflow" ]; then
airflow initdb
airflow scheduler
airflow webserver
I have an .env file that I build the containers with the defines my airflow parameters:
AIRFLOW_HOME=/home/backend/airflow
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql+pymysql://${MYSQL_USER}:${MYSQL_ROOT_PASSWORD}#${MYSQL_HOST}:${MYSQL_PORT}/airflow?charset=utf8mb4
AIRFLOW__CELERY__BROKER_URL=amqp://${RABBITMQ_DEFAULT_USER}:${RABBITMQ_DEFAULT_PASS}#${RABBITMQ_HOST}:5672
AIRFLOW__CELERY__RESULT_BACKEND=redis://${REDIS_HOST}
With how my entrypoint is setup currently, it doesn't make it to the webserver. Instead, it runs that scheduler in the foreground with invoking the web server. I can change this to
airflow initdb
airflow scheduler -D
airflow webserver
Now the webserver runs, but it isn't aware of the scheduler, which is now running as a daemon:
Airflow does, however, know that I'm using a CeleryExecutor and looks for the dags in the right place:
airflow | [2020-07-29 21:48:35,006] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
airflow | [2020-07-29 21:48:35,010] {__init__.py:50} INFO - Using executor CeleryExecutor
airflow | [2020-07-29 21:48:35,010] {dagbag.py:396} INFO - Filling up the DagBag from /home/backend/airflow/dags
airflow | [2020-07-29 21:48:35,113] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
I can solve this by going inside the container and manually firing up the scheduler:
The trick seems to be running both processes in the foreground within the container, but I'm stuck on how to do that inside the entrypoint. I've checked out Pukel's entrypoint code, but it's not obvious to me what he's doing. I'm sure that with just a slight tweak this will be off to the races... Thanks in advance for the help. Also, if there's any major anti-pattern that I'm at risk of running into here I'd love to get the feedback so that I can implement airflow properly. This is my first time implementing CeleryExecutor, and there's a decent amount involved.
try using nohup. https://en.wikipedia.org/wiki/Nohup
nohup airflow scheduler >scheduler.log &
in your case, you would update your entrypoint as follows:
if [ $SERVICE == "airflow" ]; then
airflow initdb
nohup airflow scheduler > scheduler.log &
nohup airflow webserver
fi

Azure IoT Edge Module - Backoff State

I am trying to push a custom docker image (not C/C#) to an Azure IOT Edge device from Azure IOT HUB. The docker image runs without exiting when run manually. e.g. docker run -itd is perfectly fine. When the module is published via IOT Hub, it continually shows a status of backup/and is restarting always. The full code of the docker file is as follows:
FROM alpine
RUN apk -U -u add sqlite && \
mkdir -p /db && \
rm -rf /var/lib/apt/lists/*
#RUN /usr/bin/sqlite3 /db/arf.sqlite
CMD /bin/sh
The custom create options are as follows:
{
"Env": [],
"HostConfig": {
"Binds": [
"/work:/db"
]
}
}
There are no specific module twin setting and hence I am passing it as
{}
I am attaching a screen shot that (hopefully) explains this better.
I figured this out. When running manually, I was running with -itd flags to run in daemonized mode. When publishing to azure Hub, it ran the /bin/bash as specified in CMD and exited.
Cruel Workaround:
Add a run.sh that just does nothing. I hate this solution - but it works.
while :; do
sleep 1000
done
What would be nice
Is it possible to specify anywhere in the IOT Module Metadata to run in daemon mode so that the edge device can pass -d when starting the module?

Resources