How to run airflow with CeleryExecutor on a custom docker image

How to run airflow with CeleryExecutor on a custom docker image - python-3.x

I am adding airflow to a web application that manually adds a directory containing business logic to the PYTHON_PATH env var, as well as does additional system-level setup that I want to be consistent across all servers in my cluster. I've been successfully running celery for this application with RMQ as the broker and redis as the task results backend for awhile, and have prior experience running Airflow with LocalExecutor.
Instead of using Pukel's image, I have a an entry point for a base backend image that runs a different service based on the SERVICE env var. That looks like this:
if [ $SERVICE == "api" ]; then
# upgrade to the data model
flask db upgrade
# start the web application
python wsgi.py
fi
if [ $SERVICE == "worker" ]; then
celery -A tasks.celery.celery worker --loglevel=info --uid=nobody
fi
if [ $SERVICE == "scheduler" ]; then
celery -A tasks.celery.celery beat --loglevel=info
fi
if [ $SERVICE == "airflow" ]; then
airflow initdb
airflow scheduler
airflow webserver
I have an .env file that I build the containers with the defines my airflow parameters:
AIRFLOW_HOME=/home/backend/airflow
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql+pymysql://${MYSQL_USER}:${MYSQL_ROOT_PASSWORD}#${MYSQL_HOST}:${MYSQL_PORT}/airflow?charset=utf8mb4
AIRFLOW__CELERY__BROKER_URL=amqp://${RABBITMQ_DEFAULT_USER}:${RABBITMQ_DEFAULT_PASS}#${RABBITMQ_HOST}:5672
AIRFLOW__CELERY__RESULT_BACKEND=redis://${REDIS_HOST}
With how my entrypoint is setup currently, it doesn't make it to the webserver. Instead, it runs that scheduler in the foreground with invoking the web server. I can change this to
airflow initdb
airflow scheduler -D
airflow webserver
Now the webserver runs, but it isn't aware of the scheduler, which is now running as a daemon:
Airflow does, however, know that I'm using a CeleryExecutor and looks for the dags in the right place:
airflow | [2020-07-29 21:48:35,006] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
airflow | [2020-07-29 21:48:35,010] {__init__.py:50} INFO - Using executor CeleryExecutor
airflow | [2020-07-29 21:48:35,010] {dagbag.py:396} INFO - Filling up the DagBag from /home/backend/airflow/dags
airflow | [2020-07-29 21:48:35,113] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
I can solve this by going inside the container and manually firing up the scheduler:
The trick seems to be running both processes in the foreground within the container, but I'm stuck on how to do that inside the entrypoint. I've checked out Pukel's entrypoint code, but it's not obvious to me what he's doing. I'm sure that with just a slight tweak this will be off to the races... Thanks in advance for the help. Also, if there's any major anti-pattern that I'm at risk of running into here I'd love to get the feedback so that I can implement airflow properly. This is my first time implementing CeleryExecutor, and there's a decent amount involved.

try using nohup. https://en.wikipedia.org/wiki/Nohup
nohup airflow scheduler >scheduler.log &
in your case, you would update your entrypoint as follows:
if [ $SERVICE == "airflow" ]; then
airflow initdb
nohup airflow scheduler > scheduler.log &
nohup airflow webserver
fi

Related

Keycloak 18 in Gitlab Service sometimes does not load realm (without error)

I am wondering if any one would know about this problem: I am starting a Keycloak as a Gitlab service in order to run integration tests in a pipeline, using the "--import-realm" option. It works very well locally, and it works some of the times in Gitlab. However, sometimes (I'd say a little more than 50%), the realm is simply not imported, without any error message (and then of course my test fails).
Here is my job description:
integration-tests-common:
variables:
FF_NETWORK_PER_BUILD: "true"
KEYCLOAK_DATA_IMPORT_DIR: /builds/js-dev/myproject/Keycloak-testapp/data
KEYCLOAK_ADMIN: admin
KEYCLOAK_ADMIN_PASSWORD: admin
KC_HTTPS_CERTIFICATE_FILE: /opt/keycloak/certificates/keycloak.crt.pem
KC_HTTPS_CERTIFICATE_KEY_FILE: /opt/keycloak/certificates/keycloak.key.pem
services:
#(custom image below is based on quay.io/keycloak/keycloak:18.0.2)
- name: myinternalrepo/mykeycloakimage:mytag
alias: keycloak
command: ["start-dev","--import-realm", "--health-enabled=true", "--http-port=8089","--log=console,file"]
script:
# Before E2E tests: First wait for keycloak
- |
set -x
count=0;
while [ "$(curl -s -o /dev/null -w '%{http_code}' http://keycloak:8089/health )" != "200" ]
do
echo "waiting for Keycloak..."
sleep 1;
let count=count+1;
if [ $count -gt 100 ]
then
echo "Keycloak is not starting, exiting"
exit 1;
fi
done
echo "Keycloak is UP after $count retries"
set +x
#... (the rest is my integration test)
KEYCLOAK_DATA_IMPORT_DIR is used by a custom entrypoint to create a symbolic link to /opt/keycloak/data/import (since I cannot mount a volume for a Gitlab service, as far as I know):
ln -s $KEYCLOAK_DATA_IMPORT_DIR /opt/keycloak/data/import
In working cases, I have this log:
2022-08-02 05:46:14,468 INFO [org.keycloak.services] (main) KC-SERVICES0050: Initializing master realm
2022-08-02 05:46:19,869 INFO [org.keycloak.services] (main) KC-SERVICES0004: Imported realm test from file /opt/keycloak/bin/../data/import/realm-export.json.
2022-08-02 05:46:20,232 INFO [org.keycloak.services] (main) KC-SERVICES0009: Added user 'admin' to realm 'master'
But in other cases, the log shows no error, it continues as if the import option was not given:
2022-08-02 06:04:14,230 INFO [org.keycloak.services] (main) KC-SERVICES0050: Initializing master realm
2022-08-02 06:04:18,220 INFO [org.keycloak.services] (main) KC-SERVICES0009: Added user 'admin' to realm 'master'
I have also added an nginx in the keycloak custom image exposing the Keycloak logs (because it's difficult to get full logs from Gitlab services otherwise!), but I couldn't find anything more in them.
I dont't know if this is a problem with my custom entrypoint and the symbolic link, with Keycloak, or related to Gitlab services...all I know is that when it fails, I retry the job, sometime multiples times, and usually it finally works. Any help would be appreciated.

By adding a "ls" in my custom Keycloak image entrypoint, I noticed that the Gitlab project files are not present in the error cases. So this is more a Gitlab services issue than a Keycloak issue.
In addition, it is not clear from the Gitlab services doc (https://docs.gitlab.com/ee/ci/services/) if they are supposed to access the project files or not. I had assumed so, because I made a test which worked. But finally, the solution was to integrate my realm's file into my base docker image, and not rely on the files from the repo.

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?

As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram

It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Proper way for keep process running on a container

I'm not aware if this could be considered as a duplicate since it's a problem for an specific case.
Currently, I have created a docker outside docker image for handling my Jenkins agent which will perform auto restarts without using supervisor as a solution ( lack of python 3.7 support ), and by that, since I'm using openjdk:slim as base image and I don't want to install any additional dependencies I opted to compensate the lack of tools like lsof and ps, or others for checking if the process is running or not, by writing the started process pid on a file which will be used for validating if the process exists or not under the path /proc/pid/status. Currently this works and the main reason of creating this solution for handling the auto start of the agents.
But my question is, Is this the best or more appropriated approach?
Please find the following code with the implementation:
#!/bin/bash
set -e
agent_runner() {
while :
do
if [ ! -f "/proc/$(cat /tmp/agent.pid)/status" ]
then
curl $JNLP_AGENT_DOWNLOAD_URL -o agent.jar
java \
-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 \
-Dhttps.protocols=TLSv1.2 \
-jar agent.jar \
-jnlpUrl $JNLP_AGENT_URL \
-secret $JENKINS_SECRET \
-workDir "$JENKINS_WORKDIR" &
echo $! > /tmp/agent.pid
else
:
fi
sleep 10
done
}
while :
do
if [ cat < /dev/tcp/"$TARGET" ]; then
echo "Starting Agent"
agent_runner
else
echo "Jenkins master is offline, waiting...."
fi
sleep 10
done
Link for the repository: https://github.com/thcp/jenkins-agent-dod

If the main process in the container dies, you should let the container die with it.
Docker and the various layers above it have functionality to restart whole containers. There is a docker run --restart option for the basic Docker CLI, and equivalent Docker Compose option, and restarting dying containers after some backoff is the default behavior for Kubernetes pods.
So, if you just let a container die on its own, you’ll have out-of-the-box support for the container engine to restart itself, without adding any special support into your image; just set the CMD to the thing you actually need the container to do and go. This approach also has the benefit that if you detect your environment has become unstable (“I depend on a database and it’s unreachable”) the process can choose to abort itself and let it be restarted later when hopefully the environment has improved.

Specifying Parallel Environment on Google Compute Engine using Elasticluster

I recently created a Grid Engine cluster on Compute Engine using Elasticluster (http://googlegenomics.readthedocs.org/en/latest/use_cases/setup_gridengine_cluster_on_compute_engine/index.html).
I was wondering what is the appropriate command to run shared-memory multithreaded batch jobs on a cluster of Compute Engine virtual machine running Grid Engine.
In other words, what is the name (i.e. pe_name) of the Grid Engine parallel environment.
Let's say I want to run a job requesting 4 cpus on 1 node, what would be the right qsub command.
So far I tried the following command:
qsub -cwd -l h_vmem=800G -pe smp 6 run.sh
Unable to run job: job rejected: the requested parallel environment "smp" does not exist.
qsub -cwd -l h_vmem=800G -pe omp 6 run.sh
Unable to run job: job rejected: the requested parallel environment "omp" does not exist.
Thank you for your help!

I don't believe that Elasticluster's Ansible playbook includes a parallel environment. You can see the main configuration run on the master here:
https://github.com/gc3-uzh-ch/elasticluster/blob/master/elasticluster/providers/ansible-playbooks/roles/gridengine/tasks/master.yml
I believe you can simply connect to the master and issue the "add parallele environment" command:
$ qconf -ap smp
and write a configuration file like:
pe_name smp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
and then modify the queue configuration for all.q:
$ qconf -mq all.q
...
pe_list make smp
...
I would also suggest filing an issue with Elasticluster here:
https://github.com/gc3-uzh-ch/elasticluster/issues
I would expect that someone has already done this in a fork of Elasticluster and may be able to provide a pull request to the master fork.
Hope that helps.
-Matt

Cannot run nodejs app and mongo within a docker container

I'm setting up a container with the following Dockerfile
# Start with project/baseline
FROM project/baseline # => image with mongo / nodejs / sailsjs
# Create folder that will contain all the sources
RUN mkdir -p /var/project
# Load the configuration file and the deployment script
ADD init.sh /var/project/init.sh
ADD src/ /var/project/ # src contains a list of folder, each one being a sails app
# Compile the sources / run the services / run mongodb
CMD /var/project/init.sh
The init.sh script is called when the container runs.
It should start a couple of webapp and mongodb.
#!/bin/bash
PROJECT_PATH=/var/project
# Start mongodb
function start_mongo {
mongod --fork --logpath /var/log/mongodb.log # attempt to have mongo running in daemon
}
# Start services
function start {
for service in $(ls);do
cd $PROJECT_PATH/$service
npm start # Runs sails lift on each service
done
}
# start mongodb
start_mongo
# start web applications defined in /var/project
start
Basically, there is a couple of nodejs (sailsjs) application in /var/project.
When I run the container, I got the following message:
$ sudo docker run -t -i projects/test
about to fork child process, waiting until server is ready for connections.
forked process: 10
and then it remains stuck.
How can mongo and the sails processes can be started and the container to remain in a running state ?
UPDATE
I now use this supervisord.conf file
[supervisord]
nodaemon=false
[program:mongodb]
command=/usr/bin/mongod
[program:process1]
command=/bin/bash "cd /var/project/service1 && node app.js"
[program:process2]
command=/bin/bash "cd /var/project/service2 && node app.js"
it is called in the Dockerfile like:
# run the applications (mongodb + project related services)
CMD ["/usr/bin/supervisord"]
As my services are dependent upon mongo starting correctly, supervisord does not wait that long and the services are not started then. Any idea to solve that ?
By the way, it that a so best practice to use mongo in the same container ?
UPDATE 2
I went back to a service.sh script that is called when the container is running. I know this is not clean (but I'll say it's temporary so I can fix the pb I have in supervisor), but I'm doing the following:
run nohup mongod &
wait 60 sec
run my node (forever) processes
The thing is, the container exit right after the forever processes are ran... how can it be kept active ?

If you want to cleanly start multiple services inside a container, one option is to use a process supervisor of some sort. One option is documented here, in the official Docker documentation.
I've done something similar using runit. You can see my base runit image here, and a multi-service application image using that here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string