Airflow lowering performance? - linux

I'm using Apache Airflow (1.9) to run a bash script (which starts Logstash to query data from a database and transfers this data into ElasticSearch).
When I run my script from Airflow, it takes about 95 minutes to complete. When I run the exact same script from the terminal (on the same machine) this task takes 65 minutes to complete.
I really can't figure out what's going on. I'm running a very simple instance of Airflow, using the LocalExecutor. My dag is really really simple, as shown below. I'm not using any fancy stuff, such as variables Xcom or whatsoever.
My dag:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Me',
'start_date': dt.datetime(2018, 4, 12),
'retries': 2,
'retry_delay': dt.timedelta(minutes=1),
'sla': dt.timedelta(hours=2),
'depends_on_past': False,
}
syncQDag = DAG('q13', catchup=False, default_args=default_args, schedule_interval='0 1 * * *', concurrency=1)
l13 = BashOperator(task_id='q13', bash_command='sudo -H -u root /usr/share/logstash/bin/logstash --path.settings /etc/logstash-manual -f /etc/logstash/conf.d/q13.conf', dag=syncQDag)
Is there any clear reason why a job initiated through Airflow performs about 50% worse? As all the same components are used and running while the job is executed I'd suggest there's no real difference between execution through Airflow compared to from the terminal? Or am I missing something here? Thanks!

Related

Airflow exception when using SparkSubmitOperator env_vars not supported in standalone-cluster

My infrastructure, as following:
Spark cluster
Airflow cluster
both in the same cloud (OpenShift), but different namespaces.
Airflow UI/Admin/Connections/spark content:
Connection ID: spark
Connection Type: spark
Description: blank
Host: spark://master
Port: 7077
Extras: {"queue": "root.default", "master": "spark://master:7077", "deploy-mode": "cluster", "spark_binary": "/usr/local/spark/bin/spark-submit", "namespace": "default"}
Airflow DAG file content:
spark_submit_local = SparkSubmitOperator(
application ='/opt/airflow/dags/repo/dags/airflow-pyspark-app.py' ,
conn_id= 'spark',
spark_binary='spark-submit',
task_id= 'spark_submit_task',
name='airflow-pyspark-test',
verbose=True,
dag=dag
)
When running via the Airflow UI, the following Airflow Exception is being thrown:
airflow.exceptions.AirflowException: SparkSubmitHook env_vars is not supported in standalone-cluster mode.
There are a few AIRFLOW_CTX environment variables loaded at the start:
[2022-12-07, 12:50:32 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=airflow-pyspark-test
AIRFLOW_CTX_TASK_ID=spark_submit_task
AIRFLOW_CTX_EXECUTION_DATE=2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_UID=5d5607d2-0396-51eb-abb2-ab7d3bd96f49
My questions are:
How to resolve this issue when trying to make Airflow DAG to run a file containing SparkSubmitOperator (to the cluster)?
If I remove the environment variables loaded at start (e.g. AIRLFOW_CTX_*) will it break Airflow? Or should I expect that to resolve the issue?
Is there anything else I can do to make it work with environment variables? (There are some other env variables I want to pass for spark-submit, for reporting to another service.)

Airflow on Docker: Can't Write to Volume (Permission Denied)

Goal
I'm trying to run a simple DAG which creates a pandas DataFrame and writes to a file. The DAG is being run in a Docker container with Airflow, and the file is being written to a named volume.
Problem
When I start the container, I get the error:
Broken DAG: [/usr/local/airflow/dags/simple_datatest.py] [Errno 13] Permission denied: '/usr/local/airflow/data/local_data_input.csv'
Question
Why am I getting this error? And how can I fix this so that it writes properly?
Context
I am loosely following a tutorial here, but I've modified the DAG. I'm using the puckel/docker-airflow image from Docker Hub. I've attached a volume pointing to the appropriate DAG, and I've created another volume to contain the data written within the DAG (created by running docker volume create airflow-data).
The run command is:
docker run -d -p 8080:8080 \
-v /path/to/local/airflow/dags:/usr/local/airflow/dags \
-v airflow-data:/usr/local/airflow/data:Z \
puckel/docker-airflow \
webserver
The DAG located at the /usr/local/airflow/dags path on the container is defined as follows:
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import pandas as pd
# Following are defaults which can be overridden later on
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2021, 12, 31),
'email': ['me#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('datafile', default_args=default_args)
def task_make_local_dataset():
print("task_make_local_dataset")
local_data_create=pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
t1 = BashOperator(
task_id='write_local_dataset',
python_callable=task_make_local_dataset(),
bash_command='python3 ~/airflow/dags/datatest.py',
dag=dag)
The error in the DAG appears to be in the line
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
I don't have permission to write to this location.
Attempts
I've tried changing the location of the data directory on the container, but airflow can't access it. Do I have to change permissions? It seems that this is a really simple thing that most people would want to be able to do: write to a container. I'm guessing I'm just missing something.
Don't use Puckel Docker Image. It's not been maintained for years, Airflow 1.10 has reached End Of Life in June 2021. You should only look at Airflow 2 and Airflow has official reference image that you can use:
Airflow 2 has also Quick-Start guides you can use - based on the image and docker compose: https://airflow.apache.org/docs/apache-airflow/stable/start/index.html
And it also has Helm Chart that can be used to productionize your setup. https://airflow.apache.org/docs/helm-chart/stable/index.html
Don't waste your (and other's) time on Puckel and Airflow 1.10.

Writing to a file on disk using Airflow is not working

I am using Windows machine and have created container for airflow.
I am able to read data on the local filesystem through DAG but I am unable to write data to a file. I have also tried giving full path, also tried on different operators: Python and Bash but still it doesn't work.
The DAG succeeds there isn't any failures to show.
Note: /opt/airflow : is the $AIRFLOW_HOME path
what may be the reason?
A snippet of code:
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
def pre_process():
f = open("/opt/airflow/write.txt", "w")
f.write("world")
f.close()
with DAG(dag_id="test_data", start_date=datetime(2021, 11, 24), schedule_interval='#daily') as dag:
check_file = BashOperator(
task_id="check_file",
bash_command="echo Hi > /opt/airflow/hi.txt "
)
pre_processing = PythonOperator(
task_id="pre_process",
python_callable=pre_process
)
check_file >> pre_processing
It likely is written but in the container that is running airflow.
You need to understand how containers work. They provide isolation, but this also means that unless you do some data sharing, whatever you create in the container, stays in the container and you do not see it outside of it (that's what container isolation is all about).
You can usually enter the container via docker exec command https://docs.docker.com/engine/reference/commandline/exec/ or you can - for example - mount a folder from your host to your container and write your files there (as far as I know, by default in Windows some folders are mounted for you - but you need to check docker documentation for that).
In your pre_process code, add os.chdir('your/path') before write your data to a file.

How to run airflow with CeleryExecutor on a custom docker image

I am adding airflow to a web application that manually adds a directory containing business logic to the PYTHON_PATH env var, as well as does additional system-level setup that I want to be consistent across all servers in my cluster. I've been successfully running celery for this application with RMQ as the broker and redis as the task results backend for awhile, and have prior experience running Airflow with LocalExecutor.
Instead of using Pukel's image, I have a an entry point for a base backend image that runs a different service based on the SERVICE env var. That looks like this:
if [ $SERVICE == "api" ]; then
# upgrade to the data model
flask db upgrade
# start the web application
python wsgi.py
fi
if [ $SERVICE == "worker" ]; then
celery -A tasks.celery.celery worker --loglevel=info --uid=nobody
fi
if [ $SERVICE == "scheduler" ]; then
celery -A tasks.celery.celery beat --loglevel=info
fi
if [ $SERVICE == "airflow" ]; then
airflow initdb
airflow scheduler
airflow webserver
I have an .env file that I build the containers with the defines my airflow parameters:
AIRFLOW_HOME=/home/backend/airflow
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql+pymysql://${MYSQL_USER}:${MYSQL_ROOT_PASSWORD}#${MYSQL_HOST}:${MYSQL_PORT}/airflow?charset=utf8mb4
AIRFLOW__CELERY__BROKER_URL=amqp://${RABBITMQ_DEFAULT_USER}:${RABBITMQ_DEFAULT_PASS}#${RABBITMQ_HOST}:5672
AIRFLOW__CELERY__RESULT_BACKEND=redis://${REDIS_HOST}
With how my entrypoint is setup currently, it doesn't make it to the webserver. Instead, it runs that scheduler in the foreground with invoking the web server. I can change this to
airflow initdb
airflow scheduler -D
airflow webserver
Now the webserver runs, but it isn't aware of the scheduler, which is now running as a daemon:
Airflow does, however, know that I'm using a CeleryExecutor and looks for the dags in the right place:
airflow | [2020-07-29 21:48:35,006] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
airflow | [2020-07-29 21:48:35,010] {__init__.py:50} INFO - Using executor CeleryExecutor
airflow | [2020-07-29 21:48:35,010] {dagbag.py:396} INFO - Filling up the DagBag from /home/backend/airflow/dags
airflow | [2020-07-29 21:48:35,113] {default_celery.py:88} WARNING - You have configured a result_backend of redis://redis, it is highly recommended to use an alternative result_backend (i.e. a database).
I can solve this by going inside the container and manually firing up the scheduler:
The trick seems to be running both processes in the foreground within the container, but I'm stuck on how to do that inside the entrypoint. I've checked out Pukel's entrypoint code, but it's not obvious to me what he's doing. I'm sure that with just a slight tweak this will be off to the races... Thanks in advance for the help. Also, if there's any major anti-pattern that I'm at risk of running into here I'd love to get the feedback so that I can implement airflow properly. This is my first time implementing CeleryExecutor, and there's a decent amount involved.
try using nohup. https://en.wikipedia.org/wiki/Nohup
nohup airflow scheduler >scheduler.log &
in your case, you would update your entrypoint as follows:
if [ $SERVICE == "airflow" ]; then
airflow initdb
nohup airflow scheduler > scheduler.log &
nohup airflow webserver
fi

Unable to create a cron job of my pyspark script using Airflow

I have a pyspark script which is working perfectly fine now what I want to do is that I want to schedule that job for every minute and for that I'm using Apache Airflow, I have created a .py file for airflow which is following:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import os
from builtins import range
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
srcDir = os.getcwd() + '/home/user/testing.py'
sparkSubmit = '/home/usr/spark-2.4.0-bin-hadoop2.7/bin/spark-submit'
default_args = {
"owner": "usr",
"depends_on_past": False,
"start_date": datetime(2019, 4, 8),
"email": ["abc#gmail.com"],
"email_on_failure": True,
"email_on_retry": True,
'retries': 5,
'retry_delay': timedelta(minutes=1),
}
dag= DAG('my_airflow',default_args=default_args, schedule_interval='* * * * *')
t1 = BashOperator(
task_id='task1',
bash_command='/home/user/spark-2.4.0-bin-hadoop2.7/bin/spark-submit' + ' ' + srcDir,
dag=dag,
)
But when I run this by python3 air_flow.py it shows nothing neither on console nor on Airflow UI.
I want to know how to make my pyspark script scheduled on every minute by Apache Airflow?
Any help would be really appreciated
Running python3 air_flow.py just parses your file.
To run the file on a schedule, your would first need to Start Airflow Webserver and Airflow Scheduler.
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
Then on your browser, visit http://localhost:8080 which will redirect you to Airflow Webserver UI which looks like the following:
Your script would be run automatically every minute. If you want to trigger it manually from the UI, click on the Run button in the right side of your DAG.
Follow the Quick Start Guide: https://airflow.readthedocs.io/en/1.10.2/start.html

Resources