Airflow on Docker: Can't Write to Volume (Permission Denied) - linux

Goal
I'm trying to run a simple DAG which creates a pandas DataFrame and writes to a file. The DAG is being run in a Docker container with Airflow, and the file is being written to a named volume.
Problem
When I start the container, I get the error:
Broken DAG: [/usr/local/airflow/dags/simple_datatest.py] [Errno 13] Permission denied: '/usr/local/airflow/data/local_data_input.csv'
Question
Why am I getting this error? And how can I fix this so that it writes properly?
Context
I am loosely following a tutorial here, but I've modified the DAG. I'm using the puckel/docker-airflow image from Docker Hub. I've attached a volume pointing to the appropriate DAG, and I've created another volume to contain the data written within the DAG (created by running docker volume create airflow-data).
The run command is:
docker run -d -p 8080:8080 \
-v /path/to/local/airflow/dags:/usr/local/airflow/dags \
-v airflow-data:/usr/local/airflow/data:Z \
puckel/docker-airflow \
webserver
The DAG located at the /usr/local/airflow/dags path on the container is defined as follows:
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import pandas as pd
# Following are defaults which can be overridden later on
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2021, 12, 31),
'email': ['me#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('datafile', default_args=default_args)
def task_make_local_dataset():
print("task_make_local_dataset")
local_data_create=pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
t1 = BashOperator(
task_id='write_local_dataset',
python_callable=task_make_local_dataset(),
bash_command='python3 ~/airflow/dags/datatest.py',
dag=dag)
The error in the DAG appears to be in the line
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
I don't have permission to write to this location.
Attempts
I've tried changing the location of the data directory on the container, but airflow can't access it. Do I have to change permissions? It seems that this is a really simple thing that most people would want to be able to do: write to a container. I'm guessing I'm just missing something.

Don't use Puckel Docker Image. It's not been maintained for years, Airflow 1.10 has reached End Of Life in June 2021. You should only look at Airflow 2 and Airflow has official reference image that you can use:
Airflow 2 has also Quick-Start guides you can use - based on the image and docker compose: https://airflow.apache.org/docs/apache-airflow/stable/start/index.html
And it also has Helm Chart that can be used to productionize your setup. https://airflow.apache.org/docs/helm-chart/stable/index.html
Don't waste your (and other's) time on Puckel and Airflow 1.10.

Related

Airflow exception when using SparkSubmitOperator env_vars not supported in standalone-cluster

My infrastructure, as following:
Spark cluster
Airflow cluster
both in the same cloud (OpenShift), but different namespaces.
Airflow UI/Admin/Connections/spark content:
Connection ID: spark
Connection Type: spark
Description: blank
Host: spark://master
Port: 7077
Extras: {"queue": "root.default", "master": "spark://master:7077", "deploy-mode": "cluster", "spark_binary": "/usr/local/spark/bin/spark-submit", "namespace": "default"}
Airflow DAG file content:
spark_submit_local = SparkSubmitOperator(
application ='/opt/airflow/dags/repo/dags/airflow-pyspark-app.py' ,
conn_id= 'spark',
spark_binary='spark-submit',
task_id= 'spark_submit_task',
name='airflow-pyspark-test',
verbose=True,
dag=dag
)
When running via the Airflow UI, the following Airflow Exception is being thrown:
airflow.exceptions.AirflowException: SparkSubmitHook env_vars is not supported in standalone-cluster mode.
There are a few AIRFLOW_CTX environment variables loaded at the start:
[2022-12-07, 12:50:32 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=airflow-pyspark-test
AIRFLOW_CTX_TASK_ID=spark_submit_task
AIRFLOW_CTX_EXECUTION_DATE=2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_UID=5d5607d2-0396-51eb-abb2-ab7d3bd96f49
My questions are:
How to resolve this issue when trying to make Airflow DAG to run a file containing SparkSubmitOperator (to the cluster)?
If I remove the environment variables loaded at start (e.g. AIRLFOW_CTX_*) will it break Airflow? Or should I expect that to resolve the issue?
Is there anything else I can do to make it work with environment variables? (There are some other env variables I want to pass for spark-submit, for reporting to another service.)

Writing to a file on disk using Airflow is not working

I am using Windows machine and have created container for airflow.
I am able to read data on the local filesystem through DAG but I am unable to write data to a file. I have also tried giving full path, also tried on different operators: Python and Bash but still it doesn't work.
The DAG succeeds there isn't any failures to show.
Note: /opt/airflow : is the $AIRFLOW_HOME path
what may be the reason?
A snippet of code:
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
def pre_process():
f = open("/opt/airflow/write.txt", "w")
f.write("world")
f.close()
with DAG(dag_id="test_data", start_date=datetime(2021, 11, 24), schedule_interval='#daily') as dag:
check_file = BashOperator(
task_id="check_file",
bash_command="echo Hi > /opt/airflow/hi.txt "
)
pre_processing = PythonOperator(
task_id="pre_process",
python_callable=pre_process
)
check_file >> pre_processing
It likely is written but in the container that is running airflow.
You need to understand how containers work. They provide isolation, but this also means that unless you do some data sharing, whatever you create in the container, stays in the container and you do not see it outside of it (that's what container isolation is all about).
You can usually enter the container via docker exec command https://docs.docker.com/engine/reference/commandline/exec/ or you can - for example - mount a folder from your host to your container and write your files there (as far as I know, by default in Windows some folders are mounted for you - but you need to check docker documentation for that).
In your pre_process code, add os.chdir('your/path') before write your data to a file.

Error Building Airflow On Windows Using docker-compose

I am trying to run airflow on my windows machine using docker. Here is the link that I am following from the official doc - https://airflow.apache.org/docs/apache-airflow/2.0.1/start/docker.html.
I have created the directory structure as expected and also downloaded the docker-compose yaml file. On running 'docker-compose up airflow-init' as suggested by documentation. I get below error
airflow-init_1 |
airflow-init_1 | [2021-07-03 10:19:29,721] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.errors.UndefinedTable) relation "log" does not exist
airflow-init_1 | LINE 1: INSERT INTO log (dttm, dag_id, task_id, event, execution_dat...
airflow-init_1 | ^
airflow-init_1 |
airflow-init_1 | [SQL: INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (%(dttm)s, %(dag_id)s, %(task_id)s, %(event)s, %(execution_date)s, %(owner)s, %(extra)s) RETURNING log.id]
airflow-init_1 | [parameters: {'dttm': datetime.datetime(2021, 7, 3, 10, 19, 29, 712157, tzinfo=Timezone('UTC')), 'dag_id': None, 'task_id': None, 'event': 'cli_upgradedb', 'execution_date': None, 'owner': 'airflow', 'extra': '{"host_name": "7f142ce11611", "full_command": "[\'/home/airflow/.local/bin/airflow\', \'db\', \'upgrade\']"}'}]
From the logs its clear that the log table does not exists and airflow is trying to insert into it. Not sure though why or how this error can be fixed. I am using the original docker-compose file that is published on airflow doc page.
This is the current status of my airflow docker image
on trying to access the airflow UI using - http://localhost:8080/admin/
I get Airflow 404=lot of circles error
This is just a warning because airflow CLI tries to add an audit log to log table before the tables get created.
I have the same warning on a fresh DB initially, but then the ouptu continues.
The output should continue and you should get something like that at the end (I run it with just released 2.1.1 which I recommend you to start with):
airflow-init_1 | [2021-07-03 15:54:01,449] {manager.py:784} WARNING - No user yet created, use flask fab command to do it.
airflow-init_1 | Upgrades done
airflow-init_1 | [2021-07-03 15:54:06,899] {manager.py:784} WARNING - No user yet created, use flask fab command to do it.
airflow-init_1 | Admin user airflow created
airflow-init_1 | 2.1.1
Need to initialize the airflow dB
docker exec -ti airflow-webserver airflow db init && echo "Initialized airflow DB"
Create Admin user
docker exec -ti airflow-webserver airflow users create --role Admin --username {AIRFLOW_USER} --password {AIRFLOW_PASSWORD) -e {AIRFLOW_USER_EMAIL} -f {FIRST_NAME} -l {LAST_NAME} && echo "Created airflow admin user"

Unable to create a cron job of my pyspark script using Airflow

I have a pyspark script which is working perfectly fine now what I want to do is that I want to schedule that job for every minute and for that I'm using Apache Airflow, I have created a .py file for airflow which is following:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import os
from builtins import range
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
srcDir = os.getcwd() + '/home/user/testing.py'
sparkSubmit = '/home/usr/spark-2.4.0-bin-hadoop2.7/bin/spark-submit'
default_args = {
"owner": "usr",
"depends_on_past": False,
"start_date": datetime(2019, 4, 8),
"email": ["abc#gmail.com"],
"email_on_failure": True,
"email_on_retry": True,
'retries': 5,
'retry_delay': timedelta(minutes=1),
}
dag= DAG('my_airflow',default_args=default_args, schedule_interval='* * * * *')
t1 = BashOperator(
task_id='task1',
bash_command='/home/user/spark-2.4.0-bin-hadoop2.7/bin/spark-submit' + ' ' + srcDir,
dag=dag,
)
But when I run this by python3 air_flow.py it shows nothing neither on console nor on Airflow UI.
I want to know how to make my pyspark script scheduled on every minute by Apache Airflow?
Any help would be really appreciated
Running python3 air_flow.py just parses your file.
To run the file on a schedule, your would first need to Start Airflow Webserver and Airflow Scheduler.
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
Then on your browser, visit http://localhost:8080 which will redirect you to Airflow Webserver UI which looks like the following:
Your script would be run automatically every minute. If you want to trigger it manually from the UI, click on the Run button in the right side of your DAG.
Follow the Quick Start Guide: https://airflow.readthedocs.io/en/1.10.2/start.html

Airflow lowering performance?

I'm using Apache Airflow (1.9) to run a bash script (which starts Logstash to query data from a database and transfers this data into ElasticSearch).
When I run my script from Airflow, it takes about 95 minutes to complete. When I run the exact same script from the terminal (on the same machine) this task takes 65 minutes to complete.
I really can't figure out what's going on. I'm running a very simple instance of Airflow, using the LocalExecutor. My dag is really really simple, as shown below. I'm not using any fancy stuff, such as variables Xcom or whatsoever.
My dag:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Me',
'start_date': dt.datetime(2018, 4, 12),
'retries': 2,
'retry_delay': dt.timedelta(minutes=1),
'sla': dt.timedelta(hours=2),
'depends_on_past': False,
}
syncQDag = DAG('q13', catchup=False, default_args=default_args, schedule_interval='0 1 * * *', concurrency=1)
l13 = BashOperator(task_id='q13', bash_command='sudo -H -u root /usr/share/logstash/bin/logstash --path.settings /etc/logstash-manual -f /etc/logstash/conf.d/q13.conf', dag=syncQDag)
Is there any clear reason why a job initiated through Airflow performs about 50% worse? As all the same components are used and running while the job is executed I'd suggest there's no real difference between execution through Airflow compared to from the terminal? Or am I missing something here? Thanks!

Resources