Unable to create a cron job of my pyspark script using Airflow - apache-spark

I have a pyspark script which is working perfectly fine now what I want to do is that I want to schedule that job for every minute and for that I'm using Apache Airflow, I have created a .py file for airflow which is following:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import os
from builtins import range
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
srcDir = os.getcwd() + '/home/user/testing.py'
sparkSubmit = '/home/usr/spark-2.4.0-bin-hadoop2.7/bin/spark-submit'
default_args = {
"owner": "usr",
"depends_on_past": False,
"start_date": datetime(2019, 4, 8),
"email": ["abc#gmail.com"],
"email_on_failure": True,
"email_on_retry": True,
'retries': 5,
'retry_delay': timedelta(minutes=1),
}
dag= DAG('my_airflow',default_args=default_args, schedule_interval='* * * * *')
t1 = BashOperator(
task_id='task1',
bash_command='/home/user/spark-2.4.0-bin-hadoop2.7/bin/spark-submit' + ' ' + srcDir,
dag=dag,
)
But when I run this by python3 air_flow.py it shows nothing neither on console nor on Airflow UI.
I want to know how to make my pyspark script scheduled on every minute by Apache Airflow?
Any help would be really appreciated

Running python3 air_flow.py just parses your file.
To run the file on a schedule, your would first need to Start Airflow Webserver and Airflow Scheduler.
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
Then on your browser, visit http://localhost:8080 which will redirect you to Airflow Webserver UI which looks like the following:
Your script would be run automatically every minute. If you want to trigger it manually from the UI, click on the Run button in the right side of your DAG.
Follow the Quick Start Guide: https://airflow.readthedocs.io/en/1.10.2/start.html

Related

Airflow exception when using SparkSubmitOperator env_vars not supported in standalone-cluster

My infrastructure, as following:
Spark cluster
Airflow cluster
both in the same cloud (OpenShift), but different namespaces.
Airflow UI/Admin/Connections/spark content:
Connection ID: spark
Connection Type: spark
Description: blank
Host: spark://master
Port: 7077
Extras: {"queue": "root.default", "master": "spark://master:7077", "deploy-mode": "cluster", "spark_binary": "/usr/local/spark/bin/spark-submit", "namespace": "default"}
Airflow DAG file content:
spark_submit_local = SparkSubmitOperator(
application ='/opt/airflow/dags/repo/dags/airflow-pyspark-app.py' ,
conn_id= 'spark',
spark_binary='spark-submit',
task_id= 'spark_submit_task',
name='airflow-pyspark-test',
verbose=True,
dag=dag
)
When running via the Airflow UI, the following Airflow Exception is being thrown:
airflow.exceptions.AirflowException: SparkSubmitHook env_vars is not supported in standalone-cluster mode.
There are a few AIRFLOW_CTX environment variables loaded at the start:
[2022-12-07, 12:50:32 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=airflow-pyspark-test
AIRFLOW_CTX_TASK_ID=spark_submit_task
AIRFLOW_CTX_EXECUTION_DATE=2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-12-07T12:50:29.450280+00:00
AIRFLOW_CTX_UID=5d5607d2-0396-51eb-abb2-ab7d3bd96f49
My questions are:
How to resolve this issue when trying to make Airflow DAG to run a file containing SparkSubmitOperator (to the cluster)?
If I remove the environment variables loaded at start (e.g. AIRLFOW_CTX_*) will it break Airflow? Or should I expect that to resolve the issue?
Is there anything else I can do to make it work with environment variables? (There are some other env variables I want to pass for spark-submit, for reporting to another service.)

Airflow on Docker: Can't Write to Volume (Permission Denied)

Goal
I'm trying to run a simple DAG which creates a pandas DataFrame and writes to a file. The DAG is being run in a Docker container with Airflow, and the file is being written to a named volume.
Problem
When I start the container, I get the error:
Broken DAG: [/usr/local/airflow/dags/simple_datatest.py] [Errno 13] Permission denied: '/usr/local/airflow/data/local_data_input.csv'
Question
Why am I getting this error? And how can I fix this so that it writes properly?
Context
I am loosely following a tutorial here, but I've modified the DAG. I'm using the puckel/docker-airflow image from Docker Hub. I've attached a volume pointing to the appropriate DAG, and I've created another volume to contain the data written within the DAG (created by running docker volume create airflow-data).
The run command is:
docker run -d -p 8080:8080 \
-v /path/to/local/airflow/dags:/usr/local/airflow/dags \
-v airflow-data:/usr/local/airflow/data:Z \
puckel/docker-airflow \
webserver
The DAG located at the /usr/local/airflow/dags path on the container is defined as follows:
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import pandas as pd
# Following are defaults which can be overridden later on
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2021, 12, 31),
'email': ['me#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('datafile', default_args=default_args)
def task_make_local_dataset():
print("task_make_local_dataset")
local_data_create=pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
t1 = BashOperator(
task_id='write_local_dataset',
python_callable=task_make_local_dataset(),
bash_command='python3 ~/airflow/dags/datatest.py',
dag=dag)
The error in the DAG appears to be in the line
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
I don't have permission to write to this location.
Attempts
I've tried changing the location of the data directory on the container, but airflow can't access it. Do I have to change permissions? It seems that this is a really simple thing that most people would want to be able to do: write to a container. I'm guessing I'm just missing something.
Don't use Puckel Docker Image. It's not been maintained for years, Airflow 1.10 has reached End Of Life in June 2021. You should only look at Airflow 2 and Airflow has official reference image that you can use:
Airflow 2 has also Quick-Start guides you can use - based on the image and docker compose: https://airflow.apache.org/docs/apache-airflow/stable/start/index.html
And it also has Helm Chart that can be used to productionize your setup. https://airflow.apache.org/docs/helm-chart/stable/index.html
Don't waste your (and other's) time on Puckel and Airflow 1.10.

Writing to a file on disk using Airflow is not working

I am using Windows machine and have created container for airflow.
I am able to read data on the local filesystem through DAG but I am unable to write data to a file. I have also tried giving full path, also tried on different operators: Python and Bash but still it doesn't work.
The DAG succeeds there isn't any failures to show.
Note: /opt/airflow : is the $AIRFLOW_HOME path
what may be the reason?
A snippet of code:
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
def pre_process():
f = open("/opt/airflow/write.txt", "w")
f.write("world")
f.close()
with DAG(dag_id="test_data", start_date=datetime(2021, 11, 24), schedule_interval='#daily') as dag:
check_file = BashOperator(
task_id="check_file",
bash_command="echo Hi > /opt/airflow/hi.txt "
)
pre_processing = PythonOperator(
task_id="pre_process",
python_callable=pre_process
)
check_file >> pre_processing
It likely is written but in the container that is running airflow.
You need to understand how containers work. They provide isolation, but this also means that unless you do some data sharing, whatever you create in the container, stays in the container and you do not see it outside of it (that's what container isolation is all about).
You can usually enter the container via docker exec command https://docs.docker.com/engine/reference/commandline/exec/ or you can - for example - mount a folder from your host to your container and write your files there (as far as I know, by default in Windows some folders are mounted for you - but you need to check docker documentation for that).
In your pre_process code, add os.chdir('your/path') before write your data to a file.

How to upload files to Amazon EMR?

My code is as follows:
# test2.py
from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
print("opened")
sc.stop()
It works when I run it locally with:
spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py
But after adding step to EMR claster:
spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py
I am getting error:
FileNotFoundError: [Errno 2] No such file or directory:
'/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'
Path to the file is correct, but it is not uploading from s3 for some reason.
I tried to load from executor:
from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
a = 0
with open(SparkFiles.get("test_warc.txt")) as f:
a += 1
print("opened")
return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2
sc.stop()
And it works without errors.
Looking like --files argument uploading files to executors only. How can I upload to master ?
Your understanding is correct.
--files argument is uploading files to executors only.
See this in the spark documentation
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
You can read more about this at advanced-dependency-management
Now coming back to your second question
How can I upload to master?
There is a concept of bootstrap-action in EMR. From the official documentation it means the following:
You can use a bootstrap action to install additional software or
customize the configuration of cluster instances. Bootstrap actions
are scripts that run on cluster after Amazon EMR launches the instance
using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions
run before Amazon EMR installs the applications that you specify when
you create the cluster and before cluster nodes begin processing data.
How do I use it in my case?
While spawning the cluster you can specify your script in BootstrapActions JSON Something like the following along with other custom configurations:
BootstrapActions=[
{'Name': 'Setup Environment for downloading my script',
'ScriptBootstrapAction':
{
'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
}
}]
The content of the download-file.sh should look something like below:
#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir
Now in your python script, you can use the file workingDir/test_warc.txt to read the file.
There is also an option to execute your bootstrap action on the master node only/task node only or a mix of both. bootstrap-actions/run-if is the script that we can use for this case. More reading on this can be done at emr-bootstrap-runif

Airflow lowering performance?

I'm using Apache Airflow (1.9) to run a bash script (which starts Logstash to query data from a database and transfers this data into ElasticSearch).
When I run my script from Airflow, it takes about 95 minutes to complete. When I run the exact same script from the terminal (on the same machine) this task takes 65 minutes to complete.
I really can't figure out what's going on. I'm running a very simple instance of Airflow, using the LocalExecutor. My dag is really really simple, as shown below. I'm not using any fancy stuff, such as variables Xcom or whatsoever.
My dag:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Me',
'start_date': dt.datetime(2018, 4, 12),
'retries': 2,
'retry_delay': dt.timedelta(minutes=1),
'sla': dt.timedelta(hours=2),
'depends_on_past': False,
}
syncQDag = DAG('q13', catchup=False, default_args=default_args, schedule_interval='0 1 * * *', concurrency=1)
l13 = BashOperator(task_id='q13', bash_command='sudo -H -u root /usr/share/logstash/bin/logstash --path.settings /etc/logstash-manual -f /etc/logstash/conf.d/q13.conf', dag=syncQDag)
Is there any clear reason why a job initiated through Airflow performs about 50% worse? As all the same components are used and running while the job is executed I'd suggest there's no real difference between execution through Airflow compared to from the terminal? Or am I missing something here? Thanks!

Resources