How to write to local file path in in Airflow- MacOS

How to write to local file path in in Airflow- MacOS - python-3.x

I am writing an Airflow pipeline which involves writing the results to a csv file located on my local file system.
I am using MacOS and the file path is similar to /User/name/file_path/file_name.csv)
Here is my code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
import os
from airflow.operators.python_operator import PythonOperator
#Import boto3 module
import boto3
import logging
from botocore.exceptions import ClientError
import csv
import numpy as np
import pandas as pd
bucket='my_bucket_name'
s3 = boto3.resource('s3',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY
)
def load_into_csv(years):
df = pd.DataFrame()
for year in years:
for buckett in s3.buckets.all():
for aobj in buckett.objects.filter(Bucket=bucket,Prefix=PREFIX):
if year in aobj.key:
bucket_name= "'{}' ".format(buckett.name)
the_key= "'{}' ".format(aobj.key)
last_mod= "'{}' ".format(aobj.last_modified)
stor_class= "'{}' ".format(aobj.storage_class)
size_1= "'{}' ".format(aobj.size)
dd = {'bucket_name':[bucket_name], 'S3_key_path':[the_key], 'last_modified_date':[last_mod], 'storage_class':[stor_class], 'size':[size_1] }
df_2 = pd.DataFrame(data=dd)
df = df.append(df_2, ignore_index=True)
#Get local directory
path=os.getcwd()
export_csv = df.to_csv (r'{}/results.csv'.format(path) ,index = None, header=True)
load_into_csv(years)
#######################################################################################################################
default_args = {
'owner': 'name',
'depends_on_past': False,
'start_date': datetime(2020,1,1),
'email': ['email#aol.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1)
}
dag = DAG('bo_v1',
description = 'this is a test script',
default_args=default_args,
schedule_interval= '#once',
catchup = False )
years=['2017','2018','2019']
for year in years:
t1 = PythonOperator(
task_id='load 2017',
python_callable= load_into_csv,
provide_context=False,
dag = dag)
If you look at the path variable, i try collecting the local os path and then setting that as the output file in the export csv variable-but to no avail.
Is there a way to set your local MacOS file path (/Users/name/path/file_name.csv) as the filepath in the export_to_csv variable? I am new to Airflow so any ideas or suggestions would help!!!

I have tried the path dynamic way on my Mac as you did with path=os.getcwd(). I put it in the task or global namespace, but it didn't return any feasible path. One way you can work around this is to put the path as a variable in Airflow variables, and then get it from there when you need it.

Related

I am relatively new to Airflow and spark and want to use the Kubernetes Operator in airflow Dag to run a Spark Submit command

I am using the kubernetes version 1.25 client and server, I have deployed Airflow using the official helm charts on the environment. I want the Airflow dags kubernetes pod operator that has code to trigger the spark-submit operation to spawn the driver pod and an executor pod that will run inside the spark submit command and perform a task. The Dag performs the following task 1. Take a table from mysql, 2.dump it in a text file, 3. put the same file to a minio bucket(similar to aws S3) Currently the driver pod spawns with executor pod. The Driver pod then fails eventually as it does not come into a running state. This event causes the executor pod to fail as well. I am authenticating the call going to kubernetes api using the a Service Account that I am passing as a configuration.
This is my redacted dag that I am using, note that spark-submit command works perfectly fine inside the container of the image on the command line and generates a expected outcome, So I doubt its some dag configuration that I might be missing here. Also not that all the jars that I am referring here are already part of the image and are being referenced from the**/opt/spark/connectors/** I have verified this by doing exec inside the container image
import logging
import csv
import airflow
from airflow import DAG
from airflow.utils import dates as date
from datetime import timedelta, datetime
from airflow.providers.apache.spark.operators.spark_jdbc import SparkSubmitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from dateutil.tz import tzlocal
from airflow.kubernetes.volume import Volume
from airflow.kubernetes.volume_mount import VolumeMount
import pendulum
#from airflow.models import Variables
local_tz = pendulum.timezone("Asia/Dubai")
volume_config = {"persistentVolumeClaim": {"claimName": "nfspvc-airflow-executable"}}
air_connectors_volume_config = {"persistentVolumeClaim": {"claimName": "nfspvc-airconnectors"}}
volume_mount = VolumeMount(
"data-volume",
mount_path="/air-spark/",
sub_path=None,
read_only=False,
)
air_connectors_volume_mount = VolumeMount(
"air-connectors",
mount_path="/air-connectors/",
sub_path=None,
read_only=False,
)
volume = Volume(
name="data-volume",
configs=volume_config
)
air_connectors_volume = Volume(
name="air-connectors",
configs=air_connectors_volume_config
)
default_args = {
'owner': 'panoiqtest',
'depends_on_past': False,
'start_date': datetime(2021, 5, 1, tzinfo=local_tz),
'retries': 1,
'retry_delay': timedelta(hours=1),
'email': ['admin#panoiq.com'],
'email_on_failure': False,
'email_on_retry': False
}
dag_daily = DAG(dag_id='operator',
default_args=default_args,
catchup=False,
schedule_interval='0 */1 * * *')
_config = {
'application': '/opt/spark/connectors/spark-etl-assembly-2.0.jar',
'num_executors': 2,
'driver_memory': '5G',
'executor_memory': '10G',
'driver_class_path':'/opt/spark/connectors/mysql-connector-java-5.1.49.jar',
'jars':'/opt/spark/connectors/mysql-connector-java-5.1.49.jar,/opt/spark/connectors/aws-java-sdk-bundle-1.12.374.jar,/opt/spark/connectors/hadoop-aws-3.3.1.jar',
#'java_class': 'com.spark.ETLHandler'
}
spark_config = {
"spark.executor.extraClassPath":"/opt/spark/connectors/mysql-connector-java-5.1.49.jar,/opt/spark/connectors/aws-java-sdk-bundle-1.12.374.jar,/opt/spark/connectors/hadoop-aws-3.3.1.jar",
"spark.driver.extraClassPath":"/opt/spark/connectors/mysql-connector-java-5.1.49.jar,/opt/spark/connectors/aws-java-sdk-bundle-1.12.374.jar,/opt/spark/connectors/hadoop-aws-3.3.1.jar"
}
t2 = BashOperator(
task_id='bash_example',
# "scripts" folder is under "/usr/local/airflow/dags"
bash_command="ls /air-spark/ && pwd",
dag=dag_daily)
def get_tables(table_file='/csv-directory/success-dag.csv', **kwargs):
logging.info("#Starting get_tables()#")
tables_list=[]
with open(table_file) as csvfile:
reader = csv.reader(csvfile, delimiter=',')
tables_list= [row for row in reader]
tables_list.pop(0) #remove header
return tables_list
def load_table(table_name, application_args, **kwargs):
k8s_arguments = [
'--name=datalake-pod',
'--master=k8s://https://IP:6443',
'--deploy-mode=cluster',
# '--driver-cores=4',
# '--executor-cores=4',
# '--num-executors=1',
# '--driver-memory=8192m',
'--executor-memory=8192m',
'--conf=spark.kubernetes.authenticate.driver.serviceAccountName=air-airflow-sa',
'--driver-class-path=/opt/spark/connectors//mysql-connector-java-5.1.49.jar,/opt/spark/connectors/aws-java-sdk-bundle-1.12.374.jar,/opt/spark/connectors/hadoop-aws-3.3.1.jar',
'--conf=spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp',
'--jars=/opt/spark/connectors/mysql-connector-java-5.1.49.jar,/opt/spark/connectors/aws-java-sdk-bundle-1.12.374.jar,/opt/spark/connectors/hadoop-aws-3.3.1.jar',
'--conf=spark.kubernetes.namespace=development',
# '--conf=spark.driver.cores=4',
# '--conf=spark.executor.cores=4',
# '--conf=spark.driver.memory=8192m',
# '--conf=spark.executor.memory=8192m',
'--conf=spark.kubernetes.container.image=image_name',
'--conf=spark.kubernetes.container.image.pullSecrets=Secret_name',
'--conf=spark.kubernetes.container.image.pullPolicy=Always',
'--conf=spark.dynamicAllocation.enabled=true',
'--conf=spark.dynamicAllocation.shuffleTracking.enabled=true',
'--conf=spark.kubernetes.driver.volumes.persistentVolumeClaim.air-connectors.mount.path=/air-connectors/',
'--conf=spark.kubernetes.driver.volumes.persistentVolumeClaim.air-connectors.mount.readOnly=false',
'--conf=spark.kubernetes.driver.volumes.persistentVolumeClaim.air-connectors.options.claimName=nfspvc-airconnectors',
'--conf=spark.kubernetes.file.upload.path=/opt/spark',
'--class=com.spark.ETLHandler',
'/opt/spark/connectors/spark-etl-assembly-2.0.jar'
];
all_arguments = k8s_arguments + application_args
return KubernetesPodOperator(
dag=dag_daily,
name="zombie-dry-run", #spark_submit_for_"+table_name
# image='image_name',
image='imagerepo.io:5050/panoiq/tools:sparktester',
image_pull_policy = 'Always',
image_pull_secrets = 'registry',
namespace='development',
cmds=['spark-submit'],
arguments=all_arguments,
labels={"foo": "bar"},
task_id="dry_run_demo", #spark_submit_for_"+table_name
# config_file="conf",
volumes=[volume, air_connectors_volume],
volume_mounts=[volume_mount, air_connectors_volume_mount],
)
push_tables_list = PythonOperator(task_id= "load_tables_list",
python_callable=get_tables,
dag=dag_daily)
complete = DummyOperator(task_id="complete",
dag=dag_daily)
for rec in get_tables():
table_name = rec[9]
s3_folder_name = rec[14]
s3_object_name = rec[13]
jdbcUrl = rec[4] + rec[8]
lineagegraph = ",".join(rec[17].split("#"))
entitlement = rec[10]
remarks = rec[11]
username = rec[5]
password = rec[6]
s3_folder_format = rec[16]
select_query = rec[9]
application_args= [select_query, s3_folder_name, jdbcUrl, lineagegraph,entitlement, remarks,username,password,s3_folder_format,s3_object_name]
push_tables_list >> load_table(table_name, application_args) >> complete
Any Help or pointers are appreciated on the issue!! Thanks in advance!!

I was able to fix this issue with the code below, I was able to use the Airflow pod itself as driver and that will just spawn a executor pod and run the jobs and die once completed the job flow
Below is my Python File for anyone that needs to do this again
import logging
import csv
import airflow
from airflow import DAG
from airflow.utils import dates as date
from datetime import timedelta, datetime
from airflow.providers.apache.spark.operators.spark_jdbc import SparkSubmitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
#from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from dateutil.tz import tzlocal
from airflow.kubernetes.volume import Volume
from airflow.kubernetes.volume_mount import VolumeMount
import pendulum
#from airflow.models import Variables
local_tz = pendulum.timezone("Asia/Dubai")
default_args = {
'owner': 'test',
'depends_on_past': False,
'start_date': datetime(2021, 5, 1, tzinfo=local_tz),
'retries': 1,
'retry_delay': timedelta(hours=1),
'email': ['admin#test.com'],
'email_on_failure': False,
'email_on_retry': False
}
dag_daily = DAG(dag_id='datapipeline',
default_args=default_args,
catchup=False,
schedule_interval='#hourly')
start = DummyOperator(task_id='run_this_first', dag=dag_daily)
_config = {
'application': '/air-spark/spark-etl-assembly-2.0.jar',
'num_executors': 2,
'driver_memory': '5G',
'executor_memory': '10G',
'driver_class_path':'/air-connectors/mysql-connector-java-5.1.49.jar',
'jars':'/air-connectors/mysql-connector-java-5.1.49.jar,/air-connectors/aws-java-sdk-bundle-1.12.374.jar,/air-connectors/hadoop-aws-3.3.1.jar',
#'java_class': 'com.spark.ETLHandler'
}
spark_config = {
"spark.executor.extraClassPath":"/air-connectors/mysql-connector-java-5.1.49.jar,/air-connectors/aws-java-sdk-bundle-1.12.374.jar,/air-connectors/hadoop-aws-3.3.1.jar",
"spark.driver.extraClassPath":"/air-connectors/mysql-connector-java-5.1.49.jar,/air-connectors/aws-java-sdk-bundle-1.12.374.jar,/air-connectors/hadoop-aws-3.3.1.jar"
}
t2 = BashOperator(
task_id='bash_example',
# "scripts" folder is under "/usr/local/airflow/dags"
bash_command="ls /air-spark/ && pwd",
dag=dag_daily)
def get_tables(table_file='/csv-directory/success-dag.csv', **kwargs):
logging.info("#Starting get_tables()#")
tables_list=[]
with open(table_file) as csvfile:
reader = csv.reader(csvfile, delimiter=',')
tables_list= [row for row in reader]
tables_list.pop(0) #remove header
return tables_list
def load_table(table_name, application_args, **kwargs):
k8s_arguments = [ "--master", "local[*]", "--conf", "spark.executor.extraClassPath=/air-connectors/mysql-connector-java-5.1.49.jar",
"--conf", "spark.driver.extraClassPath=/opt/spark/connectors/mysql-connector-java-5.1.49.jar", "--jars",
"/opt/spark/connectors/mysql-connector-java-5.1.49.jar,/opt/spark/connectors/ojdbc11-21.7.0.0.jar",
"--conf=spark.kubernetes.container.image=imagerepo.io:5050/tools:sparktesterV0.6",
"--conf=spark.kubernetes.container.image.pullSecrets=registry",
"--num-executors", "5", "--executor-memory", "1G", "--driver-memory", "2G", "--class=com.spark.ETLHandler",
"--name", "arrow-spark", "/opt/spark/connectors/spark-etl-assembly-2.0.jar" ];
all_arguments = k8s_arguments + application_args
# spark =
return KubernetesPodOperator(
image="imagerepo.io:5050/tools:sparktesterV0.6",
service_account_name="air-airflow-worker",
name="data_pipeline_k8s",
task_id="data_pipeline_k8s",
get_logs=True,
dag=dag_daily,
namespace="development",
image_pull_secrets="registry",
image_pull_policy="Always",
cmds=["spark-submit"],
arguments=all_arguments
)
# spark.set_upstream(start)
push_tables_list = PythonOperator(task_id= "load_tables_list",python_callable=get_tables,dag=dag_daily)
complete = DummyOperator(task_id="complete",dag=dag_daily)
for rec in get_tables():
table_name = rec[9]
s3_folder_name = rec[14]
s3_object_name = rec[13]
jdbcUrl = rec[4] + rec[8]
lineagegraph = ",".join(rec[17].split("#"))
entitlement = rec[10]
remarks = rec[11]
username = rec[5]
password = rec[6]
s3_folder_format = rec[16]
select_query = rec[9]
application_args= [select_query, s3_folder_name, jdbcUrl, lineagegraph,entitlement, remarks,username,password,s3_folder_format,s3_object_name]
push_tables_list >> load_table(table_name, application_args) >> complete

how to access s3 bucket in airflow without giving connection string via admin -> connection UI

I would like the connect to aws s3 without making use of the Admin-> configuration UI of Airflow.
Can any one ,please let me know is there a way to do so in Apache Airflow?

Since you want to connect to AWS S3 without using the default s3 operator in Airflow, You can use the PythonOperator and make sure boto is added to the python dependencies which Airflow is using.
For installing boto you can use
pip install boto3
If you are using docker you have to add boto in the docker file.
from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta
from datetime import datetime, timedelta
import time
args = {
'owner': 'airflow_default',
'start_date': datetime(2022,6,6)
}
def s3_connect(ds, **kwargs):
'''' Connect to s3 using Boto script '''
s3 = boto3.resource(
service_name='s3',
region_name='us-east-2',
aws_access_key_id='mykey',
aws_secret_access_key='mysecretkey'
)
for bucket in s3.buckets.all():
print(bucket.name)
def my_dummy_function():
''' Dummy function for doing nothing '''
return 'dummy function test'
dag = DAG(
dag_id='example_python_operator_without_s3', default_args=args,
schedule_interval=None)
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=s3_connect,
dag=dag)
also_this = PythonOperator(task_id="dummy_function", python_callable=my_dummy_function,dag= dag)
run_this>>also_this

Unable to run python code in Azure function

I have init.py and blobquickstartv12.py within the same Azure Function "Test-v3". While init.py is a blob trigger, "blobquickstartv12.py " has the python code that I want to run. The only way I am able to run my code in blobquickstartv12.py is if I paste the entire code within the main() function of init.py.
I tried using from blobquickstartv12 import load where load is a function in my blobquickstartv12.py code but that gave me Exception: ModuleNotFoundError: No module named 'blobquickstartv12'
Can anyone tell me how can I call my custom code from within init.py
This is how the structure of my Azure Function looks like:
Here is my code in init.py:
import azure.functions as func
import pandas as pd
import numpy as np
from datetime import datetime
from pandas import ExcelFile
from pandas import ExcelWriter
from datetime import datetime, timedelta
from azure.storage.blob import BlockBlobService
import pyodbc
import sys
import os
from io import StringIO
import pkgutil
from . import blobquickstartv12
def main(myblob: func.InputStream):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
load=blobquickstartv12.load()
Here is my code for blobquickstart.py:
class load:
#CODE FOR CONNECTING TO THE SQL DATABASE
SERVER = 'xxxxxx.database.windows.net'
DATABASE = 'XYZ'
username = 'USERNAME'
pwd = 'PASSWORD'
driver= '{ODBC Driver 17 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+SERVER+';PORT=1433;DATABASE='+DATABASE+';UID='+username+';PWD='+ pwd)
cursor = cnxn.cursor()
print("Connected to Azure SQL")
#sqlcommand = ("INSERT INTO Stage.File(File_ID,File_type) VALUES (1235,'D')")
Curr_dt = datetime.now()
BLOB_STORAGEACCOUNTNAME="blobstorage"
BLOB_STORAGEACCOUNTKEY="AccountKey"
BLOBNAME="BlobName"
CONTAINERNAME= "ContainerName"

Update:
Please check the structure. On my side it is no problem. The code can import blobquickstartv12 fine.
This is the structure of azure function:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#folder-structure
This is the doc of how to import:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#import-behavior
Original Answer:
import module in the module should be like this:
For example, I have a dog.py and I want to use it.
This is the dog.py:
class Dog:
def __init__(self,name):
super().__init__()
self.name = name
def showdog(self):
print("This is a dog!")
In the _init_.py, you should use this:
from . import dog
mydog = dog.Dog("Woodie")
It work fine on my side.
This is the structure:

No module named airfow.gcp - how to run dataflow job that uses python3/beam 2.15?

When I go to use operators/hooks like the BigQueryHook I see a message that these operators are deprecated and to use the airflow.gcp... operator version. However when i try and use it in my dag it fails and says no module named airflow.gcp. I have the most up to date airflow composer version w/ beta features, python3. Is it possible to install these operators somehow?
I am trying to run a Dataflow Job in python 3 using beam 2.15. I have tried virtualenv operator, but that doesn't work because it only allows python2.7. How can I do this?

The newest Airflow version available in Composer is either 1.10.2 or 1.10.3 (depending on the region). By then, those operators were in the contrib section.
Focusing on how to run Python 3 Dataflow jobs with Composer you'd need for a new version to be released. However, if you need an immediate solution you can try to back-port the fix.
In this case I defined a DataFlow3Hook which extends the normal DataFlowHook but that it does not hard-code python2 in the start_python_dataflow method:
class DataFlow3Hook(DataFlowHook):
def start_python_dataflow(
...
py_interpreter: str = "python3"
):
...
self._start_dataflow(variables, name, [py_interpreter] + py_options + [dataflow],
label_formatter)
Then we'll have our custom DataFlowPython3Operator calling the new hook:
class DataFlowPython3Operator(DataFlowPythonOperator):
def execute(self, context):
...
hook = DataFlow3Hook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
...
hook.start_python_dataflow(
self.job_name, formatted_options,
self.py_file, self.py_options, py_interpreter="python3")
Finally, in our DAG we just use the new operator:
task = DataFlowPython3Operator(
py_file='/home/airflow/gcs/data/main.py',
task_id=JOB_NAME,
dag=dag)
See full code here. Job runs with Python 3.6:
Environment details and dependencies used (Beam job was a minimal example):
softwareConfig:
imageVersion: composer-1.8.0-airflow-1.10.3
pypiPackages:
apache-beam: ==2.15.0
google-api-core: ==1.14.3
google-apitools: ==0.5.28
google-cloud-core: ==1.0.3
pythonVersion: '3'
Let me know if that works for you. If so, I'd recommend moving the code to a plugin for code readability and to reuse it across DAGs.

As an alternative, you can use the PythonVirtualenvOperator on older airflow versions. Given some beam pipeline (wrapped in a function) saved as dataflow_python3.py:
def main():
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse
import logging
class ETL(beam.DoFn):
def process(self, row):
#do data processing
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
default='gs://bucket/input/input.txt',
help='Input file to process.'
)
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DataflowRunner',
'--project=project_id',
'--region=region',
'--staging_location=gs://bucket/staging/',
'--temp_location=gs://bucket/temp/',
'--job_name=job_id',
'--setup_file=./setup.py'
])
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
rows = (p | 'read rows' >> beam.io.ReadFromText(known_args.input))
etl = (rows | 'process data' >> beam.ParDo(ETL()))
logging.getLogger().setLevel(logging.DEBUG)
run()
You can run it using the following DAG file:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonVirtualenvOperator
import sys
import dataflow_python3 as py3 #import your beam pipeline file here
default_args = {
'owner': 'John Smith',
'depends_on_past': False,
'start_date': datetime(2016, 1, 1),
'email': ['email#gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=1),
}
CONNECTION_ID = 'proj_id'
with DAG('Dataflow_Python3', schedule_interval='#once', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
dataflow_python3 = PythonVirtualenvOperator(
task_id='dataflow_python3',
python_callable=py3.main, #this is your beam pipeline callable
requirements=['apache-beam[gcp]', 'pandas'],
python_version=3,
dag=dag
)
dataflow_python3

I have run Python 3 Beam -2.17 by using DataflowTemplateOperator and it worked like a charm.
Use below command to create template:
python3 -m scriptname --runner DataflowRunner --project project_id --staging_location staging_location --temp_location temp_location --template_location template_location/script_metadata --region region --experiments use_beam_bq_sink --no_use_public_ips --subnetwork=subnetwork
scriptname would be name of your Dataflow Python file(without .py extension)
--template_location - The location where dataflow template would be created, don't put any extension like .json to it. Simply, scriptname_metadata would work.
--experiments use_beam_bq_sink - This parameter would be used if your sink is BigQuery otherwise you can remove it.
import datetime as dt
import time
from airflow.models import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
lasthour = dt.datetime.now() - dt.timedelta(hours=1)
args = {
'owner': 'airflow',
'start_date': lasthour,
'depends_on_past': False,
'dataflow_default_options': {
'project': "project_id",
'staging_location': "staging_location",
'temp_location': "temp_location",
'region': "region",
'runner': "DataflowRunner",
'job_name': 'job_name' + str(time.time()),
},
}
dag = DAG(
dag_id='employee_dataflow_dag',
schedule_interval=None,
default_args=args
)
Dataflow_Run = DataflowTemplateOperator(
task_id='dataflow_pipeline',
template='template_location/script_metadata',
parameters ={
'input':"employee.csv",
'output':'project_id:dataset_id.table',
'region':"region"
},
gcp_conn_id='google_cloud_default',
poll_sleep=15,
dag=dag
)
Dataflow_Run

Dynamic DAG Creation Not working as expected in Apache airflow

I was trying to understand how dynamic dags are created in Apache airflow as I need this to create dynamic dags in my project.
Below is the link iam following:Dynamic DAG creation in Apache airflow
Below is the code block for creating a sample hello world dynamic DAGS.(Dynamic DAGs creation based on input parameters).
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for n in range(1, 10):
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
The expectation is to create 9 such DAGs.But what I could see is that once i compile the above code block with python3 code_sample.py,it creates 9 DAGs however the code embeded in the DAG is entire sample code.
But to my understanding the created DAGs should have only the below code block which is available inside create_dag method in the above sample code block.
Expected DAG code:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
Actual DAG code:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for n in range(1, 10):
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
Let me know what is creating the above problem

The code that you see in Airflow UI when clicking on "Code" tab is simply the whole .py file source code. See how this function is implemented:
https://github.com/apache/airflow/blob/master/airflow/www/views.py#L437

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to write to local file path in in Airflow- MacOS - python-3.x

I have tried the path dynamic way on my Mac as you did with path=os.getcwd(). I put it in the task or global namespace, but it didn't return any feasible path. One way you can work around this is to put the path as a variable in Airflow variables, and then get it from there when you need it.

Related

I am relatively new to Airflow and spark and want to use the Kubernetes Operator in airflow Dag to run a Spark Submit command

how to access s3 bucket in airflow without giving connection string via admin -> connection UI

Unable to run python code in Azure function

No module named airfow.gcp - how to run dataflow job that uses python3/beam 2.15?

Dynamic DAG Creation Not working as expected in Apache airflow

Categories

Resources