I am trying to use druid hook to load data from hdfs to druid , below is my dag script :
from datetime import datetime, timedelta
import json
from airflow.hooks import HttpHook, DruidHook
from airflow.operators import PythonOperator
from airflow.models import DAG
def check_druid_con():
dr_hook = DruidHook(druid_ingest_conn_id='DRUID_INDEX',druid_query_conn_id='DRUID_QUERY')
dr_hook.load_from_hdfs("druid_airflow","hdfs://xx.xx.xx.xx/demanddata/demand2.tsv","stay_date",["channel","rate"],"2016-12-11/2017-12-13",1,-1,metric_spec=[{ "name" : "count", "type" : "count" }],hadoop_dependency_coordinates="org.apache.hadoop:hadoop-client:2.7.3")
default_args = {
'owner': 'TC',
'start_date': datetime(2017, 8, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('druid_data_load', default_args=default_args)
druid_task1=PythonOperator(task_id='check_druid',
python_callable=check_druid_con,
dag=dag)
I keep getting error , TypeError: load_from_hdfs() takes at least 10 arguments (10 given) . However I have given 10 arguments to load_from_hdfs , still it errors out . Please help.
Regards
Rahul
The issue was with function definition in airflow documents and actual definition in the code :
In documents it's defined as :
load_from_hdfs(datasource, static_path, ts_dim, columns, intervals, num_shards, target_partition_size, metric_spec=None, hadoop_dependency_coordinates=None)
But in reality the function definition is :
def load_from_hdfs(
self, datasource, static_path, ts_dim, columns,
intervals, num_shards, target_partition_size, **query_granularity, segment_granularity**,
metric_spec=None, hadoop_dependency_coordinates=None)
Passing correct arguments as per definition , took me out of this error.
Related
Airflow is running but the task is stuck as its status queued.
I ran airflow scheduler.
Here are my code and snapshot of the airflow ui.
Can any one explain to me what the problem would be?
import datetime as dt
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import pandas as pd
def CSVToJson():
df = pd.read_csv('/Users/daeyong/Desktop/Projects/Python/airflow2/file.csv')
for i,r in df.interrows() :
print(r['name'])
df.to_json('fromAirflow.json', orient='record')
default_args = {
'owner': 'paulcrickard',
'start_date': dt.datetime(2022, 3, 10),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5)
}
with DAG('MyCSVDAG',
default_args=default_args,
schedule_interval=timedelta(minutes=5),
# '0 * * * *',
) as dag:
print_starting = BashOperator(task_id='starting',
bash_command='echo "I am reading the CSV now....."')
CSVJson = PythonOperator(task_id='convertCSVtoJson', python_callable=CSVToJson)
print_starting >> CSVJson
airflow_screenshot_1
airflow_screenshot_2
Two possible reasons without more context.
Your default pool does not have any slots assigned or available
Your declaration of tasks needs to be tabbed over to fall within the "with DAG" statement
Scheduler logs and an image of your pools page would help more.
I have been trying to write a simple text to a local txt file through a DAG script. Even though the task runs successfully. I cannot seem to find the file anywhere. Is it because I am using WSL on Windows?
Here is my simple script:
import os
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2020, 12, 5),
"retries": 0,
}
dag = DAG(
"simple_dag",
default_args=default_args,
schedule_interval="#once",
)
t1 = BashOperator(
task_id="print_file",
bash_command='echo "pipeline" > opDE.txt',
dag=dag)
t1
You will need to define a path for the output file.
When Airflow execute your code it moves it to temp location and execute it from there so it does export it to that location. You can also see this from the log:
So the fix is to export to a desired path
We have "schedule_interval" attribute in DAG function to provide the cron expression to fulfill my requirement. I think there is a limitation with cron that we cannot make a job/task run for every consecutive 25days. Below is the cron expression to run a job for every 25th day of month.
5 10 */25 * *
But I need a job/DAG to run for every consecutive 25days. Is there a way to run the DAG to satisfy my requirement?
You can set schedule_interval using datetime.timedelta.
For example, to schedule a DAG to run for the first time in 25 days from today at 10:05 CET and then run every 25 days, the DAG script can be specified like this:
import pendulum
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'Airflow',
'start_date': datetime(
2019, 11, 24, 10, 5, tzinfo=pendulum.timezone('Europe/Berlin')
),
}
with DAG(
'my_dag', schedule_interval=timedelta(days=25), default_args=default_args,
) as dag:
op = DummyOperator(task_id='dummy')
I've created a DAG which was scheduled for execution in each 5 minutes using cron syntax.
Also, the pool was created for this dag, with single slot only.
I've tried to restart server/scheduler and reset the database. Currently, DAG is running in UTC time. Also, I've tried to set my local timezone, which is 'Europe/Minsk' (UTC+3) - and It gives no effect.
import random
import time
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'pool': 'download',
# 'priority_weight': 10,
# 'queue': 'bash_queue',
}
params = {
'table': 'api_avitoimage',
}
dag = DAG(
dag_id='test_download_avitoimage',
default_args=default_args,
schedule_interval='*/5 * * * *',
)
def sleep_for_a_bit(random_base):
time.sleep(random_base)
with dag:
download = BashOperator(
task_id='download',
bash_command='/usr/bin/python3 /home/artur/downloader.py --table {{ params.table }}',
params=params,
dag=dag)
sleep = PythonOperator(
task_id='sleep_for_a_bit',
python_callable=sleep_for_a_bit,
op_kwargs={'random_base': random.uniform(0, 1)},
dag=dag,
)
download >> sleep
Issue: the DAG is running ~2-3 times per one minute, which is totally an improper execution.
EDITED: It happens that there is 16/16 simultaneously active DAG runs.But I can not understand where this "magic number 16" came from.
By default Airflow tries to complete all "missed" DAGs since start_date. As your start_date is set to airflow.utils.dates.days_ago(2), Airflow is going to run DAG 576 times before it starts launching DAGs by schedule. You can turn it off by adding catchup = False to your DAG definition (not default_args).
The magic number 16 comes from parameter max_active_runs_per_dag = 16, which is set by default.
I have a task that reads a list of files from Azure and pushes the results to an XCOM. The operator is specifically the AzureDataLakeStorageListOperator. The source is here: adls_list_operator.py
I want to print the output of this task using something like a BashOperator but I am unsure how to do this locally. As far as I can tell, the airflow test command only does single tasks, and so I can't get the output of my first task in testing my second task.
Here is my full DAG:
from airflow import DAG
from airflow.contrib.operators.adls_list_operator import AzureDataLakeStorageListOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['vishaalkal#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('adls', default_args=default_args,
schedule_interval=timedelta(days=1))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = AzureDataLakeStorageListOperator(
task_id='list_adls_files',
path='reportdata/*.csv',
dag=dag)
t2 = BashOperator(
task_id='templated',
bash_command='date; echo "{{ task_instance.xcom_pull("t1") }}"',
dag=dag
)
t2.set_upstream(t1)
I believe I've set up authentication correctly as it was giving me AccessDenied errors before but no longer raises an exception.