How to handle different task intervals on a single Dag in airflow? - python-3.x

I have a single dag with multiple tasks with this simple structure that tasks A, B, and C can run at the start without any dependencies but task D depends on A no here is my question:
tasks A, B, and C run daily but I need task D to run weekly after A succeeds. how can I setup this dag?
does changing schedule_interval of task work? Is there any best practice to this problem?
Thanks for your help.

You can use a ShortCircuitOperator to do this.
import airflow
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import DAG
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
dag = DAG(dag_id='example', default_args=args)
a = DummyOperator(task_id='a', dag=dag)
b = DummyOperator(task_id='b', dag=dag)
c = DummyOperator(task_id='c', dag=dag)
d = DummyOperator(task_id='d', dag=dag)
def check_trigger(execution_date, **kwargs):
return execution_date.weekday() == 0
check_trigger_d = ShortCircuitOperator(
task_id='check_trigger_d',
python_callable=check_trigger,
provide_context=True,
dag=dag
)
a.set_downstream(b)
b.set_downstream(c)
a.set_downstream(check_trigger_d)
# Perform D only if trigger function returns a true value
check_trigger_d.set_downstream(d)

In Airflow version >= 2.1.0, you can use the BranchDayOfWeekOperator which is exactly suited for your case.
See this answer for more details.

Related

How can I only call a BigQuery table each dag run instead of every time Airflow refreshes?

I have a DAG that queries a table, pulling data from it, and which also uses a ShortCircuitOperator to check if the DAG needs to run, which is also based on a BQ table name. The issue is this currently queries the table every time Airflow refreshes. The table is small (each query is less than 1 kb) but I'm concerned this will get more expensive as its scaled up. Is there a way to only query this table each DAG run instead?
Here's a code snippet to show what's going on:
client = bigquery.Client()
def create_query_lists(list_type):
query_job = client.query(
"""
SELECT filename
FROM `requests`
"""
)
results = query_job.result()
results_list = []
for row in results:
results_list.append(row.filename)
return results_list
def check_contents():
if len(create_query_lists()) == 0:
raise ValueError('Nothing to do')
return False
else:
print("There's stuff to do")
return True
#Create task to check if data being pulled is empty, if so fail so other tasks don't run
check_list = ShortCircuitOperator(
task_id="check_column_not_empty",
provide_context=True,
python_callable=check_list_contents
)
check_list #do subsequent tasks which use the same function
If I correctly understood your need, you want to execute tasks only if the result of SQL query is not empty.
In this case you can also use BranchPythonOperator, example :
import airflow
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from google.cloud import bigquery
client = bigquery.Client()
def create_query_lists(list_type):
query_job = client.query(
"""
SELECT filename
FROM `requests`
"""
)
results = query_job.result()
results_list = []
for row in results:
results_list.append(row.filename)
return results_list
def check_contents():
if len(create_query_lists()) == 0:
return 'KO'
else:
return 'OK'
with airflow.DAG(
"your_dag",
schedule_interval=None) as dag:
branching = BranchPythonOperator(
task_id='file_exists',
python_callable=check_contents,
provide_context=True,
op_kwargs={
'param': 'param'
},
dag=dag
)
ok = DummyOperator(task_id='OK', dag=dag)
ko = DummyOperator(task_id='KO', dag=dag)
fake_task = DummyOperator(task_id='fake_task', dag=dag)
(branching >>
ok >>
fake_task)
branching >> ko
The BranchPythonOperator executes the query, if the result is not empty, it returns OK, otherwise KO
We create 2 DummyOperator one for OK, the other for KO (2 branches)
Depending on the result, we will go to the OK or KO branch
The KO branch will finish the DAG without other tasks
The OK branch will continue the DAG with tasks that follow (fake_task) in my example

use applyInPandas with PySpark on a cluster

The applyInPandas method can be used to apply a function in parallel to a GroupedData pyspark object as in the minimal example below.
import pandas as pd
from time import sleep
from pyspark.sql import SparkSession
# spark session object
spark = SparkSession.builder.getOrCreate()
# test function
def func(x):
sleep(1)
return x
# run test function in parallel
pdf = pd.DataFrame({'x': range(8)})
sdf = spark.createDataFrame(pdf)
sdf = sdf.groupby('x').applyInPandas(func, schema=sdf.schema)
dx = sdf.toPandas()
The minimal example has been tested on an 8 CPU single node system (eg a m5.4xlarge Amazon EC2 instance) and takes approximately 1 second to run, as the one-second sleep function is applied to each of 8 CPUs in parallel. pdf and dx objects are in the screenshot below.
My issue is how to run the same minimal example on a cluster, eg an Amazon EMR cluster. So far, after setting up a cluster the code is being executed with a single core, so the code will require appx 8 sec to run (each function executed in series).
UPDATE
Following #Douglas M's answer, the below code parallelizes on an EMR cluster
import pandas as pd
from datetime import datetime
from time import sleep
# test function
def func(x):
sleep(1)
return x
# run and time test function
sdf = spark.range(start=0, end=8, step=1, numPartitions=8)
sdf = sdf.groupby('id').applyInPandas(func, schema=sdf.schema)
now = datetime.now()
dx = sdf.toPandas()
print((datetime.now() - now).total_seconds()) # 1.09 sec
However using repartition does not parallelize (code below).
import pandas as pd
from datetime import datetime
from time import sleep
# test function
def func(x):
sleep(1)
return x
# run and time test function
pdf = pd.DataFrame({'x': range(8)})
sdf = spark.createDataFrame(pdf)
sdf = sdf.groupby('x').applyInPandas(func, schema=sdf.schema)
sdf = sdf.repartition(8)
now = datetime.now()
dx = sdf.toPandas()
print((datetime.now() - now).total_seconds()) # 8.33 sec
Running the above code, the spark progressbar first indicates 8 tasks then switches to 1 task.
Spark's parallelism is based on the number of partitions in the dataframe it is processing. Your sdf dataframe has only one partition, because it is very small.
It would be better to first create your range with the SparkSession.range:
SparkSession.range(start: int, end: Optional[int] = None, step: int = 1, numPartitions: Optional[int] = None) → pyspark.sql.dataframe.DataFrame
Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.
New in version 2.0.0.
Parameters:
start : int
the start value
end : int, optional
the end value (exclusive)
step : int, optional
the incremental step (default: 1)
numPartitions : int, optional
the number of partitions of the DataFrame
Returns: DataFrame
For a quick fix, add repartition:
sdf = spark.createDataFrame(pdf).repartition(8)
Which will put each of the 8 elements into their own partition. The partitions can then be processed by separate worker cores.

Scheduling airflow TaskGroup throws AttributeError

So I am creating task in a TaskGroup and am trying to add them in my dag sequence of tasks but it is throwing this error:
Broken DAG: [/Users/abc/projects/abc/airflow_dags/dag.py] Traceback (most recent call last):
File "/Users/abc/.pyenv/versions/3.8.12/envs/vmd-3.8.12/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1234, in set_downstream
self._set_relatives(task_or_task_list, upstream=False)
File "/Users/abc/.pyenv/versions/3.8.12/envs/vmd-3.8.12/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1178, in _set_relatives
task_object.update_relative(self, not upstream)
AttributeError: 'NoneType' object has no attribute 'update_relative'
I am creating my task group and tasks like this:
def get_task_group(dag, task_group):
t1 = DummyOperator(task_id='t1', dag=dag, task_group=task_group)
t2 = DummyOperator(task_id='t2', dag=dag, task_group=task_group)
t3 = DummyOperator(task_id='t3', dag=dag, task_group=task_group)
t4 = DummyOperator(task_id='t4', dag=dag, task_group=task_group)
t5 = DummyOperator(task_id='t5', dag=dag, task_group=task_group)
t_list = [t2, t3, t4]
t1.set_downstream(t_list)
t5.set_upstream(t_list)
with DAG('some_dag', default_args=args) as dag:
with TaskGroup(group_id=f"run_model_tasks", dag=dag) as tg:
run_model_task_group = get_task_group(dag, tg)
a1 = DummyOperator(task_id='a1', dag=dag)
a2 = DummyOperator(task_id='a2', dag=dag)
a3 = DummyOperator(task_id='a3', dag=dag)
a4 = DummyOperator(task_id='a4', dag=dag)
a1.set_downstream(a2)
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
a3.set_downstream(a4)
If I remove the task groups and leave task group task from sequencing by removing the lines
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
I can see that a1, a2 a3 & a4 are sequenced properly and I can the disconnected run_model_task_group tasks, but as soon as I add it in the sequence, I get the aforementioned error.
Can anyone guide me what might be happening here?
Note that I am using the function taking dag and task_group parameters to create task group tasks because I want to create the same set of tasks for another dag too.
Python Version: 3.8.8
Airflow Version: 2.0.1
AttributeError: 'NoneType' object has no attribute 'update_relative'It's happening because run_model_task_group its None outside of the scope of the With block, which is expected Python behaviour.
Without changing things too much from what you have done so far, you could refactor get_task_group() to return a TaskGroup object, like this:
def get_task_group(dag, group_id):
with TaskGroup(group_id=group_id, dag=dag) as tg:
t1 = DummyOperator(task_id='t1', dag=dag)
t2 = DummyOperator(task_id='t2', dag=dag)
t3 = DummyOperator(task_id='t3', dag=dag)
t4 = DummyOperator(task_id='t4', dag=dag)
t5 = DummyOperator(task_id='t5', dag=dag)
t_list = [t2, t3, t4]
t1.set_downstream(t_list)
t5.set_upstream(t_list)
return tg
In the DAG definition simply call it with:
run_model_task_group = get_task_group(dag, "run_model_tasks")
The resultant graph view looks like this:
DAG definition:
with DAG('some_dag',
default_args=default_args,
start_date=days_ago(2),
schedule_interval='#once') as dag:
# with TaskGroup(group_id=f"run_model_tasks", dag=dag) as tg:
# run_model_task_group = get_task_group(dag, )
run_model_task_group = get_task_group(dag, "run_model_tasks")
a1 = DummyOperator(task_id='a1', dag=dag)
a2 = DummyOperator(task_id='a2', dag=dag)
a3 = DummyOperator(task_id='a3', dag=dag)
a4 = DummyOperator(task_id='a4', dag=dag)
a1.set_downstream(a2)
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
a3.set_downstream(a4)
Finally, considering using bitwise operator instead of set_downstream and set_upstream, it's the recommend way and also less verbose source here.
Let me know if that worked for you.
Tested with: Airflow version: 2.1.4, Python 3.8.10

Python Multiprocessing Scheduling

In Python 3.6, I am running multiple processes in parallel, where each process pings a URL and returns a Pandas dataframe. I want to keep running the (2+) processes continually, I have created a minimal representative example as below.
My questions are:
1) My understanding is that since I have different functions, I cannot use Pool.map_async() and its variants. Is that right? The only examples of these I have seen were repeating the same function, like on this answer.
2) What is the best practice to make this setup to run perpetually? In my code below, I use a while loop, which I suspect is not suited for this purpose.
3) Is the way I am using the Process and Manager optimal? I use multiprocessing.Manager.dict() as the shared dictionary to return the results form the processes. I saw in a comment on this answer that using a Queue here would make sense, however the Queue object has no `.dict()' method. So, I am not sure how that would work.
I would be grateful for any improvements and suggestions with example code.
import numpy as np
import pandas as pd
import multiprocessing
import time
def worker1(name, t , seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
time.sleep(t)
np.random.seed(seed)
df= pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
def worker2(name, t, seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
np.random.seed(seed)
time.sleep(t)
df = pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
if __name__ == '__main__':
t=1
while True:
start_time = time.time()
manager = multiprocessing.Manager()
parallel_dict = manager.dict()
seed=np.random.randint(0,1000,1) # send seed to worker to return a diff df
jobs = []
p1 = multiprocessing.Process(target=worker1, args=('name1', t, seed, parallel_dict))
p2 = multiprocessing.Process(target=worker2, args=('name2', t, seed+1, parallel_dict))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
for proc in jobs:
proc.join()
parallel_end_time = time.time() - start_time
#print(parallel_dict)
df1= pd.DataFrame(parallel_dict['name1'][1:],columns=parallel_dict['name1'][0])
df2 = pd.DataFrame(parallel_dict['name2'][1:], columns=parallel_dict['name2'][0])
merged_df = pd.concat([df1,df2], axis=0)
print(merged_df)
Answer 1 (map on multiple functions)
You're technically right.
With map, map_async and other variations, you should use a single function.
But this constraint can be bypassed by implementing an executor, and passing the function to execute as part of the parameters:
def dispatcher(args):
return args[0](*args[1:])
So a minimum working example:
import multiprocessing as mp
def function_1(v):
print("hi %s"%v)
return 1
def function_2(v):
print("by %s"%v)
return 2
def dispatcher(args):
return args[0](*args[1:])
with mp.Pool(2) as p:
tasks = [
(function_1, "A"),
(function_2, "B")
]
r = p.map_async(dispatcher, tasks)
r.wait()
results = r.get()
Answer 2 (Scheduling)
I would remove the while from the script and schedule a cron job (on GNU/Linux) (on windows) so that the OS will be responsible for it's execution.
On Linux you can run cronotab -e and add the following line to make the script run every 5 minutes.
*/5 * * * * python /path/to/script.py
Answer 3 (Shared Dictionary)
yes but no.
To my knowledge using the Manager for data such as collections is the best way.
For Arrays or primitive types (int, floats, ecc) exists Value and Array which are faster.
As in the documentation
A manager object returned by Manager() controls a server process which holds > Python objects and allows other processes to manipulate them using proxies.
A manager returned by Manager() will support types list, dict, Namespace, Lock, > RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and > Array.
Server process managers are more flexible than using shared memory objects because they can be made to support arbitrary object types. Also, a single manager can be shared by processes on different computers over a network. They are, however, slower than using shared memory.
But you have only to return a Dataframe, so the shared dictionary it's not needed.
Cleaned Code
Using all the previous ideas the code can be rewritten as:
map version
import numpy as np
import pandas as pd
from time import sleep
import multiprocessing as mp
def worker1(t , seed):
print('worker1 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
def worker2(t , seed):
print('worker2 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
def dispatcher(args):
return args[0](*args[1:])
def task_generator(sleep_time=1):
seed = np.random.randint(0,1000,1)
yield worker1, sleep_time, seed
yield worker2, sleep_time, seed + 1
with mp.Pool(2) as p:
results = p.map(dispatcher, task_generator())
merged = pd.concat(results, axis=0)
print(merged)
If the process of concatenation of the Dataframe is the bottleneck, An approach with imap might become optimal.
imap version
with mp.Pool(2) as p:
merged = pd.DataFrame()
for result in p.imap_unordered(dispatcher, task_generator()):
merged = pd.concat([merged,result], axis=0)
print(merged)
The main difference is that in the map case, the program first wait for all the process tasks to end, and then concatenate all the Dataframes.
While in the imap_unoredered case, As soon as a task as ended, the Dataframe is concatenated ot the current results.

How to process multiple Spark SQL queries in parallel [duplicate]

I am trying to run 2 functions doing completely independent transformations on a single RDD in parallel using PySpark. What are some methods to do the same?
def doXTransforms(sampleRDD):
(X transforms)
def doYTransforms(sampleRDD):
(Y Transforms)
if __name__ == "__main__":
sc = SparkContext(appName="parallelTransforms")
sqlContext = SQLContext(sc)
hive_context = HiveContext(sc)
rows_rdd = hive_context.sql("select * from tables.X_table")
p1 = Process(target=doXTransforms , args=(rows_rdd,))
p1.start()
p2 = Process(target=doYTransforms, args=(rows_rdd,))
p2.start()
p1.join()
p2.join()
sc.stop()
This does not work and I now understand this will not work.
But is there any alternative way to make this work? Specifically are there any python-spark specific solutions?
Just use threads and make sure that cluster have enough resources to process both tasks at the same time.
from threading import Thread
import time
def process(rdd, f):
def delay(x):
time.sleep(1)
return f(x)
return rdd.map(delay).sum()
rdd = sc.parallelize(range(100), int(sc.defaultParallelism / 2))
t1 = Thread(target=process, args=(rdd, lambda x: x * 2))
t2 = Thread(target=process, args=(rdd, lambda x: x + 1))
t1.start(); t2.start()
Arguably this is not that often useful in practice but otherwise should work just fine.
You can further use in-application scheduling with FAIR scheduler and scheduler pools for a better control over execution strategy.
You can also try pyspark-asyncactions (disclaimer - the author of this answer is also the author of the package) which provides a set of wrappers around Spark API and concurrent.futures:
import asyncactions
import concurrent.futures
f1 = rdd.filter(lambda x: x % 3 == 0).countAsync()
f2 = rdd.filter(lambda x: x % 11 == 0).countAsync()
[x.result() for x in concurrent.futures.as_completed([f1, f2])]

Resources