I have a very simple Airflow instance setup in GCP Composer. It has the bucket and everything. I want to set up each dag to run it its own environment with PythonVirtualenvOperator.
The structure in it is as follows:
dags ->
------> code_snippets/
----------> print_name.py - has function called print_my_name() which prints a string into the terminal
------> test_dag.py
test_dag.py:
import datetime
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow import DAG
def main_func():
import pandas as pd
import datetime
from code_snippets.print_name import print_my_name
print_my_name()
df = pd.DataFrame(data={
'date': [str(datetime.datetime.now().date())]
})
print(df)
default_args = {
'owner': 'test_dag',
'start_date': datetime.datetime(2020, 7, 3, 5, 1, 00),
'concurrency': 1,
'retries': 0
}
dag = DAG('test_dag', description='Test DAGS with environment',
schedule_interval='0 5 * * *',
default_args=default_args, catchup=False)
test_the_dag = PythonVirtualenvOperator(
task_id="test_dag",
python_callable=main_func,
python_version='3.8',
requirements=["DateTime==4.3", "numpy==1.20.2", "pandas==1.2.4", "python-dateutil==2.8.1", "pytz==2021.1",
"six==1.15.0", "zope.interface==5.4.0"],
system_site_packages=False,
dag=dag,
)
test_the_dag
Everything works until I start importing custom modules - having an init.py does not help, it still gives out the same error, which in my case is:
from code_snippets.print_name import print_my_name\nModuleNotFoundError: No module named \'code_snippets\'
I also have a local instance of Airflow and i experience the same issue. I have tried moving things around or adding the path to the folders to PATH, adding inits in the directories or even changing the import statements, but the error persists as long as I am importing custom modules.
system_site_packages=False or True also has no effect
Is there a fix for that or a way to go around it so I can utilize the custom code I have separated outside of the DAGs?
Airflow Version : 1.10.14+composer
Python version for Airflow is set to: 3
The implementation for airflow.operators.python.PythonVirtualenvOperator is such that the python_callable is expected to not reference external names.
Any non-standard library packages used in the callable must be declared as external dependencies in the requirements.txt file.
If you need to use code_snippets, publish it as a package either to pypi or a VCS repository and add it in the list of packages in the requirements kwargs for the PythonVirtualenvOperator.
Related
I am trying to copy file from SFTP to Azure Blob store using SFTPToWasbOperator. I am getting error. It seems like I'm doing something wrong, but I can't figure out what it is. Could someone please check the following code and see if there is anything wrong with it?
Airflow Logs
**
[2022-07-10, 13:08:48 UTC] {sftp_to_wasb.py:188} INFO - Uploading /SPi_ESG_Live/07-04-2022/DataPoint_2022_07_04.csv to wasb://testcotainer as https://test.blob.core.windows.net/testcotainer/DataPoint_2022_07_04.csv
[2022-07-10, 13:08:48 UTC] {_universal.py:473} INFO - Request URL: 'https://.blob.core.windows.net/***/test/https%3A//test.blob.core.windows.net/testcontainer/DataPoint_2022_07_04.csv'
Error msg
"azure.core.exceptions.ServiceRequestError: URL has an invalid label."
Airflow DAG
import os
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.providers.microsoft.azure.operators.wasb_delete_blob import WasbDeleteBlobOperator
from airflow.providers.microsoft.azure.transfers.sftp_to_wasb import SFTPToWasbOperator
from airflow.providers.sftp.hooks.sftp import SFTPHook
from airflow.providers.sftp.operators.sftp import SFTPOperator
AZURE_CONTAINER_NAME = "testcotainer"
BLOB_PREFIX = "https://test.blob.core.windows.net/testcotainer/"
SFTP_SRC_PATH = "/SPi_test_Live/07-04-2022/"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_sftp_to_wasb"
with DAG(
DAG_ID,
schedule_interval=None,
catchup=False,
start_date=datetime(2021, 1, 1), # Override to match your needs
) as dag:
# [START how_to_sftp_to_wasb]
transfer_files_to_azure = SFTPToWasbOperator(
task_id="transfer_files_from_sftp_to_wasb",
# SFTP args
sftp_source_path=SFTP_SRC_PATH,
# AZURE args
container_name=AZURE_CONTAINER_NAME,
blob_prefix=BLOB_PREFIX,
)
# [END how_to_sftp_to_wasb]
The problem is with BLOB_PREFIX, its not a url its the prefix after the azure url
see this source example : https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_modules/tests/system/providers/microsoft/azure/example_sftp_to_wasb.html
I have successfully created a Great_Expectation result and I would like to output the results of the expectation to an html file.
There are few links highlighting how show the results in human readable from using what is called 'Data Docs' https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started/set_up_data_docs.html#tutorials-getting-started-set-up-data-docs
But to be quite honest, the documentation is extremely hard to follow.
My expectation simply verifies the number of passengers from my dataset fall within 1 and 6. I would like help outputting the results to a folder using 'Data Docs' or however it is possible to output the data to a folder:
import great_expectations as ge
import great_expectations.dataset.sparkdf_dataset
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
from great_expectations.data_asset import DataAsset
from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, FilesystemStoreBackendDefaults
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.resource_identifiers import ValidationResultIdentifier
from datetime import datetime
from great_expectations.data_context import BaseDataContext
df_taxi = spark.read.csv('abfss://root#adlspretbiukadlsdev.dfs.core.windows.net/RAW/LANDING/yellow_trip_data_sample_2019-01.csv', inferSchema=True, header=True)
taxi_rides = SparkDFDataset(df_taxi)
taxi_rides.expect_column_value_lengths_to_be_between(column='passenger_count', min_value=1, max_value=6)
taxi_rides.save_expectation_suite()
The code is run from Apache Spark.
If someone could just point me in the right direction, I will able to figure it out.
You can visualize Data Docs on Databricks - you just need to use correct renderer combined with DefaultJinjaPageView that renders it into HTML, and its result could be shown with displayHTML. We need to import necessary classes/functions:
import great_expectations as ge
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from great_expectations.render.renderer import *
from great_expectations.render.view import DefaultJinjaPageView
To see result of profiling, we need to use ProfilingResultsPageRenderer:
expectation_suite, validation_result = BasicDatasetProfiler.profile(SparkDFDataset(df))
document_model = ProfilingResultsPageRenderer().render(validation_result)
displayHTML(DefaultJinjaPageView().render(document_model))
it will show something like this:
We can visualize results of validation with ValidationResultsPageRenderer:
gdf = SparkDFDataset(df)
gdf.expect_column_values_to_be_of_type("county", "StringType")
gdf.expect_column_values_to_be_between("cases", 0, 1000)
validation_result = gdf.validate()
document_model = ValidationResultsPageRenderer().render(validation_result)
displayHTML(DefaultJinjaPageView().render(document_model))
it will show something like this:
Or we can render expectation suite itself with ExpectationSuitePageRenderer:
gdf = SparkDFDataset(df)
gdf.expect_column_values_to_be_of_type("county", "StringType")
document_model = ExpectationSuitePageRenderer().render(gdf.get_expectation_suite())
displayHTML(DefaultJinjaPageView().render(document_model))
it will show something like this:
If you're not using Databricks, then you can render the data into HTML and store it as files stored somewhere
I have been in touch with the developers of Great_Expectations in connection with this question. They have informed me that Data Docs is not currently available with Azure Synapse or Databricks.
As a newbie to airflow, I'm looking at the example_branch_operator:
"""Example DAG demonstrating the usage of the BranchPythonOperator."""
import random
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
with DAG(
dag_id='example_branch_operator',
default_args=args,
start_date=days_ago(2),
schedule_interval="#daily",
tags=['example', 'example2'],
) as dag:
run_this_first = DummyOperator(
task_id='run_this_first',
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='none_failed_or_skipped',
)
for option in options:
t = DummyOperator(
task_id=option,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
)
branching >> t >> dummy_follow >> join
Looking at the join operator, I'd expect for it to collect all the branches, but instead it's just another task that happens at the end of each branch. If multiple branches are executed, join will run that many times.
(yes, yes, it should be idempotent, but that's not the point of the question)
Is this a bug, a poorly named task, or am I missing something?
The tree view displays a complete branch from each DAG root node. Multiple branches that converge on a single task will be shown multiple times but they will only be executed once. Check out the Graph View of this DAG:
For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)
I want to set up a mock database (as opposed to creating a test database if possible) to check if the data is being properly queried and than being converted into a Pandas dataframe. I have some experience with mock and unit testing and have set-up previous test successfully. However, I'm having difficulty in applying how to mock real-life objects like databases for testing.
Currently, I'm having trouble generating a result when my test is run. I believe that I'm not mocking the database object correctly, I'm missing a step involved or my thought process is incorrect. I put my tests and my code to be tested in the same script to simplify things.
I've thoroughly read thorough the Python unittest and mock documentation so I know what it does and how it works (For the most part).
I've read countless posts on mocking in Stack and outside of it as well. They were helpful in understanding general concepts and what can be done in those specific circumstances outlined, but I could not get it to work in my situation.
I've tried mocking various aspects of the function including the database connection, query and using the 'pd_read_sql(query, con)' function to no avail. I believe this is the closest I got.
My Most Recent Code for Testing
import pandas as pd
import pyodbc
import unittest
import pandas.util.testing as tm
from unittest import mock
# Function that I want to test
def p2ctt_data_frame():
conn = pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
)
query = 'select * from P2CTT_2016_Plus0HHs'
# I want to make sure this dataframe object is created as intended
df = pd.read_sql(query, conn)
return df
class TestMockDatabase(unittest.TestCase):
#mock.patch('directory1.script1.pyodbc.connect') # Mocking connection
def test_mock_database(self, mock_access_database):
# The dataframe I expect as the output after query is run on the 'mock database'
expected_result = pd.DataFrame({
'POSTAL_CODE':[
'A0A0A1'
],
'DA_ID':[
1001001
],
'GHHDS_DA':[
100
]
})
# This is the line that I believe is wrong. I want to create a return value that mocks an Access table
mock_access_database.connect().return_value = [('POSTAL_CODE', 'DA_ID', 'GHHDS_DA'), ('A0A0A1', 1001001, 100)]
result = p2ctt_data_frame() # Run original function on the mock database
tm.assert_frame_equal(result, expected_result)
if __name__ == "__main__":
unittest.main()
I expect that the expected dataframe and the result after running the test using the mock database object is one and the same. This is not the case.
Currently, if I print out the result when trying to mock the database I get:
Empty DataFrame
Columns: []
Index: []
Furthermore, I get the following error after the test is run:
AssertionError: DataFrame are different;
DataFrame shape mismatch
[left]: (0, 0)
[right]: (1, 3)
I would break it up into a few separate tests. A functional test that the desired result will be produced, a test to make sure you can access the database and get expected results, and the final unittest on how to implement it. I would write each test in that order completing the tests first before the actual function. If found that if I can't figure out how to do something I'll try it on a separate REPL or create a git branch to work on it then go back to the main branch. More information can be found here: https://obeythetestinggoat.com/book/praise.harry.html
Comments for each test and the reason behind it is in the code.
import pandas as pd
import pyodbc
def p2ctt_data_frame(query='SELECT * FROM P2CTT_2016_Plus0HHs;'): # set query as default
with pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
) as conn: # use with so the connection is closed once completed
df = pd.read_sql(query, conn)
return df
Separate test file:
import pandas as pd
import pyodbc
import unittest
from unittest import mock
class TestMockDatabase(unittest.TestCase):
def test_p2ctt_data_frame_functional_test(self): # Functional test on data I know will not change
actual_df = p2ctt_data_frame(query='SELECT * FROM P2CTT_2016_Plus0HHs WHERE DA_ID = 1001001;')
expected_df = pd.DataFrame({
'POSTAL_CODE':[
'A0A0A1'
],
'DA_ID':[
1001001
],
'GHHDS_DA':[
100
]
})
self.assertTrue(actual_df == expected_df)
def test_access_database_returns_values(self): # integration test with the database to make sure it works
with pyodbc.connect(
r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};'
r'DBQ=My\Path\To\Actual\Database\Access Database.accdb;'
) as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT TOP 1 * FROM P2CTT_2016_Plus0HHs WHERE DA_ID = 1001001;")
result = cursor.fetchone()
self.assertTrue(len(result) == 3) # should be 3 columns by 1 row
# Look for accuracy in the database
info_from_db = []
for data in result: # add to the list all data in the database
info_from_db.append(data)
self.assertListEqual( # All the information matches in the database
['A0A0A1', 1001001, 100], info_from_db
)
#mock.patch('directory1.script1.pd') # testing pandas
#mock.patch('directory1.script1.pyodbc.connect') # Mocking connection so nothing sent to the outside
def test_pandas_read_sql_called(self, mock_access_database, mock_pd): # unittest for the implentation of the function
p2ctt_data_frame()
self.assert_True(mock_pd.called) # Make sure that pandas has been called
self.assertIn(
mock.call('select * from P2CTT_2016_Plus0HHs'), mock_pd.mock_calls
) # This is to make sure the proper value is sent to pandas. We don't need to unittest that pandas handles the
# information correctly.
*I was not able to test this so there might be some bugs I need to fix