Pass value from operator to dag - python-3.x

I have a BashOperator in my dag which runs a python script. At the end of its run, the python script creates a json as a report. In case of failure, I want to report this json to Slack. I have an on_failure_callback function which can push a message to Slack. However, I have no elegant way to pass the json value to the dag. Currently, I am saving the json to a file and then reading it from the file in the dag and reporting it. I also tried storing it in an environment variable and getting it in the dag. But is there a more direct way to pass this value to the dag? Preferably, without saving it to a file or an environment variable.

Rather than using on_failure_callback you should use another task to send your slack message.
And for that you should simply use XComs to push/pull the file. https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#
Then you can pull the xcom in the task that sends it further to slack.

Related

Azure Data Factory, how to pass parameters from trigger/pipeline in to data source

I need help. I've create a pipeline for data processing, which is importing csv and copy data to DB. I've also configure a Blob storage trigger, which is triggering pipeline with dataflow, when speciffic file will be uploaded in to container. For the moment, this trigger is set to monitor one container, however I would like to set it to be more universal. To monitor all containers in desired Storage Account and if someone will send some files, pipeline will be triggered. But for that I need to pass container name to the pipeline to be used in datasource file path. for now I've create something like that:
in the pipeline, I've add this parameter #pipeline().parameters.sourceFolder:
Next in Trigger, I've set this:
Now what should I set here, to pass this folder path?
You need to use dataset parameters for this.
Like folderpath parameter in pipeline create another pipeline parameter for the file name also and give #triggerBody().folderPath and #triggerBody().fileName to those when creating trigger.
Pipeline parameters:
Make sure you give all containers in storage event trigger while creating trigger.
Assiging trigger parameters to pipeline parameters:
Now, create two dataset parameters for the folder and file name like below.
Source dataset parameters:
Use these in the file path of the dataset dynamic content.
If you use copy activity for this dataset, then assign the pipeline parameters values(which we can get from trigger parameters) to dataset parameters like below.
If you use dataflows for the dataset, you can assign these in the dataflow activity itself like below after giving dataset as source in the dataflow.
Thank you Rakesh
I need to process few speciffic files from package that will be send to container. Each time user/application will send same set of files so in trigger I'm checking does new drive.xml file was send to any container. This file defines type of the data that was send, so if it comes, I know that new datafiles has been send as well and they will be present in lover folder.
F.eg. drive.xml was found in /container/data/somefolder/2022-01-22/drive.xml and then I know that in /container/data/somefolder/2022-01-22/datafiles/, are located 3 files that I need to process.
Therefor in parameters, I need to pass only file path, file names will be always the same.
The Dataset configuration looks like that:
and the event trigger like that:

The datetime in airflow Task instance keeps on updating itself

I created a trigger_date_time from a datetime object like this
trigger_date_time = str(datetime.now(tz=timezone.utc)).replace(' ', '_')
I use this variable in my airflow DAG definition. I noticed that its value keeps on updating as the task instance is being executed. I cannot comprehend how it works and how to solve this issue. I simply want to have a trigger_date_time holding the value when I trigger the DAG execution throughout the execution.
Airflow is parsing the dag every 30 sec (by default) and thats why datetime always returns a value to your trigger_date_time variable.
to get the execution date you need to get the value from the dag_run itself.
you didn't share the full code, so its hard to understand the context here, but if you are in the dag it self you can use Jinja template, for example {{ ts }}.
if you are inside a PythonOperator then you can use context['execution_date']

How to log errors in dataflow adf of parallel sources

I have to do some data engineering by reading manifest.cdm.json files from datalake.
add pipeline run id column and push to sql database.
I have one json list file which have required parameter to read CDM json file in source of dataflow.
Previous Approach: I used Foreach and passed parameter to dataflow with single activity then error capturing. But use of Dataflow with for each costs too much..
Current Approch: I mannually created Dataflow with all cdm files. But here I'm not able to capture error. If any source got error all dataflow activity fails. and If I select skip error in dataflow activity I'm not getting any error.
So what should be the approch to get errors from current approch.
You can capture the error using set variable activity in Azure Data Factory.
Use below expression to capture the error message using Set Variable activity:
#activity('Data Flow1').Error.message
Later you can store the error message in blob storage for future reference using copy activity. In below example we are saving error message in .csv file using DelimitedText dataset.

Size of the input / output parameters in the pipeline step in Azure

while running pipeline creation python script facing the following error.
"AzureMLCompute job failed. JobConfigurationMaxSizeExceeded: The specified job configuration exceeds the max allowed size of 32768 characters. Please reduce the size of the job's command line arguments and environment settings"
I haven't seen that error before! My guess is that you're passing data as a string argument to a downstream pipeline step when you should be using PipelineData or OutputFileDatasetConfig.
I strongly suggest you read more about moving data between steps of an AML pipeline
When we tried to pass a quite lengthy content as argument value to a Pipeline. You can try to upload file to blob, optionally create a dataset, then pass on dataset name or file path to AML pipeline as parameter. The pipeline step will read content of the file from the blob.

Groovy script to call Python and batch file

I wish to send email containing build status of all child job.
Therefore, I have used batch and python script to prepare the html file, which I will be importing in Editable Email Notification plugin.
However, in Pre Send script tab, we can only write groovy script.
So I would like to call my python file, that contains my logic from groovy
You would have to call your python script as part of the job (possibly via a shell action) and save the HTML in the workspace and the add that file to the email. Alternatively you could create a custom email template that does the same

Resources