Can we set task wise parameters using Databricks Jobs API "run-now" - databricks

I have a job with multiple tasks like Task1 -> Task2. I am trying to call the job using api "run now". Task details are below
Task1 - It executes a Note Book with some input parameters
Task2 - It executes a Note Book with some input parameters
So, how I can provide parameters to job api using "run now" command for task1,task2?
I have a parameter "lib" which needs to have values 'pandas' and 'spark' task wise.
I know that we can give unique parameter names like Task1_lib, Task2_lib and read that way.
current way:
json = {"job_id" : 3234234, "notebook_params":{Task1_lib: a, Task2_lib: b}}
Is there a way to send task wise parameters?

It's not supported right now - parameters are defined on the job level. You can ask your Databricks representative (if you have) to communicate this ask to the product team who works on the Databricks Workflows.

Related

Am I using switch/case wrong here to control?

I am trying to check the logs and depending the last log, run a different step in the transformation. Am I supposed to use some other steps or am I making another mistake here?
For example, if the query returns 1 I want execute SQL script to run, for 2 I want execute SQL script 2 to run and for 3 I want transformation to abort. But it keeps running all the steps even if only one value returns from the CONTROL step.
The transformation looks like this
And the switch/case step looks like this
It looks like it's correctly configured, but keep in mind that in a transformation all steps are initiated at the beginning of the transformation, waiting to receive the streaming data from the previous step. So the Abort and Execute script steps are going to be started as soon as the transformation is started, if they don't need data from the previous step to run, they are going to run at the beginning.
If you want the scripts to be executed depending on the result of the CONTROL output, you'll need to use a job, that runs the steps (actions) sequencially:
A transformation runs the CONTROL step and afterwards you put a "Copy rows to result" step to make the data produced from the CONTROL step available to the job
After the transformation, you use a "Simple evaluation" action in the job, to determine which script (or abort) to run. Jobs also have an "Execute SQL Script" action, so you can put it afterwards.
I'm supposing your CONTROL step only produces one row, if the output is more than one row, the "Simple evaluation" action won't do the job, you'll have to design one of various transformations to execute for each row of the previous transformation, running what you need.

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

How to acces output folder from a PythonScriptStep?

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).

Azure DevOps Releases skip tasks

I'm currently working on implementing CI/CD pipelines for my company in Azure DevOps 2020 (on premise). There is one requirement I just not seem to be able to solve conveniently: skipping certain tasks depending on user input in a release pipeline.
What I want:
User creates new release manually and decides if a task group should be executed.
Agent Tasks:
1. Powershell
2. Task Group (conditional)
3. Task Group
4. Powershell
What I tried:
Splitting the tasks into multiple jobs with the task group depending on a manual intervention task.
does not work, if the manual intervention is rejected the whole execution stops with failed.
Splitting the tasks into multiple stages doing almost the same as above with the same outcome.
Splitting the tasks into multiple stages trigger every stage manually.
not very usable because you have to execute what you want in the correct order and after the previous stages succeeded.
Variable set at release creation (true/false).
Will use that if nothing better comes up but kinda prone to typos and not very usable for the colleagues who will use this. Unfortunately Azure DevOps seems to not support dropdown or checkbox variables for releases. (but works with parameters in builds)
Two Stages one with tasks 1,2,3,4 and one with tasks 1,3,4.
not very desireable for me because of duplication.
Any help would be highly appreciated!
Depends on what the criteria is for the pipelines to run. One recommendation would be two pipeline lines calling the same template. And each pipeline may have a true/false embedded in it to pass as a parameter to the template.
The template will have all the tasks defined in it; however, the conditional one will have a condition like:
condition: and(succeeded(), eq('${{ parameters.runExtraStep}}', true))
This condition would be set at the task level.
Any specific triggers can be defined in the corresponding pipeline.
Here is the documentation on Azure YAML Templates to get you started.
Unfortunately, it's impossible to add custom condition for a Task Group, but this feature is on Roadmap. Check the following user voice and you can vote it:
https://developercommunity.visualstudio.com/idea/365689/task-group-custom-conditions-at-group-and-task-lev.html
The workaround is that you can clone the release definition (right click a release definition > Clone), then remove some tasks or task groups and save it, after that you can create release with corresponding release definition per to detailed scenario.
Finally I decided to stick with Releases and split my tasks into 3 agent jobs. Job 1 with the first powershell, job 2 with the conditional taskgroup that executes only if a variable is true and job 3 with the remaining tasks.
As both cece-dong and dreadedfrost stated, I could've achieved a selectable runtime parameter for the condition with yaml pipelines. Unfortunately one of the task groups needs a specific artifact from a yaml pipeline. Most of the time it would be the "latest", which can be easily achieved with a download artifacts task but sometimes a previous artifact get's chosen. I have found no easy way to achieve this in a way as convenient as it is in releases where you by default have a dropdown with a list of artifacts.
If found this blog post for anyone interested on how you can handle different build artifacts in yaml pipelines.
Thanks for helping me out!

How to re-try an ADF pipeline execution until conditions are met

An ADF pipeline needs to be executed on a daily basis, lets say at 03:00 h AM.
But prior execution we also need to check if the data sources are available.
Data is provided by an external agent, it periodically loads the corresponding data into each source table and let us know when this process is completed using a flag-table: if data source 1 is ready it set flag to 1.
I don't find a way to implement this logic with ADF.
We would need something that, for instance, at 03.00 h would trigger an 'element' that checks the flags, if the flags are not up don't launch the pipeline. Past, lets say, 10 minutes, check again the flags, and be like this for at most X times OR until the flags are up.
If the flags are up, launch the pipeline execution and stop trying to launch the pipeline any further.
How would you do it?
The logic per se is not complicated in any way, but I wouldn't know where to implement it. Should I develop an Azure Funtions that launches the Pipeline or is there a way to achieve it with an out-of-the-box AZDF activity?
There is a UNTIL iteration activity where you can check if your clause.
Example:
Your azure function (AF) checking the flag and returns 0 or 1.
Build ADF pipeline with UNTIL activity where you check the output of AF (if its 1 do something). In UNTIL activity you can have your process step. For example, you have a variable flag that will before until activity is 0. In your until you check if it's 1. if it is do your processing step, if its not, put WAIT activity on 10 min or so.
So you have the ability in ADF to iterate until something it's not satisfied.
Hope that this will help you :)

Resources