How do I pass custom data into the DatabricksRunNowOperator in airflow - apache-spark

I am trying to create a DAG which uses the DatabricksRunNowOperator to run pyspark.
However I'm unable to figure out how I can access the airflow config inside the pyspark script.
parity_check_run = DatabricksRunNowOperator(
task_id='my_task',
databricks_conn_id='databricks_default',
job_id='1837',
spark_submit_params=["file.py", "pre-defined-param"],
dag=dag,
)
I've tried accessing it via kwargs but that doesn't seem to be working.

You can use the notebook_params argument as seen in the documentation .
e.g:
job_id=42
notebook_params = {
"dry-run": "true",
"oldest-time-to-consider": "1457570074236"
}
notebook_run = DatabricksRunNowOperator(
job_id=job_id,
notebook_params=notebook_params,
)
Then you can access the value via dbutils.widgets.get("oldest-time-to-consider") in the PySpark code.

The DatabricksRunNowOperator supports different ways of providing parameters to the existing jobs, depending on how job is defined (doc):
notebook_params if you use notebooks - it's a dictionary of the widget name -> value. You can fetch parameters using the dbutils.widgets.get
python_params - list of parameters that will be passed to Python task - you can fetch them via sys.argv
jar_params - list of parameters that will be passed to Jar task. You can get them as usual for Java/Scala program
spark_submit_params - list of parameters that will be passed to the spark-submit

Related

How to retrieve nested output from XCom using taskflow syntax in Airflow

Well, I know this seems to be possible I just don't know how. To begin with, I am using traditional operators (without #task decorator) but I am interested in XComArgs return output format from these operators that can be used in downstream tasks. Below is a sample example
task_1 = DummyOperator(
task_id = 'task_1'
) # returns {"data": {"foo" : [{"cmd": "ls"}]}}
task_2 = BashOperator(
task_2='task_2',
cmd=task_1.output['return_value']['data']['foo'][0]['cmd'] # does not give what I need and returns null.
#cmd = f"{{ ti.xcom_pull(task_ids = 'task_1', key='return_value')['data']['foo'][0]['cmd'] }}" Gives what I need
)
In this example what is working for me which is pure Jinja templating and the new syntax does not work for me using XComArgs. I have tried changing the argument render_template_as_native_obj=True in Dag configuration but does not change anything. I want to use .output format which returns XcomArgs object and is returning the complete dict but have not been able to use the nested keys like above. Also, have tried converting string to JSON and all those combinations but does not seem to work.
Unfortunately, retrieving nested values from XComArgs in a limitation of the TaskFlow API.
The TaskFlow API uses __getitem__ to override the XCom key to use. In your example, the key ends up being "cmd" rather than the value of what cmd represents in that nested object. You'll have to use the original ti.xcom_pull() method until that limitation is addressed.

How to trigger google dataproc job using airflow and pass parameter as well

As a part of a DAG, I am triggering gcp pyspark dataproc job using below code,
dag=dag,
gcp_conn_id=gcp_conn_id,
region=region,
main=pyspark_script_location_gcs,
task_id='pyspark_job_1_submit',
cluster_name=cluster_name,
job_name="job_1"
)
How can I pass a variable as parameter to pyspark job that can be accessible in script ?
You can use the paramter arguments of DataProcPySparkOperator:
arguments (list) – Arguments for the job. (templated)
job = DataProcPySparkOperator(
gcp_conn_id=gcp_conn_id,
region=region,
main=pyspark_script_location_gcs,
task_id='pyspark_job_1_submit',
cluster_name=cluster_name,
job_name="job_1",
arguments=[
"-arg1=arg1_value", # or just "arg1_value" for non named args
"-arg2=arg2_value"
],
dag=dag
)

How to use Pipeline parameters on AzureML

I've built a pipeline on AzureML Designer and I'm trying to use pipeline parameters but I'm not able to get the values of those parameters on a python script module.
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline
This documentation contains a section called "Use pipeline parameters for arguments that change at inference time" but, unfortunately, it is empty.
I'm defining the parameters on the pipeline setting, see the screenshot on the bottom. Does anyone know how to use the parameters when using the Designer to build the pipeline?
You can correlate each pipeline stage’s outputs w/its inputs. e.g. given the results of model evaluation we should be able to easily identify all the artifacts (model evaluation configuration, model specification, model parameters, training script, training data etc.) pertaining to said evaluation.
Azure Machine Learning Pipelines Referenced Article:
https://github.com/Azure/MachineLearningNotebooks/blob/4a3f8e7025334ea8c0de0bada69b031ce54c24a0/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-databricks-as-compute-target.ipynb
We have an AMLS pipeline trying to parameterize with a date string to process our pipeline in the context of old historical dates.
Here’s the code we’re using to submit the pipeline
from azureml.core.authentication import InteractiveLoginAuthentication
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
print("You can perform HTTP POST on URL {} to trigger this pipeline".format(rest_endpoint))
# specify the param when running the pipeline
response = requests.post(rest_endpoint,
headers=aad_token,
json={"ExperimentName": "dtpred-Dock2RTEG-EX-param",
"RunSource": "SDK",
"DataPathAssignments": {"input_datapath": {"DataStoreName": "erpgen2datastore","RelativePath": "teams/PredictiveInsights/DatePrediction/2019/10/10"}},
"ParameterAssignments": {"param_inputDate": "2019/10/10"}})
run_id = response.json()["Id"]
print('Submitted pipeline run: ', run_id)

How to pass session parameters with python to snowflake?

The below code is my attempt at passing a session parameter to snowflake through python. This part of an existing codebase which runs in AWS Glue, & the only part of the following that doesn't work is the session_parameters.
I'm trying to understand how to add session parameters from within this code. Any help in understanding what is going on here is appreciated.
sf_credentials = json.loads(CACHE["SNOWFLAKE_CREDENTIALS"])
CACHE["sf_options"] = {
"sfURL": "{}.snowflakecomputing.com".format(sf_credentials["account"]),
"sfUser": sf_credentials["user"],
"sfPassword": sf_credentials["password"],
"sfRole": sf_credentials["role"],
"sfDatabase": sf_credentials["database"],
"sfSchema": sf_credentials["schema"],
"sfWarehouse": sf_credentials["warehouse"],
"session_parameters": {
"QUERY_TAG": "Something",
}
}
In AWS Cloudwatch, I can find the parameter was sent with the other options. In snowflake, the parameter was never set.
I can add more detail where necessary, I just wasn't sure what details are needed.
It turns out that there is no need to specify that a given parameter is a session parameter when you are using the Spark Connector. So instead:
sf_credentials = json.loads(CACHE["SNOWFLAKE_CREDENTIALS"])
CACHE["sf_options"] = {
"sfURL": "{}.snowflakecomputing.com".format(sf_credentials["account"]),
"sfUser": sf_credentials["user"],
"sfPassword": sf_credentials["password"],
"sfRole": sf_credentials["role"],
"sfDatabase": sf_credentials["database"],
"sfSchema": sf_credentials["schema"],
"sfWarehouse": sf_credentials["warehouse"],
"QUERY_TAG": "Something",
}
Works perfectly.
I found this in the Snowflake Documentation for Using the Spark Connector: Here's the section on setting Session Parameters

Passing URL parameters to Apache Zeppelin paragraph

I need to pass request parameters to a specified zeppelin paragraph have them available to the spark context. tbh this is proving a real nightmare. I can write some js in the %angular interpreter to retrieve the query parameters but as z.angularBind("myparam", "value") currently only works in Spark Interpreter(scala) I can't use this.
My next thought was to retrieve the Paragraph and/or Notebook object - I'm thinking it must have a reference somewhere to the url that invoked it. However all you can easily get is the paragraphId/noteId from the InterpreterContext.
Anyone point me in the right direction?
You can pass parameters through dynamic form. Create the parameters through dynamic form for your notebook. To pass a value for the dynamic form, use the following
{
"params": {
"formLabel1": "value1",
"formLabel2": "value2"
}
}
Doc: https://zeppelin.apache.org/docs/0.7.2/rest-api/rest-notebook.html#run-a-paragraph-synchronously
Note that you can pass params only when you want to run a single paragraph.

Resources