For each activity in azure synapse - azure

A notebook needs to be run over a set of Array of arrays in for each activity . Is there a way to pass in that subarray as a parameter to the notebook that's inside for each activity.
The datatype for a parameter in notebook in synapse doesn't accept an array type

Currently, Notebook activity does not support passing arrays to Notebook.
So, pass it as String, then In Notebook convert it into array.
My Array parameter in pipeline:
Inside ForEach, pass it as String.
Output of Notebook in one iteration:
Use eval() to convert string in to list in Notebook.

Related

How to do pattern match part of file names using ADF

I have like 10 files in blob where I need to pattern match part of the string name of the file, if matching then variable should be set to true. I will be getting the child names and file name from "Get metadata stage".
How to achieve this using Azure data factory?
Is it possible to match the pattern using Databricks Notebook by getting metadata using "Get metadata stage"?
You can do it using a ForEach activity after Get Metadata activity in ADF.
Please follow the demonstration below:
My files in blob having pattern word as pattern.
Use Get Meta data activity to pass this files list to ForEach. Create an array variable in the pipeline.
Give the ForEach Items dynamic content as below.
#activity('Get Metadata1').output.childItems
Inside ForEach use an append variable to append our file name and the True or false based on the Pattern to the array we created earlier(newfiles in mycase).
ChildItems gives the filenames and type of files, so take only the filenames from every item in ForEach and check with pattern.
#concat(item().name,'-',if(contains(string(item().name),'pattern'),'true','false'))
Last set variable for result(optional and only for output show).
Output:
Is it possible to match the pattern using Databricks Notebook by getting metadata using "Get metadata stage"?
Yes, it is possible. If you want to avoid the type of files you can use an append variable inside ForEach to just pass the filename. If you want the type of files you can pass the childItems from Get Metadata directly to Notebook.
To just pass File name
pass this newfiles variable to Databricks notebook and use pattern matching condition in notebook.

how to pass static value into dynamic on basis of column value in azure databricks

how to pass static value into dynamic on basis of column value in Azure Databricks. Currently, I have 13 notebook and its scheduled ,so I want to schedule only one notebook and In addition, data of column( 13 rows) which I defined separate in 13 notebook so how I dynamically pass that value .
You can create different jobs that refer the single notebook, pass parameters to a job and then retrieve these parameters using Databricks Widgets (widgets are available for all languages). In the notebook it will look as following (for example, in Python):
# this is necessary only if you execute notebook interactively
dbutils.widgets.text("param1_name", "default_value", "description")
# get job parameter
param1 = dbutils.widgets.get("param1_name")
# ... use param1

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql. Below is the example:
%python
RunID_Goal = sqlContext.sql("SELECT CONCAT(SUBSTRING(RunID,1,6),SUBSTRING(RunID,1,6),'01_')
FROM RunID_Pace").first()[0]
AS RunID_Goal
%sql
SELECT Type , KPIDate, Value
FROM table
WHERE
RunID = RunID_Goal (This is the variable created under %python and want to compare over here)
When I run this it throws an error:
Error in SQL statement: AnalysisException: cannot resolve 'RunID_Goal' given input columns:
I am new azure databricks and spark sql any sort of help would be appreciated.
One workaround could be to use Widgets to pass parameters between cells. For example, on Python side it could be as following:
# generate test data
import pyspark.sql.functions as F
spark.range(100).withColumn("rnd", F.rand()).write.mode("append").saveAsTable("abc")
# set widgets
import random
vl = random.randint(0, 100)
dbutils.widgets.text("my_val", str(vl))
and then you can refer the value from the widget inside the SQL code:
%sql
select * from abc where id = getArgument('my_val')
will give you:
Another way is to pass variable via Spark configuration. You can set variable value like this (please note that that the variable should have a prefix - in this case it's c.):
spark.conf.set("c.var", "some-value")
and then from SQL refer to variable as ${var-name}:
%sql
select * from table where column = '${c.var}'
One advantage of this is that you can use this variable also for table names, etc. Disadvantage is that you need to do the escaping of the variable, like putting into single quotes for string values.
You cannot access this variable. It is explained in the documentation:
When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.
Here is another workaround.
# Optional code to use databricks widgets to assign python variables
dbutils.widgets.text('my_str_col_name','my_str_col_name')
dbutils.widgets.text('my_str_col_value','my_str_col_value')
my_str_col_name = dbutils.widgets.get('my_str_col_name')
my_str_col_value = dbutils.widgets.get('my_str_col_value')
# Query with string formatting
query = """
select *
from my_table
where {0} < '{1}'
"""
# Modify query with the values of Python variable
query = query.format(my_str_col_name,my_str_col_value)
# Execute the query
display(spark.sql(query))
A quick complement to answer.
Do you can use widgets to pass parameters to another cell using magic %sql, as was mentioned;
dbutils.widgets.text("table_name", "db.mytable")
And at the cell that you will use this variable do you can use $ shortcut ~ getArgument isn't supported;
%sql
select * from $table_name

Can i return more than 1 value from databricks notebook in a single command?

I have a set of values to return as output from my databricks notebook. Can anyone sugggest a way to do that in an efficient and easy way?
You can only export a string, but you can organize its content as you want. You could export a JSON as a string, containing your different values. The output is then parsed to.retrieve the different values.
You could run said notebook within your current notebook using: %run <path to your notebook>. This will, however, import all variables etc from that notebook.

Resources