Variable value has to pass in the Databricks direct sql query instead of spark.sql(""" """) - databricks

In the databricks notebook, I have written the query
%sql
set four_date='2021-09-16';
select * from df2_many where four_date='{​​​​​​​{​​​​​​​four_date}​​​​​​​}​​​​'
Its not working, please advise that how to apply in the direct query instead of spark.sql(""" """)
Note: dont use $ its asking value in the text box, please confirm if there is any other alternative solution
how to apply the variable values which is to manipulate in the direct query at the Databricks

If you are using a Databricks Notebook, you will need to use Widgets:
https://docs.databricks.com/notebooks/widgets.html
CREATE WIDGET DATE four_date DEFAULT "2021-09-16"
SELECT * FROM df2_many WHERE four_date=getArgument("four_date")​​​​

Related

Object embedded in Databricks SQL command

I came across the following SQL command in Databricks notebook and I am confused about what is this ${da.paths.working_dir} object in following SQL command. Is it a python object or something else?
SELECT * FROM parquet.${da.paths.working_dir}/weather
I know it contains the path of a working directory but how can I access/print it.
I tried to demystify it but failed as illustrated in the following figure.
NOTE: My notebook is SQL notebook
Finally, I figured it out. This is a high-level variable in Databricks SQL and we can access it using the SELECT keyword in Databricks SQL as shown below:
SELECT '${da.paths.working_dir}';
EDIT: This high variable is spark configuration which can be set as follows:
## spark.conf.set(key, value)
spark.conf.set(da.paths.working_dir, "/path/to/files")
To access this property in python:
spark.conf.get(da.paths.working_dir)
To access this property in Databricks SQL:
SELECT {da.paths.working_dir}

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?
To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql. Below is the example:
%python
RunID_Goal = sqlContext.sql("SELECT CONCAT(SUBSTRING(RunID,1,6),SUBSTRING(RunID,1,6),'01_')
FROM RunID_Pace").first()[0]
AS RunID_Goal
%sql
SELECT Type , KPIDate, Value
FROM table
WHERE
RunID = RunID_Goal (This is the variable created under %python and want to compare over here)
When I run this it throws an error:
Error in SQL statement: AnalysisException: cannot resolve 'RunID_Goal' given input columns:
I am new azure databricks and spark sql any sort of help would be appreciated.
One workaround could be to use Widgets to pass parameters between cells. For example, on Python side it could be as following:
# generate test data
import pyspark.sql.functions as F
spark.range(100).withColumn("rnd", F.rand()).write.mode("append").saveAsTable("abc")
# set widgets
import random
vl = random.randint(0, 100)
dbutils.widgets.text("my_val", str(vl))
and then you can refer the value from the widget inside the SQL code:
%sql
select * from abc where id = getArgument('my_val')
will give you:
Another way is to pass variable via Spark configuration. You can set variable value like this (please note that that the variable should have a prefix - in this case it's c.):
spark.conf.set("c.var", "some-value")
and then from SQL refer to variable as ${var-name}:
%sql
select * from table where column = '${c.var}'
One advantage of this is that you can use this variable also for table names, etc. Disadvantage is that you need to do the escaping of the variable, like putting into single quotes for string values.
You cannot access this variable. It is explained in the documentation:
When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.
Here is another workaround.
# Optional code to use databricks widgets to assign python variables
dbutils.widgets.text('my_str_col_name','my_str_col_name')
dbutils.widgets.text('my_str_col_value','my_str_col_value')
my_str_col_name = dbutils.widgets.get('my_str_col_name')
my_str_col_value = dbutils.widgets.get('my_str_col_value')
# Query with string formatting
query = """
select *
from my_table
where {0} < '{1}'
"""
# Modify query with the values of Python variable
query = query.format(my_str_col_name,my_str_col_value)
# Execute the query
display(spark.sql(query))
A quick complement to answer.
Do you can use widgets to pass parameters to another cell using magic %sql, as was mentioned;
dbutils.widgets.text("table_name", "db.mytable")
And at the cell that you will use this variable do you can use $ shortcut ~ getArgument isn't supported;
%sql
select * from $table_name

update table from Pyspark using JDBC

I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)

Dynamic Forms + SparkSQL Variable Binding in Zep Notebooks

Is it possible to use SparkSQL in a Zeppelin Notebook to take the input of a dynamic form and bind it, the way that one can with the Angular interpreter?
I'm trying to use SparkSQL in a notebook to create a dashboard, but I want the user to be able to input a universal variable value at the beginning of the notebook and have it apply for multiple paragraphs.
Note level dynamic forms in Zeppelin are not supported yet (there is a Jira Introduce Note level dynamic form).
I am using a workaround for now:
dedicate a paragraph to the dynamic forms and variable binding (e.g. z.angularBind("BIND_VAR_A", z.input("VAR_A", 111))
z.angularBind("BIND_VAR_B", z.input("VAR_B", "Default")) -> image)
recover the variables in any paragraph that shares the same context (e.g. val VAR_A = z.angular("BIND_VAR_A")
val data = "(select * from table where id = " + VAR_A + ") as data")
It works also with the sql interpreter:
%sql
select * from data where id = VAR_A

Resources