I came across the following SQL command in Databricks notebook and I am confused about what is this ${da.paths.working_dir} object in following SQL command. Is it a python object or something else?
SELECT * FROM parquet.${da.paths.working_dir}/weather
I know it contains the path of a working directory but how can I access/print it.
I tried to demystify it but failed as illustrated in the following figure.
NOTE: My notebook is SQL notebook
Finally, I figured it out. This is a high-level variable in Databricks SQL and we can access it using the SELECT keyword in Databricks SQL as shown below:
SELECT '${da.paths.working_dir}';
EDIT: This high variable is spark configuration which can be set as follows:
## spark.conf.set(key, value)
spark.conf.set(da.paths.working_dir, "/path/to/files")
To access this property in python:
spark.conf.get(da.paths.working_dir)
To access this property in Databricks SQL:
SELECT {da.paths.working_dir}
Related
I have created a python notebook in Databricks, I have python logic and need to execute a %sql commandlet.
Say I wanted to execute that commandlet2 based on a python variable
cmd1
EXECUTE_SQL= True
cmd2
if condition :
%sql .....
As mentioned, you can use following Python code (or Scala) to make behavior similar to the %sql cell:
if condition:
display(spark.sql("your-query"))
One advantage of this approach is that you can embed variables into the query text.
Another alternate which I used is
Extracted the sql to a different notebook,
In my case i don't want any results back
Also I am cleaning up the delta tables and deleting contents.
/clean-deltatable-notebook (an sql notebook)
delete from <database>.<table>
used the dbutils.run.notebook() from the python notebook.
cmd2
if condition :
result = dbutils.run.notebook('<path-of-(clean-deltatable-noetbook)',timeout_seconds = 30)
print(result)
Link on dbutils.notebook.run() from Databricks
In a hive session it is possible to list the variables available with the following:
0: jdbc:hive2://127.0.0.1:10000>set;
How can I list the hive variables from a databricks notebook?
The solution was to just run the SET SQL command:
%sql
SET
In the databricks notebook, I have written the query
%sql
set four_date='2021-09-16';
select * from df2_many where four_date='{{four_date}}'
Its not working, please advise that how to apply in the direct query instead of spark.sql(""" """)
Note: dont use $ its asking value in the text box, please confirm if there is any other alternative solution
how to apply the variable values which is to manipulate in the direct query at the Databricks
If you are using a Databricks Notebook, you will need to use Widgets:
https://docs.databricks.com/notebooks/widgets.html
CREATE WIDGET DATE four_date DEFAULT "2021-09-16"
SELECT * FROM df2_many WHERE four_date=getArgument("four_date")
I am writing the below code in the notebook of azure synapse
%%spark
val df = spark.read.sqlanalytics("emea_analytics.abc.cde_mydata")
df.write.mode("overwrite").saveAsTable("default.t1")
I am getting the below error:
Error: com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: The specified table does not exist. Please provide a valid table.
at com.microsoft.spark.sqlanalytics.read.SQLAnalyticsReader.readSchema(SQLAnalyticsReader.scala:103)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at org.apache.spark.sql.SqlAnalyticsConnector$SQLAnalyticsFormatReader.sqlanalytics(SqlAnalyticsConnector.scala:42)
The error message clearly says - The specified table does not exist. Please provide a valid table.
Error : com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: The specified table does not exist. Please provide a valid table.
Make sure specified table exists before you running above code.
Reference: Azure Synapse Analytics - Load the NYC Taxi data into the Spark nyctaxi database.
I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)