In a hive session it is possible to list the variables available with the following:
0: jdbc:hive2://127.0.0.1:10000>set;
How can I list the hive variables from a databricks notebook?
The solution was to just run the SET SQL command:
%sql
SET
Related
I have created a python notebook in Databricks, I have python logic and need to execute a %sql commandlet.
Say I wanted to execute that commandlet2 based on a python variable
cmd1
EXECUTE_SQL= True
cmd2
if condition :
%sql .....
As mentioned, you can use following Python code (or Scala) to make behavior similar to the %sql cell:
if condition:
display(spark.sql("your-query"))
One advantage of this approach is that you can embed variables into the query text.
Another alternate which I used is
Extracted the sql to a different notebook,
In my case i don't want any results back
Also I am cleaning up the delta tables and deleting contents.
/clean-deltatable-notebook (an sql notebook)
delete from <database>.<table>
used the dbutils.run.notebook() from the python notebook.
cmd2
if condition :
result = dbutils.run.notebook('<path-of-(clean-deltatable-noetbook)',timeout_seconds = 30)
print(result)
Link on dbutils.notebook.run() from Databricks
I came across the following SQL command in Databricks notebook and I am confused about what is this ${da.paths.working_dir} object in following SQL command. Is it a python object or something else?
SELECT * FROM parquet.${da.paths.working_dir}/weather
I know it contains the path of a working directory but how can I access/print it.
I tried to demystify it but failed as illustrated in the following figure.
NOTE: My notebook is SQL notebook
Finally, I figured it out. This is a high-level variable in Databricks SQL and we can access it using the SELECT keyword in Databricks SQL as shown below:
SELECT '${da.paths.working_dir}';
EDIT: This high variable is spark configuration which can be set as follows:
## spark.conf.set(key, value)
spark.conf.set(da.paths.working_dir, "/path/to/files")
To access this property in python:
spark.conf.get(da.paths.working_dir)
To access this property in Databricks SQL:
SELECT {da.paths.working_dir}
I want to cache a table(dataframe) in one notebook and use it in another notebook , I am using same databricks cluster for both the notebooks.
Please suggest if this is possible , If yes then how ?
You can share dataframe between notebooks.
On the first notebook please register it as temp view:
df_shared.createOrReplaceGlobalTempView("df_shared")
On the second notebook please read it from global temp database:
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
df_shared= table(global_temp_db + ".df_shared")
Yes it is possible based on the following setups .
You can register your dataframe as temp table . The lifetime of temp view created by createOrReplaceTempView() is tied to Spark Session in which the dataframe has been created.
spark.databricks.session.share to true
this setup global temporary views to share temporary views across notebooks.
ref : link
Env : linux (spark-submit xxx.py)
Target database : Hive
We used to use beeline to execute hql, but now we try to run the hql through pyspark and faced some issue when tried to set table properties while creating the table.
SQL
CREATE EXTERNAL TABLE example.a(
column_a string)
TBLPROPERTIES (
'discover.partitions'='true',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"column_a","type":"string","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='received_utc_date_partition');
Error message
Hive - ERROR - Cannot persist
example.a into Hive metastore as table property
keys may not start with 'spark.sql.': [spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts,
spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0];
In line 130-147 in spark source code it seems that it prevent all table properties that start with "spark.sql"
Not sure if I did it wrong or there's another way to set up the table properties for hive table.
Any kinds of suggestion is appreciated.
I'm trying to execute my queries in hive using spark engine, but i need to run them in specific queue. I couldn't find any queue name properties except in spark-submit --queue. So far i've used these settings:
set hive.execution.engine=spark;
set spark.job.queue.name=MyQueue;
set spark.executor.instances=50;
or
set spark.queue.name=MyQueue;
but they won't start the jobs.
found another option:
set spark.yarn.queue=MyQueue
doesn't work either