Databricks python notebook to execute %sql commandlet based on condition - databricks

I have created a python notebook in Databricks, I have python logic and need to execute a %sql commandlet.
Say I wanted to execute that commandlet2 based on a python variable
cmd1
EXECUTE_SQL= True
cmd2
if condition :
%sql .....

As mentioned, you can use following Python code (or Scala) to make behavior similar to the %sql cell:
if condition:
display(spark.sql("your-query"))
One advantage of this approach is that you can embed variables into the query text.

Another alternate which I used is
Extracted the sql to a different notebook,
In my case i don't want any results back
Also I am cleaning up the delta tables and deleting contents.
/clean-deltatable-notebook (an sql notebook)
delete from <database>.<table>
used the dbutils.run.notebook() from the python notebook.
cmd2
if condition :
result = dbutils.run.notebook('<path-of-(clean-deltatable-noetbook)',timeout_seconds = 30)
print(result)
Link on dbutils.notebook.run() from Databricks

Related

Object embedded in Databricks SQL command

I came across the following SQL command in Databricks notebook and I am confused about what is this ${da.paths.working_dir} object in following SQL command. Is it a python object or something else?
SELECT * FROM parquet.${da.paths.working_dir}/weather
I know it contains the path of a working directory but how can I access/print it.
I tried to demystify it but failed as illustrated in the following figure.
NOTE: My notebook is SQL notebook
Finally, I figured it out. This is a high-level variable in Databricks SQL and we can access it using the SELECT keyword in Databricks SQL as shown below:
SELECT '${da.paths.working_dir}';
EDIT: This high variable is spark configuration which can be set as follows:
## spark.conf.set(key, value)
spark.conf.set(da.paths.working_dir, "/path/to/files")
To access this property in python:
spark.conf.get(da.paths.working_dir)
To access this property in Databricks SQL:
SELECT {da.paths.working_dir}

Execute multiple notebooks in parallel in pyspark databricks

Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Just for others in case they are after how it worked:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.
You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)
Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)
Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.

double percent spark sql in jupyter notebook

I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df
These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql. Below is the example:
%python
RunID_Goal = sqlContext.sql("SELECT CONCAT(SUBSTRING(RunID,1,6),SUBSTRING(RunID,1,6),'01_')
FROM RunID_Pace").first()[0]
AS RunID_Goal
%sql
SELECT Type , KPIDate, Value
FROM table
WHERE
RunID = RunID_Goal (This is the variable created under %python and want to compare over here)
When I run this it throws an error:
Error in SQL statement: AnalysisException: cannot resolve 'RunID_Goal' given input columns:
I am new azure databricks and spark sql any sort of help would be appreciated.
One workaround could be to use Widgets to pass parameters between cells. For example, on Python side it could be as following:
# generate test data
import pyspark.sql.functions as F
spark.range(100).withColumn("rnd", F.rand()).write.mode("append").saveAsTable("abc")
# set widgets
import random
vl = random.randint(0, 100)
dbutils.widgets.text("my_val", str(vl))
and then you can refer the value from the widget inside the SQL code:
%sql
select * from abc where id = getArgument('my_val')
will give you:
Another way is to pass variable via Spark configuration. You can set variable value like this (please note that that the variable should have a prefix - in this case it's c.):
spark.conf.set("c.var", "some-value")
and then from SQL refer to variable as ${var-name}:
%sql
select * from table where column = '${c.var}'
One advantage of this is that you can use this variable also for table names, etc. Disadvantage is that you need to do the escaping of the variable, like putting into single quotes for string values.
You cannot access this variable. It is explained in the documentation:
When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.
Here is another workaround.
# Optional code to use databricks widgets to assign python variables
dbutils.widgets.text('my_str_col_name','my_str_col_name')
dbutils.widgets.text('my_str_col_value','my_str_col_value')
my_str_col_name = dbutils.widgets.get('my_str_col_name')
my_str_col_value = dbutils.widgets.get('my_str_col_value')
# Query with string formatting
query = """
select *
from my_table
where {0} < '{1}'
"""
# Modify query with the values of Python variable
query = query.format(my_str_col_name,my_str_col_value)
# Execute the query
display(spark.sql(query))
A quick complement to answer.
Do you can use widgets to pass parameters to another cell using magic %sql, as was mentioned;
dbutils.widgets.text("table_name", "db.mytable")
And at the cell that you will use this variable do you can use $ shortcut ~ getArgument isn't supported;
%sql
select * from $table_name

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources