Is sharing cache/persisted dataframes between databricks notebook possible? - apache-spark

I want to cache a table(dataframe) in one notebook and use it in another notebook , I am using same databricks cluster for both the notebooks.
Please suggest if this is possible , If yes then how ?

You can share dataframe between notebooks.
On the first notebook please register it as temp view:
df_shared.createOrReplaceGlobalTempView("df_shared")
On the second notebook please read it from global temp database:
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
df_shared= table(global_temp_db + ".df_shared")

Yes it is possible based on the following setups .
You can register your dataframe as temp table . The lifetime of temp view created by createOrReplaceTempView() is tied to Spark Session in which the dataframe has been created.
spark.databricks.session.share to true
this setup global temporary views to share temporary views across notebooks.
ref : link

Related

How to list hive variables available in a notebook session?

In a hive session it is possible to list the variables available with the following:
0: jdbc:hive2://127.0.0.1:10000>set;
How can I list the hive variables from a databricks notebook?
The solution was to just run the SET SQL command:
%sql
SET

Object embedded in Databricks SQL command

I came across the following SQL command in Databricks notebook and I am confused about what is this ${da.paths.working_dir} object in following SQL command. Is it a python object or something else?
SELECT * FROM parquet.${da.paths.working_dir}/weather
I know it contains the path of a working directory but how can I access/print it.
I tried to demystify it but failed as illustrated in the following figure.
NOTE: My notebook is SQL notebook
Finally, I figured it out. This is a high-level variable in Databricks SQL and we can access it using the SELECT keyword in Databricks SQL as shown below:
SELECT '${da.paths.working_dir}';
EDIT: This high variable is spark configuration which can be set as follows:
## spark.conf.set(key, value)
spark.conf.set(da.paths.working_dir, "/path/to/files")
To access this property in python:
spark.conf.get(da.paths.working_dir)
To access this property in Databricks SQL:
SELECT {da.paths.working_dir}

How to get workspace name inside a python notebook in databricks

I am trying get the workspace name inside a python notebook. Is there any way we can do this?
Ex:
My workspace name is databricks-test.
I want to capture this in variable in python notebook
To get the workspace name (not Org ID which the other answer gives you) you can do it one of two main ways
spark.conf.get("spark.databricks.workspaceUrl")
which will give you the absolutely URL and you can then split on the first.
i.e
spark.conf.get("spark.databricks.workspaceUrl").split('.')[0]
You could also get it these two ways:
dbutils.notebook.entry_point.getDbutils().notebook().getContext() \
.browserHostName().toString()
or
import json
json.loads(dbutils.notebook.entry_point.getDbutils().notebook() \
.getContext().toJson())['tags']['browserHostName']
Top tip if you're ever wondering what Spark Confs exist you can get most of them in a list like this:
sc.getConf().getAll()
By using of below command , we can get the working workspace ID . But getting the workspace name ,I think difficult to find it .
spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")
spark.conf.get("spark.databricks.clusterUsageTags.clusterName")
This command will return the cluster name :)

databricks: writing spark dataframe directly to excel

Are there any method to write spark dataframe directly to xls/xlsx format ????
Most of the example in the web showing there is example for panda dataframes.
but I would like to use spark dataframe for working with my data. Any idea ?
I'm assuming that because you have the "databricks" tag you are wanting to create an .xlsx file within databricks file store and that you are running code within databricks notebooks. I'm also going to assume that your notebooks are running python.
There is no direct way to save an excel document from a spark dataframe. You can, however, convert a spark dataframe to a pandas dataframe then export from there. We'll need to start by installing the xlsxwriter package. You can do this for your notebook environment using a databricks utilites command:
dbutils.library.installPyPI('xlsxwriter')
dbutils.library.restartPython()
I was having a few permission issues saving an excel file directly to dbfs. A quick workaround was to save to the cluster's default directory then sudo move the file into dbfs. Here's some example code:
# Creating dummy spark dataframe
spark_df = spark.sql('SELECT * FROM default.test_delta LIMIT 100')
# Converting spark dataframe to pandas dataframe
pandas_df = spark_df.toPandas()
# Exporting pandas dataframe to xlsx file
pandas_df.to_excel('excel_test.xlsx', engine='xlsxwriter')
Then in a new command, specifying the command to run in shell with %sh:
%sh
sudo mv excel_test.xlsx /dbfs/mnt/data/
It is possible to generate an Excel file from pySpark.
df_spark.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(path)
You need to install the com.crealytics:spark-excel_2.12:0.13.5 (or a more recent version of course) library though, for example in Azure Databricks by specifying it as a new Maven library in the libraries list of your cluster (one of the buttons on the left sidebar of the Databricks UI).
For more info see https://github.com/crealytics/spark-excel.
I believe you can do it like this.
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv")
I'm not sure you can write directly to Excel, but Excel can definitely consume a CSV. This is almost certainly the easiest way of doing this kind of thing and the cleanest as well. In Excel you have all kinds of formatting, which can throw errors when used in some systems (think of merged cells).
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
Below code does the work of moving files.
%sh
sudo mv data_checks_output.xlsx /dbfs/mnt/fpmount/
Comment if anyone has new update or better way to do it.
Yet Pyspark does not offer any method to save excel file. But you can save csv file, then it can be read in Excel.
From pyspark.sql module version 2.3 you have write.csv:
df.write.csv('path/filename'))
Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=save

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

Resources