Databricks: convert data frame and export to xls / xlsx - excel

Is it possible for Databricks: convert data frame and export to xls / xlsx and save to blob storage ?
Using Python

Here's an example of writing a dataframe to excel:
Using pyspark:
df.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.mode("append")
.save("Worktime2.xlsx")
Based upon this library: spark-excel by Crealytics.
The following way does not require as much maneuvering. First, you will convert your pyspark dataframe to a pandas data frame (toPandas()) and then use the "to_excel" to write to excel format.
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)

Does it have to be an Excel file? CSV files are so much easier to work with. You can certainly open a CSV into Excel, and save that as an Excel file. As I know, you can write directly to the Blob storage, and completely bypass the step of storing the data locally.
df.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("myfile.csv")
In this example, you can try changing the extension to xls before you run the job. I can't test this because I don't have Databricks setup on my personal laptop.

Related

Column names appearing as record data in Pyspark databricks

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

How to read excel as a pyspark dataframe

I am able to read all the files and formats like csv, parquet, delta from adls2 account with oauth2 cred.
However when I am trying to read excel file like below,
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'excel sheet name'!A1") \
.load(filepath)
I am getting below error
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Note: I have installed external library "com.crealytics:spark-excel_2.11:0.12.2" to read excel as a dataframe.
Can anyone help me with error here?
Try to use in configs as: "fs.azure.account.oauth2.client.secret": "<key-name>",
And different versions have different set of parameters, so try use the latest release: https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7

databricks: writing spark dataframe directly to excel

Are there any method to write spark dataframe directly to xls/xlsx format ????
Most of the example in the web showing there is example for panda dataframes.
but I would like to use spark dataframe for working with my data. Any idea ?
I'm assuming that because you have the "databricks" tag you are wanting to create an .xlsx file within databricks file store and that you are running code within databricks notebooks. I'm also going to assume that your notebooks are running python.
There is no direct way to save an excel document from a spark dataframe. You can, however, convert a spark dataframe to a pandas dataframe then export from there. We'll need to start by installing the xlsxwriter package. You can do this for your notebook environment using a databricks utilites command:
dbutils.library.installPyPI('xlsxwriter')
dbutils.library.restartPython()
I was having a few permission issues saving an excel file directly to dbfs. A quick workaround was to save to the cluster's default directory then sudo move the file into dbfs. Here's some example code:
# Creating dummy spark dataframe
spark_df = spark.sql('SELECT * FROM default.test_delta LIMIT 100')
# Converting spark dataframe to pandas dataframe
pandas_df = spark_df.toPandas()
# Exporting pandas dataframe to xlsx file
pandas_df.to_excel('excel_test.xlsx', engine='xlsxwriter')
Then in a new command, specifying the command to run in shell with %sh:
%sh
sudo mv excel_test.xlsx /dbfs/mnt/data/
It is possible to generate an Excel file from pySpark.
df_spark.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(path)
You need to install the com.crealytics:spark-excel_2.12:0.13.5 (or a more recent version of course) library though, for example in Azure Databricks by specifying it as a new Maven library in the libraries list of your cluster (one of the buttons on the left sidebar of the Databricks UI).
For more info see https://github.com/crealytics/spark-excel.
I believe you can do it like this.
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv")
I'm not sure you can write directly to Excel, but Excel can definitely consume a CSV. This is almost certainly the easiest way of doing this kind of thing and the cleanest as well. In Excel you have all kinds of formatting, which can throw errors when used in some systems (think of merged cells).
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
Below code does the work of moving files.
%sh
sudo mv data_checks_output.xlsx /dbfs/mnt/fpmount/
Comment if anyone has new update or better way to do it.
Yet Pyspark does not offer any method to save excel file. But you can save csv file, then it can be read in Excel.
From pyspark.sql module version 2.3 you have write.csv:
df.write.csv('path/filename'))
Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=save

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

Overwriting Parquet File in Bluemix Object Storage with Apache Spark Notebook

I'm running a Spark Notebook to save a DataFrame as a Parquet File in the Bluemix Object Storage.
I want to overwrite the Parquet File, when rerunning the Notebook. But actually it's just appending the data.
Below a sample of the iPython Code:
df = sqlContext.sql("SELECT * FROM table")
df.write.parquet("swift://my-container.spark/simdata.parquet", mode="overwrite")
I'm not the python guy,but SaveMode work for dataframe like this
df.write.mode(SaveMode.Overwrite).parquet("swift://my-container.spark/simdata.parquet")
I think the blockstorage replace only the 'simdata.parquet' the 'PART-0000*' remains cuz was 'simdata.parquet' with the 'UUID' of app-id, when you try to read, the DF read all files with the 'simdata.parquet*'

Resources