How to write dataset object to excel in spark java? - apache-spark

I Am reading excel file using com.crealytics.spark.excel package.
Below is the code to read an excel file in spark java.
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("com.crealytics.spark.excel")
.option("location", "D:\\5Kto10K.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load("com.databricks.spark.csv");
But I tried with the same (com.crealytics.spark.excel) package to write dataset object to an excel file in spark java.
SourcePropertSet.write()
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false").save("D:\\resultset.xlsx");
But i am getting below error.
java.lang.RuntimeException: com.crealytics.spark.excel.DefaultSource
does not allow create table as select.
And even I tried with org.zuinnote.spark.office.excel package also.
below is the code for that.
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.save("D:\\result");
i have added following dependencies in my pom.xml
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>spark-hadoopoffice-ds_2.11</artifactId>
<version>1.0.3</version>
</dependency>
But I am getting below error.
java.lang.IllegalAccessError: tried to access method org.zuinnote.hadoop.office.format.mapreduce.ExcelFileOutputFormat.getSuffix(Ljava/lang/String;)Ljava/lang/String;
from class org.zuinnote.spark.office.excel.ExcelOutputWriterFactory
Please help me to write dataset object to an excel file in spark java.

Looks like the library you chose, com.crealytics.spark.excel, does not have any code related to writing excel files. Underneath it uses Apache POI for reading Excel files, there are also few examples.
The good news are that CSV is a valid Excel file, and you may use spark-csv to write it. You need to change your code like this:
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv");
Keep in mind that Spark makes 1 output file per partition, and you might want to do .repartition(1) to have exactly one result file.

The error you face when writing comes from an old version of the HaodoopOffice library. Please make sure that you have only version 1.0.3 or better 1.0.4 as a dependency. Can you provide your build file? The following should work:
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("spark.write.useHeader",true)
.option("write.locale.bcp47", "us")
.save("D:\\result");
Version 1.0.4 of the Spark2 data source for HadoopOffice also supports inferring the schema when reading:
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("org.zuinnote.spark.office.excel")
.option("spark.read.useHeader", "true")
.option("spark.read.simpleMode", "true")
.load("D:\\5Kto10K.xlsx");
Please note that it is not recommended to mix different Excel data sources based on POI in one application.
More information here: https://github.com/ZuInnoTe/spark-hadoopoffice-ds

Related

Why data frame not throwing RunTimeException with "FAILFAST" option in spark while reading using com.crealytics.spark.excel?

schema = <Schema of excel file>
df = spark.read.format("com.crealytics.spark.excel").\
option("useHeader", "true").\
option("mode", "FAILFAST"). \
schema(schema).\
option("dataAddress", "Sheet1"). \
load("C:\\Users\\ABC\\Downloads\\Input.xlsx")
df.show()
Above pyspark read excel dataframe snippet is not failing/throwing runtime exception while reading (calling action using show() ) from incorrect/corrupt data. However option("mode", "FAILFAST") is working fine for CSV but when I am using com.crealytics.spark.excel jar I am facing issue i.e. its not failing code and giving results by substracting incorrect/corrupt data.
Does anyone encountered same issue ?
Thanks in advance!
Based on following documentation, no where mentioned mode is supported.
https://github.com/crealytics/spark-excel

How to read excel as a pyspark dataframe

I am able to read all the files and formats like csv, parquet, delta from adls2 account with oauth2 cred.
However when I am trying to read excel file like below,
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'excel sheet name'!A1") \
.load(filepath)
I am getting below error
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Note: I have installed external library "com.crealytics:spark-excel_2.11:0.12.2" to read excel as a dataframe.
Can anyone help me with error here?
Try to use in configs as: "fs.azure.account.oauth2.client.secret": "<key-name>",
And different versions have different set of parameters, so try use the latest release: https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

Databricks: convert data frame and export to xls / xlsx

Is it possible for Databricks: convert data frame and export to xls / xlsx and save to blob storage ?
Using Python
Here's an example of writing a dataframe to excel:
Using pyspark:
df.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.mode("append")
.save("Worktime2.xlsx")
Based upon this library: spark-excel by Crealytics.
The following way does not require as much maneuvering. First, you will convert your pyspark dataframe to a pandas data frame (toPandas()) and then use the "to_excel" to write to excel format.
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)
Does it have to be an Excel file? CSV files are so much easier to work with. You can certainly open a CSV into Excel, and save that as an Excel file. As I know, you can write directly to the Blob storage, and completely bypass the step of storing the data locally.
df.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("myfile.csv")
In this example, you can try changing the extension to xls before you run the job. I can't test this because I don't have Databricks setup on my personal laptop.

spark-xml library is parsing xml file manytimes

I use spark-xml library from databricks for parsing xml file (550 MB).
Dataset books= spark.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.option("treatEmptyValuesAsNulls", "true")
.load("path");
Spark parses the file a first time with many tasks/partitions.
Then, when I call this code :
books.select("code").count()
Spark starts a new parsing.
Is it a solution to avoid a parsing file each function call on the dataset ?

Resources