How to read excel as a pyspark dataframe - azure

I am able to read all the files and formats like csv, parquet, delta from adls2 account with oauth2 cred.
However when I am trying to read excel file like below,
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'excel sheet name'!A1") \
.load(filepath)
I am getting below error
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Note: I have installed external library "com.crealytics:spark-excel_2.11:0.12.2" to read excel as a dataframe.
Can anyone help me with error here?

Try to use in configs as: "fs.azure.account.oauth2.client.secret": "<key-name>",
And different versions have different set of parameters, so try use the latest release: https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7

Related

Column names appearing as record data in Pyspark databricks

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

JDBC not truncating Postgres table on pyspark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.
You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

Databricks: convert data frame and export to xls / xlsx

Is it possible for Databricks: convert data frame and export to xls / xlsx and save to blob storage ?
Using Python
Here's an example of writing a dataframe to excel:
Using pyspark:
df.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.mode("append")
.save("Worktime2.xlsx")
Based upon this library: spark-excel by Crealytics.
The following way does not require as much maneuvering. First, you will convert your pyspark dataframe to a pandas data frame (toPandas()) and then use the "to_excel" to write to excel format.
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)
Does it have to be an Excel file? CSV files are so much easier to work with. You can certainly open a CSV into Excel, and save that as an Excel file. As I know, you can write directly to the Blob storage, and completely bypass the step of storing the data locally.
df.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("myfile.csv")
In this example, you can try changing the extension to xls before you run the job. I can't test this because I don't have Databricks setup on my personal laptop.

How to write dataset object to excel in spark java?

I Am reading excel file using com.crealytics.spark.excel package.
Below is the code to read an excel file in spark java.
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("com.crealytics.spark.excel")
.option("location", "D:\\5Kto10K.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load("com.databricks.spark.csv");
But I tried with the same (com.crealytics.spark.excel) package to write dataset object to an excel file in spark java.
SourcePropertSet.write()
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false").save("D:\\resultset.xlsx");
But i am getting below error.
java.lang.RuntimeException: com.crealytics.spark.excel.DefaultSource
does not allow create table as select.
And even I tried with org.zuinnote.spark.office.excel package also.
below is the code for that.
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.save("D:\\result");
i have added following dependencies in my pom.xml
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>spark-hadoopoffice-ds_2.11</artifactId>
<version>1.0.3</version>
</dependency>
But I am getting below error.
java.lang.IllegalAccessError: tried to access method org.zuinnote.hadoop.office.format.mapreduce.ExcelFileOutputFormat.getSuffix(Ljava/lang/String;)Ljava/lang/String;
from class org.zuinnote.spark.office.excel.ExcelOutputWriterFactory
Please help me to write dataset object to an excel file in spark java.
Looks like the library you chose, com.crealytics.spark.excel, does not have any code related to writing excel files. Underneath it uses Apache POI for reading Excel files, there are also few examples.
The good news are that CSV is a valid Excel file, and you may use spark-csv to write it. You need to change your code like this:
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv");
Keep in mind that Spark makes 1 output file per partition, and you might want to do .repartition(1) to have exactly one result file.
The error you face when writing comes from an old version of the HaodoopOffice library. Please make sure that you have only version 1.0.3 or better 1.0.4 as a dependency. Can you provide your build file? The following should work:
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("spark.write.useHeader",true)
.option("write.locale.bcp47", "us")
.save("D:\\result");
Version 1.0.4 of the Spark2 data source for HadoopOffice also supports inferring the schema when reading:
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("org.zuinnote.spark.office.excel")
.option("spark.read.useHeader", "true")
.option("spark.read.simpleMode", "true")
.load("D:\\5Kto10K.xlsx");
Please note that it is not recommended to mix different Excel data sources based on POI in one application.
More information here: https://github.com/ZuInnoTe/spark-hadoopoffice-ds

Resources