Spark : skip top rows with spark-excel - excel

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their github there no such functionality, so is there a way to achieve this?
This my code:
Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
.option("location", filePath)
.option("sheetName", "Feuil1")
.option("useHeader", "true")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load(filePath);

I have looked at the source code and there is no option for the same
https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DefaultSource.scala
You should fix your excel file and remove the first 3 rows. Or else you would need to create a patched version of the code to allow you the same. Which would be way more effort then having a correct excel sheet

This issue is fixed with spark excel 0.9.16, issue link in github

You can use the skipFirstRows option (I believe it is deprecated after version 0.11)
Library Dependency : "com.crealytics" %% "spark-excel" % "0.10.2"
Sample Code :
val df = sparkSession.read.format("com.crealytics.spark.excel")
.option("location", inputLocation)
.option("sheetName", "sheet1")
.option("useHeader", "true")
.option("skipFirstRows", "2") // Mention the number of top rows to be skipped
.load(inputLocation)
Hope it helps!
Feel free to let me know in comments if you have any doubts/issues. Thanks!

skipFirstRows was deprecated in favor of more generic dataAddress option. For your specific example, you can skip rows by specifying start range for your data:
Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
.option("location", filePath)
.option("useHeader", "true")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.option("dataAddress", "'Feuil1'!A3") // From the docs: Start cell of the data. Reading will return all rows below and all columns to the right
.load(filePath);

Related

spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.
Without any predefined schema, all rows are correctly read but as only string type columns.
To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.
The library dependency used in build.sbt:
"com.crealytics" %% "spark-excel" % "0.11.1",
Scala version - 2.11.8
Spark version - 2.3.2
val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))
The above reads all the data - around 25000 rows.
But,
val inputWithSchemaDF: DataFrame = sparkSession.read
.format("com.crealytics.spark.excel")
.option("useHeader" , "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls" , "false")
.option("keepUndefinedRows", "true")
.option("maxRowsInMey", 2000)
.schema(templateSchema)
.load(inputLocation)
This gives me only 450 rows.
Is there a way to prevent that? Thanks in advance! (edited)
As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:
Step 1: Create my own schema of type 'StructType':
val requiredSchema = new StructType()
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true)
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)
Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):
def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame =
{
var schemaDf: DataFrame = inputDF
for(i <- inputDF.columns.indices)
{
if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)
{
schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))
}
}
schemaDf
}
This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.
I am still searching for a solution to my original question.
This solution is just in case someone might want to try and are in immediate need of a quick fix.
Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":
# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *
input_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", isHeaderOn) \
.option("treatEmptyValuesAsNulls", "true") \
.option("dataAddress", xlsxAddress1) \
.option("setErrorCellsToFallbackValues", "true") \
.option("ignoreLeadingWhiteSpace", "true") \
.option("ignoreTrailingWhiteSpace", "true") \
.load(inputfile)
# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
colname =dt[0]
coltype = dt[1].replace("Type","")
input_df = input_df.withColumn(colname, col(colname).cast(coltype))

Error while reading an Excel file using Spark Scala with filename as a parameter

Can someone help me with reading excel file using Spark Scala Read API? I tried installing com.crealytics:spark-excel_2.11:0.13.1 (from Maven) to Cluster with Databricks Runtime 6.5 and 6.6 (Apache Spark 2.4.5, Scala 2.11) but it works only if I hard-code the filepath..
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Listing_Attributed")
.option("header", "true")
.option("inferSchema", "false")
.option("addColorColumns", "true") // Optional, default: false
.option("badRecordsPath", Vars.rootSourcePath + "BadRecords/" + DataCategory)
.option("dateFormat", "dd-MON-yy")
.option("timestampFormat", "MM/dd/yyyy hh:mm:ss")
.option("ignoreLeadingWhiteSpace",true)
.option("ignoreTrailingWhiteSpace",true)
.option("escape"," ")
.load("/ABC/Test_Filename_6.12.20.xlsx") // hard-coded path works...
// .load(filepath) //Filepath is a parameter and throws error, "java.io.IOException: GC overhead limit exceeded" (edited)
Use .option("location",inputPath) like below
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Listing_Attributed")
.option("header", "true")
.option("location", inputPath)
.load()

How can you go about creating a csv file from an empty Dataset<Row> in spark 2.1 with headers

Spark 2.1 has default behaviour of writing empty files while creating a CSV from a Dataset
How can you go about creating a csv file with headers ?
This is what i am using to write the file
dataFrame.repartition(NUM_PARTITIONS).write()
.option("header", "true")
.option("delimiter", "\t")
.option("overwrite", "true")
.option("nullValue", "null")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv("some/path");

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
The solution to your problem is to use Spark Excel dependency in your project.
Spark Excel has flexible options to play with.
I have tested the following code to read from excel and convert it to dataframe and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
you can give sheetname as option if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
I hope its helpful
Here are read and write examples to read from and write into excel with full set of options...
Source spark-excel from crealytics
Scala API
Spark 2.0+:
Create a DataFrame from an Excel file
import org.apache.spark.sql._
val spark: SparkSession = ???
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "false") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")
Write a DataFrame to an Excel file
df.write
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
.mode("overwrite")
.save("Worktime2.xlsx")
Note: Instead of sheet1 or sheet2 you can use their names as well..
in this example given above Daily is sheet name.
If you want to use it from spark shell...
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1
Dependencies needs to be added (in case of maven etc...):
groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.13.1
Further reading : See my article (How to do Simple reporting with Excel sheets using Apache Spark, Scala ?) of how to write in to excel file after an aggregations in to many excel sheets
Tip : This is very useful approach particularly for writing
maven test cases where you can place excel sheets with sample data in excel
src/main/resources folder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...
Another option you could consider is
spark-hadoopoffice-ds
A Spark datasource for the HadoopOffice library. This Spark datasource
assumes at least Spark 2.0.1. However, the HadoopOffice library can
also be used directly from Spark 1.x. Currently this datasource
supports the following formats of the HadoopOffice library:
Excel Datasource format: org.zuinnote.spark.office.Excel Loading and
Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is
available on Spark-packages.org and on Maven Central.
Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.
I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.
tempTable = new SQLContext(javaSparkContxt).read()
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1")
.option("useHeader", "false") // Required
.option("treatEmptyValuesAsNulls","false") // Optional, default: true
.option("inferSchema", "false") //Optional, default: false
.option("addColorColumns", "false") //Required
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
.schema(schema)
.load("hdfs://localhost:8020/user/tester/my.xlsx");
Hope this should help.
val df_excel= spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").load(file_path)
display(df_excel)

Apache Spark Dataframe - Load data from nth line of a CSV file

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file.
Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"
m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100
Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?
Dataset<Row> df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("\home\user\data\20170326.csv");
Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?
You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.
df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("comment", "m")
.load("\home\user\data\20170326.csv")

Resources