Apache Spark Dataframe - Load data from nth line of a CSV file - apache-spark

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file.
Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"
m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100
Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?
Dataset<Row> df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("\home\user\data\20170326.csv");
Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?

You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.
df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("comment", "m")
.load("\home\user\data\20170326.csv")

Related

spark-sftp not considering the single quote("\'") option. Reads single quote as as part of the value

I have to read a csv file from sftp server into spark dataframe that has a column containing currency values like this and another column containing text values. Since the comma is a delimiter so the currency value is conatined inside the single quote.
'$1,200.00', abc
'$1,201.00', und
'$1,202.00', jsn
'$1,203.00', yhs
'$1,204.00', rfs
'$1,205.00', jsn
'$1,202.00', han
When I read this using the below code, Spark reads this as three columns whne it should read it as two.
val df sqlContext.read.format("com.springml.spark.sftp")
.option("quote","\'")
.option("host","mylocalhost")
.option("username","user")
.option("password","password")
.option("header", "false")
.option("fileType", "csv")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("path/to/file.csv")
df:
'$1, 200.00' abc

spark data read with quoted string

i am having the csv data file as given below
each line is terminated by a Carriage Return('\r')
but certain value of text are multilined field having line delimiter as line feed ('\n'). how to use spark data source api option to handle these issue.
with enter image description here
Spark 2.2.0 has added support for parsing multi-line CSV files. You can use following to read a csv with multi-line:
val df = spark.read
.option("sep", ",")
.option("quote", "")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(file_name)

Spark : skip top rows with spark-excel

I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their github there no such functionality, so is there a way to achieve this?
This my code:
Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
.option("location", filePath)
.option("sheetName", "Feuil1")
.option("useHeader", "true")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load(filePath);
I have looked at the source code and there is no option for the same
https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DefaultSource.scala
You should fix your excel file and remove the first 3 rows. Or else you would need to create a patched version of the code to allow you the same. Which would be way more effort then having a correct excel sheet
This issue is fixed with spark excel 0.9.16, issue link in github
You can use the skipFirstRows option (I believe it is deprecated after version 0.11)
Library Dependency : "com.crealytics" %% "spark-excel" % "0.10.2"
Sample Code :
val df = sparkSession.read.format("com.crealytics.spark.excel")
.option("location", inputLocation)
.option("sheetName", "sheet1")
.option("useHeader", "true")
.option("skipFirstRows", "2") // Mention the number of top rows to be skipped
.load(inputLocation)
Hope it helps!
Feel free to let me know in comments if you have any doubts/issues. Thanks!
skipFirstRows was deprecated in favor of more generic dataAddress option. For your specific example, you can skip rows by specifying start range for your data:
Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
.option("location", filePath)
.option("useHeader", "true")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.option("dataAddress", "'Feuil1'!A3") // From the docs: Start cell of the data. Reading will return all rows below and all columns to the right
.load(filePath);

How can you go about creating a csv file from an empty Dataset<Row> in spark 2.1 with headers

Spark 2.1 has default behaviour of writing empty files while creating a CSV from a Dataset
How can you go about creating a csv file with headers ?
This is what i am using to write the file
dataFrame.repartition(NUM_PARTITIONS).write()
.option("header", "true")
.option("delimiter", "\t")
.option("overwrite", "true")
.option("nullValue", "null")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv("some/path");

Reading csv file as data frame in spark

I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks
Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

Resources