spark data read with quoted string - apache-spark

i am having the csv data file as given below
each line is terminated by a Carriage Return('\r')
but certain value of text are multilined field having line delimiter as line feed ('\n'). how to use spark data source api option to handle these issue.
with enter image description here

Spark 2.2.0 has added support for parsing multi-line CSV files. You can use following to read a csv with multi-line:
val df = spark.read
.option("sep", ",")
.option("quote", "")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(file_name)

Related

load data from csv with encoding utf-16le

I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.
df = spark.read.format("csv")
.option("delimiter", ",")
.option("header", true)
.option("encoding", "utf-16le")
.load(file_path)
df.show(4)
It seems spark can only read the first line normally:
Starting from the second row, either garbled characters or null values
however, python can read the data correct with code:
with open(file_path, encoding='utf-16le', mode='r') as f:
text = f.read()
print(text)
print result like:
python read correct
Add these options while creating Spark dataframe from CSV file source -
.option('encoding', 'UTF-16')
.option('multiline', 'true')
the multiline option ignores the encoding option when using the DataFrameReader.
It is not possible to use both options at the same time.
Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.

spark-sftp not considering the single quote("\'") option. Reads single quote as as part of the value

I have to read a csv file from sftp server into spark dataframe that has a column containing currency values like this and another column containing text values. Since the comma is a delimiter so the currency value is conatined inside the single quote.
'$1,200.00', abc
'$1,201.00', und
'$1,202.00', jsn
'$1,203.00', yhs
'$1,204.00', rfs
'$1,205.00', jsn
'$1,202.00', han
When I read this using the below code, Spark reads this as three columns whne it should read it as two.
val df sqlContext.read.format("com.springml.spark.sftp")
.option("quote","\'")
.option("host","mylocalhost")
.option("username","user")
.option("password","password")
.option("header", "false")
.option("fileType", "csv")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("path/to/file.csv")
df:
'$1, 200.00' abc

How to read csv with second line as header in pyspark dataframe

I am trying to load a csv and make the second line as header. How to achieve this. Please let me know. Thanks.
file_location = "/mnt/test/raw/data.csv"
file_type = "csv"
infer_schema = "true"
delimiter = ","
data = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", "false") \
.option("sep", delimiter) \
.load(file_location) \
First Read the data as rdd and then pass this rdd to df.read.csv()
data=sc.TextFile('/mnt/test/raw/data.csv')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
df = spark.read.csv(data,header=True)
For reference of dataframe functions use the below link, This would serve as bible for all of the dataframe operations you need, for specific version of spark replace "latest" in url to whatever version you want:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Write a DataFrame to csv file with a custom row/line delimiter/separator

I need to produce a delimited file where each row it separated by a '^' and columns are delimited by '|'.
There don't seem to be options to change the row delimiter for csv output type.
eg:
df.coalesce(1).write\
.format("com.databricks.spark.csv")\
.mode("overwrite")\
.option("header", "true")\
.option("sep","|")\
# no options for setting lineSep to '^'
.save(destination_path)
One solution consists of to convert the DataFrame to rdd :
df.rdd.map(x=>x.mkString("^")).saveAsTextFile("OutCSV")
In pyspark version 3+ there is an option to set line separator:
df.coalesce(1).write\
.format("com.databricks.spark.csv")\
.mode("overwrite")\
.option("header", "true")\
.option("sep","|")\
.option("lineSep","^")\
.save(destination_path)

Apache Spark Dataframe - Load data from nth line of a CSV file

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file.
Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"
m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100
Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?
Dataset<Row> df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("\home\user\data\20170326.csv");
Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?
You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.
df = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("comment", "m")
.load("\home\user\data\20170326.csv")

Resources