pyspark read csv file multiLine option not working for records which has newline spark2.3 and spark2.2 - python-3.x

I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row.
I tried using the "multiLine" option while reading , but still its not working.
spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)
Data is something like this. Here $ is CRLF for newline shown in vim.
name,test,12345,$
$
,desc$
name2,test2,12345,$
$
,desc2$
So pyspark is treating desc as next record.
How to read such data in pyspark .
Tried this in both spark2.2 and spark2.3 versions.

I created my own hadoop Custom Record Reader and was able to read it by invoking the api .
spark.sparkContext.newAPIHadoopFile(file_path,'com.test.multi.reader.CustomFileFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf=conf)
And in the Custom Record Reader implemented the logic to handle the newline characters encountered .

Related

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv.
I have some "//" in my source csv file (as mentioned below), where first Backslash represent the escape character and second Backslash is the actual value.
Test.csv (Source Data)
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,"//",abc,Val2
I am reading the Test.csv file and creating dataframe using below piece of code:
df = sqlContext.read.format('com.databricks.spark.csv').schema(schema).option("escape", "\\").options(header='true').load("Test.csv")
And reading the df dataframe and writing back to Output.csv file using below code:
df.repartition(1).write.format('csv').option("emptyValue", empty).option("header", "false").option("escape", "\\").option("path", 'D:\TestCode\Output.csv').save(header = 'true')
Output.csv
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,/,abc,Val2
In 2nd row of Output.csv, escape character is getting lost along with the quotes("").
My requirement is to retain the escape character in output.csv as well. Any kind of help will be much appreciated.
Thanks in advance.
Looks like you are using the default behavior .option("escape", "\\"), change this to:
.option("escape", "'")
It should work.
Let me know if this solves your problem!

difference between spark read textFile and csv

I am trying to read a text file delimited by |. I am trying this
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").csv("/tmp/file.txt").show()
I am only reading/seeing only the header but no data.
When I try the same with textFile, I am getting data but all in one column
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").textFile("/tmp/file.txt").show()
Is there a way to read data via csv? I am using spark 2.4.4
The reason for the issue was the file is in UTF16 so I had to convert it and do run dostounix on it. Thanks for your advice. Apologies I really did not know that

Pyspark dataframe.write.csv use pipe as separator cause strange characters in output file

I have a dataframe with two columns. Both are of string type.
When I tried to save the dataframe as csv with pipe as separator with following code:
df.write.csv("/outputpath/",sep="|")
the output file contains strange characters.
Please see attached screenshot.
If I instead use tab as separator sep="\t", everything looks good.
Just wonder if anyone has any idea what could go wrong here?
I'm using
Spark 2.2.0 with Python 3.4

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

How to read sequence files exported from HBase

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)
It shows:
u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd\t'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
Question:
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)
and
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!
I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
valueClass='org.apache.hadoop.hbase.client.Result',
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
conf=hadoop_conf)
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Resources