Escape Backslash(/) while writing spark dataframe into csv - apache-spark

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv.
I have some "//" in my source csv file (as mentioned below), where first Backslash represent the escape character and second Backslash is the actual value.
Test.csv (Source Data)
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,"//",abc,Val2
I am reading the Test.csv file and creating dataframe using below piece of code:
df = sqlContext.read.format('com.databricks.spark.csv').schema(schema).option("escape", "\\").options(header='true').load("Test.csv")
And reading the df dataframe and writing back to Output.csv file using below code:
df.repartition(1).write.format('csv').option("emptyValue", empty).option("header", "false").option("escape", "\\").option("path", 'D:\TestCode\Output.csv').save(header = 'true')
Output.csv
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,/,abc,Val2
In 2nd row of Output.csv, escape character is getting lost along with the quotes("").
My requirement is to retain the escape character in output.csv as well. Any kind of help will be much appreciated.
Thanks in advance.

Looks like you are using the default behavior .option("escape", "\\"), change this to:
.option("escape", "'")
It should work.
Let me know if this solves your problem!

Related

Pandas read_csv method can't get 'œ' character properly while using encoding ISO 8859-15

I have some trubble reading with pandas a csv file which include the special character 'œ'.
I've done some reseach and it appears that this character has been added to the ISO 8859-15 encoding standard.
I've tried to specify this encoding standard to the pandas read_csv methods but it doesn't properly get this special character (I got instead a '☐') in the result dataframe :
df= pd.read_csv(my_csv_path, ";", header=None, encoding="ISO-8859-15")
Does someone know how could I get the right 'œ' character (or eaven better the string 'oe') instead of this ?
Thank's a lot :)
As a matter of facts, I've just tried to write down the dataframe than I get with the read_csv and ISO-8859-15 encoding (using pd.to_csv method and "ISO-8859-15" encoding) and the special 'œ' character properly appears in the result csv file... :
df.to_csv(my_csv_full_path, sep=';', index=False, encoding="ISO-8859-15")
So it seems that pandas has properly read the special character in my csv file but can't show it within the dataframe...
Anyone have a clue ? I've manage the problem by manually rewrite this special character before reading my csv with pandas but that doesn't answer my question :(

pyspark read csv file multiLine option not working for records which has newline spark2.3 and spark2.2

I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row.
I tried using the "multiLine" option while reading , but still its not working.
spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)
Data is something like this. Here $ is CRLF for newline shown in vim.
name,test,12345,$
$
,desc$
name2,test2,12345,$
$
,desc2$
So pyspark is treating desc as next record.
How to read such data in pyspark .
Tried this in both spark2.2 and spark2.3 versions.
I created my own hadoop Custom Record Reader and was able to read it by invoking the api .
spark.sparkContext.newAPIHadoopFile(file_path,'com.test.multi.reader.CustomFileFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf=conf)
And in the Custom Record Reader implemented the logic to handle the newline characters encountered .

Pyspark dataframe.write.csv use pipe as separator cause strange characters in output file

I have a dataframe with two columns. Both are of string type.
When I tried to save the dataframe as csv with pipe as separator with following code:
df.write.csv("/outputpath/",sep="|")
the output file contains strange characters.
Please see attached screenshot.
If I instead use tab as separator sep="\t", everything looks good.
Just wonder if anyone has any idea what could go wrong here?
I'm using
Spark 2.2.0 with Python 3.4

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Unable to append "Quotes" in write for dataframe

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.
("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe
Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

Resources