Unable to append "Quotes" in write for dataframe - apache-spark

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.

("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe

Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

Related

Column names appearing as record data in Pyspark databricks

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

es.read.source.filter v.s. es.read.field.include when reading data with elasticsearch-hadoop

When reading data from Elasticsearch with elasticsearch-hadoop, there are two options two specify how to reading a subset of fields from the source, according to the offical documents, i.e,.
es.read.field.include: Fields/properties that are parsed and considered when reading the documents from Elasticsearch...;
es.read.source.filter: ...this property allows you to specify a comma delimited string of field names that you would like to return from Elasticsearch.
Both can be set as a comma-separated string.
I have tested the two options, and find that only es.read.field.include works with expected results, but es.read.source.filter takes no effect (with all data fields returned).
Question: what is the difference of these two options? and why the option es.read.source.filter does not take effect?
Here is the code to reading data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestSparkElasticsearch').getOrCreate()
es_options = {"nodes": "node1,node2,node3", "port": 9200}
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.source.filter", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # return all the fields, the option not work
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.field.include", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # only return specified fields as expected
Updated: es.read.source.filter is added in Elasticsearch 5.4 and I am using version 5.1. Hence the option is ignored. Nevertheless, the difference of these two options in newer version is still not clearly explained in the document?

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv.
I have some "//" in my source csv file (as mentioned below), where first Backslash represent the escape character and second Backslash is the actual value.
Test.csv (Source Data)
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,"//",abc,Val2
I am reading the Test.csv file and creating dataframe using below piece of code:
df = sqlContext.read.format('com.databricks.spark.csv').schema(schema).option("escape", "\\").options(header='true').load("Test.csv")
And reading the df dataframe and writing back to Output.csv file using below code:
df.repartition(1).write.format('csv').option("emptyValue", empty).option("header", "false").option("escape", "\\").option("path", 'D:\TestCode\Output.csv').save(header = 'true')
Output.csv
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,/,abc,Val2
In 2nd row of Output.csv, escape character is getting lost along with the quotes("").
My requirement is to retain the escape character in output.csv as well. Any kind of help will be much appreciated.
Thanks in advance.
Looks like you are using the default behavior .option("escape", "\\"), change this to:
.option("escape", "'")
It should work.
Let me know if this solves your problem!

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources