Spark : Store data frame into CSV with unicode separator \u2592 - apache-spark

Looking to store spark dataframe into csv, but columns need to be separated with unicode \u2592
considering my dataframe name is myDf
myDf.option("header",true)
.option("encoding", "......")
.option("delimiter", ".....")
.csv(s"$path")
data should look like
my_cd▒my_cd▒flag_cd
00000051▒R▒Y
00000051▒R▒Y
0000007a▒D▒Y

Finally I found the solution, I was trying delimiter. You can pass any unicode of desire separator as below option("sep","\u2592"). It worked for me.
myDf.option("header",true)
.option("sep", "\u2592")
.csv(s"$path")

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

load data from csv with encoding utf-16le

I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.
df = spark.read.format("csv")
.option("delimiter", ",")
.option("header", true)
.option("encoding", "utf-16le")
.load(file_path)
df.show(4)
It seems spark can only read the first line normally:
Starting from the second row, either garbled characters or null values
however, python can read the data correct with code:
with open(file_path, encoding='utf-16le', mode='r') as f:
text = f.read()
print(text)
print result like:
python read correct
Add these options while creating Spark dataframe from CSV file source -
.option('encoding', 'UTF-16')
.option('multiline', 'true')
the multiline option ignores the encoding option when using the DataFrameReader.
It is not possible to use both options at the same time.
Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.

spark-sftp not considering the single quote("\'") option. Reads single quote as as part of the value

I have to read a csv file from sftp server into spark dataframe that has a column containing currency values like this and another column containing text values. Since the comma is a delimiter so the currency value is conatined inside the single quote.
'$1,200.00', abc
'$1,201.00', und
'$1,202.00', jsn
'$1,203.00', yhs
'$1,204.00', rfs
'$1,205.00', jsn
'$1,202.00', han
When I read this using the below code, Spark reads this as three columns whne it should read it as two.
val df sqlContext.read.format("com.springml.spark.sftp")
.option("quote","\'")
.option("host","mylocalhost")
.option("username","user")
.option("password","password")
.option("header", "false")
.option("fileType", "csv")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("path/to/file.csv")
df:
'$1, 200.00' abc

Write a DataFrame to csv file with a custom row/line delimiter/separator

I need to produce a delimited file where each row it separated by a '^' and columns are delimited by '|'.
There don't seem to be options to change the row delimiter for csv output type.
eg:
df.coalesce(1).write\
.format("com.databricks.spark.csv")\
.mode("overwrite")\
.option("header", "true")\
.option("sep","|")\
# no options for setting lineSep to '^'
.save(destination_path)
One solution consists of to convert the DataFrame to rdd :
df.rdd.map(x=>x.mkString("^")).saveAsTextFile("OutCSV")
In pyspark version 3+ there is an option to set line separator:
df.coalesce(1).write\
.format("com.databricks.spark.csv")\
.mode("overwrite")\
.option("header", "true")\
.option("sep","|")\
.option("lineSep","^")\
.save(destination_path)

spark data read with quoted string

i am having the csv data file as given below
each line is terminated by a Carriage Return('\r')
but certain value of text are multilined field having line delimiter as line feed ('\n'). how to use spark data source api option to handle these issue.
with enter image description here
Spark 2.2.0 has added support for parsing multi-line CSV files. You can use following to read a csv with multi-line:
val df = spark.read
.option("sep", ",")
.option("quote", "")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(file_name)

Resources