Special characters not encoded properly when creating Spark dataframe from parquet - apache-spark

My input parquet file has a column defined as optional binary title (UTF8);, which may include special characters such as the German umlat (i.e. Schrödinger).
When using Spark to load the contents of the parquet to a DataFrame, the contents of the row are loading the value Schrödinger as Schrödinger. I believe the best explanation of why this could be happening is answered here, though I was under the impression that Spark will read the parquet file as UTF-8 by default anyway.
I have attempted to force UTF-8 encoding by using the option argument as described here, but still no luck. Any suggestions?

Can you try with encoding CP1252. It worked for us for most of the special characters which are not supported in UTF8.

Related

SPARK encoding issue while reading a csv with multiline=true option

I am stuck in an issue while trying to read a csv file with multiline=true option in spark that has characters like Ř and Á. The csv is read in utf-8 format ; But when we try to read the data by using multiline=true we get characters that are not equivalent to the ones that we had read. We get something like ŘÃ�. So essentially a word read as ZŘÁKO gets transformed to ZŘÃ�KO.I went through several other questions asked on stack overflow around the same issue but none of solution actually works !
I tried the following encodings while read/write operations : ‘US-ASCII’
‘ISO-8859-1’,‘UTF-8’,‘UTF-16BE’,‘UTF-16LE’,‘UTF-16’,SJIS and couple more but none of them could give me the expected result. But multiline=false generates the correct output somehow.
I cannot read/write the file as text as the current framework policy of project is around an ingestion framework where we read the file only once and then everything is expected to be done in-memory and I must use multiline as true.
I would really appreciate any thoughts on this matter. Thank You !
sample data:
id|name
1|ZŘÁKO
df=spark.read.format('csv').option('header',true).
option('delimter','|').option('multiline',true).option('encoding','utf-8').load()
df.show()
ouptut :
1|Z�KO
#trying to force utf-8 encoding as below :
df.withColumn("name", sql.functions.encode("name", 'utf-8'))
gives me this :
1|[22 5A c3..]
I tried the above steps with all the supported encodings in spark

Encoding data to ISO_8859_1 in Bigquery using pyspark

I have multi language characters in my pyspark dataframe. After writing the data to bigquery it shows me strange characters because of its deafult encoding scheme (utf-8).
How can I change encoding in Bigquery to ISO_8859_1 using pyspark / dataproc?
There was an issue in the source file itself, as its coming through an api. Hence able to resolve the issue.
First thing you have to check at source or source system
How it's sending the data and understand which encoding it is. If still different then do the following investigation.
AFAIK pyspark is reading json with utf-8 encoding and loading in to bigquery as per your comments . So its not bigquerys fault as default is utf-8.
you can change encoding to ISO-8859-1 and load json like below
spark.read.option('encoding','ISO-8859-1').json("yourjsonpathwith latin-1 ")
and load in to bigquery.
Also...
while writing the dataframe in to bigquery.
you can test/debug using decode function with col and charset both in iso-8859-1 and utf-8 formats to understand where its going wrong using...
pyspark.sql.functions.decode(columnname , charset) as well to see its able to decode to utf-8 or not...
you can write dataframe with pyspark.sql.functions.decode(col, charset)

Pandas read_csv method can't get 'œ' character properly while using encoding ISO 8859-15

I have some trubble reading with pandas a csv file which include the special character 'œ'.
I've done some reseach and it appears that this character has been added to the ISO 8859-15 encoding standard.
I've tried to specify this encoding standard to the pandas read_csv methods but it doesn't properly get this special character (I got instead a '☐') in the result dataframe :
df= pd.read_csv(my_csv_path, ";", header=None, encoding="ISO-8859-15")
Does someone know how could I get the right 'œ' character (or eaven better the string 'oe') instead of this ?
Thank's a lot :)
As a matter of facts, I've just tried to write down the dataframe than I get with the read_csv and ISO-8859-15 encoding (using pd.to_csv method and "ISO-8859-15" encoding) and the special 'œ' character properly appears in the result csv file... :
df.to_csv(my_csv_full_path, sep=';', index=False, encoding="ISO-8859-15")
So it seems that pandas has properly read the special character in my csv file but can't show it within the dataframe...
Anyone have a clue ? I've manage the problem by manually rewrite this special character before reading my csv with pandas but that doesn't answer my question :(

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources