Encoding data to ISO_8859_1 in Bigquery using pyspark - apache-spark

I have multi language characters in my pyspark dataframe. After writing the data to bigquery it shows me strange characters because of its deafult encoding scheme (utf-8).
How can I change encoding in Bigquery to ISO_8859_1 using pyspark / dataproc?

There was an issue in the source file itself, as its coming through an api. Hence able to resolve the issue.

First thing you have to check at source or source system
How it's sending the data and understand which encoding it is. If still different then do the following investigation.
AFAIK pyspark is reading json with utf-8 encoding and loading in to bigquery as per your comments . So its not bigquerys fault as default is utf-8.
you can change encoding to ISO-8859-1 and load json like below
spark.read.option('encoding','ISO-8859-1').json("yourjsonpathwith latin-1 ")
and load in to bigquery.
Also...
while writing the dataframe in to bigquery.
you can test/debug using decode function with col and charset both in iso-8859-1 and utf-8 formats to understand where its going wrong using...
pyspark.sql.functions.decode(columnname , charset) as well to see its able to decode to utf-8 or not...
you can write dataframe with pyspark.sql.functions.decode(col, charset)

Related

SPARK encoding issue while reading a csv with multiline=true option

I am stuck in an issue while trying to read a csv file with multiline=true option in spark that has characters like Ř and Á. The csv is read in utf-8 format ; But when we try to read the data by using multiline=true we get characters that are not equivalent to the ones that we had read. We get something like ŘÃ�. So essentially a word read as ZŘÁKO gets transformed to ZŘÃ�KO.I went through several other questions asked on stack overflow around the same issue but none of solution actually works !
I tried the following encodings while read/write operations : ‘US-ASCII’
‘ISO-8859-1’,‘UTF-8’,‘UTF-16BE’,‘UTF-16LE’,‘UTF-16’,SJIS and couple more but none of them could give me the expected result. But multiline=false generates the correct output somehow.
I cannot read/write the file as text as the current framework policy of project is around an ingestion framework where we read the file only once and then everything is expected to be done in-memory and I must use multiline as true.
I would really appreciate any thoughts on this matter. Thank You !
sample data:
id|name
1|ZŘÁKO
df=spark.read.format('csv').option('header',true).
option('delimter','|').option('multiline',true).option('encoding','utf-8').load()
df.show()
ouptut :
1|Z�KO
#trying to force utf-8 encoding as below :
df.withColumn("name", sql.functions.encode("name", 'utf-8'))
gives me this :
1|[22 5A c3..]
I tried the above steps with all the supported encodings in spark

Special characters not encoded properly when creating Spark dataframe from parquet

My input parquet file has a column defined as optional binary title (UTF8);, which may include special characters such as the German umlat (i.e. Schrödinger).
When using Spark to load the contents of the parquet to a DataFrame, the contents of the row are loading the value Schrödinger as Schrödinger. I believe the best explanation of why this could be happening is answered here, though I was under the impression that Spark will read the parquet file as UTF-8 by default anyway.
I have attempted to force UTF-8 encoding by using the option argument as described here, but still no luck. Any suggestions?
Can you try with encoding CP1252. It worked for us for most of the special characters which are not supported in UTF8.

PySpark :- How to pass encoding format to sc.newAPIHadoopFile config

I need to read an ISO-8859-1 encoded file, do some operations and create parquet file.
Reading file using sc.newAPIHadoopFile() and creating pyspark DF, but not able to find out config property to send encoding format to sc.newAPIHadoopFile().
I know we can send the encoding format to df by using encoding option
df = spark.read.format(
"com.databricks.spark.csv").schema(schema).option(
"escape", '\"').option(
"quote", '\"').option(
"header", "false").option(
"encoding", "ISO-8859-1").csv(rdd)
but still Is there a way to send encoding format directly to sc.newAPIHadoopFile()

what encoding are files after being dumped by nutch?

I have been using the readseg function to dump data after crawling with nutch. But I have been having encoding issues. What encoding are the files after being dumped by nutch?
The HTML content is still in the original encoding. Starting with Nutch 1.17 it can be optionally converted to UTF-8, see NUTCH-2773. You need to set the property segment.reader.content.recode to true. Of course, this will not work for binary document formats.
All other data (metadata, extracted plain-text) is always encoded in UTF-8 when segments are dumped.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

Resources