PySpark :- How to pass encoding format to sc.newAPIHadoopFile config

PySpark :- How to pass encoding format to sc.newAPIHadoopFile config - apache-spark

I need to read an ISO-8859-1 encoded file, do some operations and create parquet file.
Reading file using sc.newAPIHadoopFile() and creating pyspark DF, but not able to find out config property to send encoding format to sc.newAPIHadoopFile().
I know we can send the encoding format to df by using encoding option
df = spark.read.format(
"com.databricks.spark.csv").schema(schema).option(
"escape", '\"').option(
"quote", '\"').option(
"header", "false").option(
"encoding", "ISO-8859-1").csv(rdd)
but still Is there a way to send encoding format directly to sc.newAPIHadoopFile()

Related

Python - read EBCIDIC encoded .dat file into a pandas df

I have a .dat file exported from a mainframe system. It is EBCIDIC encoded(cp037). I would like to load the contents into a pandas or spark dataframe.
I tried using "iconv" to convert the file to ascii, it does not support conversion from cp037. "iconv -l" does not list cp037.
What is the best way to achieve this?

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?

Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

Encoding data to ISO_8859_1 in Bigquery using pyspark

I have multi language characters in my pyspark dataframe. After writing the data to bigquery it shows me strange characters because of its deafult encoding scheme (utf-8).
How can I change encoding in Bigquery to ISO_8859_1 using pyspark / dataproc?

There was an issue in the source file itself, as its coming through an api. Hence able to resolve the issue.

First thing you have to check at source or source system
How it's sending the data and understand which encoding it is. If still different then do the following investigation.
AFAIK pyspark is reading json with utf-8 encoding and loading in to bigquery as per your comments . So its not bigquerys fault as default is utf-8.
you can change encoding to ISO-8859-1 and load json like below
spark.read.option('encoding','ISO-8859-1').json("yourjsonpathwith latin-1 ")
and load in to bigquery.
Also...
while writing the dataframe in to bigquery.
you can test/debug using decode function with col and charset both in iso-8859-1 and utf-8 formats to understand where its going wrong using...
pyspark.sql.functions.decode(columnname , charset) as well to see its able to decode to utf-8 or not...
you can write dataframe with pyspark.sql.functions.decode(col, charset)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue

using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)

The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

Using Pyspark how to convert Text file to CSV file

I am new learner for Pyspark. I got a requirement in my project to read JSON file with a schema and need to convert it to CSV file.
Can some one help me how to proceed this request using PYspark.

You can load JSON and write CSV with SparkSession.
spark = SparkSession.builder.master("local").appName("ETL").getOrCreate()
spark.read.json(path-to-txt)
spark.write.csv(path-to-csv)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark :- How to pass encoding format to sc.newAPIHadoopFile config - apache-spark

Related

Python - read EBCIDIC encoded .dat file into a pandas df

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

Encoding data to ISO_8859_1 in Bigquery using pyspark

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

Using Pyspark how to convert Text file to CSV file

Categories

Resources