How to detect encoding of a file format - python-3.x

I have files in bucket of s3 and i am reading them as stream. I want to detect the encoding of the diffrent files.
I used chardet library , i am getting this error:
TypeError: Expected object of type bytes or bytearray, got: <class
'botocore.response.StreamingBody'>
and my code is:
a = (obj.get()['Body'])
reader = chardet.detect(a).get('encoding')
print(reader)
And is there any other ways to detect the encoding before opening of a file

i got this
you need to use read function again!
a = (obj.get()['Body']._raw_stream).read()

Related

python 3 coding issue

I create a JSON file with VSC (or Notepad++). This contains an array of strings. One of the strings is "GRÜN". Then I read the file with Python 3.
with codecs.open(file,'r',"iso-8859-15") as infile:
dictionarry = json.load(infile)
If I print the array (inside in "dictionary") to the console, I see: "GRÃ\x9cN"
How can I convert "GRÃ\x9cN" to "GRÜN" again?
I try to read the JSON file with codec "iso-8859-1" too, but the issue still occurred.

How to read an excel file with extension csv in jupyter?

I have an excel file- with .csv extension.
I want to read it in jupyter notebook.
my code is:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv")
and I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
myfile is an excel file, with .csv extension.
I tried the same with a .txt file- it was good.
any idea?
ok, I added a parameter to pd.read_csv()
now it looks like:
real_csv_data =
pd.read_csv("/Users/xxx/Downloads/myfile.csv", encoding = "utf-16")
and now it works fine.
also if someone interested- you can also add
sep = '/t'
for example, to get the data in a nice table.

Python3 convert byte object to file object

I am using an API that only takes file objects (a BufferedRandom object returned by open(file_name, 'r+b')).
However, what I have in hand is a variable (bytes object, returned by with open(file_name, "rb") as file:
file.read())
I am wondering how to convert this bytes object into the BufferedRandom object to serve as input of the API, because if I input the bytes object as input to the API function, I got the error "bytes" object has no attribute "read".
Thank you very much!
Found an answer here.
You can get your bytes data into a file object with this:
import io
f = io.BytesIO(raw_bytes_data)
Now f behaves just like a file object with methods like read, seek etc.
I had a similar issue when I needed to download (export) files from one REDCap database and upload (import) them into another using PyCap. The export_file method returns the file contents as a raw bytestream, while import_file needs a file object. Works with Python 3.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

Python3.5 error using BytesIO or StringIO with base64.standard_b64encode

I am trying to take the contents of a BytesIO or StringIO object and use base64.standard_b64encode() to encode it. I have tried both. This works fine in python 2.7, however in python 3.5 I get the following error.
TypeError: Can't convert 'bytes' object to str implicitly
This is the portion of code having the problem.
output = BytesIO()
img.save(output, format="PNG")
output.seek(0)
data = "data:image/png;base64," + base64.standard_b64encode(output.read())
html = "<html><body><img src='DATA'></body></html>"
I have seen references to fixing this error for strings using b"sting" but I don't know how that would apply to reading from a file.
Thanks
Turns out I the problem was not with the base64 encoding, but rather the string I was trying to append it to. I had to do the following so that python did not see it as a byte encoding anymore.
base64.b64encode(output.read()).decode()

Resources