double encoding through cp1252 and base 64 - python-3.x

From a client I am getting a pdf file, which is encoded in cp 1252 and for transfer is also encoded in base 64. Till now a shell program returns the file into the original form through this code line:
output= [System.Text.Encoding]::GetEncoding(1252).GetString([System.Convert]::FromBase64String(input))
and this works.
Now I am implementing a python version, to supersede this implementation. This looks generally like this:
enc_file = read_from_txt.open_file(location_of_file)
plain_file= base64.b64decode(enc_file)
with open('filename', 'w') as writer:
writer.write(plain_file.decode('cp1252'))
where read_from_txt.open_file just does this:
with open(file_location, 'rb') as fileReader:
read = fileReader.read()
return read
But for some reason, I am getting an error in the plain_file.decode('cp1252'), where it can not decode a line in the file. From what I am understanding though, the python program should do exactly the same, as the powershell does.
Concrete error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 188: character maps to undefined
Any help is appreciated.

Related

Python : Base64 Decode codec can't decode bytes in position 47-48 : invalid continuation byte

I found many questions related to this question on stackoverflow and even following and applying what solved problems for other users. I'm still at starting line.
I'm receiving response in UTF-8 encoded format. It's XML file I want to decode.
I saved the response in .txt file saved as UTF-8 encoding and tried following :
import base64
with open('docdata.txt', 'r') as f:
e = f.read()
print(e[:50])
decoded = base64.b64decode(e)
print(str(decoded, "utf-8"))
When I run above program I get this error:
print(str(decoded, "utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 47-48: invalid continuation byte
File size is around 26MB. When I tried uploading the same file on Base64decode I am getting proper output file without any error.
print(decoded[:50])
>> b'PK\x03\x04\x14\x00\x08\x08\x08\x005O=Q\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11\x00\x00\x00newUserPkList.xml\xec\xbd\xcb'
print(decoded[47:50]
>> b'\xec\xbd\xcb'
Please let me know what mistake I'm doing and how can I solve this error ?
Thanks.

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>

I have a code to convert docx files to pure text:
import docx
import glob
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
for file in glob.glob('*.docx'):
outfile = open(file.replace('.docx', '-out.txt'), 'w', encoding='utf8')
for line in open(file):
print(getText(filename), end='', file=outfile)
outfile.close()
However, when I execute it, there is the following error:
Traceback (most recent call last):
File "C:\Users\User\Desktop\add spaces docx\converting docx to pure text.py", line 16, in <module>
for line in open(file):
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>
I am using Python 3.5.2.
Can anyone help to resolve this issue?
Thanks in advance.
Although I do not know the docx modules as much, I think I can find a solution.
According to fileformat, the Unicode character 8f (which is what the charmap codec couldn't decode, resulting in a UnicodeDecodeError) is a control character.
You should be aware that when reading files (which seems to be the case for what the docx module is doing), you should be aware of control characters, because sometimes Python can't decode it.
The solution to this is to give up on the docx module, learn how .docx files work and are formatted, and when you read a docx file, use open(filename, "rb") so Python will be able to decode it.
However, this might not be the problem. As you can see, in the directory encodings, it uses cp1512 as it's encoding (default) instead of utf-8. Try changing it to utf_8.py (for me it comes up as utf_8.pyc).
NOTE: Sorry for the lack of links. This is because I do not have higher than 10 reputation (because I am new to Stack Overflow).

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5

I am trying to read a file using the following code.
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'))
features = precomputed['features']
But getting this error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5: ordinal not in range(128)
The file I am trying to read contains image features which are extracted using deep neural networks. The file content looks like below.
(dp0
S'imageIds'
p1
(lp2
I262145
aI131074
aI131075
aI393221
aI393223
aI393224
aI524297
aI393227
aI393228
aI262146
aI393230
aI262159
aI524291
aI322975
aI131093
aI524311
....
....
....
Please note that, this is big file, of size 2.8GBs.
I know this is a duplicate question but I followed the suggested solutions in other stackoverflow posts but couldn't solve it. Any help would be appreciated!
Finally I found the solution. The problem was actually about unpickling a python 2 object with python 3 which I couldn't understand first because the pickle file I got was written through a python 2 program.
Thanks to this answer which solved the problem. So, all I need to do is set the encoding parameter of pickle.load() function to latin1 because latin1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
So, the following worked for me!
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'), encoding='latin1')

Resources