Python http request and utf-8 encoding - python-3.x

I use urllib to perform http request to a REST API. All data are in JSON with utf-8 encoding.
The problem is that when i read some special charater via REST API i get the correct encoding (i.e. if i read 90°C the ° is correctly coded as 0xc2 0xb0) but when i send it back with another request it seems to loose the utf-8 encoding (i.e. ° is coded as 0xb0).
I made a little test saving the response to file: if i write the response as byte i can see it is coded right, when i load the response as json and then write to file i loose the utf-8 encoding.
resp = urllib.request.urlopen(req,context=context)
r = resp.read()
print(f'resp: {r}')
f = open('test-utf8','wb')
f.write(r)
f.close()
content = json.loads(r)
print(content)
f = open('test2-utf8','w')
f.write(content['descrizione'])
f.close()
If i make a new request sending that data after reading it with json.loads, i get this error
unable to decode byte 0xb0 near \'"\'
If i use encode('utf-8') and decode('utf-8') before sending the request it doesn't work.
Where do i mistake?

After some test we've found out that was an encoding problem on our tomcat RESTful server.

Related

Persian text can not be parsed correctly when crawling a persian website [duplicate]

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.
However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.
Also, when we use urllib.urlopen(), it returns properly encoded data.
Has anyone noticed this before? Why does requests.get() behave like this?
Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).
For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).
For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.
Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:
r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.
After getting response, take response.content instead of response.text and that will be of encoding utf-8.
response = requests.get(download_link, auth=(myUsername, myPassword), headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
body = response.content
else:
print ("Unable to get response with Code : %d " % (response.status_code))
The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.
Use .content to access the byte stream, or .text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may be off.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

MIME base64 encoding ambiguity in rfc2045

According to MIME base64 encoding specified in rfc2045, the base64 encoded data must be split in lines of at most 76 characters.
When decoding, all characters not belonging to the base64 alphabet must ne ignored.
How do we determine the end of MIME base64 encoded data ?
When you've found the start of a base64 encoded object, it should always be possible to find the end without decoding it. Examples:
You might have an email message whose top-level encoding is base64. In that case, the end of the base64 stuff is the end of the body. The end of the body is recognized not by any internal structure, but by the lone . at the end of the SMTP DATA.
If you're reading an email message from an mbox file instead of receiving it via SMTP, the mbox format is responsible for telling you where the end of the message is.
If you have a multipart email body with one part base64, you can scan for the multipart boundary first to find the end of the body part, then pass the whole body part to the base64 decoder.
Similarly, if you have an RFC2047-encoded header with base64, you can find the terminating =? first, then pass the encoded portion to the base64 decoder.
Because the terminators are already identified before base64 decoding begins, the decoder never sees the terminator, so the rule "characters not belonging to the base64 alphabet" is not relevant.
The 2 steps of finding the end of the base64 data and decoding can be combined into a single loop over the input, for efficiency. But conceptually they are separate.

Python3 urlopen read weirdness (gzip)

I'm getting an URL from Schema.org. It's content-type="text/html"
Sometimes, read() functions as expected b'< !DOCTYPE html> ....'
Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'
try:
with urlopen("http://schema.org/docs/releases.html") as f:
txt = f.read()
except URLError:
return
I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The obvious work-around is to test if the first byte is hex and treat this accordingly.
My question is: Is this a bug or something else?
Edit
Related question. Apparently, sometimes I'm getting a gzipped stream.
Lastly
I solved this by adding the following code as proposed here
if 31 == txt[0]:
txt = decompress(txt, 16+MAX_WBITS)
The question remains; why does this return text/html sometimes and zipped some other times?
There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.
Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.
As per RFC2616:
If no Accept-Encoding field is present in a request, the server MAY
assume that the client will accept any content coding. In this case,
if "identity" is one of the available content-codings, then the
server SHOULD use the "identity" content-coding, unless it has
additional information that a different content-coding is meaningful
to the client.
Unfortunately (as for the use case), RFC7231 changes this to
If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.
Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.
schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.
Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.
So as a lesson: when using urlopen() we need to set Accept-Encoding.
You are indeed receiving a gzipped response. You should be able to avoid it by:
from urllib import request
try:
req = request.Request("http://schema.org/docs/releases.html")
req.add_header('Accept-Encoding', 'identity;q=1')
with request.urlopen(req) as f:
txt = f.read()
except request.URLError:
return

TypeError: POST data should be bytes or an iterable of bytes. It cannot be str

I just updated from python 3.1 to python 3.2 (formatted HD) and one of my scripts stopped working. It gives me the error in the title.
I would fix it myself but I don't even know what an iterable of bytes is lol. I tried typecasting bytes(data) but that didn't work either. TypeError: string argument without an encoding
url = "http://example.com/index.php?app=core&module=global&section=login&do=process"
values = {"username" : USERNAME,
"password" : PASSWORD}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
urllib.request.urlopen(req)
It crashes at the last line.
Works in 3.1, but not 3.2
You did basically correct in trying to convert the string into bytes, but you did it the wrong way. Python doesn't have typecasting (so what you did was not typecasting).
The way to do it is to encode the text data into bytes data, which you do with the encode function:
binary_data = data.encode('encoding')
What 'encoding' should be depends. You should probably use 'ascii' here. If you have characters that aren't ASCII, then you need to use another encoding, typically 'utf8', but then you also need to tell the receiving webserver that it is UTF-8. It might also not want UTF8, but then you have to ask it, and it's getting complicated. :-)
#Enders, I know this is an old question, but I'd like to explain a few more things for somebody fighting with this issue.
It is specifically with this line of code here:
data = urllib.parse.urlencode(values)
That you are having issues, as you are trying to encode the data: values (urlencode).
If you refer to the urllib.parse documentation scroll to the bottom to find what urlencode does: https://docs.python.org/3/library/urllib.parse.html <~ you will see that you are trying to encode your user/pass into a data string:
Convert a mapping object or a sequence of two-element tuples, which may contain str or bytes objects, to a percent-encoded ASCII text string. If the resultant string is to be used as a data for POST operation with the urlopen() function, then it should be encoded to bytes, otherwise it would result in a TypeError.
Perhaps what you are trying to do here is do some kind of encryption of your user/password, but I don't really think this is the right way. If it is, then you probably need to make sure that the receiving end (the destination of your url) know that you're encoding your user/pass with this.
A more up-to-date approach is to use the powerful Requests library. They have compatibility with very common authentication protocols: http://docs.python-requests.org/en/master/user/authentication/
In this case, I'd do something like this:
requests.get(url, auth=('user', 'pass'))

Resources