Python3 urlopen read weirdness (gzip) - python-3.x

I'm getting an URL from Schema.org. It's content-type="text/html"
Sometimes, read() functions as expected b'< !DOCTYPE html> ....'
Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'
try:
with urlopen("http://schema.org/docs/releases.html") as f:
txt = f.read()
except URLError:
return
I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The obvious work-around is to test if the first byte is hex and treat this accordingly.
My question is: Is this a bug or something else?
Edit
Related question. Apparently, sometimes I'm getting a gzipped stream.
Lastly
I solved this by adding the following code as proposed here
if 31 == txt[0]:
txt = decompress(txt, 16+MAX_WBITS)
The question remains; why does this return text/html sometimes and zipped some other times?

There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.
Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.
As per RFC2616:
If no Accept-Encoding field is present in a request, the server MAY
assume that the client will accept any content coding. In this case,
if "identity" is one of the available content-codings, then the
server SHOULD use the "identity" content-coding, unless it has
additional information that a different content-coding is meaningful
to the client.
Unfortunately (as for the use case), RFC7231 changes this to
If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.
Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.
schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.
Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.
So as a lesson: when using urlopen() we need to set Accept-Encoding.

You are indeed receiving a gzipped response. You should be able to avoid it by:
from urllib import request
try:
req = request.Request("http://schema.org/docs/releases.html")
req.add_header('Accept-Encoding', 'identity;q=1')
with request.urlopen(req) as f:
txt = f.read()
except request.URLError:
return

Related

Python http request and utf-8 encoding

I use urllib to perform http request to a REST API. All data are in JSON with utf-8 encoding.
The problem is that when i read some special charater via REST API i get the correct encoding (i.e. if i read 90°C the ° is correctly coded as 0xc2 0xb0) but when i send it back with another request it seems to loose the utf-8 encoding (i.e. ° is coded as 0xb0).
I made a little test saving the response to file: if i write the response as byte i can see it is coded right, when i load the response as json and then write to file i loose the utf-8 encoding.
resp = urllib.request.urlopen(req,context=context)
r = resp.read()
print(f'resp: {r}')
f = open('test-utf8','wb')
f.write(r)
f.close()
content = json.loads(r)
print(content)
f = open('test2-utf8','w')
f.write(content['descrizione'])
f.close()
If i make a new request sending that data after reading it with json.loads, i get this error
unable to decode byte 0xb0 near \'"\'
If i use encode('utf-8') and decode('utf-8') before sending the request it doesn't work.
Where do i mistake?
After some test we've found out that was an encoding problem on our tomcat RESTful server.

Python3 - urllib.request.urlopen and readlines to utf-8?

Consider this example:
import urllib.request # Python3 URL loading
filelist_url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
filelist_fobj = urllib.request.urlopen(filelist_url)
#filelist_fobj_fulltext = filelist_fobj.read().decode('utf-8')
#print(filelist_fobj_fulltext) # ok, works
lines = filelist_fobj.readlines()
print(type(lines[0]))
This code prints out the type of the first entry, of the result returned by readlines() of the file object for the .urlopen()'d URL as:
<class 'bytes'>
... and in fact, all of the entries in the returned list are of the same type.
I am aware that I could do .read().decode('utf-8') as in the commented lines, and then split that result on \n -- however, I'd like to know: is there otherwise any way, to use urlopen with .readlines(), and get a list of ("utf-8") strings?
urllib.request.urlopen returns a http.client.HTTPResponse object, which implements the io.BufferedIOBase interface, which returns bytes.
The io module provides TextIOWrapper, which can wrap a BufferedIOBase object (or other similar objects) to add an encoding. The wrapped object's readlines method returns str objects decoded according to the coding you specified when you created the TextIOWrapper, so if you get the encoding right, everything will work. (On Unix-like systems, utf-8 is the default encoding, but apparently that's not the case on Windows. So if you want portability, you need to provide an encoding. I'll get back to that in a minute.)
So the following works fine:
>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response, encoding='utf-8'):
...
>>> for line in lines[:5]: print(type(line), line.strip())
...
<class 'str'> The following are the graphical (non-control) characters defined by
<class 'str'> ISO 8859-1 (1987). Descriptions in words aren't all that helpful,
<class 'str'> but they're the best we can do in text. A graphics file illustrating
<class 'str'> the character set should be available from the same archive as this
<class 'str'> file.
It's worth noting that both the HTTPResponse object and the TextIOWrapper which wraps it implement the iterator protocol, so you can use a loop like for line in TextIOWrapper(response, ...): instead of saving the entire web page using readlines(). The iterator protocol can be a big win because it lets you start processing the web page before it has all been downloaded.
Since I work on a Linux system, I could have left out the encoding='utf-8' argument to TextIOWrapper, but regardless, the assumption is that I know that the file is UTF-8 encoded. That's a pretty safe assumption, but it's not universally valid. According to W3Techs survey (updated daily, at least when I wrote this answer), 97.6% of websites use UTF-8 encoding, which means that one in 40 does not. (If you restrict the survey to what W3Techs considers the top 1,000 sites, the percentage increases to 98.7%. But that's still not universal.)
Now, the conventional wisdom, which you'll find in a number of SO answers, is that you should dig the encoding out of the HTTP headers, which you can fairly easily do:
>>> # Tempting though this is, DO NOT DO IT. See below.
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Unfortunately, that will only work if the website declares the content encoding in the HTTP headers, and many sites prefer to put the encoding in a meta tag. So when I tried the above with a randomly-selected Windows-1252-encoded site (taken from the W3Techs survey), it failed with an encoding error:
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 346: invalid continuation byte
Note that although the page is encoded in Windows-1252, that information was not provided in the HTTP headers, so TextIOWrapper chose the default encoding, which on my system is UTF-8. If I supply the correct encoding, I can read the page without problems, letting me see the encoding declaration in the page itself.
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding='Windows-1252'
... ).readlines()
...
... print(lines[3].strip())>>> print(lines[3].strip())
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
Clearly, if the encoding is declared in the content itself, it's not possible to set the encoding before reading the content. So what to do in these cases?
The most general solution, and the simplest to code, appears to be the well-known BeautifulSoup package, which is capable of using a variety of techniques to detect the character encoding. Unfortunately, that requires parsing the entire page, which is a much more time-consuming task than just reading lines.
Another option would be to read the first kilobyte or so of the webpage, as bytes, and then try to find a meta tag. Content provider are supposed to put the meta tag close to the beginning of the web page, and it certainly has to come before the first non-ascii character. If you don't find a meta tag and there is no character encoding declared in the HTTP headers, then you could try to use a heuristic encoding detector on the bytes of the file already read.
The one thing you shouldn't do is rely on the character encoding declared in the HTTP header, regardless of the many suggestions to do so, which you will find here and elsewhere on the web. As we've already seen, the headers often don't contain this information anyway, but even when they do, it is often wrong anyway, because for a web designer it's easier to declare the encoding in the page itself than to reconfigure the server to send the correct headers. So you can't really rely on the HTTP header, and you should only use it if you have no other information to go on.

Persian text can not be parsed correctly when crawling a persian website [duplicate]

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.
However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.
Also, when we use urllib.urlopen(), it returns properly encoded data.
Has anyone noticed this before? Why does requests.get() behave like this?
Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).
For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).
For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.
Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:
r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.
After getting response, take response.content instead of response.text and that will be of encoding utf-8.
response = requests.get(download_link, auth=(myUsername, myPassword), headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
body = response.content
else:
print ("Unable to get response with Code : %d " % (response.status_code))
The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.
Use .content to access the byte stream, or .text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may be off.

How to stream (upload) large amount of data with Requests in Python?

The module requests provides a high level HTTP API. Using requests I'd like to send data via HTTP using a POST request. The documentation is very short about this, stating that a "file like object" should be provided without stating clearly what exactly requests would expect from that object. I've some binary data, but unfortunately this is generated data and I have not a file like object. How could I possibly implement a "file like object" myself that would conform to the expectations of requests? The documentation is quite poor in that regard and I wasn't able to clarify this by looking into the source code of requests myself. Has anyone done this before using the requests API?
File-like object is a standard Python term for an object that behaves like a file. This means that if you have a file, you have a file-like object and simply need to pass the file path to Requests. If you have a more complex situation you will need to give us a full description of the form of your data so we can help you more explicitly.
EDIT: To address your comment, here is the code to send a binary file to a host using Requests.
url = 'http://SomeSite/post'
files = {'files': ('mydata', open('mydata', mode='rb'), 'application/octet-stream')}
r = requests.post(url, files=files)
Opening the file with the Python open command creates the file-like object.
EDIT2: Whenever to open a file on disk you create a file-like object in the process of opening the objects. However, Python supports other object types that act like files. Some examples include the standard stdin, stdout and stderr. In addition, pipes can be access using os.pipe and via subprocess.Pipe. These objects behave like files, i.e. they can be accessed with a subset of the file API and their API's behave in the same way as the object that accesses a real file.
This is why they are called file-like because they use the same API's and act in the same way. You open, close, can read or write a pipe in the same way as you do a file.

TypeError: POST data should be bytes or an iterable of bytes. It cannot be str

I just updated from python 3.1 to python 3.2 (formatted HD) and one of my scripts stopped working. It gives me the error in the title.
I would fix it myself but I don't even know what an iterable of bytes is lol. I tried typecasting bytes(data) but that didn't work either. TypeError: string argument without an encoding
url = "http://example.com/index.php?app=core&module=global&section=login&do=process"
values = {"username" : USERNAME,
"password" : PASSWORD}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
urllib.request.urlopen(req)
It crashes at the last line.
Works in 3.1, but not 3.2
You did basically correct in trying to convert the string into bytes, but you did it the wrong way. Python doesn't have typecasting (so what you did was not typecasting).
The way to do it is to encode the text data into bytes data, which you do with the encode function:
binary_data = data.encode('encoding')
What 'encoding' should be depends. You should probably use 'ascii' here. If you have characters that aren't ASCII, then you need to use another encoding, typically 'utf8', but then you also need to tell the receiving webserver that it is UTF-8. It might also not want UTF8, but then you have to ask it, and it's getting complicated. :-)
#Enders, I know this is an old question, but I'd like to explain a few more things for somebody fighting with this issue.
It is specifically with this line of code here:
data = urllib.parse.urlencode(values)
That you are having issues, as you are trying to encode the data: values (urlencode).
If you refer to the urllib.parse documentation scroll to the bottom to find what urlencode does: https://docs.python.org/3/library/urllib.parse.html <~ you will see that you are trying to encode your user/pass into a data string:
Convert a mapping object or a sequence of two-element tuples, which may contain str or bytes objects, to a percent-encoded ASCII text string. If the resultant string is to be used as a data for POST operation with the urlopen() function, then it should be encoded to bytes, otherwise it would result in a TypeError.
Perhaps what you are trying to do here is do some kind of encryption of your user/password, but I don't really think this is the right way. If it is, then you probably need to make sure that the receiving end (the destination of your url) know that you're encoding your user/pass with this.
A more up-to-date approach is to use the powerful Requests library. They have compatibility with very common authentication protocols: http://docs.python-requests.org/en/master/user/authentication/
In this case, I'd do something like this:
requests.get(url, auth=('user', 'pass'))

Resources