Python3 - urllib.request.urlopen and readlines to utf-8?

Python3 - urllib.request.urlopen and readlines to utf-8? - python-3.x

Consider this example:
import urllib.request # Python3 URL loading
filelist_url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
filelist_fobj = urllib.request.urlopen(filelist_url)
#filelist_fobj_fulltext = filelist_fobj.read().decode('utf-8')
#print(filelist_fobj_fulltext) # ok, works
lines = filelist_fobj.readlines()
print(type(lines[0]))
This code prints out the type of the first entry, of the result returned by readlines() of the file object for the .urlopen()'d URL as:
<class 'bytes'>
... and in fact, all of the entries in the returned list are of the same type.
I am aware that I could do .read().decode('utf-8') as in the commented lines, and then split that result on \n -- however, I'd like to know: is there otherwise any way, to use urlopen with .readlines(), and get a list of ("utf-8") strings?

urllib.request.urlopen returns a http.client.HTTPResponse object, which implements the io.BufferedIOBase interface, which returns bytes.
The io module provides TextIOWrapper, which can wrap a BufferedIOBase object (or other similar objects) to add an encoding. The wrapped object's readlines method returns str objects decoded according to the coding you specified when you created the TextIOWrapper, so if you get the encoding right, everything will work. (On Unix-like systems, utf-8 is the default encoding, but apparently that's not the case on Windows. So if you want portability, you need to provide an encoding. I'll get back to that in a minute.)
So the following works fine:
>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response, encoding='utf-8'):
...
>>> for line in lines[:5]: print(type(line), line.strip())
...
<class 'str'> The following are the graphical (non-control) characters defined by
<class 'str'> ISO 8859-1 (1987). Descriptions in words aren't all that helpful,
<class 'str'> but they're the best we can do in text. A graphics file illustrating
<class 'str'> the character set should be available from the same archive as this
<class 'str'> file.
It's worth noting that both the HTTPResponse object and the TextIOWrapper which wraps it implement the iterator protocol, so you can use a loop like for line in TextIOWrapper(response, ...): instead of saving the entire web page using readlines(). The iterator protocol can be a big win because it lets you start processing the web page before it has all been downloaded.
Since I work on a Linux system, I could have left out the encoding='utf-8' argument to TextIOWrapper, but regardless, the assumption is that I know that the file is UTF-8 encoded. That's a pretty safe assumption, but it's not universally valid. According to W3Techs survey (updated daily, at least when I wrote this answer), 97.6% of websites use UTF-8 encoding, which means that one in 40 does not. (If you restrict the survey to what W3Techs considers the top 1,000 sites, the percentage increases to 98.7%. But that's still not universal.)
Now, the conventional wisdom, which you'll find in a number of SO answers, is that you should dig the encoding out of the HTTP headers, which you can fairly easily do:
>>> # Tempting though this is, DO NOT DO IT. See below.
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Unfortunately, that will only work if the website declares the content encoding in the HTTP headers, and many sites prefer to put the encoding in a meta tag. So when I tried the above with a randomly-selected Windows-1252-encoded site (taken from the W3Techs survey), it failed with an encoding error:
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 346: invalid continuation byte
Note that although the page is encoded in Windows-1252, that information was not provided in the HTTP headers, so TextIOWrapper chose the default encoding, which on my system is UTF-8. If I supply the correct encoding, I can read the page without problems, letting me see the encoding declaration in the page itself.
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding='Windows-1252'
... ).readlines()
...
... print(lines[3].strip())>>> print(lines[3].strip())
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
Clearly, if the encoding is declared in the content itself, it's not possible to set the encoding before reading the content. So what to do in these cases?
The most general solution, and the simplest to code, appears to be the well-known BeautifulSoup package, which is capable of using a variety of techniques to detect the character encoding. Unfortunately, that requires parsing the entire page, which is a much more time-consuming task than just reading lines.
Another option would be to read the first kilobyte or so of the webpage, as bytes, and then try to find a meta tag. Content provider are supposed to put the meta tag close to the beginning of the web page, and it certainly has to come before the first non-ascii character. If you don't find a meta tag and there is no character encoding declared in the HTTP headers, then you could try to use a heuristic encoding detector on the bytes of the file already read.
The one thing you shouldn't do is rely on the character encoding declared in the HTTP header, regardless of the many suggestions to do so, which you will find here and elsewhere on the web. As we've already seen, the headers often don't contain this information anyway, but even when they do, it is often wrong anyway, because for a web designer it's easier to declare the encoding in the page itself than to reconfigure the server to send the correct headers. So you can't really rely on the HTTP header, and you should only use it if you have no other information to go on.

Related

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!

Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

Python 3 pickle load from Python 2

I have a pickle file that was created (I don't know how exactly) in python 2. It is intended to be loaded by the following python 2 lines, which when used in python 3 (unsurprisingly) do not work:
with open('filename','r') as f:
foo, bar = pickle.load(f)
Result:
'ascii' codec can't decode byte 0xc2 in position 1219: ordinal not in range(128)
Manual inspection of the file indicates it is utf-8 encoded, therefore:
with open('filename','r', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
TypeError: a bytes-like object is required, not 'str'
With binary encoding:
with open('filename','rb', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
ValueError: binary mode doesn't take an encoding argument
Without binary encoding:
with open('filename','rb') as f:
foo, bar = pickle.load(f)
Result:
UnpicklingError: invalid load key, '
'.
Is this pickle file just broken? If not, how can I pry this thing open in python 3? (I have browsed the extensive collection of related questions and not found anything that works yet.)
Finally, note that the original
import cPickle as pickle
has been replaced with
import _pickle as pickle

The loading of python2 pickles in python3 (version 3.7.2 in this example) can be helped using the fix_imports parameter in the pickle.load function, but in my case it also worked without setting that parameter to True.
I was attempting to load a scipy.sparse.csr.csr_matrix contained in pickle generated using Python2.
When inspecting the file format using the UNIX command file it says:
>file -bi python2_generated.pckl
application/octet-stream; charset=binary
I could load the pickle in Python3 using the following code:
with open("python2_generated.pckl", "rb") as fd:
bh01 = pickle.load(fd, fix_imports=True, encoding="latin1")
Note that the loading was successful with and without setting fix_imports to True
As for the "latin1" encoding, the Python3 documentation (version 3.7.2) for the pickle.load function says:
Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2
Although this is specifically for scipy matrixes (or Numpy arrays), and since Novak is not clarifing what his pickle file contained,
I hope this could of help to other users :)

Two errors were conflating each other.
First: By the time the .p file reached me, it had almost certainly been corrupted in transit, likely by FTP-ing (or similar) in ASCII rather than binary mode. I was able to get my hands on a properly transmitted copy, which allowed me to discover...
Second: Whatever the file might have implied on the inside, the proper encoding was 'latin1' not 'utf-8'.
So in a sense, yes, the file was broken, and even after that I was doing it wrong. I leave this here as a reminder to whoever eventually has the next bizarre pickle/python2/python3 issue that there can be multiple things gone wrong, and they have to be solved in the correct orderr.

Python3 urlopen read weirdness (gzip)

I'm getting an URL from Schema.org. It's content-type="text/html"
Sometimes, read() functions as expected b'< !DOCTYPE html> ....'
Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'
try:
with urlopen("http://schema.org/docs/releases.html") as f:
txt = f.read()
except URLError:
return
I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The obvious work-around is to test if the first byte is hex and treat this accordingly.
My question is: Is this a bug or something else?
Edit
Related question. Apparently, sometimes I'm getting a gzipped stream.
Lastly
I solved this by adding the following code as proposed here
if 31 == txt[0]:
txt = decompress(txt, 16+MAX_WBITS)
The question remains; why does this return text/html sometimes and zipped some other times?

There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.
Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.
As per RFC2616:
If no Accept-Encoding field is present in a request, the server MAY
assume that the client will accept any content coding. In this case,
if "identity" is one of the available content-codings, then the
server SHOULD use the "identity" content-coding, unless it has
additional information that a different content-coding is meaningful
to the client.
Unfortunately (as for the use case), RFC7231 changes this to
If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.
Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.
schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.
Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.
So as a lesson: when using urlopen() we need to set Accept-Encoding.

You are indeed receiving a gzipped response. You should be able to avoid it by:
from urllib import request
try:
req = request.Request("http://schema.org/docs/releases.html")
req.add_header('Accept-Encoding', 'identity;q=1')
with request.urlopen(req) as f:
txt = f.read()
except request.URLError:
return

Python's handling of shell strings

I still do not understand completely how python's unicode and str types work. Note: I am working in Python 2, as far as I know Python 3 has a completely different approach to the same issue.
What I know:
str is an older beast that saves strings encoded by one of the way too many encodings that history has forced us to work with.
unicode is an more standardised way of representing strings using a huge table of all possible characters, emojis, little pictures of dog poop and so on.
The decode function transforms strings to unicode, encode does the other way around.
If I, in python's shell, simply say:
>>> my_string = "some string"
then my_string is a str variable encoded in ascii (and, because ascii is a subset of utf-8, it is also encoded in utf-8).
Therefore, for example, I can convert this into a unicode variable by saying one of the lines:
>>> my_string.decode('ascii')
u'some string'
>>> my_string.decode('utf-8')
u'some string'
What I don't know:
How does Python handle non-ascii strings that are passed in the shell, and, knowing this, what is the correct way of saving the word "kožušček"?
For example, I can say
>>> s1 = 'kožušček'
In which case s1 becomes a str instance that I am unable to convert into unicode:
>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)
Now, naturally I can't decode the string with ascii, but what encoding should I then use? After all, my sys.getdefaultencoding() returns ascii! Which encoding did Python use to encode s1 when fed the line s1=kožušček?
Another thought I had was to say
>>> s2 = u'kožušček'
But then, when I printed s2, I got
>>> print s2
kouèek
which means that Python lost a whole letter. Can someone explain this to me?

str objects contain bytes. What those bytes represent Python doesn't dictate. If you produced ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data they can be decoded as such. If they contain bytes representing an image, then you can decode that information and display an image somewhere. When you use repr() on a str object Python will leave any bytes that are ASCII printable as such, the rest are converted to escape sequences; this keeps debugging such information practical even in ASCII-only environments.
Your terminal or console in which you are running the interactive interpreter writes bytes to the stdin stream that Python reads from when you type. Those bytes are encoded according to the configuration of that terminal or console.
In your case, your console encoded the input you typed to a Windows codepage, most likely. You'll need to figure out the exact codepage and use that codec to decode the bytes. Codepage 1252 seems to fit:
>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek
When you print those same bytes, your console is reading those bytes and interpreting them in the same codec it is already configured with.
Python can tell you what codec it thinks your console is set to; it tries to detect this information for Unicode literals, where the input has to be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout objects have an encoding attribute; mine is set to UTF-8:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek
Because my terminal has been configured for UTF-8 and Python has detected this, using a Unicode literal u'...' works. The data is automatically decoded by Python.
Why exactly your console lost a whole letter I don't know; I'd have to have access to your console and do some more experiments, see the output of print repr(s2), and test all bytes between 0x00 and 0xFF to see if this is on the input or output side of the console.
I recommend you read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Your system does not necessarily use the sys.getdefaultencoding() encoding; it is merely the default used when you convert without telling it the encoding, as in:
>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)
Python's idea of your system locale is in the locale module:
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
And using this we can decode the string:
>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček
There's a chance the locale has not been set up, as is the case for the 'C' locale. That may cause the reported encoding to be None even though the default is 'ascii'. Normally figuring this out is the job of setlocale, which getpreferredencoding will automatically call. I would suggest calling it once in your program startup and saving the value returned for all further use. The encoding used for filenames may also be yet another case, reported in sys.getfilesystemencoding().
The Python-internal default encoding is set up by the site module, which contains:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
So if you want it set by default in every run of Python, you can change that first if 0 to if 1.

Python 3.1 server-side can't output Unicode string to client

I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?

It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python3 - urllib.request.urlopen and readlines to utf-8? - python-3.x

Related

Encoding issues related to Python and foreign languages

Python 3 pickle load from Python 2

Python3 urlopen read weirdness (gzip)

Python's handling of shell strings

Python 3.1 server-side can't output Unicode string to client

Categories

Resources