Python 3.4: str : AttributeError: 'str' object has no attribute 'decode - python-3.x

I have this code part of a function that replace badly encoded foreign characters from a string :
s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s)
# b"String from an old database with weird mixed encodings"
I need here a "real" string, not bytes. But whend i want to decode them, i have an exception :
s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s.decode("utf-8"))
# AttributeError: 'str' object has no attribute 'decode'
Do you know why s is bytes here ?
Why can't i decode it to a real string ?
Do you know how to do it the clean way ? (today i return s[2:][:-1]. Working but very ugly, and i would like to understand this behavior)
Thanks in advance !
EDIT :
pypyodbc in python3 use all unicode by default. That confused me. On connect, you can tell him to use ANSI.
con_odbc = pypyodbc.connect("DSN=GP", False, False, 0, False)
Then, i can convert the returned stuffs into cp850, which is the initial codepage of the database.
str(odbc_str, "cp850", "replace")
No more need to manualy replace each special character.
Thank you very much pepr

The printed b"String from an old database with weird mixed encodings" is not the representation of the string content. It is the value of the string content. As you did not pass the encoding argument to str()... (see the doc https://docs.python.org/3.4/library/stdtypes.html#str)
If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).
This is what happened in your case. The b" are actually two characters that are the part of the string content. You can also try:
s1 = 'String from an old database with weird mixed encodings'
print(type(s1), repr(s1))
by = bytes(s1, 'cp1252')
print(type(by), repr(by))
s2 = str(by)
print(type(s2), repr(s2))
and it prints:
<class 'str'> 'String from an old database with weird mixed encodings'
<class 'bytes'> b'String from an old database with weird mixed encodings'
<class 'str'> "b'String from an old database with weird mixed encodings'"
This is the reason why s[2:][:-1] works for you.
If you think more about it, then (in my opinion) or you want to get bytes or bytearray from the database (if possible), and to fix the bytes (see bytes.translate https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#bytes.translate) or you successfully get the string (being lucky that there was no exception when constructing that string), and you want to replace the wrong characters by the correct characters (see also str.translate() https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#str.translate).
Possibly, the ODBC used internally the wrong encoding. (That is the content of the database may be correct, but it was misinterpreted by the ODBC, and you are not able to tell the ODBC what is the correct encoding.) Then you want to encode the string back to bytes using that wrong encoding, and then decode the bytes using the right encoding.

Related

Python - Convert unicode entity into unicode symbol

My request.json(), When I loop through the dict it returns from an API, returns "v\u00F6lk" (without the quotes)
But I want "völk" (without the quotes), which is how it is raw in the API.
How do I convert?
request = requests.post(get_sites_url, headers=api_header, params=search_sites_params, timeout=http_timeout_seconds)
return_search_results = request.json()
for site_object in return_search_results['data']:
site_name = str(site_object['name'])
site_name_fixed=str(site_name.encode("utf-8").decode())
print("fixed site_name: " + site_name_fixed)
My Guess, the API is actually returning the literal version, so he is really getting:
"v\\u00F6lk"
Printing that gives what we think we are getting from the api:
print("v\\u00F6lk")
v\u00F6lk
I am not sure if there is a better way to do this, but encoding it with "utf-8", then using "unicode_escape" to decode seemed to work:
>>> print(bytes("v\\u00F6lk", "utf-8").decode("unicode_escape"))
völk
>>> print("v\\u00F6lk".encode("utf-8").decode("unicode_escape"))
völk

How to use f'string bytes'string together? [duplicate]

I'm looking for a formatted byte string literal. Specifically, something equivalent to
name = "Hello"
bytes(f"Some format string {name}")
Possibly something like fb"Some format string {name}".
Does such a thing exist?
No. The idea is explicitly dismissed in the PEP:
For the same reason that we don't support bytes.format(), you may
not combine 'f' with 'b' string literals. The primary problem
is that an object's __format__() method may return Unicode data
that is not compatible with a bytes string.
Binary f-strings would first require a solution for
bytes.format(). This idea has been proposed in the past, most
recently in PEP 461. The discussions of such a feature usually
suggest either
adding a method such as __bformat__() so an object can control how it is converted to bytes, or
having bytes.format() not be as general purpose or extensible as str.format().
Both of these remain as options in the future, if such functionality
is desired.
In 3.6+ you can do:
>>> a = 123
>>> f'{a}'.encode()
b'123'
You were actually super close in your suggestion; if you add an encoding kwarg to your bytes() call, then you get the desired behavior:
>>> name = "Hello"
>>> bytes(f"Some format string {name}", encoding="utf-8")
b'Some format string Hello'
Caveat: This works in 3.8 for me, but note at the bottom of the Bytes Object headline in the docs seem to suggest that this should work with any method of string formatting in all of 3.x (using str.format() for versions <3.6 since that's when f-strings were added, but the OP specifically asks about 3.6+).
From python 3.6.2 this percent formatting for bytes works for some use cases:
print(b"Some stuff %a. Some other stuff" % my_byte_or_unicode_string)
But as AXO commented:
This is not the same. %a (or %r) will give the representation of the string, not the string iteself. For example b'%a' % b'bytes' will give b"b'bytes'", not b'bytes'.
Which may or may not matter depending on if you need to just present the formatted byte_or_unicode_string in a UI or if you potentially need to do further manipulation.
As noted here, you can format this way:
>>> name = b"Hello"
>>> b"Some format string %b World" % name
b'Some format string Hello World'
You can see more details in PEP 461
Note that in your example you could simply do something like:
>>> name = b"Hello"
>>> b"Some format string " + name
b'Some format string Hello'
This was one of the bigger changes made from python 2 to python3. They handle unicode and strings differently.
This s how you'd convert to bytes.
string = "some string format"
string.encode()
print(string)
This is how you'd decode to string.
string.decode()
I had a better appreciation for the difference between Python 2 versus 3 change to unicode through this coursera lecture by Charles Severence. You can watch the entire 17 minute video or fast forward to somewhere around 10:30 if you want to get to the differences between python 2 and 3 and how they handle characters and specifically unicode.
I understand your actual question is how you could format a string that has both strings and bytes.
inBytes = b"testing"
inString = 'Hello'
type(inString) #This will yield <class 'str'>
type(inBytes) #this will yield <class 'bytes'>
Here you could see that I have a string a variable and a bytes variable.
This is how you would combine a byte and string into one string.
formattedString=(inString + ' ' + inBytes.encode())

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

pyqt5 fails to inserts bytes

How do you insert a bytes object into a database. Whenever I attempt this I end up with an empty string (not NULL) being inserted.
with open(filepath, 'rb') as f:
filehash = hashlib.md5(f.read()).hexdigest()
img_pkl = pickle.dumps(img, protocol=4)
record = self.tablemodel.record()
record.setValue('originfile_path', filepath)
record.setValue('originfile_hash', filehash)
record.setValue('image', img_pkl)
record.setValue('area', area)
self.tablemodel.insertRecord(-1, record)
The issue was noted on the mailing list but never addressed.
https://www.riverbankcomputing.com/pipermail/pyqt/2016-April/037260.html
It also turns out that it makes more sense to use the table model when you can instead of prepared statements (he also noted a bug in pyqt regarding exec() vs exec_().
You must explicitly convert to a QByteArray.
record.setValue('image', QtCore.QByteArray(img_pkl))
Note: you also have to convert numpy.float64 objects using float()

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?
uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Resources