Decode a Python string - python-3.x

Sorry for the generic title.
I am receiving a string from an external source: txt = external_func()
I am copying/pasting the output of various commands to make sure you see what I'm talking about:
In [163]: txt
Out[163]: '\\xc3\\xa0 voir\\n'
In [164]: print(txt)
\xc3\xa0 voir\n
In [165]: repr(txt)
Out[165]: "'\\\\xc3\\\\xa0 voir\\\\n'"
I am trying to transform that text to UTF-8 (?) to have txt = "à voir\n", and I can't see how.
How can I do transformations on this variable?

You can encode your txt to a bytes-like object using the encode-method of the str class.
Then this byte-like object can be decoded again with the encoding unicode_escape.
Now you have your string with all escape sequences parsed, but latin-1 decoded. You still have to encode it with latin-1 and then decode it again with utf-8.
>>> txt = '\\xc3\\xa0 voir\\n'
>>> txt.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'à voir\n'
The codecs module also has an undocumented funciton called escape_decode:
>>> import codecs
>>> codecs.escape_decode(bytes('\\xc3\\xa0 voir\\n', 'utf-8'))[0].decode('utf-8')
'à voir\n'

Related

How to decode string with unicode in python?

I have the following line:
%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22
Expected Result:
{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","
You can check it out here.
What I tried:
foo = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
codecs.decode(foo, 'unicode-escape')
foo.encode('utf-8').decode('utf-8')
This does not work. What other options are there?
The string is urlencoded. You can convert it by reversing the urlencoding.
from urllib import parse
s = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
unquoted = parse.unquote(s)
unquoted
'{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","'
This looks like part of a larger JSON string. The complete object can be de-serialised with json.loads.

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.
ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

How to encode Cyrillic characters in JSON

I want to read a JSON file containing Cyrillic symbols.
The Cyrillic symbols are represented like \u123.
Python converts them to '\\u123' instead of the Cyrillic symbol.
For example, the string "\u0420\u0435\u0433\u0438\u043e\u043d" should become "Регион", but becomes "\\u0420\\u0435\\u0433\\u0438\\u043e\\u043d".
encode() just makes string look like u"..." or adds a new \.
How do I convert "\u0420\u0435\u0433\u0438\u043e\u043d" to "Регион"?
If you want json to output a string that has non-ASCII characters in it then you need to pass ensure_ascii=False and then encode manually afterward.
Just use the json module.
import json
s = "\u0420\u0435\u0433\u0438\u043e\u043d"
# Generate a json file.
with open('test.json','w',encoding='ascii') as f:
json.dump(s,f)
# Reading it directly
with open('test.json') as f:
print(f.read())
# Reading with the json module
with open('test.json',encoding='ascii') as f:
data = json.load(f)
print(data)
Output:
"\u0420\u0435\u0433\u0438\u043e\u043d"
Регион

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.
Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.
(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)
The following is a code example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')
print(title)
file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()
I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:
b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'
Both on screen and file. Is there a proper way to do this ?
In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.
In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')
print(title)
file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding.
So if you want to print A in utf-8, you first need to convert A with gbk as follow:
print(A.encode('gbk','ignore').decode('gbk'))
JSON data to Unicode support for Japanese characters
def jsonFileCreation (messageData, fileName):
with open(fileName, "w", encoding="utf-8") as outfile:
json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

urlopen() throwing error in python 3.3

from urllib.request import urlopen
def ShowResponse(param):
uri = str("mysite.com/?param="+param+"&submit=submit")
print(urlopen(uri).read())
file = open("myfile.txt","r")
if file.mode == "r":
filelines = file.readlines()
for line in filelines:
line = line.strip()
ShowResponse(line)
this is my python code but when i run this it causes an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-49: ordinal not in range(128)"
i dont know how to fix this. im new to python
I'm going to assume that the stack trace shows that line 4 (uri = str(...) is throwing the given error and myfile.txt contains UTF-8 characters.
The error is because you're trying to convert a Unicode object (decoded from assumed UTF-8) to an ASCII string object. ASCII simply can not represent your character.
URIs (including the Query String) must encode non-ASCII chars as percent-encoded UTF-8 bytes. Example:
€ (EURO SIGN) is encoded in UTF-8 is:
0xE2 0x82 0xAC
Percent-encoded, it's:
%E2%82%AC
Therefore, your code needs to re-encode your parameter to UTF-8 then percent-encode it:
from urllib.request import urlopen, quote
def ShowResponse(param):
param_utf8 = param.encode("utf-8")
param_perc_encoded = quote(param_utf8)
# or uri = str("mysite.com/?param="+param_perc_encoded+"&submit=submit")
uri = str("mysite.com/?param={0}&submit=submit".format(param_perc_encoded) )
print(urlopen(uri).read())
You'll also see I've changed your uri = definition slightly to use String.format() (https://docs.python.org/2/library/string.html#format-string-syntax), which I find easier to create complex strings rather than doing string concatenation with +. In this example, {0} is replaced with the first argument to .format().

Resources