How to decode string with unicode in python? - python-3.x

I have the following line:
%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22
Expected Result:
{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","
You can check it out here.
What I tried:
foo = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
codecs.decode(foo, 'unicode-escape')
foo.encode('utf-8').decode('utf-8')
This does not work. What other options are there?

The string is urlencoded. You can convert it by reversing the urlencoding.
from urllib import parse
s = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
unquoted = parse.unquote(s)
unquoted
'{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","'
This looks like part of a larger JSON string. The complete object can be de-serialised with json.loads.

Related

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How to use python to convert a backslash in to forward slash for naming the filepaths in windows OS?

I have a problem in converting all the back slashes into forward slashes using Python.
I tried using the os.sep function as well as the string.replace() function to accomplish my task. It wasn't 100% successful in doing that
import os
pathA = 'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
Expected Output:
'V:/Gowtham/2019/Python/DailyStandup.txt'
Actual Output:
'V:/Gowtham\x819/Python/DailyStandup.txt'
I am not able to get why the number 2019 is converted in to x819. Could someone help me on this?
Your issue is already in pathA: if you print it out, you'll see that it already as this \x81 since \201 means a character defined by the octal number 201 which is 81 in hexadecimal (\x81). For more information, you can take a look at the definition of string literals.
The quick solution is to use raw strings (r'V:\....'). But you should take a look at the pathlib module.
Using the raw string leads to the correct answer for me.
import os
pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
OutPut:
V:/Gowtham/2019/Python/DailyStandup.txt
Try this, Using raw r'your-string' string format.
>>> import os
>>> pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt' # raw string format
>>> newpathA = pathA.replace(os.sep,'/')
Output:
>>> print(newpathA)
V:/Gowtham/2019/Python/DailyStandup.txt

Decode a Python string

Sorry for the generic title.
I am receiving a string from an external source: txt = external_func()
I am copying/pasting the output of various commands to make sure you see what I'm talking about:
In [163]: txt
Out[163]: '\\xc3\\xa0 voir\\n'
In [164]: print(txt)
\xc3\xa0 voir\n
In [165]: repr(txt)
Out[165]: "'\\\\xc3\\\\xa0 voir\\\\n'"
I am trying to transform that text to UTF-8 (?) to have txt = "à voir\n", and I can't see how.
How can I do transformations on this variable?
You can encode your txt to a bytes-like object using the encode-method of the str class.
Then this byte-like object can be decoded again with the encoding unicode_escape.
Now you have your string with all escape sequences parsed, but latin-1 decoded. You still have to encode it with latin-1 and then decode it again with utf-8.
>>> txt = '\\xc3\\xa0 voir\\n'
>>> txt.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'à voir\n'
The codecs module also has an undocumented funciton called escape_decode:
>>> import codecs
>>> codecs.escape_decode(bytes('\\xc3\\xa0 voir\\n', 'utf-8'))[0].decode('utf-8')
'à voir\n'

Encode to Unicode UTF-8 not working

My Code -
var utf8 = require('utf8');
var y = utf8.encode('एस एम एस गपशप');
console.log(y);
Input -
एस एम एस गपशप
Expecting Output - \xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x8F\xE0\xA4\xAE\x20\xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x97\xE0\xA4\xAA\xE0\xA4\xB6\xE0\xA4\xAA
Example Encoding using utf8.js
Output -
à¤à¤¸ à¤à¤® à¤à¤¸ à¤à¤ªà¤¶à¤ª
What am I doing wrong? Please help!
That code appears to be working. That output looks like UTF-8 bytes interpreted as some 8-bit character set, most likely ISO-8859-1, which is easily recognisable by the repeating patterns.
That example output is just how you would represent that string in source code.
Try this:
var utf8 = require('utf8');
var y = utf8.encode('एस');
console.log(y);
console.log('\xE0\xA4\x8F\xE0\xA4\xB8');
You will probably see the same output twice.
You can easily write some code to get that hexadecimal forms back using a lookup table and the charCodeAt function, but it is a rather unusual way to represent a string in JavaScript. JSON for example either just uses the literal characters, or '\uXXXX' escapes.

How to change a Python 3 string into 'readable' text

I have this string in var1
var1 = '$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets-Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'
How can I change the contents of the string into 'readable' text i.e. non-URL encoded.
From research, here is the code I have tried, but it still keeps the URL-encoded items e.g. %20 etc.
import html
print(html.unescape('$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets- Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'))
All help is appreciated or if there is an existing module that does this.
What you are trying to do is unquoting of parameters string and not unescaping of html. Following should work -
import urllib.parse
print(urllib.parse.unquote('$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets- Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'))

Resources