Python getting a url request to work with special characters [duplicate] - python-3.x

If I do
url = "http://example.com?p=" + urllib.quote(query)
It doesn't encode / to %2F (breaks OAuth normalization)
It doesn't handle Unicode (it throws an exception)
Is there a better library?

Python 2
From the documentation:
urllib.quote(string[, safe])
Replace special characters in string
using the %xx escape. Letters, digits,
and the characters '_.-' are never
quoted. By default, this function is
intended for quoting the path section
of the URL.The optional safe parameter
specifies additional characters that
should not be quoted — its default
value is '/'
That means passing '' for safe will solve your first issue:
>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'
About the second issue, there is a bug report about it. Apparently it was fixed in Python 3. You can workaround it by encoding as UTF-8 like this:
>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller
By the way, have a look at urlencode.
Python 3
In Python 3, the function quote has been moved to urllib.parse:
>>> import urllib.parse
>>> print(urllib.parse.quote("Müller".encode('utf8')))
M%C3%BCller
>>> print(urllib.parse.unquote("M%C3%BCller"))
Müller

In Python 3, urllib.quote has been moved to urllib.parse.quote, and it does handle Unicode by default.
>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'

I think module requests is much better. It's based on urllib3.
You can try this:
>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
My answer is similar to Paolo's answer.

If you're using Django, you can use urlquote:
>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'
Note that changes to Python mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:
A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)

It is better to use urlencode here. There isn't much difference for a single parameter, but, IMHO, it makes the code clearer. (It looks confusing to see a function quote_plus! - especially those coming from other languages.)
In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'
In [22]: val=34
In [23]: from urllib.parse import urlencode
In [24]: encoded = urlencode(dict(p=query,val=val))
In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34
Documentation
urlencode
quote_plus

An alternative method using furl:
import furl
url = "https://httpbin.org/get?hello,world"
print(url)
url = furl.furl(url).url
print(url)
Output:
https://httpbin.org/get?hello,world
https://httpbin.org/get?hello%2Cworld

Related

Why some emojis are not converted back into their representation?

I am working on emoji detection module. For some emojis I am observing weird behavior that is after converting them to utf-8 encoding they are not converted back to their original representation form. I need their exact colored representation to be send as API response instead of sending unicode escaped string. Any leads?
In [1]: x = "example1: 🤭 and example2: 😁 and example3: 🥺"
In [2]: x.encode('utf8')
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'
In [3]: x.encode('utf8').decode('utf8')
Out[3]: 'example1: \U0001f92d and example2: 😁 and example3: \U0001f97a'
In [4]: print( x.encode('utf8').decode('utf8') )
*example1: 🤭 and example2: 😁 and example3: 🥺*
Link Emoji used in example
Update 1:
By this example it must be much clearer to explain. Here, two emojis are rendered when I have send unicode escape string, but 3rd exampled failed to convert exact emoji, what to do in such case?
'\U0001f92d' == '🤭' is True. It is an escape code but is still the same character...Two ways of display/entry. The former is the repr() of the string, printing calls str(). Example:
>>> s = '🤭'
>>> print(repr(s))
'\U0001f92d'
>>> print(str())
🤭
>>> s
'\U0001f92d'
>>> print(s)
🤭
When Python generates the repr() it uses an escape code representation if it thinks the display can't handle the character. The content of the string is still the same...the Unicode code point.
It's a debug feature. For example, is the white space spaces or tabs? The repr() of the string makes it clear by using \t as an escape code.
>>> s = 'a\tb'
>>> print(s)
a b
>>> s
'a\tb'
As to why an escape code is used for one emoji and not another, it depends on the version of Unicode supported by the version of Python used.
Pyton 3.8 uses Unicode 9.0, and one of your emoji isn't defined at that version level:
>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('😁')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('🤭')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name

Can urllib.parse.quote_plus support lowercase?

Here is an url, like https://www.example.com?timestamp={timestamp}&sign={sign}
And I have caculated the sigh by some algorithm, and got the flowing value.
org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE=".
Now I want to encode the value and add it to the url.
In urllib.parse.quote_plus, the '/' => '%2F', and '='=>'%3D', so I got 'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D' .
But I want %2f and %3d, and then HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2fx3Xv%2fKFNE%3d. And I cannot change the whole sign to a lower case directly, like sign.lower(). Because it is a sign, it is case-sensitive.
Python 3.7.2
>>> from urllib import parse
>>> org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE="
>>> parse.quote_plus(org_sign)
'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D'
And I read the document in urllib. It doesn't mention any thing about case-sensitive.
This workaround works for me:
import urllib.parse
class LowerCaseQuoter(urllib.parse.Quoter):
def __missing__(self, b):
# Handle a cache miss. Store quoted string in cache and return.
res = chr(b) if b in self.safe else '%{:02x}'.format(b)
self[b] = res
return res
urllib.parse._safe_quoters[b''] = LowerCaseQuoter(b'').__getitem__
Original string: /media/Filmy/PohadkyCD/Sofie Prvni/Season 1 (2013-2014) + pilot/1x00.Sofie První, Byla jednou jedna princezna (Once Upon a Princess).avi
Encoded string by urllib.parse.quote(filename, safe=''): %2fmedia%2fFilmy%2fPohadkyCD%2fSofie%20Prvni%2fSeason%201%20%282013-2014%29%20%2b%20pilot%2f1x00.Sofie%20Prvn%c3%ad%2c%20Byla%20jednou%20jedna%20princezna%20%28Once%20Upon%20a%20Princess%29.avi

How to use python to convert a backslash in to forward slash for naming the filepaths in windows OS?

I have a problem in converting all the back slashes into forward slashes using Python.
I tried using the os.sep function as well as the string.replace() function to accomplish my task. It wasn't 100% successful in doing that
import os
pathA = 'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
Expected Output:
'V:/Gowtham/2019/Python/DailyStandup.txt'
Actual Output:
'V:/Gowtham\x819/Python/DailyStandup.txt'
I am not able to get why the number 2019 is converted in to x819. Could someone help me on this?
Your issue is already in pathA: if you print it out, you'll see that it already as this \x81 since \201 means a character defined by the octal number 201 which is 81 in hexadecimal (\x81). For more information, you can take a look at the definition of string literals.
The quick solution is to use raw strings (r'V:\....'). But you should take a look at the pathlib module.
Using the raw string leads to the correct answer for me.
import os
pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
OutPut:
V:/Gowtham/2019/Python/DailyStandup.txt
Try this, Using raw r'your-string' string format.
>>> import os
>>> pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt' # raw string format
>>> newpathA = pathA.replace(os.sep,'/')
Output:
>>> print(newpathA)
V:/Gowtham/2019/Python/DailyStandup.txt

how to fix the unicode problem on configparser

I use Python 3.7 and
configparser 3.7.4.
I have a rank.ini:
[example]
placeholder : \U0001F882
And i have a main.py file:
import configparser
config = configparser.ConfigParser()
config.read('ranks.ini')
print('🢂')
test = '\U0001F882'
print(type(test))
print(test)
test2 = config.get('example', 'placeholder')
print(type(test2))
print(test2)
The result of the code is:
🢂
<class 'str'>
🢂
<class 'str'>
\U0001F882
Why is the var test2 not "🢂" and how i can fix it.
It took me a while to figure this one out since python3 sees everything as unicode explained here
If my understanding is correct the original print is being seen like this u'\U0001F882', so it converts it into the character.
However, when you pass the variable in using the configparser as a string the unicode escape character is essentially getting lost such as '\\U0001F882'.
You can see this difference if you print test and test2's repr
print(repr(test))
print(repr(test2))
To get the output you want you will have to unicode escape the string value
print(test2.encode('utf8').decode('unicode-escape')
Hope this works for you.

How to display chinese character in 65001 in python?

I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
﷩
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))
﷩

Resources