Can urllib.parse.quote_plus support lowercase? - python-3.x

Here is an url, like https://www.example.com?timestamp={timestamp}&sign={sign}
And I have caculated the sigh by some algorithm, and got the flowing value.
org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE=".
Now I want to encode the value and add it to the url.
In urllib.parse.quote_plus, the '/' => '%2F', and '='=>'%3D', so I got 'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D' .
But I want %2f and %3d, and then HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2fx3Xv%2fKFNE%3d. And I cannot change the whole sign to a lower case directly, like sign.lower(). Because it is a sign, it is case-sensitive.
Python 3.7.2
>>> from urllib import parse
>>> org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE="
>>> parse.quote_plus(org_sign)
'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D'
And I read the document in urllib. It doesn't mention any thing about case-sensitive.

This workaround works for me:
import urllib.parse
class LowerCaseQuoter(urllib.parse.Quoter):
def __missing__(self, b):
# Handle a cache miss. Store quoted string in cache and return.
res = chr(b) if b in self.safe else '%{:02x}'.format(b)
self[b] = res
return res
urllib.parse._safe_quoters[b''] = LowerCaseQuoter(b'').__getitem__
Original string: /media/Filmy/PohadkyCD/Sofie Prvni/Season 1 (2013-2014) + pilot/1x00.Sofie PrvnĂ­, Byla jednou jedna princezna (Once Upon a Princess).avi
Encoded string by urllib.parse.quote(filename, safe=''): %2fmedia%2fFilmy%2fPohadkyCD%2fSofie%20Prvni%2fSeason%201%20%282013-2014%29%20%2b%20pilot%2f1x00.Sofie%20Prvn%c3%ad%2c%20Byla%20jednou%20jedna%20princezna%20%28Once%20Upon%20a%20Princess%29.avi

Related

Python3 requests - cut response

How can i cut the output from a response using Python requests?
The output looks like:
...\'\n});\nRANDOMDATA\nExt.define...
or
...\'\n});\nOTHERRANDOMDATA\nExt.define...
And i only want to print out the RANDOMDATA.
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
You can use the re module findall() function for searching all occurrences of a regular expression in a string.
The following code, just search for the string RANDOMDATA in the response got from requests.get() function
import re
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
ar = re.findall('RANDOMDATA',str(response.content))
if len(ar):
print(ar[0])
This link would be helpful to learn about regular expressions
Additionally, if you have a variable data containing a string to b searched, and a variable t containing a string to search, you can use,
import re
arr = re.findall(t,data)
To return all the occurrences of t in data and :
arr = data.find(t)
To get the index of the first occurrence of t in data

How to use python to convert a backslash in to forward slash for naming the filepaths in windows OS?

I have a problem in converting all the back slashes into forward slashes using Python.
I tried using the os.sep function as well as the string.replace() function to accomplish my task. It wasn't 100% successful in doing that
import os
pathA = 'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
Expected Output:
'V:/Gowtham/2019/Python/DailyStandup.txt'
Actual Output:
'V:/Gowtham\x819/Python/DailyStandup.txt'
I am not able to get why the number 2019 is converted in to x819. Could someone help me on this?
Your issue is already in pathA: if you print it out, you'll see that it already as this \x81 since \201 means a character defined by the octal number 201 which is 81 in hexadecimal (\x81). For more information, you can take a look at the definition of string literals.
The quick solution is to use raw strings (r'V:\....'). But you should take a look at the pathlib module.
Using the raw string leads to the correct answer for me.
import os
pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt'
newpathA = pathA.replace(os.sep,'/')
print(newpathA)
OutPut:
V:/Gowtham/2019/Python/DailyStandup.txt
Try this, Using raw r'your-string' string format.
>>> import os
>>> pathA = r'V:\Gowtham\2019\Python\DailyStandup.txt' # raw string format
>>> newpathA = pathA.replace(os.sep,'/')
Output:
>>> print(newpathA)
V:/Gowtham/2019/Python/DailyStandup.txt

Getting a value error: invalid literal for int() with base 10: '56,990'

So I am trying to scrap a website containing price of a laptop.However it is a srting and for comparison purposes I need to convert it to int.But on using the same I get a none type error: invalid literal for int() with base 10: '56,990'
Below is the code:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.flipkart.com/apple-macbook-air-core-i5-5th-gen-8-gb-128-gb-ssd-mac-os-sierra-mqd32hn-a-a1466/p/itmevcpqqhf6azn3?pid=COMEVCPQBXBDFJ8C&srno=s_1_1&otracker=search&lid=LSTCOMEVCPQBXBDFJ8C5XWYJP&fm=SEARCH&iid=2899998f-8606-4b81-a303-46fd62a7882b.COMEVCPQBXBDFJ8C.SEARCH&qH=9e3635d7234e9051")
data = r.text
soup = BeautifulSoup(data,"lxml")
data=soup.find('div',{"class":"_1vC4OE _37U4_g"})
cost=(data.text[1:].strip())
print(int(cost))
PS:I used text[1:] toremove the currency character
I get error in the last line.Basically I need to get the int value of the cost.
The value has a comma in it. So you need to replace the comma with empty character before converting it to integer.
print(int(cost.replace(',','')))
python does not understand , group separators in integers, so you'll need to remove them. Try:
cost = data.text[1:].strip().translate(None,',')
Rather than invent a new solution for every character you don't want (strip() function for whitespace, [1:] index for the currency, something else for the digit separator) consider a single solution to gather what you do want:
>>> import re
>>> text = "\u20B956,990\n"
>>> cost = re.sub(r"\D", "", text)
>>> print(int(cost))
56990
The re.sub() replaces anything that isn't a digit with nothing.

python3 uuid to base64.urlsafe encode and decode mismatch

I'm having a problem getting a base64-encoded uuid to match the original uuid.
Here is the code:
import base64, uuid
def uuid2slug(uuidstring):
return base64.urlsafe_b64encode(uuid.uuid1().bytes).decode("utf-8").rstrip('=\n').replace('/', '_')
def slug2uuid(slug):
return uuid.UUID(bytes=base64.urlsafe_b64decode((slug + '==').replace('_', '/')))
uid = uuid.uuid1()
urlslug = uuid2slug(uid)
urluid = slug2uuid(urlslug)
print(uid)
print(urlslug)
print(urluid)
This returns a mismatch in the uuid's first column:
cfe71fa2-7d39-11e7-9264-000c29023711
z-cg7H05EeeSZAAMKQI3EQ
cfe720ec-7d39-11e7-9264-000c29023711
Any thoughts?
This is using Python 3.5.3
As mentioned in the comments, the problem in your code was that you were not using the argument you passed to the function, uuidstring.
Also note that you are using the urlsafe encoding and decoding libraries, so you don't need to replace the slashes yourself.
For reference, a Base64 value can be defined with the following regex, ^[A-Za-z0-9+/]+={0,2}$, where + and - are the only non-alphanumeric symbols, and = is only used for padding. The URL encoding is explained in the Base64 (Wikipedia) article,
the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_', so that using URL encoders/decoders is no longer necessary
Long story short, the correct version of your functions, without the redundant calls to replace are:
def uuid2slug(uuidstring):
return base64.urlsafe_b64encode(uuidstring.bytes).decode("utf-8").strip('=')
def slug2uuid(slug):
return uuid.UUID(bytes=base64.urlsafe_b64decode(slug+'=='))
If you run your code a couple of times, you will find hyphens and underscores, and no slashes.
E.g.
471f8fc4-5ec5-11ed-9645-06ca5f5b4308
Rx-PxF7FEe2WRQbKX1tDCA
471f8fc4-5ec5-11ed-9645-06ca5f5b4308
ac74e9fe-5ec6-11ed-b5e7-06ca5f5b4308
rHTp_l7GEe215wbKX1tDCA
ac74e9fe-5ec6-11ed-b5e7-06ca5f5b4308

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources