python3 uuid to base64.urlsafe encode and decode mismatch - python-3.x

I'm having a problem getting a base64-encoded uuid to match the original uuid.
Here is the code:
import base64, uuid
def uuid2slug(uuidstring):
return base64.urlsafe_b64encode(uuid.uuid1().bytes).decode("utf-8").rstrip('=\n').replace('/', '_')
def slug2uuid(slug):
return uuid.UUID(bytes=base64.urlsafe_b64decode((slug + '==').replace('_', '/')))
uid = uuid.uuid1()
urlslug = uuid2slug(uid)
urluid = slug2uuid(urlslug)
print(uid)
print(urlslug)
print(urluid)
This returns a mismatch in the uuid's first column:
cfe71fa2-7d39-11e7-9264-000c29023711
z-cg7H05EeeSZAAMKQI3EQ
cfe720ec-7d39-11e7-9264-000c29023711
Any thoughts?
This is using Python 3.5.3

As mentioned in the comments, the problem in your code was that you were not using the argument you passed to the function, uuidstring.
Also note that you are using the urlsafe encoding and decoding libraries, so you don't need to replace the slashes yourself.
For reference, a Base64 value can be defined with the following regex, ^[A-Za-z0-9+/]+={0,2}$, where + and - are the only non-alphanumeric symbols, and = is only used for padding. The URL encoding is explained in the Base64 (Wikipedia) article,
the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_', so that using URL encoders/decoders is no longer necessary
Long story short, the correct version of your functions, without the redundant calls to replace are:
def uuid2slug(uuidstring):
return base64.urlsafe_b64encode(uuidstring.bytes).decode("utf-8").strip('=')
def slug2uuid(slug):
return uuid.UUID(bytes=base64.urlsafe_b64decode(slug+'=='))
If you run your code a couple of times, you will find hyphens and underscores, and no slashes.
E.g.
471f8fc4-5ec5-11ed-9645-06ca5f5b4308
Rx-PxF7FEe2WRQbKX1tDCA
471f8fc4-5ec5-11ed-9645-06ca5f5b4308
ac74e9fe-5ec6-11ed-b5e7-06ca5f5b4308
rHTp_l7GEe215wbKX1tDCA
ac74e9fe-5ec6-11ed-b5e7-06ca5f5b4308

Related

Python - Convert unicode entity into unicode symbol

My request.json(), When I loop through the dict it returns from an API, returns "v\u00F6lk" (without the quotes)
But I want "völk" (without the quotes), which is how it is raw in the API.
How do I convert?
request = requests.post(get_sites_url, headers=api_header, params=search_sites_params, timeout=http_timeout_seconds)
return_search_results = request.json()
for site_object in return_search_results['data']:
site_name = str(site_object['name'])
site_name_fixed=str(site_name.encode("utf-8").decode())
print("fixed site_name: " + site_name_fixed)
My Guess, the API is actually returning the literal version, so he is really getting:
"v\\u00F6lk"
Printing that gives what we think we are getting from the api:
print("v\\u00F6lk")
v\u00F6lk
I am not sure if there is a better way to do this, but encoding it with "utf-8", then using "unicode_escape" to decode seemed to work:
>>> print(bytes("v\\u00F6lk", "utf-8").decode("unicode_escape"))
völk
>>> print("v\\u00F6lk".encode("utf-8").decode("unicode_escape"))
völk

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.
ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

unescape alternative in node.js

I'm using the deprecated unescape function in one of my program.
The protocol I'm working with sends escaped binary strings via the query string. So on their side they are doing something along the lines of (0-9, a-z, A-Z, '.', '-', '_' and '~' are encoded using the "%nn" format):
var source = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a";
var encoded = escape(source);
// escaped is now "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
So I'm receiving this string on my end and I have to decode it. decodeURIComponent is not working in this case so I rely on unescape:
var received = "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A";
var binaryString = unescape(received);
Since unescape is deprecated, any pointers on how should I decode these binary strings?
Note: querystring.unescape doesn't work either...
Having a similar issue here, after doing some research I found this which can still be of use to someone.
https://www.npmjs.com/package/unescape
The unescape() function was deprecated in JavaScript version 1.5. Use decodeURI() or decodeURIComponent() instead.
The unescape() function decodes an encoded string.

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources