How to change a Python 3 string into 'readable' text - python-3.x

I have this string in var1
var1 = '$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets-Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'
How can I change the contents of the string into 'readable' text i.e. non-URL encoded.
From research, here is the code I have tried, but it still keeps the URL-encoded items e.g. %20 etc.
import html
print(html.unescape('$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets- Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'))
All help is appreciated or if there is an existing module that does this.

What you are trying to do is unquoting of parameters string and not unescaping of html. Following should work -
import urllib.parse
print(urllib.parse.unquote('$a=1%7Cscroll%20on%20%22Page%3A%20Generator-Sets- Construction%3Fid%3Dci%26s%3DY2l8Tj00Mjk0NzQ4MDY5KzQyOTQ5NjM4OTY%3D%22%7C-%7Cscroll%7C1443616500011%7C1443616500586%7C3774$fId=16440287_806$rId=RID_-62268720$rpId=1762047089$domR=1443616443684$time=1443616500588'))

Related

How to python delete unwanted icon characters

Is there a way to get a string to get rid of icon characters automatically?
input: This is String 💋👌✅this is string✅✍️string✍️✔️
output wish: This is String this is stringstring
replace('💋👌✅', '') is not used because the icon character changes within each string without our prior knowledge of the content
Try this:
import re
def strip_emoji(text):
RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
return RE_EMOJI.sub(r'', text)
print(strip_emoji('This is String 💋👌✅this is string✅✍️string✍️✔️'))
Consider using the re module in Python to replace characters that you don't want. Something like:
import re
re.sub(r'[^(a-z|A-Z)]', '', my_string)

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How to decode string with unicode in python?

I have the following line:
%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22
Expected Result:
{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","
You can check it out here.
What I tried:
foo = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
codecs.decode(foo, 'unicode-escape')
foo.encode('utf-8').decode('utf-8')
This does not work. What other options are there?
The string is urlencoded. You can convert it by reversing the urlencoding.
from urllib import parse
s = '%7B%22appVersion%22%3A1%2C%22modulePrefix%22%3A%22web-experience-app%22%2C%22environment%22%3A%22production%22%2C%22rootURL%22%3A%22/%22%2C%22'
unquoted = parse.unquote(s)
unquoted
'{"appVersion":1,"modulePrefix":"web-experience-app","environment":"production","rootURL":"/","'
This looks like part of a larger JSON string. The complete object can be de-serialised with json.loads.

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Decode HTML Entities to UTF-8 in groovy

I am not sure how to decode the html entities to UTF-8 in groovy?
"& quot;"
should be decoded as " in groovy program.
Can anyone help me with this solution?
For semantic correctness: What you actually want to do is not to change the encoding, but to parse the markup.
(I may go to the engineers hell for giving you the following solution, but you might actually do this if you insist on staying with the Groovy-essentials to solve your problem.)
You can surround your string or character with a valid html-tag and utilize the XMLSlurper to parse the input for you. For your example string, it looks like this:
import groovy.util.XmlSlurper // comes with Groovy (using 2.4.4)
def myString = "& quot;".replace(" ","") // your string + replaced whitespaces for correct parsing
def embeddedString = "<tag>" + myString + "</tag>" // tag name can be any name
def tmp = new XmlSlurper().parseText(embeddedString)
println tmp // prints """
The output is what you desired. (and now I may burn)

Resources