How to remove unicode characters without removing parts of text - python-3.x

I am trying to do an N-gram analysis on an ancient language that does not have modern orthography and I am running into a problem of encoding.
The orthography looks like the following
It is contained in a Docx document and I use the following code to retrieve it
Text = docx2txt.process(Corpus)
print(Text)
When I put it into a dictionary it spits out the following
"daniel.mahabi": {"xq\u2019xucubaquibms.xqui\u00e7i\ua72dih": 1.0},
I can partially resolve this with the following code
Text = docx2txt.process(Corpus)
Text = Text.encode("ascii", "ignore")
Text = Text.decode()
However upon doing that it also removes parts of the text.
What can I do to resolve this?

Related

Skip processing fenced code blocks when processing Markdown files line by line

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.

How to save the output of text from selenium chrome (Python)

I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.

A workaround for a problem with s.lstrip()

I am reading an xml file that contains lines of the type:
<PLAYER_NAME>Andrew Tell</PLAYER_NAME>
I want to extract all the names from the file and I have tried:
name = (line.strip()
.lstrip('<PLAYER_NAME>')
.rstrip('</PLAYER_NAME>'))
and
name = line.strip()
name = name.lstrip('<PLAYER_NAME>')
name = name.rstrip('</PLAYER_NAME>')
These work for some names but if a name starts with any of:
A,E,L,M,N,R,Y (and possibly some others) then that character is also stripped as well so in the above example I get 'ndrew Tell' but William Tell is fine. I have not tested the full alphabet but I do know that names that start with any of: B,C,D,H,I,J,S,T,W are all extracted correctly
I have had to resort to the ugly:
namebits = line.split('>',1)
name = namebits[-1].split('<')[0]
This seems to work for all names.
I this a known problem with s.lstrip or am I doing something wrong?
Use an XML parser for XML. Every other approach is broken.
Luckily an XML parser is built into Python and using it is easy. It's most probably easier than your current code.
import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
player_name = tree.find('.//PLAYER_NAME')
print(player_name.text)
Read file, search element, get text. No awkward string manipulation required. Assuming this XML file:
<PLAYER>
<PLAYER_NAME>Andrew Tell</PLAYER_NAME>
</PLAYER>
the output is unsurprising:
Andrew Tell

Pyx unicode text

So I am trying to generate postscript from Python.
Currently trying with PyX 0.14.1 on Python3.4.2,
but I am open to suggestions, if you know something simpler.
I was following mostly the suggestions found on the PyX
mailing list in this thread. This was Python2 and is quite old.
The following shows my current code after many changes:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c = canvas.canvas()
c.text(5, 5, "Sören Sundstrøm".encode("utf8"))
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
PyX stops with a TexResultError. The interesting part of the error
shows what's happening in TeX:
pyx.text.TexResultError: unhandled TeX response (might be an error)
The expression passed to TeX was:
\ProcessPyXBox{b'S\xc3\xb6ren Sundstr\xc3\xb8m'%
}{1}%
\PyXInput{7}%
After parsing the return message from TeX, the following was left:
*
*! Undefined control sequence.
<argument> b'S\xc
3\xb 6ren Sundstr\xc 3\xb 8m'
<*> }{1}
(cut after 5 lines; use errordetail.full for all output)
So it looks like latex is receiving not utf-8,
but an escaped representation of utf-8.
My question: How do I pass the string to canvas.text correctly?
Or is my preamble wrong?
I also tried to follow this answer by wobsta here on SO,
but besides being much too complicated, it does not work for me either.
(Looks like PyX does not understand a metafont message in this case).
Running latex directly on a simple utf-8 input file with the same preamble
works fine by the way.
Looking into the PyX code revealed the problem.
The text module prepares an io.TextIOWrapper with utf-8 encoding to be used for TeX input. The string parameters in text.preamble and canvas.text are passed verbatim to the wrapper, so in Python 3 you just pass a string without any encoding necessary. Encoding will be done by the wrapper.
My original unsimplified code had another problem which made it difficult to solve this first problem. So for completeness here's the second problem and its solution. My original code had this order of operations:
from pyx import *
c = canvas.canvas()
# doing other stuff with canvas
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c.text(5, 5, "Sören Sundstrøm")
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
This does not work either, because when a canvas is created it keeps a reference to a text.defaulttexrunner which is set up with the current settings of the text module. The changed text module settings never influence the canvas instance. So you have to set-up the text module before you create the canvas where you want to draw text into.
Thanks to anyone who looked into this.

Python3 utf-8 decoding/encoding problems with data hiding

I'm trying to take the text from a file (the text is Russian), hide it in an image, and then later be able to retrieve it from the image. However, I keep getting binascii.Error: Odd-length string when I try to retrieve the data from the image I hid it in.
I feel like the problem may lie within what I use to hide the text. When I do someString = file.read() on the file, and print someString everything comes out fine. But when I run:
file = open(<text file path>, 'r', encoding='utf-8')
entireText = file.read()
print(codecs.encode(entireText,'utf-8'))
I get the following:
b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb2:\n\xd0\x9e\xd1\x87\xd0\xb8 \xd1\x87\xd1\x91\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb5, \xd0\xbe\xd1\x87\xd0\xb8 \xd0
That is only a piece of it, but the theme is shown; it has colons, spaces, commas, and \n all throughout the 'bytes' which is what type the codecs.encode returns. If i use codecs to decode it, then I get the original text back in perfect format.
if it helps, here are the functions I use to make it happen:
def stringToBinary(msg):
return bin(int(binascii.hexlify(msg.encode('utf-8')), 16))[2:]
def binaryToString(bNum):
return binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8')
If that is not enough, the entire file is here: http://pastebin.com/f541DpzS
EDIT: I think I'm getting that issue because the image I'm trying to hide the text in didn't have enough pixels for me to hide the complete message, so it was trying to convery the binary number to a string without all of the bits, thus throwing binascii.Error: Odd-length string.

Resources