Skip processing fenced code blocks when processing Markdown files line by line - python-3.x
I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.
Related
Can you remove a random string of commas and replace with one for exporting to CSV
I am using Netmiko to extract some data from Cisco Switches and Routers. I would like to put that data in to a spread sheet. For example show cdp neighbour would give me string with random white space in Port Name Status Vlan Duplex Speed Type Et0/0 connected 1 auto auto unknown Et0/1 connected 1 auto auto unknown Et0/2 connected routed auto auto unknown Et0/3 connected 1 auto auto unknown I thought i could remove it and replace with , but i get this Port,,,,,,Name,,,,,,,,,,,,,,,Status,,,,,,,Vlan,,,,,,,Duplex,,Speed,Type Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Any way of extracting data like the above. Ideally to go straight in to a structured table in excel (Cells and Rows) or anyway to do what i did and then replace repeating , with just one so i can export to CSV and then import to Excel. I may be the most long winded person you have ever seen because i am so new to prgramming :)
I'd go with regex matches which are more flexible. You can adapt this to your needs. I put the data in a list for testing, but you could process 1 line at a time instead. Here's the file (called mydata.txt) Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown Here's how to read it and write the result to a csv file (mydata.csv) import re _re = re.compile('([^,]+)+') newfile = open(r'mydata.csv', 'w') with open(r'mydata.txt') as data: for line in data.readlines(): newfile.write(','.join(f for f in _re.findall(line))) newfile.close() And here is the output Et0/0,connected,1,auto,auto,unknown Et0/1,connected,1,auto,auto,unknown Et0/2,connected,routed,auto,auto,unknown Et0/3,connected,1,auto,auto,unknown Explanation: The re library allows the use of regular expressions for parsing text. So the first line imports it. The second line specifies the regular expression to extract anything that is not a comma, but it is only a specification. It doesn't actually do the extraction The third line opens the output file, with 'w' specifying that we can write to it. The next line opens the input file. The file is reference by the name 'newfile' The fourth line reads each line from the input file one at a time. The fifth line is an all-at-once operation to separate the non-comma parts of the input, join them back together separated by commas, and write the resulting string to the output file. The last line closes the output file.
I hope I didn't misunderstand you. To turn that repeating commas to one single comma, just run this code with your string s: while ",," ins s: s = s.replace(",,", ",")
How to write to a file every next line?
I am working in python tkinter and I am trying to write a code for writing some contents to a file and saving it. I have used filedialog to do so. I wish to write the contents in every next line. While there are so errors in running the code, even after writing "\n", it is not writing to the next line. The "\n" just adds a space after. How to resolve this issue? I have tried using the "\n" keyword in different ways possible. Yet, it is not writing to the next line. Instead it only adds a space after, just like   does. Following is the relevant part of the code: def save_file(event=""): data = filedialog.asksaveasfile(mode="w", defaultextension=".html") if data is None: return data.write("Content-1" + "\n"+ "Content-2" + "\n") data.close() I expect the data to be written in the file as: Content-1 Content-2 But it is writing to the file as: Content-1 Content-2
You are creating html - files. \n it it are meaningless if you look at your file using a browser (which is the go-to for html-files). You need to write html-linebreaks to you file if you want it to break using a browser when displaying the interpreted html. data.write("Content-1" + "<br>\n"+ "Content-2" + "<br>\n") That way you can "see" htlm newlines in your browser. Edit your file in a Textfile-editor -not a browser- to see the \n that are actually written to your file.
re-organize data stored in a csv
I have successfully downloaded my data from a given url and for storing it into a csv file I used the following code: fx = open(destination_url, "w") #write data into a file for line in lines: #loop through the string fx.write(line + "\n") fx.close() # close the file object return What happened is that the data is stored but not in separate lines. As one can see in the snapshot - the data is not separated into a different lines when I use the '\n'. Every separate line of data that I wanted seems to be separated via the '\r' (marked by yellow) on the same cell in the csv file. Here is a snip: . I know I am missing something here but can I get some pointers with regards to rearranging each line that ends with a \r into a separate line ? I hope I have made myself clear. Thanks ~V
There is a method call writelines https://www.tutorialspoint.com/python/file_writelines.htm some example is in the given link you can try that first in reality it should work we need the format of the data (what is inside the element) during each iteration print that out if the above method does not work
Python3 utf-8 decoding/encoding problems with data hiding
I'm trying to take the text from a file (the text is Russian), hide it in an image, and then later be able to retrieve it from the image. However, I keep getting binascii.Error: Odd-length string when I try to retrieve the data from the image I hid it in. I feel like the problem may lie within what I use to hide the text. When I do someString = file.read() on the file, and print someString everything comes out fine. But when I run: file = open(<text file path>, 'r', encoding='utf-8') entireText = file.read() print(codecs.encode(entireText,'utf-8')) I get the following: b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb2:\n\xd0\x9e\xd1\x87\xd0\xb8 \xd1\x87\xd1\x91\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb5, \xd0\xbe\xd1\x87\xd0\xb8 \xd0 That is only a piece of it, but the theme is shown; it has colons, spaces, commas, and \n all throughout the 'bytes' which is what type the codecs.encode returns. If i use codecs to decode it, then I get the original text back in perfect format. if it helps, here are the functions I use to make it happen: def stringToBinary(msg): return bin(int(binascii.hexlify(msg.encode('utf-8')), 16))[2:] def binaryToString(bNum): return binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8') If that is not enough, the entire file is here: http://pastebin.com/f541DpzS EDIT: I think I'm getting that issue because the image I'm trying to hide the text in didn't have enough pixels for me to hide the complete message, so it was trying to convery the binary number to a string without all of the bits, thus throwing binascii.Error: Odd-length string.
Checking/Writing lines to a .txt file using Python
I'm new both to this site and python, so go easy on me. Using Python 3.3 I'm making a hangman-esque game, and all is working bar one aspect. I want to check whether a string is in a .txt file, and if not, write it on a new line at the end of the .txt file. Currently, I can write to the text file on a new line, but if the string already exists, it still writes to the text file, my code is below: Note that my text file has each string on a seperate line write = 1 if over == 1: print("I Win") wordlibrary = file('allwords.txt') for line in wordlibrary: if trial in line: write = 0 if write == 1: with open("allwords.txt", "a") as text_file: text_file.write("\n") text_file.write(trial)
Is this really the indentation from your program? As written above, in the first iteration of the loop on wordlibrary, the trial is compared to the line, and since (from your symptoms) it is not contained in the first line, the program moves on to the next part of the loop: since write==1, it will append trial to the text_file. cheers, Amnon
You dont need to know the number of lines present in the file beforehand. Just use a file iterator. You can find the documentation here : http://docs.python.org/2/library/stdtypes.html#bltin-file-objects Pay special attention to the readlines method.