A workaround for a problem with s.lstrip() - python-3.x

I am reading an xml file that contains lines of the type:
<PLAYER_NAME>Andrew Tell</PLAYER_NAME>
I want to extract all the names from the file and I have tried:
name = (line.strip()
.lstrip('<PLAYER_NAME>')
.rstrip('</PLAYER_NAME>'))
and
name = line.strip()
name = name.lstrip('<PLAYER_NAME>')
name = name.rstrip('</PLAYER_NAME>')
These work for some names but if a name starts with any of:
A,E,L,M,N,R,Y (and possibly some others) then that character is also stripped as well so in the above example I get 'ndrew Tell' but William Tell is fine. I have not tested the full alphabet but I do know that names that start with any of: B,C,D,H,I,J,S,T,W are all extracted correctly
I have had to resort to the ugly:
namebits = line.split('>',1)
name = namebits[-1].split('<')[0]
This seems to work for all names.
I this a known problem with s.lstrip or am I doing something wrong?

Use an XML parser for XML. Every other approach is broken.
Luckily an XML parser is built into Python and using it is easy. It's most probably easier than your current code.
import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
player_name = tree.find('.//PLAYER_NAME')
print(player_name.text)
Read file, search element, get text. No awkward string manipulation required. Assuming this XML file:
<PLAYER>
<PLAYER_NAME>Andrew Tell</PLAYER_NAME>
</PLAYER>
the output is unsurprising:
Andrew Tell

Related

Cutting Strings Pyhton

i have a simple ui created with pyqt5
it loads a file, let you choose a outputfolder and creats a new txt file with additonals information.
the string of the loaded file is written to
self.inputs.filename.text()
it looks like "C:/User/Folder/File.txt"
later in the application i write into a new file in a specific location.
new_txt = open(self.inputs.foldername.text() + "/optimized.txt", "w")
I want to add the "optimized.txt" string to the orginal Filename. But if I use self.inputs.filename.tex() it gives back the whole path and creates an error. I tried it with .removesuffix() but since the Path is always variable I cant find a solution to just keep the character after the last "/".
Please dont lynch me I'm quite new to python.
You can use the split function of a string to get the element after the last '/'. Like this :
str_path = self.inputs.foldername.text()
split_text = str_path.split('/') #this give you a list of str element splitted by the character '/'
last_element = split_text[-1]
Now you can use the last element that should contains 'File.txt'. you can split it again respect to "." and get only the name of the file without the .txt extension.
Hope I answered you question.

Julia: Using ProtoBuf to read messages from gzipped file

A sensor provides a stream of frames containing object coordinates, which are stored in ProtoBuf format in a gzipped file. I would like to read this file in Julia.
Using protoc, I have generated the Protobuf files for both Python and Julia, coordinate_push.py and coordinate_push.jl
My Python code is as follows:
frameList = []
with gzip.open(filePath) as f:
data = f.read()
next_pos, pos = 0, 0
while pos < len(data):
msg = coordinate_push.CoordinatesFrame()
next_pos, pos = _DecodeVarint32(data, pos)
msg.ParseFromString(data[pos:pos + next_pos])
frameList.append(msg)
pos += next_pos
I'd like to rewrite the above in Julia, and don't know where to start. Part of the problem is that I haven't fully understood the Python script (IO is not my strong point).
I understand that I need:
to open the gzip file, presumably using using GZip; file = GZip.open(file_path, "r")
to read in the data, along the lines of using ProtoBuf; data = readproto(iob, CoordinatesFrame())
What I don't understand is:
how to define iob, and especially how to link it to file (in the Julia Protobuf manual, we had iob = PipeBuffer(), but here it's a gzip-file that we'd like to read)
how to replicate the while-loop in Julia, and in particular the mysterious _DecodeVarint32 (I'm on Windows, if it's related to that.)
whether the file coordinate_push.jl has to be in the same directory as my main file, and if not, how I can properly import it (it is currently in a proto subfolder, and in Python I'd import it using from src.proto import coordinate_push)
Insight on any of the three points would be highly appreciated.
You should open an issue on the Gzip GitHub repo and ask this first part of your question there (I am not a Gzip expert unfortunately).
On the second point, I suggest looking here: https://github.com/JuliaIO/FileIO.jl/blob/master/README.md for lots of examples of FileIO loops which seems exactly what you need to replicate that Python loop. For the second part of this question, you best bet for that function is to try and hunt down the definition on GitHub or in the docs somewhere.
For the 3rd questions, coordinate_push.jl does not need to be in the same folder as your "main file" (I am not sure what you mean by this so perhaps it would help to add context on the structure of your files). To import that file all you need to do is add include("path/to/coordinate_push.jl") at the top of the file you want to call/run the code from. It's worth noting that the path can either be the absolute path or the relative project path (in some cases).

Skip processing fenced code blocks when processing Markdown files line by line

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.

os.rename is adding an extra character when renaming to something with a utf character

This is the first time I've asked a question on stackoverflow so let me know if I'm doing something wrong.
I'm trying to rename a file with the os library. I want the file name to include some non-ascii characters that are in a string that I've generated. Here's the code:
for subdir, dirs, files in os.walk(startDir):
for file in files:
# some code to generate the newFileName string
os.rename(os.path.join(subdir,file), s.path.join(subdir,newFileName))
Here's an example of what the newFileName string would be: "te©st©.txt"
However when the file saves, it adds in an extra character: "te©st©.txt"
From other reading I've done it sounds like utf-8 actually maps certain codes to two characters, or something like that, and that's where the  is coming from. If I print the string right before calling os.rename, it prints to the terminal the way that I would expect it to. So I'm guessing it must be something with the way that os.rename is interaction with the filesystem.
I am using Windows.
Perhaps you can try using unicode all the way?
path = u'99 bottles of \N{greek small letter beta}eer on the wall.txt'
f = open(path, 'w')
f.write('Hello, World!\n')
f.close()
import glob
print(glob.glob(path)) # ['99 bottles of βeer on the wall.txt']
import os
print(os.path.getsize(path)) # 15

Why won't this Python script replace one variable with another variable?

I have a CSV file with two columns in it, the one of the left being an old string, and the one directly to right being the new one. I have a heap of .xml files that contain the old strings, which I need to replace/update with the new ones.
The script is supposed to open each .xml file one at a time and replace all of the old strings in the CSV file with the new ones. I have tried to use a replace function to replace instances of the old string, called 'column[0]' with the new string, called 'column[1]'. However I must be missing something as this seems to do nothing. If I the first variable in the replace function to an actual string with quotation marks, the replace function works. However if both the terms in the replace function are variables, it doesn't.
Does anyone know what I am doing wrong?
import os
import csv
with open('csv.csv') as csv:
lines = csv.readline()
column = lines.split(',')
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
x=x.replace(column[0],column[1])
print(x)
Example of CSV file:
oldstring1,newstring1
oldstring2,newstring2
Example of .xml file:
Word words words oldstring1 words words words oldstring2
What I want in the new .xml files:
Word words words newstring1 words words words newstring2
The problem over here is you are treating the csv file as normal text file not looping over the all the lines in the csv file.
You need to read file using csv reader
Following code will work for your task
import os
import csv
with open('csv.csv') as csvfile:
reader = csv.reader(csvfile)
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
for row in reader:
x=x.replace(row[0],row[1])
print(x)
It looks like this is better done using sed. However.
If we want to use Python, it seems to me that what you want to do is best achieved
reading all the obsolete - replacements pairs and store them in a list of lists,
have a loop over the .xml files, as specified on the command line, using the handy fileinput module, specifying that we want to operate in line and that we want to keep around the backup files,
for every line in each of the .xml s operate all the replacements,
put back the modified line in the original file (using simply a print, thanks to fileinput's magic) (end='' because we don't want to strip each line to preserve eventual white space).
import fileinput
import sys
old_new = [line.strip().split(',') for line in open('csv.csv')]
for line in fileinput.input(sys.argv[1:], inplace=True, backup='.bak'):
for old, new in old_new:
line = line.replace(old, new)
print(line, end='')
If you save the code in replace.py, you will execute it like this
$ python3 replace.py *.xml subdir/*.xml another_one/a_single.xml

Resources