Can you remove a random string of commas and replace with one for exporting to CSV - python-3.x

I am using Netmiko to extract some data from Cisco Switches and Routers. I would like to put that data in to a spread sheet. For example show cdp neighbour would give me string with random white space in
Port Name Status Vlan Duplex Speed Type
Et0/0 connected 1 auto auto unknown
Et0/1 connected 1 auto auto unknown
Et0/2 connected routed auto auto unknown
Et0/3 connected 1 auto auto unknown
I thought i could remove it and replace with , but i get this
Port,,,,,,Name,,,,,,,,,,,,,,,Status,,,,,,,Vlan,,,,,,,Duplex,,Speed,Type
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Any way of extracting data like the above. Ideally to go straight in to a structured table in excel (Cells and Rows) or anyway to do what i did and then replace repeating , with just one so i can export to CSV and then import to Excel. I may be the most long winded person you have ever seen because i am so new to prgramming :)

I'd go with regex matches which are more flexible. You can adapt this to your needs. I put the data in a list for testing, but you could process 1 line at a time instead.
Here's the file (called mydata.txt)
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Here's how to read it and write the result to a csv file (mydata.csv)
import re
_re = re.compile('([^,]+)+')
newfile = open(r'mydata.csv', 'w')
with open(r'mydata.txt') as data:
for line in data.readlines():
newfile.write(','.join(f for f in _re.findall(line)))
newfile.close()
And here is the output
Et0/0,connected,1,auto,auto,unknown
Et0/1,connected,1,auto,auto,unknown
Et0/2,connected,routed,auto,auto,unknown
Et0/3,connected,1,auto,auto,unknown
Explanation:
The re library allows the use of regular expressions for parsing
text. So the first line imports it.
The second line specifies the regular expression to extract anything
that is not a comma, but it is only a specification. It doesn't
actually do the extraction
The third line opens the output file, with 'w' specifying that we
can write to it. The next line opens the input file. The file is
reference by the name 'newfile'
The fourth line reads each line from the input file one at a time.
The fifth line is an all-at-once operation to separate the non-comma
parts of the input, join them back together separated by commas, and
write the resulting string to the output file.
The last line closes the output file.

I hope I didn't misunderstand you. To turn that repeating commas to one single comma, just run this code with your string s:
while ",," ins s:
s = s.replace(",,", ",")

Related

Python efficient way to search for a pattern in text file

I need to find a pattern in a text file, which isn't big.
Therefore loading the entire file into RAM isn't a concern for me - as advised here:
I tried to do it in two ways:
with open(inputFile, 'r') as file:
for line in file.readlines():
for date in dateList:
if re.search('{} \d* 1'.format(date), line):
OR
with open(inputFile, 'r') as file:
contents = file.read()
for date in dateList:
if re.search('{} \d* 1'.format(date), contents):
The second one proved to be much faster.
Is there an explanation for this, other than the fact that I am using one less loop with the second approach?
As pointed out in the comments, the two codes are not equivalent as the second one only look for the first match in the whole file. Besides this, the first is also more expensive because the (relatively expensive) format over all dates is called for each line. Storing the regexp and precompiling them should help a lot. Even better: you can generate a regexp to match all the dates at once using something like:
regexp = '({}) \d* 1'.format('|'.join('{}'.format(date) for date in dateList))
with open(inputFile, 'r') as file:
contents = file.read()
# Search the first matching date existing in dateList
if re.search(regexp, contents):
Note that you can use findall if you want all of them.

Skip processing fenced code blocks when processing Markdown files line by line

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.

Why won't this Python script replace one variable with another variable?

I have a CSV file with two columns in it, the one of the left being an old string, and the one directly to right being the new one. I have a heap of .xml files that contain the old strings, which I need to replace/update with the new ones.
The script is supposed to open each .xml file one at a time and replace all of the old strings in the CSV file with the new ones. I have tried to use a replace function to replace instances of the old string, called 'column[0]' with the new string, called 'column[1]'. However I must be missing something as this seems to do nothing. If I the first variable in the replace function to an actual string with quotation marks, the replace function works. However if both the terms in the replace function are variables, it doesn't.
Does anyone know what I am doing wrong?
import os
import csv
with open('csv.csv') as csv:
lines = csv.readline()
column = lines.split(',')
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
x=x.replace(column[0],column[1])
print(x)
Example of CSV file:
oldstring1,newstring1
oldstring2,newstring2
Example of .xml file:
Word words words oldstring1 words words words oldstring2
What I want in the new .xml files:
Word words words newstring1 words words words newstring2
The problem over here is you are treating the csv file as normal text file not looping over the all the lines in the csv file.
You need to read file using csv reader
Following code will work for your task
import os
import csv
with open('csv.csv') as csvfile:
reader = csv.reader(csvfile)
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
for row in reader:
x=x.replace(row[0],row[1])
print(x)
It looks like this is better done using sed. However.
If we want to use Python, it seems to me that what you want to do is best achieved
reading all the obsolete - replacements pairs and store them in a list of lists,
have a loop over the .xml files, as specified on the command line, using the handy fileinput module, specifying that we want to operate in line and that we want to keep around the backup files,
for every line in each of the .xml s operate all the replacements,
put back the modified line in the original file (using simply a print, thanks to fileinput's magic) (end='' because we don't want to strip each line to preserve eventual white space).
import fileinput
import sys
old_new = [line.strip().split(',') for line in open('csv.csv')]
for line in fileinput.input(sys.argv[1:], inplace=True, backup='.bak'):
for old, new in old_new:
line = line.replace(old, new)
print(line, end='')
If you save the code in replace.py, you will execute it like this
$ python3 replace.py *.xml subdir/*.xml another_one/a_single.xml

re-organize data stored in a csv

I have successfully downloaded my data from a given url and for storing it into a csv file I used the following code:
fx = open(destination_url, "w") #write data into a file
for line in lines: #loop through the string
fx.write(line + "\n")
fx.close() # close the file object
return
What happened is that the data is stored but not in separate lines. As one can see in the snapshot - the data is not separated into a different lines when I use the '\n'.
Every separate line of data that I wanted seems to be separated via the '\r' (marked by yellow) on the same cell in the csv file. Here is a snip: .
I know I am missing something here but can I get some pointers with regards to rearranging each line that ends with a \r into a separate line ?
I hope I have made myself clear.
Thanks
~V
There is a method call writelines
https://www.tutorialspoint.com/python/file_writelines.htm
some example is in the given link you can try that first in reality it should work we need the format of the data (what is inside the element) during each iteration print that out if the above method does not work

Checking/Writing lines to a .txt file using Python

I'm new both to this site and python, so go easy on me. Using Python 3.3
I'm making a hangman-esque game, and all is working bar one aspect. I want to check whether a string is in a .txt file, and if not, write it on a new line at the end of the .txt file. Currently, I can write to the text file on a new line, but if the string already exists, it still writes to the text file, my code is below:
Note that my text file has each string on a seperate line
write = 1
if over == 1:
print("I Win")
wordlibrary = file('allwords.txt')
for line in wordlibrary:
if trial in line:
write = 0
if write == 1:
with open("allwords.txt", "a") as text_file:
text_file.write("\n")
text_file.write(trial)
Is this really the indentation from your program?
As written above, in the first iteration of the loop on wordlibrary,
the trial is compared to the line, and since (from your symptoms) it is not contained in the first line, the program moves on to the next part of the loop: since write==1, it will append trial to the text_file.
cheers,
Amnon
You dont need to know the number of lines present in the file beforehand. Just use a file iterator. You can find the documentation here : http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Pay special attention to the readlines method.

Resources