Python efficient way to search for a pattern in text file - python-3.x

I need to find a pattern in a text file, which isn't big.
Therefore loading the entire file into RAM isn't a concern for me - as advised here:
I tried to do it in two ways:
with open(inputFile, 'r') as file:
for line in file.readlines():
for date in dateList:
if re.search('{} \d* 1'.format(date), line):
OR
with open(inputFile, 'r') as file:
contents = file.read()
for date in dateList:
if re.search('{} \d* 1'.format(date), contents):
The second one proved to be much faster.
Is there an explanation for this, other than the fact that I am using one less loop with the second approach?

As pointed out in the comments, the two codes are not equivalent as the second one only look for the first match in the whole file. Besides this, the first is also more expensive because the (relatively expensive) format over all dates is called for each line. Storing the regexp and precompiling them should help a lot. Even better: you can generate a regexp to match all the dates at once using something like:
regexp = '({}) \d* 1'.format('|'.join('{}'.format(date) for date in dateList))
with open(inputFile, 'r') as file:
contents = file.read()
# Search the first matching date existing in dateList
if re.search(regexp, contents):
Note that you can use findall if you want all of them.

Related

How do I add a column to an already existing text file in Python?

How do I add a column to an already existing text file in Python? First I just want to add the header, then through my investigation add values in the column.
Here is a minimal solution that does what you asked for, and gives you an overview of some useful Python features. Importantly, what you want to do can probably be done very easily with pandas but I don't use it, so the solution is plain Python. It can also be done quite differently if you want to use numpy, more efficiently too.
This is it,
input_file = 'test.txt'
output_file = 'mod_test.txt'
header = ''
matrix = []
# Parse and store all existing lines
with open(input_file, 'r') as fin:
# The header - remove the newline character only
header = next(fin).strip()
# The data - parse and store each line as a list
# of strings, store as a sublist of matrix
for line in fin:
matrix.append(line.strip().split())
# Update the header - now a newline is needed
header += ' CLASS\n'
# Now perform your calculations for each row
# and add new column - adding 0.0 as in the comment
for row in matrix:
# Calculations would go here
# This zero is already a string, normally a
# conversion would be needed
row.append('0.0')
# Write it to the new file
with open(output_file, 'w') as fout:
# First the updated header
fout.write(header)
for row in matrix:
# Turn the entries into a single string
fout.write((' ').join(row) + '\n')
And this is a simple demonstration file that can go with it, test.txt,
C1 C2 C3
1.0 2.0 2.1
2.0 3.0 3.2
3.0 4.0 3.5
The comments highlight the most important details, but you should research each technique on your own, they are very useful when working with files.
Basically, you first load the original matrix, then modify it - add the last column, then save it into a different file. To save it to the same file you can just adjust output_file or use input_file in both cases. Writing to a different file, especially during development and debugging sounds like a better idea.
The new calculations and the writing can all be done in the same place, and if it is only adding a column of 0.0s that is probably a better way - right now the code is unnecessarily stretched. However, if you want to actually perform some more involving calculations, I recommend keeping this structure (also putting any longer calculations in separate functions).

Can you remove a random string of commas and replace with one for exporting to CSV

I am using Netmiko to extract some data from Cisco Switches and Routers. I would like to put that data in to a spread sheet. For example show cdp neighbour would give me string with random white space in
Port Name Status Vlan Duplex Speed Type
Et0/0 connected 1 auto auto unknown
Et0/1 connected 1 auto auto unknown
Et0/2 connected routed auto auto unknown
Et0/3 connected 1 auto auto unknown
I thought i could remove it and replace with , but i get this
Port,,,,,,Name,,,,,,,,,,,,,,,Status,,,,,,,Vlan,,,,,,,Duplex,,Speed,Type
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Any way of extracting data like the above. Ideally to go straight in to a structured table in excel (Cells and Rows) or anyway to do what i did and then replace repeating , with just one so i can export to CSV and then import to Excel. I may be the most long winded person you have ever seen because i am so new to prgramming :)
I'd go with regex matches which are more flexible. You can adapt this to your needs. I put the data in a list for testing, but you could process 1 line at a time instead.
Here's the file (called mydata.txt)
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Here's how to read it and write the result to a csv file (mydata.csv)
import re
_re = re.compile('([^,]+)+')
newfile = open(r'mydata.csv', 'w')
with open(r'mydata.txt') as data:
for line in data.readlines():
newfile.write(','.join(f for f in _re.findall(line)))
newfile.close()
And here is the output
Et0/0,connected,1,auto,auto,unknown
Et0/1,connected,1,auto,auto,unknown
Et0/2,connected,routed,auto,auto,unknown
Et0/3,connected,1,auto,auto,unknown
Explanation:
The re library allows the use of regular expressions for parsing
text. So the first line imports it.
The second line specifies the regular expression to extract anything
that is not a comma, but it is only a specification. It doesn't
actually do the extraction
The third line opens the output file, with 'w' specifying that we
can write to it. The next line opens the input file. The file is
reference by the name 'newfile'
The fourth line reads each line from the input file one at a time.
The fifth line is an all-at-once operation to separate the non-comma
parts of the input, join them back together separated by commas, and
write the resulting string to the output file.
The last line closes the output file.
I hope I didn't misunderstand you. To turn that repeating commas to one single comma, just run this code with your string s:
while ",," ins s:
s = s.replace(",,", ",")

Why won't this Python script replace one variable with another variable?

I have a CSV file with two columns in it, the one of the left being an old string, and the one directly to right being the new one. I have a heap of .xml files that contain the old strings, which I need to replace/update with the new ones.
The script is supposed to open each .xml file one at a time and replace all of the old strings in the CSV file with the new ones. I have tried to use a replace function to replace instances of the old string, called 'column[0]' with the new string, called 'column[1]'. However I must be missing something as this seems to do nothing. If I the first variable in the replace function to an actual string with quotation marks, the replace function works. However if both the terms in the replace function are variables, it doesn't.
Does anyone know what I am doing wrong?
import os
import csv
with open('csv.csv') as csv:
lines = csv.readline()
column = lines.split(',')
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
x=x.replace(column[0],column[1])
print(x)
Example of CSV file:
oldstring1,newstring1
oldstring2,newstring2
Example of .xml file:
Word words words oldstring1 words words words oldstring2
What I want in the new .xml files:
Word words words newstring1 words words words newstring2
The problem over here is you are treating the csv file as normal text file not looping over the all the lines in the csv file.
You need to read file using csv reader
Following code will work for your task
import os
import csv
with open('csv.csv') as csvfile:
reader = csv.reader(csvfile)
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
for row in reader:
x=x.replace(row[0],row[1])
print(x)
It looks like this is better done using sed. However.
If we want to use Python, it seems to me that what you want to do is best achieved
reading all the obsolete - replacements pairs and store them in a list of lists,
have a loop over the .xml files, as specified on the command line, using the handy fileinput module, specifying that we want to operate in line and that we want to keep around the backup files,
for every line in each of the .xml s operate all the replacements,
put back the modified line in the original file (using simply a print, thanks to fileinput's magic) (end='' because we don't want to strip each line to preserve eventual white space).
import fileinput
import sys
old_new = [line.strip().split(',') for line in open('csv.csv')]
for line in fileinput.input(sys.argv[1:], inplace=True, backup='.bak'):
for old, new in old_new:
line = line.replace(old, new)
print(line, end='')
If you save the code in replace.py, you will execute it like this
$ python3 replace.py *.xml subdir/*.xml another_one/a_single.xml

CSV reader with .txt file [duplicate]

I use python and I don't know how to do.
I want to read lots of lines in files. But I have to read from second lines. All files have different lines, So I don't know how to do.
Code example is that it read from first line to 16th lines.
But I have to read files from second lines to the end of lines.
Thank you!:)
with open('filename') as fin:
for line in islice(fin, 1, 16):
print line
You should be able to call next and discard the first line:
with open('filename') as fin:
next(fin) # cast into oblivion
for line in fin:
... # do something
This is simple and easy because of the nature of fin, being a generator.
with open("filename", "rb") as fin:
print(fin.readlines()[1:])
Looking at the documentation for islice
itertools.islice(iterable, stop)
itertools.islice(iterable, start, stop[, step])
Make an iterator that returns selected elements from the iterable. If start is non-zero, then elements from the iterable are skipped until start is reached. Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. If stop is None, then iteration continues until the iterator is exhausted, if at all; otherwise, it stops at the specified position. Unlike regular slicing, islice() does not support negative values for start, stop, or step. Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line).
I think you can just tell it to start at the second line and iterate until the end. e.g.
with open('filename') as fin:
for line in islice(fin, 2, None): # <--- change 1 to 2 and 16 to None
print line

Checking/Writing lines to a .txt file using Python

I'm new both to this site and python, so go easy on me. Using Python 3.3
I'm making a hangman-esque game, and all is working bar one aspect. I want to check whether a string is in a .txt file, and if not, write it on a new line at the end of the .txt file. Currently, I can write to the text file on a new line, but if the string already exists, it still writes to the text file, my code is below:
Note that my text file has each string on a seperate line
write = 1
if over == 1:
print("I Win")
wordlibrary = file('allwords.txt')
for line in wordlibrary:
if trial in line:
write = 0
if write == 1:
with open("allwords.txt", "a") as text_file:
text_file.write("\n")
text_file.write(trial)
Is this really the indentation from your program?
As written above, in the first iteration of the loop on wordlibrary,
the trial is compared to the line, and since (from your symptoms) it is not contained in the first line, the program moves on to the next part of the loop: since write==1, it will append trial to the text_file.
cheers,
Amnon
You dont need to know the number of lines present in the file beforehand. Just use a file iterator. You can find the documentation here : http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Pay special attention to the readlines method.

Resources