Python: remember line index when reading lines from text file - python-3.x

I'm extracting data in a loop from a text file between two strings with Python 3.6. I've got multiple strings of which I would like to extract data between those strings, see code below:
for i in range(0,len(strings1)):
with open('infile.txt','r') as infile, open('outfile.txt', 'w') as outfile:
copy = False
for line in infile:
if line == strings1[i]:
copy = True
elif line == strings2[i]:
copy = False
elif copy:
outfile.write(line)
continue
To decrease the processing time of the loop, I would like to modify my code such that after it has extracted data between two strings, let's say strings1[1] and strings2[1], it remembers the line index of strings2[1] and starts the next iteration of the loop at that line index. Therefore it doesn't have to read the whole file during each iteration. The string lists are build such that the previous strings will never occur after a current string, so modifying my code to what I want won't break the loop.
Does anyone how to do this?
===========================================================================
EDIT:
I've got a file in a format such as:
the first line
bla bla bla
FIRST some string 1
10 10
15 20
5 2.5
SECOND some string 2
bla bla bla
bla bla bla
FIRST some string 3
10 10
15 20
5 2.5
SECOND some string 4
The file goes on like this for many lines.
I want to extract the data between 'FIRST some string 1' and 'SECOND some string 2', and plot this data. When that is done, I want to do the same for the data between 'FIRST some string 3' and 'SECOND some string 4' (thus also plot the data). All the 'FIRST some string ..' are stored in strings1 list and all the 'SECOND some string ..' are stored in strings2 list.
To decrease computational time, I would like to modify the code such that after the first iteration, it knows that it can start from line with string 'some string 2' and not from 'the first line' AND also that when during the first iteration, it knows that it can stop the first iteration when it has found 'SECOND some string 2'.
Does anyone how to do this? Please let me know when something is unclear.

The key issue is you're reopening your files in a for loop, of course it will reiterate the files from the beginning each time. I wouldn't open the files in a for loop, that's horribly inefficient. You can load the files into memory first and then loop through strings1.
There are some other issues, namely here:
copy = False
for line in infile:
if line == strings1[i]:
copy = True
elif line == strings2[i]:
copy = False
elif copy:
outfile.write(line)
continue
The elif copy: line will never execute in the first iteration of the second loop because copy is only ever True once the line == strings1[i] is met. After that condition is met, for the rest of the iterations it will always write the lines from infile to outfile. Unless this is precisely what you're trying to achieve the logic doesn't work.
Without a full context it's hard to understand what exactly you're looking for.
But maybe what you want to do instead is simply this:
with open('infile.txt','r') as infile, open('outfile.txt', 'w') as outfile:
for line in infile.readlines():
if line.rstrip('\n') in strings1:
outfile.write(line)
What this code is doing:
1.) Open both files into memory.
2.) Iterate through the lines of the infile.
3.) Check if the iterated line, stripping the trailing newline character is in the list strings1, assuming your strings1 is a list that doesn't have any trailing newline characters. If each item in strings1 already has a trailing \n, then don't rstrip the line.
4.) If line occurs in strings1, write the line to outfile.
This looks to be the gist of what you're attempting.

Related

Editing data in a text file in python for a given condition

I have a text file with the following contents:
1 a 20
2 b 30
3 c 40
I need to check if the first character of a particular line is 2 and edit its final two characters to 12, and rewrite the data into the file. New file should look something like this:
1 a 20
2 b 12
3 c 40
Need help doing this in python 3.
Couldn't figure it out. Help.
To modify contents of a file with python you will need to open the file in read mode to extract the contents of the file. You can then make changes on the extracted contents. To make your changes permanent, you have to write the contents back to the file.
The whole process looks something like this:
from pathlib import Path
# Define path to your file
your_file = Path("your_file.txt")
# Read the data in your file
with your_file.open('r') as f:
lines = f.readlines()
# Edit lines that start with a "2"
for i in range(len(lines)):
if lines[i].startswith("2"):
lines[i] = lines[i][:-3] + "12\n"
# Write data back to file
with your_file.open('w') as f:
f.writelines(lines)
Note that in order to change the last two characters of a string, you actually need to change the two characters before the last. This is because of the newline character, which indicates that the line has ended and new characters should be put on the line below. The \n you see after 12 is the newline character. If you don't put this in your replacement string, what originally was the next string will be put directly behind your replacement.

file reading in python usnig different methods

# open file in read mode
f=open(text_file,'r')
# iterate over the file object
for line in f.read():
print(line)
# close the file
f.close()
the content of file is "Congratulations you have successfully opened the file"! when i try to run this code the output comes in following form:
c (newline) o (newline) n (newline) g.................
...... that is each character is printed individually on a new line because i used read()! but with readline it gives the answer in a single line! why is it so?
r.read() returns one string will all characters (the full file content).
Iterating a string iterates it character wise.
Use
for line in f: # no read()
instead to iterate line wise.
f.read() returns the whole file in a string. for i in iterates something. For a string, it iterates over its characters.
For readline(), it should not print the line. It would read the first line of the file, then print it character by character, like read. Is it possible that you used readlines(), which returns the lines as a list.
One more thing: there is with which takes a "closable" object and auto-closes it at the end of scope. And you can iterate over a file object. So, your code can be improved like this:
with open(text_file, 'r') as f:
for i in f:
print(i)

Best way to fix inconsistent csv file in python

I have a csv file which is not consistent. It looks like this where some have a middle name and some do not. I don't know the best way to fix this. The middle name will always be in the second position if it exists. But if a middle name doesn't exist the last name is in the second position.
john,doe,52,florida
jane,mary,doe,55,texas
fred,johnson,23,maine
wally,mark,david,44,florida
Let's say that you have ① wrong.csv and want to produce ② fixed.csv.
You want to read a line from ①, fix it and write the fixed line to ②, this can be done like this
with open('wrong.csv') as input, open('fixed.csv', 'w') as output:
for line in input:
line = fix(line)
output.write(line)
Now we want to define the fix function...
Each line has either 3 or 4 fields, separated by commas, so what we want to do is splitting the line using the comma as a delimiter, return the unmodified line if the number of fields is 3, otherwise join the field 0 and the field 1 (Python counts from zero...), reassemble the output line and return it to the caller.
def fix(line):
items = line.split(',') # items is a list of strings
if len(items) == 3: # the line is OK as it stands
return line
# join first and middle name
first_middle = join(' ')((items[0], items[1]))
# we want to return a "fixed" line,
# i.e., a string not a list of strings
# we have to join the new name with the remaining info
return ','.join([first_second]+items[2:])

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

IndexError: list index out of range, but list length OK

New to programming, looking for a deeper understanding on whats happening.
Goal: open a file and print the first 10 lines. (similar to head command)
Code:
with open('file') as f:
for i in range(0,10):
print([line.strip('\n') for line in f][i])
Result: prints first line fine, then returns the out of range error
File: Is a simple text file with 20 lines, no more than 50 chars per line
FYI - Removed range line and printed both type(list) and length(20). Printed specific indexes without issue (unless >1 in a row)
Able to get the desired result with different code, but trying to improve using with/as
You can actually iterate over a file. Which is what you should be doing here.
with open('file') as f:
for i, line in enumerate(file, start=1):
# Get out of the loop if we hit 10 lines
if i >= 10:
break
# Line already has a '\n' at the end
print(line, end='')
The reason that your code is failing is because of your list comprehension:
[line.strip('\n') for line in f]
The first time through your loop that consumes all of the lines in your file. Now your file has no more lines, so the next time through it creates a list of all the lines in your file and tries to get the [1]st element. But that doesn't exist because there are no lines at the end of your file.
If you wanted to keep your code mostly as-is you could do
lines = [line.rstrip('\n') for line in f]
for i in range(10):
print(lines[i])
But that's also silly, because you could just do
lines = f.readlines()
But that's also silly if you just want up to the 10th line, because you could do this:
with open('file') as f:
print('\n'.join(f.readlines()[:10]))
Some further explanation:
The shortest and worst way you could fix your code is by adding one line of code:
with open('file') as f:
for i in range(0,10):
f.seek(0) # Add this line
print([line.strip('\n') for line in f][i])
Now your code will work - but this is a horrible way to get your code to work. The reason that your code isn't working the way you expect in the first place is that files are consumable iterators. That means that when you read from them eventually you run out of things to read. Here's a simple example:
import io
file = io.StringIO('''
This is is a file
It has some lines
okay, only three.
'''.strip())
for line in file:
print(file.tell(), repr(line))
This outputs
18 'This is is a file\n'
36 'It has some lines\n'
53 'okay, only three.'
Now if you try to read from the file:
print(file.read())
You'll see that it doesn't output anything. That's because you've "consumed" the file. I mean obviously it's still on disk, but the iterator has reached the end of the file. But as shown, you can seek in the file.
print(file.tell())
file.seek(0)
print(file.tell())
print(file.read())
And you'll see your entire file printed. But what about those other positions?
file.seek(36)
print(file.read()) # => okay, only three.
As a side note, you can also specify how much to read:
file.seek(36)
print(file.read(4)) # => okay
print(file.tell()) # => 40
So when we read from a file or iterate over it we consume the iterator and get to the end of the file. Let's put your new tools to work and go back to your original code and explore what's happening.
with open('file') as f:
print(f.tell())
lines = [line.rstrip('\n') for line in f]
print(f.tell())
print(len([line for line in f]))
print(lines)
You'll see that you're at a different location in the file. And the second list comprehension produces an empty list. That's because when a list comprehension is evaluated it executes immediately. So when you do this:
for i in range(10):
print([line.strip('\n') for line in f][i])
What you're doing the first time, i = 0 and then the list comprehension reads to the end of the file. Now it takes the [0]th element of the list, or the first line in the file. But your file iterator is at the end of the file.
So now we get back to the beginning of the list and i = 1. Now we iterate to the end of the file, but we're already at the end so there are no lines to read, and we've got an empty list [] that we try to get the [0]th element of. But there's nothing there. So we get an IndexError.
List comprehensions can be useful, but when you're beginning it's usually much easier to write a for loop and then turn it into a list comprehension. So you might write something like this:
with open('file') as f:
for i, line in enumerate(file, start=10):
if i < 10:
print(line.rstrip())
Now, we shouldn't print inside a list comprehension, so instead we'll collect everything. We start out by putting what we want:
[line.rstrip()
Now add the for bit:
[line.rstrip() for i, line in enumerate(f)
And finally add the filter and our closing brace:
[line.rstrip() for i, line in enumerate(f) if i < 10]
For more on list comprehensions, this is a fantastic resource: http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

Resources