How to add to the beginning of each line of a large file (>100GB) the index of that line with Python?

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python? - python-3.x

some_file.txt: (berore)
one
two
three
four
five
...
How can I effectively modify large file in Python?
with open("some_file.txt", "r+") as file:
for idx, line in enumerate(file.readlines()):
file.writeline(f'{idx} {line}') # something like this
some_file.txt: (after)
1 one
2 two
3 three
4 four
5 five
...

Don't try to load your entire file in memory, because the file may be too large for that. Instead, read line by line:
with open('input.txt') as inp, open('output.txt', 'w') as out:
idx = 1
for line in inp:
out.write(f'{idx} {line}'
idx += 1
You can't insert into the middle of a file without re-writing it. This is an operating system thing, not a Python thing.

Use pathlib for path manipulation. Rename the original file. Then copy it to a new file, adding the line numbers as you go. Keep the old file until you verify the new file is correct.
Open files are iterable, so you can use enumerate() on them directly without having to use readlines() first. The second argument to enumerate() is the number to start the count with. So the loop below will number the lines starting with 1.
from pathlib import Path
target = Path("some_file.txt")
# rename the file with ".old" suffix
original = target.rename(target.with_suffix(".old"))
with original.open("r") as source, target.open("w") as sink:
for line_no, line in enumerate(source, 1):
sink.writeline(f'{line_no} {line}')

Related

Running a function on multiple files simultaneously with python

i have a specific function that manipulates text files via input of directory and file name.
The defined function is as below
def nav2xy(target_directory, target_file):
after_rows = f'MOD {target_file}_alines.txt'
after_columns = f'MOD {target_file}_acolumns.txt'
# this segment is used to remove top lines(8 in this case) for work with only the actual data
infile = open(f'{target_directory}/{target_file}', 'r').readlines()
with open(after_rows, 'w') as outfile:
for index, line in enumerate(infile):
if index >= 8:
outfile.write(line)
# this segment removes the necessary columns, in this case leaving only coordinates for gmt use
with open(after_rows) as In, open(after_columns, "w") as Out:
for line in In:
values = line.split()
Out.write(f"{values[4]} {values[5]}\n")
i am searching for a way to run this code once on all files in the chosen directory(could be targeted by name or just do all of them),
should i change the function to use only the file name?
tried running the function this way, to no avail
for i in os.listdir('Geoseas_related_files'):
nav2xy('target_directory', i)
this way works perfectly, although somehow i still get this error with it.
(base) ms-iMac:python gan$ python3 coordinates_fromtxt.py
Traceback (most recent call last):
File "coordinates_fromtxt.py", line 7, in <module>
nav2xy('Geoseas_related_files', str(i))
File "/Users/gadraifman/research/python/GAD_MSC/Nav.py", line 19, in nav2xy
Out.write(f"{values[4]} {values[5]}\n")
IndexError: list index out of range
any help or advice would be a great help,

From what I gather from Iterating through directories with Python, the best way to loop directories is using glob.
I made some extensive other modifications to your code to simplify it and remove the middle step of saving lines to a file just to read them again. If this step is mandatory, then feel free to add it back.
import os, glob
def nav2xy(target_file):
# New file name, just appending stuff.
# "target_file" will contain the path as defined by root_dir + current filename
after_columns = f'{target_file}_acolumns.txt'
with open(target_file, 'r') as infile, open(after_columns, "w") as outfile:
content = infile.readlines()
#
# --- Skip 8 lines here
# |
# v
for line in content[8:]:
# No need to write the lines to a file, just to read them again.
# Process directly
values = line.split()
outfile.write(f"{values[4]} {values[5]}\n")
# I guess this is the dir you want to loop through.
# Maybe an absolute path c:\path\to\files is better.
root_dir = 'Geoseas_related_files/*'
for file_or_dir in glob.iglob(os.path.join(root_dir,"*")):
# Skip directories, if there are any.
if os.path.isfile(file_or_dir):
nav2xy(file_or_dir)

Replacing a float number in txt file

Firstly, I would like to say that I am newbie in Python.
I will ll try to explain my problem as best as I can.
The main aim of the code is to be able to read, modify and copy a txt file.
In order to do that I would like to split the problem up in three different steps.
1 - Copy the first N lines into a new txt file (CopyFile), exactly as they are in the original file (OrigFile)
2 - Access to a specific line where I want to change a float number for other. I want to append this line to CopyFile.
3 - Copy the rest of the OrigFile from line in point 2 to the end of the file.
At the moment I have been able to do step 1 with next code:
with open("OrigFile.txt") as myfile:
head = [next(myfile) for x iin range(10)] #read first 10 lines of txt file
copy = open("CopyFile.txt", "w") #create a txt file named CopyFile.txt
copy.write("".join(head)) #convert list into str
copy.close #close txt file
For the second step, my idea is to access directly to the txt line I am interested in and recognize the float number I would like to change. Code:
line11 = linecache.getline("OrigFile.txt", 11) #opening and accessing directly to line 11
FltNmb = re.findall("\d+\.\d+", line11) #regular expressions to identify float numbers
My problem comes when I need to change FltNmb for a new one, taking into consideration that I need to specify it inside the line11. How could I achieve that?

Open both files and write each line sequentially while incrementing line counter.
Condition for line 11 to replace the float number. Rest of the lines are written without modifications:
with open("CopyFile.txt", "w") as newfile:
with open("OrigFile.txt") as myfile:
linecounter = 1
for line in myfile:
if linecounter == 11:
newline = re.sub("^(\d+\.\d+)", "<new number>", line)
linecounter += 1
outfile.write(newline)
else:
newfile.write(line)
linecounter += 1

Using Python how to sort lines alphabetically, by the nth character from the left in the line?

I am writing a program that takes input from a file, appends a prefix and a suffix to each line, then writes the completed line to an output file. Then, the program takes input from the output files (3 of them), combines the lines and outputs that result into a "final" output file.
I am looking to see how I can then alphabetize the "final" output file to be organized by the 9th character from the left. The first 8 characters are all the same, so doing something like
newLines.sort()
won't work. Also, I can't sort any of the files individually, as the first file is first names, second file is last names, and third file is age. If I sort them individually, I will get the first and last names mixed up.
I have seen many questions answered using sort keys and lambda code, but I haven't been able to find documentation that explains it.
For instance, it seems like this line would work for me from this search result :
(key=lambda s: s.split()[1])
but I don't understand what the "s" is, nor the "[1]". So, I'm not sure how to use this line to target the 9th character in the line. Also, it seems their input has a space, mine does not.
Here is the code I am working with:
##-- Combine files --##
finalDest = open(r'[final output location]', 'wb')
firstColumn = open(r'[file 1 location]', 'rb')
secondColumn = open(r'[file 2 location]', 'rb')
thirdColumn = open(r'[file 3 location]', 'rb')
for line in firstColumn.readlines():
finalDest.write(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n') + b'\r\n')
firstColumn.close()
secondColumn.close()
thirdColumn.close()
finalDest.close()
Here is an example from the "final" output:
<tr><td>Becky</td><td>Morgan</td><td>W 40-49</td></tr>
<tr><td>Kevin</td><td>Miller</td><td>M 20-29</td></tr>
<tr><td>Carol</td><td>Wilson</td><td>W 50-59</td></tr>
<tr><td>Joshua</td><td>Wilson</td><td>M 20-29</td></tr>
I would like that to be sorted to this:
<tr><td>Becky</td><td>Morgan</td><td>W 40-49</td></tr>
<tr><td>Carol</td><td>Wilson</td><td>W 50-59</td></tr>
<tr><td>Kevin</td><td>Miller</td><td>M 20-29</td></tr>
<tr><td>Joshua</td><td>Wilson</td><td>M 20-29</td></tr>
Based on the recommendation of #kabanus, I have adjusted my code to be the following:
##-- Combine files --##
myLines = []
finalDest = open(r'[final-output location]', 'wb')
firstColumn = open(r'[file 1 location]', 'rb')
secondColumn = open(r'[file 2 location]', 'rb')
thirdColumn = open(r'[file 3 location]', 'rb')
for line in firstColumn.readlines():
myLines.append(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n') + b'\r\n')
finalDest.write(b'\r\n'.join(myLines.sort())
firstColumn.close()
secondColumn.close()
thirdColumn.close()
finalDest.close()
However, I am now getting an error:
Traceback (most recent call last):
File "[program location]", line 56, in <module>
finalDest.write(b'\r\n'.join(myLines.sort()))
TypeError: can only join an iterable

A file object has no 'sort' method, and by the time you called sort the lines are already written. First collect your lines:
mylines = []
for line in firstColumn.readlines():
mylines.append(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n')))
Now you can sort and write it:
finalDest.write("\r\n".join(sorted(mylines)))
finalDest.close()

You should read all the lines from the three input files (use f.readlines). You then zip the three lists of lines, giving you a list of tuples.
Sort that list however you want (if you use the default sort, you'll probably get what you want), then write each tuple to the output file as a line.

IndexError: list index out of range, but list length OK

New to programming, looking for a deeper understanding on whats happening.
Goal: open a file and print the first 10 lines. (similar to head command)
Code:
with open('file') as f:
for i in range(0,10):
print([line.strip('\n') for line in f][i])
Result: prints first line fine, then returns the out of range error
File: Is a simple text file with 20 lines, no more than 50 chars per line
FYI - Removed range line and printed both type(list) and length(20). Printed specific indexes without issue (unless >1 in a row)
Able to get the desired result with different code, but trying to improve using with/as

You can actually iterate over a file. Which is what you should be doing here.
with open('file') as f:
for i, line in enumerate(file, start=1):
# Get out of the loop if we hit 10 lines
if i >= 10:
break
# Line already has a '\n' at the end
print(line, end='')
The reason that your code is failing is because of your list comprehension:
[line.strip('\n') for line in f]
The first time through your loop that consumes all of the lines in your file. Now your file has no more lines, so the next time through it creates a list of all the lines in your file and tries to get the [1]st element. But that doesn't exist because there are no lines at the end of your file.
If you wanted to keep your code mostly as-is you could do
lines = [line.rstrip('\n') for line in f]
for i in range(10):
print(lines[i])
But that's also silly, because you could just do
lines = f.readlines()
But that's also silly if you just want up to the 10th line, because you could do this:
with open('file') as f:
print('\n'.join(f.readlines()[:10]))
Some further explanation:
The shortest and worst way you could fix your code is by adding one line of code:
with open('file') as f:
for i in range(0,10):
f.seek(0) # Add this line
print([line.strip('\n') for line in f][i])
Now your code will work - but this is a horrible way to get your code to work. The reason that your code isn't working the way you expect in the first place is that files are consumable iterators. That means that when you read from them eventually you run out of things to read. Here's a simple example:
import io
file = io.StringIO('''
This is is a file
It has some lines
okay, only three.
'''.strip())
for line in file:
print(file.tell(), repr(line))
This outputs
18 'This is is a file\n'
36 'It has some lines\n'
53 'okay, only three.'
Now if you try to read from the file:
print(file.read())
You'll see that it doesn't output anything. That's because you've "consumed" the file. I mean obviously it's still on disk, but the iterator has reached the end of the file. But as shown, you can seek in the file.
print(file.tell())
file.seek(0)
print(file.tell())
print(file.read())
And you'll see your entire file printed. But what about those other positions?
file.seek(36)
print(file.read()) # => okay, only three.
As a side note, you can also specify how much to read:
file.seek(36)
print(file.read(4)) # => okay
print(file.tell()) # => 40
So when we read from a file or iterate over it we consume the iterator and get to the end of the file. Let's put your new tools to work and go back to your original code and explore what's happening.
with open('file') as f:
print(f.tell())
lines = [line.rstrip('\n') for line in f]
print(f.tell())
print(len([line for line in f]))
print(lines)
You'll see that you're at a different location in the file. And the second list comprehension produces an empty list. That's because when a list comprehension is evaluated it executes immediately. So when you do this:
for i in range(10):
print([line.strip('\n') for line in f][i])
What you're doing the first time, i = 0 and then the list comprehension reads to the end of the file. Now it takes the [0]th element of the list, or the first line in the file. But your file iterator is at the end of the file.
So now we get back to the beginning of the list and i = 1. Now we iterate to the end of the file, but we're already at the end so there are no lines to read, and we've got an empty list [] that we try to get the [0]th element of. But there's nothing there. So we get an IndexError.
List comprehensions can be useful, but when you're beginning it's usually much easier to write a for loop and then turn it into a list comprehension. So you might write something like this:
with open('file') as f:
for i, line in enumerate(file, start=10):
if i < 10:
print(line.rstrip())
Now, we shouldn't print inside a list comprehension, so instead we'll collect everything. We start out by putting what we want:
[line.rstrip()
Now add the for bit:
[line.rstrip() for i, line in enumerate(f)
And finally add the filter and our closing brace:
[line.rstrip() for i, line in enumerate(f) if i < 10]
For more on list comprehensions, this is a fantastic resource: http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

Make python program read 2 files line by line in sync and conduct program on each line

This question is two fold:
Background: I have 2 large files, each line of file 1 is "AATTGGCCAA" and each line of file 2 is "AATTTTCCAA". Each file has 20,000 lines and I have a python code I have to run on each pair of lines in turn.
Firstly, how would you go about getting the python code to run on the same numbered line of each file e.g. line 1 of both files?
Secondly, how would you get the file to move down to line 2 on both files after running on line 1 etc?

File objects are iterators. You can pass them to any function that expects an iterable object and it will work. For your specific use case, you want to use the zip builtin function, which iterates over several objects in parallel and yields tuples with one object from each iterable.
with open(filename1) as file1, open(filename2) as file2:
for line1, line2 in zip(file1, file2):
do_something(line1, line2)
In Python 3, zip is an iterator, so this is efficient. If you needed to do the same thing in Python 2, you'd probably want to use itertools.izip instead, as the regular zip would cause all the data from both files to be read at into a list up front.

File objects are iterators. You can open them and then call .next() on the object to get the next line. An example
For line in file1:
other_line = file2.next()
do_something(line, other_line)

The following code uses two Python features:
1. Generator function
2. File object treated as iterator
def get_line(file_path):
# Generator function
with open(file_path) as file_obj:
for line in file_obj:
# Give one line and return control to the calling scope
yield line
# Generator function will not be executed here
# Instead we get two generator instances
lines_a = get_line(path_to_file_a)
lines_b = get_line(path_to_file_b)
while True:
try:
# Now grab one line from each generator
line_pair = (next(lines_a), next(lines_b))
except StopIteration:
# This exception means that we hit EOF in one of the files so exit the loop
break
do_something(line_pair)
Assuming that your code is wrapped in do_something(line_pair) function which accepts a tuple of length 2 which holds the pair of lines.

Here's the code that allows you to process lines in sync from multiple files:
from contextlib import ExitStack
with ExitStack() as stack:
files = [stack.enter_context(open(filename)) for filename in filenames]
for lines in zip(*files):
do_something(*lines)
e.g., for 2 files it calls do_something(line_from_file1, line_from_file2) for each pair of lines in the given files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python? - python-3.x

Related

Running a function on multiple files simultaneously with python

Replacing a float number in txt file

Using Python how to sort lines alphabetically, by the nth character from the left in the line?

IndexError: list index out of range, but list length OK

Make python program read 2 files line by line in sync and conduct program on each line

Categories

Resources