Replacing a float number in txt file - python-3.x

Firstly, I would like to say that I am newbie in Python.
I will ll try to explain my problem as best as I can.
The main aim of the code is to be able to read, modify and copy a txt file.
In order to do that I would like to split the problem up in three different steps.
1 - Copy the first N lines into a new txt file (CopyFile), exactly as they are in the original file (OrigFile)
2 - Access to a specific line where I want to change a float number for other. I want to append this line to CopyFile.
3 - Copy the rest of the OrigFile from line in point 2 to the end of the file.
At the moment I have been able to do step 1 with next code:
with open("OrigFile.txt") as myfile:
head = [next(myfile) for x iin range(10)] #read first 10 lines of txt file
copy = open("CopyFile.txt", "w") #create a txt file named CopyFile.txt
copy.write("".join(head)) #convert list into str
copy.close #close txt file
For the second step, my idea is to access directly to the txt line I am interested in and recognize the float number I would like to change. Code:
line11 = linecache.getline("OrigFile.txt", 11) #opening and accessing directly to line 11
FltNmb = re.findall("\d+\.\d+", line11) #regular expressions to identify float numbers
My problem comes when I need to change FltNmb for a new one, taking into consideration that I need to specify it inside the line11. How could I achieve that?

Open both files and write each line sequentially while incrementing line counter.
Condition for line 11 to replace the float number. Rest of the lines are written without modifications:
with open("CopyFile.txt", "w") as newfile:
with open("OrigFile.txt") as myfile:
linecounter = 1
for line in myfile:
if linecounter == 11:
newline = re.sub("^(\d+\.\d+)", "<new number>", line)
linecounter += 1
outfile.write(newline)
else:
newfile.write(line)
linecounter += 1

Related

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python?

some_file.txt: (berore)
one
two
three
four
five
...
How can I effectively modify large file in Python?
with open("some_file.txt", "r+") as file:
for idx, line in enumerate(file.readlines()):
file.writeline(f'{idx} {line}') # something like this
some_file.txt: (after)
1 one
2 two
3 three
4 four
5 five
...
Don't try to load your entire file in memory, because the file may be too large for that. Instead, read line by line:
with open('input.txt') as inp, open('output.txt', 'w') as out:
idx = 1
for line in inp:
out.write(f'{idx} {line}'
idx += 1
You can't insert into the middle of a file without re-writing it. This is an operating system thing, not a Python thing.
Use pathlib for path manipulation. Rename the original file. Then copy it to a new file, adding the line numbers as you go. Keep the old file until you verify the new file is correct.
Open files are iterable, so you can use enumerate() on them directly without having to use readlines() first. The second argument to enumerate() is the number to start the count with. So the loop below will number the lines starting with 1.
from pathlib import Path
target = Path("some_file.txt")
# rename the file with ".old" suffix
original = target.rename(target.with_suffix(".old"))
with original.open("r") as source, target.open("w") as sink:
for line_no, line in enumerate(source, 1):
sink.writeline(f'{line_no} {line}')

Splitting up a large file into smaller files at specific points

I know this question has been asked several times. But those solutions really don't help me here. I have a really big file (5GB almost) to read, get the data and give it to my neural network. I have to read line by line. At first I loaded the entire file into the memory using .readlines() function but it obviously resulted in out-of-memory issue. Next I instead of loading the entire file into the memory, I read it line by line but it still hasn't worked. So now I am thinking to split my file into smaller files and then read each of those files. The file format that for each sequence I have a header starting with '>' followed by a sequence for example:
>seq1
acgtccgttagggtjhtttttttttt
tttsggggggtattttttttt
>seq2
accggattttttstttttttttaasftttttttt
stttttttttttttttttttttttsttattattat
tttttttttttttttt
>seq3
aa
.
.
.
>seqN
bbbbaatatattatatatatattatatat
tatatattatatatattatatatattatat
tatattatatattatatatattatatatatta
tatatatatattatatatatatatattatatat
tatatatattatatattatattatatatattata
tatatattatatattatatatattatatatatta
So now I want to split my file which has 12700000 sequences into smaller files such that for each file with header '>' has it's correct corresponding sequence as well. How can I achieve this in python without running into memory issues. Insights would be appreciated.
I was able to do this with 12,700,000 randomized lines with 1-20 random characters in each line. Though the size of my file was far less than 5GB (roughly 300MB)--likely due to format. All of that said, you can try this:
x = 0
y = 1
string = ""
cycle = "Seq1"
with open(f"{FILEPATH}/main.txt", "r") as file:
for line in file:
if line[0] == ">":
if x % 5000 == 0 and x != 0:
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
cycle = f"Seq{y*5000+1}"
y += 1
string = ""
string += line
x += 1
if line[0] != ">":
string += line
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
This will read the file line-by-line, append the first 5000 values to a string, write the string to a new file, and repeat for the rest of the original file. It will also name the file with the first sequence within the file.
The line that reads if x % 5000 == 0: is the line that defines the number of sequences within each file and the line cycle = "Seq" + str(y*5000+1) creates the formatting for the next filename. You can adjust the 5000 in these if you change your mind about how many sequences per file (you're creating 2,540 new files this way).

Using Python how to sort lines alphabetically, by the nth character from the left in the line?

I am writing a program that takes input from a file, appends a prefix and a suffix to each line, then writes the completed line to an output file. Then, the program takes input from the output files (3 of them), combines the lines and outputs that result into a "final" output file.
I am looking to see how I can then alphabetize the "final" output file to be organized by the 9th character from the left. The first 8 characters are all the same, so doing something like
newLines.sort()
won't work. Also, I can't sort any of the files individually, as the first file is first names, second file is last names, and third file is age. If I sort them individually, I will get the first and last names mixed up.
I have seen many questions answered using sort keys and lambda code, but I haven't been able to find documentation that explains it.
For instance, it seems like this line would work for me from this search result :
(key=lambda s: s.split()[1])
but I don't understand what the "s" is, nor the "[1]". So, I'm not sure how to use this line to target the 9th character in the line. Also, it seems their input has a space, mine does not.
Here is the code I am working with:
##-- Combine files --##
finalDest = open(r'[final output location]', 'wb')
firstColumn = open(r'[file 1 location]', 'rb')
secondColumn = open(r'[file 2 location]', 'rb')
thirdColumn = open(r'[file 3 location]', 'rb')
for line in firstColumn.readlines():
finalDest.write(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n') + b'\r\n')
firstColumn.close()
secondColumn.close()
thirdColumn.close()
finalDest.close()
Here is an example from the "final" output:
<tr><td>Becky</td><td>Morgan</td><td>W 40-49</td></tr>
<tr><td>Kevin</td><td>Miller</td><td>M 20-29</td></tr>
<tr><td>Carol</td><td>Wilson</td><td>W 50-59</td></tr>
<tr><td>Joshua</td><td>Wilson</td><td>M 20-29</td></tr>
I would like that to be sorted to this:
<tr><td>Becky</td><td>Morgan</td><td>W 40-49</td></tr>
<tr><td>Carol</td><td>Wilson</td><td>W 50-59</td></tr>
<tr><td>Kevin</td><td>Miller</td><td>M 20-29</td></tr>
<tr><td>Joshua</td><td>Wilson</td><td>M 20-29</td></tr>
Based on the recommendation of #kabanus, I have adjusted my code to be the following:
##-- Combine files --##
myLines = []
finalDest = open(r'[final-output location]', 'wb')
firstColumn = open(r'[file 1 location]', 'rb')
secondColumn = open(r'[file 2 location]', 'rb')
thirdColumn = open(r'[file 3 location]', 'rb')
for line in firstColumn.readlines():
myLines.append(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n') + b'\r\n')
finalDest.write(b'\r\n'.join(myLines.sort())
firstColumn.close()
secondColumn.close()
thirdColumn.close()
finalDest.close()
However, I am now getting an error:
Traceback (most recent call last):
File "[program location]", line 56, in <module>
finalDest.write(b'\r\n'.join(myLines.sort()))
TypeError: can only join an iterable
A file object has no 'sort' method, and by the time you called sort the lines are already written. First collect your lines:
mylines = []
for line in firstColumn.readlines():
mylines.append(line.strip(b'\r\n') + secondColumn.readline().strip(b'\r\n') + thirdColumn.readline().strip(b'\r\n')))
Now you can sort and write it:
finalDest.write("\r\n".join(sorted(mylines)))
finalDest.close()
You should read all the lines from the three input files (use f.readlines). You then zip the three lists of lines, giving you a list of tuples.
Sort that list however you want (if you use the default sort, you'll probably get what you want), then write each tuple to the output file as a line.

python: editing specific lines in a text file.File not being read after first edit

I am new to python. Right now I'm trying to learn how to edit text files(overwrite them).
So, I have a text file, which stores these ints just like that:
1
2
3
4
5
then when I do this
with open('badgeNumbers.txt', 'r') as f:
lines = f.readlines()
self.firstBadge = lines[0].strip()
self.secondBadge = lines[1].strip()
self.thirdBadge = lines[2].strip()
self.fourthBadge = lines[3].strip()
self.fifthBadge = lines[4].strip()
int(self.thirdBadge)
lines[2] = 56
out = open('badgeNumbers.txt', 'w')
out.writelines(str(lines))
out.close()
it works and changes the number.
in text file it is now saved like this:
['1\n', '2\n', 56, '3\n', '4\n', '5']
However, later if I want to run this again, it gives me this error:
self.secondBadge = lines[1].strip()
IndexError: list index out of range
I just need for it to be able to do the same thing as before the first text file edit.
Can somebody please help?
Thanks
The first problem is that 56 does not have a new line at the end. That means that it and the next line will be displayed on the same line. The second problem is that you are writing the string representation of the list onto one line instead of writing each string in the list on separate lines. Change lines[2] = 56 to lines[2] = "56\n", and change out.writelines(str(lines)) to out.writelines(lines)

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources