How to read specific blocks of data file within certain keywords - python-3.x

I have a text data file that looks as shown below:
BEGIN_CYCLE
..
start_data
2d_data1
end_data
..
..
END_CYCLE
BEGIN_CYCLE
..
start_data
2d_data2
end_data
BEGIN_CYCLE
..
start_data
2d_data3
end_data
...
END_CYCLE
and so on
I am only interested in data blocks that start with start_data and end with end_data keywords, AND fall between BEGIN_CYCLE and a matching END_CYCLE keywords. In the above example, I want to read 2d_data1 and 2d_data3. Notice that although 2d_data2 starts with start_data and ends with end_data keywords, it is NOT bound by BEGIN_CYCLE and a matching END_CYCLE keyword. It only has a BEGIN_CYCLE and no matching END_CYCLE keyword. Of course I can have any number of begin and end cycles, and not just 3. My code below still reads 2d_data2 and actually skips over 2d_data3, and reads subsequent data blocks correctly. I do not know why exactly this is happening.
indexes = []
with open(file) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith('BEGIN_CYCLE'):
s = i
elif line.startswith('END_CYCLE'):
e = i
indexes.append((s, e))
else:
pass
temp_list = [list(range(*idx)) for idx in indexes]
indexes = [item for sublist in temp_list for item in sublist]
data = []
with open(file) as f:
for i, line in enumerate(f):
if 'start_data' in line and i in indexes:
chunk = []
for line in f:
if not line.startswith('end_data'):
chunk.append(''.join(line.strip().split()))
else:
break
data.append(chunk)
My thought process is to first identify valid test cycles (those with begin_cycle and end_cycle keywords) which explains the first part of the code. Then within these bounds, I am searching for start_data and end_data keywords and appending lines of data into chunks which I eventually collect in a list of data. The problem with my code is that 2d_data2 is read and not ignored. In fact, the code works fine whenever the test file always has matching BEGIN_CYCLE and END_CYCLE keywords. However, as soon as there is one or more instances of missing END_CYCLE keywords, then instead of ignoring any data block under that cycle, it includes it. Any help or alternative solution is appreciated. Thanks.

Below works exactly like I wanted. However, I don't like the idea of opening the file each time I loop over the indexes. I cannot think of a fix for this.
indexes = []
with open(file) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith('BEGIN_CYCLE'):
s = i
elif line.startswith('END_CYCLE'):
e = i
indexes.append((s, e))
else:
pass
data = []
for idx in indexes:
with open(file) as f:
for line in itertools.islice(f, idx[0], idx[1]):
if line.startswith('start_data'):
chunk = []
for line in f:
if not line.startswith('end_data'):
chunk.append(''.join(line.strip().split()))
else:
break
data.append(chunk)

Related

Printing an entire list, instead of one line

I am having trouble writing the entire list into an outfile. Here is the code:
with open(infline, "r") as f:
lines = f.readlines()
for l in lines:
if "ATOM" in l :
split = l.split()
if split[-1] == "1":
print(split)
#print(type(split))
with open( newFile,"w") as f:
f.write("Model Number One" + "\n")
f.write(str(split))
When I use print(split) it allows me to see the entire list (image below):
with open(infile, "r") as f:
lines = f.readlines()
for l in lines:
if "ATOM" in l :
split = l.split()
if split[-1] == "1":
#print(split)
print(type(split))
with open( newFile,"w") as f:
f.write("Model Number One" + "\n")
for i in range(len(split)):
f.write(str(split))
However, when I try to use f.write(split) I get an error because the function can only take a str not a list. So, I used f.write(str(split)) and it worked. The only issue now is that it only writes the last item in the list, not the whole list.
The function print is slightly more permissible than the method f.write, in the sense that it can accept lists and various types of objects as input. f.write is usually called by passing pre-formatted strings, as you noticed.
I think the issue with the code is that the write routine is nested inside the code. This causes Python to erase any contents stored inside newFile, and write only the last line read (l).
The problem can be easily fixed by changing the open call to open( newFile,"a"). The flag "a" tells Python to append the new contents to the existing file newFile (without erasing information). If newFile does not exist yet, Python will automatically create it.

Merge only if two consecutives lines startwith at python and write the rest of text normally

Input
02000|42163,54|
03100|4|6070,00
03110|||6070,00|00|00|
00000|31751150201912001|01072000600074639|
02000|288465,76|
03100|11|9060,00
03110|||1299,00|00|
03110||||7761,00|00|
03100|29|14031,21
03110|||14031,21|00|
00000|31757328201912001|01072000601021393|
Code
prev = ''
with open('out.txt') as f:
for line in f:
if prev.startswith('03110') and line.startswith('03110'):
print(prev.strip()+ '|03100|XX|PARCELA|' + line)
prev = line
Hi, I have this code that search if two consecutives lines startswith 03110 and print those line, but I wanna transforme the code so it prints or write at .txt also the rest of the lines
Output should be like this
02000|42163,54|
03100|4|6070,00
03110|||6070,00|00|00|
00000|31751150201912001|01072000600074639|
02000|288465,76|
03100|11|9060,00
03110|||1299,00|00|3100|XX|PARCELA|03110||||7761,00|00|
03100|29|14031,21
03110|||14031,21|00|
00000|31757328201912001|01072000601021393|
I´m know that I´m getting only those two lines merged, because that is the command at print()
03110|||1299,00|00|3100|XX|PARCELA|03110||||7761,00|00|
But I don´t know to make the desire output, can anyone help me with my code?
# I assume the input is in a text file:
with open('myFile.txt', 'r') as my_file:
splited_line = [line.rstrip().split('|') for line in my_file] # this will split every line as a separate list
new_list = []
for i in range(len(splited_line)):
try:
if splited_line[i][0] == '03110' and splited_line[i-1][0] == '03110': # if the current line and the previous line start with 03110
first = '|'.join(splited_line[i-1])
second = '|'.join(splited_line[i])
newLine = first + "|03100|XX|PARCELA|"+ second
new_list.append(newLine)
elif splited_line[i][0] == '03110' and splited_line[i+1][0] == '03110': # to escape duplicating in the list
pass
else:
line = '|'.join(splited_line[i])
new_list.append(line)
except IndexError:
pass
# To write the new_list to text files
with open('new_file' , 'a') as f:
for item in new_list:
print(item)
f.write(item + '\n')

How can I expand List capacity in Python?

read = open('700kLine.txt')
# use readline() to read the first line
line = read.readline()
aList = []
for line in read:
try:
num = int(line.strip())
aList.append(num)
except:
print ("Not a number in line " + line)
read.close()
print(aList)
There is 700k Line in that file (every single line has max 2 digits number)
I can only get ~280k Line in that file to in my aList.
So, How can I expand aList capacity 280k to 700k or more? (Is there a different solution for this case?)
Hello, I just solved that problem. Thanks for all your helps. That was an obvious buffer problem.
Solution is just increasing the size of buffer.
link is here
Increase output buffer when running or debugging in PyCharm
Please try this.
filename = '700kLine.txt'
with open(filename) as f:
data = f.readlines()
print(data)
print(type(data)) #stores the data in a list
Yes, you can.
Once a list is defined, you can add, edit or delete its elements. To add more elements at the end, use the append function:
MyList.append(data)
Where MyList is the name of the list and data is the element you want to add.
I tried to re-create your problem:
# creating 700kLine file
with open('700kLine.txt', 'w') as f:
for i in range(700000):
f.write(str(i+1) + '\n')
# creating list from file entries
aList = []
with open('700kLine.txt', 'r') as f:
for line in f:
num = int(line.strip())
aList.append(num)
# print(aList)
print(aList[:30])
Jupyter notebook throws an error while printing all 700K lines due to too much memory used. If you really want to print all 700k values, run the python script from terminal.
It could be that your computer ran out of memory processing the file? I have tried generating an infinite loop appending a single digit to the list and I ended up with 47 million-ish len(list) >> 47119572, the code I use to test as below.
I tried this code on an online REPL and it came to a significantly lower 'len(list)`.
list = []
while True:
try:
if len(list) > 0:
list.append(list[-1] + 1)
else:
list.append(1)
except MemoryError:
print("memory error, last count is: ", list[-1])
raise MemoryError
Maybe try saving bits of data read instead of reading the whole file at once?
Just my assumption.

python read/write data IndexError: list index out of range

I'm trying to write a simply code to extract specific data columns from my measurement results (.txt files) and then save them into a new text file. Unfortunately I'm already stuck even before the writing part. The code below results in a following error: IndexError: list index out of range
How do I solve this? It seems to be related to the size of the data, i.e. the same code worked for a much smaller data file.
f = open('data.txt', 'r')
header1 = f.readline()
header2 = f.readline()
header3 = f.readline()
for line in f:
line = line.strip()
columns = line.split()
name = columns[2]
j = columns[3]
print(name, j)
Before using index you should check the length of the split() result or check the line's pattern by using a regex.
Example of length check to add right after the columns = line.split() :
if len(columns) < 4:
continue
So if you have a line that does not match your awaited data format it won't crash

Python 3.x outputting a text file with names of files that contain a list of words

I have approximately 160,000 text files in a directory. My first objective is to create a list of files that contain at least one item from a list of about 50 keywords. My current code is
import os
ngwrds= [list of words]
for filename in os.listdir(os.getcwd()):
with open(filename, 'r') as searchfile:
for line in searchfile:
if any(x in line for x in ngwrds):
with open("keyword.txt", 'a') as out:
out.write(filename + '\n')
Which works but sends out duplicate filenames. Ideally what I would like is for the loop to stop once it hits the first keyword, write the file name to 'keyword.txt', and move on to the next file in the directory. Any thoughts on how to do this?
A more in depth answer to #strubbly's comment, you would simply add a break in the 2nd for loop
with open(filename, 'r') as searchfile:
for line in searchfile:
if any(x in line for x in ngwrds):
with open("keyword.txt", 'a') as out:
out.write(filename + '\n')
break
What does the break do? from the python3 docs:
The break statement, like in C, breaks out of the smallest enclosing for or while loop.
for more information on break go to the control flow documentation :https://docs.python.org/3/tutorial/controlflow.html

Resources