Python Split itertools output to multi files(BIG output) - python-3.x

So I have created a script to read lines from a file (1500 lines)
Write them as 10 per line
(and do every possible output we can get with product a b c d a , a b c d b etc...)
The thing is the moment I run the script my computer freezes completly(because it writes so much data)
So I thought if its possible to run the script every 100 mb it will save it to a file and save the current state so when I run the script again it will actuly run from where we stopped (the last line on the 100mb file)
Or if you have another solution I would love to hear it :P
heres the script :
from itertools import product
with open('file.txt', 'r') as f:
content = f.readlines()
comb = product(content, repeat=10)
new_content = [elem for elem in list(comb)]
with open('log.txt', 'w') as f:
for line in new_content:
f.write(str(line) + '\n')

The line
new_content = [elem for elem in list(comb)]
takes the generator and transforms it into a list in memory, twice. The result is the same as just doing
new_content = list(comb)
Your computer freezes up because this will use all of the available RAM.
Since you use new_content only for iterating over it, you could just iterate over the initial generator directly instead:
from itertools import product
with open('file.txt', 'r') as f:
content = f.readlines()
comb = product(content, repeat=10)
with open('log.txt', 'w') as f:
for line in comb:
f.write(str(line) + '\n')
But now this will fill up your harddisk, since with an input size of 1500 lines it will produce 57665039062500000000000000000000 lines (1500**10) of output.

I would open the file in a separate function and yield a line at a time - that way you're never going to blow your memory.
function read_file(filename):
with open(filename", "r") as f:
for line in f:
yield line
Then you can use this in your code:
for line in read_file("log.txt"):
f.write(line + "\n")

Related

How can I merge two files with numbers into a new file and make it sorted?

How can I merge two files with numbers into a new file and make it sorted?
Code:
#combining the two files
filenames = ["numbers1.txt", "numbers2.txt"]
with open("allNumbers.txt", "w") as al_no:
**#iterating through the filenames list**
for f in filenames:
with open(f) as infile:
for line in infile:
al_no.write(line)
There are two approaches you can use.
The first approach is to loop through, append the lines to a list, sort the list and then write that out to the file.
filenames = ["numbers1.txt", "numbers2.txt"]
# Step 1: merge the lines into a list
lines = []
for f in filenames:
with open(f) as infile:
for line in infile:
lines.append(line)
# Step 2: write the list out to the file in a sorted order
with open("allNumbers.txt", "w") as all_no:
all_no.write(''.join(sorted(lines, key=lambda x: int(x.strip()))))
It is more succinct (and Pythonic) to use list comprehensions instead:
filenames = ["numbers1.txt", "numbers2.txt"]
lines = [line for sublist in [open(f).readlines() for f in filenames] for line in sublist]
with open("allNumbers.txt", "w") as all_no:
all_no.write(''.join(sorted(lines, key=lambda x: int(x.strip()))))
Remember that when sorting, you need to use the key argument to sorted to ensure a numeric sort is done, rather than the default lexicographic sort.
This code assumes that each line in the source files contains a number, which is likely given the current approach you've taken.
You did not provide a detailed description or file, I worked on it with a prediction.
try this code:
# Reading data from file1
with open('numbers1.txt') as fp:
data = fp.read()
# Reading data from file2
with open('numbers2.txt') as fp:
data2 = fp.read()
# Merging 2 files
data += "\n"
data += data2
# Covert str to list
list_data = data.split()
# Covert list(item) to int
list_data = list(map(int, list_data))
# Sort list(item)
list_data.sort()
# save file
with open('allNumbers.txt', 'w') as file:
for data in list_data:
file.write("%s\n" % data)
You can change the structure and use it.
good luck!

Adding new strings line by line from a file to a new one

I have a data output file in the format below from the script I run.
1. xxx %percentage1
2. yyy %percentage1
.
.
.
I am trying to take the percentages only, and append them to the same formatted file line by line (writing a new file once in the process).
1. xxx %percentage1 %percentage2
2. yyy %percentage1 %percentage2
The main idea is every time I run the code with a source data file I want it to add those percentages to the new file line by line.
1. xxx %percentage1 %percentage2 %percentage3 ...
2. yyy %percentage1 %percentage2 %percentage3 ...
This is what I could come up with:
import os
os.chdir("directory")
f = open("data1", "r")
n=3
a = f.readlines()
b = []
for i in range(n):
b.append(a[i].split(" ")[2])
file_lines = []
with open("data1", 'r') as f:
for t in range(n):
for x in f.readlines():
file_lines.append(''.join([x.strip(), b[t], '\n']))
print(b[t])
with open("data2", 'w') as f:
f.writelines(file_lines)
With this code I get the new file but the appending percentages are all from the first line, not different for each line. And I can only get one set of percentages added only and it is overwriting it rather than adding more down the lines.
I hope I explained it properly, if you can give some help I would be glad.
You can use a dict as a structure to load and write your data. This dict can then be pickled to store the data.
EDIT: added missing return statement
EDIT2: Fix return list of get_data
import pickle
import os
output = 'output'
dump = 'dump'
output_dict = {}
if os.path.exists(dump):
with open(dump, 'rb') as f:
output_dict = pickle.load(f)
def read_data(lines):
""" Builds a dict from a list of lines where the keys are
a tuple(w1, w2) and the values are w3 where w1, w2 and w3
are the 3 words composing each line.
"""
d = {}
for line in lines:
elts = line.split()
assert(len(elts)==3)
d[tuple(elts[:2])] = elts[2]
return d
def get_data(data):
""" Recover data from a dict as a list of strings.
The formatting for each element of the list is the following:
k[0] k[1] v
where k and v are the key/values of the data dict.
"""
lines = []
for k, v in data.items():
line = list(k)
line += [v, '\n']
lines.append(' '.join(line))
return lines
def update_data(output_d, new_d):
""" Update a data dict with new data
The values are appended if the key already exists.
Otherwise a new key/value pair is created.
"""
for k, v in new_d.items():
if k in output_d:
output_d[k] = ' '.join([output_d[k], v])
else:
output_d[k] = v
for data_file in ('data1', 'data2', 'data3'):
with open(data_file) as f:
d1 = read_data(f.readlines())
update_data(output_dict, d1)
print("Dumping data", output_dict)
with open(dump, 'wb') as f:
pickle.dump(output_dict, f)
print("Writing data")
with open(output, 'w') as f:
f.write('\n'.join(get_data(output_dict)))

read and write from and to file using functions

I'm trying to create 2 functions.
readfiles(file_path), That reads a file specified by file_path and returns a list of strings containing each line in the file.
writefiles(lines, file_path) That writes line by line the content of the list lines to the file specified by file_path.
When used one after another the output file should be an exact copy of the input file(including the formatting)
This is what i have so far.
file_path = ("/myfolder/text.txt", "r")
def readfiles(file_path):
with open file_path as f:
for line in f:
return line
lst = list[]
lst = line
lst.append(line)
return lst
read_file(file_path)
lines = lst []
def writefiles(lines, file_path):
with open ("file_path", "w") as f:
for line in lst:
f.write(line)
f.write("\n")
I can get it to kind of work when I use this for read
with open("/myfolder/text.txt", "r") as f:
for line in f:
print(line, end='')
and this for write
with open ("/myfolder/text.txt", "w") as f:
for line in f:
f.write(line)
f.write("\n")
But when I try to put them into functions it all messes up.
I'm not sure why, I know it's a simple question but it's just not clicking for me. I've read documentation on it but I'm not following it fully and am at my wits end. What's wrong with my functions?
I get varying errors from
lst = list[]
^
SyntaxError: invalid syntax
to
lst or list is not callable
Also I know there are similar questions but the ones I found don't seem to define a function.
The problems with your code are explained as comments
file_path = ("/myfolder/text.txt", "r") # this is a tupple of 2 elements should be file_path = "/myfolder/text.txt"
def readfiles(file_path):
with open file_path as f: # "open" is a function and will probably throw an error if you use it without parenthesis
# use open this way: open(file_path, "r")
for line in f:
return line # it will return the first line and exit the function
lst = list[] # "lst = []" is how you define a list in python. also you want to define it outside the loop
lst = line # you are replacing the list lst with the string in line
lst.append(line) # will throw an error because lst is a string now and doesn't have the append method
return lst
read_file(file_path) # should be lines = read_file(file_path)
lines = lst [] # lines is an empty list
def writefiles(lines, file_path):
with open ("file_path", "w") as f:
for line in lst: # this line should have 1 more tabulation
f.write(line) # this line should have 1 more tabulation
f.write("\n") # this line should have 1 more tabulation
Here's how the code should look like
def readfiles(file_path):
lst = []
with open(file_path) as f:
for line in f:
lst.append(line.strip("\n"))
return lst
def writefiles(lines, file_path):
with open(file_path, "w") as f:
for line in lines:
f.write(line + "\n")
file_path = "/myfolder/text.txt"
filepathout = "myfolder/text2.txt"
lines = readfiles(file_path)
writefiles(lines, filepathout)
A more pythonic way to do it
# readlines is a built-in function in python
with open(file_path) as f:
lines = f.readlines()
# stripping line returns
lines = [line.strip("\n") for line in lines]
# join will convert the list to a string by adding a \n between the list elements
with open(filepathout, "w") as f:
f.write("\n".join(lines))
key points:
- the function stops after reaching the return statement
- be careful where you define your variable.
i.e "lst" in a for loop will get redefined after each iteration
defining variables:
- for a list: list_var = []
- for a tuple: tup_var = (1, 2)
- for an int: int_var = 3
- for a dictionary: dict_var = {}
- for a string: string_var = "test"
A couple learning points here that will help.
In your reading function, you are kinda close. However, you cannot put the return statement in the loop. As soon as the function hits that anywhere for the first time, it ends. Also, if you are going to make a container to hold the list of things read, you need to make that before you start your loop. Lastly, don't name anything list. It is a keyword. If you want to make a new list item, just do something like: results = list() or results = []
So in pseudocode, you should:
Make a list to hold results
Open the file as you are now
Make a loop to loop through lines
append to the results list
return the results (outside the loop)
Your writefiles is very close. You should be looping through the lines variable that is a parameter of your function. Right now you are referencing lst which is not a parameter of your function.
Good luck!

Split big file in multiple files in python3.x

I want to split the file into multiple files if file size of file_write is greater than 20MB.
In Random function, I am opening big_file.txt and removing noise using remove_noise() and writing clean line to outfile.
I am not sure how to split the file based on the size in my current implementation. Please find the code below:
(Apologies for not providing proper implementation with example because it is really complicated)
I have gone through the example at this link: Split large text file(around 50GB) into multiple files
import os
def parses(lines, my_date_list):
for line in reversed(list(lines)):
line = line.strip()
if not line:
continue
date_string = "2019-11-01" # assumption
yield date_string, line
def remove_noise(line):
""" dummy function"""
return line
def random_function(path, output, cutoff="2019-10-31"):
my_date_list = []
if os.path.exists(path):
with open(path) as f:
lines = parses(f, my_date_list)
for date, line in lines:
if cutoff <= date:
results = remove_noise(line)
output.write(results + '\n')
continue
else:
break
While writing lines to output, I need to check size. If size reached 20MB and I want to write it to second {may be output_2} and so on.
if __name__ == '__main__':
path = "./big_file.txt"
file_write = "./write_file.txt"
with open(file_write) as outfile:
random_function(path=path, output=outfile)

Issue opening large number of files in Python

I'm trying to process a pipe separated text file with the following format:
18511|1|2587198|2004-03-31|0|100000|0|1.97|0.49988|100000||||
18511|2|2587198|2004-06-30|0|160000|0|3.2|0.79669|60000|60|||
18511|3|2587198|2004-09-30|0|160000|0|2.17|0.79279|0|0|||
18511|4|2587198|2004-09-30|0|160000|0|1.72|0.79118|0|0|||
18511|5|2587198|2005-03-31|0|0|0|0|0|-160000|-100|||19
18511|1|2587940|2004-03-31|0|240000|0|0.78|0.27327|240000||||
18511|2|2587940|2004-06-30|0|560000|0|1.59|0.63576|320000|133.33||24|
18511|3|2587940|2004-09-30|0|560000|0|1.13|0.50704|0|0|||
18511|4|2587940|2004-09-30|0|560000|0|0.96|0.50704|0|0|||
18511|5|2587940|2005-03-31|0|0|0|0|0|-560000|-100|||14
For each line I want to isolate the second field and write that line to a file with that field as part of the filename e.g issue1.txt, issue2.txt where the number is the second field in the above file excerpt. This number can be in the range 1 to 56. My code is shown below:
with open('d:\\tmp\issueholding.txt') as f, open('d:\\tmp\issue1.txt', 'w') as out_f1,\
open('d:\\tmp\issue2.txt', 'w') as out_f2,open('d:\\tmp\issue3.txt', 'w') as out_f3,\
open('d:\\tmp\issue4.txt', 'w') as out_f4,open('d:\\tmp\issue5.txt', 'w') as out_f5,\
open('d:\\tmp\issue6.txt', 'w') as out_f6,open('d:\\tmp\issue7.txt', 'w') as out_f7,\
open('d:\\tmp\issue8.txt', 'w') as out_f8,open('d:\\tmp\issue9.txt', 'w') as out_f9,\
open('d:\\tmp\issue10.txt', 'w') as out_f10,open('d:\\tmp\issue11.txt', 'w') as out_f11,\
open('d:\\tmp\issue12.txt', 'w') as out_f12,open('d:\\tmp\issue13.txt', 'w') as out_f13,\
open('d:\\tmp\issue14.txt', 'w') as out_f14,open('d:\\tmp\issue15.txt', 'w') as out_f15,\
open('d:\\tmp\issue16.txt', 'w') as out_f16,open('d:\\tmp\issue17.txt', 'w') as out_f17,\
open('d:\\tmp\issue18.txt', 'w') as out_f18,open('d:\\tmp\issue19.txt', 'w') as out_f19,\
open('d:\\tmp\issue20.txt', 'w') as out_f20,open('d:\\tmp\issue21.txt', 'w') as out_f21,\
open('d:\\tmp\issue22.txt', 'w') as out_f22,open('d:\\tmp\issue23.txt', 'w') as out_f23,\
open('d:\\tmp\issue24.txt', 'w') as out_f24,open('d:\\tmp\issue25.txt', 'w') as out_f25,\
open('d:\\tmp\issue32.txt', 'w') as out_f32,open('d:\\tmp\issue33.txt', 'w') as out_f33,\
open('d:\\tmp\issue34.txt', 'w') as out_f34,open('d:\\tmp\issue35.txt', 'w') as out_f35,\
open('d:\\tmp\issue36.txt', 'w') as out_f36,open('d:\\tmp\issue37.txt', 'w') as out_f37,\
open('d:\\tmp\issue38.txt', 'w') as out_f38,open('d:\\tmp\issue39.txt', 'w') as out_f39,\
open('d:\\tmp\issue40.txt', 'w') as out_f40,open('d:\\tmp\issue41.txt', 'w') as out_f41,\
open('d:\\tmp\issue42.txt', 'w') as out_f42,open('d:\\tmp\issue43.txt', 'w') as out_f43,\
open('d:\\tmp\issue44.txt', 'w') as out_f44,open('d:\\tmp\issue45.txt', 'w') as out_f45,\
open('d:\\tmp\issue46.txt', 'w') as out_f46,open('d:\\tmp\issue47.txt', 'w') as out_f47,\
open('d:\\tmp\issue48.txt', 'w') as out_f48,open('d:\\tmp\issue49.txt', 'w') as out_f49,\
open('d:\\tmp\issue50.txt', 'w') as out_f50,open('d:\\tmp\issue51.txt', 'w') as out_f51,\
open('d:\\tmp\issue52.txt', 'w') as out_f52,open('d:\\tmp\issue53.txt', 'w') as out_f53,\
open('d:\\tmp\issue54.txt', 'w') as out_f54,open('d:\\tmp\issue55.txt', 'w') as out_f55,\
open('d:\\tmp\issue56.txt', 'w') as out_f56:
for line in f:
field1_end = line.find('|') +1
field2_end = line.find('|',field1_end)
f2=line[field1_end:field2_end]
out_f56.write(line)
My two issue are:
1) When trying to run the above I get the following error message
File "", line unknown
SyntaxError: too many statically nested blocks
2) How do I change this line out_f56.write(line) so that I can use the variable f2 as part of the file descriptor rather than hard coding it.
I am running this in a jupyter notebook running python3 under Windows. To be clear, the input file has approx 235 Million records so performance is key.
Appreciate any help or suggestions
Try something like this (see comments in code for explanation):
with open(R"d:\tmp\issueholding.txt") as f:
for line in f:
# splitting line into list of strings at '|' character
fields = line.split('|')
# defining output file name according to issue code in second field
# NB: list-indexes are zero-based, therefore use 1
out_name = R"d:\tmp\issue%s.txt" % fields[1]
# opening output file and writing current line to it
# NB: make sure you use the 'a+' mode to append to existing file
with open(out_name, 'a+') as ff:
ff.write(line)
To avoid opening files repeatedly inside the reading loop, you could do the following:
from collections import defaultdict
with open(R"D:\tmp\issueholding.txt") as f:
# setting up dictionary to hold lines grouped by issue code
# using a defaultdict here to automatically create a list when inserting
# the first item
collected_issues = defaultdict(list)
for line in f:
# splitting line into list of strings at '|' character and retrieving
# current issue code from second token
issue_code = line.split('|')[1]
# appending current line to list of collected lines associated with
# current issue code
collected_issues[issue_code].append(line)
else:
for issue_code in collected_issues:
# defining output file name according to issue code
out_name = R"D:\tmp\issue%s.txt" % issue_code
# opening output file and writing collected lines to it
with open(out_name, 'a+') as ff:
ff.write("".join(collected_issues[issue_code]))
This of course creates an in-memory dictionary holding all lines retrieved from the input file. Given your specification this could very well be not feasible with your machine. An alternative would be to split up the input file and processing it chunk by chunk instead. This can be achieved in code by defining a corresponding generator that reads a defined amount of lines (here: 1000) from the input file. A possible final solution could then look like this:
from itertools import islice
from collections import defaultdict
def get_chunk_of_lines(file, N):
"""
Retrieves N lines from specified opened file.
"""
return [x.strip() for x in islice(file, N)]
def collect_issues(lines):
"""
Collects and groups issues from specified lines.
"""
collected_issues = defaultdict(list)
for line in lines:
# splitting line into list of strings at '|' character and retrieving
# current issue code from second token
issue_code = line.split('|')[1]
# appending current line to list of collected lines associated with
# current issue code
collected_issues[issue_code].append(line)
return collected_issues
def export_grouped_issues(issues):
"""
Exports collected and grouped issues.
"""
for issue_code in issues:
# defining output file name according to issue code
out_name = R"D:\tmp\issue%s.txt" % issue_code
# opening output file and writing collected lines to it
with open(out_name, 'a+') as f:
f.write("".join(issues[issue_code]))
with open(R"D:\tmp\issueholding.txt") as issue_src:
chunk_cnt = 0
while True:
# retrieving 1000 input lines at a time
line_chunk = get_chunk_of_lines(issue_src, 1000)
# exiting while loop if no more chunk is left
if not line_chunk:
break
chunk_cnt += 1
print("+ Working on chunk %d" % chunk_cnt)
# collecting, grouping and exporting issues
issues = collect_issues(line_chunk)
export_grouped_issues(issues)

Resources