merge mutliple files into single file, new file file should merge to new line in output file - python-3.x

I've written script to merge multiple files into single file and create list from that.
requirement: file1 + file2 = file3 like below
file1 :
37717531209
201128307083
211669759863
496338947094
File 2:
348353447295
278262427715
901601149752
333676465561
my outputfile(not expecting this output):
37717531209
201128307083
211669759863
496338947094348353447295
278262427715
901601149752
333676465561
my expected outputfile:
37717531209
201128307083
211669759863
496338947094
348353447295
278262427715
901601149752
333676465561
My code is:
with open(outputfile, 'wb') as outfile:
for filename in glob.glob('*.accts'):
if filename == outputfile:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
#accounts = list(outfile)
with open('accounts.txt') as f:
acc = list(f)
accounts = []
for element in acc:
accounts.append(element.strip())
I want the new file to merge from next line. not to start from same line.

You don't appear to be doing anything that would require writing the contents of the input files into an output file. At the end, after creating the output file, you reopen it and process the account numbers it contains. You could simply do this, which requires no output file at all:
accounts = []
for filename in glob.glob('*.accts'):
with open(filename) as readfile:
accounts.extend(line.strip() for line in readfile)
return accounts
Furthermore, you could probably do away with downloading the *.accts files
by using download_fileobj() to download the files into a file like object (e.g. a io.BytesIO object) and processing from there.

Related

Scan through large text file using contents of another text file

Hello I am very new to coding, I am writing small python script but I am stuck. The goal is to compare the log.txt contents to the contents of the LargeFile.txt and every line of the log.txt that is not matching to any line of the LargeFile.txt to be stored in the outfile.txt but with the code below I only get the First line of the log.txt to repeat itself in the outfile.txt
logfile = open('log1.txt', 'r') # This file is 8KB
keywordlist = open('LargeFile.txt', 'r') # This file is 1,4GB
outfile = open('outfile.txt', 'w')
loglines = [n for n in logfile]
keywords = [n for n in keywordlist]
for line in loglines:
for word in keywords:
if line not in word:
outfile.write(line)
outfile.close()
So conceptually you're trying to check whether any line of your 1+ GB file occurs in your 8 KB file.
This means one of the files needs to be loaded into RAM, and the smaller file is the natural choice. The other file can be read sequentially and does not need to be loaded in full.
We need
a list of lines from the smaller file
an index of those lines for quick look-ups (we'll use a dict for this)
a loop that runs through the large file and checks each line against the index, making note of every matching line it finds
a loop that outputs the original lines and uses the index to determine whether they are unique or not.
The sample below prints the complete output to the console. Write it to a file as needed.
with open('log1.txt', 'r') as f:
log_lines = list(f)
index = {line: [] for line in log_lines}
with open('LargeFile.txt', 'r') as f:
for line_num, line in enumerate(f, 1):
if line in index:
index[line].append(line_num)
for line in log_lines:
if len(index[line]) == 0:
print(f'{line} -> unique')
else:
print(f'{line} -> found {len(index[line])}x')

write specific file name in specific list in python

I have folder contain multiple .txt files ,the names of files(one.txt,two.txt,three.txt,...) I need to read the one.txt and then write the content of this file in list has name onefile[], then read two.txt and write the content in list twofile[] and so on. how can do this?
Update! Iam try this code, now how can print the values in each list ?
def writeinlist(file_path,i):
multilist = {}
output = open(file_path,'r')
globals()['List%s' % i] = output
print('List%s' % i)
input_path = Path(Path.home(), "Desktop", "NN")
index=1
for root, dirs, files in os.walk(input_path):
for file in files:
file_path = Path(root, file)
writeinlist(file_path,index)
index+=1
Update2: how can delete \n from values?
value_list1 = files_dict['file1']
print('Values of file1 are:')
print(value_list1)
I used the following to create a dictionary with dynamic keys (with the names of the files) and the respective values being a list with elements the lines of the file.
First, contents of onefile.txt:
First file first line
First file second line
First file third line
Contents of twofile.txt:
Second file first line
Second file second line
My code:
import os
import pprint
files_dict = {}
for file in os.listdir("/path/to/folder"):
if file.endswith(".txt"):
key = file.split(".")[0]
full_filename = os.path.join("/path/to/folder", file)
with open(full_filename, "r") as f:
files_dict[key] = f.readlines()
pprint.pprint(files_dict)
Output:
{'onefile': ['First file first line\n',
'First file second line\n',
'First file third line'],
'twofile': ['Second file first line\n', 'Second file second line']}
Another way to do this that's a bit more Pythonic:
import os
import pprint
files_dict = {}
for file in [
f
for f in os.listdir("/Users/itroulli/Downloads/data_eng_challenge3/files")
if f.endswith(".txt")
]:
with open(os.path.join("/path/to/folder", file), "r") as fo:
files_dict[file.split(".")[0]] = fo.readlines()
pprint.pprint(files_dict)

Read multiple text files, search few strings , replace and write in python

I have 10s of text files in my local directory named something like test1, test2, test3, and so on. I would like to read all these files, search few strings in the files, replace them by other strings and finally save back into my directory in such a way that something like newtest1, newtest2, newtest3, and so on.
For instance, if there was a single file, I would have done following:
#Read the file
with open('H:\\Yugeen\\TestFiles\\test1.txt', 'r') as file :
filedata = file.read()
#Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
#write the file out again
with open('H:\\Yugeen\\TestFiles\\newtest1.txt', 'w') as file:
file.write(filedata)
Is there any way that I can achieve this in python?
If you use Pyhton 3 you can use the scandir in os library.
Python 3 docs: os.scandir
With that you can get the directory entries.
with os.scandir('H:\\Yugeen\\TestFiles') as it:
Then loop over these entries and your code could look something like this.
Notice I changed the path in your code to the entry object path.
import os
# Get the directory entries
with os.scandir('H:\\Yugeen\\TestFiles') as it:
# Iterate over directory entries
for entry in it:
# If not file continue to next iteration
# This is no need if you are 100% sure there is only files in the directory
if not entry.is_file():
continue
# Read the file
with open(entry.path, 'r') as file:
filedata = file.read()
# Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
# write the file out again
with open(entry.path, 'w') as file:
file.write(filedata)
If you use Pyhton 2 you can use listdir. (also applicable for python 3)
Python 2 docs: os.listdir
In this case same code structure. But you also need to handle the full path to file since listdir will only return the filename.

Pass a file with filepaths to Python in Ubuntu terminal to analyze each file?

I have a text file with file paths:
path1
path2
path3
...
path100000000
I have my python script app.py that should run on each file (path1, path2 ...)
Please advise what is the best way to do it?
Should I just get it as argument, and then:
with open(input_file, "r") as f:
lines = f.readlines()
for line in lines:
main_function(line)
Yes that should work, except readlines() doesn't remove newline characters.
with open(input_file, "r") as f:
lines = f.readlines()
for line in lines:
main_function(line.strip())
**Note: The above code assumes the file is in the same directory as the python script file.
You are using context managers. Hence, place the code inside the context.
So according to your comment,
If you want to pass filename where you will read the file contents in the main_function, then the above code will work.
If you want to read the file and then pass the file contents, then you will have to modify the above code to first read the content and then pass it to the function
with open(input_file, "r") as f:
lines = f.readlines()
for line in lines:
main_function(open(line.strip(), "r").read())
**Note: the above function will read the whole file as a single string (text)

How to I check whether a file already contains the text I want to append?

I am currently working on a project. So I want to read all the *.pdf files in a directory, extract their text and append it to a text file. So far so good. I was able to do this, yeah.
Now the problem: if I am reading the same directory again, it appends the same files again. Is there a way to check whether the extracted text is already in the file and thus, skip the whole thing?
My code for this looks like this right now (I created the directory variable already):
`
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
file = os.path.join(directory, filename)
print(file)
#parse data from file
file_data = parser.from_file(file)
#get files text content
text = file_data['content']
#print(type(text))
print("len ", len(text))
#print(text)
#save to textfile
f = open("test2.txt", "a+", encoding = 'utf-8')
f.write(text)
f.close()
else:
continue
`
Thanks in advance!
One thing you could do is load the file contents and check if the file is within the file:
if text in open("test2.txt"):
# write here
else:
# text is already in file, don't write
However, this is very inefficient. A better way is to create a file with the filenames that you have already written, and check that:
(at the beginning of your code):
files = open("files.txt").readlines()
(before parser.from_file(file)):
if file in files:
continue # don't read or write
(after f.close()):
files.append(file)
(after the whole loop has finished)
with open("files.txt", "w") as f:
f.write("\n".join(files))
Putting it all together:
files = open("files.txt").readlines()
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
file = os.path.join(directory, filename)
if file in files:
continue # don't read or write
print(file)
#parse data from file
file_data = parser.from_file(file)
#get files text content
text = file_data['content']
#print(type(text))
print("len ", len(text))
#print(text)
#save to textfile
f = open("test2.txt", "a+", encoding = 'utf-8')
f.write(text)
f.close()
files.append(file)
else:
continue
with open("files.txt", "a+") as f:
f.write("\n".join(files))
Note that you need to create a file named files.txt in the current directory.

Resources